Skip to main content

How to Use SQL in pandas Using pandasql Queries

Unleash the power of SQL within pandas and learn when and how to use SQL queries in pandas using the pandasql library for seamless integration.
May 2023  · 8 min read

SQL, or Structured Query Language, is a programming language used to access, extract, wrangle, and explore data stored in relational databases. pandas is a Python open-source library specifically designed for data manipulation and analysis.

In this tutorial, we're going to discuss when and how we can (and when we cannot) use SQL functionality in the framework of pandas. In addition, we'll take a look at various examples of implementing this approach and compare the results with the equivalent code in pure pandas.

Why Use SQL in pandas?

Given the definitions in the introduction, why should one want to use SQL combined with pandas when the latter is an all-inclusive package for data analysis?

The answer is that on some occasions, especially for complex programs, SQL queries look much more straightforward and easy to read than the corresponding code in pandas. This is particularly true for those people who initially used SQL to work with data and then later learned pandas.

If you need more training on pandas, you can check out our Data Manipulation with pandas course and Pandas Tutorial: DataFrames in Python.

To see SQL readability in action, let's suppose that we have a table (a dataframe) called penguins containing various information on penguins (and we will work with such a table later in this tutorial). To extract all the unique species of penguins who are males and who have flippers longer than 210 mm, we would need the following code in pandas:

penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique()

Instead, to get the same information using SQL, we would run the following code:

SELECT DISTINCT species FROM penguins WHERE sex = 'Male' AND flipper_length_mm > 210

The second piece of code, written in SQL, looks almost like a natural English sentence and hence is much more intuitive. We can further increase its readability by spanning it over multiple lines:

  FROM penguins 
 WHERE sex = 'Male' 
   AND flipper_length_mm > 210

Now that we identified the advantages of using SQL for pandas, let's see how we can technically combine them both.

How to Use pandasql

The pandasql Python library allows querying pandas dataframes by running SQL commands without having to connect to any SQL server. Under the hood, it uses SQLite syntax, automatically detects any pandas dataframe, and treats it as a regular SQL table.

Setting up your environment

First, we need to install pandasql:

pip install pandasql

Then, we import the required packages:

from pandasql import sqldf
import pandas as pd

Above, we directly imported the sqldf function from pandasql, which is virtually the only meaningful function of the library. As its name suggests, it's applied to query dataframes using SQL syntax. Apart from this function, pandasql comes with two simple built-in datasets that can be loaded using the self-explanatory functions load_births() and load_meat().

pandasql Syntax

The syntax of the sqldf function is very simple:

sqldf(query, env=None)

Here, query is a required parameter that takes in a SQL query as a string, and env—an optional (and rarely useful) parameter that can be either locals() or globals() and allows sqldf to access the corresponding set of variables in your Python environment.

The sqldf function returns the result of a query as a pandas dataframe.

When we can use pandasql

The pandasql library allows working with data using the Data Query Language (DQL), which is one of the subsets of SQL. In other words, with pandasql, we can run queries on the data stored in a database to retrieve the necessary information from it. In particular, we can access, extract, filter, sort, group, join, aggregate the data, and perform mathematical or logical operations on it.

When we cannot use pandasql

pandasql doesn't allow employing any other subsets of SQL apart from DQL. This means that we can't apply pandasql to modify (update, truncate, insert, etc.) tables or change (update, delete, or insert) the data in a table.

In addition, since this library is based on SQL syntax, we should beware of the known quirks in SQLite.

Examples of using pandasql

Now, we'll take a more granular look at how to run SQL queries on pandas dataframes using the sqldf function of pandasql. To have some data to practice on, let's load one of the built-in datasets of the seaborn library—penguins:

import seaborn as sns
penguins = sns.load_dataset('penguins')


   species island     bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female

If you need to refresh your SQL skills, our SQL Fundamentals skill track is a good reference point.

Extracting data with pandasql

print(sqldf('''SELECT species, island 
FROM penguins 
LIMIT 5'''))


  species  island
0  Adelie  Torgersen
1  Adelie  Torgersen
2  Adelie  Torgersen
3  Adelie  Torgersen
4  Adelie  Torgersen

Above, we extracted information about the species and geography of the first five penguins from the penguins dataframe. Note that running the sqldf function returns a pandas dataframe:

print(type(sqldf('''SELECT species, island 
                      FROM penguins 
                     LIMIT 5''')))


<class 'pandas.core.frame.DataFrame'>

In pure pandas, it would be:

print(penguins[['species', 'island']].head())


   species island
0  Adelie  Torgersen
1  Adelie  Torgersen
2  Adelie  Torgersen
3  Adelie  Torgersen
4  Adelie  Torgersen

Another example is extracting unique values from a column:

print(sqldf('''SELECT DISTINCT species 
                 FROM penguins'''))


0     Adelie
1  Chinstrap
2     Gentoo

In pandas, it would be:



['Adelie' 'Chinstrap' 'Gentoo']

Sorting data with pandasql

print(sqldf('''SELECT body_mass_g 
                 FROM penguins 
                ORDER BY body_mass_g DESC 
                LIMIT 5'''))


0       6300.0
1       6050.0
2       6000.0
3       6000.0
4       5950.0

Above, we sorted our penguins by body mass in descending order and displayed the top five values of body mass.

In pandas, it would be:



0    6300.0
1    6050.0
2    6000.0
3    6000.0
4    5950.0
Name: body_mass_g, dtype: float64

Filtering data with pandasql

Let's try the same example that we mentioned in the chapter Why use SQL in pandas: extracting the unique species of penguins who are males and who have flippers longer than 210 mm:

print(sqldf('''SELECT DISTINCT species
                 FROM penguins 
                WHERE sex = 'Male' 
                  AND flipper_length_mm > 210'''))


0  Chinstrap
1     Gentoo

Above, we filtered the data based on two conditions: sex = 'Male' and flipper_length_mm > 210.

The same code in pandas would look a bit more overwhelming:

print(penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique())


['Chinstrap' 'Gentoo']

Grouping and aggregating data with pandasql

Now, let's apply data grouping and aggregation to find the longest bill for each species in the dataframe:

print(sqldf('''SELECT species, MAX(bill_length_mm)
                 FROM penguins 
                GROUP BY species'''))


     species  MAX(bill_length_mm)
0     Adelie                 46.0
1  Chinstrap                 58.0
2     Gentoo                 59.6

The same code in pandas:

print(penguins[['species', 'bill_length_mm']].groupby('species', as_index=False).max())


     species  bill_length_mm
0     Adelie            46.0
1  Chinstrap            58.0
2     Gentoo            59.6

Performing mathematical operations with pandasql

With pandasql, we can easily perform mathematical or logical operations on the data. Let's imagine that we want to calculate the bill length-to-depth ratio for each penguin and display the top five values of this measurement:

print(sqldf('''SELECT bill_length_mm / bill_depth_mm AS length_to_depth
                 FROM penguins
                ORDER BY length_to_depth DESC
                LIMIT 5'''))


0         3.612676
1         3.510490
2         3.505882
3         3.492424
4         3.458599

Note that this time, we used the alias length_to_depth for the column with the ratio values. Otherwise, we would get a column with a monstrous name bill_length_mm / bill_depth_mm.

In pandas, we would need first to create a new column with the ratio values:

penguins['length_to_depth'] = penguins['bill_length_mm'] / penguins['bill_depth_mm']
print(penguins['length_to_depth'].sort_values(ascending=False, ignore_index=True).head())


0    3.612676
1    3.510490
2    3.505882
3    3.492424
4    3.458599
Name: length_to_depth, dtype: float64


To wrap up, in this tutorial, we explored why and when we can combine the functionality of SQL for pandas to write better, more efficient code. We discussed how to set up and use the pandasql library for this purpose and what limitations this package has. Finally, we considered numerous popular examples of the practical application of pandasql and, in each case, compared the code with its pandas counterpart in each case.

Now you have everything you need to apply SQL for pandas in real-world projects. A great place for your practice is the Datacamp Workspace, which is a flexible and powerful environment for conducting data analysis and sharing insights with your collaborators.

Photo of Elena Kosourova
Elena Kosourova

Expand your skills

Data Manipulation in SQL

BeginnerSkill Level
4 hr
Master the complex SQL queries necessary to answer a wide variety of data science questions and prepare robust data sets for analysis in PostgreSQL.
See DetailsRight Arrow
Start Course
See MoreRight Arrow

How to Learn Python From Scratch in 2023: An Expert Guide

Discover how to learn Python, its applications, and the demand for Python skills. Start your Python journey today ​​with our comprehensive guide.
Matt Crabtree's photo

Matt Crabtree

19 min

10 Portfolio-Ready SQL Projects for All Levels

Select your first—or next—SQL project to practice your current SQL skills, develop new ones, and create an outstanding professional portfolio.
Elena Kosourova's photo

Elena Kosourova

11 min

10 Essential Python Skills All Data Scientists Should Master

All data scientists need expertise in Python, but which skills are the most important for them to master? Find out the ten most vital Python skills in the latest rundown.

Thaylise Nakamoto

9 min

Distributed Processing using Ray framework in Python

Unlocking the Power of Scalable Distributed Systems: A Guide to Ray Framework in Python
Moez Ali's photo

Moez Ali

11 min

Geocoding for Data Scientists: An Introduction With Examples

In this tutorial, you will learn three different ways to convert an address into latitude and longitude using Geopy.
Eugenia Anello's photo

Eugenia Anello

9 min

A Complete Guide to Socket Programming in Python

Learn the fundamentals of socket programming in Python
Serhii Orlivskyi's photo

Serhii Orlivskyi

41 min

See MoreSee More