Hit the Ground Running With Pandas
If you’re doing any work with data in Python, it’s only a matter of time before you come across pandas. This popular package provides you with many options for reading in, processing, and writing data; however, it’s not the most intuitive package to use and beginners might find it a bit overwhelming at first. In this blog post, we’ll cover the basics of what you need to know to get up and running with the powerful pandas library, including some of my favorite methods for these operations. So, let’s get you on your way!
Reading in and writing data
In order to start working with data, you obviously need to read some in. Pandas is extremely flexible in terms of the data types it can read in and write to file. While CSV (comma-separated values) is one of the most commonly used file formats, pandas is able to directly read and write many other formats, from JSON, Excel, and Parquet files to reading directly from SQL databases. Reading files is done using the “read” family of methods, while writing is done using the “to” family. For example, reading a CSV file is completed using read_csv
, and the data can be written back to CSV using to_csv
. Let’s take a look at a couple of examples.
Let’s say we want to read in a table in parquet format. We can use panda’s read_parquet
method, and specify the engine we want to use with the engine
argument:
my_data = df.read_parquet("my_parquet_table.parquet", engine = "pyarrow")
Let’s say we want to write this same data to a JSON-formatted file. We simply use the to_json
method like so:
my_data.to_json("my_json_file.json")
As you can see, the basics are straightforward, and each method offers a range of arguments to customize reading and writing data to suit your needs. This excellent blog post from Real Python goes into more detail about the specifics of using these methods with a range of different file formats.
Series and DataFrames
Once you’ve read in your data, it’s time to understand how pandas stores it. Pandas has two data structures: Series and DataFrames. Pandas DataFrames can be thought of as a table containing many columns, with each column being represented as a Series.
To explore this a little further, let’s create a new DataFrame, using pandas’ DataFrame
method as follows:
import random import numpy as np df = pd.DataFrame( { "col1": [random.randint(0, 100) for i in np.arange(0, 10)], "col2": list("aaabbbcccc"), "col3": [random.random() for i in np.arange(0, 10)], "col4": list("jjjjjkkkkl"), } ) df
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 46 | a | 0.701735 | j |
1 | 98 | a | 0.387270 | j |
2 | 88 | a | 0.230487 | j |
3 | 53 | b | 0.492995 | j |
4 | 51 | b | 0.711341 | j |
5 | 28 | b | 0.833130 | k |
6 | 87 | c | 0.619907 | k |
7 | 58 | c | 0.992311 | k |
8 | 36 | c | 0.892960 | k |
9 | 49 | c | 0.174617 | l |
We can see that this DataFrame is a table containing four columns, each with a different data type: integer in column 1, text in columns 2 and 4, and float in column 3. If we select one of the columns in this DataFrame, we can see that it is a Series.
type(df["col1"])
pandas.core.series.Series
You can see we’ve selected a column using the square bracket notation, in which you put a column name as a string inside brackets next to the DataFrame name.
As you can probably guess, due to the fact that most data we work with contains multiple columns, you’ll usually be working with pandas DataFrames. However, there are occasions when you’ll need to work with a single pandas Series, so it is useful to know the distinction.
Getting an overview of your data
One of the first steps in working with data is to explore the variables, and pandas offers a number of useful ways of doing this. One of the first things I like to do when I am working with a new dataset is to check the first few rows, in order to get a feel for what the data contains. In pandas, you can do this using the head
method. The n
argument will allow you to specify how many rows you want to display.
df.head(n=3)
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 46 | a | 0.701735 | j |
1 | 98 | a | 0.387270 | j |
2 | 88 | a | 0.230487 | j |
Pandas also has some nice methods for checking whether the values of your columns make sense, or something weird might be happening in your data. The describe
method will give summary statistics for all continuous variables such as the count, mean and standard deviation, and min and max. This method automatically filters its results to continuous variables and produces a summary table containing the metrics for these columns.
df.describe()
col1 | col3 | |
---|---|---|
count | 10.000000 | 10.000000 |
mean | 59.400000 | 0.603675 |
std | 23.580595 | 0.277145 |
min | 28.000000 | 0.174617 |
25% | 46.750000 | 0.413701 |
50% | 52.000000 | 0.660821 |
75% | 79.750000 | 0.802683 |
max | 98.000000 | 0.992311 |
For categorical variables, we can use the value_counts
method to get the frequency of each level. However, unlike describe
, it can only be applied to one column at a time. Here you can see we’ve gotten the frequencies of each level of col2
:
df["col2"].value_counts()
col2 | |
---|---|
c | 4 |
a | 3 |
b | 3 |
Column names
You can also see the names of all of your DataFrame’s columns using the columns
method, which outputs the column names in a list-like format:
df.columns
Index(['col1', 'col2', 'col3', 'col4'], dtype='object')
If you want to rename a column, you can use the rename
method, along with the columns argument. This argument takes a dictionary, where the key indicates the old column name, and the value indicates the new column name, as in the example below where we rename col1
to column1
:
df.rename(columns={"col1": "column1"})
column1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 46 | a | 0.701735 | j |
1 | 98 | a | 0.387270 | j |
2 | 88 | a | 0.230487 | j |
3 | 53 | b | 0.492995 | j |
4 | 51 | b | 0.711341 | j |
5 | 28 | b | 0.833130 | k |
6 | 87 | c | 0.619907 | k |
7 | 58 | c | 0.992311 | k |
8 | 36 | c | 0.892960 | k |
9 | 49 | c | 0.174617 | l |
Changing values inside existing columns
You may also want to change the values of an existing column. This can easily be done by applying the Series method map
to the column of choice. As with the rename
method above, you pass a dictionary to this method, with the old column values being represented by keys and the updated values by the values:
df["col2"].map({"a": "d", "b": "e", "c": "f"})
col2 | |
---|---|
0 | d |
1 | d |
2 | d |
3 | e |
4 | e |
5 | e |
6 | f |
7 | f |
8 | f |
9 | f |
Creating new columns
In order to create a new column, we use the same square bracket notation that we’ve been using so far to select existing columns. We then assign our desired value to this new column. A simple example is shown below, where we create a new column where every row has the value “1”:
df["col5"] = 1 df
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 46 | a | 0.701735 | j | 1 |
1 | 98 | a | 0.387270 | j | 1 |
2 | 88 | a | 0.230487 | j | 1 |
3 | 53 | b | 0.492995 | j | 1 |
4 | 51 | b | 0.711341 | j | 1 |
5 | 28 | b | 0.833130 | k | 1 |
6 | 87 | c | 0.619907 | k | 1 |
7 | 58 | c | 0.992311 | k | 1 |
8 | 36 | c | 0.892960 | k | 1 |
9 | 49 | c | 0.174617 | l | 1 |
New columns can also be created by using the values of existing columns. For example, you can combine string columns to create concatenated values, such as combining two columns containing the first and last name of a customer to create a new column containing their full name. There are other string manipulation tricks you can perform, such as creating new columns containing only substrings or lowercase versions of existing columns. You also have the option to carry out arithmetic operations between numeric columns, as seen below, where we multiply the values of col1
and col3
to get a new column containing their product:
df["col6"] = df["col1"] * df["col3"] df
col1 | col2 | col3 | col4 | col5 | col6 | |
---|---|---|---|---|---|---|
0 | 46 | a | 0.701735 | j | 1 | 32.279790 |
1 | 98 | a | 0.387270 | j | 1 | 37.952489 |
2 | 88 | a | 0.230487 | j | 1 | 20.282842 |
3 | 53 | b | 0.492995 | j | 1 | 26.128719 |
4 | 51 | b | 0.711341 | j | 1 | 36.278398 |
5 | 28 | b | 0.833130 | k | 1 | 23.327639 |
6 | 87 | c | 0.619907 | k | 1 | 53.931952 |
7 | 58 | c | 0.992311 | k | 1 | 57.554063 |
8 | 36 | c | 0.892960 | k | 1 | 32.146561 |
9 | 49 | c | 0.174617 | l | 1 | 8.556212 |
Finally, you can create a new column which is conditional on the values of one or more existing columns, using the NumPy method where
. Below we create a new column that is equal to 1 when col1
is more than 50, and 0 otherwise.
df["col7"] = np.where(df["col1"] > 50, 1, 0) df
col1 | col2 | col3 | col4 | col5 | col6 | col7 | |
---|---|---|---|---|---|---|---|
0 | 46 | a | 0.701735 | j | 1 | 32.279790 | 0 |
1 | 98 | a | 0.387270 | j | 1 | 37.952489 | 1 |
2 | 88 | a | 0.230487 | j | 1 | 20.282842 | 1 |
3 | 53 | b | 0.492995 | j | 1 | 26.128719 | 1 |
4 | 51 | b | 0.711341 | j | 1 | 36.278398 | 1 |
5 | 28 | b | 0.833130 | k | 1 | 23.327639 | 0 |
6 | 87 | c | 0.619907 | k | 1 | 53.931952 | 1 |
7 | 58 | c | 0.992311 | k | 1 | 57.554063 | 1 |
8 | 36 | c | 0.892960 | k | 1 | 32.146561 | 0 |
9 | 49 | c | 0.174617 | l | 1 | 8.556212 | 0 |
Filtering data
There are a huge number of ways to filter DataFrames in pandas, and for beginners it can feel quite overwhelming. While you may at times need more specific filtering options, I’ll list the methods that I use for 90% of my filtering needs.
We’re already familiar with using the square bracket notation to extract one specific column. You can extend this by passing a list of columns to filter your DataFrame to multiple columns at a time. For example, let’s say that we just want to view col1
and col4
– we just need to include them both in a list that we place inside of the square brackets:
df[["col1", "col4"]] df
col1 | col4 | |
---|---|---|
0 | 46 | j |
1 | 98 | j |
2 | 88 | j |
3 | 53 | j |
4 | 51 | j |
5 | 28 | k |
6 | 87 | k |
7 | 58 | k |
8 | 36 | k |
9 | 49 | l |
The most common way of filtering rows in a DataFrame is by condition. We can do this by nesting a conditional statement inside of our trusty square brackets notation. Here, we filter our DataFrame to only those rows where col2
equals “a”:
df[df["col2"] == "a"] df
col1 | col2 | col3 | col4 | col5 | col6 | col7 | |
---|---|---|---|---|---|---|---|
0 | 46 | a | 0.701735 | j | 1 | 32.279790 | 0 |
1 | 98 | a | 0.387270 | j | 1 | 37.952489 | 1 |
2 | 88 | a | 0.230487 | j | 1 | 20.282842 | 1 |
It’s also possible to filter based on multiple conditions. In this case, each condition needs to be enclosed within brackets, as below, where we’re checking for rows where col2
equals “a” and col6
is more than 1:
df[(df["col2"] == "a") & (df["col6"] > 1)] df
col1 | col2 | col3 | col4 | col5 | col6 | col7 | |
---|---|---|---|---|---|---|---|
0 | 46 | a | 0.701735 | j | 1 | 32.279790 | 0 |
1 | 98 | a | 0.387270 | j | 1 | 37.952489 | 1 |
2 | 88 | a | 0.230487 | j | 1 | 20.282842 | 1 |
We can also combine column and row filtering using pandas’ loc
method. Here we’ll have a look at the values of col1
and col7
where col1
is more than 50:
df.loc[df["col1"] > 50, ["col1", "col7"]] df
col1 | col7 | |
---|---|---|
1 | 98 | 1 |
2 | 88 | 1 |
3 | 53 | 1 |
4 | 51 | 1 |
6 | 87 | 1 |
7 | 58 | 1 |
While there will be times when your task will need more specific or complex filtering, these basic filtering methods will give you a strong start to explore and manipulate your data.
Merging two DataFrames
Finally, there will be times when you need to merge two DataFrames. There are a variety of methods of doing this in pandas, but in my experience, the most flexible and intuitive to use is the merge
method. To see how it works, let’s first create a second DataFrame:
df2 = pd.DataFrame({ "field1": list("aabb"), "field2": df["col1"][5:9].to_list(), }) df2
field1 | field2 | |
---|---|---|
0 | a | 28 |
1 | a | 87 |
2 | b | 58 |
3 | b | 36 |
Let’s say we want to merge these two DataFrames based on col1
in df
and field2
in df2
. We simply need to define a left table (df
), a right table (df2
), and the columns to use to join these tables (using the left_on
and right_on
arguments) in the merge
method:
pd.merge(df, df2, left_on="col1", right_on="field2")
col1 | col2 | col3 | col4 | col5 | col6 | col7 | field1 | field2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 28 | b | 0.833130 | k | 1 | 23.327639 | 0 | a | 28 |
1 | 87 | c | 0.619907 | k | 1 | 53.931952 | 1 | a | 87 |
2 | 58 | c | 0.992311 | k | 1 | 57.554063 | 1 | b | 58 |
3 | 36 | c | 0.892960 | k | 1 | 32.146561 | 0 | b | 36 |
As you can see, because the fields we chose to use to join the two DataFrames have unique values in either DataFrame, we ended up with a one-to-one join. However, the merge
method can also handle one-to-many, many-to-one, and many-to-many joins automatically, depending on the fields you use for the join. We can see this if we try to merge the two DataFrames on col2
and field1
, which both have duplicate values in their respective DataFrames:
pd.merge(df, df2, left_on="col2", right_on="field1")
col1 | col2 | col3 | col4 | col5 | col6 | col7 | field1 | field2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 46 | a | 0.701735 | j | 1 | 32.279790 | 0 | a | 28 |
1 | 46 | a | 0.701735 | j | 1 | 32.279790 | 0 | a | 87 |
2 | 98 | a | 0.387270 | j | 1 | 37.952489 | 1 | a | 28 |
3 | 98 | a | 0.387270 | j | 1 | 37.952489 | 1 | a | 87 |
4 | 88 | a | 0.230487 | j | 1 | 20.282842 | 1 | a | 28 |
5 | 88 | a | 0.230487 | j | 1 | 20.282842 | 1 | a | 87 |
6 | 53 | b | 0.492995 | j | 1 | 26.128719 | 1 | b | 58 |
7 | 53 | b | 0.492995 | j | 1 | 26.128719 | 1 | b | 36 |
8 | 51 | b | 0.711341 | j | 1 | 36.278398 | 1 | b | 58 |
9 | 51 | b | 0.711341 | j | 1 | 36.278398 | 1 | b | 36 |
10 | 28 | b | 0.833130 | k | 1 | 23.327639 | 0 | b | 58 |
11 | 28 | b | 0.833130 | k | 1 | 23.327639 | 0 | b | 36 |
You can see that this automatically results in a many-to-many join.
The merge
method is also capable of doing SQL join types. While the default type is inner joins, left, right and outer are all possible by using the argument how
. Here we’ll change the first merge we did, on col1
and field2
, to a left join:
pd.merge(df, df2, left_on="col1", right_on="field2", how="left")
col1 | col2 | col3 | col4 | col5 | col6 | col7 | field1 | field2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 46 | a | 0.701735 | j | 1 | 32.279790 | 0 | NaN | NaN |
1 | 98 | a | 0.387270 | j | 1 | 37.952489 | 1 | NaN | NaN |
2 | 88 | a | 0.230487 | j | 1 | 20.282842 | 1 | NaN | NaN |
3 | 53 | b | 0.492995 | j | 1 | 26.128719 | 1 | NaN | NaN |
4 | 51 | b | 0.711341 | j | 1 | 36.278398 | 1 | NaN | NaN |
5 | 28 | b | 0.833130 | k | 1 | 23.327639 | 0 | a | 28.0 |
6 | 87 | c | 0.619907 | k | 1 | 53.931952 | 1 | a | 87.0 |
7 | 58 | c | 0.992311 | k | 1 | 57.554063 | 1 | b | 58.0 |
8 | 36 | c | 0.892960 | k | 1 | 32.146561 | 0 | b | 36.0 |
9 | 49 | c | 0.174617 | l | 1 | 8.556212 | 0 | NaN | NaN |
You can see that we’ve kept all values from our left table, df
, and have missing values where those values don’t exist in our right table, df2
, as expected.
Finally, as well as being able to merge on columns, pandas’ merge
method also allows you to merge DataFrames using their indices. To do so, you simply need to use the left_index
and right_index
arguments instead of left_on
and right_on
, passing True
as the value:
pd.merge(df, df2, left_index=True, right_index=True)
col1 | col2 | col3 | col4 | col5 | col6 | col7 | field1 | field2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 46 | a | 0.701735 | j | 1 | 32.279790 | 0 | a | 28 |
1 | 98 | a | 0.387270 | j | 1 | 37.952489 | 1 | a | 87 |
2 | 88 | a | 0.230487 | j | 1 | 20.282842 | 1 | b | 58 |
3 | 53 | b | 0.492995 | j | 1 | 26.128719 | 1 | b | 36 |
Here, we’ve merged the first 4 rows of each table using their index values.
As you can see from this tutorial, you can complete a large variety of data exploration and manipulation tasks with only a handful of commands in pandas. I hope this has helped show you that getting up and running with pandas is relatively straightforward, and will boost your confidence when working with data in Python.
Get some hands-on practice with pandas
Continue your journey of learning pandas in DataSpell, where we’ll guide you through how to read in and analyse a dataset of airline delays and cancellations in pandas.