Data Science How-To's Python Tutorials

Hit the Ground Running With Pandas

If you’re doing any work with data in Python, it’s only a matter of time before you come across pandas. This popular package provides you with many options for reading in, processing, and writing data; however, it’s not the most intuitive package to use and beginners might find it a bit overwhelming at first. In this blog post, we’ll cover the basics of what you need to know to get up and running with the powerful pandas library, including some of my favorite methods for these operations. So, let’s get you on your way!

Reading in and writing data

In order to start working with data, you obviously need to read some in. Pandas is extremely flexible in terms of the data types it can read in and write to file. While CSV (comma-separated values) is one of the most commonly used file formats, pandas is able to directly read and write many other formats, from JSON, Excel, and Parquet files to reading directly from SQL databases. Reading files is done using the “read” family of methods, while writing is done using the “to” family. For example, reading a CSV file is completed using read_csv, and the data can be written back to CSV using to_csv. Let’s take a look at a couple of examples.

Let’s say we want to read in a table in parquet format. We can use panda’s read_parquet method, and specify the engine we want to use with the engine argument:

my_data = df.read_parquet("my_parquet_table.parquet", engine = "pyarrow")

Let’s say we want to write this same data to a JSON-formatted file. We simply use the to_json method like so:

my_data.to_json("my_json_file.json")

As you can see, the basics are straightforward, and each method offers a range of arguments to customize reading and writing data to suit your needs. This excellent blog post from Real Python goes into more detail about the specifics of using these methods with a range of different file formats.

Series and DataFrames

Once you’ve read in your data, it’s time to understand how pandas stores it. Pandas has two data structures: Series and DataFrames. Pandas DataFrames can be thought of as a table containing many columns, with each column being represented as a Series.

To explore this a little further, let’s create a new DataFrame, using pandas’ DataFrame method as follows:

import random
import numpy as np

df = pd.DataFrame(
   {
       "col1": [random.randint(0, 100) for i in np.arange(0, 10)],
       "col2": list("aaabbbcccc"),
       "col3": [random.random() for i in np.arange(0, 10)],
       "col4": list("jjjjjkkkkl"),
   }
)

df
col1col2col3col4
046a0.701735j
198a0.387270j
288a0.230487j
353b0.492995j
451b0.711341j
528b0.833130k
687c0.619907k
758c0.992311k
836c0.892960k
949c0.174617l

We can see that this DataFrame is a table containing four columns, each with a different data type: integer in column 1, text in columns 2 and 4, and float in column 3. If we select one of the columns in this DataFrame, we can see that it is a Series.

type(df["col1"])
pandas.core.series.Series

You can see we’ve selected a column using the square bracket notation, in which you put a column name as a string inside brackets next to the DataFrame name.

As you can probably guess, due to the fact that most data we work with contains multiple columns, you’ll usually be working with pandas DataFrames. However, there are occasions when you’ll need to work with a single pandas Series, so it is useful to know the distinction.

Getting an overview of your data

One of the first steps in working with data is to explore the variables, and pandas offers a number of useful ways of doing this. One of the first things I like to do when I am working with a new dataset is to check the first few rows, in order to get a feel for what the data contains. In pandas, you can do this using the head method. The n argument will allow you to specify how many rows you want to display.

df.head(n=3)
col1col2col3col4
046a0.701735j
198a0.387270j
288a0.230487j

Pandas also has some nice methods for checking whether the values of your columns make sense, or something weird might be happening in your data. The describe method will give summary statistics for all continuous variables such as the count, mean and standard deviation, and min and max. This method automatically filters its results to continuous variables and produces a summary table containing the metrics for these columns.

df.describe()
col1col3
count10.00000010.000000
mean59.4000000.603675
std23.5805950.277145
min28.0000000.174617
25%46.7500000.413701
50%52.0000000.660821
75%79.7500000.802683
max98.0000000.992311

For categorical variables, we can use the value_counts method to get the frequency of each level. However, unlike describe, it can only be applied to one column at a time. Here you can see we’ve gotten the frequencies of each level of col2:

df["col2"].value_counts()
col2
c4
a3
b3

Column names

You can also see the names of all of your DataFrame’s columns using the columns method, which outputs the column names in a list-like format:

df.columns
Index(['col1', 'col2', 'col3', 'col4'], dtype='object')

If you want to rename a column, you can use the rename method, along with the columns argument. This argument takes a dictionary, where the key indicates the old column name, and the value indicates the new column name, as in the example below where we rename col1 to column1:

df.rename(columns={"col1": "column1"})
column1col2col3col4
046a0.701735j
198a0.387270j
288a0.230487j
353b0.492995j
451b0.711341j
528b0.833130k
687c0.619907k
758c0.992311k
836c0.892960k
949c0.174617l

Changing values inside existing columns

You may also want to change the values of an existing column. This can easily be done by applying the Series method map to the column of choice. As with the rename method above, you pass a dictionary to this method, with the old column values being represented by keys and the updated values by the values:

df["col2"].map({"a": "d", "b": "e", "c": "f"})
col2
0d
1d
2d
3e
4e
5e
6f
7f
8f
9f

Creating new columns

In order to create a new column, we use the same square bracket notation that we’ve been using so far to select existing columns. We then assign our desired value to this new column. A simple example is shown below, where we create a new column where every row has the value “1”:

df["col5"] = 1
df
col1col2col3col4col5
046a0.701735j1
198a0.387270j1
288a0.230487j1
353b0.492995j1
451b0.711341j1
528b0.833130k1
687c0.619907k1
758c0.992311k1
836c0.892960k1
949c0.174617l1

New columns can also be created by using the values of existing columns. For example, you can combine string columns to create concatenated values, such as combining two columns containing the first and last name of a customer to create a new column containing their full name. There are other string manipulation tricks you can perform, such as creating new columns containing only substrings or lowercase versions of existing columns. You also have the option to carry out arithmetic operations between numeric columns, as seen below, where we multiply the values of col1 and col3 to get a new column containing their product:

df["col6"] = df["col1"] * df["col3"]
df
col1col2col3col4col5col6
046a0.701735j132.279790
198a0.387270j137.952489
288a0.230487j120.282842
353b0.492995j126.128719
451b0.711341j136.278398
528b0.833130k123.327639
687c0.619907k153.931952
758c0.992311k157.554063
836c0.892960k132.146561
949c0.174617l18.556212

Finally, you can create a new column which is conditional on the values of one or more existing columns, using the NumPy method where. Below we create a new column that is equal to 1 when col1 is more than 50, and 0 otherwise.

df["col7"] = np.where(df["col1"] > 50, 1, 0)
df
col1col2col3col4col5col6col7
046a0.701735j132.2797900
198a0.387270j137.9524891
288a0.230487j120.2828421
353b0.492995j126.1287191
451b0.711341j136.2783981
528b0.833130k123.3276390
687c0.619907k153.9319521
758c0.992311k157.5540631
836c0.892960k132.1465610
949c0.174617l18.5562120

Filtering data

There are a huge number of ways to filter DataFrames in pandas, and for beginners it can feel quite overwhelming. While you may at times need more specific filtering options, I’ll list the methods that I use for 90% of my filtering needs.

We’re already familiar with using the square bracket notation to extract one specific column. You can extend this by passing a list of columns to filter your DataFrame to multiple columns at a time. For example, let’s say that we just want to view col1 and col4 – we just need to include them both in a list that we place inside of the square brackets:

df[["col1", "col4"]]
df
col1col4
046j
198j
288j
353j
451j
528k
687k
758k
836k
949l

The most common way of filtering rows in a DataFrame is by condition. We can do this by nesting a conditional statement inside of our trusty square brackets notation. Here, we filter our DataFrame to only those rows where col2 equals “a”:

df[df["col2"] == "a"]
df
col1col2col3col4col5col6col7
046a0.701735j132.2797900
198a0.387270j137.9524891
288a0.230487j120.2828421

It’s also possible to filter based on multiple conditions. In this case, each condition needs to be enclosed within brackets, as below, where we’re checking for rows where col2 equals “a” and col6 is more than 1:

df[(df["col2"] == "a") & (df["col6"] > 1)]
df
col1col2col3col4col5col6col7
046a0.701735j132.2797900
198a0.387270j137.9524891
288a0.230487j120.2828421

We can also combine column and row filtering using pandas’ loc method. Here we’ll have a look at the values of col1 and col7 where col1 is more than 50:

df.loc[df["col1"] > 50, ["col1", "col7"]]
df
col1col7
1981
2881
3531
4511
6871
7581

While there will be times when your task will need more specific or complex filtering, these basic filtering methods will give you a strong start to explore and manipulate your data.

Merging two DataFrames

Finally, there will be times when you need to merge two DataFrames. There are a variety of methods of doing this in pandas, but in my experience, the most flexible and intuitive to use is the merge method. To see how it works, let’s first create a second DataFrame:

df2 = pd.DataFrame({
   "field1": list("aabb"),
   "field2": df["col1"][5:9].to_list(),
})
df2
field1field2
0a28
1a87
2b58
3b36

Let’s say we want to merge these two DataFrames based on col1 in df and field2 in df2. We simply need to define a left table (df), a right table (df2), and the columns to use to join these tables (using the left_on and right_on arguments) in the merge method:

pd.merge(df, df2, left_on="col1", right_on="field2")
col1col2col3col4col5col6col7field1field2
028b0.833130k123.3276390a28
187c0.619907k153.9319521a87
258c0.992311k157.5540631b58
336c0.892960k132.1465610b36

As you can see, because the fields we chose to use to join the two DataFrames have unique values in either DataFrame, we ended up with a one-to-one join. However, the merge method can also handle one-to-many, many-to-one, and many-to-many joins automatically, depending on the fields you use for the join. We can see this if we try to merge the two DataFrames on col2 and field1, which both have duplicate values in their respective DataFrames:

pd.merge(df, df2, left_on="col2", right_on="field1")
col1col2col3col4col5col6col7field1field2
046a0.701735j132.2797900a28
146a0.701735j132.2797900a87
298a0.387270j137.9524891a28
398a0.387270j137.9524891a87
488a0.230487j120.2828421a28
588a0.230487j120.2828421a87
653b0.492995j126.1287191b58
753b0.492995j126.1287191b36
851b0.711341j136.2783981b58
951b0.711341j136.2783981b36
1028b0.833130k123.3276390b58
1128b0.833130k123.3276390b36

You can see that this automatically results in a many-to-many join.

The merge method is also capable of doing SQL join types. While the default type is inner joins, left, right and outer are all possible by using the argument how. Here we’ll change the first merge we did, on col1 and field2, to a left join:

pd.merge(df, df2, left_on="col1", right_on="field2", how="left")
col1col2col3col4col5col6col7field1field2
046a0.701735j132.2797900NaNNaN
198a0.387270j137.9524891NaNNaN
288a0.230487j120.2828421NaNNaN
353b0.492995j126.1287191NaNNaN
451b0.711341j136.2783981NaNNaN
528b0.833130k123.3276390a28.0
687c0.619907k153.9319521a87.0
758c0.992311k157.5540631b58.0
836c0.892960k132.1465610b36.0
949c0.174617l18.5562120NaNNaN

You can see that we’ve kept all values from our left table, df, and have missing values where those values don’t exist in our right table, df2, as expected.

Finally, as well as being able to merge on columns, pandas’ merge method also allows you to merge DataFrames using their indices. To do so, you simply need to use the left_index and right_index arguments instead of left_on and right_on, passing True as the value:

pd.merge(df, df2, left_index=True, right_index=True)
col1col2col3col4col5col6col7field1field2
046a0.701735j132.2797900a28
198a0.387270j137.9524891a87
288a0.230487j120.2828421b58
353b0.492995j126.1287191b36

Here, we’ve merged the first 4 rows of each table using their index values.

As you can see from this tutorial, you can complete a large variety of data exploration and manipulation tasks with only a handful of commands in pandas. I hope this has helped show you that getting up and running with pandas is relatively straightforward, and will boost your confidence when working with data in Python.

Get some hands-on practice with pandas

Continue your journey of learning pandas in DataSpell, where we’ll guide you through how to read in and analyse a dataset of airline delays and cancellations in pandas.

Get started!

image description