DataSpell

The IDE for Data Analysts

Download

Data Science How-To's Python Tutorials

Hit the Ground Running With Pandas

Jodie Burchell

If you’re doing any work with data in Python, it’s only a matter of time before you come across pandas. This popular package provides you with many options for reading in, processing, and writing data; however, it’s not the most intuitive package to use and beginners might find it a bit overwhelming at first. In this blog post, we’ll cover the basics of what you need to know to get up and running with the powerful pandas library, including some of my favorite methods for these operations. So, let’s get you on your way!

Reading in and writing data

In order to start working with data, you obviously need to read some in. Pandas is extremely flexible in terms of the data types it can read in and write to file. While CSV (comma-separated values) is one of the most commonly used file formats, pandas is able to directly read and write many other formats, from JSON, Excel, and Parquet files to reading directly from SQL databases. Reading files is done using the “read” family of methods, while writing is done using the “to” family. For example, reading a CSV file is completed using read_csv, and the data can be written back to CSV using to_csv. Let’s take a look at a couple of examples.

Let’s say we want to read in a table in parquet format. We can use panda’s read_parquet method, and specify the engine we want to use with the engine argument:

my_data = df.read_parquet("my_parquet_table.parquet", engine = "pyarrow")

Let’s say we want to write this same data to a JSON-formatted file. We simply use the to_json method like so:

my_data.to_json("my_json_file.json")

As you can see, the basics are straightforward, and each method offers a range of arguments to customize reading and writing data to suit your needs. This excellent blog post from Real Python goes into more detail about the specifics of using these methods with a range of different file formats.

Series and DataFrames

Once you’ve read in your data, it’s time to understand how pandas stores it. Pandas has two data structures: Series and DataFrames. Pandas DataFrames can be thought of as a table containing many columns, with each column being represented as a Series.

To explore this a little further, let’s create a new DataFrame, using pandas’ DataFrame method as follows:

import random
import numpy as np

df = pd.DataFrame(
   {
       "col1": [random.randint(0, 100) for i in np.arange(0, 10)],
       "col2": list("aaabbbcccc"),
       "col3": [random.random() for i in np.arange(0, 10)],
       "col4": list("jjjjjkkkkl"),
   }
)

df

	col1	col2	col3	col4
0	46	a	0.701735	j
1	98	a	0.387270	j
2	88	a	0.230487	j
3	53	b	0.492995	j
4	51	b	0.711341	j
5	28	b	0.833130	k
6	87	c	0.619907	k
7	58	c	0.992311	k
8	36	c	0.892960	k
9	49	c	0.174617	l

We can see that this DataFrame is a table containing four columns, each with a different data type: integer in column 1, text in columns 2 and 4, and float in column 3. If we select one of the columns in this DataFrame, we can see that it is a Series.

type(df["col1"])

pandas.core.series.Series

You can see we’ve selected a column using the square bracket notation, in which you put a column name as a string inside brackets next to the DataFrame name.

As you can probably guess, due to the fact that most data we work with contains multiple columns, you’ll usually be working with pandas DataFrames. However, there are occasions when you’ll need to work with a single pandas Series, so it is useful to know the distinction.

Getting an overview of your data

One of the first steps in working with data is to explore the variables, and pandas offers a number of useful ways of doing this. One of the first things I like to do when I am working with a new dataset is to check the first few rows, in order to get a feel for what the data contains. In pandas, you can do this using the head method. The n argument will allow you to specify how many rows you want to display.

df.head(n=3)

	col1	col2	col3	col4
0	46	a	0.701735	j
1	98	a	0.387270	j
2	88	a	0.230487	j

Pandas also has some nice methods for checking whether the values of your columns make sense, or something weird might be happening in your data. The describe method will give summary statistics for all continuous variables such as the count, mean and standard deviation, and min and max. This method automatically filters its results to continuous variables and produces a summary table containing the metrics for these columns.

df.describe()

	col1	col3
count	10.000000	10.000000
mean	59.400000	0.603675
std	23.580595	0.277145
min	28.000000	0.174617
25%	46.750000	0.413701
50%	52.000000	0.660821
75%	79.750000	0.802683
max	98.000000	0.992311

For categorical variables, we can use the value_counts method to get the frequency of each level. However, unlike describe, it can only be applied to one column at a time. Here you can see we’ve gotten the frequencies of each level of col2:

df["col2"].value_counts()

	col2
c	4
a	3
b	3

Column names

You can also see the names of all of your DataFrame’s columns using the columns method, which outputs the column names in a list-like format:

df.columns

Index(['col1', 'col2', 'col3', 'col4'], dtype='object')

If you want to rename a column, you can use the rename method, along with the columns argument. This argument takes a dictionary, where the key indicates the old column name, and the value indicates the new column name, as in the example below where we rename col1 to column1:

df.rename(columns={"col1": "column1"})

	column1	col2	col3	col4
0	46	a	0.701735	j
1	98	a	0.387270	j
2	88	a	0.230487	j
3	53	b	0.492995	j
4	51	b	0.711341	j
5	28	b	0.833130	k
6	87	c	0.619907	k
7	58	c	0.992311	k
8	36	c	0.892960	k
9	49	c	0.174617	l

Changing values inside existing columns

You may also want to change the values of an existing column. This can easily be done by applying the Series method map to the column of choice. As with the rename method above, you pass a dictionary to this method, with the old column values being represented by keys and the updated values by the values:

df["col2"].map({"a": "d", "b": "e", "c": "f"})

	col2
0	d
1	d
2	d
3	e
4	e
5	e
6	f
7	f
8	f
9	f

Creating new columns

In order to create a new column, we use the same square bracket notation that we’ve been using so far to select existing columns. We then assign our desired value to this new column. A simple example is shown below, where we create a new column where every row has the value “1”:

df["col5"] = 1
df

	col1	col2	col3	col4	col5
0	46	a	0.701735	j	1
1	98	a	0.387270	j	1
2	88	a	0.230487	j	1
3	53	b	0.492995	j	1
4	51	b	0.711341	j	1
5	28	b	0.833130	k	1
6	87	c	0.619907	k	1
7	58	c	0.992311	k	1
8	36	c	0.892960	k	1
9	49	c	0.174617	l	1

New columns can also be created by using the values of existing columns. For example, you can combine string columns to create concatenated values, such as combining two columns containing the first and last name of a customer to create a new column containing their full name. There are other string manipulation tricks you can perform, such as creating new columns containing only substrings or lowercase versions of existing columns. You also have the option to carry out arithmetic operations between numeric columns, as seen below, where we multiply the values of col1 and col3 to get a new column containing their product:

df["col6"] = df["col1"] * df["col3"]
df

	col1	col2	col3	col4	col5	col6
0	46	a	0.701735	j	1	32.279790
1	98	a	0.387270	j	1	37.952489
2	88	a	0.230487	j	1	20.282842
3	53	b	0.492995	j	1	26.128719
4	51	b	0.711341	j	1	36.278398
5	28	b	0.833130	k	1	23.327639
6	87	c	0.619907	k	1	53.931952
7	58	c	0.992311	k	1	57.554063
8	36	c	0.892960	k	1	32.146561
9	49	c	0.174617	l	1	8.556212

Finally, you can create a new column which is conditional on the values of one or more existing columns, using the NumPy method where. Below we create a new column that is equal to 1 when col1 is more than 50, and 0 otherwise.

df["col7"] = np.where(df["col1"] > 50, 1, 0)
df

	col1	col2	col3	col4	col5	col6	col7
0	46	a	0.701735	j	1	32.279790	0
1	98	a	0.387270	j	1	37.952489	1
2	88	a	0.230487	j	1	20.282842	1
3	53	b	0.492995	j	1	26.128719	1
4	51	b	0.711341	j	1	36.278398	1
5	28	b	0.833130	k	1	23.327639	0
6	87	c	0.619907	k	1	53.931952	1
7	58	c	0.992311	k	1	57.554063	1
8	36	c	0.892960	k	1	32.146561	0
9	49	c	0.174617	l	1	8.556212	0

Filtering data

There are a huge number of ways to filter DataFrames in pandas, and for beginners it can feel quite overwhelming. While you may at times need more specific filtering options, I’ll list the methods that I use for 90% of my filtering needs.

We’re already familiar with using the square bracket notation to extract one specific column. You can extend this by passing a list of columns to filter your DataFrame to multiple columns at a time. For example, let’s say that we just want to view col1 and col4 – we just need to include them both in a list that we place inside of the square brackets:

df[["col1", "col4"]]
df

	col1	col4
0	46	j
1	98	j
2	88	j
3	53	j
4	51	j
5	28	k
6	87	k
7	58	k
8	36	k
9	49	l

The most common way of filtering rows in a DataFrame is by condition. We can do this by nesting a conditional statement inside of our trusty square brackets notation. Here, we filter our DataFrame to only those rows where col2 equals “a”:

df[df["col2"] == "a"]
df

	col1	col2	col3	col4	col5	col6	col7
0	46	a	0.701735	j	1	32.279790	0
1	98	a	0.387270	j	1	37.952489	1
2	88	a	0.230487	j	1	20.282842	1

It’s also possible to filter based on multiple conditions. In this case, each condition needs to be enclosed within brackets, as below, where we’re checking for rows where col2 equals “a” and col6 is more than 1:

df[(df["col2"] == "a") & (df["col6"] > 1)]
df

	col1	col2	col3	col4	col5	col6	col7
0	46	a	0.701735	j	1	32.279790	0
1	98	a	0.387270	j	1	37.952489	1
2	88	a	0.230487	j	1	20.282842	1

We can also combine column and row filtering using pandas’ loc method. Here we’ll have a look at the values of col1 and col7 where col1 is more than 50:

df.loc[df["col1"] > 50, ["col1", "col7"]]
df

	col1	col7
1	98	1
2	88	1
3	53	1
4	51	1
6	87	1
7	58	1

While there will be times when your task will need more specific or complex filtering, these basic filtering methods will give you a strong start to explore and manipulate your data.

Merging two DataFrames

Finally, there will be times when you need to merge two DataFrames. There are a variety of methods of doing this in pandas, but in my experience, the most flexible and intuitive to use is the merge method. To see how it works, let’s first create a second DataFrame:

df2 = pd.DataFrame({
   "field1": list("aabb"),
   "field2": df["col1"][5:9].to_list(),
})
df2

	field1	field2
0	a	28
1	a	87
2	b	58
3	b	36

Let’s say we want to merge these two DataFrames based on col1 in df and field2 in df2. We simply need to define a left table (df), a right table (df2), and the columns to use to join these tables (using the left_on and right_on arguments) in the merge method:

pd.merge(df, df2, left_on="col1", right_on="field2")

	col1	col2	col3	col4	col5	col6	col7	field1	field2
0	28	b	0.833130	k	1	23.327639	0	a	28
1	87	c	0.619907	k	1	53.931952	1	a	87
2	58	c	0.992311	k	1	57.554063	1	b	58
3	36	c	0.892960	k	1	32.146561	0	b	36

As you can see, because the fields we chose to use to join the two DataFrames have unique values in either DataFrame, we ended up with a one-to-one join. However, the merge method can also handle one-to-many, many-to-one, and many-to-many joins automatically, depending on the fields you use for the join. We can see this if we try to merge the two DataFrames on col2 and field1, which both have duplicate values in their respective DataFrames:

pd.merge(df, df2, left_on="col2", right_on="field1")

	col1	col2	col3	col4	col5	col6	col7	field1	field2
0	46	a	0.701735	j	1	32.279790	0	a	28
1	46	a	0.701735	j	1	32.279790	0	a	87
2	98	a	0.387270	j	1	37.952489	1	a	28
3	98	a	0.387270	j	1	37.952489	1	a	87
4	88	a	0.230487	j	1	20.282842	1	a	28
5	88	a	0.230487	j	1	20.282842	1	a	87
6	53	b	0.492995	j	1	26.128719	1	b	58
7	53	b	0.492995	j	1	26.128719	1	b	36
8	51	b	0.711341	j	1	36.278398	1	b	58
9	51	b	0.711341	j	1	36.278398	1	b	36
10	28	b	0.833130	k	1	23.327639	0	b	58
11	28	b	0.833130	k	1	23.327639	0	b	36

You can see that this automatically results in a many-to-many join.

The merge method is also capable of doing SQL join types. While the default type is inner joins, left, right and outer are all possible by using the argument how. Here we’ll change the first merge we did, on col1 and field2, to a left join:

pd.merge(df, df2, left_on="col1", right_on="field2", how="left")

	col1	col2	col3	col4	col5	col6	col7	field1	field2
0	46	a	0.701735	j	1	32.279790	0	NaN	NaN
1	98	a	0.387270	j	1	37.952489	1	NaN	NaN
2	88	a	0.230487	j	1	20.282842	1	NaN	NaN
3	53	b	0.492995	j	1	26.128719	1	NaN	NaN
4	51	b	0.711341	j	1	36.278398	1	NaN	NaN
5	28	b	0.833130	k	1	23.327639	0	a	28.0
6	87	c	0.619907	k	1	53.931952	1	a	87.0
7	58	c	0.992311	k	1	57.554063	1	b	58.0
8	36	c	0.892960	k	1	32.146561	0	b	36.0
9	49	c	0.174617	l	1	8.556212	0	NaN	NaN

You can see that we’ve kept all values from our left table, df, and have missing values where those values don’t exist in our right table, df2, as expected.

Finally, as well as being able to merge on columns, pandas’ merge method also allows you to merge DataFrames using their indices. To do so, you simply need to use the left_index and right_index arguments instead of left_on and right_on, passing True as the value:

pd.merge(df, df2, left_index=True, right_index=True)

	col1	col2	col3	col4	col5	col6	col7	field1	field2
0	46	a	0.701735	j	1	32.279790	0	a	28
1	98	a	0.387270	j	1	37.952489	1	a	87
2	88	a	0.230487	j	1	20.282842	1	b	58
3	53	b	0.492995	j	1	26.128719	1	b	36

Here, we’ve merged the first 4 rows of each table using their index values.

As you can see from this tutorial, you can complete a large variety of data exploration and manipulation tasks with only a handful of commands in pandas. I hope this has helped show you that getting up and running with pandas is relatively straightforward, and will boost your confidence when working with data in Python.

Get some hands-on practice with pandas

Continue your journey of learning pandas in DataSpell, where we’ll guide you through how to read in and analyse a dataset of airline delays and cancellations in pandas.

Get started!

DataSpell 2022.3.1 Is Out!DataSpell 2022.3.2 Is Out!

Discover more

Use barplots to gain insight into the differences between groups.

Create line plots that show trends over time.

Get up and running with version control using Git when working with Jupyter notebooks.

Learn more about using Windows Subsystem for Linux with Jupyter Notebooks, including how to run Python code from Windows using Linux.

DataSpell

Hit the Ground Running With Pandas

Reading in and writing data

Series and DataFrames

Getting an overview of your data

Column names

Changing values inside existing columns

Creating new columns

Filtering data

Merging two DataFrames

Get some hands-on practice with pandas

Discover more

Picking the Perfect Data Visualization: Barplots

Picking the Perfect Data Visualization: Line Plots

How to Use Git With Jupyter Notebooks in DataSpell

Using Jupyter Notebooks With WSL 2 in DataSpell

DataSpell

Hit the Ground Running With Pandas

Reading in and writing data

Series and DataFrames

Getting an overview of your data

Column names

Changing values inside existing columns

Creating new columns

Filtering data

Merging two DataFrames

Get some hands-on practice with pandas

Subscribe to DataSpell Blog updates

Discover more

Picking the Perfect Data Visualization: Barplots

Picking the Perfect Data Visualization: Line Plots

How to Use Git With Jupyter Notebooks in DataSpell

Using Jupyter Notebooks With WSL 2 in DataSpell