Pandas Tutorial: 10 Popular Questions for Python Data Frames
Pandas is one of the first libraries you will learn about when you start working with Python for data analysis and data science. The pandas library helps you work with datasets, transform and clean up your data, and get statistics.
In this tutorial, we will answer 10 of the most frequently asked questions people have when working with pandas. The questions covered in this tutorial mostly come from Stack Overflow.
In the first part of this tutorial, we will work with the dataset containing sample data for city population and some information about the size of the land area and population density.
Pandas loc and iloc
pandas.loc helps to access a group of rows and columns by labels or a boolean array slice.
Let’s select the population for Mexico city.
Below we’ll print only the population of Mexico City.
.iloc you can select columns by using numeric integer indices.
A few things to keep in mind:
- A plain
:is used to select all data across rows/columns.
0:2will select rows/columns 0 and 1. 2 is not included.
-1will select the last element.
Renaming columns in pandas
Next we’ll rename the columns to make them easier to access in the future.
There are a few ways to do this:
- directly assigning
df.columnsan array of column names.
df.renameto rename specific columns.
Selecting multiple columns in a pandas DataFrame
Let’s split our DataFrame into two DataFrames containing:
- City, Country, and Population.
- City, Area, and Density.
We can do this in several ways:
- By using
.iloc[:, 0:3], where the first argument in the brackets selects all rows and the second argument selects column 0, column 1, and column 2.
- By slicing the DataFrame with double
and entering the column names you want to select.
Pandas merge two tables by column
Next we’ll vertically concatenate the two tables that we’ve created. The tables have the same
City column, so we will use the
pd.merge function to concatenate the two tables.
right_on parameters indicate the column name to merge on in the first and second table.
Change column type in pandas with pandas apply
To work further with the DataFrame we need to transform the
Density columns from strings into numbers.
To do this we will:
- Create a function,
to_int(), which will transform the string with ‘,’ symbols into integer numbers.
- Use the
applyfunction with the
Groupby and turn into a DataFrame
Let’s now group the DataFrame by
Country and count the population of each country in this data sample.
The difficulty with
pd.groupby is that it returns a groupby object, not a DataFrame. In the example below, we’ll show how to create a DataFrame from a groupby object.
We’ll group by
Country, at the same time calculating the sums for the
Area columns. We’ll drop the density column as we don’t need it anymore.
How to iterate over rows in a DataFrame in pandas
Though iterating over rows might not be the fastest solution, it can still sometimes come in handy. You can do this by using a loop over
Consider trying to do the same operation with an
apply function or vectorized representation of Pandas DataFrame. On big datasets, this will increase the speed of the calculations.
Below we’ll divide the
Population column by 1000 and get the population numbers in thousands. There are 3 alternative code examples below.
How to select rows from a DataFrame based on column values
Let’s select countries with a population of more than 10 million people and an area of less than 2000 square kilometers.
You can do this by entering logical constraints within
How to change the order of your DataFrame columns
You can do this simply by slicing your existing DataFrame in a different order.
Cleaning up data with pandas
To start working with data, you need to clean it up.
The first basic steps are:
- Drop duplicates in a DataFrame.
- Fill empty cells with meaningful values or drop columns with a lot of empty values.
- Get statistics on the column values.
Let’s download the dataset with the tennis game results.
We’ll drop any duplicates with
inplace = True applying changes to the DataFrame.
Now let’s find out whether there are NaN values in our DataFrame.
df.isna().any() is True when the column contains NaN values.
minutes column we have 91% NaN values, so we’ll drop this column because it doesn’t contain any useful information.
winner_rank columns don’t have many NaN values, so we’ll replace the NaN values with a median number.
df.describe we can get statistics on numeric columns data.
That is it for our pandas tutorial. We’ve tried to provide answers to many of the most common questions people have when they are just starting out with pandas. Tell us in the comments about any other topics you’d like us to cover in future tutorials.