Datalore
Collaborative data science platform for teams
Pandas Tutorial: 10 Popular Questions for Python Data Frames
Pandas is one of the first libraries you will learn about when you start working with Python for data analysis and data science. The pandas library helps you work with datasets, transform and clean up your data, and get statistics.
In this tutorial, we will answer 10 of the most frequently asked questions people have when working with pandas. The questions covered in this tutorial mostly come from Stack Overflow.
Dataset
In the first part of this tutorial, we will work with the dataset containing sample data for city population and some information about the size of the land area and population density.
Pandas loc and iloc
pandas.loc[]
helps to access a group of rows and columns by labels or a boolean array slice.
Let’s select the population for Mexico city.
Below we’ll print only the population of Mexico City.
With .iloc[]
you can select columns by using numeric integer indices.
A few things to keep in mind:
- A plain
:
is used to select all data across rows/columns. 0:2
will select rows/columns 0 and 1. 2 is not included.-1
will select the last element.
Renaming columns in pandas
Next we’ll rename the columns to make them easier to access in the future.
There are a few ways to do this:
- directly assigning
df.columns
an array of column names. - using
df.rename
to rename specific columns.
Selecting multiple columns in a pandas DataFrame
Let’s split our DataFrame into two DataFrames containing:
- City, Country, and Population.
- City, Area, and Density.
We can do this in several ways:
- By using
.iloc[:, 0:3]
, where the first argument in the brackets selects all rows and the second argument selects column 0, column 1, and column 2. - By slicing the DataFrame with double
[]
and entering the column names you want to select.
Pandas merge two tables by column
Next we’ll vertically concatenate the two tables that we’ve created. The tables have the same City
column, so we will use the pd.merge
function to concatenate the two tables.
The left_on
and right_on
parameters indicate the column name to merge on in the first and second table.
Change column type in pandas with pandas apply
To work further with the DataFrame we need to transform the Population
, Area
, and Density
columns from strings into numbers.
To do this we will:
- Create a function,
to_int()
, which will transform the string with ‘,’ symbols into integer numbers. - Use the
apply
function with thelambda
expression.
Groupby and turn into a DataFrame
Let’s now group the DataFrame by Country
and count the population of each country in this data sample.
The difficulty with pd.groupby
is that it returns a groupby object, not a DataFrame. In the example below, we’ll show how to create a DataFrame from a groupby object.
We’ll group by Country
, at the same time calculating the sums for the Population
and Area
columns. We’ll drop the density column as we don’t need it anymore.
How to iterate over rows in a DataFrame in pandas
Though iterating over rows might not be the fastest solution, it can still sometimes come in handy. You can do this by using a loop over .iterrows()
function.
Consider trying to do the same operation with an apply
function or vectorized representation of Pandas DataFrame. On big datasets, this will increase the speed of the calculations.
Below we’ll divide the Population
column by 1000 and get the population numbers in thousands. There are 3 alternative code examples below.
How to select rows from a DataFrame based on column values
Let’s select countries with a population of more than 10 million people and an area of less than 2000 square kilometers.
You can do this by entering logical constraints within []
.
How to change the order of your DataFrame columns
You can do this simply by slicing your existing DataFrame in a different order.
Cleaning up data with pandas
To start working with data, you need to clean it up.
The first basic steps are:
- Drop duplicates in a DataFrame.
- Fill empty cells with meaningful values or drop columns with a lot of empty values.
- Get statistics on the column values.
Let’s download the dataset with the tennis game results.
We’ll drop any duplicates with pd.drop_duplicates
, with inplace = True
applying changes to the DataFrame.
Now let’s find out whether there are NaN values in our DataFrame.
df.isna().any()
is True when the column contains NaN values.
In the minutes
column we have 91% NaN values, so we’ll drop this column because it doesn’t contain any useful information.
The winner_age
, loser_age
, loser_rank
, and winner_rank
columns don’t have many NaN values, so we’ll replace the NaN values with a median number.
With df.describe
we can get statistics on numeric columns data.
That is it for our pandas tutorial. We’ve tried to provide answers to many of the most common questions people have when they are just starting out with pandas. Tell us in the comments about any other topics you’d like us to cover in future tutorials.
Other tutorials and research
- Getting Started Tutorial: Notebook, Video
- Advanced Visualization Tutorial with Seaborn: Notebook
- Visualization with Pyplot in Datalore: Notebook, Video
- Analysis of 10,000,000 Jupyter notebooks: Blogpost, Published notebook
- GPU models specification analysis
- Developer ecosystem research for Python