Data Science Datalore How-To's

Pandas Tutorial: 10 Popular Questions for Python Data Frames

Pandas is one of the first libraries you will learn about when you start working with Python for data analysis and data science. The pandas library helps you work with datasets, transform and clean up your data, and get statistics.

Pandas Tutorial: 10 popular questions for Python data frames

In this tutorial, we will answer 10 of the most frequently asked questions people have when working with pandas. The questions covered in this tutorial mostly come from Stack Overflow.

Open the notebook

Dataset

In the first part of this tutorial, we will work with the dataset containing sample data for city population and some information about the size of the land area and population density.

Pandas loc and iloc

pandas.loc[] helps to access a group of rows and columns by labels or a boolean array slice.

Let’s select the population for Mexico city.

Below we’ll print only the population of Mexico City.

With .iloc[] you can select columns by using numeric integer indices.

A few things to keep in mind:

  • A plain : is used to select all data across rows/columns.
  • 0:2 will select rows/columns 0 and 1. 2 is not included.
  • -1 will select the last element.


Renaming columns in pandas

Next we’ll rename the columns to make them easier to access in the future.

There are a few ways to do this:

  • directly assigning df.columns an array of column names.
  • using df.rename to rename specific columns.


Selecting multiple columns in a pandas DataFrame

Let’s split our DataFrame into two DataFrames containing:

  1. City, Country, and Population.
  2. City, Area, and Density.

We can do this in several ways:

  • By using .iloc[:, 0:3], where the first argument in the brackets selects all rows and the second argument selects column 0, column 1, and column 2.
  • By slicing the DataFrame with double [] and entering the column names you want to select.



Pandas merge two tables by column

Next we’ll vertically concatenate the two tables that we’ve created. The tables have the same City column, so we will use the pd.merge function to concatenate the two tables.

The left_on and right_on parameters indicate the column name to merge on in the first and second table.

Change column type in pandas with pandas apply

To work further with the DataFrame we need to transform the Population, Area, and Density columns from strings into numbers.

To do this we will:

  1. Create a function, to_int(), which will transform the string with ‘,’ symbols into integer numbers.
  2. Use the apply function with the lambda expression.



Groupby and turn into a DataFrame

Let’s now group the DataFrame by Country and count the population of each country in this data sample.
The difficulty with pd.groupby is that it returns a groupby object, not a DataFrame. In the example below, we’ll show how to create a DataFrame from a groupby object.

We’ll group by Country, at the same time calculating the sums for the Population and Area columns. We’ll drop the density column as we don’t need it anymore.


How to iterate over rows in a DataFrame in pandas

Though iterating over rows might not be the fastest solution, it can still sometimes come in handy. You can do this by using a loop over .iterrows() function.

Consider trying to do the same operation with an apply function or vectorized representation of Pandas DataFrame. On big datasets, this will increase the speed of the calculations.

Below we’ll divide the Population column by 1000 and get the population numbers in thousands. There are 3 alternative code examples below.



How to select rows from a DataFrame based on column values

Let’s select countries with a population of more than 10 million people and an area of less than 2000 square kilometers.
You can do this by entering logical constraints within [].

How to change the order of your DataFrame columns

You can do this simply by slicing your existing DataFrame in a different order.

Cleaning up data with pandas

To start working with data, you need to clean it up.

The first basic steps are:

  • Drop duplicates in a DataFrame.
  • Fill empty cells with meaningful values or drop columns with a lot of empty values.
  • Get statistics on the column values.

Let’s download the dataset with the tennis game results.

We’ll drop any duplicates with pd.drop_duplicates, with inplace = True applying changes to the DataFrame.

Now let’s find out whether there are NaN values in our DataFrame.

df.isna().any() is True when the column contains NaN values.


In the minutes column we have 91% NaN values, so we’ll drop this column because it doesn’t contain any useful information.


The winner_age, loser_age, loser_rank, and winner_rank columns don’t have many NaN values, so we’ll replace the NaN values with a median number.




With df.describe we can get statistics on numeric columns data.

That is it for our pandas tutorial. We’ve tried to provide answers to many of the most common questions people have when they are just starting out with pandas. Tell us in the comments about any other topics you’d like us to cover in future tutorials.

Open the notebook

Other tutorials and research

  1. Getting Started Tutorial: Notebook, Video
  2. Advanced Visualization Tutorial with Seaborn: Notebook
  3. Visualization with Pyplot in Datalore: Notebook, Video
  4. Analysis of 10,000,000 Jupyter notebooks: Blogpost, Published notebook
  5. GPU models specification analysis
  6. Developer ecosystem research for Python
image description