Datalore

Collaborative data science platform for teams

Try Datalore

Data Science Datalore How-To's

Pandas Tutorial: 10 Popular Questions for Python Data Frames

Alena Guzharina

Pandas is one of the first libraries you will learn about when you start working with Python for data analysis and data science. The pandas library helps you work with datasets, transform and clean up your data, and get statistics.

Pandas Tutorial: 10 popular questions for Python data frames

In this tutorial, we will answer 10 of the most frequently asked questions people have when working with pandas. The questions covered in this tutorial mostly come from Stack Overflow.

Open the notebook

Dataset

In the first part of this tutorial, we will work with the dataset containing sample data for city population and some information about the size of the land area and population density.

Pandas loc and iloc

pandas.loc[] helps to access a group of rows and columns by labels or a boolean array slice.

Let’s select the population for Mexico city.

Below we’ll print only the population of Mexico City.

With .iloc[] you can select columns by using numeric integer indices.

A few things to keep in mind:

A plain : is used to select all data across rows/columns.
0:2 will select rows/columns 0 and 1. 2 is not included.
-1 will select the last element.

Renaming columns in pandas

Next we’ll rename the columns to make them easier to access in the future.

There are a few ways to do this:

directly assigning df.columns an array of column names.
using df.rename to rename specific columns.

Selecting multiple columns in a pandas DataFrame

Let’s split our DataFrame into two DataFrames containing:

City, Country, and Population.
City, Area, and Density.

We can do this in several ways:

By using .iloc[:, 0:3], where the first argument in the brackets selects all rows and the second argument selects column 0, column 1, and column 2.
By slicing the DataFrame with double [] and entering the column names you want to select.

Pandas merge two tables by column

Next we’ll vertically concatenate the two tables that we’ve created. The tables have the same City column, so we will use the pd.merge function to concatenate the two tables.

The left_on and right_on parameters indicate the column name to merge on in the first and second table.

Change column type in pandas with pandas apply

To work further with the DataFrame we need to transform the Population, Area, and Density columns from strings into numbers.

To do this we will:

Create a function, to_int(), which will transform the string with ‘,’ symbols into integer numbers.
Use the apply function with the lambda expression.

Groupby and turn into a DataFrame

Let’s now group the DataFrame by Country and count the population of each country in this data sample.
The difficulty with pd.groupby is that it returns a groupby object, not a DataFrame. In the example below, we’ll show how to create a DataFrame from a groupby object.

We’ll group by Country, at the same time calculating the sums for the Population and Area columns. We’ll drop the density column as we don’t need it anymore.

How to iterate over rows in a DataFrame in pandas

Though iterating over rows might not be the fastest solution, it can still sometimes come in handy. You can do this by using a loop over .iterrows() function.

Consider trying to do the same operation with an apply function or vectorized representation of Pandas DataFrame. On big datasets, this will increase the speed of the calculations.

Below we’ll divide the Population column by 1000 and get the population numbers in thousands. There are 3 alternative code examples below.

How to select rows from a DataFrame based on column values

Let’s select countries with a population of more than 10 million people and an area of less than 2000 square kilometers.
You can do this by entering logical constraints within [].

How to change the order of your DataFrame columns

You can do this simply by slicing your existing DataFrame in a different order.

Cleaning up data with pandas

To start working with data, you need to clean it up.

The first basic steps are:

Drop duplicates in a DataFrame.
Fill empty cells with meaningful values or drop columns with a lot of empty values.
Get statistics on the column values.

Let’s download the dataset with the tennis game results.

We’ll drop any duplicates with pd.drop_duplicates, with inplace = True applying changes to the DataFrame.

Now let’s find out whether there are NaN values in our DataFrame.

df.isna().any() is True when the column contains NaN values.

In the minutes column we have 91% NaN values, so we’ll drop this column because it doesn’t contain any useful information.

The winner_age, loser_age, loser_rank, and winner_rank columns don’t have many NaN values, so we’ll replace the NaN values with a median number.

With df.describe we can get statistics on numeric columns data.

That is it for our pandas tutorial. We’ve tried to provide answers to many of the most common questions people have when they are just starting out with pandas. Tell us in the comments about any other topics you’d like us to cover in future tutorials.

Open the notebook

Other tutorials and research

Getting Started Tutorial: Notebook, Video
Advanced Visualization Tutorial with Seaborn: Notebook
Visualization with Pyplot in Datalore: Notebook, Video
Analysis of 10,000,000 Jupyter notebooks: Blogpost, Published notebook
GPU models specification analysis
Developer ecosystem research for Python

Code With Me Beta: Support for Audio and Video Calls JetBrains 2020/21 Annual Highlights: 10 Million Users, 30 Tools, and More!

Discover more

Financial Data Analysis and Visualization with Python

The financial ecosystem relies heavily on Excel, but as data grows, it's showing its limitations. It's time for a change. Enter Python, a game-changer in finance. In this article, I'll guide you through financial data analysis and visualization using Python. We'll explore how this powerful tool can uncover valuable insights, empowering smarter decisions.

Backtesting a Trading Strategy in Python With Datalore and AI Assistant

In this article, I'll walk through the process of backtesting a daily Dow Jones mean reversion strategy using Python in Datalore notebooks. To make it accessible even for those with limited coding experience, I'll leverage Datalore's AI Assistant capabilities.

Top Data Science Conferences for Managers in 2024

After an extended period of virtual events, 2024 is gearing up to be a year full of exciting in-person conferences for data science managers. With this in mind, we’ve compiled a list of 41 events around the world, categorizing them by type and aggregating them by month.

In this blog post, we’ll explore three surprising facts about the average data scientist of 2023 and how Datalore is tailored to their needs – stay tuned for Part 2, which will provide another pair of insights!

Datalore

Pandas Tutorial: 10 Popular Questions for Python Data Frames

Dataset

Pandas loc and iloc

Renaming columns in pandas

Selecting multiple columns in a pandas DataFrame

Pandas merge two tables by column

Change column type in pandas with pandas apply

Groupby and turn into a DataFrame

How to iterate over rows in a DataFrame in pandas

How to select rows from a DataFrame based on column values

How to change the order of your DataFrame columns

Cleaning up data with pandas

Other tutorials and research

Discover more

Financial Data Analysis and Visualization in Python With Datalore and AI Assistant

Backtesting a Trading Strategy in Python With Datalore and AI Assistant

Top Data Science Conferences for Managers in 2024: An (Almost) Exhaustive List

A Portrait of the Average Data Scientist of 2023 in 3 Facts

Datalore

Pandas Tutorial: 10 Popular Questions for Python Data Frames

Dataset

Pandas loc and iloc

Renaming columns in pandas

Selecting multiple columns in a pandas DataFrame

Pandas merge two tables by column

Change column type in pandas with pandas apply

Groupby and turn into a DataFrame

How to iterate over rows in a DataFrame in pandas

How to select rows from a DataFrame based on column values

How to change the order of your DataFrame columns

Cleaning up data with pandas

Other tutorials and research

Subscribe to Datalore News and Updates

Discover more

Financial Data Analysis and Visualization in Python With Datalore and AI Assistant

Backtesting a Trading Strategy in Python With Datalore and AI Assistant

Top Data Science Conferences for Managers in 2024: An (Almost) Exhaustive List

A Portrait of the Average Data Scientist of 2023 in 3 Facts