Data Science How-To's Python Tutorials

Picking the Perfect Data Visualization: Barplots

This blogpost is the second in a series where we explain the most common data visualization types and how you can best use them to explore your data and tell its story. In this post, we’ll cover barplots, which can give us great insight into how different groups behave relative to each other.

In this blogpost, we will use the “Airline Delays from 2003-2016” dataset by Priank Ravichandar licensed under CC0 1.0. This dataset contains information on flight delays and cancellations in US airports from 2003 to 2016. All of the code for this blog post can be found in this repo.

Barplot with a single group

Barplots (or barcharts) are ideal for contrasting how the average value of some variable varies between groups. The strength of this chart type is its simplicity. While using the average value leaves out a lot of detail, it also makes it very easy to spot differences and similarities between groups.

Let’s start by creating a barplot that shows the proportion of flights delayed for each airport over the entire data period. We’ll be using the lets-plot plotting library in Python to create each chart, which is a port of the popular ggplot2 library in R. We start, as in the last blogpost, by reading the data and removing the first and last years, as they only contain partial data.

import pandas as pd

airlines = pd.read_csv("data/airlines.csv")
airlines = airlines[~(airlines["TimeYear"].isin([2003, 2016]))]

We now need to create a summary DataFrame that gives us the total proportion of flights that were delayed per airport over the whole time period.

flights_delayed_by_airport = (
    airlines[["AirportCode", "FlightsDelayed", "FlightsTotal"]]
    .groupby(["AirportCode"])
    .sum()
    .assign(PropFlightsDelayed=lambda x:
            x["FlightsDelayed"] / x["FlightsTotal"])
    .reset_index()
    .sort_values("PropFlightsDelayed", ascending=False)
)

We’re now ready to make our barplot. As we want to compare delays between airports, we use AirportCode for our x-axis, and the proportion of all flights that were delayed, PropFlightsDelayed, for the y-axis. In order to make the pattern a little easier to see, we’re going to turn this plot into a horizontal barplot using coord_flip().

(
    ggplot(flights_delayed_by_airport,
           aes(x="AirportCode", y="PropFlightsDelayed"))
    + geom_bar(stat="identity", fill="#b3cde3")
    + coord_flip()
    + xlab("Airport Code")
    + ylab("Flights delayed (proportion)")
    + ggtitle("Proportion of flights delayed in US airports, 2004-2015")
)

This plot allows us to get a really good sense of how airports compare in terms of how many of their flights are delayed. Most airports have less than 20% of their flights delayed over the whole data period, but there are some clear outliers. Salt Lake City (SLC) has only 15% of flights delayed, while a whopping 29% of flights at Newark (EWR) were delayed.

Barplot with multiple groups

If you want to explore your data a bit more deeply, you can also group your barplots by an additional categorical variable. This can allow you to get further insight into why groups differ from each other. For instance, we might want to know why flights are getting delayed for each airport. To make this plot, we need to first create another summary DataFrame, this time finding the proportion of flights that were delayed by the airport and the cause of the delay.

delays_by_airport_and_cause = (
    airlines[["AirportCode", "NumDelaysLateAircraft",
              "NumDelaysWeather", "NumDelaysSecurity",
              "NumDelaysCarrier", "FlightsTotal"]]
    .groupby("AirportCode")
    .sum()
    .reset_index()
)

delays_by_airport_and_cause = (
    pd.melt(delays_by_airport_and_cause,
            id_vars=["AirportCode", "FlightsTotal"],
            value_vars=["NumDelaysLateAircraft", "NumDelaysWeather",
                        "NumDelaysSecurity", "NumDelaysCarrier"],
            var_name="TypeOfDelay",
            value_name="NumberDelays")
    .assign(TypeOfDelay=lambda x: x["TypeOfDelay"].str.replace("NumDelays", ""))
    .assign(PropFlightsDelayed=lambda x: x["NumberDelays"] / x["FlightsTotal"])
    .assign(PropTypeOfDelay=lambda x: x["NumberDelays"] / x.groupby("AirportCode")["NumberDelays"].transform("sum"))
)

We can now make our plot. Since there won’t be room to fit every airport on the chart, we’ll pick five airports: Salt Lake City, Newark, Denver (DEN), New York (JFK), and San Francisco (SFO). As with the previous barplot, we include AirportCode as the x-axis variable and PropFlightsDelayed as the y-axis variable, but this time we include TypeOfDelay under the argument fill, which tells lets-plot that we want to show separate bars for each of the delay reasons.

(
    ggplot(
        delays_by_airport_and_cause[(delays_by_airport_and_cause["AirportCode"].isin(
            ["EWR", "SLC", "DEN", "JFK", "SFO"]))],
        aes(x="AirportCode", y="PropFlightsDelayed", fill="TypeOfDelay")
    )
    + geom_bar(stat="identity", position="dodge")
    + xlab("Airport Code")
    + ylab("Flights delayed (proportion)")
    + ggtitle("Proportion of flights delayed by cause in US airports, 2004-2015")
    + scale_fill_brewer(type="qual", palette="Pastel1", name="Cause of delay",
                        labels=["Late aircraft", "Weather", "Security", "Carrier"])
    + ggsize(1400, 900)
)

For all airports, late aircraft are the biggest contributor to flight delays, and security-related delays are the least common. Weather-related delays are more common in northeastern airports (EWR and JFK) compared to ones located in the western part of the country. Interestingly, although EWR has the highest overall rate of delays, it has the lowest rate of carrier-related delays compared to other airports.

Stacked barplots

The grouped barplot gives us some interesting insights, but what if we want to directly compare the proportion of delayed flights by cause between airports? It’s a bit hard to see this on the grouped barplot, but we can use another type of barplot called a stacked barplot to see this better. In this barplot, we can break down the proportion of total delays for each airport by their cause, with the understanding that each airport’s bar adds up to 100%. Let’s see how this works.

(
    ggplot(delays_by_airport_and_cause,
           aes(x="AirportCode", y="PropTypeOfDelay", fill="TypeOfDelay"))
    + geom_bar(stat="identity")
    + xlab("Airport Code")
    + ylab("Proportion of delayed flights")
    + ggtitle("Division of delayed US flights by cause, 2004-2015")
    + scale_fill_brewer(type="qual", palette="Pastel1", name="Year",
                        labels=["Late aircraft", "Weather", "Security", "Carrier"])
    + ggsize(1400, 800)
)

Now it’s much easier to spot the causes of delays between the airports. We can see that Chicago O’Hare (ORD) and Chicago Midway (MDW) are particularly affected by delays from late carriers. It also appears that airports in warmer, drier areas tend to be less affected by weather delays.

With that, we’ve covered barplots! This chart type can really help you gain insight into the relative differences between groups, and it forms the launchpad for more detailed exploration of your data. In the next blog post, we’ll look at boxplots, which allow us to capture a lot of the detail left out by barplots.

image description