Picking the Perfect Data Visualization: Line Plots
Data visualizations are one of the most powerful tools when exploring and presenting data. However, when you first start using visualizations, it’s easy to get overwhelmed by the huge number of plots you can make. In this series of blog posts, we’ll go over five of the most commonly used visualizations, and how they can help you tell your data’s story. First up, we’ll cover line plots.
In this blog post, we’ll use the “Airline Delays from 2003–2016” dataset by Priank Ravichandar, licensed under CC0 1.0. This dataset contains information on flight delays and cancellations in US airports from 2003 to 2016. The code for this blog post can be found in this repo.
Line plots with a single group
Line plots are designed to demonstrate a trend over time. This means that on the x-axis, you’ll use some sort of datetime variable – anything from milliseconds to years. In order to show a trend, the y-axis then needs to contain a continuous variable, like the number of goods in stock, the price of an item, or a volume of water.
Let’s use a line plot to take a look at the total number of flight delays due to late aircraft across the whole period of the dataset. First, we’ll read our raw data:
import pandas as pd airlines = pd.read_csv("data/airlines.csv") airlines["Time"] = pd.to_datetime(airlines["TimeLabel"], infer_datetime_format=True) airlines = airlines[~(airlines["TimeYear"].isin([2003, 2016]))]
As 2003 and 2016 only have partial data for the year, we’ve removed them from the dataset. We’ve also created an explicit datetime variable from the TimeLabel
variable, which contains both the month and year.
To make our plots, we need to create a summary DataFrame that contains the number of delays by cause (late aircraft, weather, security, or carrier issues) over time.
delays_by_time_and_cause = ( airlines[["Time", "NumDelaysLateAircraft", "NumDelaysWeather", "NumDelaysSecurity", "NumDelaysCarrier"]] .groupby("Time") .sum() .reset_index() ) delays_by_time_and_cause = ( pd.melt(delays_by_time_and_cause, id_vars="Time", value_vars=["NumDelaysLateAircraft", "NumDelaysWeather", "NumDelaysSecurity", "NumDelaysCarrier"]) .rename(columns={ "variable": "TypeOfDelay", "value": "NumberDelays" }) .assign(TypeOfDelay=lambda x: x["TypeOfDelay"].str.replace("NumDelays", "")) )
We’re now ready to make our plot. We’ll be using the lets-plot plotting library in Python to create each chart, which is a port of the popular ggplot2 R library. We use our Time
variable for the x-axis, which is in months, and on the y-axis we use NumberDelays
, the total number of delays for each month.
from lets_plot import * LetsPlot.setup_html() ( ggplot( delays_by_time_and_cause[ delays_by_time_and_cause["TypeOfDelay"] == "LateAircraft" ], aes(x="Time", y="NumberDelays")) + geom_line(color="#fbb4ae", size=1) + scale_x_datetime() + xlab("Time") + ylab("Number of delays") + ggtitle("Total delays due to late aircrafts in US airports, 2004-2015") )
From this chart, we see that the number of flight delays increased from 2004 to 2008, decreased until 2012, peaked again in 2013, and then decreased again. We can also see that there is significant seasonal variation in the delays, possibly due to inclement weather or holiday peaks putting pressure on airports.
Line plots with multiple groups
Line plots are also a great way to compare trends of two continuous variables over time. In the chart below, we’ve compared the number of delays due to late aircraft versus those due to carrier issues. We’ve used Time
on the x-axis, and NumberDelays
on the y-axis. However, this time, we pass TypeOfDelay
to the color
argument, indicating that we want to plot the delays due to late aircraft and carrier issues separately.
( ggplot(delays_by_time_and_cause[ delays_by_time_and_cause["TypeOfDelay"].isin(["LateAircraft", "Carrier"]) ], aes(x="Time", y="NumberDelays", color="TypeOfDelay")) + geom_line(size=1) + scale_x_datetime() + xlab("Time") + ylab("Number of delays") + ggtitle("Total delays in US airport, 2004-2015") + scale_color_brewer(type="qual", palette="Pastel1", name="Type of delay", labels=["Late aircraft", "Carrier"]) )
The trends of delayed flights are quite similar for both late aircraft and carrier issues over the data period, which suggests that these delay types may be linked or have a common cause.
Area plots
Area plots are related to line plots. In area plots, the space under the line is filled in, so these types of graphs are ideal when you really want to emphasize the volume or amount you’re plotting in the y-axis. They are particularly effective for contrasting differences in quantity between groups. You can see this with the area plot below, where we’ve compared how many delays occur for weather-related reasons compared to those because of late aircraft.
( ggplot( delays_by_time_and_cause[ delays_by_time_and_cause["TypeOfDelay"].isin( ["LateAircraft", "Weather"]) ].sort_values("TypeOfDelay", ascending=False), aes(x="Time", y="NumberDelays", fill="TypeOfDelay")) + geom_area(color="white") + scale_x_datetime() + xlab("Time") + ylab("Number of delays") + ggtitle("Total delays in US airports, 2004-2015") + scale_fill_brewer(type="qual", palette="Pastel1", name="Type of delay", labels=["Weather", "Late aircraft"]) )
The number of delays from weather are dwarfed by those occurring because of late aircraft. This shows that, while weather-related delays tend to get the most attention, they are nowhere near as big of an issue that routine delays from late aircraft are.
That concludes our introduction to line plots! We’ve covered how these elegant plots can be used to show how things like price and volume change over time, and how they can be used to spot relationships between variables when divided by subgroups. In the next post, we’ll have a look at barplots.