Articles Data Science Interviews Python

Visualizing Geospatial Data in Python

Today we are interviewing Adam Symington, author of the PythonMaps project, which is dedicated to using Python to develop beautiful yet very informative geospatial data visualizations.

Adam is a geospatial data scientist working as the head of data science at Geollect Ltd, a geospatial intelligence company that aims to better understand the maritime world using data. 

Before that, he was a full-time researcher and also taught several undergraduate and postgraduate courses in the Department of Chemistry at the University of Bath, introducing students to Python programming and its application in chemical sciences.

“There are lots of tools and methods that can be used to generate the beautiful map of Italy… I am going to walk you through an unconventional approach using Python and hopefully you will come away convinced as I am that if something is worth doing, it is worth doing in Python.”

VL: Let’s start this interview with a classic first question: How did you get into geospatial visualizations, and why do you think they are important?

AS: I started my data visualization journey when I was in academia and I was starting to put together papers, posters, and talks. I was initially embarrassed at how awful my Microsoft Excel plots looked, and I was also annoyed at how long it took to generate a passable figure using Excel. So I started dabbling with Matplotlib and Python as a way not only to generate nice figures but also to generate nice figures quickly and in an entirely reproducible, replicable way. Which brings me to the second part of the question, data visualizations need to look stunning. A picture paints a thousand words, as they say, and if you want people to read whatever you have written or take notice of what you are presenting, the visuals that you display must capture their attention and their imagination. On top of this, our time is valuable, and if we all spent three days manually tweaking Excel, we would have no time to produce the data going into them.

VL: I couldn’t agree more! Good visualizations make complex information digestible and easy to grasp, and having a way to automate the process is definitely worth the time investment that mastering Python requires. I believe Python is a very popular language within the data science community, mainly because it’s easy to start off with. Also, there are numerous scientific packages that can help with nearly anything you might want to develop with Python. What are your favorite Python packages to use for geospatial visualizations?

AS: For anything involving shapes (points, linestrings, and polygons), I use a combination of GeoPandas, Cartopy, and Matplotlib, and for anything involving rasters I use a combination of Rasterio and Matplotlib. I am not an expert by any means on the more interactive side of data visualizations in Python, but I am a particular fan of things like Bokeh, Holoviews, and Geoviews for interactive visualizations. I personally am not a fan of Plotly but I’m sure it has a few fans lurking out there.

“Data visualizations need to look stunning… If you want people to read whatever you have written or take notice of what you are presenting, the visuals that you display must capture their attention and their imagination. On top of this, our time is valuable, and if we all spent three days manually tweaking Excel, we would have no time to produce the data going into them.”

VL: So, from what you’re saying, it sounds like Matplotlib is more or less a must for working with geospatial data. What else do you use? I’m really interested to get a sneak peek at your toolset and understand which elements you enjoy and where you see room for improvement.

AS: I scope things out and develop ideas using Jupyter notebooks, and I develop software and finish projects using PyCharm. Most of my development uses the traditional scientific Python stack, with a geospatial element to it. This includes: SciPy, NumPy, Pandas, GeoPandas, Cartopy, and Rasterio, but there are many more out there. From a geospatial context, I feel that there is a gap in the market for a tool that links the libraries and methods that process rasters and those that process shapes. At times it is difficult to wrangle these different data types into something that can be universally useful.

As far as things that I enjoy – anything visual and geospatial!

VL: You mentioned that you use different libraries for different types of data visualization. Could you elaborate on this? What are the main data types you use for representing geographic information and how do they differ?

AS: The simplest form is just latitude and longitude coordinates, for example, the location of a city. Most things don’t exist as a single point in space and so there are more complex datatypes to describe them. Using a point is maybe correct to describe a city on a global scale, but at a national scale, that city has quite a large extent and hence one could use a LineString to describe the boundary of that city or indeed a polygon to describe the boundary and everything within the city.

The other main datatype is a raster, which is an n-dimensional array of values which describe a picture, for example, a satellite image.

“The main challenge is the fact that the Earth is not actually a sphere but is rather ever so slightly egg shaped. So, the mathematical models either require corrections or are slightly wrong.”

VL: Working with geographical data doesn’t sound like an easy job by any means! There are so many tiny issues you should always be wary of. Can you share some of the unique challenges you face? Is adjusting for the curvature of earth one of them?

AS: Thankfully, there are loads of libraries that already exist to do this for us; and even then it is a relatively simple mathematical problem to convert things from a 2D surface to a 3D sphere. With that said, the main challenge is the fact that the Earth is not actually a sphere but is rather ever so slightly egg shaped. So, the mathematical models either require corrections or are slightly wrong.

VL: Talking about the surface, is the way the oceans are represented somehow different from the land’s representation?

AS: I personally do not really see much of a difference from a data science perspective, although from a visualization perspective, there is much more going on on land and hence visuals can become incredibly complicated when multiple datasets start being layered together.

“If you want to find a geospatial dataset for a particular idea that you are working on, head to Google and there is a good chance you will find something.”

VL: Indeed. It seems like it could be a nightmare trying to read an overpopulated visualization. That’s why planning how to show a complex matter might take a significant amount of time. What about searching for particular datasets? Does it take a lot of time to find them?

AS: I exclusively use open-source data. There are occasionally academic studies that produce a number of geospatial datasets alongside their work. NASA and other space-based organizations will produce a lot of raster data, and there are loads of repositories out there with a wealth of geospatial data; NaturalEarth comes to mind. If you want to find a geospatial dataset for a particular idea that you are working on, head to  Google and there is a good chance you will find something.

VL: I imagine working with open-source data means it can take more time to prepare it for further work. Are there any common manipulations that you use?

AS: Unfortunately, there is no common manipulation because I tend to produce visuals differently every time. Each dataset and each story that you want to tell is different and this requires taking a different approach. I think it is dangerous to become constrained in how you go about doing things because then your visuals will become constrained, and they will eventually all look the same. Therefore, I hate tools like Power BI.

VL: So it looks like you use a unique approach for each of your projects, depending on what you are going to show with your next visualization. Since we are talking about geospatial projects, my impression is that the type of map projection you use can have a big impact on the final visualization. Can you tell more about the different types of map projections, and how do they affect visualizations?

AS: Map projections are different ways of displaying on a 2D surface data that exists on a 3D sphere. There are loads of ways that this can be done, and every single one of them has flaws. Some exaggerate the size of the higher (+60) and lower (-60) latitudes, which is useful if you have a lot of data in Northern Europe and North America; however, this comes at the expense of making Africa look much smaller than it actually is, when compared with these areas. The choice of projection ultimately comes down to what you are trying to present.

A common way to show how badly projections can display the world is to display the size of Russia compared to Africa. While on most maps they both look pretty large, when compared alongside one another, without projection related trickery, Russia is tiny in comparison.

For genuinely global datasets I tend to use the Robinson projection because it is the most balanced when it comes to showing the central latitudes and extreme latitudes.

“ More and more problems will now move onto the cloud or HPCs because these services have the capacity to deal with these huge datasets. Data scientists and engineers will need to start understanding these technologies in more detail to interact with this data.”

VL: Taking into consideration how much data global datasets usually contain, they must be rather dense in volume. How do you work effectively with such large datasets?

AS: This depends on the use case. I work a lot with shipping data, and ships tend to broadcast their positions every second; however, they do not manage to move particularly far on a second-by-second basis. So, I will often downsample the data to give me average positions within 10 minute windows. Whereas aircraft move much faster so downsampling can only really go to average positions every one minute. 

That is a more practical look at individual problems. More and more problems will now move onto the cloud or HPCs (high-performance computers) because these services have the capacity to deal with these huge datasets. Data scientists and engineers will need to start understanding these technologies in more detail to interact with this data.

VL: Cloud computing may significantly simplify the whole process. Though it still might be difficult for a single person to manage both configuring the cloud to their needs and dealing with the data science side of things. Getting back to your project, PythonMaps, what are your plans for the project’s next steps? From what you’ve said so far, it sounds like there’s a lot to do. Are you thinking about opening it up for collaboration?

AS: I plan to take a much heavier software development approach to this project. This will involve building a website where all these maps can be viewed. The website will also have an application that allows visitors to load different datasets and visualize them together in one interactive map. I will be doing this all in PyCharm and it will have a strong software development focus. I am looking for collaborators to help with this endeavor.

VL: So you are going to produce more beautiful visualizations and will help others to produce them as well. What is your favorite visualization that you’ve done to date?

AS: I made a map of carbon dioxide emissions within 1km2 regions across the world. The reason this is my favorite is because you can see the contributions from individual industries clearly. Shipping lanes become prominent at sea, flight paths between Europe and North America are visible, highways and motorways show up clearly, and population centers shine brightly. It really shows the interconnected world that we live in and how emissions are linked across the world. I also think it is cool that a lot of the other maps I have produced, like shipping lanes or airways, are kind of portrayed within this map as well, so it feels like a nice combination of many of my creations.

VL: The way you describe the connections between these datasets sounds almost poetic. In your eyes, how much of data visualization is art and how much is science?

AS: Data visualization is an artistic representation of science.

VL: Thank you, Adam! That was an inspiring overview of what can be done with great passion for maps and a helping hand from Python.

We hope you enjoyed this interview. If you have any questions or comments, please share them below or on Twitter, mentioning the @pycharm and @PythonMaps accounts.

If you would like to dive deeper into the topic here is Adam’s tutorial from SciPy 2022:

Creating Beautiful Geospatial Data Visualisations With Python

If you are interested in joining Adam and contributing to his project, here is his contact information:

Email

Linkedin

Twitter

image description