Getting Started in Data Science: EuroPython 2023 Follow-Up
One of my favorite parts of my job as a developer advocate is being able to help people get started in data science. I still remember when I made the transition from academia to data science almost 8 years ago, and how overwhelming it was and how much I felt like I needed to learn to even get started. I am also truly passionate about this wonderful field, and I love to help others get started in an area that is so interesting and rewarding.
I was lucky enough to be involved in a couple of activities geared toward helping data science beginners at EuroPython this year, including the Humble Data workshop and a Q&A session for data science newbies along with Cheuk Ting Ho, Valerio Maggio, and Vaibhav (VB) Srivastav. After both of these sessions I had a lot of great conversations with people who asked about which resources helped me when I was starting, and I wanted to share the content of these conversations a bit more widely.
Let’s first recap what we covered in the Q&A session, and then dive into some further resources to get you started on your data science journey.
What we covered in the Q&A session
How do you define what a data scientist is in 2023?
Just like when I started in 2016, data science is defined differently depending on who you talk to. However, the field has definitely gotten more complicated as it has matured, with additional roles like machine learning and MLOps engineers becoming established in the last few years.
Despite all of the continued confusion, the core of the role remains working with data to tell a story scientifically (after all, it’s in the name!). This involves applying techniques like data preparation and analysis, statistics, and visualization to answer a question that is typically somewhat complex. While machine learning has become synonymous with data science, it’s not actually a core part of data science work. Some data science projects may involve machine learning, but certainly not all of them.
What skills do data scientists tend to have?
There is a well-known Venn diagram that has been circulating since before I even started in data science. It depicts the field as a convergence of mathematical skills, engineering skills, and domain knowledge. When I first started out, this diagram really overwhelmed me; I felt like I needed to master all three of these to even get started!
In reality, it is impossible to know every skill used in data science in depth. Some people will come in with more strengths in mathematics or scientific skills, others will come from a software engineering background, and they’ll all pick up the remaining skills on the job. The split between data science roles also means you can play to your strengths and interests better. Those who have more experience with analysis or statistics may go for a more traditional data scientist role, while those with stronger engineering skills may gravitate toward machine learning engineering.
Finally, unless you work in a tiny startup, it’s unlikely you will be working alone. Data scientists tend to do the research and prototyping side of things, while engineers put the models into production. So don’t worry if you’re not an expert at everything – there’s a place for your skills in this field!
How can I start developing my skills?
One of the most common misconceptions about data science is that you need a PhD or some other advanced degree. However, this is just one possible path for developing the core skill set of data scientists we talked about above.
The best way to develop this skill is just to get hold of datasets that interest you and start creating projects with them. VB in particular found the subreddit r/dataisbeautiful helpful for getting motivation and feedback. I love writing, so I started a blog. Cheuk recommends volunteering for organizations like DataKind and having a community around you. Once you have a feel for working with real data, you have one of the most important skills mastered and you’ll build the rest on top of this.
Finally, the main thing is not to panic! Just choose the tooling (language, development environment, and packages) that you like best in the beginning, and build up your skills using these. I personally loved R when I started because it was designed for people from statistics backgrounds and suited me better, but over time I switched to Python as I moved more into machine learning.
Useful resources
To help you continue your data science journey, I’m also including a list of resources I’ve found useful in the past (or content I’ve created to cover specific topics).
Programming languages
Your first step will be getting some basic programming under your belt – and by basic, I really do mean basic! I’d recommend starting with either R or Python. There are dozens of courses for each online, but I can recommend the two that I used: R for Psychological Science and Learn Python the Hard Way.
You should also try to include SQL in your coding toolbelt. I’ve found that W3Schools’ SQL course is a great place to get started.
Data analysis
Learning pandas is fundamental to getting started with data analysis in Python, and I cannot recommend Wes McKinney’s book Python for Data Analysis highly enough. Once you’ve finished with that book, you probably want to start playing with some real data. For this, I recommend two sources: the UC Irvine Machine Learning Repository and Kaggle Datasets.
From there, you will probably want to get into data visualization. For R, the gold standard for graphing is ggplot2, but there is more diversity in Python plotting packages, which include Matplotlib, seaborn, plotly, lets-plot, plotnine, and more. I think the best way to get started with plotting is just to think about what you want to show (maybe check out r/dataisbeautiful for inspiration) and start messing around with a plotting package that you like.
Once you want to start covering data cleaning and issues, you may want to pick up another book or course to cover this. I have a talk where I give an overview of some of the major issues that can come up in datasets and negatively affect your data science work. Much of this talk’s contents comes from one of my university statistics books, Using Multivariate Statistics.
Statistics and machine learning
Once you’re ready to dive into more advanced topics, you can start covering statistics and machine learning. I think these are both topics you can cover bit by bit (as they can be quite dense), so don’t feel like you need to master everything before you can start working as a data scientist.
While I learned statistics from my university textbooks (which are probably a bit too specific to psychology to recommend widely), I have heard nothing but good things about Think Stats. In terms of machine learning, there are a few options. I personally loved Andrew Ng’s Machine Learning Specialization for machine learning and François Chollet’s Deep Learning for an introduction to deep learning. I’ve also had friends who really liked both the classic Introduction to Statistical Learning and Google’s Machine Learning Crash Course.
Shout out to Humble Data!
And as a final plug – if you’re looking for a way to get started but want some more support, you can also keep your eye out for the next Humble Data workshop! This free workshop is aimed at getting you up and running with basic Python data science, going from the basics of Python programming to working with pandas and data visualization.