Where To Get Data for Your Data Science Projects
Whether you’re starting a new project or expanding an existing one, as a data scientist, you’re always on the lookout for new material to explore. Knowing where to get data for data science projects can be challenging, and finding “good data” can be even more difficult. In this article, we’ll look at what makes “good data”, what format that data might be in, where to find it, and what the next steps are.
What is “good data” for data science projects?
Firstly, we should consider how relevant the dataset is to our work. You can stumble upon lots of datasets that overlap with your work in some way, but it can be difficult to decide which is the best one for you to put your effort into. In this scenario, we’ll briefly explore some of the attributes of the data.
To start with, how consistent is the dataset? Specifically, are there any missing values? Data might be missing for a variety of acceptable reasons, but it can also be a sign of selection bias or other factors that might skew your results. Often, we can choose to either accept missing data or delete the records that contain it before we do our analysis, but knowing about missing data early in the process can help you make an informed decision to use that dataset or not.
Along with missing data, it’s worth checking to see if any of the data is duplicated. Duplicated data might be fine, but it might also signify a lack of consistency that could skew your results. Duplicated data might also reduce your confidence in the dataset as a whole, so it’s important to consider when choosing your dataset.
Another aspect to consider for good data is timeliness. The time over which the data was gathered is usually pertinent to the questions you want to answer when you start analyzing it. Checking if the data was collected in the timespan that you’re interested in and considering the continuity of that timespan is helpful.
When you’re starting your journey into data science and picking your first few datasets to play with, you don’t need to worry about picking the perfect dataset – focus on the process and exploring instead. When you’re ready to learn more about datasets and how to avoid common pitfalls, I recommend you watch this talk from Dr. Jodie Burchell – Garbage data in, garbage models out.
Do you want structured or unstructured data?
Structured data is what you’ll find in a table where each row is an observation, and each column is a variable or field. By contrast, unstructured data usually needs to be pre-processed before you can work with it in a data science project, or it can be used by specialist models that can process it internally. Examples of unstructured data include text, images, and sound.
As you might have guessed, unstructured data is used more in advanced and specialized subfields in data science, like natural language processing and computer vision. Most data scientists start with, and continue working with, structured data for many of their projects. I recommend that this is where you start, too.
I recommend you keep the notion of structured and unstructured data in mind as we explore standard data formats.
What are standard data formats?
In addition to the quality of the data, we also have to choose between available data formats. You’ll come across two broad types of data formats as a data scientist: downloadable data (often CSV) and databases.
Downloadable data is nearly always structured data and often takes the form of comma-separated value (CSV) files. These downloads are available from various online repositories. They are among some of the most prolific and most accessible sources of data. If you’re new to data exploration, this is the best place to get started, as they’re easy to find, human-readable, and easy to work with without any extra steps.
If you’re ready to enter the world of databases, it’s worth understanding that they are further subdivided into relational (SQL) and non-relational (non-SQL) databases. As a broad rule, relational databases contain structured data and non-relational databases contain non-structured data, but determining whether data is structured is not an exact science. Instead, think of non-relational databases as being adaptable to the shape of the data they are storing.
Databases are commonly used in the following cases: when you have large datasets, when multiple people need to access and modify the data simultaneously, when datasets need to be able to scale, and when data is unstructured (non-SQL only). In addition, if you’re commissioned to do data analysis for your company, you may find that you’re given a database to work with as it’s already in-house.
PyCharm Professional has excellent support for SQL and non-SQL databases. If your work involves using various databases and writing SQL queries, you can check out our webinar on Visual SQL Development with PyCharm to get more information about the functionality. Alternatively, you can learn how to explore tables without writing a single line of SQL with PyCharm and import your dataset into PyCharm and explore it.
Try PyCharm Professional for free
Where can I find datasets for my data science projects?
Once you’re ready to find out how to get data, there are plenty of resources you can download to use for your data science project. This is not an endless list, but it’s a good place to start and a natural progression for your data science journey.
UCI Machine Learning Repository
The UCI Machine Learning Repository has over 600 datasets covering a host of exciting topics for you to explore, such as biology, health, physics, and climate. UCI datasets also have a diverse set of data types, including images, sequential, and time series. I recommend looking at a few different datasets and types of data if you’re new to data science, as it will help you expand your understanding of what data often looks like.
Kaggle
Another well-known website for datasets is Kaggle. Not only can you sign up to Kaggle to download datasets for data science projects, but it also has a large community of like-minded people who run company-sponsored competitions designed to help you develop your data science skills. If you’re looking for a famous dataset that you’ve seen used in numerous examples, you’ll almost certainly find it hosted on Kaggle.
Hugging Face
Hugging Face is another resource that is rich in datasets. You can filter the results by modalities, including audio, geospatial, and video, and provide a range for the size of your dataset, which can be particularly helpful when you want to start small. Hugging Face has many natural language and computer vision datasets, so you might want to head over there once you’re past the basics and interested in more specialized fields.
Many more
There are many more places that you can go on your data science journey to find fun datasets to explore. You can check out GitHub for curated open source datasets, FiveThirtyEight for datasets relating to American politics and sports, and lastly, one of my favorites, the UK government, to get datasets relating to public services and the economy in the UK.
What are the next steps?
Congratulations! You’ve gained a better understanding of what “good data” is, and you know where to look to find datasets for data science projects. Once you’ve chosen a dataset, you’re ready to start preparing and analyzing your data.
Remember, you can use Jupyter notebooks inside PyCharm to explore both file format and database datasets.
You can read or watch a video showing just some of the ways you can use Jupyter notebooks inside PyCharm to boost your productivity on your data science journey with your chosen dataset.
Try PyCharm Professional for free