{"id":516844,"date":"2024-10-09T10:00:00","date_gmt":"2024-10-09T09:00:00","guid":{"rendered":"https:\/\/blog.jetbrains.com\/?post_type=pycharm&#038;p=516844"},"modified":"2025-06-10T14:49:30","modified_gmt":"2025-06-10T13:49:30","slug":"how-to-get-data","status":"publish","type":"pycharm","link":"https:\/\/blog.jetbrains.com\/fr\/pycharm\/2024\/10\/how-to-get-data","title":{"rendered":"Where To Get Data for Your Data Science Projects"},"content":{"rendered":"\n<p>Whether you\u2019re starting a new project or expanding an existing one, as a data scientist, you\u2019re always on the lookout for new material to explore. Knowing where to get data for data science projects can be challenging, and finding \u201cgood data\u201d can be even more difficult. In this article, we\u2019ll look at what makes \u201cgood data\u201d, what format that data might be in, where to find it, and what the next steps are.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is \u201cgood data\u201d for data science projects?<\/h2>\n\n\n\n<p>Firstly, we should consider how <em>relevant<\/em> the dataset is to our work. You can stumble upon lots of datasets that overlap with your work in some way, but it can be difficult to decide which is the best one for you to put your effort into. In this scenario, we\u2019ll briefly explore some of the attributes of the data.&nbsp;<\/p>\n\n\n\n<p>To start with, how <em>consistent<\/em> is the dataset? Specifically, are there any missing values? Data might be missing for a variety of acceptable reasons, but it can also be a sign of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Selection_bias\" target=\"_blank\" rel=\"noopener\">selection bias<\/a> or other factors that might skew your results. Often, we can choose to either accept missing data or delete the records that contain it before we do our analysis, but knowing about missing data early in the process can help you make an informed decision to use that dataset or not.&nbsp;<\/p>\n\n\n\n<p>Along with missing data, it\u2019s worth checking to see if any of the data is <em>duplicated<\/em>. Duplicated data might be fine, but it might also signify a lack of consistency that could skew your results. Duplicated data might also reduce your confidence in the dataset as a whole, so it\u2019s important to consider when choosing your dataset.&nbsp;<\/p>\n\n\n\n<p>Another aspect to consider for good data is <em>timeliness<\/em>. The time over which the data was gathered is usually pertinent to the questions you want to answer when you start analyzing it. Checking if the data was collected in the timespan that you\u2019re interested in and considering the continuity of that timespan is helpful.&nbsp;<\/p>\n\n\n\n<p>When you\u2019re starting your journey into data science and picking your first few datasets to play with, you don\u2019t need to worry about picking the perfect dataset \u2013 focus on the process and exploring instead. When you\u2019re ready to learn more about datasets and how to avoid common pitfalls, I recommend you watch this talk from Dr. Jodie Burchell \u2013 <a href=\"https:\/\/youtu.be\/9EI_lqPUVEE\" target=\"_blank\" rel=\"noopener\"><em>Garbage data in, garbage models out<\/em><\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Do you want structured or unstructured data?<\/h2>\n\n\n\n<p>Structured data is what you\u2019ll find in a table where each row is an observation, and each column is a variable or field. By contrast, unstructured data usually needs to be pre-processed before you can work with it in a data science project, or it can be used by specialist models that can process it internally. Examples of unstructured data include text, images, and sound.&nbsp;<\/p>\n\n\n\n<p>As you might have guessed, unstructured data is used more in advanced and specialized subfields in data science, like natural language processing and computer vision. Most data scientists start with, and continue working with, structured data for many of their projects. I recommend that this is where you start, too.<\/p>\n\n\n\n<p>I recommend you keep the notion of structured and unstructured data in mind as we explore standard data formats.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What are standard data formats?<\/h2>\n\n\n\n<p>In addition to the quality of the data, we also have to choose between available data formats. You\u2019ll come across two broad types of data formats as a data scientist: downloadable data (often CSV) and databases.&nbsp;<\/p>\n\n\n\n<p><em>Downloadable<\/em> data is nearly always structured data and often takes the form of comma-separated value (CSV) files. These downloads are available from various online repositories. They are among some of the most prolific and most accessible sources of data. If you\u2019re new to data exploration, this is the best place to get started, as they\u2019re easy to find, human-readable, and easy to work with without any extra steps.&nbsp;<\/p>\n\n\n\n<p>If you\u2019re ready to enter the world of databases, it\u2019s worth understanding that they are further subdivided into relational (SQL) and non-relational (non-SQL) databases. As a broad rule, relational databases contain structured data and non-relational databases contain non-structured data, but determining whether data is <em>structured<\/em> is not an exact science. Instead, think of non-relational databases as being adaptable to the shape of the data they are storing.&nbsp;<\/p>\n\n\n\n<p>Databases are commonly used in the following cases: when you have large datasets, when multiple people need to access and modify the data simultaneously, when datasets need to be able to scale, and when data is unstructured (non-SQL only). In addition, if you\u2019re commissioned to do data analysis for your company, you may find that you\u2019re given a database to work with as it\u2019s already in-house.&nbsp;<\/p>\n\n\n\n<p>PyCharm Professional has excellent support for SQL and non-SQL databases. If your work involves using various databases and writing SQL queries, you can check out our webinar on <a href=\"https:\/\/www.youtube.com\/watch?v=_FlpiNno088&amp;t=667s\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Visual SQL Development with PyCharm<\/em><\/a> to get more information about the functionality. Alternatively, you can learn <a href=\"https:\/\/youtu.be\/YI6xzNmDRh8\" target=\"_blank\" rel=\"noreferrer noopener\">how to explore tables without writing a single line of SQL<\/a> with PyCharm and <a href=\"https:\/\/blog.jetbrains.com\/pycharm\/2024\/09\/7-ways-to-use-jupyter-notebooks-inside-pycharm\/\" target=\"_blank\" rel=\"noreferrer noopener\">import your dataset into PyCharm and explore it<\/a>.&nbsp;<\/p>\n\n\n\n<p align=\"center\">\n    <a class=\"jb-download-button\" href=\"https:\/\/www.jetbrains.com\/pycharm\/data-science\/\" target=\"_blank\" rel=\"noopener\">      \n        Try PyCharm Professional for free\n    <\/a>\n<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where can I find datasets for my data science projects?<\/h2>\n\n\n\n<p>Once you\u2019re ready to find out how to get data, there are plenty of resources you can download to use for your data science project. This is not an endless list, but it\u2019s a good place to start and a natural progression for your data science journey.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">UCI Machine Learning Repository<\/h3>\n\n\n\n<p>The <a href=\"https:\/\/archive.ics.uci.edu\/\" target=\"_blank\" rel=\"noreferrer noopener\">UCI Machine Learning Repository<\/a> has over 600 datasets covering a host of exciting topics for you to explore, such as biology, health, physics, and climate. UCI datasets also have a diverse set of data types, including images, sequential, and time series. I recommend looking at a few different datasets and types of data if you\u2019re new to data science, as it will help you expand your understanding of what data often looks like.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Kaggle<\/h3>\n\n\n\n<p>Another well-known website for datasets is <a href=\"https:\/\/www.kaggle.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Kaggle<\/a>. Not only can you sign up to Kaggle to download datasets for data science projects, but it also has a large community of like-minded people who run company-sponsored competitions designed to help you develop your data science skills. If you\u2019re looking for a <a href=\"https:\/\/www.kaggle.com\/datasets\/yasserh\/titanic-dataset\" target=\"_blank\" rel=\"noreferrer noopener\">famous dataset<\/a> that you\u2019ve seen used in numerous examples, you\u2019ll almost certainly find it hosted on Kaggle.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Hugging Face<\/h3>\n\n\n\n<p><a href=\"https:\/\/huggingface.co\/\" target=\"_blank\" rel=\"noreferrer noopener\">Hugging Face<\/a> is another resource that is rich in datasets. You can filter the results by modalities, including audio, geospatial, and video, and provide a range for the size of your dataset, which can be particularly helpful when you want to start small. Hugging Face has many natural language and computer vision datasets, so you might want to head over there once you\u2019re past the basics and interested in more specialized fields.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Many more<\/h3>\n\n\n\n<p>There are many more places that you can go on your data science journey to find fun datasets to explore. You can check out <a href=\"https:\/\/github.com\/datasets\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a> for curated open source datasets, <a href=\"https:\/\/projects.fivethirtyeight.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">FiveThirtyEight<\/a> for datasets relating to American politics and sports, and lastly, one of my favorites, the <a href=\"https:\/\/www.data.gov.uk\/\" target=\"_blank\" rel=\"noreferrer noopener\">UK government<\/a>, to get datasets relating to public services and the economy in the UK.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What are the next steps?<\/h2>\n\n\n\n<p>Congratulations! You\u2019ve gained a better understanding of what \u201cgood data\u201d is, and you know where to look to find datasets for data science projects. Once you\u2019ve chosen a dataset, you\u2019re ready to start <a href=\"https:\/\/blog.jetbrains.com\/datalore\/2022\/11\/08\/how-to-prepare-your-dataset-for-machine-learning-and-analysis\" target=\"_blank\" rel=\"noreferrer noopener\">preparing and analyzing your data<\/a>.&nbsp;<\/p>\n\n\n\n<p>Remember, you can use Jupyter notebooks inside PyCharm to <a href=\"https:\/\/blog.jetbrains.com\/pycharm\/2024\/09\/7-ways-to-use-jupyter-notebooks-inside-pycharm\/\" target=\"_blank\" rel=\"noreferrer noopener\">explore both file format and database datasets<\/a>.&nbsp;<\/p>\n\n\n\n<p>You can <a href=\"https:\/\/blog.jetbrains.com\/pycharm\/2024\/09\/how-to-use-jupyter-notebooks-in-pycharm\/\" target=\"_blank\" rel=\"noreferrer noopener\">read<\/a> or <a href=\"https:\/\/www.youtube.com\/watch?v=uiIKaacMGoE\" target=\"_blank\" rel=\"noreferrer noopener\">watch a video<\/a> showing just some of the ways you can use Jupyter notebooks inside PyCharm to boost your productivity on your data science journey with your chosen dataset.&nbsp;<\/p>\n\n\n\n<p align=\"center\">\n    <a class=\"jb-download-button\" href=\"https:\/\/www.jetbrains.com\/pycharm\/data-science\/\" target=\"_blank\" rel=\"noopener\">      \n        Try PyCharm Professional for free\n    <\/a>\n<\/p>\n","protected":false},"author":1150,"featured_media":516851,"comment_status":"closed","ping_status":"closed","template":"","categories":[6943,952,532,1401,5108,2347],"tags":[8597,8598],"cross-post-tag":[],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/pycharm\/516844"}],"collection":[{"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/pycharm"}],"about":[{"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/types\/pycharm"}],"author":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/users\/1150"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/comments?post=516844"}],"version-history":[{"count":9,"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/pycharm\/516844\/revisions"}],"predecessor-version":[{"id":517089,"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/pycharm\/516844\/revisions\/517089"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/media\/516851"}],"wp:attachment":[{"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/media?parent=516844"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/categories?post=516844"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/tags?post=516844"},{"taxonomy":"cross-post-tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/fr\/wp-json\/wp\/v2\/cross-post-tag?post=516844"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}