The State of Data Science 2024: 6 Key Data Science Trends
Generative AI and LLMs have been hot topics this year, but are they affecting trends in data science and machine learning? What new trends in data science are worth following? Every year, JetBrains collaborates with the Python Software Foundation to carry out the Python Developer Survey, which can offer some useful insight into these questions.
The results from the latest iteration of the survey, collected between November 2023 and February 2024, included a new Data Science section. This allowed us to get a more complete picture of data science trends over the past year and highlighted how important Python remains in this domain.
While 48% of Python developers are involved in data exploration and processing, the percentage of respondents using Python for data analysis dropped from 51% in 2022 to 44% in 2023. The percentage of respondents using Python for machine learning dropped from 36% in 2022 to 34% in 2023. At the same time, 27% of respondents use Python for data engineering, and 8% use it for MLOps – two new categories that were added to the survey in 2023.
Let’s take a closer look at the trends in the survey results to put these numbers into context and get a better sense of what they mean. Read on to learn about the latest developments in the fields of data science and machine learning to prepare yourself for 2025.
Data processing: pandas remains the top choice, but Polars is gaining ground
Data processing is an essential part of data science. pandas, a project that is 15 years old, is still at the top of the list of the most commonly used data processing tools. It is used by 77% of respondents who do data exploration and processing. As a mature project, its API is stable, and many working examples can be found on the internet. It’s no surprise that pandas is still the obvious choice. As a NumFOCUS sponsored project, pandas has proven to the community that it is sustainable and its governance model has gained user trust. It is a great choice for beginners who may still be learning the ropes of data processing, as it’s a stable project that does not undergo rapid changes.
On the other hand, Polars, which pitches itself as DataFrames for the new era, has been in the spotlight quite a bit both last year and this year, thanks to the advantages it provides in terms of speed and parallel processing. In 2023, a company led by the creator of Polars, Ritchie Vink, was formed to support the development of the project. This ensures Polars will be able to maintain its rapid pace of development. In July of 2024, version 1.0 of Polars was released. Later, Polars expanded its compatibility with other popular data science tools like Hugging Face and NVIDIA RAPIDS. It also provides a lightweight plotting backend, just like pandas.
So, for working professionals in data science, there is an advantage to switching to Polars. As the project matures, it can become a load-bearing tool in your data science workflow and can be used to process more data faster. In the 2023 survey, 10% of respondents said that they are using Polars as their data processing tool. It is not hard to imagine this figure being higher in this year’s survey.
Whether you are a working professional or just starting to process your first dataset, it is important to have an efficient tool at hand that can make your work more enjoyable. With PyCharm, you can inspect your data as interactive tables, which you can scroll, sort, filter, convert to plots, or use to generate heat maps. Moreover, you can get analytics for each column and use AI assistance to explain DataFrames or create visualizations. Apart from pandas and Polars, PyCharm provides this functionality for Hugging Face datasets, NumPy, PyTorch, and TensorFlow.
The popularity of Polars has led to the creation of a new project called Narwhals. Independent from pandas and Polars, Narwhals aims to unite the APIs of both tools (and many others). Since it is a very young project (started in February 2024), it hasn’t yet shown up on our list of the most popular data processing tools, but we suspect it may get there in the next few years.
Also worth mentioning are Spark (16%) and Dask (7%), which are useful for processing large quantities of data thanks to their parallel processes. These tools require a bit more engineering capability to set up. However, as the amount of data that projects depend on increasingly exceeds what a traditional Python program can handle, these tools will become more important and we may see these figures go up.
Data visualization: Will HoloViz Panel surpass Plotly Dash and Streamlit within the next year?
Data scientists have to be able to create reports and explain their findings to businesses. Various interactive visualization dashboard tools have been developed for working with Python. According to the survey results, the most popular of them is Plotly Dash.
Plotly is most known in the data science community for the ggplot2 library, which is a highly popular visualization library for users of the R language. Ever since Python became popular for data science, Plotly has also provided a Python library, which gives you a similar experience to ggplot2 in Python. In recent years, Dash, a Python framework for building reactive web apps developed by Plotly, has become an obvious choice for those who are used to Plotly and need to build an interactive dashboard. However, Dash’s API requires some basic understanding of the elements used in HTML when designing the layout of an app. For users who have little to no frontend experience, this could be a hurdle they need to overcome before making effective use of Dash.
Second place for “best visualization dashboard” goes to Streamlit, which has now joined forces with Snowflake. It doesn’t have as long of a history as Plotly, but it has been gaining a lot of momentum over the past few years because it’s easy to use and comes packaged with a command line tool. Although Streamlit is not as customizable as Plotly, building the layout of the dashboard is quite straightforward, and it supports multipage apps, making it possible to build more complex applications.
However, in the 2024 results these numbers may change a little. There are up-and-coming tools that could catch up to – or even surpass – these apps in popularity. One of them is HoloViz Panel. As one of the libraries in the HoloViz ecosystem, it is sponsored by NumFocus and is gaining traction among the PyData community. Panel lets users generate reports in the HTML format and also works very well with Jupyter Notebook. It offers templates to help new users get started, as well as a great deal of customization options for expert users who want to fine-tune their dashboards.
ML models: scikit-learn is still prominent, while PyTorch is the most popular for deep learning
Because generative AI and LLMs have been such hot topics in recent years, you might expect deep learning frameworks and libraries to have completely taken over. However, this isn’t entirely true. There is still a lot of insight that can be extracted from data using traditional statistics-based methods offered by scikit-learn, a well-known machine learning library mostly maintained by researchers. Sponsored by NumFocus since 2020, it remains the most important library in machine learning and data science. SciPy, another Python library that provides support for scientific calculations, is also one of the most used libraries in data science.
Having said that, we cannot ignore the impact of deep learning and the increase in popularity of deep learning frameworks. PyTorch, a machine learning library created by Meta, is now under the governance of the Linux Foundation. In light of this change, we can expect PyTorch to continue being a load-bearing library in the open-source ecosystem and to maintain its level of active community involvement. As the most used deep learning framework, it is loved by Python users – especially those who are familiar with numpy, since “tensors”, the basic data structures in PyTorch, are very similar to numpy arrays.
Unlike TensorFlow, which uses a static computational graph, PyTorch uses a dynamic one – and this makes profiling in Python a blast. To top it all off, PyTorch also provides a profiling API, making it a good choice for research and experimentation. However, if your deep learning project needs to be scalable in deployment and needs to support multiple programming languages, TensorFlow may be a better choice, as it is compatible with many languages, including C++, JavaScript, Python, C#, Ruby, and Swift. Keras is a tool that makes TensorFlow more accessible and is also popular for deep learning frameworks.
Another framework we cannot ignore for deep learning is Hugging Face Transformers. Hugging Face is a hub that provides many state-of-the-art pre-trained deep learning models that are popular in the data science and machine learning community, which you can download and train further yourself. Transformers is a library maintained by Hugging Face and the community for state-of-the-art machine learning with PyTorch, TensorFlow, and JAX. We can expect Hugging Face Transformers will gain more users in 2024 due to the popularity of LLMs.
With PyCharm you can identify and manage Hugging Face models in a dedicated tool window. PyCharm can also help you to choose the right model for your use case from the large variety of Hugging Face models directly in the IDE.
One new library that is worth paying attention to in 2024 is Scikit-LLM, which allows you to tap into Open AI models like ChatGPT and integrate them with scikit-learn. This is very handy when text analysis is needed, and you can perform analysis using models from scikit-learn with the power of modern LLM models.
MLOps: The future of data science projects
One aspect of data science projects that is essential but frequently overlooked is MLOps (machine learning operations). In the workflow of a data science project, data scientists need to manage data, retrain the model, and have version control for all the data and models used. Sometimes, when a machine learning application is deployed in production, performance and usage also need to be observed and monitored.
In recent years, MLOps tools designed for data science projects have emerged. One of the issues that has been bothering data scientists and data engineers is versioning the data, which is crucial when your pipeline constantly has data flowing in.
Data scientists and engineers also need to track their experiments. Since the machine learning model will be retrained with new data and hyperparameters will be fine-tuned, it’s important to keep track of model training and experiment results. Right now, the most popular tool is TensorBoard. However, this may be changing soon. TensorBoard.dev has been deprecated, which means users are now forced to deploy their own TensorBoard installations locally or share results using the TensorBoard integration with Google Colab. As a result, we may see a drop in the usage of TensorBoard and an uptick in that of other tools like MLflow and PyTorch.
Another MLOps step that is necessary for ensuring that data projects run smoothly is shipping the development environment for production. The use of Docker containers, a common development practice among software engineers, seems to have been adopted by the data science community. This ensures that the development environment and the production environment remain consistent, which is important for data science projects involving machine learning models that need to be deployed as applications. We can see that Docker is a popular tool among Python users who need to deploy services to the cloud.
This year, Docker containers is slightly ahead of Anaconda in the “Python installation and upgrade” category.
Big data: How much is enough?
One common misconception is that we will need more data to train better, more complex models in order to improve prediction. However, this is not the case. Since models can be overfitted, more is not always better in machine learning. Different tools and approaches will be required depending on the use case, the model, and how much data is being handled at the same time.
The challenge of handling a huge amount of data in Python is that most Python libraries rely on the data being stored in the memory. We could just deploy cloud computing resources with huge amounts of memory, but even this approach has its limitations and would sometimes be slow and costly.
When handling huge amounts of data that are hard to fit in memory, a common solution is to use distributed computing resources. Computation tasks and data are distributed over a cluster to be performed and handled in parallel. This approach makes data science and machine learning operations scalable, and the most popular engine for this is Apache Spark. Spark can be used with PySpark, the Python API library for it.
As of Spark 2.0, anyone using Spark RDD API is encouraged to switch to Spark SQL, which provides better performance. Spark SQL also makes it easier for data scientists to handle data because it enables SQL queries to be executed. We can expect PySpark to remain the most popular choice in 2024.
Another popular tool for managing data in clusters is Databricks. If you are using Databricks to work with your data in clusters, now you can benefit from the powerful integration of Databricks and PyCharm. You can write code for your pipelines and jobs in PyCharm, then deploy, test, and run it in real time on your Databricks cluster without any additional configuration.
Communities: Events shifting focus toward data science
Many newcomers to Python are using it for data science, and thus more Python libraries have been catering to data science use cases. In that same vein, Python events like PyCon and EuroPython are beginning to include more tracks, talks, and workshops that focus on data science, while events that are specific to data science, like PyData and SciPy, remain popular, as well.
Final thoughts
Data science and machine learning are becoming increasingly active, and together with the popularity of AI and LLMs, more and more new open source tools have become available for use in data science. The landscape of data science continues to change rapidly, and we are excited to see what becomes most popular in the 2024 survey results.
Enhance your data science experience with PyCharm
Modern data science demands skills for a wide range of tasks, including data processing and visualization, coding, model deployment, and managing large datasets. As an integrated development environment (IDE), PyCharm helps you efficiently build this skill set. It provides intelligent coding assistance, top-tier debugging, version control, integrated database management, and seamless Docker integration. For data science, PyCharm supports Jupyter notebooks, as well as key scientific and machine learning libraries, and it integrates with tools like the Hugging Face models library, Anaconda, and Databricks.
Start using PyCharm for your data science projects today and enjoy its latest improvements, including features for inspecting pandas and Polars DataFrames, and for the layer by layer inspection of PyTorch tensors, which is handy when exploring data and building deep learning models.