Datalore
Collaborative data science platform for teams
A Portrait of the Average Data Scientist of 2023 in 3 Facts
Consider the average data scientist of 2023: They spend a significant chunk of their time visualizing data, with line and bar plots as their go-to tools. They’ve mastered their craft through independent study and use Jupyter notebooks for in-depth data analysis. Version control is not their strong suit, and when presenting findings, they tend to lean toward using slide decks or even the notebook itself. Does all this sound familiar to you? We thought it might.
In this blog post, we’ll explore three surprising facts about the average data scientist of 2023 and how Datalore is tailored to their needs – stay tuned for Part 2, which will provide another pair of insights!
Fact 1: Data scientists share their work via PowerPoint, Google Slides, Jupyter notebooks, and spreadsheets
Why is this the case? Our best guess is that results are often shared outside of the core data team and presented to stakeholders via slide decks or spreadsheets, while Jupyter notebooks are used for intra-team sharing.
Why these sharing mechanics are not optimal | How Datalore can help |
Slide decks are not interactive. | A variety of interactive controls together with the integrated Report builder allow you to turn your Jupyter notebooks into interactive data apps. |
Slide decks take extra time and skills to create. | With the Report builder, you can arrange the Jupyter notebook cells on the canvas and share a report in just a few clicks. |
It’s hard to apply changes to a slide deck after publishing it. | To apply changes after publishing, you just need to update the contents in the initial notebook and click Update report. |
Spreadsheets don’t capture the narrative and story behind the data. | You can use the Report builder to organize chart, metric, Markdown, and code cells on the canvas. Giving some structure to your data helps make the story behind it more accessible. |
Jupyter notebooks can be technical and hard to comprehend. | You can completely hide Python, SQL, Scala, or R code cells and instead focus on Markdown cells, visualizations, and interactive widgets. Or you can share the full version of the Jupyter notebook. You decide what content to include. |
All these sharing mechanics require manual updates for regular reporting. | Datalore allows you to automate report updates with a flexible scheduling interface. |
“”We share the results of our work in different ways. Firstly, Datalore’s interactive reports serve as internal tools for our Customer Success Teams, providing valuable insights. Secondly, executives also use regular Datalore reports, containing key customer statistics for tracking and analysis. Lastly, we create customized slide decks, including Datalore visualizations, when sharing insights with clients.””
Fact 2: 59% of data scientists don’t version their notebooks
Almost 6 in 10 data scientists haven’t adopted versioning practices, whereas for software engineers, code versioning with Git is considered standard. Should data scientists use Git to version their notebooks? Let’s describe a few challenges of traditional Git versioning for data science and how Datalore helps address those.
Traditional notebook versioning with Git | Datalore’s notebook versioning |
The environment and data are usually stored separately from the notebook. This causes reproducibility issues. | Datalore’s notebooks are stored in tight connection to the environment, data, and other artifacts of notebooks. |
Collaborating on notebooks results in time spent resolving merge conflicts. | Datalore’s editor supports real-time collaboration and helps you keep track of changes using internal versioning. |
Versioning with Git requires extra actions and context switching. | Datalore’s versioning automatically creates history checkpoints for actions like cell or worksheet deletion and collaborator changes. To create a history checkpoint yourself, you just need to press cmd/Ctrl+S. |
Fact 3: Data quality is the biggest pain point for data scientists
As a foundational activity, data quality control demands the attention of both data engineers and data scientists. Datalore helps bring the whole data team into the same notebook environment to collaboratively troubleshoot data quality issues.
Data engineers can enjoy data integrations and native SQL cells to query the data, then use out-of-the-box visualizations and dataframe statistics to identify the outliers and missing values. Finally, they can leave comments and recommendations for data analysts and scientists on the best way to work with the data. Watch the Datalore overview to see the integrations and automations in action.
In addition, data engineers can leverage Datalore’s scheduling feature to regularly run data quality checks.
““Data science engineers have started using Datalore more than ever since the recent introduction of the Scheduling feature. Traditionally these engineers would write Airflow DAGs, but we’ve been transitioning to using scheduled runs for some of our use cases.””
From these three facts alone, you can see that most data scientists could stand to streamline their workflows a bit. In Part 2, we’ll delve into the most popular data science activity and discuss the extent to which the industry has moved to the cloud. Stay tuned!
P.S. All the numbers were calculated as part of the annual Developer Ecosystem survey. You can dive deep into the survey results here.