Interview: Dan Tofan for this week’s data science webinar
In the past few years, Python has made a big push into data science and PyCharm has as well. Years ago we added Jupyter Notebook integration, then 2017.3 introduced Scientific Mode for workflows that felt more like an IDE. In 2019.1 we re-invented our Jupyter support to also be more like a professional tool.
PyCharm and data science are thus a hot topic. Dan Tofan very recently published a Pluralsight course on using PyCharm for data science and we invited him for a webinar next week.
To help set the stage, below is an interview with Dan.
- Thursday, April 25
- 7PM GMT+3, 9AM Pacific
- Register here
- Aimed at new and intermediate data scientists
Let’s start with the key point: what does PyCharm bring to data scientists?
PyCharm brings a productivity boost to data scientists, by helping them explore data, debug Python code, write better Python code, and understand Python code faster. As a PyCharm user, I experienced and benefited from these productivity boosters, which I distilled into my first Pluralsight course, so that data scientists can make the most out of PyCharm in their activities.
For the webinar: who is it for and what can people expect you to cover?
If you are a data scientist who dabbled with PyCharm, then this webinar is for you. I will cover PyCharm’s most relevant features to data science: the scientific mode and the completely rewritten Jupyter support. I will show how these features interplay with other PyCharm features, such as refactoring code from Jupyter cells. I will use easy-to-understand code examples with popular data science libraries.
Now, back to the start: tell us a little about yourself.
Currently, I am a senior backend developer for Dimensions – a research data platform that uses data science, and links data on a total of over 140 million publications, grants, patents and clinical trials. I’ve always been curious, which led me to do my PhD studies at the University of Groningen (Netherlands) and learn more about statistics and data analysis.
Do Python data scientists feel like programmers first and data scientists second, or the reverse?
In my opinion, data science is a melting pot of skills from three complementing backgrounds: programmers, statisticians and business analysts. At the start of your data science journey, you are going to rely on the skills from your main background, and – as your skills expand – you are going to feel more and more like a data scientist.
Your course has a bunch of sections on software development practices and IDE tips. How important are these practices to “professional” data science?
As part of the melting pot, programmers bring a lot of value with their experiences ranging from software development practices to IDE tips. Data scientists from a programming background are already familiar with most of these, and those from other backgrounds benefit immensely.
Think of a code base that starts to grow: how do you write better code? How do you refactor the code? How can a new team member understand that code faster? These are some of the questions that my course helps with.
The course also covers three major facilities in PyCharm Professional: Scientific Mode, Jupyter support, and the Database tool. How do these fit in?
All of them are data centric, so they are very relevant to data scientists. These facilities are integrated nicely with other PyCharm capabilities such as debugging and refactoring. Overall, after watching the course and getting familiar with these capabilities, data scientists get a nice productivity boost.
This webinar is good timing. You just released the course and we just re-invented our Jupyter support. What do you think of the new, IDE-centric Jupyter integration?
I think the new Jupyter integration is an excellent step in the right direction, because you can use both Jupyter and PyCharm features such as debugging and code completion. Joel Grus gave an insightful and entertaining talk about Jupyter limitations at JupyterCon 2018. I think the new Jupyter integration in PyCharm can eventually help solve some Jupyter pain points raised by Joel, such as hidden state.
What’s one big problem or pain point in Jupyter that could benefit from new ideas or tooling?
Reproducibility is problematic with Jupyter and it is important for data science. For example, it’s easy to share a notebook on GitHub, then someone else tries to run it and gets different results. Perhaps the solution is a mix of discipline and better tools.