News Podcast

Early Access PyCharm: PyCharm and Data Science

For a very long time, PyCharm’s Data Science tooling has not been a feature set that we’ve talked at length about. We’ve got a lot wrong, but this time around, we’ve changed the way we approach building tools for Data Scientists. I sat down with Andrey Cheptsov, the Product Manager of the newly created DataSpell IDE and asked him how this time, it’s going to be different.


Nafiul: Hey everybody. This is early access PyCharm. I’m your host Nafiul Islam. Today we have a special treat for you. I’m here with Andrey Cheptsov who is the product manager for a new IDE that we’re about to release called DataSpell. That’s going to be targeted towards data scientists, but a lot of the features of DataSpell, in fact, if not all of them, are going to be ported into PyCharm and I’m just super excited about it. So without much further ado, let’s dive in.
So Andrey, you’re the guy that everybody talks to when it comes to the data science features of PyCharm. What have you been up to this past year in terms of data science?
Andrey Cheptsov: All right. I think it’s actually a little bit more than a year. I think we’ve started working on that on October 2019 and yeah, many things to talk about. So where do I start?
Nafiul: You can start talking about the fact that, our data science offering has had a few ups and downs. So what have we been up to in trying to fix that?
Andrey Cheptsov: Awesome. Indeed it’s a good starting point to discuss the data science supported by PyCharm. PyCharm is an IDE for Python developers. It’s known for its let’s say intelligent coding assistance, which people like, like refactoring, code completion and quick fixes.
And if you look at the community of Python developers, we see that there are totally different, let’s say job titles and mindsets who use the IDE, but what is prevailing? What is the if I say? If I can say the main one is probably developers. So PyCharm is very well known within the developer community.
Nafiul: Oh, absolutely. Yeah, it is.
Andrey Cheptsov: And I think it’s for a good reason, because it really nails the core development experience, which is the good support for Git, built-in tools, such as terminal. Of course the editor is there is the heart of the IDE. And the workflow, it takes well some time to set up your project, but then you work with that project for quite significant amount of time, and then you commit all your work through Git, and then, that’s how it works.
And however, Python is a lot more than just development and there are many other things going on, and data science is for sure one of, one of the other main things taking place. And if we take a look at the some of the survey, surveys asking like how people use Python, at Stack Overflow, for example, or JetBrains own DevEcosystem survey, we’ll see that a lot of people use Python for data analysis and also for machine learning. It’s it’s, it’s quite interesting because there are different ways of using Python for data analysis and machine learning. So for example, there are ways, there are software developers which use or somehow involved in data analysis. It could be either they write software for data analysis.
So basically they automate things around data. Yeah. And there are also people who apart from software development are involved in ad hoc data analysis, which is a very interesting thing. And also was probably a starting point for us to look differently at the support for data science within PyCharm.
I think that’s where it started and basically…
Nafiul: But here’s the thing, we’ve added different kinds of support for data science in PyCharm, again and again, we did it not just a year ago, but years ago. And we… something happened. It wasn’t what people expected. So what are we doing different this time around to make sure that we nail that data science story?
Andrey Cheptsov: Yeah. That’s a good one. Thanks for asking. I think if I would if I would say what has changed in their approach we are taking the main thing would be to look from the perspective of the people who are involved in data analysis. And maybe talk a little bit about what is it, what is data analysis? What are we talking about and how it looks like. If I can describe very briefly and on a primitive level, like what data analysis is and how it’s different from development, probably I would describe it the following way. There are certain tools which one can use to look at the data.
And when I say, look at the data, I really mean it. Literally staring at the data.
Nafiul: At rows and rows of random numbers and fields, huh?
Andrey Cheptsov: Yeah, exactly. And so to to make sense out of data, you have to use certain tools which we’ll talk about. And then in the end you have to look at them yourself and it cannot be automated to a certain way.
You have to look at the, either the raw data at the processed data or at visualized data. And this is how data analysis is done. You process data, and then do you visualize data? And there’s a lot of different tools used for that.
Nafiul: And so the approach of this time around is to really focus on what data scientists wanted to do and coming at it from their perspective, instead of just saying, making it an offshoot of what we offer software engineers in general.
So it’s more about what they need instead of how we can retrofit some kind of a data science support into the IDE.
Andrey Cheptsov: Yeah. One thing that is changing now. That’s what we changed for some time is a PyCharm is very good at providing a way to read and write code and then also run code.
And while it’s also still very important part of the data analysis since to do data analysis, you still have to write case and a lot of things I’ve done through code. Yet, there’s another key thing, which is indirective way of working with data and maybe…
Nafiul: So that means visualizations and just getting quick feedback when you input something.
Andrey Cheptsov: Yeah, absolutely. And that’s maybe a good point of time to talk about the tools which make data analysis or so, so efficient, let’s say. There are many different tools data scientists use today. And you can, you can get a rough idea by looking at these, let’s say surveys, for example, Stack Overflow or DevEco, like what tools data scientists actually use and what is… and what you can immediately see, for example, there’s a NumPy library, right?
Which let you work efficiently with data and process this data. And when you process data you have to analyze what you get, like basically look at this data some way, and this is where there are two things come very handy. One is the interactive Python. Or,
Nafiul: Yeah.
Andrey Cheptsov: a tool, like IPython, for example, which implementes this REPL mode when you write something, then you run it and then you look at it and then you can run through it again, and then you see the results again.
So this is a very different from what typically a code editor offers to you.
Yes, absolutely.
And the other thing is of course, the Jupiter notebooks, which are very famous and there are people who love them, and there are people who hate them. And with all of that you cannot deny, one cannot deny that Jupyter Notebooks is probably the best way today to interact, work with the data.
Nafiul: Okay. So in terms of design decisions you went out there and you just saw the different ways in which data scientists were interacting with data, were working with data, were playing with data, and I’m guessing that in the new thing that we’re going to do the most important thing is going to be Jupyter support and top-notch Jupyter support inside of the IDE.
Am I correct in assuming that?
Andrey Cheptsov: Oh, yeah. So the, one of the main things is a better support for Jupyter notebooks. It’s not the only thing I think we’ll cover the other things too. But if we start somewhere, Jupyter notebooks support is totally the right place to start. If it, if I can describe very briefly what we do now, basically do in terms of the better support for Jupyter notebooks we… or what we do now, the approach was to try to see if we can make something like Jupyter notebook out of the code editor, which PyCharm already has.
And this didn’t seem to work really well to a great number of reasons. One reason is that notebooks are so handy because they can, they let you see their results immediately in line with your let’s say code. So you’ve…
Nafiul: They’re also quite portable. You can share them with a friend and they can also run the code and they can also see the visualizations or they can see a rendered version of that.
So going back to the old WolframAlpha and Mathematica days where you would have like a notebook with different equations being solved by your CAS system. Okay. So your main focus is going to be Jupyter notebooks. You do want to add other features, of course, that’s not the only thing. But what are we working on that makes this offering in PyCharm just really amazing to use? What have we done? What was special sauce that we added to make this top-notch?
Andrey Cheptsov: Right. Yeah. I think the secret sauce probably would be something very well obvious here. So we gonna take the best parts of the Jupyter notebook. We’re going to keep the best parts of this Jupyter notebook.
And make sure that we don’t miss any of the advantages Jupyter notebooks offers. So we are talking about the inline cell outputs, we are talking about the command mode, which makes it easy to navigate over the cells and also apply commands in there, familiar shortcuts. We’re also talking about JavaScript outputs, which was a problem previously, if you’re using some interactive library like Plotly or what care or widgets it didn’t work due to the poor support, the interoperability between the IDE and JavaScript. This is also something that we’ve been up within a change. Basically, we were going to make notebooks work exactly as you expect them to work with all of the nice things, which you like in, in Jupyter notebook and at the same time, we also want to keep some of the things PyCharm is typically good at and make sure that it applies to this new Jupyter notebook support.
Nafiul: So in terms of performance, are we working on making this as fluid and as fast and as smooth as possible?
In, in, in terms of just working with it is it going to be a similar experience to writing in your editor where every, where the code completion pops up very well and you can debug and you get all the IDE features that you love about PyCharm, but you now have it in like a notebook format.
Andrey Cheptsov: Yeah. So performance is a is one of the main, let’s say aspects. And this was a big problem as well with the previous support for Jupyter Notebook. Especially when you start to import a lot of data and then you start to visualize all of it. We used to have problems and now we addressed most of it, which prevented the, like the way of working with notebooks.
So we want to expect at least the same experience which you typically have with Jupyter notebooks except that you also get coding assistance on top of that without any, um, delays.
Nafiul: You know, Python isn’t, this is of course a PyCharm show, but Python isn’t the only language t hat is prominent in data science. So what other plans do we have for people who are using R, people who are using Julia or whatever other language that they want to use for data science?
Andrey Cheptsov: I think what is worth mentioning here is we bring a better support for Jupyter notebooks for Python developers. A big chunk of the work here is to make just language-independent notebook support with the IntelliJ platform. And while of course the notebook support is super important for Python developers at the point of, at this point of time, more than, for example, R or Julia or other languages we, we think it is super important to make it language-independent. And …
Nafiul: That, that has to be the case because if you take a look at like the notebook, the IPython kernel, this notebook format is being adopted by Scala. It’s been adopted by Julia. It’s been adopted by other languages because it helps, it… Because it’s portable.
It helps you transfer that idea and that ended processing work from one place to another, from one data scientist or one data analysts to another. So building this thing as something that works across languages, I would assume is going to pay out dividends when we support more languages.
Yeah, absolutely. Um, and I actually look at it this even of course, it’s a higher return, let’s say on investments on our end now that we support Python, if other people can use it for other languages, we don’t have to implement it again. That’s one thing.
The other one is, which I think is super important as well, is Python is a super great language for data science. Um, it doesn’t mean that other languages shouldn’t also improve the support for data science workflows. And I see a lot of potential in other languages as well. And um. Of course Scala is one case, but I think, I personally see here a trend that the data science is going to be a big thing also for other languages as well. And if you look for example, what’s going on with JavaScript, there is a lot of things what’s going on and like in the end everybody can benefit out of better support for notebooks in other languages.
I can absolutely imagine, but I’m assuming that building that support across libraries, across languages, across frameworks is going to be difficult. But is that made a little bit easier now that we can embed a web browser into the IDE?
Andrey Cheptsov: Yeah, absolutely. And I think you’ve just nailed it.
We can make it possible because we’ve integrated, we found let’s say a working way of integrating of JavaScript and the IDE in order to make it interactive. And probably now that we support Jupiter notebooks, it’s already very easy to support known Python kernels of Jupyter notebooks.
So that’s the most easy, let’s say step here. More difficult steps will be to support non-Jupyter notebooks. Which of course they are there.
Nafiul: So Andrey last question, before we wrap this up When are we going to get all this goodness, when can we actually get our hands on working with all these goodness that you’re talking about and how can we get it?
Andrey Cheptsov: Yeah, sure. So I think it depends on um, If, whether you are open to, to try one of their unstable builds, for example, or try some early previews and eager to also share a feedback, if that’s the case I would strongly suggest you to join the private beta and sign up for that. And we are send new builds. And basically in September we are going to make it public and available for everyone.
So you don’t really have to register it’s still going to be an EAP quality which means there are bugs we fix that.
Yeah. Yeah. And yeah. And speaking of the release of the DataSpell currently most likely that it’s going to be there the spring 2022.1 release train of IntelliJ-based IDEs.
Nafiul: So when is PyCharm going to see all of us?
Andrey Cheptsov: It’s very likely that PyCharm will get this once DataSpell is released which is most likely to happen, yeah, in spring. Still chances that it might happen even sooner, but most likely, yeah, spring of the next year.
Nafiul: Awesome.
Thank you very much, Andrey for dropping by and we’ll see you again soon.
Andrey Cheptsov: Yeah. Thanks for having me.

image description