PyCharm Scientific Mode with Code Cells

You can use code cells to divide a Python script into chunks that you can individually execute, maintaining the state between them. This means you can re-run only the part of the script you’re developing right now, without having to wait for reloading your data. Code cells were added to PyCharm 2018.1 Professional Edition’s scientific mode.

To try this out, let’s have a look at the raw data from the Python Developer Survey 2017 that was jointly conducted by JetBrains and the Python Software Foundation.

To start, let’s create a scientific project. After opening PyCharm Professional Edition (Scientific mode is not available in the Community Edition), choose to create a new project, and then select ‘Scientific’ as the project type:

Create a Scientific Project

A scientific project will by default be created with a new Conda environment. Of course, for this to work you need to have Anaconda installed on your computer. The scientific project also creates a folder structure for your data.

If we want to analyze data, we’ll first need to go get some data. Please download the CSV file from the ‘Raw Data’ section of the developer survey results page. Afterward, place it in the data folder that was created in the scientific project’s scaffold.

Extract, Transform, Load

Our first challenge will be to load the file. The easiest way to do this would be to run:

So let’s run this. After writing this code in the main.py file that was created for us with the project, right click anywhere in the file and choose ‘Run’. We should see a Python console appear at the bottom of our screen after the script completes execution. On the right-hand side, we should see the variable overview with our dataframe. Click ‘View as DataFrame’ to inspect the DataFrame:

DF After Initial Load

We can see the structure of the CSV file here. The columns have headings like “Java:What other language(s) do you use?”. These columns are the result of multiple-choice answers: respondents were asked ‘What other language(s) do you use?’ and could select multiple answers. If an answer was selected, that string is inserted. Otherwise the string ‘NA’ is inserted (if you open the CSV file directly, you’ll be able to see a lot of ‘NA’ values).

If you scroll through the DataFrame a little more, you’ll see that in some cases Pandas was able to correctly infer some data, but in many cases it would be fairly unwieldy to work with the data in this shape.

To make the data easier to work with, we could recode columns after the read_csv call, and fix things. A better way is to configure the read_csv call with various parameters.

In this step, we’d like to make sure that our columns will be named in a way that’s easier to work with, and to make sure that the data types are all correct. To do this, we can use several parameters of read_csv:

  • names – allows us to specify the names of the columns (instead of reading them from the CSV file). We need to pay attention to the fact that if we specify this parameter, Pandas will import the header column as a data row by default. We can prevent that by explicitly specifying header=0, which indicates that the 0th row (the first row) is a header row.
  • dtype – enables us to specify a datatype per column, as a dict. Pandas will cast values in these columns to the specified datatype.
  • converters – functions that receive the raw value of the cell for a specified column, and return the desired value (with the desired datatype).
  • usecols – allows us to specify exactly which columns to import.

 

For more details, see the documentation for read_csv. Or, just write pd.read_csv in PyCharm to see it in the documentation tool window (this works if you have scientific mode enabled; if not, use Ctrl+Q to see the documentation).

The disadvantage of these parameters is that they take lists and dicts, which become very unwieldy for datasets with many columns. As our dataset has over 150 columns, it would be a pain to write them inline. Also, the information for one column would be spread among these parameters, making it hard to see what is being done to a column.

One great thing about analyzing data with Pandas is that we can use all features of the Python language. So let’s create a data dictionary with plain Python objects, and then use some Python magic to transform these to the structures Pandas needs.

The Data Dictionary

To recap, for every column we want to know what name we will want to give it, and how to encode the values. We also want to have the ability to drop a column.

Let’s create a separate file to hold our data dictionary: survey_data_dictionary.py. In this file, we define a class that describes what we want to do with a column:

Now we can make a big list of all of our columns, and describe one-by-one what to do with them. To make our lives easier, we can use Pandas to get a list of the current names of the columns. Run in the Python console:

This will print the full name of every column as a Python comment. Copy & paste the full list into the data dictionary Python file after the class definition. We can now use regex replacement to create instances of our ColumnDescription class.

Open the Replace tool (Ctrl+R or Edit | Find | Replace), and make sure to check ‘Regex’ to enable regex mode. Enter #([^\n]+) as the regex to find. This looks for the ‘#’ character, and then multiple characters that are *not* a newline. Everything between the parentheses is captured into a group, which we can then use in the replacement (use $1 for the first capture group).

As a replacement type (use Ctrl+Shift+Enter to create newlines):

Make sure you’ve indented the middle line, and used double quotes, and then click “Replace all”. Now we’ve created a lot of ColumnDescription objects. We’ll need them in a list for Pandas, so let’s wrap it with a list constructor now. Write DATA_DICTIONARY = [ before the first ColumnDescription call, and ] at the end of the file. Choose Code | Reformat Code to properly indent all of the ColumnDescription calls.

At this point, we can go back to our main.py and feed this data structure into the read_csv call. Let’s start by adding the ability to rename columns – we do this with the names parameter. This should be a list of strings, with as its length the number of columns in the CSV file.

We can use a Python list comprehension to extract the names from our list of ColumnDescription objects:

At this point we can provide this list to read_csv. We also need to remember to specify the header row explicitly so that Pandas doesn’t import the header row as a data row:

If we run this code, we should see nothing has changed. To see if it worked, let’s go back to our data dictionary and add a name to the first column:

After re-running main.py, we should now see that the first column has been renamed. Crack open a bottle of champagne to celebrate your success!

Let’s provide the other metadata from DATA_DICTIONARY to read_csv with a combination of list comprehensions and dict comprehensions:

Now all that’s left to do is to populate the rest of the data dictionary. Unfortunately, this is manual work; there’s no way for Pandas to know the design of our survey. If you want to follow along with the rest of the blog post without writing the entire data dictionary, you can grab a complete one from the GitHub repo.

For those columns that are either ‘NA’ or the name of the selected columns, we can create a small helper function that will convert these to booleans:

We can then specify this helper function as the converter for a column like this:

Another type of data that’s fairly common in surveys is categorical data: multiple choice, single answer. We can specify Pandas’ CategoricalDType as the data type for those columns:

Cleaning up our Data

Although our columns are now looking good, we may want to make some additional changes to our data. In the Python developer survey, the first question was:

Is Python the main language you use for your current projects?

  • Yes
  • No, I use Python as a secondary language
  • No, I don’t use Python for my current projects

All respondents who selected they don’t use Python were excluded from most of the rest of the survey. So we should drop these data points for our analysis. Let’s create a new code cell, and start cleaning up our data.

Code cells are defined simply by creating a comment that starts with #%%. The rest of the comment is the header of the cell, which you see when you collapse it:

As long as you have the scientific mode enabled in PyCharm Professional, you should see a dividing line appear, and a green ‘play’ icon to run the cell.

It’s fairly easy to select data in Pandas, so let’s complete our cell:

As the remaining choices are basically “Yes” and “No”, we can also turn the remaining data into a boolean:

Analyzing our Data

In our survey, users were asked what they thought the ratio is between the number of Python developers creating web applications, and the number of developers that do data science. To make things interesting, they were also asked what they thought other people thought this ratio was. Let’s see now if people think they agree with the rest of the world.

The questions look like this:

Please think about the total number of Python Web Developers in the world and the total number of Data Scientists using Python.

What do you think is the ratio of these two numbers?

Python Web Developers ( ) 10:1 ( ) 5:1 ( ) 2:1 ( ) 1:1 ( ) 1:2 ( ) 1:5 ( ) 1:10 Python data scientists

What do you think would be the most popular opinion?

Python Web Developers ( ) 10:1 ( ) 5:1 ( ) 2:1 ( ) 1:1 ( ) 1:2 ( ) 1:5 ( ) 1:10 Python data scientists

 

Make sure you’ve specified categorical data types in the data dictionary for both questions.

We can now go ahead and create a new code cell to start our analysis. Let’s start by getting the value counts:

We’re disabling sorting here to maintain the order that we’ve specified using the categorical data type. If we run this cell with the green play icon, we can then click ‘View as Series’ in the variable overview to have a glance at our data.

View as Series

We can also use Matplotlib to get a graphical overview of the data:

Ratio Plot

See the GitHub repository for the exact code used to generate the plot.

We can see in the plot that there’s a difference between what individual respondents thought the ratio was, and what they thought the most popular opinion was. So let’s dive a little deeper: how big is this difference?

Exploring Further

Although the data points are categorical, they represent numbers, so we can see what the numeric difference would be if we turn them into numbers. Let’s create a new code cell, and calculate the difference:

After running this cell, we see:

Turns out the difference in means isn’t very large. However, the distributions are fairly different. We can see in the plot that the ‘self’ distribution has a peak at the 5:1 web dev:data scientist point, whereas the ‘Most popular’ distribution trades some votes from 5:1 to 1:1. Fun fact: this same survey found about a 1:1 distribution between web developers and data scientists with its Python usage questions.

ratio_plot

To see whether or not we have a significant difference, we can use a Chi-Square test. The scipy.stats package contains a method to calculate this statistic. So let’s create a last code cell to finish this investigation:

This results in: Power_divergenceResult(statistic=294.72519972505006, pvalue=1.1037599850410103e-60).

In other words, there’s a 1-60 chance that this is the result of random chance, and we can conclude this is a statistically significant difference.

What’s Next?

We’ve just shown how to ingest a fairly large CSV file into Pandas, and how to handle the conversion of data from its raw form to a form that’s easier to analyze. For the example, we looked into what respondents think the Python ecosystem looks like. And we’ve confirmed that people think that others have a different opinion from themselves (also, water is wet).

Now it’s your turn! Download the CSV (and you may want to grab the data dictionary from this blog post’s repo) and let us know what interesting things you discover in the Python developer survey! It contains many interesting data points: what people use Python for, what their job roles are, what packages they use, and more.

This entry was posted in Tutorial and tagged , . Bookmark the permalink.

3 Responses to PyCharm Scientific Mode with Code Cells

  1. Trenton McKinney says:

    Running:
    Python 3.6.5
    PyCharm 2018.1.1 Pro

    Perhaps there is an issue with PyCharm:
    Running the following does not produce an object in the Special Variables window:
    import pandas as pd
    pd.read_csv(‘data/survey.csv’)

    the same behavior occurs even with a small 5×5 csv. Additionally, even with a very small test dataset the run button shows that the script never finishes and the stop button stays red.

    if test = pd.read_csv(‘data/survey.csv’)
    then a test df is produced, but not otherwise and the Pycharm exhibits the same behavior is relation to the process doesn’t finish. If the red stop button is clicked, the process finishes with exit code -1.

    • Ernst Haagsman says:

      Hi Trenton,

      In your snippet you’re not assigning the dataframe to a variable. Try:

      import pandas as pd
      df = pd.read_csv(‘data/survey.csv’)

  2. Trenton McKinney says:

    Thank you for the response. I’m aware that assigning it to a variable works, but I was confused as I followed along the blog, because pd.read_csv(‘data/survey.csv’) isn’t assigned to a variable at the beginning and the instructions state to run the code and then look for the dataframe in the variables.

    Other than the initial confusion, I’ve enjoyed enjoyed your blog post, especially the steps used to create the data_dictionary and using find/replace to negate the need to manually fill in the full_name.

Leave a Reply

Your email address will not be published. Required fields are marked *