{"id":520479,"date":"2024-10-29T16:47:18","date_gmt":"2024-10-29T15:47:18","guid":{"rendered":"https:\/\/blog.jetbrains.com\/?post_type=pycharm&#038;p=520479"},"modified":"2025-12-06T19:23:38","modified_gmt":"2025-12-06T18:23:38","slug":"data-exploration-with-pandas","status":"publish","type":"pycharm","link":"https:\/\/blog.jetbrains.com\/ja\/pycharm\/2024\/10\/data-exploration-with-pandas","title":{"rendered":"Data Exploration With pandas"},"content":{"rendered":"\n<p>Maybe you\u2019ve heard complicated-sounding phrases such as \u2018\u201cStudents <em>t<\/em>-test\u201d, \u201cregression models\u201d, \u201csupport vector machines\u201d, and so on. You might think there\u2019s so much you need to learn before you can explore and understand your data, but I am going to show you two tools to help you go faster. These are <strong>summary statistics<\/strong> and <strong>graphs<\/strong>.<\/p>\n\n\n\n<p>Summary statistics and graphs\/plots are used by new and experienced data scientists alike, making them the perfect building blocks for exploring data.<\/p>\n\n\n\n<p>We will be working with <a href=\"https:\/\/www.kaggle.com\/datasets\/prevek18\/ames-housing-dataset\" target=\"_blank\" rel=\"noopener\">this dataset<\/a> available from Kaggle if you\u2019d like to follow along. I chose this dataset because it has several interesting properties, such as multiple <a href=\"https:\/\/blog.jetbrains.com\/pycharm\/2024\/09\/7-ways-to-use-jupyter-notebooks-inside-pycharm\/#continuous-variables\">continuous<\/a> and <a href=\"https:\/\/blog.jetbrains.com\/pycharm\/2024\/09\/7-ways-to-use-jupyter-notebooks-inside-pycharm\/#categorical-variables\">categorical<\/a> variables, missing data, and a variety of distributions and skews. I\u2019ll explain each variable I work with and why I chose each one to show you the tools you can apply to your chosen data set.<\/p>\n\n\n\n<p>In our previous blog posts, we looked at <a href=\"https:\/\/blog.jetbrains.com\/pycharm\/2024\/10\/how-to-get-data\/\">where to get data from<\/a> and bring that data into PyCharm. You can look at steps <a href=\"https:\/\/blog.jetbrains.com\/pycharm\/2024\/09\/7-ways-to-use-jupyter-notebooks-inside-pycharm\/\" target=\"_blank\" rel=\"noreferrer noopener\">1 and 2 from our blog post entitled 7 ways to use Jupyter notebooks in PyCharm<\/a> to create a new Jupyter notebook and import your data as a CSV file if you need a reminder. You can use the dataset I linked above or pick your own for this walkthrough.<\/p>\n\n\n\n<p>We\u2019re going to be using the pandas library in this blog post, so to ensure we\u2019re all on the same page, your code should look something like the following block in a Jupyter notebook \u2013&nbsp; you\u2019ll need to change the spreadsheet name and location to yours, though. Make sure you\u2019ve imported <em>matplotlib<\/em>, too, as we will be using that library to explore our data.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import pandas as pd\nimport matplotlib as plt\n\n\ndf = pd.read_csv('..\/data\/AmesHousing.csv')\ndf<\/pre>\n\n\n\n<p>When you run that cell, PyCharm will show you your DataFrame, and we can get started.<\/p>\n\n\n\n<p align=\"center\">\n    <a class=\"jb-download-button\" href=\"https:\/\/jb.gg\/m8p92h\" target=\"_blank\" rel=\"noopener\">      \n        Try PyCharm Professional for free\n    <\/a>\n<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary statistics<\/h2>\n\n\n\n<p>When we looked at where to get data from, we discussed <a href=\"https:\/\/blog.jetbrains.com\/pycharm\/2024\/09\/7-ways-to-use-jupyter-notebooks-inside-pycharm\/#continuous-variables\" target=\"_blank\" rel=\"noreferrer noopener\">continuous<\/a> and <a href=\"https:\/\/blog.jetbrains.com\/pycharm\/2024\/09\/7-ways-to-use-jupyter-notebooks-inside-pycharm\/#categorical-variables\" target=\"_blank\" rel=\"noreferrer noopener\">categorical<\/a> variables. We can use Jupyter notebooks inside PyCharm to generate different summary statistics for these, and, as you might have already guessed, the summary statistics differ depending on whether the variables are continuous or categorical.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Continuous variables summary statistics<\/h3>\n\n\n\n<p>First, let\u2019s see how we can view our summary statistics. Click on the small bar graph icon on the right-hand side of your DataFrame and select <em>Compact<\/em>:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1999\" height=\"531\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/compact.png\" alt=\"\" class=\"wp-image-520516\"\/><figcaption class=\"wp-element-caption\">Exploratory data analysis with pandas in an IDE<\/figcaption><\/figure>\n\n\n\n<p>Let me give you a little tip here if you\u2019re unsure which variables are continuous and which are categorical, PyCharm shows different summary statistics for each one. The ones with the mini graphs (blue in this screenshot) are continuous, and those without are categorical.<\/p>\n\n\n\n<p>This data set has several continuous variables, such as Order, PID, MS SubClass, and more, but we will focus on Lot Frontage first. That is the amount of space at the front of the property.<\/p>\n\n\n\n<p>The summary statistics already give us some clues:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" loading=\"lazy\" width=\"664\" height=\"844\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/summary-stastics.png\" alt=\"\" class=\"wp-image-520529\" style=\"aspect-ratio:0.7867298578199052;width:476px;height:auto\"\/><\/figure>\n\n\n\n<p>There\u2019s a lot of data here, so let\u2019s break it down and explore it to understand it better. Immediately, we can see that we have missing data for this variable; that\u2019s something we want to note, as it might mean we have some issues with the dataset, although we won\u2019t go into that in this blog post!<\/p>\n\n\n\n<p>First, you can see the little histogram in blue in my screenshot, which tells us that we have a positive skew in our data because the data tails off to the right. We can further confirm this with the data because the <em>mean<\/em> is slightly larger than the <em>median<\/em>. That\u2019s not entirely surprising, given we\u2019d expect the majority of lot frontages to be of a similar size, but perhaps there are a small number of luxury properties with much bigger lot frontages that are skewing our data. Given this skew, we would be well advised not to use the standard deviation as a measure of dispersion because that is calculated by using all data points, so it\u2019s affected by outliers, which we know we have on one side of our distribution.<\/p>\n\n\n\n<p>Next, we can calculate our <em>interquartile range<\/em> as the difference between our 25th percentile of 58.0 and our 75th percentile of 80.0, giving us an <em>interquartile range<\/em> of 22.0. Alongside the <em>interquartile range<\/em>, it\u2019s helpful to consider the <em>median<\/em>, the middle value in our data, and unlike the <em>mean<\/em>, it is not based on every data point. The <em>median<\/em> is more appropriate for Lot Frontage than the <em>mean<\/em> because it\u2019s not affected by the outliers we know we have.<\/p>\n\n\n\n<p>Since we\u2019re talking about the <em>median<\/em> and <em>interquartile range<\/em>, it is worth saying that box plots are a great way to represent these values visually. We can ask JetBrains AI Assistant to create one for us with a prompt such as this:<\/p>\n\n\n\n<p><em>Create code using matplotlib for a box plot for&nbsp; <\/em>&#8216;Lot Frontage&#8217;<em>.<\/em><em> Assume we have all necessary imports and the data exists.<\/em><\/p>\n\n\n\n<p>Here\u2019s the code that was generated:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">plt.figure(figsize=(10, 6))\nplt.boxplot(df['Lot Frontage'].dropna(), vert=False)\nplt.title('Box Plot of Lot Frontage')\nplt.xlabel('Lot Frontage')\nplt.show()<\/pre>\n\n\n\n<p>When I click <em>Accept and run<\/em>, we get our box plot:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1632\" height=\"1078\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/box-plot.png\" alt=\"\" class=\"wp-image-520542\"\/><\/figure>\n\n\n\n<p>The <em>median<\/em> is the line inside the box, which, as you can see, is slightly to the left, confirming the presence of the positive or right-hand skew. The box plot also makes it very easy to see a noticeable number of outliers to the right of the box, known as \u201cthe tail\u201d. That\u2019s the small number of likely luxury properties that we suspect we have.<\/p>\n\n\n\n<p>It\u2019s important to note that coupling the <em>mean<\/em> and <em>standard deviation<\/em> or the <em>median<\/em> and <em>IQR<\/em> gives you two pieces of information for that data: a central tendency and the variance. For determining the central tendency, the <em>mean<\/em> is more prone to being affected by outliers, so it is best when there is no skew in your data, whereas the <em>median<\/em> is more robust in that regard. Likewise, for the variation, the <em>standard deviation<\/em> can be affected by outliers in your data. In contrast, the <em>interquartile range<\/em> will always tell you the distribution of the middle 50% of your data. Your goals determine which measurements you want to use.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Categorical variables summary statistics<\/h3>\n\n\n\n<p>When it comes to categorical variables in your data, you can use the summary statistics in PyCharm to find patterns. At this point, we need to be clear that we\u2019re talking about descriptive rather than inferential statistics. That means we can see patterns, but we don\u2019t know if they are significant.<\/p>\n\n\n\n<p>Some examples of categorical data in this data set include MS Zoning, Lot Shape, and House Style. You can gain lots of insights just by looking through your data set. For example, looking at the categorical variable Neighborhood, the majority are stated as <em>Other<\/em> in the summary statistics with 75.8%. This tells you that there might well be a lot of categories in Neighborhood, which is something to bear in mind when we move on to graphs.&nbsp;<\/p>\n\n\n\n<p>As another example, the categorical variable House Style states that about 50% of the houses are one-story, while 30% are two-story, leaving 20% that fall into some other category that you might want to explore in more detail. You can ask JetBrains AI for help here with a prompt like:<\/p>\n\n\n\n<p><em>Write pandas code that tells me all the categories for &#8216;House Style&#8217; in my DataFrame &#8216;df&#8217;, which already exists. Assume we have all the necessary imports and that the data exists.<\/em><\/p>\n\n\n\n<p>Here\u2019s the resulting code:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">unique_house_styles = df['House Style'].unique()\n\n\nprint(\"Unique categories for 'House Style':\")\nprint(unique_house_styles)<\/pre>\n\n\n\n<p>When we run that we can see that the remaining 20% is split between various codes that we might want to research more to understand what they mean:<\/p>\n\n\n\n<p>Unique categories for &#8216;House Style&#8217;:<\/p>\n\n\n\n<p><code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">['1Story' '2Story' '1.5Fin' 'SFoyer' 'SLvl' '2.5Unf' '1.5Unf' '2.5Fin']<\/code><\/p>\n\n\n\n<p>Have a look through the data set at your categorical variables and see what insights you can gain!<\/p>\n\n\n\n<p>Before we move on to graphs, I want to touch on one more piece of functionality inside PyCharm that you can use to access your summary statistics called <em>Explain DataFrame<\/em>. You can access it by clicking on the purple <em>AI<\/em> icon on the top-right of the DataFrame and then choosing <em>AI Actions <\/em>| <em>Explain DataFrame<\/em>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1894\" height=\"936\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/explain-dataframe.png\" alt=\"\" class=\"wp-image-520554\"\/><\/figure>\n\n\n\n<p>JetBrains AI lists out your summary statistics but may also add some code snippets that are helpful for you to get your data journey started, such as how to drop missing values, filter rows based on a condition, select specific columns, as well as group and aggregate data.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Graphs<\/h2>\n\n\n\n<p>Graphs or plots are a way of quickly getting patterns to pop out at you that might not be obvious when you\u2019re looking at the numbers in the summary statistics. We\u2019re going to look at some of the plots you can get PyCharm to generate to help you explore your data.<\/p>\n\n\n\n<p>First, let\u2019s revisit our continuous variable, Lot Frontage. We already learned that we have a positive or right-hand skew from the mini histogram in the summary statistics, but we want to know more!&nbsp;<\/p>\n\n\n\n<p>In your DataFrame in PyCharm, click the <em>Chart View<\/em> icon on the left-hand side:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1999\" height=\"629\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/lot-frontage-index.png\" alt=\"\" class=\"wp-image-520565\"\/><\/figure>\n\n\n\n<p>Now click the cog on the right-hand side of the chart that says <em>Show series settings<\/em> and select the Histogram plot icon on the far right-hand side. Click <em>x<\/em> to clear the values in the <em>X axis<\/em> and <em>Y axis<\/em> and then select Lot Frontage with <em>group and sort<\/em> for the <em>X axis<\/em> and Lot Frontage with <em>count<\/em> for the <em>Y axis<\/em>:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1908\" height=\"936\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/lot-frontage-histogram.png\" alt=\"\" class=\"wp-image-520576\"\/><\/figure>\n\n\n\n<p>PyCharm generates the same histogram as you see in the summary settings, but we didn\u2019t have to write a single line of code. We can also explore the histogram and mouse over data points to learn more.&nbsp;<\/p>\n\n\n\n<p>Let\u2019s take it to the next level while we\u2019re here. Perhaps we want to see if the condition of the property, as captured by the Overall Cond variable, predicts the sale price.<\/p>\n\n\n\n<p>Change your <em>X axis <\/em>SalePrice <em>group and sort <\/em>and your <em>Y axis<\/em> to SalePrice <em>count<\/em> and then add the group Overall Cond:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1912\" height=\"940\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/saleprice-overall-condition.png\" alt=\"\" class=\"wp-image-520587\"\/><\/figure>\n\n\n\n<p>Looking at this chart, we can hypothesize that the overall condition of the property is indeed a predictor of the sale price, as the distribution and skew are remarkably similar. One small note is that grouping histograms like this works best when you have a smaller number of categories. If you change <em>Groups<\/em> to Neighborhood, which we know has many more categories, it becomes much harder to view!&nbsp;<\/p>\n\n\n\n<p>Moving on, let\u2019s stick with PyCharm\u2019s plotting capabilities and explore bar graphs. These are a companion to frequency charts such as histograms, but can also be used for categorical data. Perhaps you are interested in Neighbourhood (a categorical variable) in relation to SalesPrice.<\/p>\n\n\n\n<p>Click the <em>Bar<\/em> [chart] icon on the left-hand side of your series setting, then select Neighbourhood as <em>Categories<\/em> and SalesPrice with the <em>median<\/em> as the <em>Values<\/em>:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" loading=\"lazy\" width=\"1999\" height=\"866\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/neighbourhood-saleprice.png\" alt=\"\" class=\"wp-image-520598\" style=\"aspect-ratio:2.308314087759815;width:840px;height:auto\"\/><\/figure>\n\n\n\n<p>This helps us understand the neighborhoods with the most expensive and cheapest housing. I chose the <em>median<\/em> for the SalesPrice as it\u2019s less susceptible to outliers in the data. For example, I can see that housing in <em>Mitchel<\/em> is likely to be substantially cheaper than in <em>NoRidge<\/em>.&nbsp;<\/p>\n\n\n\n<p>Line plots are another useful plot for your toolkit. You can use these to demonstrate trends between continuous variables over a period of time. For example, select the <em>Line<\/em> [graph] icon and then choose Year Built as the <em>X axis<\/em> and SalePrice with the <em>mean<\/em> as the <em>Y axis<\/em>:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1999\" height=\"728\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/yearbuilt-saleprice.png\" alt=\"\" class=\"wp-image-520609\"\/><\/figure>\n\n\n\n<p>This suggests a small positive correlation between the year the house was built and the price of the house, especially after 1950. If you\u2019re feeling adventurous, remove the <em>mean<\/em> from SalePrice and see how your graph changes when it has to plot every single price!&nbsp;<\/p>\n\n\n\n<p>The last plot I\u2019d like to draw your attention to is scatter plots. These are a great way to see a relationship between two continuous variables and any correlation between them. A correlation shows the strength of a relationship between two variables. To dig deeper, check out this <a href=\"https:\/\/realpython.com\/numpy-scipy-pandas-correlation-python\/\" target=\"_blank\" rel=\"noopener\">beginner-friendly overview from Real Python<\/a>.<\/p>\n\n\n\n<p>For example, if we set our <em>X axis <\/em>to SalePrice and our <em>Y axis<\/em> to Gr LivArea, we can see that there is a positive correlation between the two variables, and we can also easily spot some outliers in our data, including a couple of houses with a lower sale price but a huge living area!<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1999\" height=\"726\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2024\/10\/sale-price-grt-living-area.png\" alt=\"\" class=\"wp-image-520620\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>Here\u2019s a reminder of what we\u2019ve covered today. You can access your summary statistics in PyCharm either through Explain DataFrame with JetBrains AI or by clicking on the small graph icon on the right-hand side of a DataFrame called <em>Column statistics<\/em> and then selecting <em>Compact<\/em>. You can also use <em>Detailed<\/em> to get even more information than we\u2019ve covered in this blog post.&nbsp;<\/p>\n\n\n\n<p>You can get PyCharm to create graphs to explore your data and create hypotheses for further investigation. Some more commonly used ones are histograms, bar charts, line graphs, and scatter plots.<\/p>\n\n\n\n<p>Finally, you can use <a href=\"https:\/\/www.jetbrains.com\/ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">JetBrains AI Assistant<\/a> to generate code with natural language prompts in the <em>AI<\/em> tool window. This is a quick way to learn more about your data and start thinking about the insights on offer.<\/p>\n\n\n\n<p><a href=\"https:\/\/jb.gg\/i8wlty\" target=\"_blank\" rel=\"noreferrer noopener\">Download PyCharm Professional<\/a> to try it out for yourself! Get an extended trial today and experience the difference PyCharm Professional can make in your data science endeavors. Use the promotion code \u201cPyCharmNotebooks\u201d at checkout to activate your free 60-day subscription to PyCharm Professional. The free subscription is available for individual users only.<\/p>\n\n\n\n<p align=\"center\">\n    <a class=\"jb-download-button\" href=\"https:\/\/jb.gg\/m8p92h\" target=\"_blank\" rel=\"noopener\">      \n        Try PyCharm Professional for free\n    <\/a>\n<\/p>\n\n\n\n<p>Using both summary statistics and graphs in PyCharm, we can learn a lot about our data, giving us a solid foundation for our next step \u2013 cleaning our data, which we will talk about in the next blog post in this series.<\/p>\n","protected":false},"author":1150,"featured_media":520647,"comment_status":"closed","ping_status":"closed","template":"","categories":[952],"tags":[8477,566,8101],"cross-post-tag":[],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/pycharm\/520479"}],"collection":[{"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/pycharm"}],"about":[{"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/types\/pycharm"}],"author":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/users\/1150"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/comments?post=520479"}],"version-history":[{"count":9,"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/pycharm\/520479\/revisions"}],"predecessor-version":[{"id":666203,"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/pycharm\/520479\/revisions\/666203"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/media\/520647"}],"wp:attachment":[{"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/media?parent=520479"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/categories?post=520479"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/tags?post=520479"},{"taxonomy":"cross-post-tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/ja\/wp-json\/wp\/v2\/cross-post-tag?post=520479"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}