Redshift is Amazon Web Services’ data warehousing solution. They’ve extended PostgreSQL to better suit large datasets used for analysis. When you hear about this kind of technology as a Python developer, it just makes sense to then unleash Pandas on it. So let’s have a look to see how we can analyze data in Redshift using a Pandas script!
Setting up Redshift
If you haven’t used Redshift before, you should be able to get the cluster up for free for 2 months. As long as you make sure that you don’t use more than 1 instance, and you use the smallest available instance.
To play around, let’s use Amazon’s example dataset, and to keep things very simple, let’s only load the ‘users’ table. Configuring AWS is a complex subject, and they’re a lot better at explaining how to do it than we are, so please complete the first four steps of the AWS tutorial for setting up an example Redshift environment. We’ll use PyCharm Professional Edition as the SQL client.
Connecting to Redshift
After spinning up Redshift, you can connect PyCharm Professional to it by heading over to the database tool window (View | Tool Windows | Database), then use the green ‘+’ button, and select Redshift as the data source type. Then fill in the information for your instance:
Make sure that when you click the ‘test connection’ button you get a ‘connection successful’ notification. If you don’t, make sure that you’ve correctly configured your Redshift cluster’s VPC to allow connections from 0.0.0.0/0 on port 5439.
Now that we’ve connected PyCharm to the Redshift cluster, we can create the tables for Amazon’s example data. Copy the first code listing from here, and paste it into the SQL console that was opened in PyCharm when you connected to the database. Then execute it by pressing Ctrl + Enter, when PyCharm asks which query to execute, make sure to select the full listing. Afterward, you should see all the tables in the database tool window:
To load the sample data, go back to the query window, and use the Redshift ‘load’ command to load data from an Amazon S3 bucket into the database:
The IAM role identifier should be the identifier for the IAM role you’ve created for your Redshift cluster in the second step in the Amazon tutorial. If everything goes right, you should have about 50,000 rows of data in your users table after the command completes.
Loading Redshift Data into a Pandas Dataframe
So let’s get started with the Python code! In our example we’ll use Pandas, Matplotlib, and Seaborn. The easiest way to get all of these installed is by using Anaconda, get the Python 3 version from their website. After installing, we need to choose Anaconda as our project interpreter:
If you can’t find Anaconda in the dropdown, you can click the settings “gear” button, and then select ‘Add Local’ and find your Anaconda installation on your disk. We’re using the root Anaconda environment without Conda, as we will depend on several scientific libraries which are complicated to correctly install in Conda environments.
Pandas relies on SQLAlchemy to load data from an SQL data source. So let’s use the PyCharm package manager to install sqlalchemy: use the green ‘+’ button next to the package list and find the package. To make SQLAlchemy work well with Redshift, we’ll need to install both the postgres driver, and the Redshift additions. For postgres, you can use the PyCharm package manager to install psycopg2. Then we need to install sqlalchemy-redshift to teach SQLAlchemy the specifics of working with a Redshift cluster. This package is unfortunately not available in the default Anaconda repository, so we’ll need to add a custom repository.
To add a custom repository click the ‘Manage Repositories’ button, and then use the green ‘+’ icon to add the ‘conda-forge’ channel. Afterwards, close the ‘Manage Repositories’ screen, and install sqlalchemy-redshift. Now that we’ve done that, we can start coding!
To show how it’s done, let’s analyze something simple in Amazon’s dataset, the users dataset holds fictional users, and then indicates for every user if they like certain types of entertainment. Let’s see if there’s any correlation between the types of entertainment.
Let’s open a new Python file, and start our analysis. At first, we need to load our data. Redshift is accessed just like a regular PostgreSQL database, just with a slightly different connection string to use the redshift driver:
Also note that Redshift by default listens on port 5439, rather than Postgres’ 5432.
After we’ve connected we can use Pandas’ standard way to load data from an SQL database:
likesports as sports,
liketheatre as theater,
likeconcerts as concerts,
likejazz as jazz,
likeclassical as classical,
likeopera as opera,
likerock as rock,
likevegas as vegas,
likebroadway as broadway,
likemusicals as musicals
The dataset holds users’ preferences as False, None, or True. Let’s interpret this as True being a ‘like’, None being ambivalent, and False being a dislike. To make a correlation possible, we should convert this into numeric values:
# Map dataframe to have 1 for 'True', 0 for null, and -1 for False
And now we’re ready to calculate the correlation matrix, and present it. To present it we’ll use Seaborn’s heatmap. We’ll also create a mask to only show the bottom half of the correlation matrix (the top half mirrors the bottom).
# Calculate correlations
After running this code, we can see that there are no correlations in the dataset:
Which is strong evidence for Amazon’s sample dataset being a sample dataset. QED.
Fortunately, PyCharm also works great with real datasets in Redshift. Let us know in the comments what data you’re interested in analyzing!
Last June (2016) JetBrains PyCharm partnered with the Django Software Foundation to generate a big boost to Django fundraising. The campaign was a huge success. Together we raised a total of $50,000 for the Django Software Foundation!
This year we hope to repeat that success. During the two-week campaign, buy a new PyCharm Professional Edition individual license with a 30% discount code, and all the money raised will go to the DSF’s general fundraising and the Django Fellowship program.
Up until Aug 28th, you can effectively donate to Django by purchasing a New Individual PyCharm Professional annual subscription at 30% off. It’s very simple:
1. When buying a new annual PyCharm subscription in our e-store, on the checkout page, сlick “Have a discount code?”.
2. Enter the following 30% discount promo code: ISUPPORTDJANGO Alternatively, just click this shortcut link to go to the e-store with the code automatically applied
3. Fill in the other required fields on the page and click the “Place order” button.
All of the income from this promotion code will go to the DSF fundraising campaign 2017 – not just the profits, but actually the entire sales amount including taxes, transaction fees – everything. The campaign will help the DSF to maintain the healthy state of the Django project and help them continue contributing to their different outreach and diversity programs.
“Django has grown to be a world-class web framework, and coupled with PyCharm’s Django support, we can give tremendous developer productivity,” says Frank Wiles, DSF President. “Last year JetBrains was a great partner for us in support of raising money for the Django Software Foundation, on behalf of the community, I would like to extend our deepest thanks for their generous help. Together we hope to make this a yearly event!”
Security is important, and that’s why we all use key pairs when authenticating to our servers. And as is best practice, we also protect our private key with a passphrase, don’t we? Ideally the passphrase is long and complicated, and therefore a massive pain to type.
So how can we be lazy while still maintaining security? SSH Agent is the answer! Rather than typing the passphrase to your key individually for every connection, we can type it only once when we turn the computer on, and then SSH agent will keep the decrypted key safe in memory.
Let’s have a look how to do this on Windows, Mac, and Linux. Feel free to skip ahead to the section that’s relevant for your OS. At the very end I’ll show how to setup PyCharm to connect using the key in SSH Agent, which is the same on all platforms.
The default tool for SSH connections on Windows is PuTTY. Let’s walk through generating a key and then using it with PuTTY.
When you download PuTTY, you don’t just get PuTTY, you get various other tools as well. We’ll use one of those to generate a key to use with SSH later. Open PuTTYgen, and click the generate button to start creating a key:
After simulating an earthquake with your mouse, you should have a shiny new SSH key.
Give it a secure passphrase, and then save the private key somewhere safe. At this point you should copy the public key to the ~/.ssh/authorized_keys file on the computer you’d like to connect to. If you’d like a computer to try this with, you can create a VM using the Vagrantfile here. If you’re using the VM, choose ‘Save public key’ to overwrite key.pub in the files subfolder, and then run vagrant up to create the virtual machine.
Now that we have a Linux machine with the public key installed, let’s connect to it. To manually use the key, we can configure PuTTY to connect to our box. We will connect to localhost, on port 2222 (the port that Vagrant exposes SSH to by default). And then in the menu on the left select SSH, and then Auth. In the settings you can configure a key file to use:
When you click open now, you should be prompted for the username you’d like to log in as. If you’re using the sample box, the username is ‘example’. At this point you’ll need to enter the passphrase for the private key:
If you right-click the title bar, and choose Duplicate Session, you’ll need to re-enter the passphrase. Let’s see if we can make this a little more efficient.
A tool that comes with PuTTY is Pageant. What this does is hold your private key, and as long as it’s running, it can provide the unlocked private key to PuTTY (and other tools like PyCharm). After you start Pageant, you may notice that nothing seems to happen. Pageant lives in the Windows task bar on the bottom right of your screen. Right-click Pageant’s icon, and you can choose ‘Add Key’ to add your key to Pageant:
After finding your key on the filesystem, you’ll need to enter the key’s passphrase. However, after this is done, you won’t need to enter it again until Pageant is stopped (for example, when you shutdown your computer).
So when we duplicate our session, we can see that we don’t need to enter a passphrase anymore:
Although it’s possible to configure Pageant to start with Windows, it might be more interesting to use another tool: KeeAgent. If you’re already using KeePass to store your passwords, you can use the KeeAgent plugin to make KeePass work just like Pageant. And then when you unlock your KeePass workspace, your private key will automatically be unlocked as well. Read how to set it up here.
To create a key on macOS, open a Terminal window, and run:
ssh-keygen-trsa-f<name of file>
Make sure to add a passphrase for your key.
After this finishes you’ll have both the private key file, and the public key file (with a .pub extension). Let’s get a Linux computer to try our new key with. Clone this repository, copy the .pub file over key.pub in the files subdirectory, and then run vagrant up to create a VM (you’ll need Vagrant for this).
When the VM has started, we can use ssh to connect to it:
We need to disable host key checking, as otherwise SSH will see a mismatch between your computer’s key and the VM’s key. When you run this command, you’ll be prompted for the key’s passphrase. So let’s see if we can get rid of this prompt.
macOS comes with a Keychain application which stores your passwords in the background. To see the stored passwords, use the Keychain Access application. We can store the passphrase there by using:
After adding it to your Keychain, you should see that whenever you run the SSH command again, it’ll take the passphrase from your Keychain.
On Linux you can use the ssh-add command, after running it once and entering your passphrase, you won’t have to re-enter your passphrase until you log out:
Without any parameters this will add the key in ~/.ssh/id_rsa. If you get an error ‘Could not open a connection to your authentication agent’, start ssh-agent by running: eval $(ssh-agent -s). To make this easier, you could add it to your ~/.bashrc, or see if your distribution or desktop environment has a tool that support SSH Agent. For example GNOME Keyring supports ssh-agent.
Using Your Key in PyCharm Professional Edition
After storing your key in SSH Agent (or a compatible tool), you can connect to a Python interpreter on a remote machine using that key for authentication. Please keep in mind that remote interpreters are only available in PyCharm Professional Edition. So let’s open up PyCharm, and go to the interpreter settings (File | Settings | Project Interpreter). Then use the gear icon, and choose ‘Add Remote’.
Then choose “SSH Credentials”, enter the hostname, port, and username. By setting authentication method to ‘Authentication agent’ you’re instructing PyCharm to take the credentials from SSH Agent or a compatible tool.
Finally, just click OK to connect, and you’ll be using the Python interpreter on the machine or VM you’re connecting to.
Yesterday Paul Craven, computer science professor and creator of the Arcade library for Python 2d games, gave a webinar about teaching Python to university students using game writing. The recording is now available:
During the webinar, Paul discussed his techniques for teaching introductory programming, an approach which he has refined over the years based on seeing what works and what doesn’t work. He covered:
The iterative process of producing teaching materials
Having a good toolchain (PyCharm for writing Sphinx docs, GitHub, ReadTheDocs, etc.)
How PyCharm’s IDE features impacted the quality of what students did (PEP 8, function completion, spell checking, etc.)
The effect Python 3.5/3.6 type hinting had on the learning/teaching experience
How games as a topic made learning fun, and how he adjusted the game library to make it more teachable
Teaching self-sufficiency by learning to browse APIs
The importance of already-working game examples
Paul then covered a bit of Arcade and the kinds of games it makes easy to write.
The webinar lasted a bit over an hour and had one of our highest number of questions posed during the session. We’ll try to get Paul back again in the future and go more in depth on game writing.
If you have any questions or comments about the webinar, feel free to leave them in the comments below, or you can reach us on Twitter. Paul Craven is on Twitter as well, his Twitter handle is @professorcraven.
By popular demand, PyCharm 2017.2 Professional Edition expands its Docker Compose support to those of you who run Windows. Let’s take a look and see how this works!
In our last Docker Compose post, we created a guest book in Flask. This time we’ll take a simple todo app in Django, and dockerize it. The starting point today will be a Django todo app which works locally, see the code on GitHub.
Setting Up Docker on Windows
If you don’t have Docker installed yet, you’ll need to make a decision about which version to install:
Are you using anything other than Windows 10 Pro or Enterprise, or do you have Virtualbox, VMware, or anything other than Hyper-V installed: get Docker Toolbox and Virtualbox.
If you’re on Windows 10 Pro or Enterprise, and you have either Hyper-V or no virtualization software installed: get Docker for Windows.
The reason for this is that Docker for Windows is based on Microsoft’s Hyper-V virtualization technology. Hyper-V is a seriously cool bit of tech that wraps your Windows in a hypervisor, rather than installing a hypervisor within Windows. What this means is that effectively you’ll be using a VM when you’re using your computer. Hypervisors are unable to run on a VM, so when you enable Hyper-V on Windows, you can’t run any other VM software anymore.
Setting Up Docker Toolbox
If you installed Docker for Windows, you can skip this section.
Docker Toolbox works by redirecting all your Docker commands to a Docker instance running either on a local VM, or on a cloud service. Today, let’s set up a Virtualbox VM on our local computer. Run with a cmd window:
Let’s verify that it works by connecting our command-line Docker. To do so, we need to run this cryptic looking command in cmd:
To see what it does, run docker-machine env default --shell=cmd, it will output several commands to set environment variables that configure Docker and Docker Compose. The long command above simply runs these all.
At this point, if you run docker run hello-world you should see a cheerful message that confirms that everything works:
Running Django in Docker
For our Django app we’ll need to create two containers: a database container, and a container which holds our actual application. We’ll use Docker Compose to link the containers together.
Let’s get started with writing our Dockerfile:
# By copying over requirements first, we make sure that Docker will cache
# our installed requirements rather than reinstall them on every build
This is enough configuration to start Django, however, before we can proceed we should also make sure that we configure our database. For this we should write a compose file where we add both our Django service, and our postgres service:
The postgres image is easily configured with environment variables, for details read the image’s page on Docker hub. In this case we’re only setting the password, leaving the defaults for username and database. If you want to persist data when the container is destroyed, you’d need to create a named volume for the /var/lib/postgresql/data folder.
After adding these files, let’s just head over to our Django settings.py to configure our new database credentials:
In a Docker Compose project, you can connect to linked containers by their service name unless you’ve specified an alias in the link section of the compose file. In this case we wrote:
Therefore we should tell Django to look for the db host. Although I’m hardcoding it here for simplicity, ideally you’d get this configuration from environment variables.
This is all the configuration we need to do, and we can get started with building our images now.
Let’s first let PyCharm know where to find Docker. Go to Settings | Build, Execution, Deployment | Docker, and make sure that your Docker is configured. If you’re using Docker for Windows, and there’s no Docker listed, just click the green ‘+’ icon, and the defaults should be correct. If you’re using Docker machine, select the ‘Docker Machine’ radio button, and select the correct machine in the dropdown:
After that’s set up, we can go and add our Docker run configuration, go to the ‘Edit Run Configurations’ screen, and add a Docker Deployment run configuration.
Let’s name it Rebuild Images, and in the Deployment field, select the compose file:
Now when we run this configuration, we should see that all the layers are pulled from Docker hub, and both the database and Django are started.
Setting up the Python Remote Docker Interpreter
Now to make sure that we can debug our Django project, let’s configure PyCharm to use the Python interpreter within our Docker container. To do so go to Settings | Project Interpreter, and use the gear icon to select Add Remote:
Choose the Docker Compose interpreter type, and make sure the docker-compose.yml file is selected. The service you choose under ‘service’ is the service you want to debug with this run configuration, when you start it, all services will still be started either way. As the only Python service is ‘web’, let’s select that here:
Afterwards you should see that PyCharm detected the packages we configured in requirements.txt, and the path mappings:
Now we can add a normal Django server run configuration, just make sure to set host to ‘0.0.0.0’ to make sure that we listen to requests coming from outside the Docker container.
Now first run the migrations, by going to Tools | Run manage.py task, and then writing migrate. After this command has completed, we can use the regular run and debug icons in PyCharm to run and debug our Django project. So let’s run it!
To see our Django application in the browser, go to http://localhost:8000 if you’re using Docker for Windows. If you’re using Docker Machine, we’ll first need to check on which IP our Docker Machine is running, run docker-machine ip default on the command line. In my case this is 192.168.99.100, so I’ll go to http://192.168.99.100:8000 in the browser.
Now if you see a message “DisallowedHost at /”, go to Django’s settings.py, and find ALLOWED_HOSTS. During development we can change this to: ALLOWED_HOSTS = [‘*’] to disable this check. Please make sure you appropriately configure it when running in production however.
When everything works, we can add a breakpoint, and debug as usual:
PyCharm Edu 4 is out now! This update brings a better user experience to both learners and educators, making the product’s use as simple as possible, whether it is used for learning, or for teaching.
First of all, we’ve changed the welcoming UI. Now you begin by choosing your role, Learner or Educator. Depending on your choice, you get access to the courses you can join as a learner and can practice with the help of simple and effective “fill in the missing code” exercises. Or, you can create your own code practice tasks and integrated tests as an educator.
With PyCharm Edu 4, learners can now easily choose the course to join and start learning thanks to the new Welcoming UI.
We’ve added some small improvements for editing Python files: a quick fix to change the signature of a function you’re calling, an inspection to make sure your Python strings formatted with str.format() work correctly, and auto-completion for type hints
Docker Compose is additionally supported on Windows (this feature is available only in PyCharm Professional Edition)
PyCharm 2017.2 supports using SSH Agent to handle your SSH private keys. Compatible tools like Pageant on Windows are also supported. (only in Professional Edition)
Database tools fully support connecting to Amazon Redshift and Azure Databases (only in Professional Edition)
Run inline SQL on multiple data source (only in Professional Edition)