Update on Big Data Tools Plugin: Spark, HDFS, Parquet and More

Andrey Cheptsov

It’s been a while since our last update. If you remember, last year, we announced IntelliJ IDEA’s integration with Apache Zeppelin, S3, and its experimental integration with Apache Spark. The latter integration was released as an experimental feature and was only available in the unstable update channel. But we have great news. Today we’re releasing a new version of the plugin that finally makes Spark support publicly available. It also adds support for HDFS and Parquet.

Spark Monitoring

Now that the Spark integration is available in the public update, let us quickly catch you up on what it can do for you.

To be able to monitor your Spark jobs, all you have to do now is go to the Big Data Tools Connections settings and add the URL of your Spark History Server:

Once you’ve done that, close the settings and open the Spark tool window in the bottom right of the IDE’s window. The Spark tool window displays the list of completed and running Spark applications (this is the Applications tab, which is collapsed by default), the list of the jobs, their stages, and tasks.

By clicking the Executors tab, you’ll see information about the active and non-active executors:

At the moment, the SQL tab shows a list of recent queries but it doesn’t yet include the actual SQL. Additionally, if you are using Kerberos with Spark, the IDE might not allow you to connect to the server. We’re working on fixing this in one of the next updates. If you use Kerberos, please let us know, so we prioritize this task over the others.

HDFS

Similar to the S3 support that we introduced in December, the plugin now allows you to connect to your HDFS servers to explore and manage your files from the IDE. To enable this feature, just go to the Big Data Tools Connections settings and add an HDFS configuration:

Currently, you have to specify the root path and the way to connect to the server: either Configuration Files Directory or Explicit URI.

Once you’ve configured HDFS servers, you’ll see them appear in the Big Data Tools tool window (next to your Apache Zeppelin notebooks and S3 buckets, if you’ve configured any of course):

The Big Data Tools tool window displays the files and folders that are stored in the configured servers. As is the case for S3, the CSV and Parquet files in HDFS can be expanded in the tree to show their file schemas. The context menu invoked on any file or folder provides a variety of actions:

These options allow you to manage files, copy them to your local machine, or preview them in the editor. Previewing allows you to see the first chunk of the file content without fully copying it to your machine.

Parquet

As mentioned above, this update introduces initial support for Parquet files. Now you can open any Parquet file in the IDE and view its content as a table:

When opening Parquet files, the plugin only displays the first portion rather than the entirety of the content. This is especially useful when you work with very large files.

Note that just as with Spark, you need physical access to the servers in order to access the files. This means that if your servers are behind an SSH tunnel, you currently have to establish the tunnel yourself. In the event that you experience any issues or inconveniences when accessing your files, please make sure to let us know about it. Otherwise, we might not know of specific scenarios that may not yet be supported. The sooner you provide your feedback, the better!

That’s it for today. As you might have also noticed, up until now, we’ve published our updates in the Scala blog, and this is the first update published in the IntelliJ IDEA blog. We’re doing this because now the plugin no longer merely offers Apache Zeppelin and Scala support. Instead, it integrates a much wider variety of tools for working with big data.

To see the complete list of bug fixes in this update, please refer to the release notes.

And last but not least, in case you need help on how to use any feature of the plugin, make sure to check out the documentation. Still need help? Please don’t hesitate to leave us a message either here in the comments or on Twitter.

P.S.: Because the plugin is still in an early stage of development, its many integrations may not support the whole variety of scenarios. This is why, at this point in time, we’re heavily relying on your feedback. In the event you see that an important user scenario (e.g. a certain authorization type, or some other specifics) is not supported, please let us know here in the comments, in the issue tracker, or in our feedback survey.