Introducing Kotlin for Apache Spark Preview
Apache Spark is an open-source unified analytics engine for large-scale distributed data processing. Over the last few years, it has become one of the most popular tools used for processing large amounts of data. It covers a wide range of tasks – from data batch processing and simple ETL (Extract/Transform/Load) to streaming and machine learning.
Due to Kotlin’s interoperability with Java, Kotlin developers can already work with Apache Spark via Java API. This way, however, they cannot use Kotlin to its full potential, and the general experience is far from smooth.
Today, we are happy to share the first preview of the Kotlin API for Apache Spark. This project adds a missing layer of compatibility between Kotlin and Apache Spark. It allows you to write idiomatic Kotlin code using familiar language features such as data classes and lambda expressions.
Kotlin for Apache Spark also extends the existing APIs with a few nice features.
withSpark and withCached functions
withSpark is a simple and elegant way to work with SparkSession that will automatically take care of calling
spark.stop() at the end of the block for you.
You can pass parameters to it that may be required to run Spark, such as master location, log level, or app name. It also comes with a convenient set of defaults for running Spark locally.
Here’s a classic example of counting occurrences of letters in lines:
Another useful function in the example above is
withCached. In other APIs, if you want to fork computations into several paths, but compute things only once, you would call the ‘cache’ method. However, this quickly becomes difficult to track and you have to remember to unpersist the cached data. Otherwise, you risk taking up more memory than intended or even breaking things altogether.
withCached takes care of tracking and unpersisting for you.
Kotlin for Spark adds
rightJoin, and other aliases to the existing methods, however, these are null safe by design.
city?.coordinate line in the example above. A city may potentially be null in this right join. This would’ve caused a
NullPointerException in other JVM Spark APIs, and it would’ve been rather difficult to debug the source of the problem.
Kotlin for Apache Spark takes care of null safety for you and you can conveniently filter out null results.
This initial version of Kotlin for Apache Spark supports Apache Spark 3.0 with the core compiled against Scala 2.12.
The API covers all the methods needed for creating self-contained Spark applications best suited for batch ETL.
Getting started with Kotlin for Apache Spark
To help you quickly get started with Kotlin for Apache Spark, we have prepared a Quick Start Guide that will help you set up the environment, correctly define dependencies for your project, and run your first self-contained Spark application written in Kotlin.
We understand that it takes a while to upgrade any existing framework to a newer version, and Spark is no exception. That is why in the next update we are going to add support for the earlier Spark versions: 2.4.2 – 2.4.6.
We are also working on the Kotlin Spark shell so that you can enjoy working with your data in an interactive manner, and perform exploratory data analysis with it.
Currently, Spark Streaming and Spark MLlib are not covered by this API, but we will be closely listening to your feedback and will address it in our roadmap accordingly.
In the future, we hope to see Kotlin join the official Apache Spark project as a first-class citizen. We believe that it can add value both for Kotlin, and for the Spark community. That is why we have opened a Spark Project Improvement Proposal: Kotlin support for Apache Spark. We encourage you to voice your opinions and join the discussion.