Kotlin

A concise multiplatform language developed by JetBrains

Visit the Kotlin Site

Ecosystem

Introducing Kotlin for Apache Spark Preview

Maria Khalusova

Apache Spark is an open-source unified analytics engine for large-scale distributed data processing. Over the last few years, it has become one of the most popular tools used for processing large amounts of data. It covers a wide range of tasks – from data batch processing and simple ETL (Extract/Transform/Load) to streaming and machine learning.

Due to Kotlin’s interoperability with Java, Kotlin developers can already work with Apache Spark via Java API. This way, however, they cannot use Kotlin to its full potential, and the general experience is far from smooth.

Today, we are happy to share the first preview of the Kotlin API for Apache Spark. This project adds a missing layer of compatibility between Kotlin and Apache Spark. It allows you to write idiomatic Kotlin code using familiar language features such as data classes and lambda expressions.

Kotlin for Apache Spark also extends the existing APIs with a few nice features.

withSpark and withCached functions

withSpark is a simple and elegant way to work with SparkSession that will automatically take care of calling spark.stop() at the end of the block for you.
You can pass parameters to it that may be required to run Spark, such as master location, log level, or app name. It also comes with a convenient set of defaults for running Spark locally.

Here’s a classic example of counting occurrences of letters in lines:

val logFile = "a/path/to/logFile.txt"
withSpark(master = "yarn", logLevel = SparkLogLevel.DEBUG){
	spark.read().textFile(logFile).withCached {
		val numAs = filter { it.contains("a") }.count()
		val numBs = filter { it.contains("b") }.count()
		println("Lines with a: $numAs, lines with b: $numBs")
	}
}

Another useful function in the example above is withCached. In other APIs, if you want to fork computations into several paths, but compute things only once, you would call the ‘cache’ method. However, this quickly becomes difficult to track and you have to remember to unpersist the cached data. Otherwise, you risk taking up more memory than intended or even breaking things altogether. withCached takes care of tracking and unpersisting for you.

Null safety

Kotlin for Spark adds leftJoin, rightJoin, and other aliases to the existing methods, however, these are null safe by design.


fun main() {

   data class Coordinate(val lon: Double, val lat: Double)
   data class City(val name: String, val coordinate: Coordinate)
   data class CityPopulation(val city: String, val population: Long)

   withSpark(appName = "Find biggest cities to visit") {
       val citiesWithCoordinates = dsOf(
               City("Moscow", Coordinate(37.6155600, 55.7522200)),
		   // ...
       )

       val populations = dsOf(
               CityPopulation("Moscow", 11_503_501L),
               // ...
       )
       citiesWithCoordinates.rightJoin(populations, citiesWithCoordinates.col("name") `==` populations.col("city"))
               .filter { (_, citiesPopulation) ->
                   citiesPopulation.population > 15_000_000L
               }
               .map { (city, _) ->
                   // A city may potentially be null in this right join!!!
                   city?.coordinate
               }
               .filterNotNull()
               .show()
   }
}

Note the city?.coordinate line in the example above. A city may potentially be null in this right join. This would’ve caused a NullPointerException in other JVM Spark APIs, and it would’ve been rather difficult to debug the source of the problem.
Kotlin for Apache Spark takes care of null safety for you and you can conveniently filter out null results.

What’s supported

This initial version of Kotlin for Apache Spark supports Apache Spark 3.0 with the core compiled against Scala 2.12.

The API covers all the methods needed for creating self-contained Spark applications best suited for batch ETL.

Getting started with Kotlin for Apache Spark

To help you quickly get started with Kotlin for Apache Spark, we have prepared a Quick Start Guide that will help you set up the environment, correctly define dependencies for your project, and run your first self-contained Spark application written in Kotlin.

What’s next

We understand that it takes a while to upgrade any existing framework to a newer version, and Spark is no exception. That is why in the next update we are going to add support for the earlier Spark versions: 2.4.2 – 2.4.6.

We are also working on the Kotlin Spark shell so that you can enjoy working with your data in an interactive manner, and perform exploratory data analysis with it.

Currently, Spark Streaming and Spark MLlib are not covered by this API, but we will be closely listening to your feedback and will address it in our roadmap accordingly.

In the future, we hope to see Kotlin join the official Apache Spark project as a first-class citizen. We believe that it can add value both for Kotlin, and for the Spark community. That is why we have opened a Spark Project Improvement Proposal: Kotlin support for Apache Spark. We encourage you to voice your opinions and join the discussion.

Go ahead and try Kotlin for Apache Spark and let us know what you think!

Kotlin 1.4 Released with a Focus on Quality and Performance Kotlin Multiplatform Mobile Goes Alpha

Discover more

We’re excited to announce a deepened collaboration between JetBrains and the Spring team as part of our continued efforts to make Kotlin a top choice for professional server-side work.

This blog post explores the current state and future plans for Kotlin scripting.

Analyze your GitHub repo's star history with Kotlin. Discover trends, visualize growth, and optimize your project's impact using Kotlin DataFrame and Kandy.

Enhanced Column Selection DSL in Kotlin DataFrame

Explore new functions and improved syntax for selecting values from structured data in Kotlin DataFrame.

Kotlin

Introducing Kotlin for Apache Spark Preview

withSpark and withCached functions

Null safety

What’s supported

Getting started with Kotlin for Apache Spark

What’s next

Discover more

Strengthening Kotlin for Backend Development: A Strategic Partnership With Spring

State of Kotlin Scripting 2024

Track and Analyze GitHub Star Growth With Kandy and Kotlin DataFrame

Enhanced Column Selection DSL in Kotlin DataFrame

Kotlin

Introducing Kotlin for Apache Spark Preview

withSpark and withCached functions

Null safety

What’s supported

Getting started with Kotlin for Apache Spark

What’s next

Subscribe to Kotlin Blog updates

Discover more

Strengthening Kotlin for Backend Development: A Strategic Partnership With Spring

State of Kotlin Scripting 2024

Track and Analyze GitHub Star Growth With Kandy and Kotlin DataFrame

Enhanced Column Selection DSL in Kotlin DataFrame