Kotlin

A concise multiplatform language developed by JetBrains

Visit the Kotlin Site

Ecosystem

Kotlin for Apache Spark: One Step Closer to Your Production Cluster

Pasha Finkelshteyn

We have released Preview 2 of Kotlin for Apache Spark. First of all, we’d like to thank the community for providing us with all their feedback and even some pull requests! Now let’s take a look at what we have implemented since Preview 1?

Scala 2.11 and Spark 2.4.1+ support

The main change in this preview is the introduction of Spark 2.4 and Scala 2.11 support. This means that you can now run jobs written in Kotlin in your production environment.

The syntax remains the same as the Apache Spark 3.0 compatible version, but the installation procedure differs a bit. You can now use sbt to add the correct dependency, for example:

libraryDependencies += "org.jetbrains.kotlinx.spark" %% "kotlin-spark-api-2.4" % "1.0.0-preview2"

Of course, for other build systems, such as Maven, we would use the standard:

<dependency>
  <groupId>org.jetbrains.kotlinx.spark</groupId>
  <artifactId>kotlin-spark-api-2.4_2.11</artifactId>
  <version>1.0.0-preview2</version>
</dependency>

And for Gradle we would use:

implementation 'org.jetbrains.kotlinx.spark:kotlin-spark-api-2.4_2.11:1.0.0-preview2'

Spark 3 only supports Scala 2.12 for now, so there is no need to define it.

You can read more about dependencies here.

Support for the custom `SparkSessionBuilder`

Imagine you have a company-wide SparkSession.Builder that is set up to work with your YARN/Mesos/etc. cluster. You don’t want to create it, and the “withSpark” function doesn’t give you enough flexibility. Before Preview 2, you had only one option – you had to rewrite the whole builder from Scala to Kotlin. You now have another one, as the “withSpark” function accepts the SparkSession.Builder as an argument:

val builder = // obtain your builder here
withSpark(builder, logLevel = DEBUG) {
  // your spark is initialized with correct config here
}

Broadcast variable support

Sometimes we need to share data between executor nodes. Occasionally, we need to provide executor nodes with dictionary-like data without pushing it to each executor node individually. Apache Spark allows you to do this using broadcasting variable. You were previously able to do this using the “encoder” function. Still, we are aiming to provide an API experience that is as consistent as possible with that of the Scala API. So we’ve added explicit support for the broadcasting of variables. For example:

// Broadclasted data should implement Serializable
data class SomeClass(val a: IntArray, val b: Int) : Serializable
// some another code snipped
val receivedBroadcast = broadcast.value // broadcasted value is available

We’d like to thank Jolan Rensen for the contribution.

Primitive arrays support for Spark 3

If you use arrays of primitives in your data pipeline, you may be glad to know that they now work in Kotlin for Apache Spark. Arrays of data (or Java bean) classes were supported in Preview 1, but we’ve only just added support for primitive arrays in Preview 2. Please note that this support will work correctly only if you’re using particular Kotlin primitive array types, such as IntArray, BooleanArray, and LongArray. In other cases, we propose using lists or alternative collection types.

Future work

Now that Spark 2.4 is supported, our targets for Preview 3 will be mainly to extend the Column API and add more operations on columns. Once Apache Spark with Scala 2.13 support is available, we will implement support for Scala 2.13, as well.

The awesome support introduced in Preview 2 makes now a great time to give it a try! The official repository and the quick start guide are excellent places to start.

Down Dog Case Study: Building 500K Subscribers App with Kotlin Multiplatform Mobile Server-side With Kotlin Webinar Series, Vol 2

Discover more

We’re excited to announce a deepened collaboration between JetBrains and the Spring team as part of our continued efforts to make Kotlin a top choice for professional server-side work.

This blog post explores the current state and future plans for Kotlin scripting.

Analyze your GitHub repo's star history with Kotlin. Discover trends, visualize growth, and optimize your project's impact using Kotlin DataFrame and Kandy.

Enhanced Column Selection DSL in Kotlin DataFrame

Explore new functions and improved syntax for selecting values from structured data in Kotlin DataFrame.

Kotlin

Kotlin for Apache Spark: One Step Closer to Your Production Cluster

Scala 2.11 and Spark 2.4.1+ support

Support for the custom `SparkSessionBuilder`

Broadcast variable support

Primitive arrays support for Spark 3

Future work

Discover more

Strengthening Kotlin for Backend Development: A Strategic Partnership With Spring

State of Kotlin Scripting 2024

Track and Analyze GitHub Star Growth With Kandy and Kotlin DataFrame

Enhanced Column Selection DSL in Kotlin DataFrame

Kotlin

Kotlin for Apache Spark: One Step Closer to Your Production Cluster

Scala 2.11 and Spark 2.4.1+ support

Support for the custom SparkSessionBuilder

Broadcast variable support

Primitive arrays support for Spark 3

Future work

Subscribe to Kotlin Blog updates

Discover more

Strengthening Kotlin for Backend Development: A Strategic Partnership With Spring

State of Kotlin Scripting 2024

Track and Analyze GitHub Star Growth With Kandy and Kotlin DataFrame

Enhanced Column Selection DSL in Kotlin DataFrame

Support for the custom `SparkSessionBuilder`