Ecosystem

Making Kotlin Ready for Data Science

This year at KotlinConf 2019, Roman Belov gave an overview on Kotlin’s approach to data science. Now that the talk is available for everyone to see, we decided to recap it and share a bit more on the current state of Kotlin tools and libraries for data science.

https://www.youtube.com/watch?v=APnyDVye4JA

How does Kotlin fit data science? Following the need to analyze large amounts of data, the last few years has brought a true renaissance to the data science discipline. All this renaissance of data science couldn’t be possible without proper tools. Before, you needed a programming language designed specifically for data science, but today you can already do it with general-purpose languages. Of course this requires general-purpose languages to make the right design decisions, not to mention getting the community to help in. All this made certain languages, such as Python, more popular for data science than others.

With the concept of Kotlin Multiplatform, Kotlin aims to replicate its developer experience and extend its interoperability to other platforms as well. The major qualities of Kotlin by design include conciseness, safety, and interoperability. These fundamental language traits make it a great tool for a wide variety of tasks and platforms. Data science is certainly one of these tasks.

The great news is that the community has already begun adopting Kotlin for data science, and this adoption is happening at a fast pace. The brief report below outlines how ready Kotlin is for data science, including the Kotlin libraries and Kotlin tools for data science.

Jupyter

First and foremost, thanks to their interactivity, Jupyter notebooks are very convenient for transforming, visualizing, and presenting data. With the extensibility and the open-source nature of Jupyter, it has turned into a large ecosystem around data science and was integrated into tons of other solutions related to data. Among them is the Kotlin kernel for Jupyter notebooks. With this kernel, you can write and run Kotlin code in Jupyter notebooks and use third-party data science frameworks written in Java and Kotlin.

An example of a reproducible Kotlin Jupyter notebook can be found in this repo. To quickly play with a Kotlin notebook, you can launch it on Binder (please note the environment will normally take a minute to set up).

Apache Zeppelin

Due to the strong support for Spark and Scala, Apache Zeppelin is very popular among data engineers. Similar to Jupyter, Zeppelin has a plugin API (called Interpreters) to extend its core with support for other tools and languages. Currently, the latest release of Zeppelin (0.8.2) doesn’t come with a bundled Kotlin interpreter. But anyway, it is available in the master branch of Zeppelin. To learn how to deploy Zeppelin with Kotlin support in a Spark cluster, see these instructions.

Apache Spark

Since Spark has a robust Java API, you can already use Kotlin to work with the Spark Java API from both Jupyter and Zeppelin without any problems. However we’re working on improving this integration by adding full support for Kotlin classes with Spark’s Dataset API. Support for Kotlin with Spark’s shell is also in progress.

Libraries

Using Kotlin for data science alone, without libraries, makes little sense. Luckily, thanks to the recent efforts of the community, there’s already a number of nice Kotlin libraries that you can use right away.

Here are some of the most useful libraries:

  • kotlin-statistics is a library that provides a set of extension functions to perform exploratory and production statistics. It supports basic numeric list/sequence/array functions (from sum to skewness), slicing operators (e.g. countBy, simpleRegressionBy, etc), binning operations, discrete PDF sampling, naive bayes classifier, clustering, linear regression, and more.
  • kmath is a library inspired by numpy; this library supports algebraic structures and operations, array-like structures, math expressions, histograms, streaming operations, wrappers around commons-math and koma, and more.
  • krangl is a library inspired by R’s dplyr and Python’s pandas; this library provides functionality for data manipulation using a functional-style API; it allows you to filter, transform, aggregate, and reshape tabular data.
  • lets-plot is a library for declaratively creating plots based on tabular data. This library is inspired by R’s ggplot and The Grammar of Graphics, and is integrated tightly with the Kotlin kernel. It is multi-platform and can be used not just with JVM, but also from JS and Python.
  • kravis is another library inspired by R’s ggplot for visualizing tabular data.

For a more complete list of useful links, please refer to Kotlin data science resources by Thomas Nield.

Lets-Plot for Kotlin

Lets-Plot is an open-source plotting library for statistical data written entirely in Kotlin. Being a multiplatform library, it has an API designed specifically for Kotlin. You can familiarize yourself with how to use this API by reading its user guide.

For interactivity, Lets-Plot is tightly integrated with the Kotlin kernel for Jupyter notebooks. Once you have the Kotlin kernel installed and enabled, add the following line to a Jupyter notebook:

%use lets-plot

Then you will be able to call Lets-Plot API functions from your cells, and see the results immediately beneath the cells as you would normally have by using ggplot with R or Python:

Kotlin bindings for NumPy

NumPy is a popular package for scientific computing with Python. It provides powerful capabilities for multi-dimensional array processing, linear algebra, Fourier transform, random numbers, and other mathematical tasks. Kotlin Bindings for NumPy is a Kotlin library that enables calling NumPy functions from Kotlin code by providing statically typed wrappers for NumPy functions.

Contribution

The entire Kotlin ecosystem is based on the idea of open source and would not be possible without the help of many contributors. Kotlin for data science is only emerging and needs your help now as ever! Here’s how you can pitch in:

  • Talk about your pain points and share your ideas on how to make Kotlin even better-suited for data-science tasks – your tasks.
  • Contribute to the open source data-science-related libraries, and create your own libraries and tools – anything that you think can help Kotlin become a language of choice for data science.

The Kotlin community has a dedicated channel called #datascience in its Slack. We invite you to join this channel to ask questions, find out in what areas help is needed and how you can contribute, and of course share your feedback and your work with the community.

Keep in mind that Kotlin is still in the very early stages of becoming the tool of choice for data scientists. It’s going to be an exciting and challenging journey! It will require building a rich ecosystem of tools and libraries, as well as adjusting the language design to meet the needs of data-related tasks. If you see things not working as you would expect, please share your experience – or get involved and help fix them. Give them a try, especially the Jupyter kernel and libraries, and share your feedback with us.

Resources

Most of the information in this post, and much more, can be found on the official Kotlin website.

KotlinConf 2019 had more inspiring talks about data science, including a Kotlin for Science by Alexander Nozik and another one Gradient Descent with Kotlin by Erik Meijer.

We also recommend watching these talks from the past two KotlinConf conferences: a talk by Holger Brandl (the creator of krangl, Kotlin’s analog of Python’s pandas), and this talk by Thomas Nield (the creator of kotlin-statistics).

That’s it for today (and probably for this year). Wrapping it all up, the community is adopting Kotlin for data science at a good pace, so now it’s your turn.

Let’s Kotlin!

image description