Making Kotlin Ready for Data Science
By Andrey Cheptsov, JetBrains
January 2, 2020
This year at KotlinConf 2019, Roman Belov gave an overview on Kotlinís approach to data science. Now that the talk is available for everyone to see, we decided to recap it and share a bit more on the current state of Kotlin tools and libraries for data science.
How does Kotlin fit data science? Following the need to analyze large amounts of data, the last few years has brought a true renaissance to the data science discipline. All this renaissance of data science couldnít be possible without proper tools. Before, you needed a programming language designed specifically for data science, but today you can already do it with general-purpose languages. Of course this requires general-purpose languages to make the right design decisions, not to mention getting the community to help in. All this made certain languages, such as Python, more popular for data science than others.
With the concept of Kotlin Multiplatform, Kotlin aims to replicate its developer experience and extend its interoperability to other platforms as well. The major qualities of Kotlin by design include conciseness, safety, and interoperability. These fundamental language traits make it a great tool for a wide variety of tasks and platforms. Data science is certainly one of these tasks.
The great news is that the community has already begun adopting Kotlin for data science, and this adoption is happening at a fast pace. The brief report below outlines how ready Kotlin is for data science, including the Kotlin libraries and Kotlin tools for data science.
First and foremost, thanks to their interactivity, Jupyter notebooks are very convenient for transforming, visualizing, and presenting data. With the extensibility and the open-source nature of Jupyter, it has turned into a large ecosystem around data science and was integrated into tons of other solutions related to data. Among them is the Kotlin kernel for Jupyter notebooks. With this kernel, you can write and run Kotlin code in Jupyter notebooks and use third-party data science frameworks written in Java and Kotlin.
An example of a reproducible Kotlin Jupyter notebook can be found in this repo. To quickly play with a Kotlin notebook, you can launch it on Binder (please note the environment will normally take a minute to set up).
Due to the strong support for Spark and Scala, Apache Zeppelin is very popular among data engineers. Similar to Jupyter, Zeppelin has a plugin API (called Interpreters) to extend its core with support for other tools and languages. Currently, the latest release of Zeppelin (0.8.2) doesnít come with a bundled Kotlin interpreter. But anyway, it is available in the master branch of Zeppelin. To learn how to deploy Zeppelin with Kotlin support in a Spark cluster, see these instructions.
Since Spark has a robust Java API, you can already use Kotlin to work with the Spark Java API from both Jupyter and Zeppelin without any problems. However weíre working on improving this integration by adding full support for Kotlin classes with Sparkís Dataset API. Support for Kotlin with Sparkís shell is also in progress.
Using Kotlin for data science alone, without libraries, makes little sense. Luckily, thanks to the recent efforts of the community, thereís already a number of nice Kotlin libraries that you can use right away.
Here are some of the most useful libraries:
For a more complete list of useful links, please refer to Kotlin data science resources by Thomas Nield.
Lets-Plot for Kotlin
Lets-Plot is an open-source plotting library for statistical data written entirely in Kotlin. Being a multiplatform library, it has an API designed specifically for Kotlin. You can familiarize yourself with how to use this API by reading its user guide.
For interactivity, Lets-Plot is tightly integrated with the Kotlin kernel for Jupyter notebooks. Once you have the Kotlin kernel installed and enabled, add the following line to a Jupyter notebook:
Then you will be able to call Lets-Plot API functions from your cells, and see the results immediately beneath the cells as you would normally have by using ggplot with R or Python:
Kotlin bindings for NumPy
NumPy is a popular package for scientific computing with Python. It provides powerful capabilities for multi-dimensional array processing, linear algebra, Fourier transform, random numbers, and other mathematical tasks. Kotlin Bindings for NumPy is a Kotlin library that enables calling NumPy functions from Kotlin code by providing statically typed wrappers for NumPy functions.
The entire Kotlin ecosystem is based on the idea of open source and would not be possible without the help of many contributors. Kotlin for data science is only emerging and needs your help now as ever! Hereís how you can pitch in:
The Kotlin community has a dedicated channel called #datascience in its Slack. We invite you to join this channel to ask questions, find out in what areas help is needed and how you can contribute, and of course share your feedback and your work with the community.
Keep in mind that Kotlin is still in the very early stages of becoming the tool of choice for data scientists. Itís going to be an exciting and challenging journey! It will require building a rich ecosystem of tools and libraries, as well as adjusting the language design to meet the needs of data-related tasks. If you see things not working as you would expect, please share your experience Ė or get involved and help fix them. Give them a try, especially the Jupyter kernel and libraries, and share your feedback with us.
Most of the information in this post, and much more, can be found on the official Kotlin website.
We also recommend watching these talks from the past two KotlinConf conferences: a talk by Holger Brandl (the creator of krangl, Kotlinís analog of Pythonís pandas), and this talk by Thomas Nield (the creator of kotlin-statistics).