Kotlin logo

Kotlin

A concise multiplatform language developed by JetBrains

AI-Friendly Programming Languages: the Kotlin Story

To stay relevant in today’s world of AI revolution, a programming language should be well represented in the ML community and in language models. The less well represented a language is, the lower the quality of generated code, which results in decreased usage of the language and even worse representation. You might be wondering what exactly we mean by “representation”. Read on!

To support the future growth of Kotlin popularity and ensure the language is well represented in the new generation of developer tools, we introduce 💜Kotlin ML Pack: a set of necessary tools, data, and models to promote code modeling tasks for the Kotlin language. It is based on extensive research performed by the JetBrains Research team and provides ML researchers with more tools and ideas that they can apply to other programming languages.

Kotlin Data / Datasets

Good data is the cornerstone of machine learning in any domain, programming languages included. While popular and high-quality datasets to teach and measure various aspects of Python language modeling already exist, such datasets were virtually non-existent for Kotlin. We bridge this gap by collecting and open-sourcing two main datasets: Kotlin language corpus and the dataset of instructions for Kotlin generation.

Language corpus datasets

The following two datasets are the result of our research related to language corpus:

  • KStack – Kotlin large language corpus. The most complete, permissively licensed, and up-to-date collection of open-source Kotlin code.
  • KStack-clean – a curated dataset for better model training. A highly filtered version of KStack containing 25,000 high-quality examples.

The table below compares the descriptive statistics for these two new datasets and the Kotlin subset of The Stack v2.

FilesRepositoriesLinesTokens
The Stack v22M109547162M1.7B
KStack4M163310293M3.1B
KStack-clean2500033662M22M

KExercises: Kotlin instructions datasets

Another focus of our dataset development was the creation of the Kotlin dataset for instruct-tuning. Typically, such datasets consist of sets of instructions or tasks along with their solutions. Training on this data aids models in better comprehending the relationship between natural and programming languages.

There are a number of such datasets available, some for the Python programming language and others with multi-language representation. However, in these datasets, Kotlin only has a relatively modest representation, or they do not contain Kotlin at all.

Our decision was to adapt one of the existing datasets by translating it from Python to Kotlin, rather than creating an entire dataset from scratch. For this purpose, we selected a dataset of Python exercises that demonstrated its functionality and effectiveness. We then used GPT-3.5-turbo to translate the data from Python to Kotlin. After the translation, we manually reviewed a subsample of the data to ensure the accuracy of the translations. Finally, we compiled an instruct dataset comprising 15,000 Kotlin tasks (approximately 3.5M tokens and 335,000 lines of code).

Evaluation

Another vital aspect of machine learning is accurate and efficient evaluation procedures. Thankfully, HumanEval has become a standard for such evaluations in the world of code LLMs. Though initially designed for Python, HumanEval has been translated into multiple programming languages. It has also been adapted for use with compiled languages and has been expanded with new tasks.

HumanEval for Kotlin

Unfortunately, the existing HumanEval for Kotlin required significant improvement before it could be used. Therefore, we set out to redo the HumanEval from scratch using a different approach involving human experts.

All JetBrains HumanEval solutions and tests were written by an expert competitive programmer with six years of experience in Kotlin and independently checked by a programmer with four years of experience in Kotlin. The tests we implement are equivalent to the original HumanEval tests for Python, and we fix the prompt signatures to address the generic variable signature we describe above.

The new HumanEval benchmark is available on Hugging Face, together with usage instructions and benchmark evaluation results for different language models.

Training models for Kotlin

To showcase our datasets, we trained several models in different setups.

  • Code Llama 7B is an autoregressive language model using optimized transformer architectures. It supports infilling text generation, was fine-tuned with up to 16,000 tokens, and supports up to 100,000 tokens at inference time.
  • DeepSeek-coder-6.7B base model, implemented by DeepSeek, is a 6.7B-parameter model with Multi-Head Attention trained on two trillion tokens of natural language texts in English and Chinese. It is also pre-trained on project-level code corpus by employing a window size of 16,000 and an extra fill-in-the-blank task to support project-level code completion and infilling. 
  • DeepSeek-coder-1.3B shares the same architecture and training procedure, but with fewer parameters.

We used our three datasets mentioned above as part of the training setup. The fine-tuning was performed on an NVIDIA A100 GPU in bf16 precision, using the AdamW optimizer. Additionally, to stabilize the training process, we used a number of various techniques such as Z-loss, weight decay, gradient norm clipping, and others.

As a result, we have seen improvements across all approaches that we used. We achieve the most significant boost with a combination of DeepSeek-coder-6.7B and the fine-tuning on the KExercises dataset, resulting in a pass rate of 55.28%. Fine-tuning on instructions produced great results on the other two base models as well. At the same time, fine-tuning on the full dataset gave weak results, increasing the pass rate for CodeLlama by only three percentage points. The clean version of the KStack shows much better results during fine-tuning, but the pass rate is still lower than the one that we achieved with the KExercises dataset.

We will not stop here. Our goals go beyond just improving the quality of Kotlin code generation. We also strive to provide researchers with more tools and ideas to ensure that in result the developer tooling evolves further in the application of ML to code generation and software development in general.

This work and the Kotlin ML Pack that we’ve published cover the essentials of the Kotlin learning pipeline, like data and evaluation. However, the Kotlin and JetBrains ecosystems can offer much more to the language modeling and ML community, such as learning from tools like compilers or linters, additional code for datasets, and new benchmarks more relevant to day-to-day production development tasks.

For a deeper dive and a more detailed description of the research by the JetBrains Research team, read the Kotlin ML Pack: Technical Report.

Alternatively, watch the related section of the KotlinConf’24 keynote.

image description