Kotlin

A concise multiplatform language developed by JetBrains

Visit the Kotlin Site

Ecosystem

Kotlin Dataframe 0.9.1 released!

Jolan Rensen

It’s time for another Kotlin Dataframe update to start off the new year.
There have been a lot of exciting changes since the last 0.8.0 preview release. So without any further ado, let’s jump right in!

TL;DR:

OpenAPI type schemas can now be parsed and converted into data schemas.
New JSON reading options include type clash tactics and key/value paths.
Support for writing Apache Arrow files has been added.
Many bugs have been fixed.
Make sure to update your Kotlin Jupyter kernel if you use DataFrame there.

Kotlin DataFrame on GitHub

OpenAPI Type Schemas

JSON schema inference is great, but it’s not perfect. DataFrame has had the ability to generate data schemas based on given data for a while now, but this can lead to errors in types or nullability when the sample doesn’t correctly reflect how future data might look.
Today, more and more APIs offer OpenAPI (Swagger) specifications. Aside from API endpoints, they also hold Data Models (Schemas) which include all the information about the types that can be returned from or supplied to the API. Obviously, we don’t want to reinvent the wheel and use our own schema inference when we can use the one provided by the API. Not only will we now get the proper names of the types, but we will also get enums, correct inheritance, and overall better type safety.

From DataFrame 0.9.1 onward, we will support the automatic generation of data schemas based on OpenAPI 3.0 type schemas.

To get started, simply import the OpenAPI specification file (.json or .yaml) as you would import any other data you would want to generate data schemas for. An OpenAPI file can contain any number of type schemas that will all be converted to a data schema.
We’ll use the pet store example from OpenAPI itself.

Your project does need an extra dependency for this to work:

implementation("org.jetbrains.kotlinx:dataframe-openapi:{VERSION}")

Importing data schemas can be done using a file annotation:

@file:ImportDataSchema(
    path = "https://petstore3.swagger.io/api/v3/openapi.json",
    name = "PetStore",
)

import org.jetbrains.kotlinx.dataframe.annotations.ImportDataSchema

Or using Gradle:

dataframes {
    schema {
        data = "https://petstore3.swagger.io/api/v3/openapi.json"
        name = "PetStore"
    }
}

And in Jupyter:

val PetStore = importDataSchema(
    "https://petstore3.swagger.io/api/v3/openapi.json"
)

After generating the data schemas, all type schemas from the OpenAPI spec file will have a corresponding data schema in Kotlin that’s ready to parse any JSON content adhering to it.
These will be grouped together under the name you give, which in this case is PetStore. Since the pet store OpenApi schema has the type schemas Order, Customer, Pet, etc., you will have access to the data schemas PetStore.Order, PetStore.Customer, PetStore.Pet, etc. that you can use to read and parse JSON data. (Hint: You can explore this generated code in your IDE and see what it looks like.)

For example:

val df = PetStore.Pet.readJson(
   "https://petstore3.swagger.io/api/v3/pet/findByStatus?status=available"
)
val names: DataColumn<String> = df
    .filter { /* this: DataRow<Pet>, it: DataRow<Pet> */
        category.name == "Dogs" &&
            status == Status1.AVAILABLE
    }
    .name

If you’re interested in the specifics of how this is done, I’ll break down an example below. Otherwise, you can continue to the next section.

OpenAPI Deep Dive

We can compare and see how, for instance, Pet is converted from the OpenAPI spec to Kotlin DataSchema interfaces (examples have been cleaned up a bit):

Pet, in the OpenApi spec is defined as:

"Pet": {
  "required": [ "name", "photoUrls" ],
  "type": "object",
  "properties": {
    "id": {
      "type": "integer",
      "format": "int64",
      "example": 10
    },
    "name": {
      "type": "string",
      "example": "doggie"
    },
    "category": { "$ref": "#/components/schemas/Category" },
    "photoUrls": {
      "type": "array",
      "items": { "type": "string" }
    },
    "tags": {
      "type": "array",
      "items": { "$ref": "#/components/schemas/Tag" }
    },
    "status": {
      "type": "string",
      "description": "pet status in the store",
      "enum": [ "available", "pending", "sold" ]
    }
  }
}

As you can see, it’s an object type that has multiple properties. Some properties are required, like name and photoUrls. Others, like id and category are not. No properties are nullable in this particular example, but since Kotlin has no concept of undefined properties, non-required properties will be seen as nullable too. There are primitive properties, such as id and name, but also references to other types, like Category and Tag. Let’s see what DataFrame generates using this example:

enum class Status1(override val value: String) : DataSchemaEnum {
    AVAILABLE("available"),
    PENDING("pending"),
    SOLD("sold");
}
    
@DataSchema(isOpen = false)
interface Pet {
    val id: Long?
    val name: String
    val category: Category?
    val photoUrls: List<String>
    val tags: DataFrame<Tag?>
    val status: Status1?
    
    companion object {
      val keyValuePaths: List<JsonPath>
        get() = listOf()
      fun DataFrame<*>.convertToPet(convertTo: ConvertSchemaDsl<Pet>.() -> Unit = {}): DataFrame<Pet> = convertTo<Pet> {
          convertDataRowsWithOpenApi()
          convertTo()
      }
      fun readJson(url: java.net.URL): DataFrame<Pet> = 
        DataFrame.readJson(url, typeClashTactic = ANY_COLUMNS, keyValuePaths = keyValuePaths)
          .convertToPet()
      fun readJson(path: String): DataFrame<Pet> = ...
      ...
    }
}

Let’s look at the generated interface Pet. All properties from the OpenAPI JSON appear to be there: id, name, and so on. Non-required or nullable properties are correctly marked with a ?. References to other types, like Category and Tag, are working too and are present elsewhere in the generated file.
Interestingly, since tags is supposed to come in the form of an array of objects, this is represented as a List of DataRows, or more specifically, a data frame. Thus, when Pet is used as a DataFrame type, tags will become a FrameColumn.
Finally, status was an enum that was defined inline in the OpenAPI JSON. We cannot define a type inline like that in Kotlin, so it’s generated outside of Pet.
Since DataSchemaEnum is used here, this might also be a good opportunity to introduce it properly. Enums can implement this interface to control how their values are read/written from/to data frames. This allows enums to be created with names that might be illegal in Kotlin (such as numbers or empty strings) but legal in other languages.

To be able to quickly read data as a certain type, the generated types have specific .readJson() methods. The example only shows the URL case in full, but the others are very similar. After calling one of them, the data frame is converted to the right type (in this case, using convertToPet(), which applies, corrects, and converts all the properties to the expected types). Those conversion functions can also be used to convert your own data frames to one of these generated types.

Adding support for OpenAPI type schemas was a difficult task. OpenAPI is very flexible in ways Kotlin and DataFrame cannot always follow. We’re certain it will not work with 100% of the OpenAPI specifications out there, so if you notice some strange behavior with one of your APIs, please let us know on Github or Slack so we can improve the support. :-)

JSON Options

To make the OpenAPI integration work better, we made several changes to how JSON is read in DataFrame. While the default behavior is the same, we added some extra options that might be directly beneficial to you too!

Key/Value Paths

Have you ever encountered a JSON file that, when read into a data frame, resulted in hundreds of columns? This can happen if your JSON data contains an object with many properties (key/value pairs). Unlike a large list of data, a huge map like this is not so easily stored in a column-based fashion, making it easy for you to lose grip on your data. Plus, if you’re generating data schemas, the compiler will most likely run out of memory due to the sheer number of interfaces it needs to create.

It would make more sense to convert all these columns into just two columns: “key” and “value”. This is exactly what the new key/value paths achieve.

Let’s look at an example:

By calling the API from APIS.GURU (a website/API that holds a collection of OpenAPI APIs), we get a data frame of 2366 columns in the form as shown here:

DataFrame.read("https://api.apis.guru/v2/list.json")

Inspecting the JSON as a data frame, we can find two places where conversion to keys/values might be useful: The root of the JSON and the versions property inside each website’s object. Let’s read it again but now with these key/value paths. We can use the JsonPath class to help construct these paths (available in Gradle too, but not available in KSP) and since we have a key/value object at the root, we’ll need to unpack the result by taking the first row and first column:

DataFrame.readJson(
    path = "https://api.apis.guru/v2/list.json",
    keyValuePaths = listOf(
        JsonPath(), // generates '$'
        JsonPath() // generates '$[*]["versions"]'
            .appendWildcard()
            .append("versions"),
    ),
)[0][0] as AnyFrame

Way more manageable, right? To play around more with this example, check out the Jupyter notebook or Datalore. This notebook contains examples of key/value paths and examples of the new OpenAPI functionality.

Type Clash Tactics

A little-known feature of DataFrame is how type clashes are handled when creating data frames from JSON. Let’s look at an example:

Using the default type clash tactic ARRAY_AND_VALUE_COLUMNS, JSON is read as follows:

[
    { "a": "text" },
    { "a": { "b": 2 } },
    { "a": [6, 7, 8] }
]

⌌----------------------------------------------⌍
|  | a:{b:Int?, value:String?, array:List<Int>}|
|--|-------------------------------------------|
| 0|         { b:null, value:"text", array:[] }|
| 1|              { b:2, value:null, array:[] }|
| 2|    { b:null, value:null, array:[6, 7, 8] }|
⌎----------------------------------------------⌏

Clashes between array elements, value elements, and object elements are solved by creating a ColumnGroup in the data frame with the columns array (containing all arrays), value (containing all values), and a column for each property in all of the objects. For non-array elements, the array column will contain an empty list. For non-value elements, the value column will contain null. This also applies to elements that don’t contain a property of one of the objects.

If you’re not very fond of this conversion and would rather have a more direct representation of the JSON data, you could use the type clash tactic ANY_COLUMNS. This tactic is also used by OpenAPI to better represent the provided type schema. Using this tactic to read the same JSON sample as above results in the following data frame:

⌌-------------⌍
|  |     a:Any|
|--|----------|
| 0|    "text"|
| 1|   { b:2 }|
| 2| [6, 7, 8]|
⌎-------------⌏

We could consider more type clash tactics in the future. Let us know if you have any ideas!

How to use JSON Options

Both of these JSON options can be used when reading JSON using the DataFrame.readJson() functions and (for generating data schemas) using the Gradle- and KSP plugins:

Functions:

DataFrame.readJson(
    path = "src/main/resources/someData.json",
    keyValuePaths = listOf(
        JsonPath()
            .appendArrayWithWildcard()
            .append("data"),
    ),
    typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS,
)

Gradle:

dataframes {
    schema {
        data = "src/main/resources/someData.json"
        name = "com.example.package.SomeData"
        jsonOptions {
            keyValuePaths = listOf(
                JsonPath()
                    .appendArrayWithWildcard()
                    .append("data"),
            )
            typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS
        }
    }
}

KSP:

@file:ImportDataSchema(
    path = "src/main/resources/someData.json",
    name = "SomeData",
    jsonOptions = JsonOptions(
        keyValuePaths = [
            """$[*]["data"]""",
        ],
        typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS,
    ),
)

Apache Arrow

Thanks to @Kopilov, our support for Apache Arrow files has further improved!

To use it, add the following dependency:

implementation("org.jetbrains.kotlinx:dataframe-arrow:{VERSION}")

On the reading side, this includes better reading of Data and Time types, UInts, and configurable nullability options. For more information, check out the docs.

On the writing side, well, this is completely new! DataFrame gained the ability to write to both the Arrow IPC Streaming format (.ipc) and the Arrow Random Access format (.feather). You can use both formats to save the data to a file, stream, byte channel, or byte array:

df.writeArrowIPC(file) // writes df to an .ipc file
df.writeArrowFeather(file) // writes df to a .feather file
 
val ipcByteArray: ByteArray = df.saveArrowIPCToByteArray()
val featherByteArray: ByteArray = df.saveArrowFeatherToByteArray()

If you need more configuration, then you can use arrowWriter. For example:

// Get schema from anywhere you want. It can be deserialized from JSON, generated from another dataset
// (including the DataFrame.columns().toArrowSchema() method), created manually, and so on.
val schema = Schema.fromJSON(schemaJson)

df.arrowWriter(

    // Specify your schema
    targetSchema = schema,

    // Specify desired behavior mode
    mode = ArrowWriter.Mode(
        restrictWidening = true,
        restrictNarrowing = true,
        strictType = true,
        strictNullable = false,
    ),

    // Specify mismatch subscriber
    mismatchSubscriber = { message: ConvertingMismatch ->
        System.err.println(message)
    },

).use { writer: ArrowWriter ->

    // Save to any format and sink, like in the previous example
    writer.writeArrowFeather(file)
}

For more information, check out the docs.

Other New Stuff

Let’s finish this blog with a quick-fire round of some bug fixes and new features. Of course, there are far too many to mention, so we’ll stick to the ones that stand out:

The skipRows parameter in the DataFrame.readExcel() function (thanks @Burtan for the idea).
Locale fixes for parsing Double values.
ISO_DATE_TIME support when parsing from String.
Examples updated.
Improved generated data frame accessors with regard to nullability.
CSVFormat.withSkipHeaderRecord() now actually working in the writeCSV() function (thanks @Vhuc).
Improved type recognition:
- More consistent behavior.
- Nothing can now show up for empty lists, for instance.
convertTo improvements:
- Better exceptions: CellConversionException can now be thrown (thanks @Kopilov).
- fill support for missing columns in the DSL (thanks @Nikitinas).
- ColumnGroups and FrameColumns can now also be converted, not just ValueColumns.
- New advanced convertIf function in the DSL for all the cases that convert<>.with {} cannot handle easily.
- Empty data frame- or null-filled columns can now be generated if needed.
- Empty rows/columns can be converted to anything.
- See the docs for more information.
Improved dataFrameConfig {} DSL in Jupyter.
DataSchemaEnum: interface enums can now inherit to control how their values are stored in data frames.
The unfold operation can now unwrap columns of any objects into ColumnGroups/FrameColumns; see the docs (thanks @Holgerbrandl for the idea).
Many bugs and fixes – too many to name.