Data Engineering Annotated Monthly – June 2022
Hi, I’m Pasha Finkelshteyn, and I’ll be your guide today through this month’s news. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, catch me on Twitter and suggest a topic, link, or anything else you want to see. By the way, if you would prefer to get this monthly source of data engineering information delivered straight to your inbox each month, you can subscribe to the newsletter here.
A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.
Apache Ambari: Resurrected – In February, Apache Ambari was moved to the Apache Attic. It made me think that the era of on-premises free Hadoop installations had come to an end. However, a miracle happened! This is actually the first instance I remember of something being revived after it was already in the attic. The process of returning to active maintenance is not even described in the docs. I’m actually happy that this has happened – Hadoop was there for me at the very beginning of my career and I have very positive feelings associated with it.
ShardingSphere – One more thing I learned while preparing this installment is that there is an entire top-level project to convert traditional databases into distributed ones. To be honest, I’m a little skeptical. How is it possible to support distributed transactions and solve the other complex problems of distributed systems? Amazon spent tremendous amounts of money developing on top of Postgres in Redshift. Greenplum is in some sense the successor of PostgresXL and PostgresXC. Combined, it took them many years to develop something both distributed and functional. And yet, there is a publication on the topic, and who am I to argue with them, anyway? The creators of ShardingSphere promise that it is SQL-aware and can transparently proxy SQL traffic, while also being pluggable, meaning you can extend the whole sphere with custom plugins.
Druid 0.23.0 – Druid — not the tree-folk kind — recently increased their development speed tremendously. In this release, the authors have implemented dozens of features, and some of them are very significant. For example, grouping on arrays without exploding the arrays significantly improves the readability of queries. This is crucial because, as we know, code is only written once but is read a potentially infinite number of times. There are also multiple improvements for streaming support (for Kafka and Kinesis), along with many other changes.
InLong 1.2.0 – This is one of the more interesting projects I hadn’t already heard of before preparing this installment. Apache InLong was formerly named TubeMQ and was initially created by Tencent, a huge multimedia company with roots in China. When I say “huge”, I mean it’s one of the highest grossing multimedia companies in the world. Just like any multimedia company, they handle very large amounts of data. And, of course, they have created a solution that will suit their ingestion needs. In a nutshell, InLong is a SaaS-based streaming platform that scales. It wouldn’t be quite right to call it “Kafka on steroids” because it includes lots of batteries. It integrates with different Message Queues out of the box, provides a real-time ETL experience, and offers built-in alerting and monitoring. On top of that, on the main page of its documentation you can find an impressive list of integrations.
Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and that you may want to keep an eye on.
Kafka: Monitor KRaft Controller Quorum Health – In the previous installment I wrote about KRaft, the new consensus algorithm in Kafka. However, when you implement such a major feature, you need to provide customers with the ability to monitor it. KIP-835 has already been accepted and is waiting to be implemented. Once it has been, we’ll have the ability to understand not only the health of the Kafka cluster, but specifically the state of quorum as well.
Flink: Add Retry Support For Async I/O In DataStream API – There is no such thing as a “reliable data source”. Everything we connect remotely is inherently faulty. Networks are unreliable, slow, and error-prone. They can drop packets and even lie on occasion, depending on the communication protocol. There are, of course, different strategies for dealing with these issues. We can just ignore the absence of information and accept that it will be lost in the haystack of data (and that can be perfectly fine), but sometimes we need to obtain it at any cost. In this case, the usual solution is to retry until it will succeed. In Flink, customers have to write the whole logic for their retries themselves. But with the implementation of FLIP-232, this might be about to change!
Spark: Spark Connect – A client and server interface for Apache Spark – This proposed improvement has a lot of potential! The authors claim that Spark, with its current architecture, lacks 4 important traits: built-in remote connectivity, a rich developer experience, stability, and upgradability. They say that they are aiming to introduce a new API that will make it possible to work with Spark in a client-server manner. This means the client would connect an API to a running Spark cluster, and this API would make it much easier to perform exploratory data analysis (which is a common task for both data engineers and data scientists). And who knows? Maybe the Kotlin API for Apache Spark can benefit from it too!
This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.
Recap of Databricks Machine Learning announcements from Data & AI Summit – This year’s Data & AI Summit was huge, and it was full of interesting announcements. A month ago it might have seemed like Databricks just provided notebooks, but that’s not the case anymore. The platform’s new features include MLflow 2.0, serverless model endpoints, model monitoring, and many other features aimed at MLOps and production-ready data science models and experiments.
The State of Data Engineering 2022 – I like this kind of content. Somebody looks at what’s going on in the world of data engineering today, classifies it, and puts it all together into one nice image. I’ve already shared a similar piece by Matt Turck, who does this every year for the whole data landscape. I hope the folks at lakeFS continue their good work and update this yearly. Keep it up!
Cache in Distributed Systems – There are two hard problems in programming: variable naming and cache invalidation. Maybe you’ve already found your own solution to the first, but the second is likely still an issue for you. This article not only describes invalidation, but also addresses matters like eviction, hits, and misses. Of course, it won’t solve the problem of cache invalidation, but it might just help you understand caches a little better.
Current 2022: The Next Generation of Kafka Summit – This is the most popular conference dedicated to Kafka, and it is hosted by one of Kafka’s main maintainers – Confluent. Of course, the main topic is data streaming, as always.
Big Data Event: London – This is going to be a huge data event in London. It’s likely there will be thousands of attendees, and there are already dozens of speakers from a wide selection of companies, including the widely known Aerospike, Stack Overflow, and Snowflake.
That wraps up June’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at email@example.com or send a DM to my personal Twitter account. You can also get in touch with our team at firstname.lastname@example.org. We’d love to know about any other interesting data engineering articles you come across!