Data Engineering Annotated Monthly – September 2022
It’s been a very bustling two months in Berlin. Indeed, it’s been so busy that I had to skip the digests. I am now delighted to have the privilege of returning to the task of collecting for you the most exciting news from the world of data engineering. Greetings from sunny Berlin! I’m Pasha Finkelshteyn, and I’ll be your guide through this month’s news. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, hit me up on Twitter and suggest a topic, link, or anything else you want to see. By the way, if you would prefer to receive this information as an email, you can subscribe to the newsletter here.
A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.
Brooklin 4.1.0 – Once again, I learned something new while preparing this article. This time I learned about Brooklin, a LinkedIn service for streaming data in a heterogeneous environment. The official GitHub for the project says that it is characterized by high reliability and throughput, claiming that Brooklin can run hundreds of streaming pipelines simultaneously. This is no doubt very interesting. One of the use cases from the product page that stood out to me in particular was the effort to mirror multiple Kafka clusters in one Brooklin cluster!
Ambry v0.3.870 – It turns out that last month was rich in releases from LinkedIn, all of them related in one way or another to data engineering. The authors of Ambry, another new product, release so often that they must already be getting tired of writing changelogs. Nevertheless, the project looks very interesting. It is a distributed on-premise object repository that targets trillions of small immutable objects or billions of large ones. I often hear that MinIO does not perform well in large installations. Perhaps it’s time to try an alternative!
Apache Pegasus 2.3.0 – Have you ever been in a situation where you were designing a storage architecture and all the solutions in some areas just seemed wrong, leaving you to choose between an unsuitable option and an even less suitable one? Key-value storage is a fairly typical context in which this problem arises. HBase is too slow and brittle, and Redis is fast but can lose data. Apache Pegasus might be the alternative you are looking for, if not now, then in your next project. On the one hand, it is written in C++, which probably makes it faster, but on the other, Pegasus is very concerned about data persistence on disk, which is critical when migrating data, for example.
Cloudstack 184.108.40.206 (LTS) – While it may sometimes seem like Kubernetes is on top of the pile, the truth is that even the mighty k8s needs the hardware it runs on. If we use cloud providers, we don’t have to think about that. For your own hardware, it can be easy too: take your favorite hypervisor and go with it. But what if there are many hardware clusters? Apache Cloudstack provides a free IaaS stack that is compatible with all the most popular hypervisors, both paid and free ones. It would seem that why should data engineers care about this tool? It’s very simple: at some point, while talking to the Ops team, it will turn out that they’re tired of creating virtual machines with different characteristics all over the world for us. This is where Cloudstack comes in handy. Why should data engineers care about this? Simple: because it allows them to manage hardware they need themselves instead of going to the Ops team every time they need a slight change in resources to run their what-they-need-to-run. So both sides can benefit from this product – the Ops and Data teams.
DuaLip 2.4.1 – Sometimes the job of a data engineer is not just to build pipelines but also to help data science professionals optimize their solutions. Imagine, for example, that your colleagues are working on a sales or scheduling task that requires a constraint-based solution. They have their algorithm. They have their data. And they know what they need to do. They have a problem, however. Their solution is too slow and isn’t scalable. While OptaPlanner, the most famous product for completing these tasks, does not scale, LinkedIn’s DuaLip runs on a tool that many of us know and love – Apache Spark, which allows us to scale tasks and thus run them faster.
Druid 24.0.0 – Apache Druid has made the leap from 0.23.0 to 24.0.0. I think it’s about time! Druid seems to have really made the transition from an immature project to a production-ready solution. But maturity does not signal a halt in development! On the contrary, new features appear in every release, including noticeable ones. This release, for example, introduced a multi-stage query task engine that promises significant changes in the execution speed for batch queries. Unfortunately, it is not clear yet whether all your current queries will work as before, and the developers recommend testing your queries on the staging environment first. As they say, you can’t make an omelet without breaking any eggs.
Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and that you may want to keep an eye on.
Potentially breaking change in Spark 3.3.1 – While reading Apache Spark’s mailing list, I stumbled upon a concern expressed by one of the project’s maintainers: The SPARK-40218 fix seems to have introduced a change in the framework’s behavior. If by chance it happens to work correctly for you now – before updating to version 3.3.1 – it might still work after this release. Don’t forget to check your tests! This feature is not included in the release at the time of writing, voting is still in progress. You can find more info on the voting process here.
Support for Standalone mode in Flink’s Kubernetes operator – As you may have noticed, I write about Kubernetes all the time. Indeed, I believe that it’s the future of data engineering, at least for on-premises solutions. And that’s why I get especially excited when popular products add or extend support for k8s. Over the past few months, Flink’s developers realized that the current support for its Kubernetes operator is insufficient and may also lead to security risks. In addition to solving this problem, the Flink developers also addressed the issue of how to run an older version of Flink in a cluster that does not yet know anything about Kubernetes-native features. This will be possible after the release thanks to support for running Flink in standalone mode instead of cluster mode.
Multicasting record results with Kafka – I can’t say it any better than the author of this KIP did:
[Often] in Kafka Streams users want to send a record to more than one partition on the sink topic. Currently, if a user wants to replicate a message into N partitions, the only way of doing that is to replicate the message N times and then plug-in a new custom partitioner to write the message N times into N different partitions.
This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.
Upgrading Data Warehouse Infrastructure at Airbnb – The value of data lakes – or quasi-mutable storages, as I like to call them – is difficult to overstate, as it is becoming increasingly difficult for companies to fit all their data into completely immutable structures. GDPR and DMCA compliance alone is already an arduous enough process! These conditions have led to a wave of changes at Airbnb. They’ve replaced their simple Parquet files with Apache Iceberg, as well as migrated to Spark 3 for query optimization. From Iceberg they need not only create, read, and update operations, but also metastore, which reduces their number of S3 bucket client listings.
Is DataOps the Future Of the Modern Data Stack? – DevOps, a process built around the interaction of developers (Dev) with the Operations (Ops) team, has been with us for a long time. This article discusses the need to establish a similar framework around data. Some people maintain data while some people use data, and it may be useful to have a discrete field that brings them together. From my perspective, the introduction of DataOps is long overdue. In fact, the term is not entirely new, which may be an indication that we are about to experience a sea change.
Spanner on a modern columnar storage engine – From time to time, industry giants share some insight into how they solve their problems, even if they don’t reveal the specific details of the tools they create. This practice can be extremely helpful, and in fact, famous, industry-changing open-source tools like Hadoop have been born out of it. In early September, Google decided to share how columnar storage works in Spanner and how it was migrated from the old engine to the new one under a load of two billion requests per second. Who knows? Maybe that’s precisely the information you need to create the next industry-changing product.
That wraps up September’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at email@example.com or send a DM to my personal Twitter account. You can also get in touch with our team at firstname.lastname@example.org. We’d love to know about any other interesting data engineering articles you come across!