Big Data Tools
A data engineering plugin
Data Engineering Annotated Monthly – August 2021
August is usually a quiet month, with vacations taking their toll. But data engineering never stops. I’m Pasha Finkelshteyn and I will be your guide through this month’s news, my impressions of the developments, and ideas from the wider community. If you think I missed something worthwhile, ping me on Twitter and suggest a topic, link, or anything else.
News
A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in data engineering right now.
Fairlens 0.1.0 – Ethical ML is huge right now. But it is incredibly hard to determine whether a dataset is ethical, unbiased, and not skewed manually. Given this is a hot topic and there’s a boatload of money in it, you would expect there to be a wealth of tools to verify data ethics… but you’d be wrong. At least until Fairlens came on the scene. It hasn’t had its first release yet, but the promise is that it will un-bias your data for you! How cool is that?
Kafka 3.0.0-rc0 – If you like to try new releases of popular products, the time has come to test Kafka 3 and report any issues you find on your staging environment! Support for Scala 2.12 and Java 8 still exists but is deprecated. There are also several changes in KRaft (namely Revise KRaft Metadata Records and Producer ID generation in KRaft mode), along with many other changes. Unfortunately, the feature that was most awaited (at least by me) – tiered storage – has been postponed for a subsequent release.
ClickHouse v21.8 – This release of ClickHouse is massive. For fans of open-source instruments, the most interesting change is support for the MaterializedPostgreSQL table engine, which lets you copy a whole Postgres table/database to ClickHouse with ease.
MLflow 1.12.0 – This minor release of a popular ML Ops framework allows you to store and serve ML models. One of the changes that look exciting to me is “Add pip_requirements and extra_pip_requirements to mlflow.*.log_model
and mlflow.*.save_model
for directly specifying the pip requirements of the model to log / save.
Apache Pinot 0.8.0 – Apache Pinot is a real-time distributed OLAP datastore, designed to answer OLAP queries with low latency. In some sense, it competes with ClickHouse, as both target the same workflow. There are multiple differences, of course; for example, Pinot is intended to work in big clusters. There are a couple of comparisons on the internet, like this one, but it’s worth mentioning that they are quite old and both systems have changed a lot, so if you’re aware of more recent comparisons, please let me know! One of the interesting changes here is support for Bloom filters for IN predicates.
LakeFS 0.48.0 – We described LakeFS in the July issue of our Annotated. Now it has added support for having multiple AWS regions for underlying buckets. While this may be more expensive in terms of both money and performance, it still sounds like a nice disaster recovery option. Even if a meteorite hits your data center, your big data is still going to be safe!
Future improvements
Data engineering technologies are evolving every day. This section is about what’s in the works for technologies that you may want to keep on your radar.
Cache for ORC metadata in Spark – ORC is one of the most popular binary formats for data storage, featuring awesome compression and encoding capabilities. But what if we need to query the same dataset multiple times? Reading file metadata is costly because it is an IO operation, which is slow. And more files means more time. With caching, though, execution times may be decreased dramatically (on some workloads).
Custom netty HTTP request inbound/outbound handlers in Flink – Sometimes we need to perform HTTP requests while processing with Flink. But sometimes we need to do more than just make an HTTP request – sometimes we need to customize it, for example, by adding authentication or custom headers, which may be especially helpful in strict corporate environments. It looks like this will be available soon in Flink!
Cassandra Paxos Improvements – Cassandra’s Paxos implementation is known to be good, but not perfect. For example, Lightweight Transactions (LWT) are known to suffer from poor performance. Don’t take it from me – this comes from Cassandra developers themselves. So, they’ve decided to improve this in the foreseeable future and the work is already underway, which I think is awesome.
Articles
This section is about inspiration. We’ll try to list some great articles and posts that can help us all learn from the experience of other people, teams, and companies dealing with data engineering.
Change Data Capture at DeviantArt – I think we all know what Debezium is. But while it is a tool for streaming data from DBs to Kafka, it cannot cover all CDC needs or scenarios. In this article, the folks from DeviantArt describe the whole architecture of their CDC solution, with concrete recipes and tips.
How Uber Achieves Operational Excellence in the Data Quality Experience – Uber is known for having a huge Hadoop installation in Kubernetes. This blog post is more about data quality, though, describing how they built their data quality platform. Who would have thought that building a data quality platform could be this challenging and exciting? 100% test coverage sounds amazing, too, so good job!
Apache Hudi – The Data Lake Platform – Quasi-mutable data storage formats are not only trending, but also mysterious. How do they really work under the hood? At what cost do we get this mutability? In this detailed post, Hudi developers meticulously describe how Apache Hudi works and why it’s good for streaming.
Hive Metastore – It didn’t age well – The folks from LakeFS continue to delight us with interesting articles about data engineering. This time they describe what is wrong with the popular Hive Metastore and explain how it works in detail.
Tools
sqlglot – I often found myself digging the web for specific SQL dialect details. Should I backtick the identifiers here? Should I use double quotes or single ones? And don’t get me started on formatting. Sometimes I just didn’t want to launch my favorite DataGrip to format a single SQL statement. Then I discovered sqlglot, a tool that can transpile my syntax from one dialect to another in an instant. That’s one less headache for me!
Conferences
SmartData 2021 – This international conference on data engineering is organized by a Russian company, but it aims to have at least 30% of the talks in English. Most of the topics, from data quality to DWH architecture, are hot! Speakers from Databricks, Microsoft, Netflix, and other huge companies are going!
That wraps up August’s Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter, or you can get in touch with our team at big-data-tools@jetbrains.com. We’d love to know about any other interesting data engineering articles you come across!