Big Data Tools
A data engineering plugin
Data Engineering Annotated Monthly – September 2021
In most countries, students start learning in September. As data engineers, let’s follow their lead and learn something new, too! I’m Pasha Finkelshteyn, and I’ll be your guide through this month’s news. I’ll offer my impressions of developments and highlight ideas from the wider community. If you think I missed something worthwhile, ping me on Twitter and suggest a topic, link, or anything else.
News
A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in data engineering right now.
Zingg 0.3.0 – MDM (Master Data Management) is tricky. You have multiple sources of data and you have to define what is true and what is not. You also have to somehow determine whether records belong to the same person or not. Are Tim Burton, Timothy Burton, and T. W. Burton the same person? Zingg is a tool that integrates with Spark and tries to answer this question automatically, without the quadratic complexity of the task!
Kafka 3.0.0 – The Apache Software Foundation needed less than one month to go from Kafka version 3.0.0-rc0 to the release of 3.0.0. It involved only 10 commits with minor fixes, which is an incredible indication of the quality of their software.
Camel K 1.6.0 – This is not a huge release of Camel K, but I just wanted to share this awesome project, which is not widely known inside my bubble. Lots of happy customers are aware of Apache Camel, an integration framework that makes it possible to connect almost anything to everything. Truth to be told, it’s quite cumbersome, with each instance requiring a separate installation and XML configuration. Camel K unifies Camel’s power with Kubernetes’’ native experience. This specific release allows for the scaling of KameletBindings.
Hudi 0.9 – This release adds something huge: Spark DDL and DML support (experimental). Boundaries between Hudi and Hive are slowly disappearing as you are reading this post! Also, Hudi tables are now being registered as Spark data source tables, which means that Hive fallback is no longer needed to find tables from Spark.
Druid 0.22.0 – Apache Druid is claimed to be a high-performance analytical database competing with ClickHouse. An interesting nuance: it has a very nice user interface that allows users to build charts and change queries interactively! This release brings over 400 new features, but my favorites are the array aggregation functions in SQL.
PostgreSQL 14 – Sometimes I forget, but traditional relational databases play a big role in the lives of data engineers. And of course, PostgreSQL is one of the most popular databases. This release brings more features that are important for complex analytical queries. They say that “performance improvements have been made for parallel queries, heavily-concurrent workloads, partitioned tables, logical replication, and vacuuming”, and a lot more.
Future improvements
Data engineering technologies are evolving every day. This section is about technology updates that are in the works for technologies and that you may want to keep on your radar.
Massive compile time improvement for Spark – In some rare cases, the compilation of queries with deeply nested case-when statements can take a lot of time, sometimes even more than 24 hours! This happens because PushDownPredicate pushes rules down one by one, and in huge case-when statements there may be thousands of rules. This proposed optimization will reduce the required time in such cases from hours to minutes.
Add Broker Count Metric to Kafka – I never thought about it, but here’s a funny observation: if a Kafka broker is alive, this does not automatically mean that it’s healthy and fully functional. It may not even be able to send metrics due to failed DNS resolution. This KIP aims to add a new metric, the total number of brokers, so that administrators can compare the number of healthy and unhealthy brokers with the total number of brokers.
Improve YARN Registry DNS Server qps – In massive Hadoop clusters, there may be a lot of DNS queries. It turns out that in YARN Registry’s DNS Server implementation, resolution speed is suboptimal. In the linked issue, people from naver.com improved resolution throughput by more than 9 times!
Operational Analytics Framework for NiFi – It’s always beneficial for administration in a low-code tool to be as simple as development in it. NiFi is going to take a new step in this direction by implementing operational analytics with cluster behaviour prediction inside NiFi itself!
Articles
This section is all about inspiration. Here are some great articles and posts that can help us all learn from the experience of other people, teams, and companies who work in data engineering.
Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot – As an expert in distributed systems, I’m always very skeptical when I read or hear the words “exactly once”. But I’m not so presumptuous to think that I’m more experienced than the engineers from Uber. They say that they know how to build exactly-once delivery for the event stream, which really sounds like the holy grail of stream processing. Sounds unbelievable, right? Check out the article for more details!
Auto-generating an Airflow DAG using the dbt manifest— dbt is commonly used as a tool for more or less declaratively describing the building blocks of your data and its transformation. But generally it’s not enough to run the dbt run command and say that your solution is production ready. This article describes how to generate Airflow DAGs from dbt manifests. Airflow is almost almighty!
Treating data as a product at Adevinta— Having data is not enough! People should be able to access and, more importantly, use data that is not sensitive from a security or privacy standpoint. In this article, Adevinta describes several practices they implemented to make data more accessible and useful.
Why you should try something else than Airflow for data pipeline orchestration – If you asked me what orchestrator you should use at work, I would answer “It depends, but probably Airflow”. Other people have different views, and it’s good to get a variety of perspectives!
Tools
askgit – SQL is a native language for many data engineers. On the other hand, git log is extremely complicated: its documentation takes up many pages! After selecting the correct way to build a log, you need to use bash magic to extract data. For example, what if I want to find the first commit by each committer inside the Kotlin API for Apache Spark?
With Bash, it will look like this:
declare -A committers IFS=';' while read -ra ITEM; do [ "${committers[${ITEM[1]}]+abc}" ] || committers[${ITEM[1]}]=${ITEM[0]} done < <(git log --full-history --reverse "--format=format:%at;%ae") for x in "${!committers[@]}"; do printf "%s:\t%s\n" "$x" "$(date -d @${committers[$x]})" ; done
The output of such a script will be not particularly readable:
vitaly.khudobakhshov@gmail.com: Sun Nov 17 10:05:08 AM MSK 2019 209830+plastic-karma@users.noreply.github.com: Thu Sep 10 08:58:37 AM MSK 2020 asm0dey@jetbrains.com: Sun Nov 17 04:45:37 PM MSK 2019 nonpool@163.com: Thu Jul 15 12:33:52 AM MSK 2021 49699333+dependabot[bot]@users.noreply.github.com: Fri Jun 19 09:35:47 AM MSK 2020 j.j.r.rensen@student.tue.nl: Tue Dec 1 01:27:32 AM MSK 2020 ugai@jp.fujitsu.com: Fri Aug 21 10:16:15 AM MSK 2020 pavel.finkelshteyn@jetbrains.com: Fri Mar 20 09:39:55 PM MSK 2020 felix.engl@hotmail.com: Thu Oct 8 11:23:11 AM MSK 2020 gunnar.schulze@gmail.com: Tue Sep 22 10:04:43 AM MSK 2020 kafooster@gmail.com: Tue Jun 2 09:11:19 PM MSK 2020 pavel.finkelshtein@gmail.com: Sat Jun 6 02:04:55 AM MSK 2020 408698+cra@users.noreply.github.com: Mon Jun 1 05:56:17 PM MSK 2020
Here’s what the command looks like with askgit:
askgit 'SELECT author_email AS email, min(author_when) AS date FROM commits GROUP BY author_email ORDER BY date'
And the output will be nice:
+---------------------------------------------------+---------------------------+ | EMAIL | DATE | +---------------------------------------------------+---------------------------+ | vitaly.khudobakhshov@gmail.com | 2019-11-17T10:05:08+03:00 | +---------------------------------------------------+---------------------------+ | asm0dey@jetbrains.com | 2019-11-17T16:45:37+03:00 | +---------------------------------------------------+---------------------------+ | pavel.finkelshteyn@jetbrains.com | 2020-03-20T18:39:55Z | +---------------------------------------------------+---------------------------+ | 408698+cra@users.noreply.github.com | 2020-06-01T17:56:17+03:00 | +---------------------------------------------------+---------------------------+ | kafooster@gmail.com | 2020-06-02T14:11:19-04:00 | +---------------------------------------------------+---------------------------+ | pavel.finkelshtein@gmail.com | 2020-06-06T02:04:55+03:00 | +---------------------------------------------------+---------------------------+ | 49699333+dependabot[bot]@users.noreply.github.com | 2020-06-19T09:35:47+03:00 | +---------------------------------------------------+---------------------------+ | ugai@jp.fujitsu.com | 2020-08-21T16:16:15+09:00 | +---------------------------------------------------+---------------------------+ | 209830+plastic-karma@users.noreply.github.com | 2020-09-09T22:58:37-07:00 | +---------------------------------------------------+---------------------------+ | gunnar.schulze@gmail.com | 2020-09-22T09:04:43+02:00 | +---------------------------------------------------+---------------------------+ | felix.engl@hotmail.com | 2020-10-08T10:23:11+02:00 | +---------------------------------------------------+---------------------------+ | j.j.r.rensen@student.tue.nl | 2020-11-30T23:27:32+01:00 | +---------------------------------------------------+---------------------------+ | nonpool@163.com | 2021-07-15T05:33:52+08:00 | +---------------------------------------------------+---------------------------+
Which script is more readable? Even for those who know shell scripting very well, I bet it’s still the second one. Which output is better? Definitely the second! So, do we still really need git log? Maybe it’s a good time to switch to the specialized tool!
That wraps up September’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter account. You can also get in touch with our team at big-data-tools@jetbrains.com. We’d love to know about any other interesting data engineering articles you come across!