Data Engineering Annotated Monthly – October 2021
The lockdowns are back again in Moscow, which means that conferences are again out of the question for me for some time. The good news is that I had time to put together this new installment of our Data Engineering Annotated! Hi, I’m Pasha Finkelshteyn, and I’ll be your guide through this month’s news. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, catch me on Twitter and suggest a topic, link, or anything else you want to see. BTW, if you would prefer to get this in your email, you can subscribe to the newsletter here.
A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.
Spark Release 3.2.0 – We’ll start with the big news first. Apache Spark® has been released and there are a load of changes, including ANSI SQL support, Pandas API layer over PySpark, and lots and lots of other things. Also, this release is compatible with Scala 2.13 – the latest stable language release before the 3.x version.
Airflow 2.2.0 – One of the most popular orchestrators released a new version in October, too. One of the most awaited features in Airflow is Timetables. If you’re wondering what Timetables are, check out the Articles section below for a nice description. Other notable changes include adding pre/post task hooks and better support for kerberos.
Apache Ranger 2.2.0 released – Somehow there were no announcements, but yours truly managed to dig one out from the deepest, darkest depths it was buried in. This release is huge! There are 5 pages of fixes and improvements, and many of them seem like they are very technical and not really user-facing. But they are! For example, now Ranger supports groups with 300K+ members. If you are curious about what Apache Ranger is – it’s the framework set up to maintain security over the whole Hadoop platform.
Apache Flink 1.14.0 – This release of Flink is also humongous. They’ve removed the legacy SQL engine, added an Apache Pulsar connector (I just love how the Apache ecosystem works together), and implemented something called “hybrid source” – you can unify multiple sources into one with all the data from the underlying sources.
Apache Beam 2.33.0 – Have you ever heard anyone (like me, for example) say that there is almost no place for Go in data engineering? Well, that is no longer true. The Go SDK is now official for Apache Beam, together with Go Modules for dependency management!
Data engineering technologies are evolving every day. This section is about updates which are in the works for technologies and which you may want to keep an eye on.
Kafka: Allow configuring num.network.threads per listener – Sometimes you find yourself in a situation with Kafka brokers where some listeners are less active than others (and are in some sense more equal than others). But they contain a fixed number of threads in the pool. These threads waste resources that could be utilized in a better way. This KIP proposes to invent a way to configure such listeners individually.
Flink: Extend unified Sink interface to support small file compaction – When you use Flink and write data to files, usually everything is fine, but sometimes – just sometimes – you end up with tons of tiny files. And this may be a big problem! Reading lots of small files takes more time than reading one big one. Flink isn’t aware of this – yet. If this FLIP is implemented, it could one day lead to support for small file compaction.
Spark: Constraint Propagation code causes OOM issues or increasing compilation time to hours – Under certain conditions, constraint propagation code may be suboptimal or even cause an application to crash with an OutOfMemoryError. This proposal promises to eliminate all possible occurrences of this issue.
This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.
Scaling with Presto on Spark – An exciting tale of how nicely Presto and Spark can work together to achieve better results on batch workloads. Everything starts with Presto’s MPP architecture and how Spark can augment it to become more batch-ready.
The Future of the Data Engineer – As the author describes it: “Is the data engineer still the ‘worst seat at the table’? Thoughts on the past, present, and future of tooling, processes, and culture in our industry.” What else can I even add? This is an interesting read, especially when you live in a country where people prefer on-premises solutions to cloud ones.
Airflow Timetable: Schedule your DAGs like never before – A promising post about Apache Airflow Timetables. Some say that those uses of start_date and execution_date are not very transparent or understandable, and therefore not very maintainable. Timetables should solve this.
Processing billions of events in real time at Twitter – OK, frankly, processing billions of events in real time is something I was dreaming of being a data engineer. And of course, this task still looks extremely appealing to me. Well, for some it’s a dream and for some it’s reality. I will say more: they did more with simpler architecture! Thrilled? Look into Twitter’s post!
DuckDB – We all know what SQLite is. It’s an awesome embedded database that is both powerful and simple. It has integrations with all the major languages and even has support for Python UDFs. But it has one shortcoming: it is not tailored to our analytical and engineering needs. Infinitely complex queries, mathematical functions, and parquet support OOTB – that’s what DuckDB gives you out of the box. “Parquet support?” you ask. Yes! One simple expression and data is loaded from parquet to DuckDB (either in-memory or in-file – both will work), and you have all the power of analytical SQL in your hands!
That wraps up October’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at firstname.lastname@example.org or send a DM to my personal Twitter account. You can also get in touch with our team at email@example.com. We’d love to know about any other interesting data engineering articles you come across!