{"id":246941,"date":"2022-05-19T12:08:50","date_gmt":"2022-05-19T11:08:50","guid":{"rendered":"https:\/\/blog.jetbrains.com\/?post_type=big-data-tools&#038;p=246941"},"modified":"2022-10-26T13:38:19","modified_gmt":"2022-10-26T12:38:19","slug":"data-engineering-annotated-monthly-april-2022","status":"publish","type":"big-data-tools","link":"https:\/\/blog.jetbrains.com\/zh-hans\/big-data-tools\/2022\/05\/19\/data-engineering-annotated-monthly-april-2022","title":{"rendered":"Data Engineering Annotated Monthly \u2013 April 2022"},"content":{"rendered":"\n<p>Long time no see! Sorry about the silence, but luckily we\u2019re back.<\/p>\n\n\n\n<p>Hi, I&#8217;m <a href=\"https:\/\/blog.jetbrains.com\/author\/pavel-finkelshteyn-jetbrains-com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Pasha Finkelshteyn<\/a>, and I\u2019ll be your guide through this month\u2019s news. I\u2019ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, catch me on <a href=\"https:\/\/twitter.com\/asm0di0\" target=\"_blank\" rel=\"noreferrer noopener\">Twitter<\/a> and suggest a topic, link, or anything else you want to see. And please feel free to <a href=\"https:\/\/www.jetbrains.com\/resources\/newsletters\/\" target=\"_blank\" rel=\"noreferrer noopener\">subscribe to this newsletter<\/a> to get it in your email inbox every month.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"News\">News<\/h1>\n\n\n\n<p>A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here\u2019s what\u2019s happening in the world of data engineering right now.<\/p>\n\n\n\n<p id=\"Airflow-2.3.0\"><a href=\"https:\/\/github.com\/apache\/airflow\/releases\/tag\/2.3.0\" target=\"_blank\" rel=\"noreferrer noopener\">Airflow 2.3.0<\/a> \u2013 This popular orchestrator got a new release. Some say it&#8217;s &#8220;almost 3.0&#8221;, and yes, it does bring a lot of changes. Take the new <a href=\"https:\/\/www.astronomer.io\/guides\/dynamic-tasks\/\" target=\"_blank\" rel=\"noreferrer noopener\">dynamic tasks<\/a>, for example. Based on the &#8220;map-reduce&#8221; paradigm, they allow you to compute the next DAGs from the current state \u2013 a very useful feature, which incidentally has been available in Luigi for a while. Additionally, the Tree view has been replaced by the Grid view, which, in my opinion, is much more informative.<\/p>\n\n\n\n<p id=\"hudi\"><a href=\"https:\/\/github.com\/apache\/hudi\/releases\/tag\/release-0.11.0\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Hudi 1.11.0<\/a> \u2013 This release of the well-known data lake has added many interesting changes. First, they\u2019ve implemented asynchronous indexing. Second, they\u2019ve significantly improved Spark integration. Third, Google BigQuery now has support for Hudi as an external source. I could go on and on. Now\u2019s a good time to update your Hudi!<\/p>\n\n\n\n<p id=\"YuniKorn\"><a href=\"https:\/\/yunikorn.apache.org\/release-announce\/1.0.0\" target=\"_blank\" rel=\"noreferrer noopener\">YuniKorn 1.0.0<\/a> \u2013 If you&#8217;ve been anxiously waiting for Kubernetes to come to data engineering, your wishes have been granted. A top-level ASF project, YuniKorn 1.0 is a scheduler targeting big data and ML workflows, and of course, it is cloud-native.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/apache\/incubator-kyuubi\/releases\/tag\/v1.5.1-incubating-rc0\" target=\"_blank\" rel=\"noreferrer noopener\">Kyuubi 1.5.1<\/a> \u2013 Kyuubi is a JDBC server built over Apache Spark, but as of version 1.5.0, it supports two more SQL engines, Flink and Trino\/Presto. The team has also added the ability to run Scala for the SparkSQL engine.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/apache\/pulsar\/releases\/tag\/v2.10.0\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Pulsar 2.0.10<\/a> \u2013 No fewer than 14 PIPs (Pulsar Improvement Proposals) were implemented in this version! Notably, cluster failover is now supported on the client-side. Read more about Pulsar 2.0.10 <a href=\"https:\/\/github.com\/apache\/pulsar\/issues\/13315\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/apache\/rocketmq-streams\/releases\/tag\/rocketmq-streams-1.0.1-preview\" target=\"_blank\" rel=\"noreferrer noopener\">RocketMQ Streams 1.0.1 preview<\/a> \u2013 I&#8217;ve mentioned RocketMQ before, in the <a href=\"https:\/\/blog.jetbrains.com\/big-data-tools\/2021\/12\/07\/data-engineering-annotated-monthly-november-2021\/\" target=\"_blank\" rel=\"noreferrer noopener\">November Annotated<\/a>, but here\u2019s a good reason to write about it again. Virtually every technology seems to be adding some kind of streaming API these days. Kafka was the first, and soon enough, everybody was trying to grab their own share of the market. In the case of RocketMQ, their attempt is very interesting because, unlike Kafka and Pulsar, RocketMQ is closer to traditional MQs like ActiveMQ (which isn\u2019t really surprising, seeing how it&#8217;s based on ActiveMQ).<\/p>\n\n\n\n<p><a href=\"https:\/\/flink.apache.org\/news\/2022\/05\/05\/1.15-announcement.html\" target=\"_blank\" rel=\"noreferrer noopener\">Flink 1.15.0<\/a> \u2013 What I like about this release of Flink, a top framework for streaming data processing, is that it comes with quality documentation. The docs clarify the semantics of checkpoints and savepoints, making them much easier to understand. The release isn\u2019t short on technical improvements, either, such as elastic scaling with reactive mode and an adaptive scheduler.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"Future-improvements\">Future improvements<\/h1>\n\n\n\n<p>Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and which you may want to keep an eye on.<\/p>\n\n\n\n<p><a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/KAFKA\/KIP-813%3A+Shareable+State+Stores\" target=\"_blank\" rel=\"noreferrer noopener\">Kafka: Shareable State Stores<\/a> \u2013 This improvement in Kafka looks very interesting. Authors promise us that under certain conditions, it will be possible to share data between topics without needing to copy it around over nodes. However, this improvement depends on the implementation of tiered storage support, which hasn\u2019t yet landed.<\/p>\n\n\n\n<p><a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/KAFKA\/KIP-808%3A+Add+support+for+different+unix+precisions+in+TimestampConverter+SMT\" target=\"_blank\" rel=\"noreferrer noopener\">Kafka: Add support for different unix precisions in TimestampConverter SMT<\/a> \u2013 Have you ever been in a situation where the timestamp is of type Long and you can&#8217;t understand what it represents? This is an inherent issue with the Unix timestamp. Some systems think that it should be in milliseconds, and some think that it should be in seconds. This KIP promises to resolve the issue by making the TimestampConverter class support the precision of Unix timestamps.<\/p>\n\n\n\n<p><a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/FLINK\/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator\" target=\"_blank\" rel=\"noreferrer noopener\">Flink: Introduce Flink Kubernetes Operator<\/a> \u2013 Since Kubernetes is dominating virtually everywhere, other data engineering tools are having to catch up and introduce k8s integration. It\u2019s true that there is a scheduler for data engineering for k8s \u2013 YuniKorn \u2013 but some would prefer to run Flink ad hoc, and that requires these tools to implement the k8s operator.<\/p>\n\n\n\n<p><a href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-39088\" target=\"_blank\" rel=\"noreferrer noopener\">Spark: Add support for forwarding Spark History requests to a live running driver when present<\/a> \u2013 While a Spark job is running, we can see it on a Spark history server, but that doesn&#8217;t provide us with full and up-to-date information. This enhancement, which has already been implemented, automatically redirects us from a history server to the live driver, where we can find the complete information. Neat!<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"Articles\">Articles<\/h1>\n\n\n\n<p>This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.<\/p>\n\n\n\n<p><a href=\"https:\/\/neo4j.com\/blog\/analyzing-panama-papers-neo4j\/\" target=\"_blank\" rel=\"noreferrer noopener\">Analyzing the Panama Papers With Neo4j: Data Models, Queries, and More<\/a> \u2013 Graph databases are extremely useful, but few of us have a lot of experience with them. Most of us have some difficulty identifying whether problems could be better solved with the help of a graph database. In addition, typical examples of using graph databases are oversimplified. That is why this example from the creators of Neo4j was so insightful for me. It provides a great view into the different aspects of using graph databases in general, and it covers the specifics of Neo4j in detail, as well.<\/p>\n\n\n\n<p><a href=\"https:\/\/towardsdatascience.com\/scalable-efficient-big-data-analytics-machine-learning-pipeline-architecture-on-cloud-4d59efc092b5\" target=\"_blank\" rel=\"noreferrer noopener\">Architecture for High-Throughput Low-Latency Big Data Pipeline on Cloud<\/a> \u2013 The title of the article speaks for itself. The premise might sound familiar, but this isn\u2019t some boring repetition of what we all know already. There\u2019s at least one interesting twist that goes like this: &#8220;A data pipeline has five stages grouped into three heads.&#8221; If that sounds intriguing, read the article to find out more.<\/p>\n\n\n\n<p><a href=\"https:\/\/bytearray.io\/corrections-in-data-lakehouse-table-format-comparisons-b72eb63ece32\" target=\"_blank\" rel=\"noreferrer noopener\">Corrections in data lakehouse table format comparisons<\/a> \u2013 Quasi-mutable (a.k.a. data lake) formats are improving almost at the speed of thought. Sooner or later, we data engineers will have to choose which one to make our standard! This live document is a big and growing set of corrections to the original and very well-known <a href=\"https:\/\/www.dremio.com\/subsurface\/comparison-of-data-lake-table-formats-iceberg-hudi-and-delta-lake\/\" target=\"_blank\" rel=\"noopener\">comparison by Dremio<\/a>. Check back often if you want to keep up with what&#8217;s new in the world of Hudi, Iceberg, and DeltaLake.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity\"\/>\n\n\n\n<p>That wraps up April\u2019s Data Engineering Annotated. Follow JetBrains Big Data Tools on <a href=\"https:\/\/twitter.com\/BigDataTools\" target=\"_blank\" rel=\"noreferrer noopener\">Twitter<\/a> and subscribe to our <a href=\"https:\/\/blog.jetbrains.com\/big-data-tools\/\" target=\"_blank\" rel=\"noreferrer noopener\">blog<\/a> for more news! You can always reach me, Pasha Finkelshteyn, at <a href=\"mailto:asm0dey@jetbrains.com\">asm0dey@jetbrains.com<\/a> or send a DM to <a href=\"https:\/\/twitter.com\/asm0di0\" target=\"_blank\" rel=\"noopener\">my personal Twitter<\/a> account. You can also get in touch with our team at <a href=\"mailto:big-data-tools@jetbrains.com\" target=\"_blank\" rel=\"noreferrer noopener\">big-data-tools@jetbrains.com<\/a>. We\u2019d love to know about any other interesting data engineering articles you come across!<\/p>\n\n\n\n<p><\/p>\n","protected":false},"author":1234,"featured_media":246944,"comment_status":"closed","ping_status":"closed","template":"","categories":[],"tags":[2319,589,6586,6749,1731,76,91],"cross-post-tag":[],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/246941"}],"collection":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools"}],"about":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/types\/big-data-tools"}],"author":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/users\/1234"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/comments?post=246941"}],"version-history":[{"count":4,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/246941\/revisions"}],"predecessor-version":[{"id":291568,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/246941\/revisions\/291568"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/media\/246944"}],"wp:attachment":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/media?parent=246941"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/categories?post=246941"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/tags?post=246941"},{"taxonomy":"cross-post-tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/cross-post-tag?post=246941"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}