{"id":254394,"date":"2022-06-08T10:00:00","date_gmt":"2022-06-08T09:00:00","guid":{"rendered":"https:\/\/blog.jetbrains.com\/?post_type=big-data-tools&#038;p=254394"},"modified":"2022-11-30T18:58:35","modified_gmt":"2022-11-30T17:58:35","slug":"data-engineering-annotated-monthly-may-2022","status":"publish","type":"big-data-tools","link":"https:\/\/blog.jetbrains.com\/zh-hans\/big-data-tools\/2022\/06\/08\/data-engineering-annotated-monthly-may-2022","title":{"rendered":"Data Engineering Annotated Monthly \u2013 May 2022"},"content":{"rendered":"\n<p>It&#8217;s the start of June. That means it\u2019s time to start taking summer vacations and enjoying some fresh juice alongside your fresh news! Hi, I&#8217;m <a href=\"https:\/\/blog.jetbrains.com\/author\/pavel-finkelshteyn-jetbrains-com\/\">Pasha Finkelshteyn<\/a>, and I\u2019ll be your guide through this month\u2019s news. I\u2019ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, catch me on <a href=\"https:\/\/twitter.com\/asm0di0\" target=\"_blank\" rel=\"noopener\">Twitter<\/a> and suggest a topic, link, or anything else you want to see. By the way, if you would prefer to receive this information as an email, you can subscribe to the newsletter <a href=\"https:\/\/www.jetbrains.com\/resources\/newsletters\/\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">News<\/h1>\n\n\n\n<p>A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here\u2019s what\u2019s happening in the world of data engineering right now.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/datahub-project\/datahub\/releases\/tag\/v0.8.36\" target=\"_blank\" rel=\"noopener\">DataHub 0.8.36<\/a> \u2013 Metadata management is a big and complicated topic. There are several solutions. Some of them are free, some of them are paid, but none of them are particularly easy to use. I\u2019ve had some experience with Apache Atlas, and even with the help of my colleagues, I wasn\u2019t able to make it do what I wanted it to. On top of that, it&#8217;s a part of the Hadoop platform, which created additional work that we otherwise would not have had to do. DataHub is a completely independent product by LinkedIn, and the folks there definitely know what metadata is and how important it is. If you haven\u2019t found your perfect metadata management system just yet, maybe it&#8217;s time to try DataHub! This new release brings exciting features like support for Apache Iceberg!<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/linkedin\/feathr\/releases\/tag\/v0.4.0\" target=\"_blank\" rel=\"noopener\">Feathr 0.4.0<\/a> \u2013 This feature store by LinkedIn is developing quickly. I know that many companies have not been able to find a suitable feature store on the market and have had to write their own. This task is not easy, and it takes a very long time and significant engineering resources to do properly. Meanwhile, it looks like LinkedIn has the necessary resources and is even ready to open up its solution to external contributors! The most notable change in the latest release is support for streaming, which means you can now ingest data from streaming sources.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/apache\/pulsar-manager\/releases\/tag\/v0.3.0\" target=\"_blank\" rel=\"noopener\">Pulsar Manager 0.3.0<\/a> \u2013 Lots of enterprise systems lack a nice management interface. They need to be configured with configuration files or via the command line. I am an old-school guy. I adore command line, vim, and so on, but I also understand that sometimes configuration is such a complex task that would really be easier to do once with a UI and then just not have to think about it again. Apache Pulsar takes a step in this direction and adds an official management UI! In this release, there are some improvements to the dashboard, as well as several bug fixes.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/apache\/bookkeeper\/releases\/tag\/release-4.15.0\" target=\"_blank\" rel=\"noopener\">Bookkeeper 4.15.0<\/a> \u2013 And while we&#8217;re on the subject of Pulsar, we should not forget to mention the engine behind Pulsar: Bookkeeper. Bookkeeper is usually perceived as exclusively a backend behind Pulsar, but the truth is that nothing can stop you from using it in your own systems. Bookkeeper\u2019s team presents it as a &#8220;fault-tolerant and low-latency storage service optimized for append-only workloads&#8221;, so if you need to store something in a distributed manner, you may not need a traditional database. Perhaps Bookkeeper would suit your needs better! In the latest version <a href=\"https:\/\/bookkeeper.apache.org\/bps\/BP-46-run-without-journal\/\" target=\"_blank\" rel=\"noopener\">BP-46: Running without a journal<\/a> has been implemented, along with several other features.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/apache\/impala\/releases\/tag\/4.1.0\" target=\"_blank\" rel=\"noopener\">Impala 4.1.0<\/a> \u2013 While almost all data engineering SQL query engines are written in JVM languages, Impala is written in C++. This means that the Impala authors had to go above and beyond to integrate it with different Java\/Python-oriented systems. And yet it is still compatible with different clouds, storage formats (including <a href=\"https:\/\/kudu.apache.org\/\" target=\"_blank\" rel=\"noopener\">Kudu<\/a>, <a href=\"https:\/\/ozone.apache.org\/\" target=\"_blank\" rel=\"noopener\">Ozone<\/a>, and many others), and storage engines. It shouldn\u2019t come as a surprise that Cloudera managed to achieve this, as they know how to create on-premise data engineering products. I don&#8217;t know how this happened, but there is not even an official changelog yet at the time of writing. However, you can find a diff with the 4.0.0 version <a href=\"https:\/\/github.com\/apache\/impala\/compare\/4.0.0...4.1.0\" target=\"_blank\" rel=\"noopener\">on GitHub<\/a>.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/facebook\/rocksdb\/releases\/tag\/v7.2.2\" target=\"_blank\" rel=\"noopener\">RocksDB 7.2.2<\/a> \u2013 We often forget that certain data engineering products only work so well because they have other powerful tools under the hood. For proof of this, look no further than systems like Flink and Camunda, which rely on RocksDB. RocksDB is a storage engine with a key\/value interface, where keys and values are arbitrary byte streams written as a C++ library. It can store data virtually everywhere, for example in memory or on any kind of permanent storage device. And yes, it pays attention to correctness and effectiveness when storing data.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Future improvements<\/h1>\n\n\n\n<p>Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and that you may want to keep an eye on.<\/p>\n\n\n\n<p><a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/KAFKA\/KIP-833%3A+Mark+KRaft+as+Production+Ready\" target=\"_blank\" rel=\"noopener\">Kafka: Mark KRaft as Production Ready<\/a> \u2013 One of the most interesting changes to Kafka from recent years is that it now works without ZooKeeper. This is possible thanks to implementations of KRaft, a <a href=\"https:\/\/raft.github.io\" target=\"_blank\" rel=\"noopener\">Raft<\/a> consensus protocol designed specifically for the needs of Kafka. This Kafka Improvement Proposal&#8217;s goal is to declare KRaft production-ready and to make support and operations related to Kafka clusters much easier.<\/p>\n\n\n\n<p><a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/FLINK\/FLIP-214+Support+Advanced+Function+DDL\" target=\"_blank\" rel=\"noopener\">Flink: Support Advanced Function DDL<\/a> \u2013 SQL query engines like Hive and Spark have supported external functions in SQL for quite some time. This allows developers and data engineers to enrich traditional SQL with their own extensions, which can be useful when you need to perform business-specific operations inside a regular query. Hopefully with the implementation of this Flink Improvement Proposal, Flink will support them too.<\/p>\n\n\n\n<p><a href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-39312\" target=\"_blank\" rel=\"noopener\">Spark: Use Parquet in predicate for Spark In filter<\/a> \u2013 Though it is usually hidden behind the scenes, one of the most popular storage formats \u2013 Parquet \u2013 is evolving too. At this point in time, filters have been implemented on the storage level in Parquet, and Spark needs to catch up by adding support for native filtering. This improvement can make our queries dramatically faster in some cases!<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Articles<\/h1>\n\n\n\n<p>This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.<\/p>\n\n\n\n<p><a href=\"https:\/\/rockset.com\/blog\/rocksdb-is-eating-the-database-world\/\" target=\"_blank\" rel=\"noopener\">RocksDB Is Eating the Database World<\/a> \u2013 Continuing on the topic of RocksDB, here is an older, but still very interesting, article on what RocksDB is and how it works. It also provides some insight into why its popularity is growing rapidly.<\/p>\n\n\n\n<p><a href=\"https:\/\/martinfowler.com\/articles\/patterns-of-distributed-systems\/replicated-log.html\" target=\"_blank\" rel=\"noopener\">Replicated Log<\/a> \u2013 Here\u2019s a relatively long and detailed article about replicated logs. A replicated log is a way to synchronize data among nodes in a distributed system. There are multiple ways to implement a replicated log, and most of them are somehow related to what are called consensus protocols, for example, Paxos and Raft.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Events<\/h1>\n\n\n\n<p><a href=\"https:\/\/currentevent.io\/\" target=\"_blank\" rel=\"noopener\">Current 2022: The Next Generation of Kafka Summit<\/a> \u2013 This most popular conference related to Kafka is organized by one of its main maintainers, Confluent. Of course, the main topic is data streaming.<\/p>\n\n\n\n<p><a href=\"https:\/\/bigdataldn.com\/\" target=\"_blank\" rel=\"noopener\">Big Data Event: London<\/a> \u2013 Thousands of attendees are expected to participate in this big data event in London. They\u2019ve already booked a large number of speakers from a wide range of companies, including the widely known Aerospike, StackOverflow, and Snowflake.<\/p>\n\n\n\n<p>That wraps up May\u2019s Data Engineering Annotated. Follow JetBrains Big Data Tools on <a href=\"https:\/\/twitter.com\/BigDataTools\" target=\"_blank\" rel=\"noopener\">Twitter<\/a> and subscribe to our <a href=\"https:\/\/blog.jetbrains.com\/big-data-tools\/\">blog<\/a> for more news! You can always reach me, Pasha Finkelshteyn, at <a href=\"mailto:asm0dey@jetbrains.com\">asm0dey@jetbrains.com<\/a> or send a DM to <a href=\"https:\/\/twitter.com\/asm0di0\" target=\"_blank\" rel=\"noopener\">my personal Twitter<\/a> account. You can also get in touch with our team at <a href=\"mailto:big-data-tools@jetbrains.com\">big-data-tools@jetbrains.com<\/a>. We\u2019d love to know about any other interesting data engineering articles you come across!<\/p>\n","protected":false},"author":1234,"featured_media":254395,"comment_status":"closed","ping_status":"closed","template":"","categories":[],"tags":[7020,6749,7017,7018,7023,7021,6597,7019,7022],"cross-post-tag":[],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/254394"}],"collection":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools"}],"about":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/types\/big-data-tools"}],"author":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/users\/1234"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/comments?post=254394"}],"version-history":[{"count":4,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/254394\/revisions"}],"predecessor-version":[{"id":291567,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/254394\/revisions\/291567"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/media\/254395"}],"wp:attachment":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/media?parent=254394"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/categories?post=254394"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/tags?post=254394"},{"taxonomy":"cross-post-tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/cross-post-tag?post=254394"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}