{"id":296090,"date":"2022-11-09T11:33:44","date_gmt":"2022-11-09T10:33:44","guid":{"rendered":"https:\/\/blog.jetbrains.com\/?post_type=big-data-tools&#038;p=296090"},"modified":"2022-11-09T21:09:00","modified_gmt":"2022-11-09T20:09:00","slug":"data-engineering-annotated-monthly-october-2022","status":"publish","type":"big-data-tools","link":"https:\/\/blog.jetbrains.com\/zh-hans\/big-data-tools\/2022\/11\/09\/data-engineering-annotated-monthly-october-2022","title":{"rendered":"Data Engineering Annotated Monthly \u2013 October 2022"},"content":{"rendered":"\n<p>Greetings from sunny Berlin! Yes, it\u2019s still 20+ \u00b0C here \u2013 perfect conditions for sitting down on your balcony with the latest issue of your favorite Annotated! I&#8217;m <a href=\"https:\/\/blog.jetbrains.com\/author\/pavel-finkelshteyn-jetbrains-com\/\">Pasha Finkelshteyn<\/a>, and I\u2019ll be your guide through this month\u2019s news. I\u2019ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, hit me up on <a href=\"https:\/\/twitter.com\/asm0di0\" target=\"_blank\" rel=\"noopener\">Twitter<\/a> and suggest a topic, link, or anything else you want to see. By the way, if you would prefer to receive this information as an email, you can subscribe to the newsletter <a href=\"https:\/\/www.jetbrains.com\/resources\/newsletters\/\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">News<\/h1>\n\n\n\n<p>A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here\u2019s what\u2019s happening in the world of data engineering right now.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/apache\/doris\/releases\/tag\/1.1.3-rc02\" target=\"_blank\" rel=\"noopener\">Apache Doris 1.1.3<\/a> \u2013 Here\u2019s another interesting database for you. We aren\u2019t aware of many MPP databases, and none of them are under the motley umbrella of the Apache Software Foundation. It is built specifically for ad-hoc queries, report analysis, and other similar tasks. For example, take a look at this picture from Doris\u2019 site:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1600\" height=\"631\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2022\/11\/image-31.png\" alt=\"\" class=\"wp-image-296091\"\/><figcaption>Typical usage pattern of Apache Doris<\/figcaption><\/figure>\n\n\n\n<p>This looks like an excellent candidate for use as your next DWH, doesn\u2019t it?<\/p>\n\n\n\n<p>One of the great things about ASF projects is that they usually work nicely together, and this is no exception. For example, the current 1.1.3 release supports Apache Parquet as an output file format.&nbsp;<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/apache\/age\/releases\/tag\/v1.1.0-rc0\" target=\"_blank\" rel=\"noopener\">Apache Age 1.1.0<\/a> \u2013 Sometimes, we data engineers do work that doesn\u2019t deal directly with big data. Sometimes our job is just to ensure things are designed correctly, which can require us to use tools we are familiar with in a non-typical manner. Take, for example, Postgres. It\u2019s one of the most popular databases; it is extensible, has a good enough planner, and is tunable. Some extensions for it are fairly well-known. For example, `ltree` is a popular extension from postgres-contrib that facilitates the representation of tree structures in a friendly way with a special type. But today, I want to highlight Apache Age, an extension that makes it possible to use Postgres as a graph database. The query language is some kind of mix of traditional SQL and <a href=\"https:\/\/neo4j.com\/developer\/cypher\/\" target=\"_blank\" rel=\"noopener\">Cypher<\/a>, which is, as far as I\u2019m concerned, the most popular graph query language today.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/apache\/rocketmq\/releases\/tag\/rocketmq-all-5.0.0\" target=\"_blank\" rel=\"noopener\">RocketMQ 5.0.0<\/a> \u2013 I\u2019ve already mentioned Apache RocketMQ, a high-performance queue based on ActiveMQ, in previous installments of this series. This new major release is notable, however, because it introduces a handy new concept: logic queues. Currently, MessageQueue is coupled to the broker name, and the broker name is coupled to the number of presently active brokers. When there\u2019s a change, planned or unplanned, to the number of brokers, queue rebalancing begins. This rebalancing can take minutes, and the queue will not always be available during that time. That, in turn, can lead to significant degradation of the overall quality of service. Logic queues remove this relation between queues and the number of nodes, so now, when the number of nodes changes, the rebalance will take significantly less time, if any.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.scylladb.com\/product\/release-notes\/scylladb-open-source-5-0-5\/\" target=\"_blank\" rel=\"noopener\">ScyllaDB 5.0.5<\/a> \u2013 I\u2019ve had my eye on ScyllaDB for a long time now, but I still haven\u2019t had a chance to share its progress. ScyllaDB is interesting because it\u2019s a drop-in replacement for Apache Cassandra. Many years ago, when Java seemed slow, and its JIT compiler was not as cool as it is today, some of the people working on the OSv operating system recognized that they could make many more optimizations in user space than they could in kernel space. One example of an application they targeted for improvement was Apache Cassandra, as it was powerful but slow\u2026 Fast forward seven years, and it looks like they achieved their goal and built a sustainable business as well! Among the most notable changes in ScyllaDB 5.0 is the implementation of (experimental) support for the strongly consistent DDL, which is very important, especially when your data changes, as it is likely to do.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Future changes<\/h1>\n\n\n\n<p>Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and that you may want to keep an eye on.<\/p>\n\n\n\n<p><a href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-40513\" target=\"_blank\" rel=\"noopener\">Docker Official Image for Spark<\/a> \u2013 A proposal to make Spark a Docker Official Image was recently approved. The Docker Official Image status is a way to indicate that an image is top-level and very important for the community. This type of image will usually be located not in some directory in Docker Hub but rather in the root (in this case, it should be <a href=\"https:\/\/hub.docker.com\/_\/spark\" target=\"_blank\" rel=\"noopener\">https:\/\/hub.docker.com\/_\/spark<\/a>). But of course, making Spark a Docker Official Image would not just entail small cosmetic changes. Docker Official Images are rebuilt proactively, so if you find yourself vulnerable to a new security breach that\u2019s already fixed in the master, you can just download the latest version of the DOI and get back to safety. Additionally, DOIs are maintained by the Docker community, which usually means best practices are adhered to. Of course, some work still needs to be done to achieve this goal, but the proposal has already been approved and is halfway implemented.<\/p>\n\n\n\n<p><a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/FLINK\/FLIP-250%3A+Support+Customized+Kubernetes+Schedulers+Proposal\" target=\"_blank\" rel=\"noopener\">Flink: Support Customized Kubernetes Schedulers<\/a> \u2013 This proposal is fascinating. On the one hand, the authors say that the current integration of Flink with k8s is already excellent. On the other hand, they say that resource scheduling is implemented only with a very narrow set of techniques, which is not enough. They postulate that different Flink workflows require different kinds of resource scheduling and describe four strategies for resource allocation and scheduling that could significantly improve the performance of Flink jobs.<\/p>\n\n\n\n<p><a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/KAFKA\/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol\" target=\"_blank\" rel=\"noopener\">Kafka: The Next Generation of the Consumer Rebalance Protocol<\/a> \u2013 The current rebalance protocol in Kafka has existed for a long time. It\u2019s superior to what was introduced at first with ZooKeeper, but nevertheless, it\u2019s already a legacy protocol. It relies on intelligent clients that know everything about other consumers in their consumer group and can act accordingly when the number of consumers in the group changes. The authors of the proposal state that the majority of bugs they have encountered in the protocol this year required fixes on the client side, which is indeed bad because we don\u2019t have control over a consumer\u2019s code, for a variety of reasons. The coming change is absolutely massive and has lots of goals, but for me, the most crucial difference is that now the broker will decide how clients should be rebalanced, allowing it to dictate changes that are as small as possible. And for clients, the process should be completely transparent.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Articles<\/h1>\n\n\n\n<p>This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.<\/p>\n\n\n\n<p><a href=\"https:\/\/datatalks.club\/blog\/data-engineers-arent-plumbers.html\" target=\"_blank\" rel=\"noopener\">Data Engineers Aren&#8217;t Plumbers<\/a> \u2013 When people ask me what a data engineer is, I usually use the metaphor of a plumber. We work with pipes. We make sure they work correctly, and that they aren\u2019t clogged, and so on, right? Well, it looks like Lu\u00eds Oliveira disagrees with me. He says that data engineering work is indeed related to pipes, but in a different way, and he uses a different pipe-related metaphor. Will I use it in my future explanations? Maybe. But I suspect his explanation will raise even more questions. Nevertheless, the analogy is beautiful.<\/p>\n\n\n\n<p><a href=\"https:\/\/towardsdatascience.com\/how-to-create-a-dbt-package-ca795d1dbe12\" target=\"_blank\" rel=\"noopener\">How to create a dbt package<\/a> \u2013 I like dbt and even wrote <a href=\"https:\/\/blog.jetbrains.com\/big-data-tools\/tag\/dbt\/\">a couple of posts about it<\/a>. But this post brought something to my attention that I hadn\u2019t given a lot of thought to: dbt packages. Dbt packages allow one project to depend on others, making the usage of dbt much more manageable in the event your warehouses are massive. Different teams can, for example, reuse the \u201ccommon\u201d package that contains all the basic models from your anchor-organized Data Warehouse.<\/p>\n\n\n\n<p><a href=\"https:\/\/towardsdatascience.com\/how-to-run-your-data-team-like-a-product-team-7efd8a0fd423\" target=\"_blank\" rel=\"noopener\">How to run your data team as a product team<\/a> \u2013 Sometimes, it can be tempting to conceive of data engineering as purely a technical enterprise, like plumbing. But the truth is all the data engineers are (or should be) working on a data product \u2013 a product that will solve particular problems confronting management, customers, or another party. And that\u2019s why we should think about our tasks as engineers and as part of a product team. This post offers insight into that dual mindset.<\/p>\n\n\n\n<p>That wraps up October\u2019s Data Engineering Annotated. Follow JetBrains Big Data Tools on <a href=\"https:\/\/twitter.com\/BigDataTools\" target=\"_blank\" rel=\"noreferrer noopener\">Twitter<\/a> and subscribe to our <a href=\"https:\/\/blog.jetbrains.com\/big-data-tools\/\">blog<\/a> for more news! You can always reach me, Pasha Finkelshteyn, at <a href=\"mailto:asm0dey@jetbrains.com\">asm0dey@jetbrains.com<\/a> or send a DM to <a href=\"https:\/\/twitter.com\/asm0di0\" target=\"_blank\" rel=\"noopener\">my personal Twitter<\/a> account. You can also get in touch with our team at <a href=\"mailto:big-data-tools@jetbrains.com\">big-data-tools@jetbrains.com<\/a>. We\u2019d love to know about any other exciting data engineering articles you come across!<\/p>\n","protected":false},"author":1234,"featured_media":296158,"comment_status":"closed","ping_status":"closed","template":"","categories":[],"tags":[7148,6749,6918,7147,7149,7150],"cross-post-tag":[],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/296090"}],"collection":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools"}],"about":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/types\/big-data-tools"}],"author":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/users\/1234"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/comments?post=296090"}],"version-history":[{"count":5,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/296090\/revisions"}],"predecessor-version":[{"id":296468,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/296090\/revisions\/296468"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/media\/296158"}],"wp:attachment":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/media?parent=296090"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/categories?post=296090"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/tags?post=296090"},{"taxonomy":"cross-post-tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/cross-post-tag?post=296090"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}