{"id":231334,"date":"2022-02-22T08:00:38","date_gmt":"2022-02-22T07:00:38","guid":{"rendered":"https:\/\/blog.jetbrains.com\/?post_type=big-data-tools&#038;p=231334"},"modified":"2022-11-30T12:19:47","modified_gmt":"2022-11-30T11:19:47","slug":"dbt-deeper-concepts-materialization","status":"publish","type":"big-data-tools","link":"https:\/\/blog.jetbrains.com\/zh-hans\/big-data-tools\/2022\/02\/22\/dbt-deeper-concepts-materialization","title":{"rendered":"dbt\u00ae deeper concepts: materialization"},"content":{"rendered":"\n<p>In the <a href=\"https:\/\/blog.jetbrains.com\/big-data-tools\/2022\/01\/25\/how-i-started-out-with-dbt\/\">first part<\/a> of this blog series, I described basic dbt\u00ae concepts such as installation, creation of views, and describing models. I could have stopped there, but indeed, there are some drawbacks to only using views to build the whole transformation layer in our database. Sometimes we don&#8217;t really need to use a view and a view may run slowly even in databases oriented toward analytical workflows.<\/p>\n\n\n\n<p>I\u2019ll start by giving an overview of ephemeral views.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"ephemeral-views\">Ephemeral views<\/h2>\n\n\n\n<p>In some cases, we don&#8217;t really want to have an entity for a dbt\u00ae model, rather we want this model to be inlined in other entities. As an example, take a look at the <a href=\"https:\/\/blog.jetbrains.com\/big-data-tools\/2022\/01\/25\/how-i-started-out-with-dbt\/#first-model\">first model<\/a> from the previous post. It&#8217;s very basic. I reckon we&#8217;ll be ok with inlining it into the second model. To make the model ephemeral, we need to perform one single action:<\/p>\n\n\n\n<p>Add a line <code>{{ config(materialized='ephemeral') }}<\/code> to the top of the model.<\/p>\n\n\n\n<p>If we do it with our <code>user<\/code> model, the code will become the following:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{{ config(materialized='ephemeral') }}\nSELECT * FROM customer<\/pre>\n\n\n\n<p>After running the <code>dbt build<\/code> command (and deleting all the existing views just in case), I see that there is now only one view in the DB: <code>user_post_count<\/code>. It is exactly as it was defined in the <a href=\"https:\/\/blog.jetbrains.com\/big-data-tools\/2022\/01\/25\/how-i-started-out-with-dbt\/#second-model\">user_post_count<\/a> model, but the compiled code is different, it&#8217;s not:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">WITH users AS (\n    SELECT u.id user_id, u.username username\n    FROM \"demo\".\"public\".\"user\" u\n)\nSELECT user_id,\n       username,\n       count(post.id) post_number\nFROM users,\n     post\nWHERE post.author = user_id\nGROUP BY user_id, username<\/pre>\n\n\n\n<p>but rather:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">WITH  __dbt__cte__user as (\n\nselect * from customer\n),users AS (\n    SELECT u.id user_id, u.username username\n    FROM __dbt__cte__user u\n)\nSELECT user_id,\n       username,\n       count(post.id) post_number\nFROM users,\n     post\nWHERE post.author = user_id\nGROUP BY user_id, username<\/pre>\n\n\n\n<p>Do you see how this trick works? dbt added a new, \u201csynthetic\u201d CTE named <code>__dbt__cte__user<\/code>, and now uses this instead of the call to view, named <code>user<\/code>. But all the tests still run!<\/p>\n\n\n\n<p>If you want to check what will be executed, you can call <code>dbt compile<\/code> and then look at everything that is generated, including tests in the <code>target<\/code> directory.<\/p>\n\n\n\n<p>Of course, ephemeral views do not solve execution speed problems\u2013 it is mainly just syntactic sugar.<\/p>\n\n\n\n<p>What if we need to materialize the data because the queries are taking too long?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"table-views\">Table views<\/h2>\n\n\n\n<p>To make the model materialize into a table, we just need to add the appropriate config to it:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{{ config(materialized='table') }}\nWITH users AS (\n    SELECT u.id user_id, u.username username\n    FROM {{ ref('user') }} u\n)\nSELECT user_id,\n       username,\n       count(post.id) post_number\nFROM users,\n     post\nWHERE post.author = user_id\nGROUP BY user_id, username<\/pre>\n\n\n\n<p>After execution, we will find a table named <code>user_post_count<\/code> in our database. There are cons with such a solution that are obvious. For instance, we will need to rebuild the whole thing every time and, as we know, when we\u2019re working with a large amount of data, queries can take hours to complete. More data means that more time is required to build a view. And what if we need to have lots of materialized views of this sort? It could take forever!<\/p>\n\n\n\n<p>The solution to this is to use incremental models!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"incremental-models\">Incremental models<\/h2>\n\n\n\n<p>Incremental views allow you to store only the changed\/new parts of underlying entities. Of course, there is no magic solution, so making incremental models is a bit more complex and involved.<\/p>\n\n\n\n<p>In this case, it won&#8217;t be enough to use any of the examples from the previous blog post, so I will use something else. I have chosen the Sakila database.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"sakila-database\">Sakila database<\/h3>\n\n\n\n<p>The Sakila database is a normalized model of the DVD rental business. It was originally developed by Mike Hillyer for MySQL. It has since been ported to other databases by DB Software Laboratory (<a href=\"http:\/\/www.etl-tools.com\" target=\"_blank\" rel=\"noopener\">http:\/\/www.etl-tools.com<\/a>).<\/p>\n\n\n\n<p>You can find everything you need to deploy the Sakila database to PostgreSQL on GitHub at <a href=\"https:\/\/github.com\/jOOQ\/sakila\" target=\"_blank\" rel=\"noopener\">jOOQ repository<\/a>, courtesy of Lukas Eder.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a target=\"_blank\" href=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2022\/02\/actor.png\" rel=\"noopener\"><img decoding=\"async\" loading=\"lazy\" width=\"2800\" height=\"1476\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2022\/02\/actor-2800x1476.png\" alt=\"\" class=\"wp-image-231404\"><\/a><\/figure>\n\n\n\n<p>What is very cool about Sakila is that it already contains everything needed to build incremental views! What makes Sakila so different from our previous test database? Well, first off, every table contains a column called <code>last_updated<\/code>. We can use these columns to determine if the model needs to be updated and what data should be added there.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"first-incremental-model\">First incremental model<\/h3>\n\n\n\n<p>Let&#8217;s make an example of what can be counted incrementally in our DVD rental store. The first idea that comes to my mind is the total amount of daily payments.<\/p>\n\n\n\n<p>So, let&#8217;s start with a simple model: <code>daily_payment.sql<\/code><\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">SELECT sum(amount) daily_amount, date(payment_date) date\nFROM payment\nGROUP BY date(payment_date)<\/pre>\n\n\n\n<p>and describe it in an <code>YAML<\/code> file<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">models:\n  - name: daily_payment\n    description: \"Total amount of payments, daily\"\n    columns:\n      - name: daily_amount\n        type: decimal\n        tests:\n          - not_null\n      - name: date\n        type: date\n        tests:\n          - not_null\n          - unique<\/pre>\n\n\n\n<p>Of course, a view named <code>daily_payment<\/code> will appear in the DB, and it will take longer to open it each time. In our example, there is not a huge amount of data in the <code>payments<\/code> table, but in real life, it would be different. It looks like an awesome candidate to become an incremental model.<\/p>\n\n\n\n<p>The first thing we need to do is obvious \u2013 we should define a different materialization type in the config:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{{ config(materialized='incremental') }}\n\nSELECT sum(amount) daily_amount, date(payment_date) date\nFROM payment\nGROUP BY date(payment_date)<\/pre>\n\n\n\n<p>But this time around, this is not enough. We need to define a way for dbt to find an increment. Doing so will require one more instance of Jinja magic:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{{ config(materialized='incremental') }}\n\nSELECT sum(amount) daily_amount, date(payment_date) date\nFROM payment\n\n{% if is_incremental() %}\nWHERE date(payment_date) > (SELECT max(date) from {{this}})\n{% endif %}\n\nGROUP BY date(payment_date)<\/pre>\n\n\n\n<p>In this code, you can see an <code>is is_incremental()<\/code> block, which adds an additional condition to our SQL query. Why do we need a separate block? Why can\u2019t we just include the <code>WHERE<\/code> condition in the main body of our SQL query? The reason is that we need to be able to change our model to a different type by replacing only one literal.<\/p>\n\n\n\n<p>There is one more thing we need to look at in this model: the <code>{{this}}<\/code> term inside the sub-query. This term refers to the current view (<code>daily_payment<\/code> in our case). It&#8217;s the part of the Jinja magic that allows us to change the name of the model without rewriting the whole query: dbt is made with refactoring in mind!<\/p>\n\n\n\n<p>We have the option of defining a unique key constraint for our model. This will ensure that the rows are being updated, not added, when data changes. I won&#8217;t demonstrate it here because payments are in order, we hope.<\/p>\n\n\n\n<p>Let&#8217;s build our model\u2026<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1628\" height=\"82\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2022\/02\/incremental1.png\" alt=\"\" class=\"wp-image-231416\" title=\"image_tooltip\"\/><\/figure>\n\n\n\n<p>And look at the generated SQL:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">  create  table \"sakila\".\"public\".\"daily_payment\"\n  as (\n\n\nSELECT sum(amount) daily_amount, date(payment_date) date\nFROM payment\n\n\n\nGROUP BY date(payment_date)\n  );<\/pre>\n\n\n\n<p>Hmm, that is interesting! There is no sign of our <code>WHERE<\/code> clause here! Apparently, when there is no materialization of a current model, dbt is smart and throws out everything <code>incremental<\/code>-related. That is another reason to have an <code>is is_incremental()<\/code> block! But how will it work when we put a bit more data into the <code>payments<\/code> table? Let&#8217;s insert the data in now:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">INSERT INTO payment(customer_id, staff_id, rental_id, amount, payment_date)\nVALUES (19, 2, 234, 3.99, now());\nINSERT INTO payment(customer_id, staff_id, rental_id, amount, payment_date)\nVALUES (19, 2, 235, 2.99, now());\nINSERT INTO payment(customer_id, staff_id, rental_id, amount, payment_date)\nVALUES (19, 2, 236, 5.99, now());<\/pre>\n\n\n\n<p>Now let&#8217;s call <code>dbt build<\/code> again. At first glance, it seems like nothing has changed:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1642\" height=\"74\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2022\/02\/incremental2.png\" alt=\"\" class=\"wp-image-231428\" title=\"image_tooltip\"\/><\/figure>\n\n\n\n<p>But if we look at the compiled SQL, we can see that something is different!<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">    insert into \"sakila\".\"public\".\"daily_payment\" (\"daily_amount\", \"date\")\n    (\n        select \"daily_amount\", \"date\"\n        from \"daily_payment__dbt_tmp103757815292\"\n    )<\/pre>\n\n\n\n<p>dbt, I see what you did there! dbt creates a temporary view with the necessary data only. After that, the data is inserted into our incremental table from this temporary view and the temporary view is dropped.<\/p>\n\n\n\n<p>Of course, after execution we can see a new line in our table:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"654\" height=\"288\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2022\/02\/table.png\" alt=\"\" class=\"wp-image-231440\" title=\"image_tooltip\"\/><\/figure>\n\n\n\n<p>The issue here is that the day is not over yet, so if I rebuild the model in the middle of the day (to obtain my very important report ASAP), the data will be incomplete!<\/p>\n\n\n\n<p>This is another chance to apply our knowledge about unique constraints!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"incremental-with-unique-constraint\">Incremental with unique constraint<\/h3>\n\n\n\n<p>Let&#8217;s update our model a bit:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{{\n    config(\n        materialized='incremental',\n        unique_key='date'\n    )\n}}\n\nSELECT sum(amount) daily_amount, date(payment_date) date\nFROM payment\n\n{% if is_incremental() %}\nWHERE date(payment_date) >= (SELECT max(date) from {{this}})\n{% endif %}\n\nGROUP BY date(payment_date)<\/pre>\n\n\n\n<p>A couple things to notice here are:<\/p>\n\n\n\n<ol><li>I&#8217;ve added a <code>unique_key<\/code> configuration.<\/li><li>I&#8217;ve replaced <code>&gt;<\/code> with <code>&gt;=<\/code> in the incremental predicate.<\/li><\/ol>\n\n\n\n<p>Now let\u2019s rebuild our model. This command will refresh all the incremental views. This step is necessary because we have made several changes that dbt doesn&#8217;t know about (yet).<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">dbt build --full-refresh<\/pre>\n\n\n\n<p>Of course, in the table, we will find the same $12.97 for today, so let&#8217;s add one more payment:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"sql\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">INSERT INTO payment(customer_id, staff_id, rental_id, amount, payment_date)\nVALUES (19, 2, 237, 4.99, now());<\/pre>\n\n\n\n<p>\u2026and then execute <code>dbt build<\/code> again. The table is updated and the manager will be happy!<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"664\" height=\"300\" src=\"https:\/\/blog.jetbrains.com\/wp-content\/uploads\/2022\/02\/incremental3.png\" alt=\"\" class=\"wp-image-231452\" title=\"image_tooltip\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"summary\">Summary<\/h2>\n\n\n\n<p>In this post, we learned about three more types of model materialization (<code>table<\/code>, <code>ephemeral<\/code> and <code>incremental<\/code>) that are available in addition to the default ones. <code>incremental<\/code> models require more effort to make them work, but this effort pays off! If you want to know more about view materialization, please refer to <a href=\"https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/building-models\/configuring-incremental-models\" target=\"_blank\" rel=\"noopener\">dbt docs<\/a> to find more details on incremental views, for example, how to update only a portion of the models, not all of them.<\/p>\n\n\n\n<p>dbt continues to show itself to be an extremely easy-to-use tool for transformation layer creation.<\/p>\n\n\n\n<p>Follow this blog for the next part: the holy grail of data engineering \u2013 SQL testing!<\/p>\n","protected":false},"author":1234,"featured_media":231538,"comment_status":"closed","ping_status":"closed","template":"","categories":[594],"tags":[588,6920,6918,201],"cross-post-tag":[],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/231334"}],"collection":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools"}],"about":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/types\/big-data-tools"}],"author":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/users\/1234"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/comments?post=231334"}],"version-history":[{"count":10,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/231334\/revisions"}],"predecessor-version":[{"id":302188,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/big-data-tools\/231334\/revisions\/302188"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/media\/231538"}],"wp:attachment":[{"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/media?parent=231334"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/categories?post=231334"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/tags?post=231334"},{"taxonomy":"cross-post-tag","embeddable":true,"href":"https:\/\/blog.jetbrains.com\/zh-hans\/wp-json\/wp\/v2\/cross-post-tag?post=231334"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}