Bringing your dbt Core incremental logic to the next level

Piotr Sieminski
6 min readJun 26, 2024

--

Source: https://docs.getdbt.com/best-practices/materializations/4-incremental-models

Key consideration before you begin

  1. To reap all of the benefits of parametrisation, you should have production deployment which enables you to parse flags and variables to the dbt command executed in production.
  2. You have to take into account any sort of window functions, aggregations etc. happening in a model you want to make incremental. Speaking from experience you can usually get away with a larger reload period (eg. process past 30 days), which is still better performant than full refresh each day. Be cautious though, as your calculations can easily get skewed if you miss.
  3. Find a reliable date column (or timestamp, if your batches run more often than once per day). By ‘reliable’ I mean the one that corresponds closely to the actual time when the event you are processing happened.

Common mistakes

  1. Forgetting about delay in data arrival — if your producer may send the data up to 48h after it was produced — you better account for it or else these events will get lost, filtered out by incremental logic. It is safe to add one more day for processing in this case.
  2. Inaccurate CI handle — if your CI checks do a full refresh of the tables every single time — you might wanna adjust the commands you use there.
  3. Lack of safety checks, preventing unnecessary or unwanted full refreshes — there exist legacy tables which you are forbidden to refresh fully (yes, that happened to me), but also other cases where you should have conditional logic in place to avoid unnecessary stress and costs. More on this topic in the sections below.

Parameterise incremental reload window

It’s quite simple. All you have to do is to make Jinja read from your model config instead of inside the query (especially useful with multiple is_incremental() blocks applied). Here are the steps required:

  • Add a parameter to your config (I named it lookback_window_in_days, but the choice is yours).
  - name: <your_model>
description: "<your_description>"
config:
materialized: incremental
incremental_strategy: delete+insert
on_schema_change: append_new_columns
unique_key: surrogate_key
cluster_by: <date_column>
lookback_window_in_days: 2
  • At the beginning of your model, set a variable which reads this parameter; add a default value as a safety net.
{%- set lookback_window_in_days = config.get('lookback_window_in_days', default='2') -%}
  • Select from this model, subtracting the parametrised lookback window.
---- your table logic here


and table_name.date_column >= (
select date_trunc(
day
, dateadd(day, -{{ lookback_window_in_days }}::int
, max(t.date_column))
) from {{ this }} t
)

This way, you have standardised the lookback window. You can adjust it to hours or timestamps — as you please. It is helpful not only for the readability & documentation purposes, but primarily when you have two or more incremental blocks within one model. If you want to change the lookback window — you have one place and one place only.

Custom lookback window

What happens when there was a bug in production and you only discovered it after a week? You can add a possibility to parse number of days (or other value of your choice) which will be read and applied in the incremental block. Here is how:

  • Extend your is_incremental() block by an if-else clause, where you check for the existence of a variable (in my case custom_lookback_window_in_days, but name it as you please).
  • Set none as default value for your custom_lookback_window_in_days.
  • Add another filtering block reading from your custom variable.

Here you can see parametrised code from above, extended with the custom lookback window:

-- here goes your table specification and key setting

-- here we start custom incremental processing
{% if is_incremental() %}

-- this part allows to do a partial backfill by specifying as a variable
-- number of backfill days required
-- accepts --vars '{\"custom_lookback_window_in_days\":\"<number>\"}'")
-- eg.: dbt run --select <your_model> --vars '{"custom_lookback_window_in_days":"10"}'

{% if var('custom_lookback_window_in_days', none) %}

and table_name.date_column >= (
select date_trunc(
day
, dateadd(day, -'{{ var("custom_lookback_window_in_days") }}'::int
, max(t.date_column))
) from {{ this }} t
)

{% else %}

and table_name.date_column >= (
select date_trunc(
day
, dateadd(day, -{{ lookback_window_in_days }}::int
, max(t.date_column))
) from {{ this }} t
)

{% endif %}

{% endif %}

And just like that, you gained the super power of custom reloads! Be very careful when setting all these variables up — single or double quotes, lack of them — these are tiny irritating errors which I had encountered over the years. The implementation above works — that’s guaranteed :)

Unique key creation / selection

You need a unique key so dbt knows what to do when it has to insert newly processed rows but the same key is already existing in the table. The incremental strategies’ details are out of scope for this article — so choose whichever strategy works for you, but let me suggest a great way to go about unique_key.

If you already have a rock solid primary key — you can go for it. I always recommend creating a surrogate_key though. It makes the table more readable and at least you always know on which grain you have unique combination of values. Here’s the process:

  1. Install dbt_utils package (if not installed already).
  2. Add a macro generating surrogate key:
{{ dbt_utils.generate_surrogate_key(['some_id' , 'another_id'
, 'some_name' , 'another_name'
, 'final_value']) }} as surrogate_key

I recommend adding it at the very last step of your model — when performing final selection. That way it is clear where it is produced and you are sure it won’t get lost or changed during model’s calculations.

All that’s left is to add this to the model config.

Tip: you can use any column you want as part of the surrogate key. The macro for surrogate key has handles for null values, different data types etc. Have a detailed read here.

Clustering

For incremental models with high volumes it usually makes sense to cluster your tables. If you are using the same date columns to apply incremental logic across different layers of your pipeline, then it only seems natural to use that date as a clustering key. As clustering in general is out of scope of this article, let me leave you a snippet of the model YAML definition, so you know where it goes:

- name: <your_model>
description: "<your_description>"
config:
materialized: incremental
incremental_strategy: delete+insert
on_schema_change: append_new_columns
unique_key: surrogate_key
cluster_by: <date_column>
lookback_window_in_days: 2

Conditional logic possibilities

I named three that I found most useful for incremental pipelines. You can find the full list of dbt Jinja functions here.

  • execute

Useful when you want to avoid statement execution during manifest generation, you can check for the execute parameter like this:

{% if execute %}
{{ some_macro('val1', 'val2') }}
{% endif %}
  • FULL_REFRESH

Mostly used to execute some additional macros or DML queries.

For example, let’s say you leverage Snowflake’s sequences in your model which only have to be dropped and created again upon a full refresh. All you have to do, is execute the macro when the FULL_REFRESH flag is on:

{% if execute and flags.FULL_REFRESH %}
{{ recreate_sequence('<sequence_name>', '<sequence_comment>') }}
{% endif %}
  • raise_compiler_error

Let’s say you have a legacy table and you want to limit the number of days this table can be reloaded for (scenario from the beginning of this article). You can throw exceptions to evaluate the lookback parameter. Is it within the range of your liking?

{% if lookback_window_in_days < 0 or lookback_window_in_days > 100 %}
{{ exceptions.raise_compiler_error("This table max reload is 100 days") }}
{% endif %}

--

--

Piotr Sieminski
Piotr Sieminski

Written by Piotr Sieminski

AWS & dbt & Snowflake certified | Optimising data architecture processes!

Responses (1)