advanced dbt macros, maybe they don't suck?

Posts

dbt-header

Within the wacky world of DBT there are a lot of peculiarities. However, one of the most confusing, and therefore under-utilized, features of DBT is the macro. Macros are a way to create reusable code within your DBT project. They are extremely powerful but also a bit of a pain to work with. In this post I’ll go over some advanced examples that illuminate some of the dark witchcraft of writing quality macros.

Use A Template

First things first, having a standard template will help to speed up development while keeping everyone consistent.

jinja
{#
<insert_macro_name>

<insert_description>

usage:
    {{ insert_macro_name(some_value) }}
    (sql) -> <insert sql generated>
    {{ insert_macro_name(some_value, some_value) }}
    (sql) -> <insert sql generated>
    {{ insert_macro_name(arg_datatype=some_value, arg2_dataype=some_value) }}
    (sql) -> <insert sql generated>
#}
sql
{% macro insert_macro_name(arg_datatype=none, arg2_datatype=none) -%}

    {% if arg_datatype is none %}
        {{ exceptions.raise_compiler_error("arg_datatype is required") }}
    {% endif %}

    {%- if arg2_datatype is not none -%}
    {%- else -%}
    {%- endif -%}

{%- endmacro %}

The above template, while a little verbose, clearly establishes a pattern while nudging the developer towards best practices. Using jinja comments as a form of macro documentation makes it much easier to understand intent and usage. Additionally, understanding, in a very literal sense, what sql is getting generated for different invocations is hugely helpful for others who may not have as much experience. Also, including the expected datatype within the argument name is a great way to ensure that the macro is being used correctly.

Something Actually Useful

The following macro is great for speeding up local development and CI runs. Essentially, we limit the amount of data processed by using Snowflake’s sample function. Furthermore, we optimize this by using the appropriate method given the context of the query. In general, the system based sampling method is much faster than bernoulli; however, the system approach is not able to be used with the result of joins. Therefore, we expose that as an option in order to allow the developer to configure appropriately. Lastly, we add an option to override the behavior for particular local runs that may need the entire dataset.

jinja
{#
dev_limit

A macro to limit the number of rows
processed during development and staging.
In production, the limit is removed.

By default, the macro runs using
the bernoulli sampling method.
However, the system method is faster,
but it cannot be used on:
    - the result of a join
    - a view
    - sample size must be a decimal percentage (0.1, .2, etc.)

Thus, `is_system_bool=true` will invoke this behavior.


Slightly confusing, so refer to the link below:

https://docs.snowflake.com/en/sql-reference/constructs/sample.html#examples

Lastly, `skip_bool=true` in order to skip the limit.


usage:
    select id, name
    from {{ source('schema', 'table') }}
    {{ dev_limit(1000) }}
    (sql) -> sample bernoulli (1000 rows)
    {{ dev_limit(.01, is_system_bool=true) }}
    (sql) -> sample system (.01)
    {{ dev_limit(1000, skip_bool=true) }}
    (sql) -> nothing (skip limit)
#}
sql
{%- macro dev_limit(sample_size=none, is_system_bool=false, skip_bool=false) -%}

    {%- if sample_size is none -%}
        {{ exceptions.raise_compiler_error("Invalid must provide a `sample_size` value.") }}
    {%- endif -%}

    {%- if skip_bool != false -%}
    {%- else -%}
        {%- if target.name == 'prod' -%}
        {%- else -%}
            {%- if is_system_bool != false -%}
                sample system ({{ sample_size }})
            {%- else -%}
                sample bernoulli ({{ sample_size }} rows)
            {%- endif -%}
        {%- endif -%}
    {%- endif -%}

{%- endmacro -%}

Even More Useful

Everyone knows that incremental models can be a bit of a pain. Yet, a quality macro can help to smooth out the bumpy process. In particular, this macro provides great defaults, while exposing a lot of configurability via variables.

An aditional quirk is that this macro is hyper optimized in that it eschews predicates containing a subquery, which Snowflake does not prune on, in favor of actual values. This, in theory, should lead to a significant performance improvement.

sf-qp

jinja
{#
incremental_filter

This macro is a utility to help cut down on
repetitive code for incremental models. Also,
it should help to standardize our approach.

usage:
    {{
        config(
            materialized='incremental'
        )
    }}
    
    select event_id, event_date
    from {{ source('schema', 'table') }}
    where 1=1
        {{ incremental_filter('event_date') }}

basic example:
    {{ incremental_filter('event_date') }}
    (sql) -> and event_date > (select max(event_date) from {{ this }})
    {{ incremental_filter('event_date', -3) }}
    (sql) -> and event_date >= current_date() + -3

backfill example:
    dbt run --models my_model --vars '{backfill_start_date: 2022-11-01}'
    (sql) -> and event_date >= '2022-11-01'
    dbt run --models my_model --vars '{backfill_start_date: 2022-11-01, backfill_end_date: 2022-12-01}'
    (sql) -> and event_date >= '2022-11-01' and event_date <= '2022-12-01'
    dbt run --models my_model --vars '{backfill_start_int: -8, backfill_end_int: -5}'
    (sql) -> and event_date >= current_date() + -8 and event_date <= current_date() + -5
#}
sql
{%- macro incremental_filter(column_str=none, lookback_int=none) -%}

    {%- if is_incremental() -%}

        {%- if column_str is none -%}
            {{ exceptions.raise_compiler_error("Invalid must provide a `column_str` value.") }}
        {%- endif -%}

        {%- if var('backfill_start_date', default=false) -%}
            and {{ column_str }} >= '{{ var("backfill_start_date") }}'
            {%- if var('backfill_end_date', default=false) -%}
                and {{ column_str }} <= '{{ var("backfill_end_date") }}'
            {%- endif -%}

        {%- elif var('backfill_start_int', default=false) and var('backfill_end_int', default=false) -%}
            and {{ column_str }} >= current_date() + {{ var('backfill_start_int') }}
            and {{ column_str }} <= current_date() + {{ var('backfill_end_int') }}
        {%- else -%}

            {%- if lookback_int is not none -%}
                and {{ column_str }} >= current_date() + {{ lookback_int }}
            {%- else -%}
                {%- set sql -%}select max({{ column_str }}) from {{ this }}{%- endset -%}
                {%- if execute -%}
                    and {{ column_str }} > '{{ run_query(sql).columns[0].values()[0] }}'
                {%- endif -%}
            {%- endif -%}

        {%- endif -%}

    {%- endif -%}

{%- endmacro -%}

The first thing to note is the use of variables. Here we’ve provided a set of default configurations that should meet the standard day-to-day needs. Nevertheless, data gets messy and sometimes a backfill is necessary. In this case, there are 3 separate ways to invoke a backfill:

  • backfill_start_date and backfill_end_date - these are the most explicit and should be used when you know the exact dates you want to backfill.
  • backfill_start_int and backfill_end_int - these are the most implicit and should be used when you want to backfill a range of days relative to the current date.
  • backfill_start_date - this is a middle ground and provides an explicit way to backfill a range relative to a particular date/lookback.

The next consideration point is the re-writing of the standard incremental pattern:

sql
{{ column_str }} > (select max({{ column_str }}) from {{ this }})

In this case, a subquery is used to generate the date value. This is problematic because Snowflake does not prune on subqueries. Therefore, we’ve re-written the pattern to use a literal value instead:

sql
{{ column_str }} > '{{ run_query(sql).columns[0].values()[0] }}'

This literal value is acquired by running a query to get the max value of the column. This is a bit of a hack, but it’s the only way to get a literal value in this case. The execute variable is used to ensure that the query is ran when DBT is in the parse phase.

Conclusion

All in all, I’m not really sure I would call this true Advanced DBT Macros - there are a ton of features within Jinja that very rarely seem to be utilized within macros. Therefore, I might do a followup article seeing how weird we can get with Jinja + DBT. However, these examples are a great point of reference and can get most people, I presume, to start considering the possibilities.