Unlock Advanced dbt Use Cases with the Meta Config and the Graph Variable

Unlock Advanced dbt Use Cases with the Meta Config and the Graph Variable

·

10 min read

The meta config can unlock very powerful use cases for your dbt project, but working with them during runtime can be tricky. In this post, we’ll be exploring how to access the meta config via the output of dbt’s graph context variable. We'll consider how to safely iterate over all models where only a few have meta configs defined. To do that, we’ll learn about dot notation and get() as handy methods for accessing JSON objects.

💡 Use Case

Our scenario is a DAG of a parent model containing user IDs and email addresses, and child models that pull those IDs & emails from the parent model and joins to high-output transactional data. The child models, which are incremental models, are appended with new transactional data multiple times daily.

Here's the catch: our analysts need the child models to contain the most up-to-date email addresses. However, the parent model containing user IDs and email addresses can only be built once a day. That means the child models will contain new transaction data but with stale email data. Other--and easier--options such as using different materializations (i.e., materialized views), lookback windows, or increasing build frequency are not available to us.

Our solution was this:

  1. Write a macro with an UPDATE TABLE SQL statement to update email addresses on the child models.

  2. Set the macro as a post-hook on the parent model.

  3. Set update_table meta config on child models. This allows us to identify them as models to run the update table macro against.

  4. Write a second macro to search through the graph object for models with the update_table meta config, and pass a list of models to the update table macro.

The operational flow looked like this:

  1. Parent model runs

  2. Child models run

  3. Later in the day, the parent model runs again

  4. The post-hook then kicked off:

    1. Search graph for child models containing the meta config

    2. Update those child models with the latest email addresses

A lot's happening under the hood here! Let’s take it step by step.

🔧 meta config

dbt supports a meta config that can be defined within a model’s {{ config() }} statement or under the model’s config YAML.

{{
  config(
      meta={
          'update_emails': {'enabled': true}
      }
  )
}}

That’s it. We could have gotten away with something even simpler, such as meta = {'update_emails': true}. However, setting the enabled key allows for easier development – this way, devs can just turn it on and off without having to touch anything else in the config block, which is handy if there are several properties to define.

This is infinitely extensible! You could do something like the below config to address requirements that vary from model to model.

{{
  config(
      meta={
          'update_emails': {
              'enabled': true,
              'user_id_field': 'user_id',
              'email_field': 'user_email',
              'days_to_skip_updates': ['Saturday', 'Sunday']
          }
      }
  )
}}

Why don't we just use dbt tags? Tags are largely for grouping things so you can easily run a subset of your dbt project, and tags also have inheritance. It’s not the right tool for this use case. meta config, by contrast, is flexible and extensible; you can define key-value pairs and nest them as needed to power unique use cases.

⛓️ graph

The graph context variable is super cool and powerful because you can query the structure and contents of your dbt project during runtime! That means when a model is running, it can fetch information about other models and perform actions using that information.

If you queried the graph for one of the child models that has the meta tag (we’ll get to how in the next section!), you’d get a giant JSON blob like this, which has been truncated for this post:

{
   "resource_type":"model",
   "config":{
      "enabled":true,
      "tags":[],
      "meta":{
         "update_emails":{
            "enabled":true,
            "user_id_field":"user_id",
            "email_field":"email",
            "days_to_skip_updates": ['Saturday', 'Sunday']
         }
      },
      "materialized":"table",
      "incremental_strategy":"None",
      "post-hook":[],
      "pre-hook":[]
   },
   "database":"analytics",
   "schema":"dbt_adamstone_child_example",
   "fqn":[
      "abc_company",
      "child_example"
   ],
   "unique_id":"model.abc_company.child_example",
   "root_path":"/Users/adam.stone/git/abc_company/data",
   "path":"child_example.sql",
   "original_file_path":"models/child_example.sql",
   "name":"child_example",
   "created_at":1688548667.9344058
}

See the update_emails part in the meta config? So handy! There's a caveat from dbt though about how these nodes are currently structured:

The exact contract for these model and source nodes is not currently documented, but that will change in the future.

🔍 Querying the graph

Using dot (.) notation, you can transverse through that JSON object to access the enabled value: model.config.meta.update_emails.enabled. Here’s a simple example:

{% macro find_nodes() %}

{# Run macro only in execution mode #}
{% if execute %}

    {# Iterate over all models contained in graph.nodes #}
    {% for model in graph.nodes.values() %}

        {# Extract model properties #}
        {% set model_name = model.name %}
        {% set enabled_value = model.config.meta.update_emails.enabled %}

        {# Print model properties to console #}  
        {{ log(model_name ~ ' has update_emails meta set to ' ~ enabled_value, info=True) }}

    {% endfor %}

{% endif %}

{% endmacro %}

If you run dbt run-operation find_nodes it would print to the console "child_model has update_emails meta set to True." The for loop will iterate over all models (called “nodes” in the graph context variable) and print out each model's model.config.meta.update_emails.enabled value.

So far, so good…except not all models will have the update_emails meta config! We only are adding that to the models that need it. We can't add this optional config parameter to every model and set most of these to enabled: false. That’s just not good coding practice.

The consequence of only a few models having the update_emails meta config is that it will break any loop that assumes all models have that meta config. In the above find_nodes() macro, once the for loop lands on a model that doesn't have the update_emails meta config, it will try to look for model.config.meta.update_emails.enabled and find...nothing!

Encountered an error while running operation: Compilation Error in macro find_nodes (macros/find_nodes.sql)
  'dict object' has no attribute 'update_emails'

What to do? How do we safely know which models have the meta config before we can even look through all of them? There are two methods, and we’ll use both of them.

☝️ Method 1: Check if the meta config exists

Using dot notation, accessing model.config.meta.update_emails will return whatever’s nested in it, or a None value if that key doesn't exist. We can use that to our advantage by passing that to an {% if %} statement.

{# Iterate over all models contained in graph.nodes #}
{% for model in graph.nodes.values() %}

    {# Check if model contains update_emails meta config #}
    {% if model.config.meta.update_emails %}

        {# Extract model properties #}
        {% set model_name = model.name %}
        {% set enabled_value = model.config.meta.update_emails.enabled %}

        {# Output model properties to the console #}
        {{ log(model_name ~ ' has update_emails meta set to ' ~ enabled_value, info=True) }}

    {% endif %}

{% endfor %}

✌️ Method 2: Use get() to safely access values

The first method is great for some cases but not all. Sometimes we need to retrieve values without knowing in advance what’s going to be in them and then execute logic based on those values. Or we’ve got many nested key-value pairs, all of which are optional. Remember, the meta config is infinitely flexible! It becomes impractical to write nested {% if %} statements for every possible combination of keys.

Instead, we can use the get() function which returns (1) the value of an item given a key, and (2) returns a default variable if there is no key! For example:

  • If we access model.config.meta.update_emails.enabled for a model that doesn’t have this, it’ll error out

  • If we run model.config.meta.update_emails.get(‘enabled’, false) for a model that doesn't have this, it’ll return false

Let’s use get() to build a list of models whose emails need to be updated. This is the part where we're going to use it:

{% if model.config.meta.update_emails.get("enabled", false) == true %}
    {% do list_of_models_to_update.append(model) %}
{% endif %}

In many cases, it's safer than {% if model.config.meta.update_emails %} because the dot notation approach simply tries to access the update_emails key – and doesn’t check if there’s a proper enabled key-value attribute within it. After all, the unique_emails key may contain wrong or missing values.

🏗️ Building a “find these tables” macro

Using both methods to access meta config in the graph context variable, we can now build our macro! It will navigate the graph, find models containing the update_emails meta config, and return a list of those models.

{% macro find_email_models() %}

{# Run macro only in execution mode #}
{% if execute %}

    {# Communicate to user what is happening #}
    {{ log("Searching for models to update emails", info=True) }}

    {# Instantiate empty list variable #}
    {% set models_to_update = [] %}

    {# Iterate over all models contained in graph.nodes #}
    {% for model in graph.nodes.values() %}

        {# Get model name for current iteration #}
        {% set current_model = model.name %}

        {# Use get() to be safe if the property does not exist for a model #}
        {% set update_emails_config = model.config.meta.get('update_emails', {}) %}

        {# Check if model contains update_emails meta config #}
        {% if update_emails_config %}

                {# Extract model properties #}
                {% set enabled_value = update_emails_config.get("enabled", "not set") %}

                {# Output model properties to the console #}
                {{
                    log(
                        "Model " ~ current_model ~ " has property update_emails.enabled: " ~ enabled_value, info=True
                    )
                }}

                {# If enabled, append model name to list of models to update #}
                {% if enabled_value == true %}
                    {% do models_to_update.append(current_model) %}
                {% endif %}

        {% endif %}

    {% endfor %}

    {# Return list of all models to update #}
    {{ return(models_to_update) }}


{% endif %}

{% endmacro %}

📩 Let's update those emails

Phew! After all that work, you still need to update the models themselves. You'll want a second macro to do this, and it should:

  1. Calls the find_email_models() macro – the one we just wrote above – and saves its results to a list called models_to_update

  2. Grab all properties of all models from the graph context variable and save it to all_models

  3. Loop through each model in models_to_update:

    1. From all_models, extract the current model's properties

    2. Make sure it exists in the database by checking its source relation (if we attempt to run an UPDATE statement on a non-existent table, it will error)

    3. Run SQL to update the table

  4. Done!

The handy thing about extracting model properties in advance is that you have the entire meta config ready to go if you're using several properties.

{% macro update_email_models() %}

{# Run macro only in execution mode #}
{% if execute %}

    {{ log("Beginning update of emails", info=True)}}

    {# Call helper macro to find all models that need updating #}
    {% set models_to_update=find_email_models() %}

    {# Grab all models and their properties; we will pull out what we need #}
    {% set all_models = graph.nodes.values() %}

    {% for model_name in models_to_update %}

        {# Extract properties of the current model in this iteration #}
        {% set model_properties = (all_models | selectattr('name', 'equalto', model_name) | list).pop() %}

        {# Confirm model actually exists in the database before we try to update it #}
        {% set model_db = model_properties.database %}
        {% set model_schema = model_properties.schema %}
        {% set source_relation = adapter.get_relation(
            database=model_db,
            schema=model_schema,
            identifier=model_name
            ) %}

        {% if source_relation %}

            {# The actual SQL query to run #}
            {% set sql %}

                begin;

                update {{ source_relation }}
                set email = parent.email
                from {{ this }} as parent
                where {{ source_relation }}.user_id = parent.user_id
                    and {{ source_relation }}.email != parent.email;

                commit;

            {% endset %}

            {{ log("Updating emails in " ~ model_name, info=True) }}
            {% do run_query(sql) %}
            {{ log("Updated emails in " ~ source_relation, info=True) }}

        {% else %}

            {# If model does not exist, raise error #}
            {% do exceptions.raise_compiler_error(model_db ~ "." ~ model_schema ~ "." ~ model_name ~ " doesn't exist! Skipping any further email updates") %}

        {% endif %}

    {% endfor %}

    {# Let the user know we are all done! #}
    {{ log("Finished updating emails", info=True)}}

{% endif %}

{% endmacro %}

🪝 Now the post-hook

Last, add that update_email_models() macro as a post-hook to the parent model! This is the easiest step of all. This goes on top of your parent model definition:

{{ 
    config(
        post_hook="{{ update_email_models() }}",
        materialized='table'
    ) 
}}

select * from {{ ref('source_data') }}

When you’re orchestrating this for the first time, be mindful that the child models may not exist yet, so you may want to deploy child models with the update_emails meta config set to false first, then enable it later.

🪩 Conclusion

In this real-world example, where email address data changed frequently throughout the day, we wanted the child models to reflect the latest information. Using the meta config and graph variable, we can successfully ensure the freshest data exists in the child models without increasing build frequency or using alternative materializations.

I'm sure this is just one among many unique and interesting use cases for leveraging these advanced dbt features -- I'd love to hear more examples out there! Shout out to Gemma, Dan, and Jared for creative, technical, and moral support in working with these gnarly lines of Jinja and JSON code. 🫶