The meta config can unlock very powerful use cases for your dbt project, but working with them during runtime can be tricky. In this post, we’ll be exploring how to access the meta config via the output of dbt’s graph
context variable. We'll consider how to safely iterate over all models where only a few have meta configs defined. To do that, we’ll learn about dot notation and get()
as handy methods for accessing JSON objects.
💡 Use Case
Our scenario is a DAG of a parent model containing user IDs and email addresses, and child models that pull those IDs & emails from the parent model and joins to high-output transactional data. The child models, which are incremental models, are appended with new transactional data multiple times daily.
Here's the catch: our analysts need the child models to contain the most up-to-date email addresses. However, the parent model containing user IDs and email addresses can only be built once a day. That means the child models will contain new transaction data but with stale email data. Other--and easier--options such as using different materializations (i.e., materialized views), lookback windows, or increasing build frequency are not available to us.
Our solution was this:
Write a macro with an
UPDATE TABLE
SQL statement to update email addresses on the child models.Set the macro as a post-hook on the parent model.
Set
update_table
meta config on child models. This allows us to identify them as models to run the update table macro against.Write a second macro to search through the graph object for models with the
update_table
meta config, and pass a list of models to the update table macro.
The operational flow looked like this:
Parent model runs
Child models run
Later in the day, the parent model runs again
The post-hook then kicked off:
Search graph for child models containing the meta config
Update those child models with the latest email addresses
A lot's happening under the hood here! Let’s take it step by step.
🔧 meta config
dbt supports a meta config that can be defined within a model’s {{ config() }}
statement or under the model’s config YAML.
{{
config(
meta={
'update_emails': {'enabled': true}
}
)
}}
That’s it. We could have gotten away with something even simpler, such as meta = {'update_emails': true}
. However, setting the enabled
key allows for easier development – this way, devs can just turn it on and off without having to touch anything else in the config block, which is handy if there are several properties to define.
This is infinitely extensible! You could do something like the below config to address requirements that vary from model to model.
{{
config(
meta={
'update_emails': {
'enabled': true,
'user_id_field': 'user_id',
'email_field': 'user_email',
'days_to_skip_updates': ['Saturday', 'Sunday']
}
}
)
}}
Why don't we just use dbt tags? Tags are largely for grouping things so you can easily run a subset of your dbt project, and tags also have inheritance. It’s not the right tool for this use case. meta config, by contrast, is flexible and extensible; you can define key-value pairs and nest them as needed to power unique use cases.
⛓️ graph
The graph context variable is super cool and powerful because you can query the structure and contents of your dbt project during runtime! That means when a model is running, it can fetch information about other models and perform actions using that information.
If you queried the graph for one of the child models that has the meta tag (we’ll get to how in the next section!), you’d get a giant JSON blob like this, which has been truncated for this post:
{
"resource_type":"model",
"config":{
"enabled":true,
"tags":[],
"meta":{
"update_emails":{
"enabled":true,
"user_id_field":"user_id",
"email_field":"email",
"days_to_skip_updates": ['Saturday', 'Sunday']
}
},
"materialized":"table",
"incremental_strategy":"None",
"post-hook":[],
"pre-hook":[]
},
"database":"analytics",
"schema":"dbt_adamstone_child_example",
"fqn":[
"abc_company",
"child_example"
],
"unique_id":"model.abc_company.child_example",
"root_path":"/Users/adam.stone/git/abc_company/data",
"path":"child_example.sql",
"original_file_path":"models/child_example.sql",
"name":"child_example",
"created_at":1688548667.9344058
}
See the update_emails
part in the meta
config? So handy! There's a caveat from dbt though about how these nodes are currently structured:
The exact contract for these model and source nodes is not currently documented, but that will change in the future.
🔍 Querying the graph
Using dot (.) notation, you can transverse through that JSON object to access the enabled
value: model.config.meta.update_emails.enabled
. Here’s a simple example:
{% macro find_nodes() %}
{# Run macro only in execution mode #}
{% if execute %}
{# Iterate over all models contained in graph.nodes #}
{% for model in graph.nodes.values() %}
{# Extract model properties #}
{% set model_name = model.name %}
{% set enabled_value = model.config.meta.update_emails.enabled %}
{# Print model properties to console #}
{{ log(model_name ~ ' has update_emails meta set to ' ~ enabled_value, info=True) }}
{% endfor %}
{% endif %}
{% endmacro %}
If you run dbt run-operation find_nodes
it would print to the console "child_model has update_emails meta set to True." The for
loop will iterate over all models (called “nodes” in the graph context variable) and print out each model's model.config.meta.update_emails.enabled
value.
So far, so good…except not all models will have the update_emails
meta config! We only are adding that to the models that need it. We can't add this optional config parameter to every model and set most of these to enabled: false
. That’s just not good coding practice.
The consequence of only a few models having the update_emails
meta config is that it will break any loop that assumes all models have that meta config. In the above find_nodes()
macro, once the for loop lands on a model that doesn't have the update_emails
meta config, it will try to look for model.config.meta.update_emails.enabled
and find...nothing!
Encountered an error while running operation: Compilation Error in macro find_nodes (macros/find_nodes.sql)
'dict object' has no attribute 'update_emails'
What to do? How do we safely know which models have the meta config before we can even look through all of them? There are two methods, and we’ll use both of them.
☝️ Method 1: Check if the meta config exists
Using dot notation, accessing model.config.meta.update_emails
will return whatever’s nested in it, or a None
value if that key doesn't exist. We can use that to our advantage by passing that to an {% if %}
statement.
{# Iterate over all models contained in graph.nodes #}
{% for model in graph.nodes.values() %}
{# Check if model contains update_emails meta config #}
{% if model.config.meta.update_emails %}
{# Extract model properties #}
{% set model_name = model.name %}
{% set enabled_value = model.config.meta.update_emails.enabled %}
{# Output model properties to the console #}
{{ log(model_name ~ ' has update_emails meta set to ' ~ enabled_value, info=True) }}
{% endif %}
{% endfor %}
✌️ Method 2: Use get() to safely access values
The first method is great for some cases but not all. Sometimes we need to retrieve values without knowing in advance what’s going to be in them and then execute logic based on those values. Or we’ve got many nested key-value pairs, all of which are optional. Remember, the meta config is infinitely flexible! It becomes impractical to write nested {% if %}
statements for every possible combination of keys.
Instead, we can use the get() function which returns (1) the value of an item given a key, and (2) returns a default variable if there is no key! For example:
If we access
model.config.meta.update_emails.enabled
for a model that doesn’t have this, it’ll error outIf we run
model.config.meta.update_emails.get(‘enabled’, false)
for a model that doesn't have this, it’ll returnfalse
Let’s use get()
to build a list of models whose emails need to be updated. This is the part where we're going to use it:
{% if model.config.meta.update_emails.get("enabled", false) == true %}
{% do list_of_models_to_update.append(model) %}
{% endif %}
In many cases, it's safer than {% if model.config.meta.update_emails %}
because the dot notation approach simply tries to access the update_emails
key – and doesn’t check if there’s a proper enabled
key-value attribute within it. After all, the unique_emails
key may contain wrong or missing values.
🏗️ Building a “find these tables” macro
Using both methods to access meta config in the graph context variable, we can now build our macro! It will navigate the graph, find models containing the update_emails
meta config, and return a list of those models.
{% macro find_email_models() %}
{# Run macro only in execution mode #}
{% if execute %}
{# Communicate to user what is happening #}
{{ log("Searching for models to update emails", info=True) }}
{# Instantiate empty list variable #}
{% set models_to_update = [] %}
{# Iterate over all models contained in graph.nodes #}
{% for model in graph.nodes.values() %}
{# Get model name for current iteration #}
{% set current_model = model.name %}
{# Use get() to be safe if the property does not exist for a model #}
{% set update_emails_config = model.config.meta.get('update_emails', {}) %}
{# Check if model contains update_emails meta config #}
{% if update_emails_config %}
{# Extract model properties #}
{% set enabled_value = update_emails_config.get("enabled", "not set") %}
{# Output model properties to the console #}
{{
log(
"Model " ~ current_model ~ " has property update_emails.enabled: " ~ enabled_value, info=True
)
}}
{# If enabled, append model name to list of models to update #}
{% if enabled_value == true %}
{% do models_to_update.append(current_model) %}
{% endif %}
{% endif %}
{% endfor %}
{# Return list of all models to update #}
{{ return(models_to_update) }}
{% endif %}
{% endmacro %}
📩 Let's update those emails
Phew! After all that work, you still need to update the models themselves. You'll want a second macro to do this, and it should:
Calls the
find_email_models()
macro – the one we just wrote above – and saves its results to a list calledmodels_to_update
Grab all properties of all models from the graph context variable and save it to
all_models
Loop through each model in
models_to_update
:From
all_models
, extract the current model's propertiesMake sure it exists in the database by checking its source relation (if we attempt to run an UPDATE statement on a non-existent table, it will error)
Run SQL to update the table
Done!
The handy thing about extracting model properties in advance is that you have the entire meta config ready to go if you're using several properties.
{% macro update_email_models() %}
{# Run macro only in execution mode #}
{% if execute %}
{{ log("Beginning update of emails", info=True)}}
{# Call helper macro to find all models that need updating #}
{% set models_to_update=find_email_models() %}
{# Grab all models and their properties; we will pull out what we need #}
{% set all_models = graph.nodes.values() %}
{% for model_name in models_to_update %}
{# Extract properties of the current model in this iteration #}
{% set model_properties = (all_models | selectattr('name', 'equalto', model_name) | list).pop() %}
{# Confirm model actually exists in the database before we try to update it #}
{% set model_db = model_properties.database %}
{% set model_schema = model_properties.schema %}
{% set source_relation = adapter.get_relation(
database=model_db,
schema=model_schema,
identifier=model_name
) %}
{% if source_relation %}
{# The actual SQL query to run #}
{% set sql %}
begin;
update {{ source_relation }}
set email = parent.email
from {{ this }} as parent
where {{ source_relation }}.user_id = parent.user_id
and {{ source_relation }}.email != parent.email;
commit;
{% endset %}
{{ log("Updating emails in " ~ model_name, info=True) }}
{% do run_query(sql) %}
{{ log("Updated emails in " ~ source_relation, info=True) }}
{% else %}
{# If model does not exist, raise error #}
{% do exceptions.raise_compiler_error(model_db ~ "." ~ model_schema ~ "." ~ model_name ~ " doesn't exist! Skipping any further email updates") %}
{% endif %}
{% endfor %}
{# Let the user know we are all done! #}
{{ log("Finished updating emails", info=True)}}
{% endif %}
{% endmacro %}
🪝 Now the post-hook
Last, add that update_email_models()
macro as a post-hook to the parent model! This is the easiest step of all. This goes on top of your parent model definition:
{{
config(
post_hook="{{ update_email_models() }}",
materialized='table'
)
}}
select * from {{ ref('source_data') }}
When you’re orchestrating this for the first time, be mindful that the child models may not exist yet, so you may want to deploy child models with the update_emails
meta config set to false
first, then enable it later.
🪩 Conclusion
In this real-world example, where email address data changed frequently throughout the day, we wanted the child models to reflect the latest information. Using the meta config and graph
variable, we can successfully ensure the freshest data exists in the child models without increasing build frequency or using alternative materializations.
I'm sure this is just one among many unique and interesting use cases for leveraging these advanced dbt features -- I'd love to hear more examples out there! Shout out to Gemma, Dan, and Jared for creative, technical, and moral support in working with these gnarly lines of Jinja and JSON code. 🫶