How do I loop through nested structs using Jinja in DBT?
Asked Answered
L

2

6

I'm trying to build a model in DBT that flattens out a struct with name properties that contains about a hundred structs inside it (e.g. property1, property2, etc.), each with 5 different columns of which I want to extract one called value. I could type properties.propertyX.value 100 times, but I figured I could try to find a way to loop through each struct within properties and obtain propertyX.value within a SELECT statement by using Jinja, but I guess I'm either unfamiliar with the syntax or its limitations because I don't know what to do. I've tried:

WITH t as (
    SELECT
        properties
    FROM
        src
)
SELECT
    {% for property in properties %}
    {{property}}.value
    {% endfor %}
    {%- if not loop.last %},{% endif -%}
FROM
    t

but I realized I have to set properties as a variable and I don't really know how to do that in a way that it references the individual properties in the properties struct. Anyway, I'm quite lost and if someone could help I would be so grateful.

Lavena answered 14/6, 2021 at 16:30 Comment(0)
A
7

Not sure if I followed 100% your data structure, but let's say it's similar to this:

{
  "properties": {
    "property1": {
      "column1": "...",
      "column2": "...",
      "column3": "...",
      "value": "my value 1.0"
    },
    "property2": {
      "column1": "...",
      "column2": "...",
      "column3": "...",
      "value": "my value 2.0"
    },
    "propertyX": {
      "column1": "...",
      "column2": "...",
      "column3": "...",
      "value": "my value 3.0"
    }
  }
}

As you mention, you need to use set to create variables and to be able to manipulate your data. Personally, I like to create different variables to handle the query statement, the query result, and the query values. So following this strategy you would have something like this:

{% set data_structure_query %}
    select properties from src
{% endset %}

{% set results = run_query(data_structure_query) %}

{% set properties = results.columns[0].values() %}

Note that results.columns[0].values() will bring the data of the first column of your query which in this case it's the properties.

The .values() get the values of the column as a tuple, where the items most of the time is defined as a string. So in order to access the properties of your data, you will have to deserialize the json string into a Python object, e.g. dict. To do that, you need to use the fromjson method:

...

{% set properties = results.columns[0].values() %}

{% set properties_dict = fromjson(properties[0]) %}

...

Assuming your query return only one row with the JSON format, I specified the properties[0] to access the first row of the result query.

Before jumping to the next step, it's important to know that dbt has a jinja variable that informs us when dbt is in the "execute mode". That's a thing we need to worry about since it can raise issues to build our models. In short, any jinja that relies on a result being returned from the database will throw an error.

In your case, the results variable is depending on a value that needs to be executed in the database which means if you just try to run the model most likely you'll get an issue with Compilation Error. To avoid that, you need to add an if condition to check if dbt is in "execute mode" or not:

...

{% set results = run_query(data_structure_query) %}

{% if execute %}
    {% set properties = results.columns[0].values() %}
    {% set properties_dict = fromjson(properties[0]) %}
{% else %}
    {% set properties = [] %}
    {% set properties_dict = [] %}
{% endif %}

...

Finally, you can proceed with a loop to build your columns:

select
{%- for property in properties_dict.properties %}
    {{ property }}.value
    {%- if not loop.last %},{% endif -%}
{%- endfor %}
from 
...

This will be compiled to:

select
    property1.value,
    property2.value,
    propertyX.value
from
...

If you want to access the values for each column, then:

select
{%- for property in properties_dict.properties %}
    '{{ properties_dict.properties[property].value }}'
    {%- if not loop.last %},{% endif -%}
{%- endfor %}
from
...

Which will be compiled to:

select
    'my value',
    'my value 1.0',
    'my value 2.0'
from
...

Might be worth taking a look in your database/warehouse and check if there is any internal function that handles semi-structured data. This can help you with the logic too. For instance, Snowflake has the lateral flatten that does a similar behavior to split the properties into multiple rows.

For debug purposes, I recommend to compile your model and use the logs ({{ log('my message', info=True) }}) to understand how dbt/jinja is processing the data. Some of the code I provided might change depending on the output of your query.

Some helpful links:

https://docs.getdbt.com/reference/dbt-jinja-functions/run_query

https://docs.getdbt.com/reference/dbt-jinja-functions/execute

https://docs.getdbt.com/reference/dbt-jinja-functions/fromjson/

https://docs.getdbt.com/tutorial/using-jinja

Angi answered 15/6, 2021 at 6:1 Comment(0)
E
0

Assuming your data structure looks like this:

{
  "properties": [
    {
      "value": "Value 1"
    },
    {
      "value": "Value 2"
    },
    ...
  ]
}

You just need to move the .value lookup into the variable delimiters: {{ property.value }}

Entelechy answered 14/6, 2021 at 19:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.