Not sure if I followed 100% your data structure, but let's say it's similar to this:
{
"properties": {
"property1": {
"column1": "...",
"column2": "...",
"column3": "...",
"value": "my value 1.0"
},
"property2": {
"column1": "...",
"column2": "...",
"column3": "...",
"value": "my value 2.0"
},
"propertyX": {
"column1": "...",
"column2": "...",
"column3": "...",
"value": "my value 3.0"
}
}
}
As you mention, you need to use set
to create variables and to be able to manipulate your data. Personally, I like to create different variables to handle the query statement
, the query result
, and the query values
. So following this strategy you would have something like this:
{% set data_structure_query %}
select properties from src
{% endset %}
{% set results = run_query(data_structure_query) %}
{% set properties = results.columns[0].values() %}
Note that results.columns[0].values()
will bring the data of the first column of your query which in this case it's the properties
.
The .values()
get the values of the column as a tuple, where the items most of the time is defined as a string
. So in order to access the properties of your data, you will have to deserialize the json string into a Python object, e.g. dict
. To do that, you need to use the fromjson
method:
...
{% set properties = results.columns[0].values() %}
{% set properties_dict = fromjson(properties[0]) %}
...
Assuming your query return only one row with the JSON format, I specified the properties[0]
to access the first row of the result query.
Before jumping to the next step, it's important to know that dbt has a jinja variable that informs us when dbt is in the "execute mode". That's a thing we need to worry about since it can raise issues to build our models. In short, any jinja that relies on a result being returned from the database will throw an error.
In your case, the results
variable is depending on a value that needs to be executed in the database which means if you just try to run the model most likely you'll get an issue with Compilation Error
. To avoid that, you need to add an if condition
to check if dbt is in "execute mode" or not:
...
{% set results = run_query(data_structure_query) %}
{% if execute %}
{% set properties = results.columns[0].values() %}
{% set properties_dict = fromjson(properties[0]) %}
{% else %}
{% set properties = [] %}
{% set properties_dict = [] %}
{% endif %}
...
Finally, you can proceed with a loop
to build your columns:
select
{%- for property in properties_dict.properties %}
{{ property }}.value
{%- if not loop.last %},{% endif -%}
{%- endfor %}
from
...
This will be compiled to:
select
property1.value,
property2.value,
propertyX.value
from
...
If you want to access the values for each column, then:
select
{%- for property in properties_dict.properties %}
'{{ properties_dict.properties[property].value }}'
{%- if not loop.last %},{% endif -%}
{%- endfor %}
from
...
Which will be compiled to:
select
'my value',
'my value 1.0',
'my value 2.0'
from
...
Might be worth taking a look in your database/warehouse and check if there is any internal function that handles semi-structured data. This can help you with the logic too. For instance, Snowflake has the lateral flatten
that does a similar behavior to split the properties into multiple rows.
For debug purposes, I recommend to compile
your model and use the logs ({{ log('my message', info=True) }}
) to understand how dbt/jinja is processing the data. Some of the code I provided might change depending on the output of your query.
Some helpful links:
https://docs.getdbt.com/reference/dbt-jinja-functions/run_query
https://docs.getdbt.com/reference/dbt-jinja-functions/execute
https://docs.getdbt.com/reference/dbt-jinja-functions/fromjson/
https://docs.getdbt.com/tutorial/using-jinja