Automatically add new column to incremental (or other type)
Asked Answered
P

1

6

I need some wisdom for a new DBT use case that I am trying to solve. I am pretty new to DBT and not sure what is the most efficient DBT way. We are using snowflake as our DWH.

Problem

We have a lot of incremental models that are managed with DBT. Lately, we had the need to add a new column to all models. What would be the most efficient DBT way to do it? Should we override the incremental macro script? (I found this for snowflake.) I assume that the last resort will be to add the new column manually to each model.

Penetrant answered 18/12, 2021 at 12:46 Comment(0)
G
5

You can either --full-refresh all the incremental models or perform this schema migration outside of dbt.

I would recommend using --full-refresh if you are able to. since --full-refresh rebuilds the table, it takes care of the schema changes and the historical values of the new column.

On the current version of dbt v0.21.0, a new incremental setting was introduced, on_schema_change. You can set it to append_new_columns

Quoting some relevant sections from the documentation:

New on_schema_change config in dbt version v0.21.0 Incremental models can now be configured to include an optional on_schema_change parameter to enable additional control when incremental model columns change. These options enable dbt to continue running incremental models in the presence of schema changes, resulting in fewer --full-refresh scenarios and saving query costs.

append_new_columns: Append new columns to the existing table. Note that this setting does not remove columns from the existing table that are not present in the new data.

Note: None of the on_schema_change behaviors backfill values in old records for newly added columns. If you need to populate those values, we recommend running manual updates, or triggering a --full-refresh.

If --full-refresh is not an option or you're on an older dbt version, then your schema migration would have to be done manually.

The steps are:

  1. add the new column via an alter table alter table my_incremental_table add column new_column_name data_type
  2. run an update query to hydrate the new column
  3. edit the dbt model for my_incremental_table to add new_column_name to the end of the select query's list of columns

This will work because dbt is stateless, but as this is a manual operation, I don't recommend this if you can avoid it.

note also, that if you use the on_schema_change method, you'd still need to do a backfill of the new column manually.

Gudgeon answered 19/12, 2021 at 18:29 Comment(3)
Thanks @HaleemurAli. Not sure I understand how --full-refresh will help this situation. Also, alter table add column, I am not sure if thats an option since we need to add a Jinja logic column for example {% if var('id') %} '{{ var("id") }}' {% else %} null {% endif %} as id and since we've got a few dozens I do not wish to edit each .sql file. I found this The --full-refresh flag will force dbt to drop cascade the existing table before rebuilding it. so I am not sure I want to drop the existing tablePenetrant
@ShayHa, I'm not sure what the jinja-logic column is doing. are you passing in a variable through the CLI called id and determines what is used to build the id column. If yes, then I think you may have a different data modelling issue as well, though I can't be sure with the information that I have. if you're on the most recent version of dbt, there's a new parameter for incremental models to control what happens when a column gets added on_schema_change, that automates the schema change on the incremental model by introspecting the defining query & the database schema. answer updated.Gudgeon
Thanks. I managed to understand how to override the ``` incremental.sql``` macro and I will add my logic there so every run will insert another column to all incremental models.Penetrant

© 2022 - 2024 — McMap. All rights reserved.