How to choose the latest partition of a bigquery table in DBT without scanning the whole table?
Asked Answered
D

3

7

I'm trying to select the latest partition from a BigQuery table without scanning the whole table in a DBT model in order to save query costs.

DBT doesnt allow using semicolons in a data model so using the DECLARE+SET scripting statements doesn't work as suggested here.

DBT has a sql_header macro which allows setting some variables in the header but that header doesn't accept references to a data model or at least the following code is not compiling:

{{ config(
  sql_header="  DECLARE latest_partition_date DATE;
  DECLARE latest_load_timestamp TIMESTAMP;
  SET latest_partition_date = (SELECT MAX(_PARTITIONDATE) FROM {{ ref("model") }} );
  SET latest_load_timestamp = (SELECT MAX(loaded_at) FROM {{ ref("model") }} WHERE _PARTITIONDATE = latest_partition_date);"
) }}

-- set the main query
SELECT * FROM {{ ref("model") }}
WHERE 
-- Select the latest partition to reduce 'Bytes processed' for loading the query.
_PARTITIONDATE = latest_partition_date
-- Select the latest load within the latest partition to get only one duplicate of data.
AND loaded_at = latest_load_timestamp

I need to solve this in standard SQL.

Other methods that were suggested included setting WHERE _PARTITIONDATE = CURRENT_DATE() or using DATE_SUB(CURRENT_DATE(), 3) but those don't satisfy because data load breakages are unpredictable and only dynamically selecting the latest would work here. Is that possible?

Dulcinea answered 19/2, 2021 at 11:58 Comment(0)
D
2

Since the original question was working with dates, the right datatype conversion was the missing piece.

In the end I figured that the conversion to the right datatype needs to be done within jinja and not with SQL for the queries to accept right variables. Also, {{ max_date }} needed quotes.

The final solution that I got working was this:


{%- call statement('max_partition_date_query', True) -%}
  SELECT MAX(_PARTITIONDATE) as max_partition_date FROM {{ ref('model') }}
{%- endcall -%}

{%- set max_timestamp = load_result('max_partition_date_query')['data'][0][0] -%}
{%- set max_date = max_timestamp.strftime('%Y-%m-%d') -%}

select * FROM {{ ref('model') }}
WHERE _PARTITIONDATE = '{{ max_date }}'
Dulcinea answered 22/2, 2021 at 16:4 Comment(0)
C
2

You could do it in another query and get the result as a variable, something like this:

    {%- call statement('max_partition', fetch_result=True) -%}
      SELECT MAX(_PARTITIONDATE) FROM {{ ref("model") }} )
    {%- endcall -%}

    {%- set max_date = load_result('max_partition')['data'][0][0] -%}

    SELECT * FROM {{ ref("model") }}
    WHERE 
    _PARTITIONDATE = {{ max_date }}
Claytonclaytonia answered 19/2, 2021 at 13:42 Comment(3)
When I run this i get the error message: No matching signature for operator = for argument types: DATE, INT64. Supported signature: ANY = ANY. It's probably because _PARTITIONDATE is type DATE but it seems to be evaluated as and INT. Any idea how to fix that?Dulcinea
Then probably you need to cast it in order to retrieve the value, SELECT CAST(DATE(MAX(_PARTITIONDATE)) AS STRING) FROM {{ ref("model") }} probably will work to retrieve it, maybe you'll need to cast it again to use it in the main query. I have a similar example working using a Timestamp column, but I haven't tried with _PARTITIONDATEMisapply
I tried casting but at compilation bgiquery just ignored that and still continued forcing the variable to be integer. the solution I got working required conversion within jinjaDulcinea
D
2

Since the original question was working with dates, the right datatype conversion was the missing piece.

In the end I figured that the conversion to the right datatype needs to be done within jinja and not with SQL for the queries to accept right variables. Also, {{ max_date }} needed quotes.

The final solution that I got working was this:


{%- call statement('max_partition_date_query', True) -%}
  SELECT MAX(_PARTITIONDATE) as max_partition_date FROM {{ ref('model') }}
{%- endcall -%}

{%- set max_timestamp = load_result('max_partition_date_query')['data'][0][0] -%}
{%- set max_date = max_timestamp.strftime('%Y-%m-%d') -%}

select * FROM {{ ref('model') }}
WHERE _PARTITIONDATE = '{{ max_date }}'
Dulcinea answered 22/2, 2021 at 16:4 Comment(0)
S
1

The approaches above will work for https://cloud.google.com/bigquery/docs/partitioned-tables#ingestion_time but not for https://cloud.google.com/bigquery/docs/partitioned-tables#date_timestamp_partitioned_tables.

The other option is to utilise https://cloud.google.com/bigquery/docs/information-schema-tables to get the latest partition for the date/timestamp partitioned table.

{% set last_snapshot_at_query %}
SELECT 
    PARSE_TIMESTAMP("%Y%m%d", MAX(PARTITION_ID)) 
FROM
    {{ source("change-me-source", "INFORMATION_SCHEMA.PARTITIONS") }} 
WHERE
    table_name = "change-me-table"
{% endset %}

{% if execute %}
{% set last_snapshot_at = run_query(last_snapshot_at_query).columns[0][0] %}
{% endif %}

SELECT
    *
FROM
    {{ source('change-me-source', 'change-me-table') }},
WHERE 
    snapshot_at = "{{ last_snapshot_at }}"
Stylist answered 9/8, 2022 at 20:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.