How to choose the latest partition in BigQuery table?
Asked Answered
E

9

19

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.

I've tried (as far as I know, BigQuery does not support QUALIFY):

SELECT col FROM table WHERE _PARTITIONTIME = (
  SELECT pt FROM (
    SELECT pt, RANK() OVER(ORDER by pt DESC) as rnk FROM (
      SELECT _PARTITIONTIME AS pt FROM table GROUP BY 1)
    )
  )
  WHERE rnk = 1
);

But this does not work and reads all rows.

SELECT col from table WHERE _PARTITIONTIME = TIMESTAMP('YYYY-MM-DD')

where 'YYYY-MM-DD' is a specific date does work.

However, I need to run this script in the future, but the table update (and the _PARTITIONTIME) is irregular. Is there a way I can pull data only from the latest partition in BigQuery?

Elliottellipse answered 27/9, 2016 at 20:34 Comment(7)
can you clarify - how it doesn't work? what is wrong with it?Norwood
I've commented below, but both queries do not read from only the latest partition. When I explicitly type the last _partitiontime, the query reads 18 MB. But when I try either of the two queries below, they read 15.4 GB.Elliottellipse
i got it now. thank you for update! interesting to understand why. @felipe-hoffa mentioned something about caching - but this was not clearNorwood
I can first run the subquery separately, then explicitly type the result of the subquery as a workaround.Elliottellipse
yes. this way it works, but interesting why it does not work inline - in one queryNorwood
i think i am getting it now. obvious but this is my guess ... so when value is used - partitioning get's involved and bq engine knows what to scan and what not to; but when expression is used - it works totally different - partitioning per se is not really working (i think that is what Felipe meant by cache) so query in this case is treated as join so whole table is scanned and joined on result of expression. Still something that easily to be addressed by having that logic scripted - as you mentioned - first getting result of subquery and then use it in final queryNorwood
I am currently having the same problem. Is there any updated solution to this problem?Hap
N
18

October 2019 Update

Support for Scripting and Stored Procedures is now in beta (as of October 2019)

You can submit multiple statements separated with semi-colons and BigQuery is able to run them now

See example below

DECLARE max_date TIMESTAMP;
SET max_date = (
  SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);

SELECT * FROM `project.dataset.partitioned_table`
WHERE _PARTITIONTIME = max_date;

Update for those who like downvoting without checking context, etc.

I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query

Try

SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
  SELECT MAX(TIMESTAMP(partition_id))
  FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
)  

or

SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
  SELECT MAX(_PARTITIONTIME) 
  FROM [dataset.partitioned_table]
)  
Norwood answered 27/9, 2016 at 20:47 Comment(5)
Note that this will stop query results caching (might impact in costs).Ecliptic
This answer addresses Is there a way I can pull data only from the latest partition in BigQuery? question! and it is obvious that engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final queryNorwood
The scripting method is great when you're manually exploring data, but doesn't work when using api clients and storing the results in a destination table. In those cases you will still have to run as two jobs with output of the first used as input to the second, unfortunately: configuration.query.destinationTable cannot be set for scriptsTalithatalk
@LarsHaugseth you could use an INSERT INTO or CREATE AND REPLACE in your script to get around not being able to set a destination tableChirrup
The bad part with DECLARE is that you cannot use it in a ViewFlanagan
S
9

Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.

As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:

Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.

So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.

The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:

SELECT * FROM [dataset.partitioned_table]
    WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.

I'm happy to learn if there is a better workaround to get the latest partition!

Shonda answered 28/10, 2018 at 11:30 Comment(0)
S
6

I found the workaround to this issue. You can use with statement, select last few partitions and filter out the result. This is I think better approach because:

  1. You are not limited by fixed partition date (like today - 1 day). It will always take the latest partition from given range.
  2. It will only scan last few partitions and not whole table.

Example with last 3 partitions scan:

WITH last_three_partitions as (select *, _PARTITIONTIME as PARTITIONTIME 
    FROM dataset.partitioned_table 
    WHERE  _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 DAY))
SELECT col1, PARTITIONTIME from last_three_partitions 
WHERE PARTITIONTIME = (SELECT max(PARTITIONTIME) from last_three_partitions)
Stilu answered 3/4, 2019 at 16:0 Comment(2)
you are limited not to today - 1 but to today - 3, feels like the same constraint.Interrogatory
Depends on your use case. For us, it is pretty common to have last day's partition being not available because of various reasons but we always have bit older data available. Thus 3 days solves our problem.Stilu
M
5

List all partitions with:

#standardSQL
SELECT
  _PARTITIONTIME as pt
FROM
  `[DATASET].[TABLE]`
GROUP BY 1

And then choose the latest timestamp.

Good luck :)

https://cloud.google.com/bigquery/docs/querying-partitioned-tables

Madisonmadlen answered 24/1, 2018 at 18:21 Comment(1)
This query processes 0 bytes when runPortia
W
3

A compromise that manages to query only a few partitions without resorting to scripting or failing with missing partitions for fixed dates.

WITH latest_partitions AS (
  SELECT *, _PARTITIONDATE AS date
  FROM `myproject.mydataset.mytable`
  WHERE _PARTITIONDATE > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT
  *
FROM
  latest_partitions
WHERE
  date = (SELECT MAX(date) FROM latest_partitions)
Wilding answered 31/3, 2020 at 16:56 Comment(0)
P
1

You can leverage the __TABLES__ list of tables to avoid re-scanning everything or having to hope latest partition is ~3 days ago. I did the split and ordinal stuff to guard against in case my table prefix appears more than once in the table name for some reason.

This should work for either _PARTITIONTIME or _TABLE_SUFFIX.

select * from `project.dataset.tablePrefix*` 
where _PARTITIONTIME = (
    SELECT split(table_id,'tablePrefix')[ordinal(2)] FROM `project.dataset.__TABLES__` 
    where table_id like 'tablePrefix%'
    order by table_id desc limit 1)
Puppet answered 26/6, 2019 at 23:17 Comment(1)
FYI if you use a subquery in the "where" clause bigquery scans all partitions. cloud.google.com/bigquery/docs/…Chime
E
0

I had this answer in a less popular question, so copying it here as it's relevant (and this question is getting more pageviews):

Mikhail's answer looks like this (working on public data):

SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
WHERE DATE(datehour) = DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)     
AND wiki='es' 
# 122.2 MB processed

But it seems the question wants something like this:

SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
WHERE DATE(datehour) = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')     
AND wiki='es'
# 50.6 GB processed

... but for way less than 50.6GB

What you need now is some sort of scripting, to perform this in 2 steps:

max_date = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')   

;

SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
WHERE DATE(datehour) = {{max_date}}
AND wiki='es'
# 115.2 MB processed

You will have to script this outside BigQuery - or wait for news on https://issuetracker.google.com/issues/36955074.

Ecliptic answered 10/9, 2019 at 1:21 Comment(0)
P
0

Building on the answer from Chase. If you have a table that requires you filter over a column, and you're receiving the error:

Cannot query over table 'myproject.mydataset.mytable' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination

Then you can use:

SELECT
  MAX(_PARTITIONTIME) AS pt
FROM
  `myproject.mydataset.mytable`
WHERE _PARTITIONTIME IS NOT NULL

Instead of the latest partition, I've used this to get the earliest partition in a dataset by simply changing max to min.

Pocosin answered 30/11, 2022 at 14:15 Comment(0)
W
0

All these above comments will in some way hit the main tables to attempt to get the partition data. If you're running tb datasets, this will get very expensive.

SELECT
  MAX(partition_id) as max_partition_id

  FROM
  `your_project_name.your_dataset.INFORMATION_SCHEMA.PARTITIONS`
  
WHERE 
  table_name = 'my_table_name'


  -- should scan <10mb
  -- limited to 1000 rows so make sure you're using table_name filter

ref: https://cloud.google.com/bigquery/docs/information-schema-partitions

  • Always shows actual partition information
  • can be done in a declare + set statement before the query
  • DO NOT use this within a where statement because Bigquery will still scan all partitions for some reason.

EDIT

You will also need to coerce the data from partition_id to the partition type that your table will be in, since you'll likely see stuff like 20240530 for DAY partitions, even though the actual partition in your table will be a date.

-- partition_id to date
PARSE_DATE('%Y%m%d', max(partition_id)) as max_partition_id -- day

https://cloud.google.com/bigquery/docs/information-schema-partitions#examples examples have breakdowns of the patterns you should be looking for.

Wessex answered 30/5 at 1:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.