Make existing bigquery table clustered
Asked Answered
A

7

15

I have a quite huge existing partitioned table in bigquery. I want to make the table clustered, at least for the new partition.

From the documentation: https://cloud.google.com/bigquery/docs/creating-clustered-tables, it is said that we are able to Creating a clustered table when you load data and I have tried to load a new partition using clustering fields: job_config.clustering_fields = ["event_type"].

The load finished successfully, however it seems that the new partition is not clustered (I am not really sure how to check whether it is clustered or not, but when I query to that particular partition it would always scan all rows).

Is there a good way to make clustering field for an existing partitioned table?

Any comment, suggestion, or answer is well appreciated.

Thanks a lot, Yosua

Athena answered 29/8, 2018 at 12:29 Comment(0)
W
7

This answer is no longer valid / correct

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec


You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered

The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge

Wrightson answered 29/8, 2018 at 12:39 Comment(10)
Thanks! Yeah, I think I can use the work around for now :)Athena
Hi, I have two table one is non-cluster and cluster table with same fields. How to copy the data from non-cluster table to cluster table if both table exist. ?Circuitous
I think this answer is not completely correct. According to the official docs : By calling the tables.update or tables.patch methods, table clustering specifications can be changed or removed. The set of clustered columns in a clustered table can also be changed to a different set of columns. When a table is converted from non-clustered to clustered or the clustered column set is changed, automatic re-clustering only works from that time onward. What I understood is that you can specify a clustering column also later. It's just that data added previously won't be clusteredGabbard
the answer was given in August 2018 at which time things were different :o)Wrightson
Is this working by now? I'm trying to update an existing table with clustering information, but the API responds with a "Cannot add, update, or remove clustering without partitioning field." error (that doesn't really make sense to me).Spoilt
I have the strong suspicion that the docs are lying... after adding a partitioning column the error becomes "Cannot change partitioning/clustering spec for a partitioned/clustered table." - which imho contradicts the statement in the docs.Spoilt
After some more fiddling, it works as expected but only on tables with field partitioning. I believe this might be a bug, because one can CREATE non-partitioned tables with clustering - so it should also be possible to CHANGE the clustering on those tables. I created an issue at github.com/googleapis/google-cloud-php/issues/3655Spoilt
As of Aug 15, 2022, it appears to allow clustering to be added to tables that only have _PARTITIONTIME partitioning (not field partitioning) too. I was able to add clustering to a table like this, that was created about 5 years ago, in my project.Contradiction
@MikhailBerlyant - The workaround is the only option even now when we add a clustering field for an existing table. The old data can only be re-clustered if we re-create the table and load the table. Is there any other alternative? I couldn't find any document to do this even now.Frankfort
"This answer is no longer valid / correct". So I would recommend deleting itKeratosis
S
22

BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table.

You can change the clustering specification in the following ways:

Call the tables.update or tables.patch API method.

Call the bq command-line tool's bq update command with the --clustering_fields flag.

Reference

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

Sailor answered 25/4, 2021 at 9:47 Comment(2)
I can confirm bq update --clustering_fields={clustering_variable} {dataset_name}.{table_name} does the trick.Yolande
Also confirming that this command works like charm :)Baltoslavic
M
14

What I missed from the above answers was a real example, so here it goes:

bq update --clustering_fields=tool,qualifier,user_id my_dataset.my_table

Where tool, qualifier and user_id are the three columns I want the table to be clustered by (in that order) and the table is my_dataset.my_table.

Megasporophyll answered 28/4, 2022 at 5:59 Comment(2)
+1 thank you. I tried adding the project-id and it did not like it... I needed to 'gcloud config set project PROJECT_ID' and then run the command from the correct project.Lucianolucias
The command to use another project is: bq update --project_id=the_project_to_run_from --clustering_fields=tool,qualifier,user_id my_project:my_dataset.my_table In the resource part, a colon separates the project from the dataset and table.Megasporophyll
W
7

This answer is no longer valid / correct

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec


You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered

The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge

Wrightson answered 29/8, 2018 at 12:39 Comment(10)
Thanks! Yeah, I think I can use the work around for now :)Athena
Hi, I have two table one is non-cluster and cluster table with same fields. How to copy the data from non-cluster table to cluster table if both table exist. ?Circuitous
I think this answer is not completely correct. According to the official docs : By calling the tables.update or tables.patch methods, table clustering specifications can be changed or removed. The set of clustered columns in a clustered table can also be changed to a different set of columns. When a table is converted from non-clustered to clustered or the clustered column set is changed, automatic re-clustering only works from that time onward. What I understood is that you can specify a clustering column also later. It's just that data added previously won't be clusteredGabbard
the answer was given in August 2018 at which time things were different :o)Wrightson
Is this working by now? I'm trying to update an existing table with clustering information, but the API responds with a "Cannot add, update, or remove clustering without partitioning field." error (that doesn't really make sense to me).Spoilt
I have the strong suspicion that the docs are lying... after adding a partitioning column the error becomes "Cannot change partitioning/clustering spec for a partitioned/clustered table." - which imho contradicts the statement in the docs.Spoilt
After some more fiddling, it works as expected but only on tables with field partitioning. I believe this might be a bug, because one can CREATE non-partitioned tables with clustering - so it should also be possible to CHANGE the clustering on those tables. I created an issue at github.com/googleapis/google-cloud-php/issues/3655Spoilt
As of Aug 15, 2022, it appears to allow clustering to be added to tables that only have _PARTITIONTIME partitioning (not field partitioning) too. I was able to add clustering to a table like this, that was created about 5 years ago, in my project.Contradiction
@MikhailBerlyant - The workaround is the only option even now when we add a clustering field for an existing table. The old data can only be re-clustered if we re-create the table and load the table. Is there any other alternative? I couldn't find any document to do this even now.Frankfort
"This answer is no longer valid / correct". So I would recommend deleting itKeratosis
S
2

I can confirm that the instructions here work.

There was another answer that said that part 2 doesn't work. To use the example from the existing answer for Part 1:

Enter this into the bq command line tool:

bq update --clustering_fields=tool,qualifier,user_id my_dataset.my_table

Part 2 works by using this code and running a query (not in the bq command line):

UPDATE 'my_project.my_dataset.my_table' SET tool=tool, qualifier=qualifier, user_id=user_id WHERE true

You have to split up the clustering columns (my guess it that they were trying to do "tool,qualifier,user_id=tool,qualifier,user_id" which is what it fills in automatically on the link above).

Stratiform answered 4/10, 2023 at 14:3 Comment(0)
B
0

Updating the BigQuery clustering fields by below command are only working from that time onward of dataset arrival by insert/update only.

 bq update --clustering_fields=CLUSTER_COLUMN DATASET.ORIGINAL_TABLE 

From the above command you can also changed the set of clustered columns in a clustered table to a different set of columns

Reference: Modify clustering specification Point 1 of this documentation working fine but the point 2, does not work at all.

Barkeeper answered 9/9, 2023 at 1:10 Comment(0)
K
0

After updating the clustering spec you need to run a DML query to overwrite the clustering field in order to apply clustering to that column [1]. This is the step that will process data and may incur cost.

bq update --clustering_fields=[CLUSTER_COLUMN] [DATASET.ORIGINAL_TABLE] 
bq query --nouse_legacy_sql "UPDATE [DATASET.ORIGINAL_TABLE] SET [CLUSTER_COLUMN]=[CLUSTER_COLUMN] WHERE true"
Keratosis answered 17/1, 2024 at 16:7 Comment(0)
E
0

I used 3 queries to do so.

CREATE TABLE `my_project.my_dataset.my_table_clustered`
PARTITION BY DATE(datetime) CLUSTER BY type AS SELECT * FROM `my_project.my_dataset.my_table` WHERE DATE(datetime) >= "2013-07-01";
DROP TABLE `my_project.my_dataset.my_table`;
ALTER TABLE `my_project.my_dataset.my_table_clustered` RENAME TO my_table;
Ernest answered 17/7, 2024 at 19:10 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.