Make existing bigquery table clustered

A

7

15

I have a quite huge existing partitioned table in bigquery. I want to make the table clustered, at least for the new partition.

From the documentation: https://cloud.google.com/bigquery/docs/creating-clustered-tables, it is said that we are able to Creating a clustered table when you load data and I have tried to load a new partition using clustering fields: job_config.clustering_fields = ["event_type"].

The load finished successfully, however it seems that the new partition is not clustered (I am not really sure how to check whether it is clustered or not, but when I query to that particular partition it would always scan all rows).

Is there a good way to make clustering field for an existing partitioned table?

Any comment, suggestion, or answer is well appreciated.

Thanks a lot, Yosua

Athena answered 29/8, 2018 at 12:29 Comment(0)

W

7

This answer is no longer valid / correct

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered

The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge

Wrightson answered 29/8, 2018 at 12:39 Comment(10)

Thanks! Yeah, I think I can use the work around for now :) – Athena 30/8, 2018 at 9:41

Hi, I have two table one is non-cluster and cluster table with same fields. How to copy the data from non-cluster table to cluster table if both table exist. ? – Circuitous 14/1, 2019 at 10:37

I think this answer is not completely correct. According to the official docs :

By calling the tables.update or tables.patch methods, table clustering specifications can be changed or removed. The set of clustered columns in a clustered table can also be changed to a different set of columns. When a table is converted from non-clustered to clustered or the clustered column set is changed, automatic re-clustering only works from that time onward.

What I understood is that you can specify a clustering column also later. It's just that data added previously won't be clustered – Gabbard 14/5, 2020 at 13:37

the answer was given in August 2018 at which time things were different :o) – Wrightson 14/5, 2020 at 13:39

Is this working by now? I'm trying to update an existing table with clustering information, but the API responds with a "Cannot add, update, or remove clustering without partitioning field." error (that doesn't really make sense to me). – Spoilt 8/1, 2021 at 9:6

I have the strong suspicion that the docs are lying... after adding a partitioning column the error becomes "Cannot change partitioning/clustering spec for a partitioned/clustered table." - which imho contradicts the statement in the docs. – Spoilt 8/1, 2021 at 9:17

After some more fiddling, it works as expected but only on tables with field partitioning. I believe this might be a bug, because one can CREATE non-partitioned tables with clustering - so it should also be possible to CHANGE the clustering on those tables. I created an issue at github.com/googleapis/google-cloud-php/issues/3655 – Spoilt 8/1, 2021 at 10:18

As of Aug 15, 2022, it appears to allow clustering to be added to tables that only have _PARTITIONTIME partitioning (not field partitioning) too. I was able to add clustering to a table like this, that was created about 5 years ago, in my project. – Contradiction 15/8, 2022 at 19:41

@MikhailBerlyant - The workaround is the only option even now when we add a clustering field for an existing table. The old data can only be re-clustered if we re-create the table and load the table. Is there any other alternative? I couldn't find any document to do this even now. – Frankfort 1/3, 2023 at 10:14

"This answer is no longer valid / correct". So I would recommend deleting it – Keratosis 17/1, 2024 at 16:1

S

22

BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table.

You can change the clustering specification in the following ways:

Call the tables.update or tables.patch API method.

Call the bq command-line tool's bq update command with the --clustering_fields flag.

Reference

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

Sailor answered 25/4, 2021 at 9:47 Comment(2)

I can confirm bq update --clustering_fields={clustering_variable} {dataset_name}.{table_name} does the trick. – Yolande 31/1, 2022 at 7:59

Also confirming that this command works like charm :) – Baltoslavic 31/3, 2023 at 21:54

M

14

What I missed from the above answers was a real example, so here it goes:

bq update --clustering_fields=tool,qualifier,user_id my_dataset.my_table

Where tool, qualifier and user_id are the three columns I want the table to be clustered by (in that order) and the table is my_dataset.my_table.

Megasporophyll answered 28/4, 2022 at 5:59 Comment(2)

+1 thank you. I tried adding the project-id and it did not like it... I needed to 'gcloud config set project PROJECT_ID' and then run the command from the correct project. – Lucianolucias 23/6, 2023 at 6:56

The command to use another project is: bq update --project_id=the_project_to_run_from --clustering_fields=tool,qualifier,user_id my_project:my_dataset.my_table In the resource part, a colon separates the project from the dataset and table. – Megasporophyll 27/6, 2023 at 12:26

W

7

This answer is no longer valid / correct

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered

The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge

Wrightson answered 29/8, 2018 at 12:39 Comment(10)

Thanks! Yeah, I think I can use the work around for now :) – Athena 30/8, 2018 at 9:41

Hi, I have two table one is non-cluster and cluster table with same fields. How to copy the data from non-cluster table to cluster table if both table exist. ? – Circuitous 14/1, 2019 at 10:37

I think this answer is not completely correct. According to the official docs :

By calling the tables.update or tables.patch methods, table clustering specifications can be changed or removed. The set of clustered columns in a clustered table can also be changed to a different set of columns. When a table is converted from non-clustered to clustered or the clustered column set is changed, automatic re-clustering only works from that time onward.