how to list ALL table sizes in a project
Asked Answered
A

6

6

Is there a way to list all the table size in BigQuery?

I know a command like this:

select 
  table_id,
  sum(size_bytes)/pow(10,9) as size
from
  certain_dataset.__TABLES__
group by 
  1

But I want to know all the tables in ALL datasets.

Thanks

Allynallys answered 18/10, 2018 at 9:7 Comment(0)
O
12

This problem got a little easier with the introduction of dynamic SQL to BigQuery scripting in 2020. Now, we can build up a query dynamically and execute it via EXECUTE IMMEDIATE.

Somthing like this would do for most circumstances where all the datasets are in region-us:

DECLARE dataset_names ARRAY<STRING>;

SET dataset_names = (
    SELECT ARRAY_AGG(SCHEMA_NAME) FROM `region-us.INFORMATION_SCHEMA.SCHEMATA`
);

EXECUTE IMMEDIATE (
    SELECT STRING_AGG(
        (SELECT """
            SELECT project_id, dataset_id, table_id, row_count, size_bytes 
            FROM `""" || s || 
            """.__TABLES__`"""), 
            " UNION ALL ")
    FROM UNNEST(dataset_names) AS s);

If there are a large number of datasets then this may return a rate limit error when trying to read all the meta data concurrently.

If this happens then we can then fall back on a "batched" approach that's a little more complicated to read and slower/less-efficient but still gets the job done:

DECLARE dataset_names ARRAY<STRING>;
DECLARE batch ARRAY<STRING>;
DECLARE batch_size INT64 DEFAULT 25;

CREATE TEMP TABLE results (
    project_id STRING,
    dataset_id STRING,
    table_id STRING,
    row_count INT64,
    size_bytes INT64
);

SET dataset_names = (
        SELECT ARRAY_AGG(SCHEMA_NAME) 
        FROM `region-us.INFORMATION_SCHEMA.SCHEMATA`
    );

LOOP
    IF ARRAY_LENGTH(dataset_names) < 1 THEN 
        LEAVE;
    END IF;

    SET batch = (
        SELECT ARRAY_AGG(d) 
        FROM UNNEST(dataset_names) AS d WITH OFFSET i 
        WHERE i < batch_size);

    EXECUTE IMMEDIATE (
        SELECT """INSERT INTO results """ 
            || STRING_AGG(
                    (SELECT """
                        SELECT project_id, dataset_id, table_id, row_count, size_bytes 
                        FROM `""" || s || """.__TABLES__`"""), 
                " UNION ALL ")
        FROM UNNEST(batch) AS s);

    SET dataset_names = (
        SELECT ARRAY_AGG(d) 
        FROM UNNEST(dataset_names) AS d
        WHERE d NOT IN (SELECT * FROM UNNEST(batch)));
        
END LOOP; 

SELECT * FROM results;
Osset answered 10/2, 2021 at 14:51 Comment(0)
E
7

At the moment there's no possible way to do that in a single query, but you can do it with a script, here is my python script that prints out the list:

from google.cloud import bigquery

client = bigquery.Client()

datasets = list(client.list_datasets())
project = client.project

if datasets:
    print('Datasets in project {}:'.format(project))
    for dataset in datasets:  # API request(s)
        print('Dataset: {}'.format(dataset.dataset_id))

        query_job = client.query("select table_id, sum(size_bytes)/pow(10,9) as size from `"+dataset.dataset_id+"`.__TABLES__ group by 1")

        results = query_job.result()
        for row in results:
            print("\tTable: {} : {}".format(row.table_id, row.size))

else:
    print('{} project does not contain any datasets.'.format(project))
Ereshkigal answered 18/10, 2018 at 11:6 Comment(2)
Hello there Alex. How do you provide credentials for this code? When I hit this line, 'client = bigquery.Client()', I get this message, 'DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see developers.google.com/accounts/docs/….' Thanks.Gery
Oh, I see how it works. You create a .json file and reference it like this: client = bigquery.Client.from_service_account_json('C:/your_path_here.123456789.json')Gery
P
2

This can be done with EXECUTE IMMEDIATE which needs a string input.
Please note that you need to set the correct region where your datasets are, in my case eu:

DECLARE dataset_names ARRAY<STRING>;
DECLARE query_first_part STRING;

SET dataset_names = (SELECT ARRAY_AGG(SCHEMA_NAME) FROM `region-eu.INFORMATION_SCHEMA.SCHEMATA`);
SET query_first_part = """
            SELECT 
                project_id, 
                dataset_id,
                table_id,
                size_bytes / pow(1024, 3) size_gb,
            FROM `""";

EXECUTE IMMEDIATE (
    SELECT 
        STRING_AGG(
            query_first_part || dataset_name || ".__TABLES__`", 
            " UNION ALL "
        )
    FROM UNNEST(dataset_names) AS dataset_name
);

More info on EXECUTE IMMEDIATE:
https://cloud.google.com/bigquery/docs/reference/standard-sql/procedural-language#execute_immediate

Palikar answered 5/5, 2022 at 14:53 Comment(0)
B
0

If you would like a python script that will crawl all the tables and capture all metadata such as column types, table size, descriptions etc I have a script on my github account that will do it.

It saves the output to a Bigquery table, CSV or JSON depending on what you need.

I created this for a customer who had to audit tens of thousands of tables but I use to find specific table features in the public datasets.

https://github.com/go-dustin/gcp_data_utilities/tree/master/BigQuery/meta_data_crawler

Bustard answered 25/1, 2021 at 22:47 Comment(0)
S
0
from google.cloud import bigquery
import os  

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOURGCPSERVICEACCOUNTKEY.json"
GCP_PROJECT_ID = "YOURGCPPROJECT"

client = bigquery.Client(project=GCP_PROJECT_ID)
datasets = list(client.list_datasets())
project = client.project

if datasets:
    print('Datasets in project {}:'.format(project))
    for dataset in datasets:  # API request(s)
        print('Dataset: {}'.format(dataset.dataset_id))

        query_job = client.query(
            f"""
            SELECT
            table_id,
            size_bytes/pow(1024,3) AS size_GB,
            FROM `{GCP_PROJECT_ID}.{dataset.dataset_id}`.__TABLES__
            """
        )

        results = query_job.result()
        for row in results:
            print(f"\tTable: {row.table_id} : {row.size_GB} GB")

else:
    print('project does not contain any datasets.'.format(project))

Replying on top on @Alex's answer, this answer will get more accurate number for size in GB. 1kb should be represented as 1024 byte

Staccato answered 4/8, 2021 at 7:49 Comment(0)
U
0

You can use information schema TABLE_STORAGE views to list all of the table in a specific project https://cloud.google.com/bigquery/docs/information-schema-table-storage

select
 table_schema,
 table_name,
 total_logical_bytes
from `my-project`.`region-us-central1`.INFORMATION_SCHEMA.TABLE_STORAGE

Change the project and region based on yours.

Urfa answered 16/7, 2024 at 8:16 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.