Extract BigQuery partitioned table
Asked Answered
E

3

5

Is there a way to extract the complete BigQuery partitioned table with one command so that data of each partition is extracted into a separate folder of the format part_col=date_yyyy-mm-dd

Since Bigquery partitioned table can read files from the hive type partitioned directories, is there a way to extract the data in a similar way. I can extract each partition separately, however that is very cumbersome when i an extracting a lot of partitions

Earring answered 2/7, 2019 at 14:38 Comment(1)
seems like a feature request for the issue trackerNadda
G
8

You could do this programmatically. For instance, you can export partitioned data by using the partition decorator such as table$20190801. And then on the bq extract command you can use URI Patterns (look the example of the workers pattern) for the GCS objects.

Since all objects will be within the same bucket, the folders are just an hierarchical illusion, so you can specify URI patterns on the folders as well, but not on the bucket.

So you would do a script where you loop over the DATE value, with something like:

bq extract 
--destination_format [CSV, NEWLINE_DELIMITED_JSON, AVRO] 
--compression [GZIP, AVRO supports DEFLATE and SNAPPY] 
--field_delimiter [DELIMITER] 
--print_header [true, false] 
[PROJECT_ID]:[DATASET].[TABLE]$[DATE]
gs://[BUCKET]/part_col=[DATE]/[FILENAME]-*.[csv, json, avro]

You can't do it automatically with just a bq command. For this it would be better to raise a feature request as suggested by Felipe.

Greenman answered 1/8, 2019 at 17:9 Comment(4)
Thanks Neri, I have implemented this programmatially but i was just curious if we could do it using a single command. As confirmed by you and Felipe that such a feature does not exist, will raise a feature requestEarring
Hi when I run this bq extract --destination_format=CSV test_dataset.test_partitiontime$20210716 gs://testbucket/ part_col=20210716/test.csv, I get this output ` Not found: Table test- net:temp.test_partitiontime0210716 was not found in location EU `. My table is partitioned by date, do you think what could be the problem? ps: normally I have 7k rows for 2021-07-16 date –Honegger
Hi emily, try escaping the $ with backslash, you are probably using a bash shell and the $2 is being treated as a variable that is replaced by blank valueEarring
Can someone please share the script if they have on exporting the bigquery dataset based on the type of table.Donative
V
0

Set the project as test_dataset using gcloud init before running the below command.

bq extract --destination_format=CSV 'test_partitiontime$20210716' gs://testbucket/20210716/test*.csv

This will create a folder with the name 20210716 inside testbucket and write the file there.

Varioloid answered 4/2, 2022 at 11:16 Comment(0)
R
0

I created a script to iterate on a date partition if it helps. it would be easy to convert to using parameters

#!/bin/bash
input_start=2018-1-1
input_end=2018-3-1
project=myproject
dataset=mydataset
table=table_of_stuff

startdate=$(date -I -d "$input_start") || exit -1
enddate=$(date -I -d "$input_end")     || exit -1

d="$startdate"
while [[ "$d" < "$enddate" ]]; do
        year_val=$(date --date="$d" +%Y)
        mon_val=$(date --date="$d" +%m)
        day_val=$(date --date="$d" +%d)
  echo bq extract --location=US --destination_format PARQUET --compression SNAPPY $project:$dataset.$table\$$year_val$mon_val$day_val gs://your_bucket/table_archive/$year_val/$dataset/$table/date_col=$year_val-$mon_val-$day_val/*.parquet
  d=$(date -I -d "$d + 1 day")
done
Ruler answered 16/10, 2024 at 16:10 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.