Copy table from one dataset to another in google big query
Asked Answered
G

7

8

I intend to copy a set of tables from one dataset to another within the same project. I execute the code in Ipython notebook.

I get the list of table names to be copied in the variable “value” using the below code:

list = bq.DataSet('test:TestDataset')

for x in list.tables():
   if(re.match('table1(.*)',x.name.table_id)):
     value = 'test:TestDataset.'+ x.name.table_id

Then i tried using the “bq cp” command to copy table from one dataset to another. But I cannot execute the bq command in the notebook.

!bq cp $value proj1:test1.table1_20162020

Note:

I tried with bigquery command to check whether there is a copy command associated with it but could not find any.

Galatea answered 2/8, 2016 at 19:33 Comment(0)
X
10

I have created following script to copying all the tables from one dataset to another dataset with couple of validation.

from google.cloud import bigquery

client = bigquery.Client()

projectFrom = 'source_project_id'
datasetFrom = 'source_dataset'

projectTo = 'destination_project_id'
datasetTo = 'destination_dataset'

# Creating dataset reference from google bigquery cient
dataset_from = client.dataset(dataset_id=datasetFrom, project=projectFrom)
dataset_to = client.dataset(dataset_id=datasetTo, project=projectTo)

for source_table_ref in client.list_dataset_tables(dataset=dataset_from):
    # Destination table reference
    destination_table_ref = dataset_to.table(source_table_ref.table_id)

    job = client.copy_table(
      source_table_ref,
      destination_table_ref)

    job.result()
    assert job.state == 'DONE'

    dest_table = client.get_table(destination_table_ref)
    source_table = client.get_table(source_table_ref)

    assert dest_table.num_rows > 0 # validation 1  
    assert dest_table.num_rows == source_table.num_rows # validation 2

    print ("Source - table: {} row count {}".format(source_table.table_id,source_table.num_rows ))
    print ("Destination - table: {} row count {}".format(dest_table.table_id, dest_table.num_rows))
Xanthippe answered 30/10, 2018 at 11:29 Comment(0)
B
6

Assume you want to copy most tables, you can first copy the entire BigQuery dataset, then delete some tables you don't want to copy.

The copy dataset UI is similar to copy table. Just click "copy dataset" button from the source dataset, and specify the destination dataset in the pop-up form. You can copy dataset to another project or another region. See screenshots of how to copy dataset below.

Copy dataset button

enter image description here

Copy dataset form

enter image description here

Betti answered 12/9, 2019 at 0:10 Comment(0)
C
5

If you are using the BigQuery API with Python, you can run a copy job:

https://cloud.google.com/bigquery/docs/tables#copyingtable

Copying the Python example from the docs:

def copyTable(service):
   try:
    sourceProjectId = raw_input("What is your source project? ")
    sourceDatasetId = raw_input("What is your source dataset? ")
    sourceTableId = raw_input("What is your source table? ")

    targetProjectId = raw_input("What is your target project? ")
    targetDatasetId = raw_input("What is your target dataset? ")
    targetTableId = raw_input("What is your target table? ")

    jobCollection = service.jobs()
    jobData = {
      "projectId": sourceProjectId,
      "configuration": {
          "copy": {
              "sourceTable": {
                  "projectId": sourceProjectId,
                  "datasetId": sourceDatasetId,
                  "tableId": sourceTableId,
              },
              "destinationTable": {
                  "projectId": targetProjectId,
                  "datasetId": targetDatasetId,
                  "tableId": targetTableId,
              },
          "createDisposition": "CREATE_IF_NEEDED",
          "writeDisposition": "WRITE_TRUNCATE"
          }
        }
      }

    insertResponse = jobCollection.insert(projectId=targetProjectId, body=jobData).execute()

    # Ping for status until it is done, with a short pause between calls.
    import time
    while True:
      status = jobCollection.get(projectId=targetProjectId,
                                 jobId=insertResponse['jobReference']['jobId']).execute()
      if 'DONE' == status['status']['state']:
          break
      print 'Waiting for the import to complete...'
      time.sleep(10)

    if 'errors' in status['status']:
      print 'Error loading table: ', pprint.pprint(status)
      return

    print 'Loaded the table:' , pprint.pprint(status)#!!!!!!!!!!

    # Now query and print out the generated results table.
    queryTableData(service, targetProjectId, targetDatasetId, targetTableId)

   except HttpError as err:
    print 'Error in loadTable: ', pprint.pprint(err.resp)

The bq cp command does basically the same, internally (you could call that function too, depending on what bq you are importing).

Chlorohydrin answered 2/8, 2016 at 19:50 Comment(3)
Thanks Felipe. But in my scenario, I have to copy multiple tables that starts with a same name but has different timestamps in the end. That was the reason i loop through the table list and get the list of tables that start with 'table1'Galatea
seems like something easy to automate with python and the provided code?Chlorohydrin
How does the bq cp command work between datasets? I only see an error message claiming the destination dataset doesn't exist, which is clearly wrong (I also see this in the console; both the destination dataset and the incorrect error).Kieserite
X
1

Now coping dataset feature available in BigQuery Data Transfer Service. Select transfer service in BigQuery web console and fill the source and destination details and run it on-demand or schedule it on specified time interval.

enter image description here

Or simply run following gcloud command to achieve this

bq mk --transfer_config --project_id=[PROJECT_ID] --data_source=[DATA_SOURCE] --target_dataset=[DATASET] --display_name=[NAME] --params='[PARAMETERS]'
Xanthippe answered 31/10, 2019 at 13:5 Comment(0)
I
0

I am not sure why it is not working for you, since it works perfectly for me.

projectFrom = 'project1'
datasetFrom = 'dataset1'
tableSearchString = 'test1'

projectTo = 'project2'
datasetTo = 'dataset2'

tables = bq.DataSet(projectFrom + ':' + datasetFrom).tables()

for table in tables:
  if tableSearchString in table.name.table_id:

    tableFrom = projectFrom + ':' + datasetFrom + '.' + table.name.table_id
    tableTo = projectTo + ':' + datasetTo + '.' + table.name.table_id

    !bq cp $tableFrom $tableTo

Try this in your notebook, since it works well for me.
Just wondering, what is the error code that returns from your script?

Ingurgitate answered 4/8, 2016 at 14:4 Comment(0)
H
0

I think it would help you.

    tables = source_dataset.list_tables()
    for table in tables:
        #print table.name
        job_id = str(uuid.uuid4())
        dest_table = dest_dataset.table(table.name)
        source_table = source_dataset.table(table.name)
        if not dest_table.exists():
            job = self.bigquery_client.copy_table(job_id, dest_table, source_table)
            job.create_disposition = (google.cloud.bigquery.job.CreateDisposition.CREATE_IF_NEEDED)
            job.begin()
            job.result()
Haugen answered 12/8, 2018 at 15:28 Comment(0)
G
0

Why not use the CLI:

bq ls src_dataset | grep TABLE | cut -f 3 -d ' ' | xargs -I {} bq cp src_dataset.{} dst_dataset.{}

First, get a list of table names from the source dataset, then copy them to the destination.

If you want to see the commands before executing them put an echo at the beginning:

bq ls src_dataset | grep TABLE | cut -f 3 -d ' ' | xargs -I {} echo bq cp src_dataset.{} dst_dataset.{}
Geller answered 29/11, 2022 at 11:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.