I have a CSV stored in GCS which I want to load it to BigQuery table. But I need to do some pre-process first so I load it to DataFrame and later load to BigQuery table
import pandas as pd
import json
from google.cloud import bigquery
cols_name_list = [....]. # column name in order
uri = "gs://<bucket>/<path>/<csv_file>"
df = pd.read_csv(uri, dtype="string")
df =df.reindex(columns=cols_name_list)
client = bigquery.Client()
job_config = bigquery.LoadJobConfig(
... # added all schema field according to table column type
)
job = client.load_table_from_dataframe(
df, "<bq_table_id>", job_config=job_config
)
job.result()
From code above, I reorder the dataframe column order to match with the order in BigQuery table (not sure if this matter or not) and convert all column to be string type.
I got this error as shown below in which
pyarrow.lib.ArrowInvalid: Could not convert '47803' with type str: tried to convert to int
I also ran it without forcing the dtypes to be string and I got another error
pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
The code and data looks normal. So, I tried downgrading the version of numpy and pyarrow and still cause the same error.
Update:
I updated the code to force only string column
df =pd.read_csv(uri, dtype={"B" : "string"})
This is the example CSV data that I worked with
A,B,C,D,E,F,G
47803,000000000020030263,629,,2021-01-12 23:26:37,,
where column type of BQ table should be like this
job_config = bigquery.LoadJobConfig(
schema = [
bigquery.SchemaField("A", "INTEGER"),
bigquery.SchemaField("B", "STRING"),
bigquery.SchemaField("C", "INTEGER"),
bigquery.SchemaField("D", "INTEGER"),
bigquery.SchemaField("E", "DATETIME"),
bigquery.SchemaField("F", "INTEGER"),
bigquery.SchemaField("G", "DATETIME")
]
)
Now, when I'm trying to load data with load_table_from_dataframe()
with these configs, I got this error
pyarrow.lib.ArrowTypeError: object of type <class 'str'> cannot be converted to int
So, I print the dtypes out
A int64
B string
C int64
D float64
E object
F float64
G float64
dtype: object
Which column that is the issue right now and how can I fix this? The error is not quite useful for debugging. Since the column that supposed to be int is already int. The only column with string type no need to be converted to int but the error thrown like that