I have a parquet dataset stored in my S3 bucket with multiple partition files. I want to read it into my pandas dataframe, but am getting this ArrowInvalid error when I didn't before.
Occasionally, this data has been overwritten with some previous snapshot of pandas data like the following:
import pandas as pd # version 1.3.4
# pyarrow version 5.0
df.to_parquet(
f's3a://{bucket_and_prefix}',
storage_options={
"key" : os.getenv("AWS_ACCESS_KEY_ID"),
"secret" : os.getenv("AWS_SECRET_ACCESS_KEY"),
"client_kwargs": {
'verify' : os.getenv('AWS_CA_BUNDLE'),
'endpoint_url': 'https://prd-data.company.com/'
}
},
index=False
)
But when reading it with:
df = pd.read_parquet(
f"s3a://{bucket_and_prefix}",
storage_options={
"key" : os.getenv("AWS_ACCESS_KEY_ID"),
"secret" : os.getenv("AWS_SECRET_ACCESS_KEY"),
"client_kwargs": {
'verify' : os.getenv('AWS_CA_BUNDLE'),
'endpoint_url': 'https://prd-data.company.com/'
}
}
)
It fails with error:
ArrowInvalid: GetFileInfo() yielded path 'bucket/folder/data.parquet/year=2021/month=2/abcde.parquet', which is outside base dir 's3://bucket/folder/data.parquet'
Any idea why this ArrowInvalid error happens and how I can read the parquet data into pandas?
file_system
argument (typically an s3fs.FileSystem), otherwise it will use the local file system (which doesn't know abouts3://
– Libriumpyarrow.parquet.read_table
, but I'm using pd.read_parquet and I don't need to pass a file system. In fact I'm able to run the above for most parquet datasets in my S3 bucket. – EmanueleAny additional kwargs are passed to the engine
as**kwargs
so you can pass an s3 file system as an argument and it will be passed topyarrow.parquet.read_table
– Librium