ArrowInvalid: GetFileInfo() yielded path which is outside base dir parquet
Asked Answered
E

0

9

I have a parquet dataset stored in my S3 bucket with multiple partition files. I want to read it into my pandas dataframe, but am getting this ArrowInvalid error when I didn't before.

Occasionally, this data has been overwritten with some previous snapshot of pandas data like the following:

import pandas as pd  # version 1.3.4
# pyarrow version 5.0

df.to_parquet(
    f's3a://{bucket_and_prefix}',
    storage_options={
        "key"          : os.getenv("AWS_ACCESS_KEY_ID"),
        "secret"       : os.getenv("AWS_SECRET_ACCESS_KEY"),
        "client_kwargs": {
            'verify'      : os.getenv('AWS_CA_BUNDLE'),
            'endpoint_url': 'https://prd-data.company.com/'
        }
    },
    index=False
)

But when reading it with:

df = pd.read_parquet(
    f"s3a://{bucket_and_prefix}",
    storage_options={
        "key"          : os.getenv("AWS_ACCESS_KEY_ID"),
        "secret"       : os.getenv("AWS_SECRET_ACCESS_KEY"),
        "client_kwargs": {
            'verify'      : os.getenv('AWS_CA_BUNDLE'),
            'endpoint_url': 'https://prd-data.company.com/'
        }
    }
)

It fails with error:

ArrowInvalid: GetFileInfo() yielded path 'bucket/folder/data.parquet/year=2021/month=2/abcde.parquet', which is outside base dir 's3://bucket/folder/data.parquet'

Any idea why this ArrowInvalid error happens and how I can read the parquet data into pandas?

Emanuele answered 28/4, 2022 at 18:9 Comment(3)
According to the pyarrow documentation, arrow.apache.org/docs/python/generated/… you need to pass a file_system argument (typically an s3fs.FileSystem), otherwise it will use the local file system (which doesn't know about s3://Librium
@Librium that is for pyarrow.parquet.read_table, but I'm using pd.read_parquet and I don't need to pass a file system. In fact I'm able to run the above for most parquet datasets in my S3 bucket.Emanuele
According to pandas.pydata.org/pandas-docs/version/1.3/reference/api/… Any additional kwargs are passed to the engine as **kwargs so you can pass an s3 file system as an argument and it will be passed to pyarrow.parquet.read_tableLibrium

© 2022 - 2024 — McMap. All rights reserved.