Is it possible to use Pandas' DataFrame.to_parquet
functionality to split writing into multiple files of some approximate desired size?
I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow', compression='snappy')
to write to a file, but this results in a file that's about 4GB. I'd instead like this split into many ~100MB files.
.to_parquet()
. – Onestepddf = dask.dataframe.from_pandas(df, chunksize=5000000); ddf.to_parquet('/path/to/save/')
which saves one file per chunk. – Cocker