How to compress parquet file with zstandard using pandas
Asked Answered
S

4

5

i'm using pandas to convert dataframes to .parquet files using this command:

df.to_parquet(file_name, engine='pyarrow', compression='gzip')

I need to use zstandard as compression algorithm, but the function above accepts only gzip, snappy, and brotli. I tried Is there a way to include zstd in this function? If not, how can i do that with other packages? I tried with zstandard, but it seems to accept only bytes-like objects.

Singultus answered 28/10, 2019 at 16:54 Comment(0)
C
6

I usually use zstandard as my compression algorithm for my dataframes.

This is the code I use (a bit simplified) to write those parquet files:

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

parquetFilename = "test.parquet"

df = pd.DataFrame(
    {
        "num_legs": [2, 4, 8, 0],
        "num_wings": [2, 0, 0, 0],
        "num_specimen_seen": [10, 2, 1, 8],
    },
    index=["falcon", "dog", "spider", "fish"],
)

df = pa.Table.from_pandas(df)
pq.write_table(df, parquetFilename, compression="zstd")

And to read these parquet files:

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

parquetFilename = "test.parquet"
df = pq.read_table(parquetFilename)
df = df.to_pandas()

For more details see these sites for more information:

Finally a shameless plug for a blog post I wrote. It is about the speed vs space balance of zstandard and snappy compression in parquet files using pyarrow. It is relevent to your question and includes some more "real world" code examples of reading and writing parquet files in zstandard. I will actually be writing a follow up soon too. if you're interested let me know.

Calices answered 26/3, 2020 at 3:25 Comment(0)
B
2

You can actually just use

df.to_parquet(file_name, engine='pyarrow', compression='zstd')

Note: Only pyarrow supports Zstandard compression, fastparquet does not.

Reading is even easier, since you don't have to name the compression algorithm:

df = pd.read_parquet(file_name)

Up to now (Pandas 1.5.3) it was documented only in the backend since Pandas 1.4.0. The missing documentation in the interface has been fixed in the current development version.

Baines answered 22/2, 2023 at 2:0 Comment(0)
B
0

It seems it is not supported yet:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html

compression{‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’ Name of the compression to use. Use None for no compression.

Bismuthinite answered 4/2, 2020 at 18:51 Comment(0)
A
0

Dependencies: %pip install pandas[parquet, compression]>=1.4

Code: df.to_parquet(filepath, compression='zstd')

Documentation

  • Installed by "parquet": pyarrow is the default parquet/feather engine, fastarrow also exists.
  • Installed by "compression": Zstandard is only mentioned from pandas>=1.4 and in to_parquet from pandas>=2.1
Amethist answered 22/1 at 12:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.