Simple method to write pandas dataframe to parquet.
Assuming, df
is the pandas dataframe. We need to import the following libraries.
import pyarrow as pa
import pyarrow.parquet as pq
First, write the dataframe df
into a pyarrow
table.
# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df_image_0)
Second, write the table
into parquet
file say file_name.parquet
# Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet')
NOTE: parquet files can be further compressed while writing. Following are the popular compression formats.
- Snappy ( default, requires no argument)
- Gzip
- Brotli
Parquet with Snappy compression
pq.write_table(table, 'file_name.parquet')
Parquet with GZIP compression
pq.write_table(table, 'file_name.parquet', compression='GZIP')
Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet', compression='BROTLI')
Comparative comparison achieved with different formats of parquet
Reference:
https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/