Methods for writing Parquet files using Python?
Asked Answered
M

7

46

I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it.

Thus far the only method I have found is using Spark with the pyspark.sql.DataFrame Parquet support.

I have some scripts that need to write Parquet files that are not Spark jobs. Is there any approach to writing Parquet files in Python that doesn't involve pyspark.sql?

Midgut answered 5/10, 2015 at 2:18 Comment(1)
it seems that the Parquet format has thirft definition files can't you use this to access it?Abeyta
K
26

Update (March 2017): There are currently 2 libraries capable of writing Parquet files:

  1. fastparquet
  2. pyarrow

Both of them are still under heavy development it seems and they come with a number of disclaimers (no support for nested data e.g.), so you will have to check whether they support everything you need.

OLD ANSWER:

As of 2.2016 there seems to be NO python-only library capable of writing Parquet files.

If you only need to read Parquet files there is python-parquet.

As a workaround you will have to rely on some other process like e.g. pyspark.sql (which uses Py4J and runs on the JVM and can thus not be used directly from your average CPython program).

Karlow answered 3/2, 2016 at 17:12 Comment(1)
If you need to be able to append data to existing files, like writing multiple dfs in batches, fastparquet does the trick. I could not find a single mention of append in pyarrow and seems the code is not ready for it (March 2017).Eo
E
13

Simple method to write pandas dataframe to parquet.

Assuming, df is the pandas dataframe. We need to import the following libraries.

import pyarrow as pa
import pyarrow.parquet as pq

First, write the dataframe df into a pyarrow table.

# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df_image_0)

Second, write the table into parquet file say file_name.parquet

# Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet')

NOTE: parquet files can be further compressed while writing. Following are the popular compression formats.

  • Snappy ( default, requires no argument)
  • Gzip
  • Brotli

Parquet with Snappy compression

 pq.write_table(table, 'file_name.parquet')

Parquet with GZIP compression

pq.write_table(table, 'file_name.parquet', compression='GZIP')

Parquet with Brotli compression

pq.write_table(table, 'file_name.parquet', compression='BROTLI')

Comparative comparison achieved with different formats of parquet

enter image description here

Reference: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/

Emilioemily answered 31/12, 2019 at 3:20 Comment(1)
Note that compression depends on the content. Brotli is especially efficient in code/English, as it was tuned from HTML code samples for a fixed dictionary.Levenson
B
9

fastparquet does have write support, here is a snippet to write data to a file

from fastparquet import write
write('outfile.parq', df)
Bendwise answered 3/1, 2017 at 8:42 Comment(0)
T
8

I've written a comprehensive guide to Python and Parquet with an emphasis on taking advantage of Parquet's three primary optimizations: columnar storage, columnar compression and data partitioning. There is a fourth optimization that isn't covered yet, row groups, but they aren't commonly used. The ways of working with Parquet in Python are pandas, PyArrow, fastparquet, PySpark, Dask and AWS Data Wrangler.

Check out the post here: Python and Parquet Performance In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask

Taxable answered 2/11, 2020 at 21:39 Comment(0)
B
5

using fastparquet you can write a pandas df to parquet either withsnappy or gzip compression as follows:

make sure you have installed the following:

$ conda install python-snappy
$ conda install fastparquet

do imports

import pandas as pd 
import snappy
import fastparquet

assume you have the following pandas df

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

send df to parquet with snappy compression

df.to_parquet('df.snap.parquet',compression='snappy')

send df to parquet with gzip compression

df.to_parquet('df.gzip.parquet',compression='gzip')

check:

read parquet back into pandas df

pd.read_parquet('df.snap.parquet')

or

pd.read_parquet('df.gzip.parquet')

output:

   col1 col2
0   1    3
1   2    4
Boltrope answered 4/10, 2018 at 13:41 Comment(0)
Q
2

pyspark seems to be the best alternative right now for writing out parquet with python. It may seem like using a sword in place of needle, but thats how it is at the moment.

  • It supports most compression types like lzo, snappy. Zstd support should come into it soon.
  • Has complete schema support (nested, structs, etc)

Simply do, pip install pyspark and you are good to go.

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Quicklime answered 14/9, 2019 at 1:48 Comment(0)
D
1

Two more Python libraries for fast CSV => parquet transformations:

  1. DuckDB https://duckdb.org
  2. Polars https://github.com/pola-rs/polars

May not have all the bells and whistles of fastparquet but are really fast and easy to master.

Edit Polars can write parquet using Arrows, which supports new parquet versions and options: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html

Disembarrass answered 4/11, 2021 at 14:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.