How to write a partitioned Parquet file using Pandas

Asked 22/10, 2018 at 16:56 Answered 11/3, 2021 at 12:54

I'm trying to write a Pandas dataframe to a partitioned file:

df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])

TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'

From the documentation I expected that the partition_cols would be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?

Quiche answered 22/10, 2018 at 16:56 Comment(3)

Are you sure there is not a typo in the partitiol_cols argument? – Leonelleonelle 22/10, 2018 at 17:0

Yeah, this was not the problem. Notice that the error message was correct. – Quiche 22/10, 2018 at 18:35

partition_cols has been added in pandas 0.24.0: pandas.pydata.org/pandas-docs/stable/reference/api/… – Dud 28/2, 2020 at 15:52

First make sure that you have a reasonably recent version of pandas and pyarrow:

pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0

Then you can use partition_cols to produce the partitioned parquet files:

import pandas as pd

# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year':  [2020, 2020, 2021],
                        'month': [1,12,2], 
                        'day':   [1,31,28], 
                        'value': [1000,2000,3000]})

df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])

This produces:

mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet

Lavery answered 11/3, 2021 at 12:54 Comment(0)

Pandas DataFrame.to_parquet is a thin wrapper over table = pa.Table.from_pandas(...) and pq.write_table(table, ...) (see pandas.parquet.py#L120), and pq.write_table does not support writing partitioned datasets. You should use pq.write_to_dataset instead.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)

pq.write_to_dataset(
    table,
    root_path='output.parquet',
    partition_cols=['partone', 'parttwo'],
)

For more info, see pyarrow documentation.

In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.

Crider answered 22/10, 2018 at 18:41 Comment(3)

I believe that I did with the engine=pyarrow option, and it seems that the default engine is pyarrow and not fastparquet: "engine : {‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’ Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable." pandas.pydata.org/pandas-docs/stable/generated/… – Quiche 22/10, 2018 at 18:49

Yes, you are right. They must have changed it in one of the recent versions. – Crider 22/10, 2018 at 18:56

Recent pandas has incorporated partitioned_cols and starts using write_to_dataset as well. – Astra 31/3, 2023 at 16:44

You need to update to Pandas version 0.24 or above. The functionality of partition_cols is added from that version onwards.

Bracketing answered 27/7, 2019 at 11:41 Comment(0)

Recommended topics

Hot tags