How to write a partitioned Parquet file using Pandas
Asked Answered
Q

3

11

I'm trying to write a Pandas dataframe to a partitioned file:

df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])

TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'

From the documentation I expected that the partition_cols would be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?

Quiche answered 22/10, 2018 at 16:56 Comment(3)
Are you sure there is not a typo in the partitiol_cols argument?Leonelleonelle
Yeah, this was not the problem. Notice that the error message was correct.Quiche
partition_cols has been added in pandas 0.24.0: pandas.pydata.org/pandas-docs/stable/reference/api/…Dud
L
25

First make sure that you have a reasonably recent version of pandas and pyarrow:

pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0

Then you can use partition_cols to produce the partitioned parquet files:

import pandas as pd

# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year':  [2020, 2020, 2021],
                        'month': [1,12,2], 
                        'day':   [1,31,28], 
                        'value': [1000,2000,3000]})

df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])

This produces:

mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet
Lavery answered 11/3, 2021 at 12:54 Comment(0)
C
18

Pandas DataFrame.to_parquet is a thin wrapper over table = pa.Table.from_pandas(...) and pq.write_table(table, ...) (see pandas.parquet.py#L120), and pq.write_table does not support writing partitioned datasets. You should use pq.write_to_dataset instead.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)

pq.write_to_dataset(
    table,
    root_path='output.parquet',
    partition_cols=['partone', 'parttwo'],
)

For more info, see pyarrow documentation.

In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.

Crider answered 22/10, 2018 at 18:41 Comment(3)
I believe that I did with the engine=pyarrow option, and it seems that the default engine is pyarrow and not fastparquet: "engine : {‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’ Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable." pandas.pydata.org/pandas-docs/stable/generated/…Quiche
Yes, you are right. They must have changed it in one of the recent versions.Crider
Recent pandas has incorporated partitioned_cols and starts using write_to_dataset as well.Astra
B
9

You need to update to Pandas version 0.24 or above. The functionality of partition_cols is added from that version onwards.

Bracketing answered 27/7, 2019 at 11:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.