pandas write dataframe to parquet format with append
Asked Answered
R

7

29

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?

the write syntax is

df.to_parquet(path, mode='append')

the read syntax is

pd.read_parquet(path)
Rammish answered 8/11, 2017 at 23:48 Comment(6)
try opening the file in append modeSonja
this does not work (makes not difference from the previous situation)Rammish
from this link "#39234891" it looks like append is not supported in parquet client APIRammish
In the doc there is no append mode for to_parquet() API.If you want to append to a file, the append mode is for the file.That's what I try to express earlier.Sonja
see here #47114313Myrt
In case you want append to the SAME file, then forget my comment, but sometimes it could be usefull write the new parquet file to the same directory with another name. So, next time you cand read to the directory instead an specific file and you will get the data in every parquet file on that directorySudan
G
29

Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.

Below is from pandas doc:

DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)

we have to pass in both engine and **kwargs.

  • engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
  • **kwargs - Additional arguments passed to the parquet library.

**kwargs - here we need to pass is: append=True (from fastparquet)

import pandas as pd
from pathlib import Path

df = pd.DataFrame({'col1': [1, 2,], 'col2': [3, 4]})
file_path = Path("D:\\dev\\output.parquet")

if file_path.exists():
  df.to_parquet(file_path, engine='fastparquet', append=True)
else:
  df.to_parquet(file_path, engine='fastparquet')

If append is set to True and the file does not exist then you will see below error

AttributeError: 'ParquetFile' object has no attribute 'fmd'

Running above script 3 times I have below data in parquet file. enter image description here

If I inspect the metadata, I can see that this resulted in 3 row groups.

enter image description here


Note:

Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.

Gunslinger answered 26/10, 2022 at 14:48 Comment(3)
I tried this with pyarrow and it failed, so it seems to only work with fastparquet as the author suggestsHindgut
If you're getting TypeError: expected list of bytes, ensure that your df.dtypes are well defined (e.g. using df = df.astype({"col1": "str", ...}).Deponent
If I have many small row groups and drop the row group columns, does that eliminate the performance problem? Or, how would I resolve the performance issue?Firkin
C
7

To append, do this:

import pandas as pd 
import pyarrow.parquet as pq
import pyarrow as pa

dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"

# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)

# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)

This will automatically append into your table.

Capias answered 26/9, 2019 at 17:16 Comment(1)
it will create directory with few parquet files, as pyarrow datasetFreestone
S
3

I used the awswrangler library. It works like a charm

Below are the reference docs

https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html

I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter

Below is the sample code I used:

!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
    wr.s3.to_parquet(
    df=evet_data,
    path=s3_path,
    dataset=True,
    partition_cols=['e','f'],
    mode="append",
    database="wat_q4_stg",
    table="raw_data_v3",
    catalog_versioning=True  # Optional
    )
    print("write successful")       
except Exception as e:
    print(str(e))

Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient

Scraggly answered 13/11, 2020 at 3:8 Comment(2)
Hey, thanks for this - it seems to create snappy.parquet files, Is there a way to create singular parquet files, or at least non-snappy files?Felipafelipe
Oop nevermind, it has an option for compression=None.Felipafelipe
B
1

There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.

Byerly answered 10/3, 2018 at 12:2 Comment(0)
N
1

Use the fastparquet write function

from fastparquet import write

write(file_name, df, append=True)

The file must already exist as I understand it.

API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write

Nodus answered 19/9, 2022 at 14:35 Comment(0)
M
0

If you are considering the use of partitions:

As per Pyarrow doc (this is the function called behind the scene when using partitions), you might want to combine partition_cols with a unique basename_template name. i.e. something like the following:

df.to_parquet(root_path, partition_cols=["..."], basename_template="{i}")

You could omit basename_template if df is not overlapping existing data. But if you do have overlaps, it would create duplicate .parquet files.

This is very handy if your partition column consists of timestamp. This way you can actually have a "rolling" DataFrame and there would be no duplicate being written, only new files corresponding to new times would get created.

Mangum answered 7/7, 2023 at 11:10 Comment(0)
B
-1

Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.

os.makedirs(path, exist_ok=True)

# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))

# read
pd.read_parquet(path)
Blondell answered 13/12, 2021 at 15:30 Comment(1)
It strikes me that this would scale linearly in time, i.e. better than does append mode as suggested by @Naveen in https://mcmap.net/q/483636/-pandas-write-dataframe-to-parquet-format-with-append - am I right? And to_parquet() supports S3, correct?Tuneless

© 2022 - 2024 — McMap. All rights reserved.