How can one append to parquet files and how does it affect partitioning?

Asked 9/9, 2021 at 20:23 Answered 28/3 at 0:57

Does parquet allow appending to a parquet file periodically ?

How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partition it by that column, if i were to append more data to it would parquet be able to automatically append data while preserving partitioning or would one have to repartition the file ?

Clew answered 9/9, 2021 at 20:23 Comment(1)

parquet files cannot be modified is my understanding. pyarrow does not allow appending to parquet files neither. The only option to 'append' to a partitioned parquet file is using Spark API: https://mcmap.net/q/477213/-how-to-append-data-to-an-existing-parquet-file. I'd point you to the comments under that answer though as this is not truly an 'append'. – Didactic 9/9, 2021 at 21:40

Does parquet allow appending to a parquet file periodically ?

Yes and No. The parquet spec describes a format that could be appended to by reading the existing footer, writing a row group, and then writing out a modified footer. This process is described a bit here.

Not all implementations support this operation. The only implementation I am aware of at the moment is fastparquet (see this answer). It is usually acceptable, less complexity, and potentially better performance to cache and batch, either by caching in memory or writing the small files and batching them together at some point later.

How does appending relate to partitioning if any?

Parquet does not have any concept of partitioning.

Many tools that support parquet implement partitioning. For example, pyarrow has a datasets feature which supports partitioning. If you were to append new data using this feature a new file would be created in the appropriate partition directory.

Tree answered 10/9, 2021 at 0:28 Comment(4)

appending row groups to an existing parquet file is possible using fast parquet library – Insupportable 29/10, 2022 at 12:21

Hmm...I think fastparquet's append feature is used to add files to data sets and not to add row groups to existing files. – Tree 1/11, 2022 at 9:12

It does append a new row group.. I have posted the answer below. Its a very useful feature that I didn't know was possible. – Insupportable 1/11, 2022 at 9:26

@shadow0359 I've updated my answer to reflect that fastparquet supports this operation. Thanks for the help! – Tree 2/11, 2022 at 21:40

Its possible to append row groups to already existing parquet file using fastparquet.

Here is my SO answer on the same topic.

From fast parquet docs

append: bool (False) or ‘overwrite’ If False, construct data-set from scratch; if True, add new row-group(s) to existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.

from fastparquet import write
write('output.parquet', df, append=True)

EXAMPLE UPDATE:

Here is a PY script. The first run, it will create a file with one row group. Subsequent runs, it will append row groups to the same parquet file.

import os.path
import pandas as pd
from fastparquet import write

df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
file_path = "C:\\Users\\nsuser\\dev\\write_parq_row_group.parquet"
if not os.path.isfile(file_path):
  write(file_path, df)
else:
  write(file_path, df, append=True)

Insupportable answered 29/10, 2022 at 12:30 Comment(4)

Can you maybe include a complete example? I have been unable to figure out how to use fastparquet to actually append to existing files. This gist is my current test. – Tree 2/11, 2022 at 16:32

@Tree could you check this answer? I have added the example here. https://mcmap.net/q/483636/-pandas-write-dataframe-to-parquet-format-with-append – Insupportable 2/11, 2022 at 19:2

@Tree I have updated the gist with example. – Insupportable 2/11, 2022 at 19:21

That works great, thanks. I was getting the 'fmd' error and hadn't realized I needed to leave off append on the first call. – Tree 2/11, 2022 at 21:33

exactly. Whether appending data using fastparquet depend on situation:

If every peace of your data is very small like 10mb. This case would harm the performance slightly. I recommend you reconstruct to big peace periodly. A appropriate size of every parquet file is 500mb ~ 1Gb.
If every peace of your data is big enough, this case would not harm your performance.

Boothe answered 28/3 at 0:57 Comment(0)

Recommended topics

Hot tags