PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)
Asked Answered
S

1

5

I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC):

import pyarrow as pa
import pyarrow.parquet as pq
import pickle
import time

arrow_schema = pickle.load(open('schema.pickle', 'rb'))
rows_dataframe = pickle.load(open('rows.pickle', 'rb'))

output_file = 'test.parq'

with pq.ParquetWriter(
                output_file,
                arrow_schema,
                compression='snappy',
                allow_truncated_timestamps=True,
                version='2.0',  # Highest available schema
                data_page_version='2.0',  # Highest available schema
        ) as writer:
            for rows_dataframe in function_that_yields_data()
                writer.write_table(
                    pa.Table.from_pydict(
                            rows_dataframe,
                            arrow_schema
                    )
                )

But even though the I'm yielding chunks (like 10 000 rows in my case) and using write_table it's still keeping the entire dataset in memory.

Turns out ParquetWriter keeps the entire dataset in memory while it incrementally writes to disk.

Is there anyway to force the ParquetWriter to not keep the entire dataset in memory, or is it simply not possible for good reasons?

Satirical answered 14/9, 2020 at 20:11 Comment(1)
Related question: #68375754Gladiate
F
4

Based on analysis from the Arrow bug report this is potentially caused by collection of metadata which can only be flushed when the file is closed.

Fried answered 21/10, 2020 at 18:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.