I'm trying to write a large parquet file onto disk (larger then memory). I naively thought I can be clever and use ParquetWriter and write_table to incrementally write a file, like this (POC):
import pyarrow as pa
import pyarrow.parquet as pq
import pickle
import time
arrow_schema = pickle.load(open('schema.pickle', 'rb'))
rows_dataframe = pickle.load(open('rows.pickle', 'rb'))
output_file = 'test.parq'
with pq.ParquetWriter(
output_file,
arrow_schema,
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
for rows_dataframe in function_that_yields_data()
writer.write_table(
pa.Table.from_pydict(
rows_dataframe,
arrow_schema
)
)
But even though the I'm yielding chunks (like 10 000 rows in my case) and using write_table
it's still keeping the entire dataset in memory.
Turns out ParquetWriter keeps the entire dataset in memory while it incrementally writes to disk.
Is there anyway to force the ParquetWriter to not keep the entire dataset in memory, or is it simply not possible for good reasons?