It appears the most common way in Python to create Parquet files is to first create a Pandas dataframe and then use pyarrow to write the table to parquet. I worry that this might be overly taxing in memory usage - as it requires at least one full copy of the dataset to be stored in memory in order to create the pandas dataframe.
I wonder if loading the entire dataset into memory is required due to the columnar compression requirements, or if there's a more efficient and stream-based approach. In my case, I will receive the records in a streaming fashion. For a similar csv output process, we write rows out to disk in batches of 1000 so the number of rows needing to be held in memory never reaches the size of the full dataset.
Should I...?:
- Just create a pandas dataframe and then write it to the parquet. (Meaning the entire dataset will need to be stored in memory, but we treat this as a necessary requirement.)
- Use some streaming-friendly way to write 1000 or so rows at a time as we receive them, minimizing the total point-in-time ram consumption over the course of the process. (I didn't see any documentation on how to do this and I'm not sure it's even an option for parquet.)
- Write everything to CSV and then use a function that smartly reads/analyzes the CSV contents and creates the compressed parquet after-the-fact. (Slower runtime perhaps but lower memory profile and lower chance to fail on a very large file.)
Thoughts? Suggestions?