How to open huge parquet file using Pandas without enough RAM

Asked 11/2, 2020 at 3:59 Answered 22/10, 2022 at 17:6

python pandas parquet pyarrow fastparquet

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).

Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.

I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.

Antacid answered 11/2, 2020 at 3:59 Comment(0)

Its possible to read parquet data in

batches
read certain row groups or iterate over row groups
read only certain columns

This way you can reduce the memory footprint. Both fastparquet and pyarrow should allow you to do this.

In case of pyarrow, iter_batches can be used to read streaming batches from a Parquet file.

import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=1000):
    print("RecordBatch")
    print(i.to_pandas())

Above example simply reads 1000 records at a time. You can further limit this to certain row groups or even certain columns like below.

for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):

Miserere answered 22/10, 2022 at 17:6 Comment(0)

Do you need all the columns? You might be able to save memory by just loading the ones you actually use.

A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.

Perfecto answered 11/2, 2020 at 8:51 Comment(0)

You can use Dask instead of pandas. It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.

https://examples.dask.org/dataframes/01-data-access.html

White answered 7/3, 2020 at 4:38 Comment(0)

Recommended topics

Hot tags