How to loop large parquet file with generators in python?

Asked 8/6, 2018 at 7:32 Answered 28/2 at 14:19

python pandas dataframe generator parquet

Is it possible to open parquet files and iterate line by line, using generators? This is to avoid loading the whole parquet file into memory.

The content of the file is pandas DataFrame.

Snuffer answered 8/6, 2018 at 7:32 Comment(0)

You can not iterate by line as it is not the way it is stored. You can iterate through the row-groups as following:

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
for df in pf.iter_row_groups():
    process sub-data-frame df

Havoc answered 11/6, 2018 at 7:59 Comment(1)

by default it is 50million row per sub dataframe, is there a way to change the value of n_row_per_group? – Ephemeron 6/11, 2021 at 2:41

You can iterate using tensorflow_io.

import tensorflow_io as tfio

dataset = tfio.IODataset.from_parquet('myfile.parquet')

for line in dataset.take(3):
    # print the first 3 lines
    print(line)

Roos answered 10/6, 2021 at 1:6 Comment(2)

Be aware of CPU instructions compilation and compatibility, else it would throw the error "Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA" – Diondione 1/12, 2021 at 10:45

Good point @madmatrix, you should definitely have your tensorflow install optimized for your environment for best performance. – Roos 7/12, 2021 at 16:20

You can use the pyarrow package for this, it allows to iterate per batch:

import pyarrow
import pyarrow.parquet as pq

print(pyarrow.__version__) # 13.0.0

path = "" # path to your parquet file
batch_size = 50_000 # number of rows to load in memory

parquet_file = pq.ParquetFile(path)

for batch in parquet_file.iter_batches(batch_size=batch_size):
    # process your batch

In the proposed solution, the batch is a pyarrow.RecordBatch, if you want to iterate line by line and to represent each line by a dict you can use the following snippet:

import pyarrow
import pyarrow.parquet as pq

from typing import Any
from collections.abc import Iterator

print(pyarrow.__version__) # # 13.0.0

def iter_by_line_parquet(path: str, batch_size: int) -> Iterator[dict[str, Any]]:
    """Iterate over a parquet file line by line.
    
    Each line is represented by a dict.

    Args:
        path: path to the .parquet file.
        batch_size: number of rows to load in memory.

    Yields:
        line as dict.
    """
    parquet_file = pq.ParquetFile(path)

    for batch in parquet_file.iter_batches(batch_size=batch_size):
        yield from batch.to_pylist()

path = "" # path to your parquet file
batch_size = 50_000

# print the first row of your parquet file
print(next(iter_by_line_parquet(path, batch_size)))

Aw answered 28/2 at 14:19 Comment(0)

If, as is usually the case, the Parquet is stored as multiple files in one directory, you can run:

for parquet_file in glob.glob(parquet_dir + "/*.parquet"):
    df = pd.read.parquet(parquet_file)
    for value1, value2, value3 in zip(df['col1'],df['col2'],df['col3']):
        # Process row
    del df

Only one file will be in memory at a time.

Souza answered 2/2, 2022 at 20:55 Comment(0)

Recommended topics

Hot tags