Is it possible to open parquet files and iterate line by line, using generators? This is to avoid loading the whole parquet file into memory.
The content of the file is pandas DataFrame.
Is it possible to open parquet files and iterate line by line, using generators? This is to avoid loading the whole parquet file into memory.
The content of the file is pandas DataFrame.
You can not iterate by line as it is not the way it is stored. You can iterate through the row-groups as following:
from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
for df in pf.iter_row_groups():
process sub-data-frame df
You can iterate using tensorflow_io.
import tensorflow_io as tfio
dataset = tfio.IODataset.from_parquet('myfile.parquet')
for line in dataset.take(3):
# print the first 3 lines
print(line)
You can use the pyarrow package for this, it allows to iterate per batch:
import pyarrow
import pyarrow.parquet as pq
print(pyarrow.__version__) # 13.0.0
path = "" # path to your parquet file
batch_size = 50_000 # number of rows to load in memory
parquet_file = pq.ParquetFile(path)
for batch in parquet_file.iter_batches(batch_size=batch_size):
# process your batch
In the proposed solution, the batch
is a pyarrow.RecordBatch, if you want to iterate line by line and to represent each line by a dict
you can use the following snippet:
import pyarrow
import pyarrow.parquet as pq
from typing import Any
from collections.abc import Iterator
print(pyarrow.__version__) # # 13.0.0
def iter_by_line_parquet(path: str, batch_size: int) -> Iterator[dict[str, Any]]:
"""Iterate over a parquet file line by line.
Each line is represented by a dict.
Args:
path: path to the .parquet file.
batch_size: number of rows to load in memory.
Yields:
line as dict.
"""
parquet_file = pq.ParquetFile(path)
for batch in parquet_file.iter_batches(batch_size=batch_size):
yield from batch.to_pylist()
path = "" # path to your parquet file
batch_size = 50_000
# print the first row of your parquet file
print(next(iter_by_line_parquet(path, batch_size)))
If, as is usually the case, the Parquet is stored as multiple files in one directory, you can run:
for parquet_file in glob.glob(parquet_dir + "/*.parquet"):
df = pd.read.parquet(parquet_file)
for value1, value2, value3 in zip(df['col1'],df['col2'],df['col3']):
# Process row
del df
Only one file will be in memory at a time.
© 2022 - 2024 — McMap. All rights reserved.