I am using Pyarrow library for optimal storage of Pandas DataFrame. I need to process pyarrow Table row by row as fast as possible without converting it to pandas DataFrame (it won't fit in memory). Pandas has iterrows()/iterrtuples() methods. Is there any fast way to iterate Pyarrow Table except for-loop and index addressing?
Fastest way to iterate Pyarrow Table
The software is not optimized at all for this use case at the moment. I would recommend using Cython or C++ or interact with the data row by row. If you have further questions, please reach out on the developer mailing list [email protected]
Is the answer any different today? –
Harlene
There is now a experimental API for user-defined scalar functions: arrow.apache.org/docs/python/api/… –
Bratton
This code worked for me:
for batch in table.to_batches():
d = batch.to_pydict()
for c1, c2, c3 in zip(d['c1'], d['c2'], d['c3']):
# Do something with the row of c1, c2, c3
If you have a large parquet data set split into mupltiple files, this seems reasonably fast and memory-efficient.
import argparse
import pyarrow.parquet as pq
from glob import glob
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('parquet_dir')
return parser.parse_args()
def iter_parquet(dirpath):
for fpath in glob(f'{dirpath}/*.parquet'):
tbl = pq.ParquetFile(fpath)
for group_i in range(tbl.num_row_groups):
row_group = tbl.read_row_group(group_i)
for batch in row_group.to_batches():
for row in zip(*batch.columns):
yield row
if __name__ == '__main__':
args = parse_args()
total_count = 0
for row in iter_parquet(args.parquet_dir):
total_count += 1
print(total_count)
The software is not optimized at all for this use case at the moment. I would recommend using Cython or C++ or interact with the data row by row. If you have further questions, please reach out on the developer mailing list [email protected]
Is the answer any different today? –
Harlene
There is now a experimental API for user-defined scalar functions: arrow.apache.org/docs/python/api/… –
Bratton
© 2022 - 2024 — McMap. All rights reserved.