Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?
Pyarrow Dataset read specific columns and specific rows
As of pyarrow==2.0.0
, this is possible at least with pyarrow.parquet.ParquetDataset
.
To read specific columns, its read
and read_pandas
methods have a columns
option. You can also do this with pandas.read_parquet
.
To read specific rows, its __init__
method has a filters
option.
Excellent! Hmm I wonder if all rows are read into memory before filtering out the bad ones. If so, the memory pressure still spikes. though i bet the arrow table is smaller than the pandas df –
Cattleya
© 2022 - 2024 — McMap. All rights reserved.
pd.read_parquet()
you can specify the columns with the columns arg. To my knowledge you can't filter on load. – Ghatfilter
argument in the docs arrow.apache.org/docs/python/generated/…). To also filter within a single file is being worked on (see issues.apache.org/jira/browse/ARROW-1796) – Raila