Pyarrow Dataset read specific columns and specific rows
Asked Answered
P

1

5

Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?

Preciosity answered 10/9, 2019 at 22:7 Comment(3)
Yes, to reading specific columns, that's one of the strengths of the Parquet format. In general, with pd.read_parquet() you can specify the columns with the columns arg. To my knowledge you can't filter on load.Ghat
You can also filter a dataset when reading, but for now only in case of a partitioned dataset (consistent of multiple files in nested directories, see the filter argument in the docs arrow.apache.org/docs/python/generated/…). To also filter within a single file is being worked on (see issues.apache.org/jira/browse/ARROW-1796)Raila
See also the aswer to this file: #56523477Raila
Q
6

As of pyarrow==2.0.0, this is possible at least with pyarrow.parquet.ParquetDataset.

To read specific columns, its read and read_pandas methods have a columns option. You can also do this with pandas.read_parquet.

To read specific rows, its __init__ method has a filters option.

Quadriplegia answered 23/10, 2020 at 15:19 Comment(1)
Excellent! Hmm I wonder if all rows are read into memory before filtering out the bad ones. If so, the memory pressure still spikes. though i bet the arrow table is smaller than the pandas dfCattleya

© 2022 - 2024 — McMap. All rights reserved.