Pyarrow Dataset read specific columns and specific rows

About

Asked 10/9, 2019 at 22:7 Answered 23/10, 2020 at 15:19

Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?

Preciosity answered 10/9, 2019 at 22:7 Comment(3)

Yes, to reading specific columns, that's one of the strengths of the Parquet format. In general, with pd.read_parquet() you can specify the columns with the columns arg. To my knowledge you can't filter on load. – Ghat 10/9, 2019 at 22:12

You can also filter a dataset when reading, but for now only in case of a partitioned dataset (consistent of multiple files in nested directories, see the filter argument in the docs arrow.apache.org/docs/python/generated/…). To also filter within a single file is being worked on (see issues.apache.org/jira/browse/ARROW-1796) – Raila 11/9, 2019 at 13:40

See also the aswer to this file: #56523477 – Raila 11/9, 2019 at 13:40

As of pyarrow==2.0.0, this is possible at least with pyarrow.parquet.ParquetDataset.

To read specific columns, its read and read_pandas methods have a columns option. You can also do this with pandas.read_parquet.

To read specific rows, its __init__ method has a filters option.

Quadriplegia answered 23/10, 2020 at 15:19 Comment(1)

Excellent! Hmm I wonder if all rows are read into memory before filtering out the bad ones. If so, the memory pressure still spikes. though i bet the arrow table is smaller than the pandas df – Cattleya 14/5, 2022 at 2:30

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags