How can I read each Parquet row group into a separate partition?

Asked 30/1, 2020 at 14:27 Answered 30/1, 2024 at 12:58

I have a parquet file with 10 row groups:

In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups)
10

But when I load it using Dask Dataframe, it is read into a single partition:

In [31]: print(dask.dataframe.read_parquet("/tmp/test2.parquet").npartitions)
1

This appears to contradict this answer, which states that Dask Dataframe reads each Parquet row group into a separate partition.

How can I read each Parquet row group into a separate partition with Dask Dataframe? Or must the data be distributed over different files for this to work?

Salop answered 30/1, 2020 at 14:27 Comment(0)

I believe that fastparquet will read each row-group separately, and the fact that pyarrow apparently doesn't could be considered bug or at least a feature enhancement that you could request on the dask issues tracker. I would tend to agree that a set of files containing one row-group each and a single file containing the same row-groups should result in the same partition structure.

Cruiser answered 30/1, 2020 at 15:0 Comment(0)

I can read using the batches with pyarrow.

import pyarrow as pq
batch_size = 1
_file = pq.parquet.ParquetFile("file.parquet")
batches = _file.iter_batches(batch_size) #batches will be a generator

for batch in batches:
  process(batch)

Amathiste answered 3/3, 2021 at 20:35 Comment(0)

You may use split_row_groups=True with dask.dataframe.read_parquet:


import dask.dataframe as dd
df = dd.read_parquet(file, split_row_groups=True)

Docs for split_row_groups:

split_row_groups‘infer’, ‘adaptive’, bool, or int, default ‘infer’ If True, then each output dataframe partition will correspond to a single parquet-file row-group. If False, each partition will correspond to a complete file. If a positive integer value is given, each dataframe partition will correspond to that number of parquet row-groups (or fewer). If ‘adaptive’, the metadata of each file will be used to ensure that every partition satisfies blocksize. If ‘infer’ (the default), the uncompressed storage-size metadata in the first file will be used to automatically set split_row_groups to either ‘adaptive’ or False.

Expiate answered 30/1, 2024 at 12:58 Comment(0)

Recommended topics

Hot tags