Python: Obtain number of rows for ParquetDataset?
Asked Answered
S

4

5

How do I obtain the number of rows of a ParquetDataset that is structured in the form of a folder containing multiple parquet files.

I tried

from pyarrow.parquet import ParquetDataset
a = ParquetDataset(path)
a.metadata
a.schema
a.commmon_metadata

I want to figure out the number of rows in total without reading the dataset as it can quite large.

What's the best way to do that?

Swarey answered 1/4, 2020 at 0:39 Comment(0)
H
8

You will still have to touch each individual file but luckily Parquet saves the total row count of each file in its footer. Thus you will only need to read the metadata of each file to figure out its size. The following code will compute the number of rows in the ParquetDataset

nrows = 0
dataset = ParquetDataset(..)
for piece in dataset.pieces:
    nrows += piece.get_metadata().num_rows
Hartman answered 1/4, 2020 at 8:42 Comment(0)
D
6

For pyarrow >= 5.0.0:

from pyarrow.parquet import ParquetDataset
dataset = ParquetDataset(path, use_legacy_dataset=False)
nrows = sum(p.count_rows() for p in dataset.fragments)
Decolorize answered 7/5, 2022 at 15:12 Comment(0)
B
1

Modern pyarrow (version 17.0.0 at the time of writing this answer):

import pyarrow.dataset
dataset = pyarrow.dataset.dataset("...")
sum(row_group.num_rows for fragment in dataset.get_fragments() for row_group in fragment.row_groups)

This takes around 100 times less time on my machine on larger datasets compared to the count_rows() or even pandas alternatives which were presented here.

This is because .count_rows() will actually traverse the file with a given filter, while .num_rows is a O(1) lookup of statistics in the row group footer.

Butterbur answered 23/10 at 15:22 Comment(0)
S
0

Also pyarrow num_rows.

import pyarrow as pa
import pandas as pd
df = pd.DataFrame({'n_legs': [None, 4, 5, None],
                   'animals': ["Flamingo", "Horse", None, "Centipede"]})
table = pa.Table.from_pandas(df)
table.num_rows
Searle answered 15/8 at 11:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.