Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend
Asked Answered
T

1

5

Let's say I have the following dataframe:

Code Price
AA1 10
AA1 20
BB2 30

And I want to perform the following operation on it:

df.groupby("code").aggregate({
    "price": "sum"
})

I have tried playing with the new pyarrow dtypes introduced in Pandas 2.0 and I created 3 copies, and for each copy I measured execution time (average of 5 executions) of the operation above.

Code column dtype Price column dtype Execution time
Object float64 2.94 s
string[pyarrow] double[pyarrow] 49.5 s
string[pyarrow] float64 1.11 s

Can anyone explain why applying an aggregate function on a column with double pyarrow dtype is so slow compared to the standard numpy float64 dtype?

Tnt answered 3/4, 2023 at 9:6 Comment(1)
The last stable pandas version is 1.5.3. The version 2.0.0 is the first release candidate. If you found an issue, create a ticket on github or search if it's not already exist.Reseta
K
10

https://github.com/pandas-dev/pandas/issues/52070

Looks like groupby for arrow isn't implemented yet - so there's likely a arrow -> numpy happening internally leading to a loss of performance.

Kithara answered 3/4, 2023 at 9:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.