I have a PySpark Dataframe with an A
field, few B
fields that dependent on A
(A->B
) and C
fields that I want to aggregate per each A. For example:
A | B | C
----------
A | 1 | 6
A | 1 | 7
B | 2 | 8
B | 2 | 4
I wish to group by A
, present any of B
and run aggregation (let's say SUM
) on C
.
The expected result would be:
A | B | C
----------
A | 1 | 13
B | 2 | 12
SQL-wise I would do:
SELECT A, COALESCE(B) as B, SUM(C) as C
FROM T
GROUP BY A
What is the PySpark way to do that?
I can group by A and B together or select MIN(B)
per each A, for example:
df.groupBy('A').agg(F.min('B').alias('B'),F.sum('C').alias('C'))
or
df.groupBy(['A','B']).agg(F.sum('C').alias('C'))
but that seems inefficient. Is there is anything similar to SQL coalesce
in PySpark?
Thanks
first
here is computationally equivalent toany
? I'm using this function a lot when doing a window (over partition order by) but then it requires sorting – Dispend