About

How to count a boolean in grouped Spark data frame

Asked 18/2, 2016 at 22:28 Answered 31/5 at 14:43

Solved python sql apache-spark pyspark apache-spark-sql

I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region, salary and IsUnemployed column with IsUnemployed as a Boolean. I want to see how many unemployed people in each region. I know we can do a filter and then groupby but I want to generate two aggregation at the same time as below

from pyspark.sql import functions as F  
data.groupby("Region").agg(F.avg("Salary"), F.count("IsUnemployed"))

Adai answered 18/2, 2016 at 22:28 Comment(0)

Probably the simplest solution is a plain CAST (C style where TRUE -> 1, FALSE -> 0) with SUM:

(data
    .groupby("Region")
    .agg(F.avg("Salary"), F.sum(F.col("IsUnemployed").cast("long"))))

A little bit more universal and idiomatic solution is CASE WHEN with COUNT:

(data
    .groupby("Region")
    .agg(
        F.avg("Salary"),
        F.count(F.when(F.col("IsUnemployed"), F.col("IsUnemployed")))))

but here it is clearly an overkill.

Shrubbery answered 18/2, 2016 at 22:40 Comment(1)

Does the second approach run slower or does Catalyst optimize both to roughly similar bytecode? – Recount 7/7, 2016 at 6:35

`count_if` function

Pyspark 3.5 introduced pyspark.sql.functions.count_if documented as "Returns the number of TRUE values for the col."

So for your example, you could do:

from pyspark.sql import functions as F  
results = (
    data
    .groupby("Region")
    .agg(
        F.avg("Salary").alias("AverageSalary"),
        # new count_if method
        F.count_if("IsUnemployed").alias("CountEmployed"),
        # old casting method still required for getting proportion of true values
        F.avg(F.col("IsUnemployed").cast("integer")).alias("ProportionEmployed"),
    )
)

Southwestward answered 31/5 at 14:43 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

count_if function

Recommended topics

Hot tags

`count_if` function