Get mode (most often) value in Spark column with groupBy
Asked Answered
E

3

6

I have a SparkR DataFrame and I want to get the mode (most often) value for each unique name. How can I do this? There doesn't seem to be a built-in mode function. Either a SparkR or PySpark solution will do.

# Create DF
df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
  value = c(5, 5, 4, 3, 3, 7))
DF <- createDataFrame(df)

name   | value
-----------------
Thomas |  5
Thomas |  5
Thomas |  4
Bill   |  3
Bill   |  3
Bill   |  9

# What I want to get
name   | mode(value)
-----------------
Thomas |   5
Bill   |   3 
Evocative answered 28/6, 2017 at 15:25 Comment(0)
G
8

You could achieve that using combination of .groupBy() and window methods like this:

grouped = df.groupBy('name', 'value').count()
window = Window.partitionBy("name").orderBy(desc("count"))
grouped\
    .withColumn('order', row_number().over(window))\
    .where(col('order') == 1)\
    .show()

outputs:

+------+-----+-----+-----+
|  name|value|count|order|
+------+-----+-----+-----+
|  Bill|    3|    2|    1|
|Thomas|    5|    2|    1|
+------+-----+-----+-----+
Ghastly answered 28/6, 2017 at 15:46 Comment(4)
Thanks, @pandaromeo. Can you explain what the Window.partitionBy('name').orderBy(desc('count')) does? I'm also having trouble converting this code to SparkR, though the windowPartitionBy commands exists there.Evocative
after groupBy and agg you end up with count column. Imagine that you perform orderBy on it in ascending order. With window You not only sort it but also give it index which enables You to filter out only the ones which interests You. In this case those are mode (most often) value. Unfortunately I have no experience with R so I can't translate it to SparkR.Ghastly
Actually, this link explains it well: github.com/apache/spark/blob/master/R/pkg/vignettes/….Evocative
If you want to ignore null values do grouped = df.where(df["value"].isNotNull()).groupBy("name", "value").count()Inheritor
C
1

Spark 3.4+ has mode column function.

PySpark full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('Thomas', 5),
     ('Thomas', 5),
     ('Thomas', 4),
     ('Bill', 3),
     ('Bill', 3),
     ('Bill', 9)],
    ['name', 'value'])

df.groupBy('name').agg(F.mode('value')).show()
# +------+-----------+
# |  name|mode(value)|
# +------+-----------+
# |Thomas|          5|
# |  Bill|          3|
# +------+-----------+

SparkR full example:

df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
                 value = c(5, 5, 4, 3, 3, 9))
df <- as.DataFrame(df)

df <- agg(groupBy(df, 'name'), expr("mode(value)"))

showDF(df)
# +------+-----------+
# |  name|mode(value)|
# +------+-----------+
# |Thomas|        5.0|
# |  Bill|        3.0|
# +------+-----------+
Chiclayo answered 1/8, 2023 at 11:25 Comment(0)
E
0

Here's the SparkR version of the solution:

grouped <- agg(groupBy(df, 'name', 'value'), count=count(df$value))
window <- orderBy(windowPartitionBy("name"), desc(grouped$count))
dfmode <- withColumn(grouped, 'order', over(row_number(), window))
dfmode <- filter(dfmode, dfmode$order==1)
Evocative answered 28/6, 2017 at 20:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.