Get mode (most often) value in Spark column with groupBy

Asked 28/6, 2017 at 15:25 Answered 1/8, 2023 at 11:25

Solved apache-spark pyspark apache-spark-sql mode sparkr

I have a SparkR DataFrame and I want to get the mode (most often) value for each unique name. How can I do this? There doesn't seem to be a built-in mode function. Either a SparkR or PySpark solution will do.

# Create DF
df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
  value = c(5, 5, 4, 3, 3, 7))
DF <- createDataFrame(df)

name   | value
-----------------
Thomas |  5
Thomas |  5
Thomas |  4
Bill   |  3
Bill   |  3
Bill   |  9

# What I want to get
name   | mode(value)
-----------------
Thomas |   5
Bill   |   3

Evocative answered 28/6, 2017 at 15:25 Comment(0)

You could achieve that using combination of .groupBy() and window methods like this:

grouped = df.groupBy('name', 'value').count()
window = Window.partitionBy("name").orderBy(desc("count"))
grouped\
    .withColumn('order', row_number().over(window))\
    .where(col('order') == 1)\
    .show()

outputs:

+------+-----+-----+-----+
|  name|value|count|order|
+------+-----+-----+-----+
|  Bill|    3|    2|    1|
|Thomas|    5|    2|    1|
+------+-----+-----+-----+

Ghastly answered 28/6, 2017 at 15:46 Comment(4)

Thanks, @pandaromeo. Can you explain what the Window.partitionBy('name').orderBy(desc('count')) does? I'm also having trouble converting this code to SparkR, though the windowPartitionBy commands exists there. – Evocative 28/6, 2017 at 19:6

after groupBy and agg you end up with count column. Imagine that you perform orderBy on it in ascending order. With window You not only sort it but also give it index which enables You to filter out only the ones which interests You. In this case those are mode (most often) value. Unfortunately I have no experience with R so I can't translate it to SparkR. – Ghastly 28/6, 2017 at 19:24

Actually, this link explains it well: github.com/apache/spark/blob/master/R/pkg/vignettes/…. – Evocative 28/6, 2017 at 19:36

If you want to ignore null values do grouped = df.where(df["value"].isNotNull()).groupBy("name", "value").count() – Inheritor 6/10, 2017 at 14:12

Spark 3.4+ has mode column function.

PySpark full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('Thomas', 5),
     ('Thomas', 5),
     ('Thomas', 4),
     ('Bill', 3),
     ('Bill', 3),
     ('Bill', 9)],
    ['name', 'value'])

df.groupBy('name').agg(F.mode('value')).show()
# +------+-----------+
# |  name|mode(value)|
# +------+-----------+
# |Thomas|          5|
# |  Bill|          3|
# +------+-----------+

SparkR full example:

df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
                 value = c(5, 5, 4, 3, 3, 9))
df <- as.DataFrame(df)

df <- agg(groupBy(df, 'name'), expr("mode(value)"))

showDF(df)
# +------+-----------+
# |  name|mode(value)|
# +------+-----------+
# |Thomas|        5.0|
# |  Bill|        3.0|
# +------+-----------+

Chiclayo answered 1/8, 2023 at 11:25 Comment(0)

Here's the SparkR version of the solution:

grouped <- agg(groupBy(df, 'name', 'value'), count=count(df$value))
window <- orderBy(windowPartitionBy("name"), desc(grouped$count))
dfmode <- withColumn(grouped, 'order', over(row_number(), window))
dfmode <- filter(dfmode, dfmode$order==1)

Evocative answered 28/6, 2017 at 20:9 Comment(0)

Recommended topics

Hot tags