What's the equivalent of Panda's value_counts() in PySpark?

P

5

49

I am having the following python/pandas command:

df.groupby('Column_Name').agg(lambda x: x.value_counts().max()

where I am getting the value counts for ALL columns in a DataFrameGroupBy object.

How do I do this action in PySpark?

Pralltriller answered 27/6, 2018 at 13:8 Comment(6)

The task I am asking very simple. I want to get the vale counts (the highest distinct count) for all columns in a group by dataframe. This is easily done in Pandas with the value_counts() method. – Pralltriller 27/6, 2018 at 13:40

Here is my DF:>>> schemaTrans.show() +----+----+------+-----+----+----+ |COL1|COL2| COL3| COL4|COL5| ID| +----+----+------+-----+----+----+ | 123| 456|ABC123| XYZ| 525|ID01| | 123| 456|ABC123| XYZ| 634|ID01| | 123| 456|ABC123| XYZ| 802|ID01| | 456| 123| BC01|K_L_M| 213|ID01| | 456| 123| BC01|K_L_M| 401|ID01| | 456| 123| BC01|P_Q_M| 213|ID01| | 123| 456|XYZ012| ABC| 117|ID02| | 123| 456|XYZ012| ABE| 117|ID02| | 456| 123| QPR12|S_T_U| 204|ID02| | 456| 123| QPR12|S_T_X| 415|ID02| +----+----+------+-----+----+----+ – Pralltriller 27/6, 2018 at 14:20

from pyspark.sql.functions import count exprs = {x: "count" for x in schemaTrans.columns} schemaTrans.groupBy("ID").agg(exprs).show(5) +----+---------+-----------+-----------+-----------+-----------+-----------+ ID|count(ID)|count(COL4)|count(COL2)|count(COL3)|count(COL1)|count(COL5)| +----+---------+-----------+-----------+-----------+-----------+-----------+ |ID01| 6| 6| 6| 6| 6| 6| |ID02| 4| 4| 4| 4| 4| 4| +----+---------+-----------+-----------+-----------+-----------+--------- – Pralltriller 27/6, 2018 at 14:23

exprs = [countDistinct(x) for x in schemaTrans.columns] schemaTrans.groupBy("ID").agg(*exprs).show(5) | ID|(DISTINCT COL1)|(DISTINCT COL2)|(DISTINCT COL3)|(DISTINCT COL4)|(DISTINCT COL5)|(DISTINCT ID)| +----+---------------+---------------+---------------+---------------+---------------+---------|ID01| 2 | 2 | 2 | 3 | 5 | 1 | |ID02| 2 | 2 | 2 | 4 | 3 | 1 | +----+---------------+---------------+---------------+---------------+---------------+--------- – Pralltriller 27/6, 2018 at 14:28

But I would like to have: +----+----------+-----------+-----------+-----------+-----------+--------+ | ID|(VL COL1) | (VL COL2) | (VL COL3) | (VL COL4) | (VL COL5) | (VL ID)| +----+----------+-----------+-----------+-----------+-----------+--------+ |ID01| 3 | 3 | 3 | 3 | 2 | 1 | |ID02| 2 | 2 | 2 | 2 | 2 | 1 | +----+----------+-----------+-----------+-----------+-----------+--------+ – Pralltriller 27/6, 2018 at 14:32

Please do not add these as comments. Edit your question and put it there. Please also read how do I format my code blocks. Also check out how to make good reproducible apache spark dataframe examples. – Murton 27/6, 2018 at 14:41

A

42

It's more or less the same:

spark_df.groupBy('column_name').count().orderBy('count')

In the groupBy you can have multiple columns delimited by a ,

For example groupBy('column_1', 'column_2')

Ani answered 27/6, 2018 at 13:29 Comment(3)

Hi Tanjin, thank you for your reply! I am not getting the same result. I have been dong the following: (Action-1): from pyspark.sql.functions import count exprs = {x: "count" for x in df.columns} df.groupBy("ID").agg(exprs).show(5), this works but I am getting all the record count for each group. That's NOT what I want. (Action-2) from pyspark.sql.functions import countDistinct exprs = [countDistinct(x) for x in df.columns] df.groupBy("ID").agg(*exprs).show(5) This breaks!! It errors out as follows: ERROR client.TransportResponseHandler: – Pralltriller 27/6, 2018 at 13:34

The missing .show() that you need to add onto the end of that line to actually see the results might be confusing to beginners. – Bucolic 15/3, 2021 at 15:52

To match the behavior in Pandas you want to return count by descending order: spark_df.groupBy('column_name').count().orderBy(col('count').desc()).show() – Cacka 28/4, 2021 at 13:44

P

16

try this when you want to control the order:

data.groupBy('col_name').count().orderBy('count', ascending=False).show()

Paries answered 5/1, 2021 at 10:11 Comment(0)

S

3

Try this:

spark_df.groupBy('column_name').count().show()

Stoical answered 16/8, 2020 at 13:42 Comment(0)

C

1

from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc
spark = SparkSession.builder.appName('whatever_name').getOrCreate()
spark_sc = spark.read.option('header', True).csv(your_file)    
value_counts=spark_sc.select('Column_Name').groupBy('Column_Name').agg(count('Column_Name').alias('counts')).orderBy(desc('counts'))
value_counts.show()

but spark is much slower than pandas value_counts() on a single machine

Catercornered answered 23/11, 2021 at 10:38 Comment(0)

A

0

df.groupBy('column_name').count().orderBy('count').show()

Anselma answered 3/3, 2022 at 5:59 Comment(2)

Please read "How to Answer". It helps more if you supply an explanation why this is the preferred solution and explain how it works. We want to educate, not just provide code. – Kathie 11/3, 2022 at 5:56

I think it not count Null values – Ebby 12/7 at 10:36

Recommended topics

Hot tags