What's the equivalent of Panda's value_counts() in PySpark?
Asked Answered
P

5

49

I am having the following python/pandas command:

df.groupby('Column_Name').agg(lambda x: x.value_counts().max()

where I am getting the value counts for ALL columns in a DataFrameGroupBy object.

How do I do this action in PySpark?

Pralltriller answered 27/6, 2018 at 13:8 Comment(6)
The task I am asking very simple. I want to get the vale counts (the highest distinct count) for all columns in a group by dataframe. This is easily done in Pandas with the value_counts() method.Pralltriller
Here is my DF:>>> schemaTrans.show() +----+----+------+-----+----+----+ |COL1|COL2| COL3| COL4|COL5| ID| +----+----+------+-----+----+----+ | 123| 456|ABC123| XYZ| 525|ID01| | 123| 456|ABC123| XYZ| 634|ID01| | 123| 456|ABC123| XYZ| 802|ID01| | 456| 123| BC01|K_L_M| 213|ID01| | 456| 123| BC01|K_L_M| 401|ID01| | 456| 123| BC01|P_Q_M| 213|ID01| | 123| 456|XYZ012| ABC| 117|ID02| | 123| 456|XYZ012| ABE| 117|ID02| | 456| 123| QPR12|S_T_U| 204|ID02| | 456| 123| QPR12|S_T_X| 415|ID02| +----+----+------+-----+----+----+Pralltriller
from pyspark.sql.functions import count exprs = {x: "count" for x in schemaTrans.columns} schemaTrans.groupBy("ID").agg(exprs).show(5) +----+---------+-----------+-----------+-----------+-----------+-----------+ ID|count(ID)|count(COL4)|count(COL2)|count(COL3)|count(COL1)|count(COL5)| +----+---------+-----------+-----------+-----------+-----------+-----------+ |ID01| 6| 6| 6| 6| 6| 6| |ID02| 4| 4| 4| 4| 4| 4| +----+---------+-----------+-----------+-----------+-----------+---------Pralltriller
exprs = [countDistinct(x) for x in schemaTrans.columns] schemaTrans.groupBy("ID").agg(*exprs).show(5) | ID|(DISTINCT COL1)|(DISTINCT COL2)|(DISTINCT COL3)|(DISTINCT COL4)|(DISTINCT COL5)|(DISTINCT ID)| +----+---------------+---------------+---------------+---------------+---------------+---------|ID01| 2 | 2 | 2 | 3 | 5 | 1 | |ID02| 2 | 2 | 2 | 4 | 3 | 1 | +----+---------------+---------------+---------------+---------------+---------------+---------Pralltriller
But I would like to have: +----+----------+-----------+-----------+-----------+-----------+--------+ | ID|(VL COL1) | (VL COL2) | (VL COL3) | (VL COL4) | (VL COL5) | (VL ID)| +----+----------+-----------+-----------+-----------+-----------+--------+ |ID01| 3 | 3 | 3 | 3 | 2 | 1 | |ID02| 2 | 2 | 2 | 2 | 2 | 1 | +----+----------+-----------+-----------+-----------+-----------+--------+Pralltriller
Please do not add these as comments. Edit your question and put it there. Please also read how do I format my code blocks. Also check out how to make good reproducible apache spark dataframe examples.Murton
A
42

It's more or less the same:

spark_df.groupBy('column_name').count().orderBy('count')

In the groupBy you can have multiple columns delimited by a ,

For example groupBy('column_1', 'column_2')

Ani answered 27/6, 2018 at 13:29 Comment(3)
Hi Tanjin, thank you for your reply! I am not getting the same result. I have been dong the following: (Action-1): from pyspark.sql.functions import count exprs = {x: "count" for x in df.columns} df.groupBy("ID").agg(exprs).show(5), this works but I am getting all the record count for each group. That's NOT what I want. (Action-2) from pyspark.sql.functions import countDistinct exprs = [countDistinct(x) for x in df.columns] df.groupBy("ID").agg(*exprs).show(5) This breaks!! It errors out as follows: ERROR client.TransportResponseHandler:Pralltriller
The missing .show() that you need to add onto the end of that line to actually see the results might be confusing to beginners.Bucolic
To match the behavior in Pandas you want to return count by descending order: spark_df.groupBy('column_name').count().orderBy(col('count').desc()).show()Cacka
P
16

try this when you want to control the order:

data.groupBy('col_name').count().orderBy('count', ascending=False).show()
Paries answered 5/1, 2021 at 10:11 Comment(0)
S
3

Try this:

spark_df.groupBy('column_name').count().show()
Stoical answered 16/8, 2020 at 13:42 Comment(0)
C
1
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc
spark = SparkSession.builder.appName('whatever_name').getOrCreate()
spark_sc = spark.read.option('header', True).csv(your_file)    
value_counts=spark_sc.select('Column_Name').groupBy('Column_Name').agg(count('Column_Name').alias('counts')).orderBy(desc('counts'))
value_counts.show()

but spark is much slower than pandas value_counts() on a single machine

Catercornered answered 23/11, 2021 at 10:38 Comment(0)
A
0

df.groupBy('column_name').count().orderBy('count').show()

Anselma answered 3/3, 2022 at 5:59 Comment(2)
Please read "How to Answer". It helps more if you supply an explanation why this is the preferred solution and explain how it works. We want to educate, not just provide code.Kathie
I think it not count Null valuesEbby

© 2022 - 2024 — McMap. All rights reserved.