I am having the following python/pandas command:
df.groupby('Column_Name').agg(lambda x: x.value_counts().max()
where I am getting the value counts for ALL columns in a DataFrameGroupBy
object.
How do I do this action in PySpark?
I am having the following python/pandas command:
df.groupby('Column_Name').agg(lambda x: x.value_counts().max()
where I am getting the value counts for ALL columns in a DataFrameGroupBy
object.
How do I do this action in PySpark?
It's more or less the same:
spark_df.groupBy('column_name').count().orderBy('count')
In the groupBy you can have multiple columns delimited by a ,
For example groupBy('column_1', 'column_2')
.show()
that you need to add onto the end of that line to actually see the results might be confusing to beginners. –
Bucolic spark_df.groupBy('column_name').count().orderBy(col('count').desc()).show()
–
Cacka try this when you want to control the order:
data.groupBy('col_name').count().orderBy('count', ascending=False).show()
Try this:
spark_df.groupBy('column_name').count().show()
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc
spark = SparkSession.builder.appName('whatever_name').getOrCreate()
spark_sc = spark.read.option('header', True).csv(your_file)
value_counts=spark_sc.select('Column_Name').groupBy('Column_Name').agg(count('Column_Name').alias('counts')).orderBy(desc('counts'))
value_counts.show()
but spark is much slower than pandas value_counts() on a single machine
df.groupBy('column_name').count().orderBy('count').show()
Null
values –
Ebby © 2022 - 2024 — McMap. All rights reserved.