Count Non Null values in column in PySpark

G

4

7

I have a dataframe which contains null values:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(125, '2012-10-10', 'tv'),
     (20, '2012-10-10', 'phone'),
     (40, '2012-10-10', 'tv'),
     (None, '2012-10-10', 'tv')],
    ["Sales", "date", "product"]
)

I need to count the Non Null values in the "Sales" column.

I tried 3 methods.

The first one I got it right:

df.where(F.col("sales").isNotNull()).groupBy('product')\
  .agg((F.count(F.col("Sales")).alias("sales_count"))).show()

# product   | sales_count
# phone     |  1
# tv        |  2

The second one, it's not correct:

df.groupBy('product')\
  .agg((F.count(F.col("Sales").isNotNull()).alias("sales_count"))).show()

# product   | sales_count
# phone     |  1
# tv        |  3

The third one, I got the error:

 df.groupBy('product')\
   .agg((F.col("Sales").isNotNull().count()).alias("sales_count")).show()

TypeError: 'Column' object is not callable

What might cause errors in the second and third methods?

Gowrie answered 5/2, 2018 at 21:50 Comment(0)

I

6

The first attempt of yours is filtering out the rows with null in Sales column before you did the aggregation. Thus it is giving you the correct result.

But with the second code

df.groupBy('product') \
    .agg((F.count(F.col("Sales").isNotNull()).alias("sales_count"))).show()

You haven't filtered out and did aggregation on whole dataset. If you analyze closely F.col("Sales").isNotNull() would give you boolean columns i.e. true and false. So F.count(F.col("Sales").isNotNull()) is just counting the boolean values in the grouped dataset which is evident if you create a new column as below.

df.withColumn("isNotNull", F.col("Sales").isNotNull()).show()

which would give you

+-----+----------+-------+---------+
|Sales|      date|product|isNotNull|
+-----+----------+-------+---------+
|  125|2012-10-10|     tv|     true|
|   20|2012-10-10|  phone|     true|
|   40|2012-10-10|     tv|     true|
| null|2012-10-10|     tv|    false|
+-----+----------+-------+---------+

So the counts are correct with your second attempt.

For your third attempt, .count() is an action which cannot be used in aggregation transformation. Only functions returning Column dataType can be used in .agg() and they can be inbuilt functions, UDFs or your own functions.

Inapprehensive answered 6/2, 2018 at 2:17 Comment(0)

F

6

There is an easier way:

>>> df.groupBy("product").agg({"Sales":"count"}).show()
+-------+------------+
|product|count(Sales)|
+-------+------------+
|  phone|           1|
|     tv|           2|
+-------+------------+

Fabriane answered 5/2, 2018 at 22:15 Comment(0)

I

6

The first attempt of yours is filtering out the rows with null in Sales column before you did the aggregation. Thus it is giving you the correct result.

But with the second code

df.groupBy('product') \
    .agg((F.count(F.col("Sales").isNotNull()).alias("sales_count"))).show()

You haven't filtered out and did aggregation on whole dataset. If you analyze closely F.col("Sales").isNotNull() would give you boolean columns i.e. true and false. So F.count(F.col("Sales").isNotNull()) is just counting the boolean values in the grouped dataset which is evident if you create a new column as below.

df.withColumn("isNotNull", F.col("Sales").isNotNull()).show()

which would give you

+-----+----------+-------+---------+
|Sales|      date|product|isNotNull|
+-----+----------+-------+---------+
|  125|2012-10-10|     tv|     true|
|   20|2012-10-10|  phone|     true|
|   40|2012-10-10|     tv|     true|
| null|2012-10-10|     tv|    false|
+-----+----------+-------+---------+

So the counts are correct with your second attempt.

For your third attempt, .count() is an action which cannot be used in aggregation transformation. Only functions returning Column dataType can be used in .agg() and they can be inbuilt functions, UDFs or your own functions.

Inapprehensive answered 6/2, 2018 at 2:17 Comment(0)

L

0

Count non-null values

only for every string and numeric column:

df.summary("count").show()
# +-------+-----+----+-------+
# |summary|Sales|date|product|
# +-------+-----+----+-------+
# |  count|    3|   4|      4|
# +-------+-----+----+-------+

for every column of any type:

df.agg(*[F.count(c).alias(c) for c in df.columns]).show()
# +-----+----+-------+
# |Sales|date|product|
# +-----+----+-------+
# |    3|   4|      4|
# +-----+----+-------+

Lyndes answered 5/10, 2022 at 6:6 Comment(0)

H

0

Check out the countIf() function .Inside the function place your condition like:

df.select(count_if(col("Sales").isNotNull()).show()

supports from spark 3.x

Harberd answered 21/11, 2023 at 11:0 Comment(1)

It is supperted from 3.5, I have 3.4.1. I just tried and got fail. But F.sum(F.when(col_with_condition, 1).otherwise(0)) worked – Gaylordgaylussac 25/7 at 11:44

Recommended topics

Hot tags