I have a dataframe which contains null values:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(125, '2012-10-10', 'tv'),
(20, '2012-10-10', 'phone'),
(40, '2012-10-10', 'tv'),
(None, '2012-10-10', 'tv')],
["Sales", "date", "product"]
)
I need to count the Non Null values in the "Sales" column.
I tried 3 methods.
The first one I got it right:
df.where(F.col("sales").isNotNull()).groupBy('product')\
.agg((F.count(F.col("Sales")).alias("sales_count"))).show()
# product | sales_count
# phone | 1
# tv | 2
The second one, it's not correct:
df.groupBy('product')\
.agg((F.count(F.col("Sales").isNotNull()).alias("sales_count"))).show()
# product | sales_count
# phone | 1
# tv | 3
The third one, I got the error:
df.groupBy('product')\
.agg((F.col("Sales").isNotNull().count()).alias("sales_count")).show()
TypeError: 'Column' object is not callable
What might cause errors in the second and third methods?