PySpark - Resolving isnan errors with TimeStamp datatype
Asked Answered
S

1

6

I'm trying to create a function to check the quality of data (nans/nulls etc) I have the following code running on a PySpark DataFrame

df.select([f.count(f.when((f.isnan(c) | f.col(c).isNull()), c)).alias(c) for c in cols_check]).show()

As long as the columns to check are strings/integers, I have no issue. However when I check columns with the datatype of date or timestamp I receive the following error:

cannot resolve 'isnan(Date_Time)' due to data type mismatch: argument 1 requires (double or float) type, however, 'Date_Time' is of timestamp type.;;\n'Aggregate...

There are clear null values in the column, how can I remedy this?

Shoeshine answered 23/12, 2021 at 7:7 Comment(0)
F
11

You can use df.dtypes to check the type of each column and be able to handle timestamp and date null count differently like this:

from pyspark.sql import functions as F

df.select(*[
    (
        F.count(F.when((F.isnan(c) | F.col(c).isNull()), c)) if t not in ("timestamp", "date")
        else F.count(F.when(F.col(c).isNull(), c))
    ).alias(c)
    for c, t in df.dtypes if c in cols_check
]).show()
Furring answered 23/12, 2021 at 14:29 Comment(1)
# List of Columns to Check cols_check = ["VendorID","passenger_count"]Malta

© 2022 - 2024 — McMap. All rights reserved.