Differences between null and NaN in spark? How to deal with it?
Asked Answered
M

4

33

In my DataFrame, there are columns including values of null and NaN respectively, such as:

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

Are there any difference between those? How can they be dealt with?

Multivibrator answered 10/5, 2017 at 2:33 Comment(0)
H
58

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

One possible way to handle null values is to remove them with:

df.na.drop()

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull())
df.where(col("a").isNotNull())

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan
df.where(isnan(col("a")))
Homogony answered 10/5, 2017 at 3:13 Comment(7)
I got the full answers from @Shadio. Thanks!Multivibrator
That is an example of a perfect answer, just nice :)Rollback
Thanks, nice to be of help. :)Homogony
but when I calculate 1.0/0.0, I get null instead of NaN. Why?Storms
@PanChao: What happens when dividing by zero can be different in different languages and types. For example in scala 1.0/0.0 would be infinity: #43938918. 0.0/0.0 would most probably give you a NaN.Homogony
@Homogony Thanks for clarification. I am using pyspark.Storms
.filter((col("colum_x").isNotNull()).and(col("colum_x").notEqual("null")))Chap
R
2

You can diference your NaN values using the function isnan, like this example

>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], ("a", "b"))
>>> df.select(isnan("a").alias("r1"), isnan(df.a).alias("r2")).collect()
[Row(r1=False, r2=False), Row(r1=True, r2=True)]

The difference is in the type of the object that generates the value. NaN (not a number) is an old fashion way to deal with the "None value for a number", you can think that you have all the numbers (-1-2...0,1,2...) and there is the need to have and extra value, for cases of errors (example, 1/0), I want that 1/0 gives me a number, but which number? well, like there is number for 1/0, they create a new value called NaN, that is also of type Number.

None is used for the void, absence of an element, is even more abstract, because inside the number type, you have, besides de NaN value, the None value. The None value is present in all the sets of values of all the types

Rollback answered 10/5, 2017 at 2:39 Comment(2)
Thanks, Could you point out the difference between the two types of null and Nan in spark? I am still confusing about why spark exits those two types to represent nothing.Multivibrator
I hope this enlight your mind, your question was very interesting, because is not simple to understand and to deal with this kind of concepts. Nice :)Rollback
S
1

you can deal with it using this code

df = df.where(pandas.notnull(df), None)

The code will convert any NaN value into null

Below is the reffrence link

Link

Spancake answered 1/6, 2020 at 4:23 Comment(0)
L
0

I have different think Maybe you can change the nan or null into another value like this:

xxDf.withColumn("xxColumn", when(col("xxColumn").isNull, "xxx")).when(col("xxColumn").isNan, "xxx")).otherwise(col("xxColumn")))
Liberia answered 26/8, 2022 at 10:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.