How to handle null entries in SparkR

Asked 23/7, 2015 at 21:46 Answered 25/7, 2015 at 13:44

Solved r apache-spark sparkr apache-spark-1.4

I have a SparkSQL DataFrame.

Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas?

In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.

Thanks.

Fetterlock answered 23/7, 2015 at 21:46 Comment(0)

SparkR Column provides a long list of useful methods including isNull and isNotNull:

> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)

  Id Age
1  1  21
2  2  18
3  3  NA

> filter(people, isNotNull(people$Age)) %>% head()
  Id Age
1  1  21
2  2  18
3  3  30

> filter(people, isNull(people$Age)) %>% head()
  Id Age
1  4  NA

Please keep in mind that there is no distinction between NA and NaN in SparkR.

If you prefer operations on a whole data frame there is a set of NA functions including fillna and dropna:

> fillna(people, 99) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30
4  4  99

> dropna(people) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30

Both can be adjusted to consider only some subset of columns (cols), and dropna has some additional useful parameters. For example you can specify minimal number of not null columns:

> people_with_names_local <- data.frame(
    Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob
4  4  NA  <NA>

> dropna(people_with_names, minNonNulls=2) %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob

Fiji answered 25/7, 2015 at 13:44 Comment(0)

It is not the nicest workaround, but if you cast them as strings, they are stored as "NaN" and then you can filter them, a short example:

testFrame   <- createDataFrame(sqlContext, data.frame(a=c(1,2,3),b=c(1,NA,3)))
testFrame$c <- cast(testFrame$b,"string")

resultFrame <- collect(filter(testFrame, testFrame$c!="NaN"))
resultFrame$c <- NULL

This omits the entire row where the element in column b is missing.

Mcgannon answered 24/7, 2015 at 6:13 Comment(4)

It would require separate code for different types and will fail completely with booleans: df <- createDataFrame(sqlContext, data.frame(x=c(FALSE, NA, TRUE))); df$chx <- cast(df$x,"string"); df %>% head() – Fiji 25/7, 2015 at 14:56

Well the problem is that the entry is empty. Let me give an example: ID = 1 2 3. AGE = 21 EMPTY 20. So enry 2th in this data is empty but in sparkR it does not have the 'value' NAN, NA or NULL. The idea is maybe to give this empty entry a value? – Fetterlock 26/7, 2015 at 11:34

Could you give an initialization example? Just to understand if you have data.frame(ID=c(1,2,3),AGE=c(21,"EMPTY",20)) or data.frame(ID=c(1,2,3),AGE=c(21,,20)) or still something else. – Mcgannon 27/7, 2015 at 6:40

I have a dataset from a csv-file. I call it 'pgz'. When I type pgz in sparkR I get "DataFrame[id:string, age:string]" as output. Lets look at first data. ID = 1 2 3. Lets say AGE for ID=1 is 21 and AGE for ID=3 is 20. AGE for ID=2 is empty. This gives me a problem in sparkR when I will run a function on the data. Example: Take the AGE for each ID and add 1 to it. This works fine for ID=1 and ID=3 but it crashes for ID=2 because this ID is empty. Therefore I wish to delete this ID. – Fetterlock 27/7, 2015 at 7:54

Recommended topics

Hot tags