Quick replace of NA - an error or warning
Asked Answered
C

3

5

I have a big data.frame called "mat" of 49952 obs. of 7597 variables and I'm trying to replace NAs with zeros. Here is and example how my data.frame looks like:

    A   B   C   E   F   D   Q   Z   . . .
1   1   1   0   NA  NA  0   NA  NA
2   0   0   1   NA  NA  0   NA  NA
3   0   0   0   NA  NA  1   NA  NA
4   NA  NA  NA  NA  NA  NA  NA  NA
5   0   1   0   1   NA  0   NA  NA 
6   1   1   1   0   NA  0   NA  NA
7   0   0   1   0   NA  1   NA  NA 
.
.
.

I need realy fast tool to replace them. The result should look like:

    A   B   C   E   F   D   Q   Z   . . .
1   1   1   0   0   0   0   0   0
2   0   0   1   0   0   0   0   0 
3   0   0   0   0   0   1   0   0
4   0   0   0   0   0   0   0   0
5   0   1   0   1   0   0   0   0 
6   1   1   1   0   0   0   0   0
7   0   0   1   0   0   1   0   0 
.
.
.

I already tried lapply(mat, function(x){replace(x, is.na(x),0)}) - didn't work - mat[is.na(mat)] <- 0 - error and and maybe too slow - and also link - didn't work too.

@Sotos already advised me plyr::rbind.fill(lapply(L, as.data.frame)) but it didn't work, because it makes data.frame of 379485344 observations and 1 variable (which is 49952x7597) so I have to also trafnsform it back. Is there any better way to do this?

The real structure of my data.frame:

> str(mat)
'data.frame':   49952 obs. of  7597 variables:
 $ 6794602   : num  1 NA NA NA NA 0 0 0 0 0 ...
 $ 1008667   : num  NA 1 0 NA NA 0 0 0 0 0 ...
 $ 8009082   : num  NA 0 1 NA NA NA NA NA NA NA ...
 $ 6740421   : num  NA NA NA 1 NA 0 0 0 0 0 ...
 $ 6777805   : num  NA NA NA NA 1 NA NA NA NA NA ...
 $ 1001682   : num  NA NA NA NA NA 0 0 0 0 0 ...
 $ 1001990   : num  NA NA NA NA NA 0 0 0 0 0 ...
 $ 1002541   : num  NA NA NA NA NA 0 0 0 0 0 ...
 $ 1002790   : num  NA NA NA NA NA 0 0 0 0 0 ...

Note:

when I tried mat[is.na(mat)] <- 0 there was a warning:

> mat[is.na(mat)] <- 0
Warning messages:
1: In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
  invalid factor level, NA generated
> nlevels(mat)
[1] 0

Data.frame mat after using mat[is.na(mat)] <- 0:

> str(mat)
'data.frame':   49952 obs. of  7597 variables:
 $ 6794602   : num  1 0 0 0 0 0 0 0 0 0 ...
 $ 1008667   : num  0 1 0 0 0 0 0 0 0 0 ...
 $ 8009082   : num  0 0 1 0 0 0 0 0 0 0 ...
 $ 6740421   : num  0 0 0 1 0 0 0 0 0 0 ...
 $ 6777805   : num  0 0 0 0 1 0 0 0 0 0 ...
 $ 1001682   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1001990   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002541   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002790   : num  0 0 0 0 0 0 0 0 0 0 ...

So the questions are:

  1. Is there any other fast way to replace the NA?
  2. Is the warning big deal? Because data after using mat[is.na(mat)] <- 0 looks like what I want, but there are too many values, so I can't check if they are all right.
Craving answered 8/8, 2017 at 17:6 Comment(6)
mat[is.na(mat)] = 0 should be the fastest way, hands down (on dense matrices). If it isn’t, that’s a glaring bug in R …Sempiternal
That's a warning not an error, and it explains well what is happening, right? If you're surprised that the data has factors, maybe try View(mat[sapply(mat, is.factor)]) or maybe str instead of View there.Marybelle
@Marybelle Well the question contains the output of str(mat) and there are no factors. But the warning message simply doesn’t fit that output.Sempiternal
@Konrad OP truncated the str output. Try str(as.data.frame(replicate(7597, 1, simplify=FALSE))) -- first, OP showed us less than they saw; second, even the full displayed output won't show all 7597 columns. Anyway, we cannot say for sure when OP only provides glimpses of their data instead of a good example...Marybelle
@Marybelle @KonradRudolph I think there shouldn't be any factor I make this data.frame from: 'data.frame': 199235 obs. of 3 variables: $ Invoice_Date: Factor w/ 627 levels $ SKU : Factor w/ 53113 levels $ CustomerID : Factor w/ 55945 levels where I split it into 627 data frames accoring to Invoice_Date and use droplevels to simpler computation and then I made frequency data frames of SKU in columns and CustomerID in rows and then I use mat <- rbindlist(cop.data1, fill=T) to put it back together (I don't need CusotmerID) and I get the data.frame matHawsepipe
Edit: I found out two factors! But it was like what, why?!Hawsepipe
E
11

Try the following:

mat %>% replace(is.na(.), 0)
Ester answered 8/8, 2017 at 17:42 Comment(2)
It takes longer than mat[is.na(mat)] <- 0 but maybe I will let it run overnight to find out if there will be also the warning. Edit: it takes longer and again - warning. So as I wrote above, I think I will ignore the warning.Hawsepipe
Can you check the following issue to understand that warning message? Might help resolve. https://mcmap.net/q/56636/-warning-message-in-invalid-factor-level-na-generatedEster
N
2

If suspect that some of your columns are factor, you can use the following code to detect and change them to numeric.

inx <- sapply(mat, inherits, "factor")
mat[inx] <- lapply(mat[inx], function(x) as.numeric(as.character(x)))

Then try the following.

mat[] <- lapply(mat, function(x) {x[is.na(x)] <- 0; x})
mat

And here's the data.

mat <-
structure(list(A = c(1L, 0L, 0L, NA, 0L, 1L, 0L), B = c(1L, 0L, 
0L, NA, 1L, 1L, 0L), C = c(0L, 1L, 0L, NA, 0L, 1L, 1L), E = c(NA, 
NA, NA, NA, 1L, 0L, 0L), F = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_), D = c(0L, 0L, 1L, NA, 
0L, 0L, 1L), Q = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), Z = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_)), .Names = c("A", "B", "C", "E", 
"F", "D", "Q", "Z"), row.names = c("1", "2", "3", "4", "5", "6", 
"7"), class = "data.frame")
Ninetta answered 8/8, 2017 at 17:28 Comment(4)
again warning messages: Warning messages: 1: In [<-.factor(*tmp*, is.na(x), value = 0) : invalid factor level, NA generated 2: In [<-.factor(*tmp*, is.na(x), value = 0) : invalid factor level, NA generated maybe I will just ignore warning, it's not the error as in other cases.Hawsepipe
@MartinaZapletalová If your output of str(mat) is correct that shouldn't happen, since all columns are of class numeric. Are you sure that all of those 7597 variables are numeric? If two of them are factors you may need to convert them to numeric first.Ninetta
so I tried this: a <- 0 for (i in 1:ncol(mat)){ if (class(mat[[i]]) == "numeric"){ a <- a+1 } else { a <- a+1 print(i) } } and found out that there is problem in mat[[2260]] and mat[[2261]] so I looked at it and you were right they are factor but I don't understand how it could happend.Hawsepipe
@MartinaZapletalová OK, I'll edit my answer with code for you to get rid of the factors in an automated way.Ninetta
F
0

See my detailed answer here.

#install.packages("xlsx")
library(xlsx)
extracted_df <- read.xlsx("test.xlsx", sheetName='Sheet1', stringsAsFactors=FALSE)
# Replace all NAs in a data frame with "G" character
extracted_df[is.na(extracted_df)] <- "G"
Flawy answered 11/6, 2020 at 15:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.