I have a big data.frame called "mat" of 49952 obs. of 7597 variables and I'm trying to replace NAs with zeros. Here is and example how my data.frame looks like:
A B C E F D Q Z . . .
1 1 1 0 NA NA 0 NA NA
2 0 0 1 NA NA 0 NA NA
3 0 0 0 NA NA 1 NA NA
4 NA NA NA NA NA NA NA NA
5 0 1 0 1 NA 0 NA NA
6 1 1 1 0 NA 0 NA NA
7 0 0 1 0 NA 1 NA NA
.
.
.
I need realy fast tool to replace them. The result should look like:
A B C E F D Q Z . . .
1 1 1 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0
3 0 0 0 0 0 1 0 0
4 0 0 0 0 0 0 0 0
5 0 1 0 1 0 0 0 0
6 1 1 1 0 0 0 0 0
7 0 0 1 0 0 1 0 0
.
.
.
I already tried lapply(mat, function(x){replace(x, is.na(x),0)})
- didn't work - mat[is.na(mat)] <- 0
- error and and maybe too slow - and also link - didn't work too.
@Sotos already advised me plyr::rbind.fill(lapply(L, as.data.frame))
but it didn't work, because it makes data.frame of 379485344 observations and 1 variable (which is 49952x7597) so I have to also trafnsform it back. Is there any better way to do this?
The real structure of my data.frame:
> str(mat)
'data.frame': 49952 obs. of 7597 variables:
$ 6794602 : num 1 NA NA NA NA 0 0 0 0 0 ...
$ 1008667 : num NA 1 0 NA NA 0 0 0 0 0 ...
$ 8009082 : num NA 0 1 NA NA NA NA NA NA NA ...
$ 6740421 : num NA NA NA 1 NA 0 0 0 0 0 ...
$ 6777805 : num NA NA NA NA 1 NA NA NA NA NA ...
$ 1001682 : num NA NA NA NA NA 0 0 0 0 0 ...
$ 1001990 : num NA NA NA NA NA 0 0 0 0 0 ...
$ 1002541 : num NA NA NA NA NA 0 0 0 0 0 ...
$ 1002790 : num NA NA NA NA NA 0 0 0 0 0 ...
Note:
when I tried mat[is.na(mat)] <- 0
there was a warning:
> mat[is.na(mat)] <- 0
Warning messages:
1: In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
invalid factor level, NA generated
> nlevels(mat)
[1] 0
Data.frame mat after using mat[is.na(mat)] <- 0
:
> str(mat)
'data.frame': 49952 obs. of 7597 variables:
$ 6794602 : num 1 0 0 0 0 0 0 0 0 0 ...
$ 1008667 : num 0 1 0 0 0 0 0 0 0 0 ...
$ 8009082 : num 0 0 1 0 0 0 0 0 0 0 ...
$ 6740421 : num 0 0 0 1 0 0 0 0 0 0 ...
$ 6777805 : num 0 0 0 0 1 0 0 0 0 0 ...
$ 1001682 : num 0 0 0 0 0 0 0 0 0 0 ...
$ 1001990 : num 0 0 0 0 0 0 0 0 0 0 ...
$ 1002541 : num 0 0 0 0 0 0 0 0 0 0 ...
$ 1002790 : num 0 0 0 0 0 0 0 0 0 0 ...
So the questions are:
- Is there any other fast way to replace the NA?
- Is the warning big deal? Because data after using
mat[is.na(mat)] <- 0
looks like what I want, but there are too many values, so I can't check if they are all right.
mat[is.na(mat)] = 0
should be the fastest way, hands down (on dense matrices). If it isn’t, that’s a glaring bug in R … – SempiternalView(mat[sapply(mat, is.factor)])
or maybestr
instead ofView
there. – Marybellestr(mat)
and there are no factors. But the warning message simply doesn’t fit that output. – Sempiternalstr(as.data.frame(replicate(7597, 1, simplify=FALSE)))
-- first, OP showed us less than they saw; second, even the full displayed output won't show all 7597 columns. Anyway, we cannot say for sure when OP only provides glimpses of their data instead of a good example... – Marybelle'data.frame': 199235 obs. of 3 variables: $ Invoice_Date: Factor w/ 627 levels $ SKU : Factor w/ 53113 levels $ CustomerID : Factor w/ 55945 levels
where I split it into 627 data frames accoring to Invoice_Date and use droplevels to simpler computation and then I made frequency data frames of SKU in columns and CustomerID in rows and then I usemat <- rbindlist(cop.data1, fill=T)
to put it back together (I don't need CusotmerID) and I get the data.frame mat – Hawsepipe