Converting Factor Levels to Numbers
Asked Answered
H

5

7

I apologize if there is an answer out there already for this... I looked but could not find one.

I am trying to convert a matrix of factors into a matrix of numbers that corresponds to each of the factor values for the column. Simple, right? Yet I have run into a variety of very odd problems when I try to do this.

Let me explain. Here is a sample dataset:

demodata2 <- matrix(c("A","B","B","C",NA,"A","B","B",NA,"C","A","B",NA,"B",NA,"C","A","B",NA,NA,NA,"B","C","A","B","B",NA,"B","B",NA,"B","B",NA,"C","A",NA), nrow=6, ncol=6)
democolnames <- c("Q","R","S","T","U","W")
colnames(demodata2) <- democolnames

Yielding:

     Q   R   S   T   U   W  
[1,] "A" "B" NA  NA  "B" "B"
[2,] "B" "B" "B" NA  "B" "B"
[3,] "B" NA  NA  NA  NA  NA 
[4,] "C" "C" "C" "B" "B" "C"
[5,] NA  "A" "A" "C" "B" "A"
[6,] "A" "B" "B" "A" NA  NA 

Ok. So what I want is this:

     Q    R    S    T    U    W
1    1    2 <NA> <NA>    1    2
2    2    2    2 <NA>    1    2
3    2 <NA> <NA> <NA> <NA> <NA>
4    3    3    3    2    1    3
5 <NA>    1    1    3    1    1
6    1    2    2    1 <NA> <NA>

No problem. Let's try as.numeric(demodata2)

> as.numeric(demodata2)
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [30] NA NA NA NA NA NA NA
 Warning message:
 NAs introduced by coercion 

Less than satisfying. Let's try only one column...

> as.numeric(demodata2[,3])
[1] NA NA NA NA NA NA
Warning message:
NAs introduced by coercion 

* edit *

These are actually supposed to be factors, not characters (thanks @Carl Witthoft and @smci)... so let's make this into a dataframe...

> demodata2 <- as.data.frame(demodata2)
> as.numeric(demodata2)
Error: (list) object cannot be coerced to type 'double'

Nope. But wait... here's where it gets interesting...

> as.numeric(demodata2$S)
[1] NA  2 NA  3  1  2

Well, that is right. Let's validate I can do this calling columns by number:

> as.numeric(demodata2[,3])
[1] NA  2 NA  3  1  2

Ok. So I can do this column by column assembling my new matrix by iterating through ncol times... but is there a better way?

And why does it barf when it is in matrix form, as opposed to data frame? <- edit actually, this is now pretty obvious... in the matrix form, these are characters, not factors. My bad. Question still stands about the dataframe, though...

Thanks! (and pointing me to an existing answer is totally fine)

Holoblastic answered 23/12, 2014 at 21:2 Comment(3)
Your example is Not factors. Be careful about your nomenclature.Brushwork
Your example is a matrix of strings, not factors. Strings don't have any factor levels, etc.Sportswoman
My apologies. This question got started with an imported dataset, where strings are automatically assumed to be factors (unless specified otherwise). The error occurred when I tried to recreate it for stackoverflow usage.Holoblastic
A
7

It seems like your U column should be 2 corresponding to "B", not 1. Please clarify that.

You could try match()

matrix(match(demodata2, LETTERS), nrow(demodata2), dimnames=dimnames(demodata2))
#       Q  R  S  T  U  W
# [1,]  1  2 NA NA  2  2
# [2,]  2  2  2 NA  2  2
# [3,]  2 NA NA NA NA NA
# [4,]  3  3  3  2  2  3
# [5,] NA  1  1  3  2  1
# [6,]  1  2  2  1 NA NA

You could also get this result with

m <- match(demodata2, LETTERS)
attributes(m) <- attributes(demodata2)

And then look at m


Update for the revised data set :

For your updated data, try

demodata2[] <- lapply(demodata2, as.numeric) 
demodata2
#    Q  R  S  T  U  W
# 1  1  2 NA NA  1  2
# 2  2  2  2 NA  1  2
# 3  2 NA NA NA NA NA
# 4  3  3  3  2  1  3
# 5 NA  1  1  3  1  1
# 6  1  2  2  1 NA NA

Now you have the 1's in the U column because each column is factored individually and hence B is the first (and only) value in that column.

Arella answered 23/12, 2014 at 21:10 Comment(4)
Wonderful answer for the question I asked... but apparently I asked the wrong question. Make demodata2 into a data frame first (which automatically puts the character fields into factors) and then you have the question I meant to ask. Thank you very much, and I hope you can help with this additional challenge.Holoblastic
@Holoblastic - it's even more simple for your updated data. Do demodata2[] <- lapply(demodata2, as.numeric) Now you have the 1's in the U column because each column is factored individually and hence B is the first (and only) valueArella
Thank you so much! Simple? Perhaps. But I had been going around and around on this one, so your help is greatly appreciated.Holoblastic
Well, more simple in terms of the code is what I meant :-)Arella
W
5

Mechanically, this is very similar to the 'dim<-' answer. A little more transparent, but probably less efficient (maybe?).

matrix(as.numeric(factor(demodata2)), ncol = ncol(demodata2))

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2   NA   NA    2    2
[2,]    2    2    2   NA    2    2
[3,]    2   NA   NA   NA   NA   NA
[4,]    3    3    3    2    2    3
[5,]   NA    1    1    3    2    1
[6,]    1    2    2    1   NA   NA
Worse answered 23/12, 2014 at 21:24 Comment(3)
Whoops, thanks. Turns out the as.vector() is also unnecessary.Worse
My guess it would be more efficient than <-dim simply because skipping the nrow part, I just wanted to be a bit slick with it :)Biparty
@Gregor: Wonderful answer for the question I asked... but apparently I asked the wrong question. Make demodata2 into a data frame first (which automatically puts the character fields into factors) and then you have the question I meant to ask. Thank you very much, and I hope you can help with this additional challenge.Holoblastic
B
3

Or using dim<-

`dim<-`(as.numeric(factor(demodata2)), c(nrow(demodata2), ncol(demodata2)))
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    2   NA   NA    2    2
# [2,]    2    2    2   NA    2    2
# [3,]    2   NA   NA   NA   NA   NA
# [4,]    3    3    3    2    2    3
# [5,]   NA    1    1    3    2    1
# [6,]    1    2    2    1   NA   NA

If you need the column names, you''ll have to do this in two steps as in

Res <- `dim<-`(as.numeric(factor(demodata2)), c(nrow(demodata2), ncol(demodata2)))
colnames(Res) <- colnames(demodata2)
Biparty answered 23/12, 2014 at 21:20 Comment(2)
Another way to rewrite your line: matrix(as.numeric(factor(demodata2)),ncol=ncol(demodata2))Afterdamp
@David Arenburg: Wonderful answer for the question I asked... but apparently I asked the wrong question. Make demodata2 into a data frame first (which automatically puts the character fields into factors) and then you have the question I meant to ask. Thank you very much, and I hope you can help with this additional challenge.Holoblastic
T
2
apply(demodata2, 2, function(x) 
          as.numeric( factor(x ,levels=unique(as.vector(demodata2) ) ) ) )
#---------------
      Q  R  S  T  U  W
[1,]  1  2 NA NA  2  2
[2,]  2  2  2 NA  2  2
[3,]  2 NA NA NA NA NA
[4,]  3  3  3  2  2  3
[5,] NA  1  1  3  2  1
[6,]  1  2  2  1 NA NA

(I discovered via getting the wrong answer that unique on a matrix doesn't return what I expected.)

Trevortrevorr answered 24/12, 2014 at 1:28 Comment(1)
Wonderful answer for the question I asked... but apparently I asked the wrong question. Make demodata2 into a data frame first (which automatically puts the character fields into factors) and then you have the question I meant to ask. Thank you very much, and I hope you can help with this additional challenge.Holoblastic
A
0

Once demodata2 is a dataframe, there are two steps:

Step 1: Convert your characters into factors:

demodata2[sapply(demodata2, is.character)] <- lapply(demodata2[sapply(demodata2, is.character)], as.factor)

Step 2: Convert your factors into numeric using as.integer:

demodata2[sapply(demodata2, is.factor)] <- lapply(demodata2[sapply(demodata2, is.factor)], as.integer)

Result:

> demodata2
   Q  R  S  T  U  W
1  1  2 NA NA  1  2
2  2  2  2 NA  1  2
3  2 NA NA NA NA NA
4  3  3  3  2  1  3
5 NA  1  1  3  1  1
6  1  2  2  1 NA NA

This selects all your preferred columns at once like you wanted, rather than selecting a single column at a time. And this factors each column individually so you don't get a blend of factor levels across columns.

Admirable answered 19/1, 2023 at 19:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.