How to convert a factor to integer\numeric without loss of information?
Asked Answered
M

14

708

When I convert a factor to a numeric or integer, I get the underlying level codes, not the values as numbers.

f <- factor(sample(runif(5), 20, replace = TRUE))
##  [1] 0.0248644019011408 0.0248644019011408 0.179684827337041 
##  [4] 0.0284090070053935 0.363644931698218  0.363644931698218 
##  [7] 0.179684827337041  0.249704354675487  0.249704354675487 
## [10] 0.0248644019011408 0.249704354675487  0.0284090070053935
## [13] 0.179684827337041  0.0248644019011408 0.179684827337041 
## [16] 0.363644931698218  0.249704354675487  0.363644931698218 
## [19] 0.179684827337041  0.0284090070053935
## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218

as.numeric(f)
##  [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2

as.integer(f)
##  [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2

I have to resort to paste to get the real values:

as.numeric(paste(f))
##  [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493
##  [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901
## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493
## [19] 0.17968483 0.02840901

Is there a better way to convert a factor to numeric?

Moslemism answered 5/8, 2010 at 18:53 Comment(3)
The levels of a factor are stored as character data type anyway (attributes(f)), so I don't think there is anything wrong with as.numeric(paste(f)). Perhaps it would be better to think why (in the specific context) you are getting a factor in the first place, and try to stop that. E.g., is the dec argument in read.table set correctly?Explosive
If you use a dataframe you can use convert from hablar. df %>% convert(num(column)). Or if you have a factor vector you can use as_reliable_num(factor_vector)Guaiacol
Thank good for this question. This is SO MUCH frustrating to see numbers get transformed into other numbers pretty much randomly.Garlen
P
844

See the Warning section of ?factor:

In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

The FAQ on R has similar advice.


Why is as.numeric(levels(f))[f] more efficent than as.numeric(as.character(f))?

as.numeric(as.character(f)) is effectively as.numeric(levels(f)[f]), so you are performing the conversion to numeric on length(x) values, rather than on nlevels(x) values. The speed difference will be most apparent for long vectors with few levels. If the values are mostly unique, there won't be much difference in speed. However you do the conversion, this operation is unlikely to be the bottleneck in your code, so don't worry too much about it.


Some timings

library(microbenchmark)
microbenchmark(
  as.numeric(levels(f))[f],
  as.numeric(levels(f)[f]),
  as.numeric(as.character(f)),
  paste0(x),
  paste(x),
  times = 1e5
)
## Unit: microseconds
##                         expr   min    lq      mean median     uq      max neval
##     as.numeric(levels(f))[f] 3.982 5.120  6.088624  5.405  5.974 1981.418 1e+05
##     as.numeric(levels(f)[f]) 5.973 7.111  8.352032  7.396  8.250 4256.380 1e+05
##  as.numeric(as.character(f)) 6.827 8.249  9.628264  8.534  9.671 1983.694 1e+05
##                    paste0(x) 7.964 9.387 11.026351  9.956 10.810 2911.257 1e+05
##                     paste(x) 7.965 9.387 11.127308  9.956 11.093 2419.458 1e+05
Pander answered 5/8, 2010 at 19:1 Comment(6)
For timings see this answer: #6980125Glossotomy
Many thanks for your solution. Can I ask why the as.numeric(levels(f))[f] is more precise and faster? Thanks.Erskine
@Erskine as.character(f) requires a "primitive lookup" to find the function as.character.factor(), which is defined as as.numeric(levels(f))[f].Faison
when apply as.numeric(levels(f))[f] OR as.numeric(as.character(f)), I have an warning msg: Warning message:NAs introduced by coercion. Do you know where the problem could be? thank you !Watters
@Watters did you overcame this issue?Festoon
@Festoon I have the same issue as maycca. I suspect this is from gradual changes in R over time (this answer was posted in 2010), and this answer is now outdatedPoised
D
109

R has a number of (undocumented) convenience functions for converting factors:

  • as.character.factor
  • as.data.frame.factor
  • as.Date.factor
  • as.list.factor
  • as.vector.factor
  • ...

But annoyingly, there is nothing to handle the factor -> numeric conversion. As an extension of Joshua Ulrich's answer, I would suggest to overcome this omission with the definition of your own idiomatic function:

as.double.factor <- function(x) {as.numeric(levels(x))[x]}

that you can store at the beginning of your script, or even better in your .Rprofile file.

Diabolic answered 27/3, 2014 at 23:39 Comment(7)
There's nothing to handle the factor-to-integer (or numeric) conversion because it's expected that as.integer(factor) returns the underlying integer codes (as shown in the examples section of ?factor). It's probably okay to define this function in your global environment, but you might cause problems if you actually register it as an S3 method.Pander
That's a good point and I agree: a complete redefinition of the factor->numeric conversion is likely to mess a lot of things. I found myself writing the cumbersome factor->numeric conversion a lot before realizing that it is in fact a shortcoming of R: some convenience function should be available... Calling it as.numeric.factor makes sense to me, but YMMV.Diabolic
If you find yourself doing that a lot, then you should do something upstream to avoid it all-together.Pander
as.numeric.factor returns NA?Dupin
@jO.: in the cases where you used something like v=NA;as.numeric.factor(v) or v='something';as.numeric.factor(v), then it should, otherwise you have a weird thing going on somewhere.Diabolic
as.numeric(as.character.factor(x)) just did the trick for meAfrikah
@rui-barradas comment = as a historical anomaly, R has two types for floating point vectors: numeric and double. According to the documentation, it is better to write code for the double type, thus as.double.factor seems like a more proper name. Link to documentation: stat.ethz.ch/R-manual/R-devel/library/base/html/numeric.html . Thanks @rui-barradas !Diabolic
I
48

Note: this particular answer is not for converting numeric-valued factors to numerics, it is for converting categorical factors to their corresponding level numbers.


Every answer in this post failed to generate results for me , NAs were getting generated.

y2<-factor(c("A","B","C","D","A")); 
as.numeric(levels(y2))[y2] 
[1] NA NA NA NA NA Warning message: NAs introduced by coercion

What worked for me is this -

as.integer(y2)
# [1] 1 2 3 4 1
Instead answered 22/2, 2017 at 18:26 Comment(9)
Are you sure you had a factor? Look at this example.y<-factor(c("5","15","20","2")); unclass(y) %>% as.numeric This returns 4,1,3,2, not 5,15,20,2. This seems like incorrect information.Delirious
Ok, this is similar to what I was trying to do today :- y2<-factor(c("A","B","C","D","A")); as.numeric(levels(y2))[y2] [1] NA NA NA NA NA Warning message: NAs introduced by coercion whereas unclass(y2) %>% as.numeric gave me the results that I needed.Instead
Let me update my scenario in the answer that I had providedInstead
OK, well that's not the question that was asked above. In this question the factor levels are all "numeric". In your case , as.numeric(y) should have worked just fine, no need for the unclass(). But again, that's not what this question was about. This answer isn't appropriate here.Delirious
Well, I really hope it helps someone who was in a hurry like me and read just the title !Instead
@jogo %>% is from the magrittr package.Flannelette
If you have characters representing the integers as factors, this is the one I would recommend. this is the only one that worked for me.Transship
This is the answer so many of us are after and the first hit in Google. I can't find a similar question.Alow
To convert categorical factors into their corresponding numeric levels, there is an existing question and answer here: <https://mcmap.net/q/64636/-converting-factor-levels-to-numbers>.Quarters
S
43

The most easiest way would be to use unfactor function from package varhandle which can accept a factor vector or even a dataframe:

unfactor(your_factor_variable)

This example can be a quick start:

x <- rep(c("a", "b", "c"), 20)
y <- rep(c(1, 1, 0), 20)

class(x)  # -> "character"
class(y)  # -> "numeric"

x <- factor(x)
y <- factor(y)

class(x)  # -> "factor"
class(y)  # -> "factor"

library(varhandle)
x <- unfactor(x)
y <- unfactor(y)

class(x)  # -> "character"
class(y)  # -> "numeric"

You can also use it on a dataframe. For example the iris dataset:

sapply(iris, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
   "numeric"    "numeric"    "numeric"    "numeric"     "factor"
# load the package
library("varhandle")
# pass the iris to unfactor
tmp_iris <- unfactor(iris)
# check the classes of the columns
sapply(tmp_iris, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
   "numeric"    "numeric"    "numeric"    "numeric"  "character"
# check if the last column is correctly converted
tmp_iris$Species
  [1] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
  [6] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [11] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [16] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [21] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [26] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
 [31] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"
 [36] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"
 [41] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"
 [46] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"
 [51] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [56] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [61] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [66] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [71] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [76] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [81] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [86] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [91] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
 [96] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
[101] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[106] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[111] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[116] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[121] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[126] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[131] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[136] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[141] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
[146] "virginica"  "virginica"  "virginica"  "virginica"  "virginica"
Sham answered 1/12, 2015 at 14:11 Comment(7)
The unfactor function converts to character data type first and then converts back to numeric. Type unfactor at the console and you can see it in the middle of the function. Therefore it doesn't really give a better solution than what the asker already had.Explosive
Having said that, the levels of a factor are of character type anyway, so nothing is lost by this approach.Explosive
The unfactor function takes care of things that cannot be converted to numeric. Check the examples in help("unfactor")Sham
Error: could not find function "unfactor"Lorimer
@Lorimer I've mentioned that this function is available in varhandle package, meaning you should load the package (library("varhandle")) first (as I mentioned in the first line of my answer!!)Sham
I appreciate that your package probably has some other nice functions too, but installing a new package (and adding an external dependency to your code) isn't as nice or easy as typing as.character(as.numeric()).Judsonjudus
@Gregor adding a light dependency does not harm usually and of course if you are looking for the most efficient way, writing the code your self might perform faster. but as you can also see in your comment this is not trivial since you also put the as.numeric() and as.character() in a wrong order ;) What your code chunk does is to turn the factor's level index into a character matrix, so what you will have at the and is a character vector that contains some numbers that has been once assigned to certain level of your factor. Functions in that package are there to prevent these confusionsSham
K
13

It is possible only in the case when the factor labels match the original values. I will explain it with an example.

Assume the data is vector x:

x <- c(20, 10, 30, 20, 10, 40, 10, 40)

Now I will create a factor with four labels:

f <- factor(x, levels = c(10, 20, 30, 40), labels = c("A", "B", "C", "D"))

1) x is with type double, f is with type integer. This is the first unavoidable loss of information. Factors are always stored as integers.

> typeof(x)
[1] "double"
> typeof(f)
[1] "integer"

2) It is not possible to revert back to the original values (10, 20, 30, 40) having only f available. We can see that f holds only integer values 1, 2, 3, 4 and two attributes - the list of labels ("A", "B", "C", "D") and the class attribute "factor". Nothing more.

> str(f)
 Factor w/ 4 levels "A","B","C","D": 2 1 3 2 1 4 1 4
> attributes(f)
$levels
[1] "A" "B" "C" "D"

$class
[1] "factor"

To revert back to the original values we have to know the values of levels used in creating the factor. In this case c(10, 20, 30, 40). If we know the original levels (in correct order), we can revert back to the original values.

> orig_levels <- c(10, 20, 30, 40)
> x1 <- orig_levels[f]
> all.equal(x, x1)
[1] TRUE

And this will work only in case when labels have been defined for all possible values in the original data.

So if you will need the original values, you have to keep them. Otherwise there is a high chance it will not be possible to get back to them only from a factor.

Kith answered 9/10, 2015 at 12:34 Comment(0)
G
5

You can use hablar::convert if you have a data frame. The syntax is easy:

Sample df

library(hablar)
library(dplyr)

df <- dplyr::tibble(a = as.factor(c("7", "3")),
                    b = as.factor(c("1.5", "6.3")))

Solution

df %>% 
  convert(num(a, b))

gives you:

# A tibble: 2 x 2
      a     b
  <dbl> <dbl>
1    7.  1.50
2    3.  6.30

Or if you want one column to be integer and one numeric:

df %>% 
  convert(int(a),
          num(b))

results in:

# A tibble: 2 x 2
      a     b
  <int> <dbl>
1     7  1.50
2     3  6.30
Guaiacol answered 1/11, 2018 at 10:5 Comment(1)
However, loading another package just for that single operation is not parcimoniousGarlen
Y
5

strtoi() works if your factor levels are integers.

Yasmeen answered 6/5, 2021 at 19:47 Comment(1)
Nice simple solution, as fast as other solutions too.Flannelette
B
4

late to the game, accidently, I found trimws() can convert factor(3:5) to c("3","4","5"). Then you can call as.numeric(). That is:

as.numeric(trimws(x_factor_var))
Blouson answered 13/11, 2018 at 2:37 Comment(2)
Is there a reason you would recommend using trimws over as.character as described in the accepted answer? It seems to me like unless you actually had whitespace you needed to remove, trimws is just going to do a bunch of unnecessary regular expression work to return the same result.Delirious
as.numeric(levels(f))[f] is might be a bit confusing and hard to remember for beginners. trimws does no harm.Blouson
A
3

type.convert(f) on a factor whose levels are completely numeric is another base option.

Performance-wise it's about equivalent to as.numeric(as.character(f)) but not nearly as quick as as.numeric(levels(f))[f].

identical(type.convert(f), as.numeric(levels(f))[f])

[1] TRUE

That said, if the reason the vector was created as a factor in the first instance has not been addressed (i.e. it likely contained some characters that could not be coerced to numeric) then this approach won't work and it will return a factor.

levels(f)[1] <- "some character level"
identical(type.convert(f), as.numeric(levels(f))[f])

[1] FALSE
Aurangzeb answered 17/6, 2020 at 4:43 Comment(0)
C
1

If you have many factor columns to convert to numeric,

df <- rapply(df, function(x) as.numeric(levels(x))[x], "factor", how =  "replace")

This solution is robust for data.frames containing mixed types, provided all factor levels are numbers.

Caron answered 16/10, 2022 at 20:9 Comment(0)
T
1

I found as.numeric(levels(f))[f] difficult to apply across a list of column names using tidyverse syntax. Converting to a character first then an integer gave me the original numeric values without having to add additional packages. Perhaps not the most performant/elegant solution but kept things simple and readable.

library(tidyverse)

tbl_df <- tibble(a = as.factor(c("7", "3")),
                 b = as.factor(c("1.5", "6.3")))

cols <- c("a", "b")

tbl_df %>%
  mutate(across(all_of(cols), as.character)) %>% 
  mutate(across(all_of(cols), as.numeric))
Tailspin answered 13/3, 2023 at 9:38 Comment(0)
C
0

The collapse package includes a wrapper around as.numeric(levels(f))[f] and as.character(levels(f))[f] in as_numeric_factor and as_character_factor.

library(collapse)
set.seed(1)
f <- factor(sample(runif(5), 5, replace = TRUE))

as_numeric_factor(f)
# [1] 0.2016819 0.5728534 0.3721239 0.5728534 0.5728534

as_character_factor(f)
# [1] "0.201681931037456" "0.572853363351896" "0.37212389963679" "0.572853363351896" "0.572853363351896"

It gives similar performances compared to as.numeric(levels(f))[f].

# Unit: milliseconds
#                      expr      min        lq       mean    median        uq      max neval
#  as.numeric(levels(f))[f]   2.6026   3.01305   5.834900   3.54310   8.57450  66.3497   100
#  as.numeric(levels(f)[f]) 317.2509 336.78690 350.215388 349.85620 361.57980 401.1002   100
#      as_numeric_factor(f)   2.5793   2.92970   5.383223   3.23355   4.29355  68.4460   100

Code:

set.seed(1)
f <- factor(sample(runif(5), 1e6, replace = TRUE))
library(microbenchmark)
microbenchmark(
  as.numeric(levels(f))[f],
  as.numeric(levels(f)[f]),
  as_numeric_factor(f),
  times = 100
)
Courtly answered 4/8, 2023 at 10:5 Comment(0)
P
-1

From the many answers I could read, the only given way was to expand the number of variables according to the number of factors. If you have a variable "pet" with levels "dog" and "cat", you would end up with pet_dog and pet_cat.

In my case I wanted to stay with the same number of variables, by just translating the factor variable to a numeric one, in a way that can applied to many variables with many levels, so that cat=1 and dog=0 for instance.

Please find the corresponding solution below:

crime <- data.frame(city = c("SF", "SF", "NYC"),
                    year = c(1990, 2000, 1990),
                    crime = 1:3)

indx <- sapply(crime, is.factor)

crime[indx] <- lapply(crime[indx], function(x){ 
  listOri <- unique(x)
  listMod <- seq_along(listOri)
  res <- factor(x, levels=listOri)
  res <- as.numeric(res)
  return(res)
}
)
Palmapalmaceous answered 27/11, 2019 at 19:4 Comment(0)
C
-2

Looks like the solution as.numeric(levels(f))[f] no longer work with R 4.0.

Alternative solution:

factor2number <- function(x){
    data.frame(levels(x), 1:length(levels(x)), row.names = 1)[x, 1]
}

factor2number(yourFactor)
Courageous answered 24/5, 2020 at 16:38 Comment(1)
?? On R 4.1, it does work.Garlen

© 2022 - 2024 — McMap. All rights reserved.