Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level

Asked 19/2, 2011 at 3:23 Answered 20/4, 2022 at 20:27

118

I have an R data frame containing a factor that I want to "expand" so that for each factor level, there is an associated column in a new data frame, which contains a 1/0 indicator. E.g., suppose I have:

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))

I want:

df.desired  <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))

Because for certain analyses for which you need to have a completely numeric data frame (e.g., principal component analysis), I thought this feature might be built in. Writing a function to do this shouldn't be too hard, but I can foresee some challenges relating to column names and if something exists already, I'd rather use that.

Pragmatics answered 19/2, 2011 at 3:23 Comment(0)

139

Use the model.matrix function:

model.matrix( ~ Species - 1, data=iris )

Cattan answered 19/2, 2011 at 3:50 Comment(11)

Can I just add that this method was so much faster than using cast for me. – Faruq 8/12, 2013 at 15:3

@RyanChase, in the 14 hours between you writing your comment and me noticing it to respond you could have looked at the help page ?formula and found the answer in the 2nd paragraph of the Details section. Or you could have tried the code with and without the "-1" and compared the output to see the effects. But I guess you are more patient that I am. The "-1" specifies to not fit an intercept (there are other ways as well) and therefore to create an indicator variable for each level rather than differences based on contrasts. – Cattan 26/9, 2015 at 19:52

@GregSnow I reviewed the 2nd paragraph of ?formula as well as ?model.matrix, but it was unclear (could just be my lack of depth of knowledge in matrix algebra and model formulation). After digging more, I've been able to gather that the -1 is just specifying not to include the "intercept" column. If you leave out the -1, you'll see an intercept column of 1's in the output with one binary column left out. You're able to see which values the omitted column are 1's based on rows where the values of the other columns are 0's. The documentation seems cryptic -is there another good resource? – Machicolation 5/10, 2015 at 22:25

@RyanChase, there are many online tutorials and books about R/S (several that have brief descriptions on the r-project.org webpage). My own learning of S and R has been rather eclectic (and long), so I am not the best to give an opinion on how current books/tutorials appeal to beginners. I am, however, a fan of experimentation. Trying something out in a fresh R session can be very enlightening and not dangerous (the worst that has happened to me is crashing R, and that rarely, which lead to improvements in R). Stackoverflow is then a good resource for understanding what happened. – Cattan 6/10, 2015 at 16:15

And if you want to convert all factor columns, you can use: model.matrix(~., data=iris)[,-1] – Origan 5/1, 2016 at 0:32

@DestaHaileselassieHagos, the return value is a matrix, you can change the names however you want using functions like dimnames or colnames. Functions like sub or gsub may be of help as well. – Cattan 30/3, 2016 at 16:56

This answer is incomplete. How do you merge the output of matrix.model back with the columns in the original data frame, as the OP asked? – Latonialatoniah 21/5, 2016 at 0:48

@stackoverflowuser2010, the original question asked for the indicators to be in a "new" data frame, not merged with the original (note the reference to techniques requiring only numeric data). But if you really want it combined with the original you can use cbind. – Cattan 23/5, 2016 at 17:45

This method does not work for numeric or date types without first coercing the column type to factor. – Brookweed 28/11, 2018 at 17:40

resurrecting this thread- any way to have this pass NA values when present? I realize the intention of the model.matrix() function is to create something without NA values for analysis, but if I am trying to use it for something else, this breaks down. Could write a custom function to do this, but would be cool if there was something embedded here that I am missing... – Intolerable 18/12, 2018 at 1:12

@colin, Not fully automatic, but you can use naresid to put the missing values back in after using na.exclude. A quick example:

tmp <- data.frame(x=factor(c('a','b','c',NA,'a'))); tmp2 <- na.exclude(tmp); tmp3 <- model.matrix( ~x-1, tmp2); tmp4 <- naresid(attr(tmp2,'na.action'), tmp3)

– Cattan 18/12, 2018 at 18:57

If your data frame is only made of factors (or you are working on a subset of variables which are all factors), you can also use the acm.disjonctif function from the ade4 package :

R> library(ade4)
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"))
R> acm.disjonctif(df)
  eggs.bar eggs.foo ham.blue ham.green ham.red
1        0        1        0         0       1
2        0        1        1         0       0
3        1        0        0         1       0
4        1        0        0         0       1

Not exactly the case you are describing, but it can be useful too...

Akeylah answered 19/2, 2011 at 12:49 Comment(2)

Thanks, this helped me a lot as it uses less memory then model.matrix! – Pilchard 11/5, 2015 at 15:21

I like the way the variables get named; I dislike that they are returned as storage-hungry numeric when they should (IMHO) just be logicals. – Artistry 26/8, 2016 at 1:8

A quick way using the reshape2 package:

require(reshape2)

> dcast(df.original, ham ~ eggs, length)

Using ham as value column: use value_var to override.
  ham bar foo
1   1   0   1
2   2   0   1
3   3   1   0
4   4   1   0

Note that this produces precisely the column names you want.

Hoiden answered 19/2, 2011 at 13:9 Comment(2)

Good. But be care of the duplicate of ham. say, d <- data.frame(eggs = c("foo", "bar", "foo"), ham = c(1,2,1)); dcast(d, ham ~ eggs, length) makes foo = 2. – Came 19/2, 2011 at 22:58

@Kohske, true, but I was assuming ham is a unique row id. If ham is not a unique id then one must use some other unique-id (or create a dummy one) and use that in place of ham. Converting a categorical label to a binary indicator would only make sense for unique ids. – Hoiden 19/2, 2011 at 23:42

probably dummy variable is similar to what you want. Then, model.matrix is useful:

> with(df.original, data.frame(model.matrix(~eggs+0), ham))
  eggsbar eggsfoo ham
1       0       1   1
2       0       1   2
3       1       0   3
4       1       0   4

Came answered 19/2, 2011 at 3:49 Comment(0)

A late entry class.ind from the nnet package

library(nnet)
 with(df.original, data.frame(class.ind(eggs), ham))
  bar foo ham
1   0   1   1
2   0   1   2
3   1   0   3
4   1   0   4

Justify answered 19/2, 2013 at 5:4 Comment(0)

Just came across this old thread and thought I'd add a function that utilizes ade4 to take a dataframe consisting of factors and/or numeric data and returns a dataframe with factors as dummy codes.

dummy <- function(df) {  

    NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
    FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)]

    require(ade4)
    if (is.null(ncol(NUM(df)))) {
        DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
        names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))]
    } else {
        DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
    }
    return(DF)
}

Let's try it.

df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), 
            ham = c("red","blue","green","red"), x=rnorm(4))     
dummy(df)

df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"), 
            ham = c("red","blue","green","red"))  
dummy(df2)

Stenopetalous answered 30/10, 2011 at 4:38 Comment(0)

Here is a more clear way to do it. I use model.matrix to create the dummy boolean variables and then merge it back into the original dataframe.

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
df.original
#   eggs ham
# 1  foo   1
# 2  foo   2
# 3  bar   3
# 4  bar   4

# Create the dummy boolean variables using the model.matrix() function.
> mm <- model.matrix(~eggs-1, df.original)
> mm
#   eggsbar eggsfoo
# 1       0       1
# 2       0       1
# 3       1       0
# 4       1       0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"

# Remove the "eggs" prefix from the column names as the OP desired.
colnames(mm) <- gsub("eggs","",colnames(mm))
mm
#   bar foo
# 1   0   1
# 2   0   1
# 3   1   0
# 4   1   0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"

# Combine the matrix back with the original dataframe.
result <- cbind(df.original, mm)
result
#   eggs ham bar foo
# 1  foo   1   0   1
# 2  foo   2   0   1
# 3  bar   3   1   0
# 4  bar   4   1   0

# At this point, you can select out the columns that you want.

Latonialatoniah answered 21/5, 2016 at 1:8 Comment(0)

I needed a function to 'explode' factors that is a bit more flexible, and made one based on the acm.disjonctif function from the ade4 package. This allows you to choose the exploded values, which are 0 and 1 in acm.disjonctif. It only explodes factors that have 'few' levels. Numeric columns are preserved.

# Function to explode factors that are considered to be categorical,
# i.e., they do not have too many levels.
# - data: The data.frame in which categorical variables will be exploded.
# - values: The exploded values for the value being unequal and equal to a level.
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors.
# Inspired by the acm.disjonctif function in the ade4 package.
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) {
  exploders <- colnames(data)[sapply(data, function(col){
      is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col)
    })]
  if (length(exploders) > 0) {
    exploded <- lapply(exploders, function(exp){
        col <- data[, exp]
        n <- length(col)
        dummies <- matrix(values[1], n, length(levels(col)))
        dummies[(1:n) + n * (unclass(col) - 1)] <- values[2]
        colnames(dummies) <- paste(exp, levels(col), sep = '_')
        dummies
      })
    # Only keep numeric data.
    data <- data[sapply(data, is.numeric)]
    # Add exploded values.
    data <- cbind(data, exploded)
  }
  return(data)
}

Caen answered 22/6, 2015 at 9:57 Comment(0)

(The question is 10yo, but for the sake of completeness...)

The function i() from the fixest package does exactly that.

Beyond creating a design matrix from a factor-like variable, you can also very easily do two extra things on the fly:

binning values (with the argument 'bin'),
excluding some factor values (with the argument ref).

And since it is made for this task, if your variable happens to be numeric you don't need to wrap it with factor(x_num) (as opposed to the model.matrix solution).

Here's an example:

library(fixest)
data(airquality)
table(airquality$Month)
#>  5  6  7  8  9 
#> 31 30 31 31 30

head(i(airquality$Month))
#>      5 6 7 8 9
#> [1,] 1 0 0 0 0
#> [2,] 1 0 0 0 0
#> [3,] 1 0 0 0 0
#> [4,] 1 0 0 0 0
#> [5,] 1 0 0 0 0
#> [6,] 1 0 0 0 0

#
# Binning (check out the help, there are many many ways to bin)
#

colSums(i(airquality$Month, bin = 5:6)))
#>  5  7  8  9 
#> 61 31 31 30 

#
# References
#

head(i(airquality$Month, ref = c(6, 9)), 3)
#>      5 7 8
#> [1,] 1 0 0
#> [2,] 1 0 0
#> [3,] 1 0 0

And here's a little wrapper expanding all non-numeric variables (by default):

library(fixest)

# data: data.frame
# var: vector of variable names // if missing, all non numeric variables
# no argument checking
expand_factor = function(data, var){
    
    if(missing(var)){
        var = names(data)[!sapply(data, is.numeric)]
        if(length(var) == 0) return(data)
    }
    
    data_list = unclass(data)
    new = lapply(var, \(x) i(data_list[[x]]))
    data_list[names(data_list) %in% var] = new
    
    do.call("cbind", data_list)
}

my_data = data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))

expand_factor(my_data)
#>      bar foo ham
#> [1,]   0   1   1
#> [2,]   0   1   2
#> [3,]   1   0   3
#> [4,]   1   0   4

Finally, for those wondering, the timing is equivalent to the model.matrix solution.

library(microbenchmark)
my_data = data.frame(x = as.factor(sample(100, 1e6, TRUE)))

microbenchmark(mm = model.matrix(~x, my_data),
               i = i(my_data$x), times = 5)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq      max neval
#>    mm 155.1904 156.7751 209.2629 182.4964 197.9084 353.9443     5
#>     i 154.1697 154.7893 159.5202 155.4166 163.9706 169.2550     5

Flatworm answered 4/10, 2021 at 20:56 Comment(0)

In sapply == over eggs could be used to generate dummy vectors:

x <- with(df.original, data.frame(+sapply(unique(eggs), `==`, eggs), ham))
x
#  foo bar ham
#1   1   0   1
#2   1   0   2
#3   0   1   3
#4   0   1   4

all.equal(x, df.desired)
#[1] TRUE

A maybe faster variant - Result best used as list or data.frame:

. <- unique(df.original$eggs)
with(df.original, 
     data.frame(+do.call(cbind, lapply(setNames(., .), `==`, eggs)), ham))

Indexing in a matrix - Result best used as matrix:

. <- unique(df.original$eggs)
i <- match(df.original$eggs, .)
nc <- length(.)
nr <- length(i)
cbind(matrix(`[<-`(integer(nc * nr), 1:nr + nr * (i - 1), 1), nr, nc,
                 dimnames=list(NULL, .)), df.original["ham"])

Using outer - Result best used as matrix:

. <- unique(df.original$eggs)
cbind(+outer(df.original$eggs, setNames(., .), `==`), df.original["ham"])

Using rep - Result best used as matrix:

. <- unique(df.original$eggs)
n <- nrow(df.original)
cbind(+matrix(df.original$eggs == rep(., each=n), n, dimnames=list(NULL, .)),
 df.original["ham"])

Englishry answered 20/4, 2022 at 20:27 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags