Recode categorical factor with N categories into N binary columns
Asked Answered
A

7

19

Original data frame:

v1 = sample(letters[1:3], 10, replace=TRUE)
v2 = sample(letters[1:3], 10, replace=TRUE)
df = data.frame(v1,v2)
df
   v1 v2
1   b  c
2   a  a
3   c  c
4   b  a
5   c  c
6   c  b
7   a  a
8   a  b
9   a  c
10  a  b

New data frame:

new_df = data.frame(row.names=rownames(df))
for (i in colnames(df)) {
    for (x in letters[1:3]) {
        #new_df[x] = as.numeric(df[i] == x)
        new_df[paste0(i, "_", x)] = as.numeric(df[i] == x)
    }
}
   v1_a v1_b v1_c v2_a v2_b v2_c
1     0    1    0    0    0    1
2     1    0    0    1    0    0
3     0    0    1    0    0    1
4     0    1    0    1    0    0
5     0    0    1    0    0    1
6     0    0    1    0    1    0
7     1    0    0    1    0    0
8     1    0    0    0    1    0
9     1    0    0    0    0    1
10    1    0    0    0    1    0

For small datasets this is fine, but it becomes slow for much larger datasets.

Anyone know of a way to do this without using looping?

Aristippus answered 24/4, 2013 at 19:8 Comment(5)
Your first data frame had two variables, but it looks like you only converted the second one. Can you clarify that a bit?Fm
you're overwriting your data. It should have 6 columns in output.Hooper
Sorry, that was a mistake on my part -- I fixed it in the code above. There should be three new columns for each original column in the above example. Thanks for catching that!Aristippus
@Keith, have you checked the answers that've been posted?Hooper
@Hooper done. there were many helpful solution. I appreciate everyone's input!Aristippus
H
24

Even better with the help of @AnandaMahto's search capabilities,

model.matrix(~ . + 0, data=df, contrasts.arg = lapply(df, contrasts, contrasts=FALSE))
#    v1a v1b v1c v2a v2b v2c
# 1    0   1   0   0   0   1
# 2    1   0   0   1   0   0
# 3    0   0   1   0   0   1
# 4    0   1   0   1   0   0
# 5    0   0   1   0   0   1
# 6    0   0   1   0   1   0
# 7    1   0   0   1   0   0
# 8    1   0   0   0   1   0
# 9    1   0   0   0   0   1
# 10   1   0   0   0   1   0

I think this is what you're looking for. I'd be happy to delete if it's not so. Thanks to @G.Grothendieck (once again) for the excellent usage of model.matrix!

cbind(with(df, model.matrix(~ v1 + 0)), with(df, model.matrix(~ v2 + 0)))
#    v1a v1b v1c v2a v2b v2c
# 1    0   1   0   0   0   1
# 2    1   0   0   1   0   0
# 3    0   0   1   0   0   1
# 4    0   1   0   1   0   0
# 5    0   0   1   0   0   1
# 6    0   0   1   0   1   0
# 7    1   0   0   1   0   0
# 8    1   0   0   0   1   0
# 9    1   0   0   0   0   1
# 10   1   0   0   0   1   0

Note: Your output is just:

with(df, model.matrix(~ v2 + 0))

Note 2: This gives a matrix. Fairly obvious, but still, wrap it with as.data.frame(.) if you want a data.frame.

Hooper answered 24/4, 2013 at 19:19 Comment(0)
E
9

There is a function in caret's package that does what you require, dummyVars. Here is the example of it's usage taken from the authors documentation: http://topepo.github.io/caret/preprocess.html

library(earth)
data(etitanic)

dummies <- caret::dummyVars(survived ~ ., data = etitanic)
head(predict(dummies, newdata = etitanic))

  pclass.1st pclass.2nd pclass.3rd sex.female sex.male     age sibsp parch
1          1          0          0          1        0 29.0000     0     0
2          1          0          0          0        1  0.9167     1     2
3          1          0          0          1        0  2.0000     1     2
4          1          0          0          0        1 30.0000     1     2
5          1          0          0          1        0 25.0000     1     2
6          1          0          0          0        1 48.0000     0     0

The model.matrix options could be useful in case you had sparse data and wanted to use Matrix::sparse.model.matrix

Exactitude answered 17/1, 2015 at 3:54 Comment(0)
R
4

Just seen a closed question directed to here, and nobody has mentioned using the dummies package yet:

You can recode your variables using the dummy.data.frame() function which is built on top of model.matrix() but has easier syntax, some good options and will return a dataframe:

> dummy.data.frame(df, sep="_")
   v1_a v1_b v1_c v2_a v2_b v2_c
1     0    1    0    0    0    1
2     1    0    0    1    0    0
3     0    0    1    0    0    1
4     0    1    0    1    0    0
5     0    0    1    0    0    1
6     0    0    1    0    1    0
7     1    0    0    1    0    0
8     1    0    0    0    1    0
9     1    0    0    0    0    1
10    1    0    0    0    1    0

Some nice aspects of this function is you can easily specify delimeter for the new names (sep=), omit non-encoded variables (all=F) and comes with its own option dummy.classes that allows you to specify which classes of column should be encoded.

You can also just use the dummy() function to apply this to just one column.

Rainmaker answered 18/12, 2017 at 15:58 Comment(0)
K
3

A fairly direct approach is to just use table on each column, tabulating the values in the column by the number of rows in the data.frame:

allLevels <- levels(factor(unlist(df)))
do.call(cbind, 
        lapply(df, function(x) table(sequence(nrow(df)), 
                                     factor(x, levels = allLevels))))
#    a b c a b c
# 1  0 1 0 0 0 1
# 2  1 0 0 1 0 0
# 3  0 0 1 0 0 1
# 4  0 1 0 1 0 0
# 5  0 0 1 0 0 1
# 6  0 0 1 0 1 0
# 7  1 0 0 1 0 0
# 8  1 0 0 0 1 0
# 9  1 0 0 0 0 1
# 10 1 0 0 0 1 0

I've used factor on "x" to make sure that even in cases where there are, say, no "c" values in a column, there will still be a "c" column in the output, filled with zeroes.

Kaffraria answered 25/4, 2013 at 7:10 Comment(0)
M
3

I recently came across another way. I noticed that when you run any of the contrasts functions with contrasts set to FALSE, it gives you one hot encoding. For example, contr.sum(5, contrasts = FALSE) gives

  1 2 3 4 5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1

To get this behavior for all of your factors, you can create a new contrast function and set it as the default. For example,

contr.onehot = function (n, contrasts, sparse = FALSE) {
  contr.sum(n = n, contrasts = FALSE, sparse = sparse)
}

options(contrasts = c("contr.onehot", "contr.onehot"))
model.matrix(~ . - 1, data = df)

This results in

   v1a v1b v1c v2a v2b v2c
1    0   0   1   0   0   1
2    0   1   0   1   0   0
3    0   0   1   0   1   0
4    1   0   0   0   1   0
5    0   1   0   0   1   0
6    0   1   0   0   0   1
7    1   0   0   0   1   0
8    0   1   0   0   1   0
9    0   1   0   1   0   0
10   0   0   1   0   0   1
Musty answered 31/3, 2016 at 21:43 Comment(0)
K
0

Here is a solution for more general case, when the amount of letters is not specified apriori:

convertABC <- function(x) {

    hold <- rep(0,max(match(as.matrix(df),letters))) # pre-format output

    codify <- function(x) {                          # define function for single char

        output <- hold                               # take empty vector
        output[match(x,letters)] <- 1                # place 1 according to letter pos
        return(output)
    }

    to.return <- t(sapply(as.character(x),codify))   # apply it to whole vector
    rownames(to.return) <- 1:nrow(to.return)         # nice rownames
    colnames(to.return) <- do.call(c,list(letters[1:max(match(as.matrix(df),letters))])) # nice columnnames
    return(to.return)
}

This function takes a vector of characters, and recodes it into binary values. To process all variables in df:

do.call(cbind,lapply(df,convertABC))
Kandacekandahar answered 24/4, 2013 at 20:20 Comment(0)
E
0
library(correlationfunnel)
library(dplyr)
v1 = sample(letters[1:3], 10, replace=TRUE)
v2 = sample(letters[1:3], 10, replace=TRUE)
df = data.frame(v1,v2)
df

   v1 v2
1   b  c
2   c  c
3   c  a
4   c  c
5   a  a
6   b  b
7   b  c
8   b  c
9   c  a
10  b  c

df$id= 1:nrow(df)
df %>%
   select(-id) %>%
   binarize()

# A tibble: 10 x 6
   v1__a v1__b v1__c v2__a v2__b v2__c
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     0     1     0     0     0     1
 2     0     0     1     0     0     1
 3     0     0     1     1     0     0
 4     0     0     1     0     0     1
 5     1     0     0     1     0     0
 6     0     1     0     0     1     0
 7     0     1     0     0     0     1
 8     0     1     0     0     0     1
 9     0     0     1     1     0     0
10     0     1     0     0     0     1
Ectomere answered 14/6, 2020 at 4:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.