Generate a dummy-variable
Asked Answered
B

17

95

I have had trouble generating the following dummy-variables in R:

I'm analyzing yearly time series data (time period 1948-2009). I have two questions:

  1. How do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)?

  2. How do I generate a dummy variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?

Babs answered 2/8, 2012 at 23:7 Comment(0)
B
118

Another option that can work better if you have many variables is factor and model.matrix.

year.f = factor(year)
dummies = model.matrix(~year.f)

This will include an intercept column (all ones) and one column for each of the years in your data set except one, which will be the "default" or intercept value.

You can change how the "default" is chosen by messing with contrasts.arg in model.matrix.

Also, if you want to omit the intercept, you can just drop the first column or add +0 to the end of the formula.

Hope this is useful.

Baleful answered 3/8, 2012 at 1:24 Comment(10)
what if you want to generate dummy variables for all (instead of k-1) with no intercept?Polygynous
note that model.matrix( ) accepts multiple variables to transform into dummies: model.matrix( ~ var1 + var2, data = df) Again, just be sure that they are factors.Citrine
@FernandoHocesDeLaGuardia I too am wondering that. Can anybody answer?Orangery
@Orangery table(1:n, factor). Where factor is the original variable and n is its lengthPolygynous
@FernandoHocesDeLaGuardia I'm sorry I don't understand. What do you do with that table?Orangery
@Orangery that table is a n x k matrix with all k indicator variables (instead of k-1)Polygynous
@FernandoHocesDeLaGuardia Thanks, I got it. I was using the wrong value for factor. It makes sense now. That's much simpler.Orangery
@FernandoHocesDeLaGuardia You can remove the intercept from a formula either with + 0 or - 1. So model.matrix(~ year.f + 0) will give a give dummy variables without a reference level.Strohbehn
Note also that the dummies order will be alphabetical with regards to the column values.Compulsive
Note for table(1:n, factor) the second argument does not need to be converted using as.factor; just putting it in normally works.Diaphony
D
61

The simplest way to produce these dummy variables is something like the following:

> print(year)
[1] 1956 1957 1957 1958 1958 1959
> dummy <- as.numeric(year == 1957)
> print(dummy)
[1] 0 1 1 0 0 0
> dummy2 <- as.numeric(year >= 1957)
> print(dummy2)
[1] 0 1 1 1 1 1

More generally, you can use ifelse to choose between two values depending on a condition. So if instead of a 0-1 dummy variable, for some reason you wanted to use, say, 4 and 7, you could use ifelse(year == 1957, 4, 7).

Debutant answered 2/8, 2012 at 23:38 Comment(0)
O
53

Using dummies::dummy():

library(dummies)

# example data
df1 <- data.frame(id = 1:4, year = 1991:1994)

df1 <- cbind(df1, dummy(df1$year, sep = "_"))

df1
#   id year df1_1991 df1_1992 df1_1993 df1_1994
# 1  1 1991        1        0        0        0
# 2  2 1992        0        1        0        0
# 3  3 1993        0        0        1        0
# 4  4 1994        0        0        0        1
Oppugnant answered 31/10, 2016 at 13:34 Comment(8)
Maybe adding "fun= factor" in function dummy can help if that is the meaning of the variable.Eosin
@FilippoMazza I prefer to keep them as integer, yes, we could set factor if needed.Oppugnant
how do you remove df1 before each dummy column header names?Aaren
@Aaren colnames(df1) <- gsub("df1_", "", fixed = TRUE, colnames(df1))Oppugnant
The fact that users have to resort to a third party library to accomplish this commonly done task in statistics is a major and unfortunate basic feature omission of R. A good statistics language would have a simple built in syntax to accomplish this.Benedic
@DonF It is just an option, did you see the most voted base answer above?Oppugnant
@DonF model.matrix() :)Poche
An unmaintained package that create problems with certain commands. Not recommendedMensch
H
20

Package mlr includes createDummyFeatures for this purpose:

library(mlr)
df <- data.frame(var = sample(c("A", "B", "C"), 10, replace = TRUE))
df

#    var
# 1    B
# 2    A
# 3    C
# 4    B
# 5    C
# 6    A
# 7    C
# 8    A
# 9    B
# 10   C

createDummyFeatures(df, cols = "var")

#    var.A var.B var.C
# 1      0     1     0
# 2      1     0     0
# 3      0     0     1
# 4      0     1     0
# 5      0     0     1
# 6      1     0     0
# 7      0     0     1
# 8      1     0     0
# 9      0     1     0
# 10     0     0     1

createDummyFeatures drops original variable.

https://www.rdocumentation.org/packages/mlr/versions/2.9/topics/createDummyFeatures
.....

H answered 10/11, 2016 at 16:54 Comment(2)
Enrique, I've tried installing the package, but it doesn't seem to be working after doing library(mlr). I get the following error:«Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : there is no package called ‘ggvis’ In addition: Warning message: package ‘mlr’ was built under R version 3.2.5 Error: package or namespace load failed for ‘mlr’»Ellga
you need to install 'ggvis' firstGulf
C
19

The other answers here offer direct routes to accomplish this task—one that many models (e.g. lm) will do for you internally anyway. Nonetheless, here are ways to make dummy variables with Max Kuhn's popular caret and recipes packages. While somewhat more verbose, they both scale easily to more complicated situations, and fit neatly into their respective frameworks.


caret::dummyVars

With caret, the relevant function is dummyVars, which has a predict method to apply it on a data frame:

df <- data.frame(letter = rep(c('a', 'b', 'c'), each = 2),
                 y = 1:6)

library(caret)

dummy <- dummyVars(~ ., data = df, fullRank = TRUE)

dummy
#> Dummy Variable Object
#> 
#> Formula: ~.
#> 2 variables, 1 factors
#> Variables and levels will be separated by '.'
#> A full rank encoding is used

predict(dummy, df)
#>   letter.b letter.c y
#> 1        0        0 1
#> 2        0        0 2
#> 3        1        0 3
#> 4        1        0 4
#> 5        0        1 5
#> 6        0        1 6

recipes::step_dummy

With recipes, the relevant function is step_dummy:

library(recipes)

dummy_recipe <- recipe(y ~ letter, df) %>% 
    step_dummy(letter)

dummy_recipe
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          1
#> 
#> Steps:
#> 
#> Dummy variables from letter

Depending on context, extract the data with prep and either bake or juice:

# Prep and bake on new data...
dummy_recipe %>% 
    prep() %>% 
    bake(df)
#> # A tibble: 6 x 3
#>       y letter_b letter_c
#>   <int>    <dbl>    <dbl>
#> 1     1        0        0
#> 2     2        0        0
#> 3     3        1        0
#> 4     4        1        0
#> 5     5        0        1
#> 6     6        0        1

# ...or use `retain = TRUE` and `juice` to extract training data
dummy_recipe %>% 
    prep(retain = TRUE) %>% 
    juice()
#> # A tibble: 6 x 3
#>       y letter_b letter_c
#>   <int>    <dbl>    <dbl>
#> 1     1        0        0
#> 2     2        0        0
#> 3     3        1        0
#> 4     4        1        0
#> 5     5        0        1
#> 6     6        0        1
Chickie answered 17/12, 2017 at 21:59 Comment(0)
L
16

For the usecase as presented in the question, you can also just multiply the logical condition with 1 (or maybe even better, with 1L):

# example data
df1 <- data.frame(yr = 1951:1960)

# create the dummies
df1$is.1957 <- 1L * (df1$yr == 1957)
df1$after.1957 <- 1L * (df1$yr >= 1957)

which gives:

> df1
     yr is.1957 after.1957
1  1951       0          0
2  1952       0          0
3  1953       0          0
4  1954       0          0
5  1955       0          0
6  1956       0          0
7  1957       1          1
8  1958       0          1
9  1959       0          1
10 1960       0          1

For the usecases as presented in for example the answers of @zx8754 and @Sotos, there are still some other options which haven't been covered yet imo.

1) Make your own make_dummies-function

# example data
df2 <- data.frame(id = 1:5, year = c(1991:1994,1992))

# create a function
make_dummies <- function(v, prefix = '') {
  s <- sort(unique(v))
  d <- outer(v, s, function(v, s) 1L * (v == s))
  colnames(d) <- paste0(prefix, s)
  d
}

# bind the dummies to the original dataframe
cbind(df2, make_dummies(df2$year, prefix = 'y'))

which gives:

  id year y1991 y1992 y1993 y1994
1  1 1991     1     0     0     0
2  2 1992     0     1     0     0
3  3 1993     0     0     1     0
4  4 1994     0     0     0     1
5  5 1992     0     1     0     0

2) use the dcast-function from either or

 dcast(df2, id + year ~ year, fun.aggregate = length)

which gives:

  id year 1991 1992 1993 1994
1  1 1991    1    0    0    0
2  2 1992    0    1    0    0
3  3 1993    0    0    1    0
4  4 1994    0    0    0    1
5  5 1992    0    1    0    0

However, this will not work when there are duplicate values in the column for which the dummies have to be created. In the case a specific aggregation function is needed for dcast and the result of of dcast need to be merged back to the original:

# example data
df3 <- data.frame(var = c("B", "C", "A", "B", "C"))

# aggregation function to get dummy values
f <- function(x) as.integer(length(x) > 0)

# reshape to wide with the cumstom aggregation function and merge back to the original
merge(df3, dcast(df3, var ~ var, fun.aggregate = f), by = 'var', all.x = TRUE)

which gives (note that the result is ordered according to the by column):

  var A B C
1   A 1 0 0
2   B 0 1 0
3   B 0 1 0
4   C 0 0 1
5   C 0 0 1

3) use the spread-function from (with mutate from )

library(dplyr)
library(tidyr)

df2 %>% 
  mutate(v = 1, yr = year) %>% 
  spread(yr, v, fill = 0)

which gives:

  id year 1991 1992 1993 1994
1  1 1991    1    0    0    0
2  2 1992    0    1    0    0
3  3 1993    0    0    1    0
4  4 1994    0    0    0    1
5  5 1992    0    1    0    0
Lepper answered 13/2, 2018 at 18:38 Comment(0)
A
11

What I normally do to work with this kind of dummy variables is:

(1) how do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)

data$factor_year_1 <- factor ( with ( data, ifelse ( ( year == 1957 ), 1 , 0 ) ) )

(2) how do I generate a dummy-variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?

data$factor_year_2 <- factor ( with ( data, ifelse ( ( year < 1957 ), 0 , 1 ) ) )

Then, I can introduce this factor as a dummy variable in my models. For example, to see whether there is a long-term trend in a varible y :

summary ( lm ( y ~ t,  data = data ) )

Hope this helps!

Abacist answered 3/8, 2012 at 9:44 Comment(0)
P
7

If you want to get K dummy variables, instead of K-1, try:

dummies = table(1:length(year),as.factor(year))  

Best,

Polygynous answered 27/3, 2015 at 17:45 Comment(1)
the resulting table cannot be used as a data.frame. If that's a problem, use as.data.frame.matrix(dummies) to translate it into oneSanctus
P
7

I read this on the kaggle forum:

#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"

#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
  example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}
Patch answered 16/5, 2015 at 10:37 Comment(0)
W
5

The ifelse function is best for simple logic like this.

> x <- seq(1950, 1960, 1)

    ifelse(x == 1957, 1, 0)
    ifelse(x <= 1957, 1, 0)

>  [1] 0 0 0 0 0 0 0 1 0 0 0
>  [1] 1 1 1 1 1 1 1 1 0 0 0

Also, if you want it to return character data then you can do so.

> x <- seq(1950, 1960, 1)

    ifelse(x == 1957, "foo", "bar")
    ifelse(x <= 1957, "foo", "bar")

>  [1] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "foo" "bar" "bar" "bar"
>  [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar" "bar"

Categorical variables with nesting...

> x <- seq(1950, 1960, 1)

    ifelse(x == 1957, "foo", ifelse(x == 1958, "bar","baz"))

>  [1] "baz" "baz" "baz" "baz" "baz" "baz" "baz" "foo" "bar" "baz" "baz"

This is the most straightforward option.

Waddell answered 9/12, 2015 at 22:41 Comment(0)
A
5

Another way is to use mtabulate from qdapTools package, i.e.

df <- data.frame(var = sample(c("A", "B", "C"), 5, replace = TRUE))
  var
#1   C
#2   A
#3   C
#4   B
#5   B

library(qdapTools)
mtabulate(df$var)

which gives,

  A B C
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
Agape answered 6/10, 2017 at 6:32 Comment(0)
L
5

This one liner in base R

model.matrix( ~ iris$Species - 1)

gives

    iris$Speciessetosa iris$Speciesversicolor iris$Speciesvirginica
1                    1                      0                     0
2                    1                      0                     0
3                    1                      0                     0
4                    1                      0                     0
5                    1                      0                     0
6                    1                      0                     0
7                    1                      0                     0
8                    1                      0                     0
9                    1                      0                     0
10                   1                      0                     0
11                   1                      0                     0
12                   1                      0                     0
13                   1                      0                     0
14                   1                      0                     0
15                   1                      0                     0
16                   1                      0                     0
17                   1                      0                     0
18                   1                      0                     0
19                   1                      0                     0
20                   1                      0                     0
21                   1                      0                     0
22                   1                      0                     0
23                   1                      0                     0
24                   1                      0                     0
25                   1                      0                     0
26                   1                      0                     0
27                   1                      0                     0
28                   1                      0                     0
29                   1                      0                     0
30                   1                      0                     0
31                   1                      0                     0
32                   1                      0                     0
33                   1                      0                     0
34                   1                      0                     0
35                   1                      0                     0
36                   1                      0                     0
37                   1                      0                     0
38                   1                      0                     0
39                   1                      0                     0
40                   1                      0                     0
41                   1                      0                     0
42                   1                      0                     0
43                   1                      0                     0
44                   1                      0                     0
45                   1                      0                     0
46                   1                      0                     0
47                   1                      0                     0
48                   1                      0                     0
49                   1                      0                     0
50                   1                      0                     0
51                   0                      1                     0
52                   0                      1                     0
53                   0                      1                     0
54                   0                      1                     0
55                   0                      1                     0
56                   0                      1                     0
57                   0                      1                     0
58                   0                      1                     0
59                   0                      1                     0
60                   0                      1                     0
61                   0                      1                     0
62                   0                      1                     0
63                   0                      1                     0
64                   0                      1                     0
65                   0                      1                     0
66                   0                      1                     0
67                   0                      1                     0
68                   0                      1                     0
69                   0                      1                     0
70                   0                      1                     0
71                   0                      1                     0
72                   0                      1                     0
73                   0                      1                     0
74                   0                      1                     0
75                   0                      1                     0
76                   0                      1                     0
77                   0                      1                     0
78                   0                      1                     0
79                   0                      1                     0
80                   0                      1                     0
81                   0                      1                     0
82                   0                      1                     0
83                   0                      1                     0
84                   0                      1                     0
85                   0                      1                     0
86                   0                      1                     0
87                   0                      1                     0
88                   0                      1                     0
89                   0                      1                     0
90                   0                      1                     0
91                   0                      1                     0
92                   0                      1                     0
93                   0                      1                     0
94                   0                      1                     0
95                   0                      1                     0
96                   0                      1                     0
97                   0                      1                     0
98                   0                      1                     0
99                   0                      1                     0
100                  0                      1                     0
101                  0                      0                     1
102                  0                      0                     1
103                  0                      0                     1
104                  0                      0                     1
105                  0                      0                     1
106                  0                      0                     1
107                  0                      0                     1
108                  0                      0                     1
109                  0                      0                     1
110                  0                      0                     1
111                  0                      0                     1
112                  0                      0                     1
113                  0                      0                     1
114                  0                      0                     1
115                  0                      0                     1
116                  0                      0                     1
117                  0                      0                     1
118                  0                      0                     1
119                  0                      0                     1
120                  0                      0                     1
121                  0                      0                     1
122                  0                      0                     1
123                  0                      0                     1
124                  0                      0                     1
125                  0                      0                     1
126                  0                      0                     1
127                  0                      0                     1
128                  0                      0                     1
129                  0                      0                     1
130                  0                      0                     1
131                  0                      0                     1
132                  0                      0                     1
133                  0                      0                     1
134                  0                      0                     1
135                  0                      0                     1
136                  0                      0                     1
137                  0                      0                     1
138                  0                      0                     1
139                  0                      0                     1
140                  0                      0                     1
141                  0                      0                     1
142                  0                      0                     1
143                  0                      0                     1
144                  0                      0                     1
145                  0                      0                     1
146                  0                      0                     1
147                  0                      0                     1
148                  0                      0                     1
149                  0                      0                     1
150                  0                      0                     1
Landis answered 17/4, 2020 at 11:8 Comment(0)
L
2

Convert your data to a data.table and use set by reference and row filtering

library(data.table)

dt <- as.data.table(your.dataframe.or.whatever)
dt[, is.1957 := 0]
dt[year == 1957, is.1957 := 1]

Proof-of-concept toy example:

library(data.table)

dt <- as.data.table(cbind(c(1, 1, 1), c(2, 2, 3)))
dt[, is.3 := 0]
dt[V2 == 3, is.3 := 1]
Lineage answered 15/2, 2018 at 3:48 Comment(0)
A
1

I use such a function (for data.table):

# Ta funkcja dla obiektu data.table i zmiennej var.name typu factor tworzy dummy variables o nazwach "var.name: (level1)"
factorToDummy <- function(dtable, var.name){
  stopifnot(is.data.table(dtable))
  stopifnot(var.name %in% names(dtable))
  stopifnot(is.factor(dtable[, get(var.name)]))

  dtable[, paste0(var.name,": ",levels(get(var.name)))] -> new.names
  dtable[, (new.names) := transpose(lapply(get(var.name), FUN = function(x){x == levels(get(var.name))})) ]

  cat(paste("\nDodano zmienne dummy: ", paste0(new.names, collapse = ", ")))
}

Usage:

data <- data.table(data)
data[, x:= droplevels(x)]
factorToDummy(data, "x")
Anthozoan answered 18/8, 2015 at 9:50 Comment(0)
B
1

We can also use cSplit_e from splitstackshape. Using @zx8754's data

df1 <- data.frame(id = 1:4, year = 1991:1994)
splitstackshape::cSplit_e(df1, "year", fill = 0)

#  id year year_1 year_2 year_3 year_4
#1  1 1991      1      0      0      0
#2  2 1992      0      1      0      0
#3  3 1993      0      0      1      0
#4  4 1994      0      0      0      1

To make it work for data other than numeric we need to specify type as "character" explicitly

df1 <- data.frame(id = 1:4, let = LETTERS[1:4])
splitstackshape::cSplit_e(df1, "let", fill = 0, type = "character")

#  id let let_A let_B let_C let_D
#1  1   A     1     0     0     0
#2  2   B     0     1     0     0
#3  3   C     0     0     1     0
#4  4   D     0     0     0     1
Berley answered 2/10, 2019 at 2:5 Comment(0)
L
0

Hi i wrote this general function to generate a dummy variable which essentially replicates the replace function in Stata.

If x is the data frame is x and i want a dummy variable called a which will take value 1 when x$b takes value c

introducedummy<-function(x,a,b,c){
   g<-c(a,b,c)
  n<-nrow(x)
  newcol<-g[1]
  p<-colnames(x)
  p2<-c(p,newcol)
  new1<-numeric(n)
  state<-x[,g[2]]
  interest<-g[3]
  for(i in 1:n){
    if(state[i]==interest){
      new1[i]=1
    }
    else{
      new1[i]=0
    }
  }
    x$added<-new1
    colnames(x)<-p2
    x
  }
Lem answered 6/2, 2015 at 17:18 Comment(0)
M
0

another way you can do it is use

ifelse(year < 1965 , 1, 0)
Medlock answered 9/5, 2018 at 21:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.