Consistent factor levels for same value over different datasets

train= c("March","April","January","November","January") train=as.factor(train) str(train) Factor w/ 4 levels "April","January",..: 3 1 2 4 2 test= c(c("March","April")) test=as.factor(test) str(test) # Factor w/ 2 levels "April","March",..: 1 2

question

If you see the above, it creates factor levels, I believe is what they are called for each month. However, the levels do not match up necessarily.

For example, in test "APRIL" is "1" in both, but in train "JANUARY" is 2 while "MARCH" is 2 in the 2nd.

If I was to incorporate this into a model, I don't think I would get an error since all the categorical values in the TEST set are in the training set already...but would hte appropriate coeffecients/values be used?

please help i'm very confused

When you use as.factor to convert / coerce a vector into a factor, R takes all unique values of your vector and associates a numerical id to each of them; it also has a default sorting method to decide which value gets 1, 2 etc.

If you have different vectors which live in a common "universe" of values and you want to convert them into consistent factors (i.e. a value appearing in different vectors / dfs is associated to the same numerical id), do this:

x <- letters[1:5]
y <- letters[3:8]
allvalues <- unique(union(x,y))  # superfluous but I think it adds clarity
x <- factor(x, levels = allvalues)
y <- factor(y, levels = allvalues)
str(x)   # Factor w/ 8 levels "a","b","c","d",..: 1 2 3 4 5
str(y)   # Factor w/ 8 levels "a","b","c","d",..: 3 4 5 6 7 8

Edit

A small experiment to show that R is smart enough to recognize factor values in different vectors, even if they had been assigned inconsistent numerical ids:

y <- sample(1:2, size = 20, replace = T)
x <- factor(letters[y], levels = c("b","a"))  # so a~2 and b~1
y <- y + rnorm(0, 0.2, n = 20)
Set <- data.frame(x = x, y = y)
fit <- lm(data = Set, y ~ x)

To get descriptions of everything: str(x), str(y), summary(fit).

So fit is trained to associate x = a (which as a factor has a numerical tag of 2) with the value y ~= 1 and y = b with the value x ~= 2.

Now let's make a "confusing" test set:

x2 <- factor(c("a","b"), levels = c("c","d","a","b"))
str(x2)   # Factor w/ 4 levels "c","d","a","b": 3 4

Let's use predict to see what R makes of it:

predict(fit, newdata = data.frame(x = x2))
#        1        2 
# 1.060569 1.961109

Which is what we'd expect from R...

question

Recommended topics

Hot tags