Consistent factor levels for same value over different datasets
Asked Answered
R

1

4

I'm not sure if I completely understand how factors work. So please correct me in an easy to understand way if I'm wrong.

I always assumed that when doing regressions and what not, R behind the scenes concerts categorical variables into integers, but this part was outside of my train of thought.

It would use the categorical values in a training set and after building a model, check for the same categorical value in the test dataset. Whatever the underlying 'levels' were - didnt matter to me.

However, I've been thinking more... and need clarification - especially if I'm doing this wrong on how to fix it.

     train= c("March","April","January","November","January")
     train=as.factor(train)
     str(train)
     Factor w/ 4 levels "April","January",..: 3 1 2 4 2

     test= c(c("March","April"))
     test=as.factor(test)
      str(test)
     # Factor w/ 2 levels "April","March",..:  1 2

question

If you see the above, it creates factor levels, I believe is what they are called for each month. However, the levels do not match up necessarily.

For example, in test "APRIL" is "1" in both, but in train "JANUARY" is 2 while "MARCH" is 2 in the 2nd.

If I was to incorporate this into a model, I don't think I would get an error since all the categorical values in the TEST set are in the training set already...but would hte appropriate coeffecients/values be used?

please help i'm very confused

Rox answered 24/2, 2016 at 7:8 Comment(0)
S
5

When you use as.factor to convert / coerce a vector into a factor, R takes all unique values of your vector and associates a numerical id to each of them; it also has a default sorting method to decide which value gets 1, 2 etc.

If you have different vectors which live in a common "universe" of values and you want to convert them into consistent factors (i.e. a value appearing in different vectors / dfs is associated to the same numerical id), do this:

x <- letters[1:5]
y <- letters[3:8]
allvalues <- unique(union(x,y))  # superfluous but I think it adds clarity
x <- factor(x, levels = allvalues)
y <- factor(y, levels = allvalues)
str(x)   # Factor w/ 8 levels "a","b","c","d",..: 1 2 3 4 5
str(y)   # Factor w/ 8 levels "a","b","c","d",..: 3 4 5 6 7 8

Edit

A small experiment to show that R is smart enough to recognize factor values in different vectors, even if they had been assigned inconsistent numerical ids:

y <- sample(1:2, size = 20, replace = T)
x <- factor(letters[y], levels = c("b","a"))  # so a~2 and b~1
y <- y + rnorm(0, 0.2, n = 20)
Set <- data.frame(x = x, y = y)
fit <- lm(data = Set, y ~ x)

To get descriptions of everything: str(x), str(y), summary(fit).

So fit is trained to associate x = a (which as a factor has a numerical tag of 2) with the value y ~= 1 and y = b with the value x ~= 2.

Now let's make a "confusing" test set:

x2 <- factor(c("a","b"), levels = c("c","d","a","b"))
str(x2)   # Factor w/ 4 levels "c","d","a","b": 3 4

Let's use predict to see what R makes of it:

predict(fit, newdata = data.frame(x = x2))
#        1        2 
# 1.060569 1.961109 

Which is what we'd expect from R...

Synergetic answered 24/2, 2016 at 7:15 Comment(5)
Thank you for the fast response. That is very helpful method to do this. HOWEVER, I am assuming that yes, indeed when you have the 'common universe' of values in each, that while the code might work without error, the inappropriate factor-coefficients would be used int he regression? This would be the case when the test-set is just a subset of the factors in the trainingset, potentially with different alphanumeric sorting orders (my example above).Rox
What do you call "inappropriate factor-coefficients" ? If your factor variable is used as an input variable in a regression, the numerical ids associated to each factor value are not used anyway, R just creates dummy variables (e.g. if X = c("a", "b", "c", "d"), R picks a base value, for instance "a", and creates (w/o telling you), the variables X-is-b = (0,1,0,0), X-is-c = (0,0,1,0) and X-is-d = (0,0,0,1) and uses them as input for the regression).Synergetic
What I mean in my example is that in the 'test' part March==2 and in the 'train' part March==3Rox
Well if you're concerned about that, just force factors in your train and test set to use a common levels parameter. If you have to train your model before getting the test set, just force the test set variable to use the train set's variable's levels as a pre-processing step (values which are in test but not in train will be transformed into NAs).Synergetic
I personally would use table(x) and table(y) over the str(x) etc. recco above, since you can't see the full list, and the table names are ordered by the same values.Logogriph

© 2022 - 2024 — McMap. All rights reserved.