I'm not sure if I completely understand how factors work. So please correct me in an easy to understand way if I'm wrong.
I always assumed that when doing regressions and what not, R behind the scenes concerts categorical variables into integers, but this part was outside of my train of thought.
It would use the categorical values in a training set and after building a model, check for the same categorical value in the test dataset. Whatever the underlying 'levels' were - didnt matter to me.
However, I've been thinking more... and need clarification - especially if I'm doing this wrong on how to fix it.
train= c("March","April","January","November","January")
train=as.factor(train)
str(train)
Factor w/ 4 levels "April","January",..: 3 1 2 4 2
test= c(c("March","April"))
test=as.factor(test)
str(test)
# Factor w/ 2 levels "April","March",..: 1 2
question
If you see the above, it creates factor levels, I believe is what they are called for each month. However, the levels do not match up necessarily.
For example, in test "APRIL" is "1" in both, but in train "JANUARY" is 2 while "MARCH" is 2 in the 2nd.
If I was to incorporate this into a model, I don't think I would get an error since all the categorical values in the TEST set are in the training set already...but would hte appropriate coeffecients/values be used?
please help i'm very confused