R: Kaggle Titanic Dataset Random Forest NAs introduced by coercion
Asked Answered
C

1

5

Im currently practicing R on the Kaggle using the titanic data set I am using the Random Forest Algorthim

Below is the code

fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age_Bucket + Embarked
                + Age_Bucket + Fare_Bucket + F_Name + Title + FamilySize + FamilyID, 
                data=train, importance=TRUE, ntree=5000)

I am getting the following error

Error in randomForest.default(m, y, ...) : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In data.matrix(x) : NAs introduced by coercion
4: In data.matrix(x) : NAs introduced by coercion

My data looks like below

$ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
$ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
$ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1...
$ Age_Bucket : chr  "20-25" "30-40" "25-30" "30-40" ...
$ Fare_Bucket: chr  "<10" "30+" "<10" "30+" ...
$ Title      : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
$ F_Name     : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
$ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
$ FamilyID   : chr  "Small" "Small" "Alone" "Small" ...

If i just type the below, I have no coercion issues which as far as i can see is the only place where coercion occurs to create NA values

as.factor(Survived)

Can anyone see the problem

Thank you for your time

Corcovado answered 10/5, 2015 at 13:30 Comment(0)
C
7

You need to convert your char columns into factors. Factors are treated as integers internally whereas character fields are not. See the following small demonstration:

Data:

df <- data.frame(y = sample(0:1, 26, rep=T), x1=runif(26), x2=letters, stringsAsFactors=F)

df$y <- as.factor(df$y)

> str(df)
'data.frame':   26 obs. of  3 variables:
 $ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 1 ...
 $ x1: num  0.457 0.296 0.517 0.478 0.764 ...
 $ x2: chr  "a" "b" "c" "d" ...

Now if I run my randomForest function:

> randomForest(y ~ x1 + x2, data=df)
Error in randomForest.default(m, y, ...) : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In data.matrix(x) : NAs introduced by coercion

I get the same error you did.

Whereas if I convert the char column into factor:

df$x2 <- as.factor(df$x2)

> randomForest(y ~ x1 + x2, data=df)

Call:
 randomForest(formula = y ~ x1 + x2, data = df) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 61.54%
Confusion matrix:
  0  1 class.error
0 0 16           1
1 0 10           0

It works great!

Cadel answered 10/5, 2015 at 13:47 Comment(8)
Hi, Sorry i should have been clearer. I ran the line "as.factor(Survived)" on its own and it converted everything fine into a factor as thats what i originally thought the problem was. When i run it in the Random Forest code it gives me the error about the coercianCorcovado
Can you please dput the data?Cadel
I found the reason why it breaks! You got + FamilyID in your code but this column is not in your dataset.Cadel
Hi LyzandeR.....I just ran that piece of code to determine if that was the place where it was failing as its the only place i can see where NAs are introduced by coercion. This error only occurs when i run it based on the first segment of code in the OP. The error doesnt happen if i run "as.factor(Survived)" on its own.Corcovado
I see. I guess you need to dput the data otherwise no one will be able to troubleshoot. I dont have an account on Kaggle to get it myself.Cadel
oh oh oh oh. You got char columns in there. And the matrix creation inside the randomForest function is failing. Can you please convert those to factors and try again? Age_bucket for example is char and when the matrix is created everything is coerced into NAs.Cadel
That looks to be it :).....Thank you. sadly it looks like i have too many factors to run it so i will try Inference Trees instead. - Can not handle categorical predictors with more than 53 categories.Corcovado
You are welcome :). Glad I could be of help. I updated the answer as well. This is an annoying limitation I know...Cadel

© 2022 - 2024 — McMap. All rights reserved.