XGBoost predictor in R predicts the same value for all rows [duplicate]

S

2

I looked into the the post on the same thing in Python, but I want a solution in R. I'm working on the Titanic dataset from Kaggle, and it looks like this:

    'data.frame':   891 obs. of  13 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : num  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Age        : num  22 38 26 35 35 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Child      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ Embarked.C : num  0 1 0 0 0 0 0 0 0 1 ...
 $ Embarked.Q : num  0 0 0 0 0 1 0 0 0 0 ...
 $ Embarked.S : num  1 0 1 1 1 0 1 1 1 0 ...
 $ Sex.female : num  0 1 1 1 0 0 0 0 1 1 ...
 $ Sex.male   : num  1 0 0 0 1 1 1 1 0 0 ...

This is after I used dummy variables. My test set:

'data.frame':   418 obs. of  12 variables:
 $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
 $ Child      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Embarked.C : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Embarked.Q : num  1 0 1 0 0 0 1 0 0 0 ...
 $ Embarked.S : num  0 1 0 1 1 1 0 1 0 1 ...
 $ Sex.female : num  0 1 0 0 1 0 1 0 1 0 ...
 $ Sex.male   : num  1 0 1 1 0 1 0 1 0 1 ...

I ran xgboost using the following code:

> param <- list("objective" = "multi:softprob",
    +               "max.depth" = 25)
    > xgb = xgboost(param, data = trmat, label = y, nround = 7)
    [0] train-rmse:0.350336
    [1] train-rmse:0.245470
    [2] train-rmse:0.171994
    [3] train-rmse:0.120511
    [4] train-rmse:0.084439
    [5] train-rmse:0.059164
    [6] train-rmse:0.041455

trmat is:

trmat = data.matrix(train)

and temat is:

temat = data.matrix(test)

and y is the survived variable:

y = train$Survived

But wen i run the predict function:

> x = predict(xgb, newdata = temat)
> x[1:10]
 [1] 0.9584613 0.9584613 0.9584613 0.9584613 0.9584613 0.9584613 0.9584613
 [8] 0.9584613 0.9584613 0.9584613

All probabilities are being predicted to be the same. In the python question, someone said increasing max.depth would work, but it didn't. What am I doing wrong?

Self answered 27/6, 2016 at 13:10 Comment(0)

S

1

You must remove the Survived variable in your test set in order to use xgboost, since this is the variable you want to predict.

trmat = data.matrix(train[, colnames(train) != "Survived"])

It should solve your problem.

Spoils answered 27/6, 2016 at 13:46 Comment(0)

R

0

I may be late to answer, I have faced the same problem when I first used xgboost. Removing the "Survived" column from train set should solve your problem. If we have the column in train set which we use for label in xgboost then the algorithm ends up predicting all probabilities to be same.

Rabelaisian answered 8/3, 2017 at 18:47 Comment(2)

I appreciate the contribution, but is this any different from the answer provided above by jlesuffleur? – Remotion 8/3, 2017 at 19:28

The column should be removed from train set. Jlesuffleur mentioned it as test set though the code she gave was for train set. – Rabelaisian 10/3, 2017 at 3:22

Recommended topics

Hot tags