XGBoost - Poisson distribution with varying exposure / offset

Asked 26/2, 2016 at 19:56 Answered 13/9, 2016 at 14:54

I am trying to use XGBoost to model claims frequency of data generated from unequal length exposure periods, but have been unable to get the model to treat the exposure correctly. I would normally do this by setting log(exposure) as an offset - are you able to do this in XGBoost?

(A similar question was posted here: xgboost, offset exposure?)

To illustrate the issue, the R code below generates some data with the fields:

x1, x2 - factors (either 0 or 1)
exposure - length of policy period on observed data
frequency - mean number of claims per unit exposure
claims - number of observed claims ~Poisson(frequency*exposure)

The goal is to predict frequency using x1 and x2 - the true model is: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.

Exposure can't be used to predict the frequency as it is not known at the outset of a policy. The only way we can use it is to say: expected number of claims = frequency * exposure.

The code tries to predict this using XGBoost by:

Setting exposure as a weight in the model matrix
Setting log(exposure) as an offset

Below these, I've shown how I would handle the situation for a tree (rpart) or gbm.

set.seed(1)
size<-10000
d <- data.frame(
  x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)

#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
  "objective"  = "count:poisson"
  , "eval_metric" = "logloss"
  , "eta" = 1
  , "subsample" = 1
  , "colsample_bytree" = 1
  , "min_child_weight" = 1
  , "max_depth" = 2
)

## 1 - set weight in xgb.Matrix

xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_1 <- predict(xgb, xgtrain)

## 2 - set as offset in xgb.Matrix
xgtrain.mf  <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m  <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain  <- xgb.DMatrix(xgtrain.m,label = d$claims)

xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_2 <- predict(model, xgtrain)

#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
              data = d,
              method = "poisson")

d$Tree_F <- predict(tree, newdata = d)

#### Fit a GBM

gbm <- gbm(claims~x1+x2+offset(log(exposure)), 
           data = d,
           distribution = "poisson",
           n.trees = 1,
           shrinkage=1,
           interaction.depth=2,
           bag.fraction = 0.5)

d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")

Orel answered 26/2, 2016 at 19:56 Comment(0)

At least with the glm function in R, modeling count ~ x1 + x2 + offset(log(exposure)) with family=poisson(link='log') is equivalent to modeling I(count/exposure) ~ x1 + x2 with family=poisson(link='log') and weight=exposure. That is, normalize your count by exposure to get frequency, and model frequency with exposure as the weight. Your estimated coefficients should be the same in both cases when using glm for Poisson regression. Try it for yourself using a sample data set

I'm not exactly sure what objective='count:poisson' corresponds to, but I would expect setting your target variable as frequency (count/exposure) and using exposure as the weight in xgboost would be the way to go when exposures are varying.

Elute answered 2/9, 2016 at 4:16 Comment(1)

Thanks Vinh. This is one of the options I had tried but didn't seem to work as expected in simple cases. I believe I have now found the solution and have posted it here. – Orel 13/9, 2016 at 15:2

I have now worked out how to do this using setinfo to change the base_margin attribute to be the offset (as a linear predictor), ie:

setinfo(xgtrain, "base_margin", log(d$exposure))

Orel answered 13/9, 2016 at 14:54 Comment(2)

so does this suffice for xgboost ? I mean, if you specify the base_margin, do you still have to specify weights in xgb.DMatrix? Thanks! – Skulduggery 9/11, 2018 at 17:31

base_margin: base margin is the base prediction Xgboost will boost from ; How do you propose , base_margin will work as an offset in traditional sense of Poisson problem? – Mantilla 19/7, 2020 at 19:11

Recommended topics

Hot tags