XGBoost - Poisson distribution with varying exposure / offset
Asked Answered
O

2

10

I am trying to use XGBoost to model claims frequency of data generated from unequal length exposure periods, but have been unable to get the model to treat the exposure correctly. I would normally do this by setting log(exposure) as an offset - are you able to do this in XGBoost?

(A similar question was posted here: xgboost, offset exposure?)

To illustrate the issue, the R code below generates some data with the fields:

  • x1, x2 - factors (either 0 or 1)
  • exposure - length of policy period on observed data
  • frequency - mean number of claims per unit exposure
  • claims - number of observed claims ~Poisson(frequency*exposure)

The goal is to predict frequency using x1 and x2 - the true model is: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.

Exposure can't be used to predict the frequency as it is not known at the outset of a policy. The only way we can use it is to say: expected number of claims = frequency * exposure.

The code tries to predict this using XGBoost by:

  1. Setting exposure as a weight in the model matrix
  2. Setting log(exposure) as an offset

Below these, I've shown how I would handle the situation for a tree (rpart) or gbm.

set.seed(1)
size<-10000
d <- data.frame(
  x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)

#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
  "objective"  = "count:poisson"
  , "eval_metric" = "logloss"
  , "eta" = 1
  , "subsample" = 1
  , "colsample_bytree" = 1
  , "min_child_weight" = 1
  , "max_depth" = 2
)

## 1 - set weight in xgb.Matrix

xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_1 <- predict(xgb, xgtrain)

## 2 - set as offset in xgb.Matrix
xgtrain.mf  <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m  <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain  <- xgb.DMatrix(xgtrain.m,label = d$claims)

xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_2 <- predict(model, xgtrain)

#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
              data = d,
              method = "poisson")

d$Tree_F <- predict(tree, newdata = d)

#### Fit a GBM

gbm <- gbm(claims~x1+x2+offset(log(exposure)), 
           data = d,
           distribution = "poisson",
           n.trees = 1,
           shrinkage=1,
           interaction.depth=2,
           bag.fraction = 0.5)

d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")
Orel answered 26/2, 2016 at 19:56 Comment(0)
E
6

At least with the glm function in R, modeling count ~ x1 + x2 + offset(log(exposure)) with family=poisson(link='log') is equivalent to modeling I(count/exposure) ~ x1 + x2 with family=poisson(link='log') and weight=exposure. That is, normalize your count by exposure to get frequency, and model frequency with exposure as the weight. Your estimated coefficients should be the same in both cases when using glm for Poisson regression. Try it for yourself using a sample data set

I'm not exactly sure what objective='count:poisson' corresponds to, but I would expect setting your target variable as frequency (count/exposure) and using exposure as the weight in xgboost would be the way to go when exposures are varying.

Elute answered 2/9, 2016 at 4:16 Comment(1)
Thanks Vinh. This is one of the options I had tried but didn't seem to work as expected in simple cases. I believe I have now found the solution and have posted it here.Orel
O
4

I have now worked out how to do this using setinfo to change the base_margin attribute to be the offset (as a linear predictor), ie:

setinfo(xgtrain, "base_margin", log(d$exposure))
Orel answered 13/9, 2016 at 14:54 Comment(2)
so does this suffice for xgboost ? I mean, if you specify the base_margin, do you still have to specify weights in xgb.DMatrix? Thanks!Skulduggery
base_margin: base margin is the base prediction Xgboost will boost from ; How do you propose , base_margin will work as an offset in traditional sense of Poisson problem?Mantilla

© 2022 - 2024 — McMap. All rights reserved.