GLM function in R with log link not working

Asked 26/11, 2012 at 14:44 Answered 27/11, 2012 at 14:16

I am using the book "Generalized Linear Models and Extension" by Hardin and Hilbe (second edition, 2007) at the moment. The authors suggest that instead of OLS models, "the log link is generally used for response data that take only positive values on the continuous scale". Of course they also suggest residual plots to check whether a "normal" linear model using an identity link can still be used.

I am trying to replicate in R what they do in the book in STATA. Indeed, I have no problems in STATA with the log link. However, when calling the same model using R's glm-function, but specifying family=gaussian(link="log") I am asked to provide starting values. When I set them all equal to zero, I always get the message that the algorithm did not converge. Picking other values the message is sometimes the same, but more often I get:

Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  :
     NA/NaN/Inf in 'x'

As I said, in STATA I can run these models without setting starting values and without errors. I tried many different models, and different datasets, but the problem is always the same (unless I only include one single independent variable). Could anyone tell me why this is the case, or what I do wrong, or why the suggested models from the book might not be appropriate? I'd appreciate any help, thanks!

Edit: As an example which reproduces the error consider the dataset which can be downloaded here. With this dataset loaded, I run the following model:

mod <- glm(betaplasma ~ age + vituse, family=gaussian(link="log"), data=data2, start=c(0,0,0))

This produces the the warning message that the algorithm did not converge.

Edit2: I was asked to also provide the STATA output for that model. Here it is:

. glm betaplasma age vituse, link(log)

Iteration 0:   log likelihood = -2162.1385  
Iteration 1:   log likelihood = -2096.4765  
Iteration 2:   log likelihood = -2076.2465  
Iteration 3:   log likelihood = -2076.2244  
Iteration 4:   log likelihood = -2076.2244  

Generalized linear models                          No. of obs      =       315
Optimization     : ML                              Residual df     =       312
                                                   Scale parameter =  31384.51
Deviance         =  9791967.359                    (1/df) Deviance =  31384.51
Pearson          =  9791967.359                    (1/df) Pearson  =  31384.51

Variance function: V(u) = 1                        [Gaussian]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =  13.20142
Log likelihood   = -2076.224437                    BIC             =   9790173

------------------------------------------------------------------------------
             |                 OIM
  betaplasma |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0056809   .0032737     1.74   0.083    -.0007354    .0120972
      vituse |   -.273027   .0650773    -4.20   0.000    -.4005762   -.1454779
       _cons |   5.467577   .2131874    25.65   0.000     5.049738    5.885417
------------------------------------------------------------------------------

Purport answered 26/11, 2012 at 14:44 Comment(5)

this is cross-posted from R-help . I was going to answer there ... crossposting is not explicitly forbidden, but at the very least you should mention that you are cross-posting. glm may well be more fragile than Stata's glm-fitting code, but I have used a log link successfully many times in the past. Can you post a reproducible example [ tinyurl.com/reproducible-000 ] ? – Mayweed 26/11, 2012 at 14:48

Thanks for the answer and sorry for the crossposting. As a relative noob I didn't know this was a problem. As an example consider the dataset which can be downloaded here. Then I run the following model: ´mod <- glm(betaplasma ~ age + vituse, family=gaussian(link="log"), data=data2, start=c(0,0,0))´, which gives me the warning message that the algorithm did not converge. – Purport 26/11, 2012 at 16:18

it would be good to add this comment to your question (by editing the original question) ... – Mayweed 27/11, 2012 at 14:25

Would be interesting to compare the actual Stata output to that produced from the quasipoisson model. Care to add that also to your question? – Mello 27/11, 2012 at 15:1

DWin, just added the output, it is indeed very similar. Thanks you, too, for your effort! – Purport 27/11, 2012 at 15:12

As I said in my comment, it's probably true that Stata has more robust (in the numerical, not the statistical sense) GLM fitting than R. That said, fitting this particular dataset doesn't seem too hard.

Read data:

data2 <- read.table("http://lib.stat.cmu.edu/datasets/Plasma_Retinol",
         skip=30,nrows=315)
dnames <- c("age","sex","smokstat","quetelet","vituse","calories","fat","fiber",
           "alcohol","cholesterol","betadiet","retdiet","betaplasma","retplasma")
names(data2) <- dnames

Plot the data:

par(mfrow=c(1,2),las=1,bty="l")
with(data2,plot(betaplasma~age))
with(data2,boxplot(betaplasma~vituse))

enter image description here

It's fairly easy to get these to fit by setting the starting value of the intercept parameter to something reasonable (i.e. something close to the mean of the data on the log scale: either of these works

mod <- glm(betaplasma ~ age + vituse, family=gaussian(link="log"), data=data2,
           start=c(10,0,0))
mod <- glm(betaplasma ~ age + vituse, family=gaussian(link="log"), data=data2,
           start=c(log(mean(data2$betaplasma)),0,0))

The latter case is probably a reasonable default strategy for starting log-link fits. The results (slightly abbreviated) match Stata's very closely:

summary(mod)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.467575   0.218360  25.039  < 2e-16 ***
## age          0.005681   0.003377   1.682   0.0935 .  
## vituse      -0.273027   0.065552  -4.165 4.03e-05 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
## 
## (Dispersion parameter for gaussian family taken to be 31385.26)
## 
##     Null deviance: 10515638  on 314  degrees of freedom
## Residual deviance:  9791967  on 312  degrees of freedom
## AIC: 4160.4
## 
## Number of Fisher Scoring iterations: 9

confint(mod)
##                     2.5 %      97.5 %
## (Intercept)  5.0364648709  5.87600710
## age         -0.0007913795  0.01211007
## vituse      -0.4075213916 -0.14995759

(R is using t rather than Z statistics for the p-values and (?) confidence intervals)

However, there are a few reasons I might not fit this model to these data. In particular, the assumption of constant variance (associated with the Gaussian model) is not very reasonable -- these data seem better suited for a lognormal model (or equivalently, for just log-transforming and analyzing with a standard Gaussian model).

Plotting on a log(1+x) scale (there is a zero entry in the data):

with(data2,plot(log(1+betaplasma)~age))
with(data2,boxplot(log(1+betaplasma)~vituse))

enter image description here

Plotting with ggplot (this fits separate lines for each value of vituse rather than fitting an additive model)

library(ggplot)
theme_set(theme_bw())
(g1 <- qplot(age,1+betaplasma,colour=factor(vituse),data=data2)+
    geom_smooth(method="lm")+
    scale_y_log10())

enter image description here

View without 'outlier':

g1 %+% subset(data2,betaplasma>0)

enter image description here

Two other points: (1) it's a bit odd that there's a response with a value of 0 in this data set -- not impossible, but odd; (2) it looks like vituse should be treated as a factor rather than as numeric ("1=Yes, fairly often, 2=Yes, not often, 3=No") -- possibly ordinal.

Mayweed answered 27/11, 2012 at 14:16 Comment(1)

This is absolutely awesome, thanks you very much for all your effort! I really like the method to generate starting values, I instead tried various random values, without success. And also thanks for pointing out that vituse should be treated as a factor. Also, sorry again for cross-posting! – Purport 27/11, 2012 at 14:38

I would like to suggest that it may be the errors that could be non-normal. If you agree (or rather if the data agrees) then consider this construction:

?family
?glm
?binomial
lfit <- glm( dep <- indep1 + indep2, data=dat, family=binomial(link="probit")

This should provide Binomial errors around an identity-inked model. The advantage of this is that your estimates are easier to interpret on the original scale of the variables. Apologies for earlier incorrect suggestion to use family = poisson with probit link. Remember you never presented any data or even description of the distributions. Obviously binomial errors is not an appropriate for the data set that @BenBolker offers.

If you have non-integer values with log-normally distributed errors you should consider the quasipoisson models. If you run this model on the data that Ben Bolker presented and compare the gaussian(link="log) modela they are pretty much indistinguishable and no starting values are needed.

> mod2 <- glm(betaplasma ~ age + vituse, family=quasipoisson, data=data2         )
> mod2

Call:  glm(formula = betaplasma ~ age + vituse, family = quasipoisson, 
    data = data2)

Coefficients:
(Intercept)          age       vituse  
   5.452014     0.006096    -0.276679  

Degrees of Freedom: 314 Total (i.e. Null);  312 Residual
Null Deviance:      37270 
Residual Deviance: 33420    AIC: NA 

> glm(betaplasma ~ age + vituse, family=gaussian(link="log"), data=data2,
+            start=c(10,0,0))

Call:  glm(formula = betaplasma ~ age + vituse, family = gaussian(link = "log"), 
    data = data2, start = c(10, 0, 0))

Coefficients:
(Intercept)          age       vituse  
   5.467575     0.005681    -0.273027  

Degrees of Freedom: 314 Total (i.e. Null);  312 Residual
Null Deviance:      10520000 
Residual Deviance: 9792000  AIC: 4160

You should probably use a slightly more complex model since vituse is obviously a three level factor:

> mod2 <- glm(betaplasma ~ age + factor(vituse), family=quasipoisson, data=data2         )
> mod2

Call:  glm(formula = betaplasma ~ age + factor(vituse), family = quasipoisson, 
    data = data2)

Coefficients:
    (Intercept)              age  factor(vituse)2  factor(vituse)3  
       5.151076         0.006359        -0.224107        -0.562727  

Degrees of Freedom: 314 Total (i.e. Null);  311 Residual
Null Deviance:      37270 
Residual Deviance: 33380    AIC: NA

Mello answered 26/11, 2012 at 16:36 Comment(2)

Thanks for the answer, but doesn't the Poisson-Model require count data as dependent variable? And as an addition, according to the help file in R the poisson family only accepts the links log, identity, and sqrt. – Purport 26/11, 2012 at 17:0

Ooops, should have written family=binomial(link="probit"). Will fix. – Mello 27/11, 2012 at 7:22

Recommended topics

Hot tags