What do these R glm error messages mean: "Error: no valid set of coefficients has been found: please supply starting values"
Asked Answered
W

2

10

Here are two related questions but they are not duplicates of mine as the first one has a solution specific to the data set and the second one involves a failure of glm when start is supplied alongside an offset.

https://mcmap.net/q/1168181/-error-please-supply-starting-values://mcmap.net/q/855033/-glm-starting-values-not-accepted-log-link

I have the following dataset:

library(data.table)
df <- data.frame(names = factor(1:10))
set.seed(0)
df$probs <- c(0, 0, runif(8, 0, 1))
df$response = lapply(df$probs, function(i){
  rbinom(50, 1, i)  
})



dt <- data.table(df)

dt <- dt[, list(response = unlist(response)), by = c('names', 'probs')]

such that dt is:

> dt
     names     probs response 
  1:     1 0.0000000        0 
  2:     1 0.0000000        0 
  3:     1 0.0000000        0 
  4:     1 0.0000000        0 
  5:     1 0.0000000        0 
 ---                                     
496:    10 0.9446753        0 
497:    10 0.9446753        1 
498:    10 0.9446753        1 
499:    10 0.9446753        1 
500:    10 0.9446753        1 

I am trying to fit a logistic regression model with the identity link, using lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity')).

This gives an error:

Error: no valid set of coefficients has been found: please supply starting values

I tried fixing it by supplying a start argument, but then I get another error.

> lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start = c(0, 1))
Error: cannot find valid starting values: please specify some

At this point these errors make no sense to me and I have no idea what to do.

EDIT: @iraserd has thrown some more light on this problem. Using start = c(0.5, 0.5), I get:

> lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start = c(0.5, 0.5))
There were 25 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: step size truncated: out of bounds
2: step size truncated: out of bounds
3: step size truncated: out of bounds
4: step size truncated: out of bounds
5: step size truncated: out of bounds
6: step size truncated: out of bounds
7: step size truncated: out of bounds
8: step size truncated: out of bounds
9: step size truncated: out of bounds
10: step size truncated: out of bounds
11: step size truncated: out of bounds
12: step size truncated: out of bounds
13: step size truncated: out of bounds
14: step size truncated: out of bounds
15: step size truncated: out of bounds
16: step size truncated: out of bounds
17: step size truncated: out of bounds
18: step size truncated: out of bounds
19: step size truncated: out of bounds
20: step size truncated: out of bounds
21: step size truncated: out of bounds
22: step size truncated: out of bounds
23: step size truncated: out of bounds
24: step size truncated: out of bounds
25: glm.fit: algorithm stopped at boundary value

and

> summary(lm2)

Call:
glm(formula = response ~ probs, family = binomial(link = "identity"), 
    data = dt, start = c(0.5, 0.5))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4023  -0.6710   0.3389   0.4641   1.7897  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.486e-08  1.752e-06   0.008    0.993    
probs       9.995e-01  2.068e-03 483.372   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 69312  on 49999  degrees of freedom
Residual deviance: 35984  on 49998  degrees of freedom
AIC: 35988

Number of Fisher Scoring iterations: 24

I highly suspect this has something to do with the fact that some of the responses are generated with true probability zero which causes problems as the coefficient of probs approaches 1.

Waiver answered 25/2, 2016 at 3:59 Comment(0)
C
3

There are two places in the fit.glm code where it terminates with the error no valid set of coefficients has been found: please supply starting values. In one case, when some calculated deviance becomes infinite, the other case seems to occur when invalid etastart and mustart options are provided.

See also the answer to, which elaborates in detail: How do I use a custom link function in glm?

As you try to make a regression on probabilities (values between 0 and 1), I guess you need to specify starting values unequal to 0 or 1:

lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start=c(0.5,0.5))

This throws a lot of warnings and terminates with an overflow, probably because of the artificial nature of the example.

Changing the formula to use the logit link (as you want a logistic regression according to your question) gets rid of the warnings (and does not need starting parameters):

    lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='logit')
Conceivable answered 25/2, 2016 at 13:58 Comment(5)
the help for glm is completely confusing: start: starting values for the parameters in the linear predictor. I thought these were the initial guesses for the coefficient of the covariates. In fact, if you only supply 1 start value, you get this error: length of 'start' should equal 2 and correspond to initial coefs for c("(Intercept)", "probs") There is no reason why my initial starting values should fail, as the data is (well as you note contrived) using these values.Waiver
@Waiver it would be easier if you provided info on your real data. This example looks like a panel set for which pglm would fit better?Conceivable
My real data is probably a bit too noisy to show here. I am simulating a regression where I wish to estimate the probability associated with a category level, and I am having problems when the actual probabilities are very close to zero.Waiver
@Waiver Fitting a logistic regression with the 'logit' link yields no errors - why specifically you want the 'identity' link? I never used that an in the ?family it does not list 'identity' as a valid link function for binomial. Also if you want to estimate category probabilities, why not use a dummy variable approach on names?Conceivable
The final model that you are suggesting is not the model that the OP is simulating from.Camboose
C
1

As irased argue the error can come from here or here. Both are in the main loop of the iterative re-weighted least squares.

The first check can fail if any of deviances are not finite. In your case (and for all link functions with the binomial family), these come from binomial("identity")$dev.resids which calls this C function. This can in some cases evaluate the log at a negative value if the mean mu is the outside (0,1) (i.e. outside the valid range).

We reach the second branch if any of the linear predictors, eta, or the mean, mu, are not valid and we are in the first iteration in which case coefold is NULL

if (!(valideta(eta) && validmu(mu))) {
  if(is.null(coefold))
    stop("no valid set of coefficients has been found: please supply starting values", call. = FALSE)
  # ...
}

Looking at the family you are use using, valideta and validmu are

with(binomial("identity"), {
    print(valideta)
    print(validmu)
})
#R> function (eta) 
#R> TRUE
#R> <environment: namespace:stats>
#R> function (mu) 
#R> all(is.finite(mu)) && all(mu > 0 & mu < 1)
#R> <bytecode: 0x55de9ffd4448>
#R> <environment: 0x55dea8ee2418>

which makes sense as the probabilities, the means, must be between the (0,1). Thus, we can conclude that some of the means must at some point be outside the (0,1) range during the iterative re-weighted least squares.

The link function you are using does not guarantee that the the means are inside the (0,1) range since the inverse link function is

binomial("identity")$linkinv
#R> function (eta) 
#R> eta
#R> <environment: namespace:stats>

and this is your problem. There is not guarantee or check in glm that ensures that everything is valid. However, this constraint is always satisfied with some link functions. Specifying the starting values might make you not enter areas with invalid means during the iterative re-weighted least squares.

I highly suspect this has something to do with the fact that some of the responses are generated with true probability zero which causes problems as the coefficient of probs approaches 1.

Yes, this is exactly the issue. Simply replacing you example with

library(data.table)
df <- data.frame(names = factor(1:10))
set.seed(0)
df$probs <- c(0, 0, runif(8, 0, 1))
df$response = lapply(df$probs, function(i){
    rbinom(50, 1, i)  
})

dt <- data.table(df)
dt <- dt[, list(response = unlist(response)), by = c('names', 'probs')]

tmp <- dt$probs
tmp <- pmin(pmax(tmp, .Machine$double.eps), 1 - .Machine$double.eps)
dt$probs_logit <- log(tmp / (1 - tmp))
fit <- glm(data = dt, formula = response ~ probs_logit - 1, family = binomial("logit"))
#R> Warning message:
#R> glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit)
#R> 
#R> Call:
#R> glm(formula = response ~ probs_logit - 1, family = binomial("logit"), 
#R>     data = dt)
#R> 
#R> Deviance Residuals: 
#R>     Min       1Q   Median       3Q      Max  
#R> -2.4320  -0.6616   0.0000   0.4519   1.8038  
#R> 
#R> Coefficients:
#R>             Estimate Std. Error z value Pr(>|z|)    
#R> probs_logit  1.02336    0.09468   10.81   <2e-16 ***
#R> ---
#R> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#R> 
#R> (Dispersion parameter for binomial family taken to be 1)
#R> 
#R>     Null deviance: 693.15  on 500  degrees of freedom
#R> Residual deviance: 355.18  on 499  degrees of freedom
#R> AIC: 357.18
#R> 
#R> Number of Fisher Scoring iterations: 8
#R> 

gives you a warning but allows you to simulate from almost the right model after truncating and transforming the probabilities.

Camboose answered 25/6, 2020 at 9:20 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.