pull out p-values and r-squared from a linear regression
Asked Answered
C

12

235

How do you pull out the p-value (for the significance of the coefficient of the single explanatory variable being non-zero) and R-squared value from a simple linear regression model? For example...

x = cumsum(c(0, runif(100, -1, +1)))
y = cumsum(c(0, runif(100, -1, +1)))
fit = lm(y ~ x)
summary(fit)

I know that summary(fit) displays the p-value and R-squared value, but I want to be able to stick these into other variables.

Chester answered 7/4, 2011 at 21:7 Comment(1)
It only displays the values if you don't assign the output to an object (e.g. r <- summary(lm(rnorm(10)~runif(10))) does not display anything).Muire
F
223

r-squared: You can return the r-squared value directly from the summary object summary(fit)$r.squared. See names(summary(fit)) for a list of all the items you can extract directly.

Model p-value: If you want to obtain the p-value of the overall regression model, this blog post outlines a function to return the p-value:

lmp <- function (modelobject) {
    if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
    f <- summary(modelobject)$fstatistic
    p <- pf(f[1],f[2],f[3],lower.tail=F)
    attributes(p) <- NULL
    return(p)
}

> lmp(fit)
[1] 1.622665e-05

In the case of a simple regression with one predictor, the model p-value and the p-value for the coefficient will be the same.

Coefficient p-values: If you have more than one predictor, then the above will return the model p-value, and the p-value for coefficients can be extracted using:

summary(fit)$coefficients[,4]  

Alternatively, you can grab the p-value of coefficients from the anova(fit) object in a similar fashion to the summary object above.

Fulani answered 7/4, 2011 at 21:17 Comment(3)
It's a bit better to use inherits rather than class directly. And maybe you want unname(pf(f[1],f[2],f[3],lower.tail=F))?Holtorf
If you prefer a one-liner: summary(fit)$fstatistic %>% {unname(pf(.[1],.[2],.[3],lower.tail=F))}Scandic
Or as pipe-less one-liner: with(summary(fit), pf(fstatistic[1],fstatistic[2],fstatistic[3],lower.tail=F))Scandic
F
185

Notice that summary(fit) generates an object with all the information you need. The beta, se, t and p vectors are stored in it. Get the p-values by selecting the 4th column of the coefficients matrix (stored in the summary object):

summary(fit)$coefficients[,4] 
summary(fit)$r.squared

Try str(summary(fit)) to see all the info that this object contains.

Edit: I had misread Chase's answer which basically tells you how to get to what I give here.

Ferdinana answered 7/4, 2011 at 23:3 Comment(6)
Note: this is the only method which gives you easy access to the p-value of the intercept as well as the other predictors. By far the best of above.Huzzah
This is the RIGHT answer. The top-rated answer did NOT work for me.Gradin
IF YOU WANT EASY ACCESS TO P-VALUE, USE THIS ANSWER. Why would you go through writing multi-line functions or creating new objects (i.e., anova outputs), when you just have to look a bit harder to find p-value in the summary output itself. To isolate an individual p-value itself, you'd add a row number to Vincent's answer: for example, summary(fit)$coefficients[1,4] for thei nterceptCorrigan
Note: this method works for models created using lm() but does not work for gls() models.Corrigan
See my answer below for an extension of this answer that works for lm() AND gls()Corrigan
Chase's answer returns the p-value of the model, this answer returns the p-value of the coefficients. In the case of a simple regression, they are the same, but in the case of a model with multiple predictors, they are not the same. Thus, both answers are useful depending on what you want to extract.Hyposensitize
S
52

You can see the structure of the object returned by summary() by calling str(summary(fit)). Each piece can be accessed using $. The p-value for the F statistic is more easily had from the object returned by anova.

Concisely, you can do this:

rSquared <- summary(fit)$r.squared
pVal <- anova(fit)$'Pr(>F)'[1]
Sahara answered 7/4, 2011 at 21:13 Comment(2)
this works only for univariate regressions where the p val of the regression is the same of the predictorKristykristyn
For some reason this answer does not work when using the summary object from aovCousteau
M
33

I came across this question while exploring suggested solutions for a similar problem; I presume that for future reference it may be worthwhile to update the available list of answer with a solution utilising the broom package.

Sample code

x = cumsum(c(0, runif(100, -1, +1)))
y = cumsum(c(0, runif(100, -1, +1)))
fit = lm(y ~ x)
require(broom)
glance(fit)

Results

>> glance(fit)
  r.squared adj.r.squared    sigma statistic    p.value df    logLik      AIC      BIC deviance df.residual
1 0.5442762     0.5396729 1.502943  118.2368 1.3719e-18  2 -183.4527 372.9055 380.7508 223.6251          99

Side notes

I find the glance function is useful as it neatly summarises the key values. The results are stored as a data.frame which makes further manipulation easy:

>> class(glance(fit))
[1] "data.frame"
Madrid answered 1/5, 2016 at 16:6 Comment(1)
This is a great answer!Improve
N
24

While both of the answers above are good, the procedure for extracting parts of objects is more general.

In many cases, functions return lists, and the individual components can be accessed using str() which will print the components along with their names. You can then access them using the $ operator, i.e. myobject$componentname.

In the case of lm objects, there are a number of predefined methods one can use such as coef(), resid(), summary() etc, but you won't always be so lucky.

Nicholenicholl answered 7/4, 2011 at 21:30 Comment(0)
C
17

Extension of @Vincent 's answer:

For lm() generated models:

summary(fit)$coefficients[,4]   ##P-values 
summary(fit)$r.squared          ##R squared values

For gls() generated models:

summary(fit)$tTable[,4]         ##P-values
##R-squared values are not generated b/c gls uses max-likelihood not Sums of Squares

To isolate an individual p-value itself, you'd add a row number to the code:

For example to access the p-value of the intercept in both model summaries:

summary(fit)$coefficients[1,4]
summary(fit)$tTable[1,4]  
  • Note, you can replace the column number with the column name in each of the above instances:

    summary(fit)$coefficients[1,"Pr(>|t|)"]  ##lm 
    summary(fit)$tTable[1,"p-value"]         ##gls 
    

If you're still unsure of how to access a value form the summary table use str() to figure out the structure of the summary table:

str(summary(fit))
Corrigan answered 20/2, 2016 at 3:36 Comment(0)
A
10

This is the easiest way to pull the p-values:

coef(summary(modelname))[, "Pr(>|t|)"]
Ambsace answered 16/9, 2016 at 15:6 Comment(1)
I tried this method, but it will fail if the linear model contains any NA termsCellulosic
P
5

I used this lmp function quite a lot of times.

And at one point I decided to add new features to enhance data analysis. I am not in expert in R or statistics but people are usually looking at different information of a linear regression :

  • p-value
  • a and b
  • and of course the aspect of the point distribution

Let's have an example. You have here

Here a reproducible example with different variables:

Ex<-structure(list(X1 = c(-36.8598, -37.1726, -36.4343, -36.8644, 
-37.0599, -34.8818, -31.9907, -37.8304, -34.3367, -31.2984, -33.5731
), X2 = c(64.26, 63.085, 66.36, 61.08, 61.57, 65.04, 72.69, 63.83, 
67.555, 76.06, 68.61), Y1 = c(493.81544, 493.81544, 494.54173, 
494.61364, 494.61381, 494.38717, 494.64122, 493.73265, 494.04246, 
494.92989, 494.98384), Y2 = c(489.704166, 489.704166, 490.710962, 
490.653212, 490.710612, 489.822928, 488.160904, 489.747776, 490.600579, 
488.946738, 490.398958), Y3 = c(-19L, -19L, -19L, -23L, -30L, 
-43L, -43L, -2L, -58L, -47L, -61L)), .Names = c("X1", "X2", "Y1", 
"Y2", "Y3"), row.names = c(NA, 11L), class = "data.frame")


library(reshape2)
library(ggplot2)
Ex2<-melt(Ex,id=c("X1","X2"))
colnames(Ex2)[3:4]<-c("Y","Yvalue")
Ex3<-melt(Ex2,id=c("Y","Yvalue"))
colnames(Ex3)[3:4]<-c("X","Xvalue")

ggplot(Ex3,aes(Xvalue,Yvalue))+
          geom_smooth(method="lm",alpha=0.2,size=1,color="grey")+
          geom_point(size=2)+
          facet_grid(Y~X,scales='free')


#Use the lmp function

lmp <- function (modelobject) {
  if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
  f <- summary(modelobject)$fstatistic
    p <- pf(f[1],f[2],f[3],lower.tail=F)
    attributes(p) <- NULL
    return(p)
    }

# create function to extract different informations from lm

lmtable<-function (var1,var2,data,signi=NULL){
  #var1= y data : colnames of data as.character, so "Y1" or c("Y1","Y2") for example
  #var2= x data : colnames of data as.character, so "X1" or c("X1","X2") for example
  #data= data in dataframe, variables in columns
  # if signi TRUE, round p-value with 2 digits and add *** if <0.001, ** if < 0.01, * if < 0.05.

  if (class(data) != "data.frame") stop("Not an object of class 'data.frame' ")
  Tabtemp<-data.frame(matrix(NA,ncol=6,nrow=length(var1)*length(var2)))
  for (i in 1:length(var2))
       {
  Tabtemp[((length(var1)*i)-(length(var1)-1)):(length(var1)*i),1]<-var1
  Tabtemp[((length(var1)*i)-(length(var1)-1)):(length(var1)*i),2]<-var2[i]
  colnames(Tabtemp)<-c("Var.y","Var.x","p-value","a","b","r^2")

  for (n in 1:length(var1))
  {
  Tabtemp[(((length(var1)*i)-(length(var1)-1))+n-1),3]<-lmp(lm(data[,var1[n]]~data[,var2[i]],data))

  Tabtemp[(((length(var1)*i)-(length(var1)-1))+n-1),4]<-coef(lm(data[,var1[n]]~data[,var2[i]],data))[1]

  Tabtemp[(((length(var1)*i)-(length(var1)-1))+n-1),5]<-coef(lm(data[,var1[n]]~data[,var2[i]],data))[2]

  Tabtemp[(((length(var1)*i)-(length(var1)-1))+n-1),6]<-summary(lm(data[,var1[n]]~data[,var2[i]],data))$r.squared
  }
  }

  signi2<-data.frame(matrix(NA,ncol=3,nrow=nrow(Tabtemp)))
  signi2[,1]<-ifelse(Tabtemp[,3]<0.001,paste0("***"),ifelse(Tabtemp[,3]<0.01,paste0("**"),ifelse(Tabtemp[,3]<0.05,paste0("*"),paste0(""))))
  signi2[,2]<-round(Tabtemp[,3],2)
  signi2[,3]<-paste0(format(signi2[,2],digits=2),signi2[,1])

  for (l in 1:nrow(Tabtemp))
    {
  Tabtemp$"p-value"[l]<-ifelse(is.null(signi),
         Tabtemp$"p-value"[l],
         ifelse(isTRUE(signi),
                paste0(signi2[,3][l]),
                Tabtemp$"p-value"[l]))
  }

   Tabtemp
}

# ------- EXAMPLES ------

lmtable("Y1","X1",Ex)
lmtable(c("Y1","Y2","Y3"),c("X1","X2"),Ex)
lmtable(c("Y1","Y2","Y3"),c("X1","X2"),Ex,signi=TRUE)

There is certainly a faster solution than this function but it works.

Panteutonism answered 2/6, 2013 at 16:16 Comment(0)
P
4

For the final p-value displayed at the end of summary(), the function uses pf() to calculate from the summary(fit)$fstatistic values.

fstat <- summary(fit)$fstatistic
pf(fstat[1], fstat[2], fstat[3], lower.tail=FALSE)

Source: [1], [2]

Phyllida answered 18/2, 2019 at 10:38 Comment(0)
S
2

Another option is to use the cor.test function, instead of lm:

> x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
> y <- c( 2.6,  3.1,  2.5,  5.0,  3.6,  4.0,  5.2,  2.8,  3.8)

> mycor = cor.test(x,y)
> mylm = lm(x~y)

# r and rsquared:
> cor.test(x,y)$estimate ** 2
      cor 
0.3262484 
> summary(lm(x~y))$r.squared
[1] 0.3262484

# P.value 

> lmp(lm(x~y))  # Using the lmp function defined in Chase's answer
[1] 0.1081731
> cor.test(x,y)$p.value
[1] 0.1081731
Scratch answered 5/6, 2014 at 14:2 Comment(0)
S
2

Use:

(summary(fit))$coefficients[***num***,4]

where num is a number which denotes the row of the coefficients matrix. It will depend on how many features you have in your model and which one you want to pull out the p-value for. For example, if you have only one variable there will be one p-value for the intercept which will be [1,4] and the next one for your actual variable which will be [2,4]. So your num will be 2.

Snip answered 6/5, 2017 at 21:22 Comment(0)
R
1
x = cumsum(c(0, runif(100, -1, +1)))
y = cumsum(c(0, runif(100, -1, +1)))
fit = lm(y ~ x)
> names(summary(fit))
[1] "call"          "terms"        
 [3] "residuals"     "coefficients" 
 [5] "aliased"       "sigma"        
 [7] "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"   
[11] "cov.unscaled" 
    summary(fit)$r.squared
Readership answered 26/3, 2016 at 20:1 Comment(2)
Care to provide an explanation, even if briefly, on why this code works?Ceuta
how does this improve on the existing answers (and in particular the accepted answer)?Boutonniere

© 2022 - 2024 — McMap. All rights reserved.