Linear Regression and group by in R

Asked 23/7, 2009 at 4:0 Answered 24/4, 2018 at 15:13

Solved r regression linear-regression lm

136

I want to do a linear regression in R using the lm() function. My data is an annual time series with one field for year (22 years) and another for state (50 states). I want to fit a regression for each state so that at the end I have a vector of lm responses. I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector. That does not seem very R-like, however. In SAS I would do a 'by' statement and in SQL I would do a 'group by'. What's the R way of doing this?

Intestinal answered 23/7, 2009 at 4:0 Comment(1)

Just want to tell people that although there are lots of group-by functions in R, not all of them are the right one for group-by regression. For example, aggregate is not a right one; neither is tapply. – Brush 18/8, 2018 at 20:21

Here's one way using the lme4 package.

 library(lme4)
 d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                 year=rep(1:10, 2),
                 response=c(rnorm(10), rnorm(10)))

 xyplot(response ~ year, groups=state, data=d, type='l')

 fits <- lmList(response ~ year | state, data=d)
 fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
   (Intercept)        year
CA -1.34420990  0.17139963
NY  0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

Impound answered 23/7, 2009 at 4:55 Comment(3)

Is there a way to list R2 for both of these two models? e.g. add an R2 column after year. Also add p-value for each of the coeff? – Excrescent 27/8, 2014 at 15:52

@Excrescent here you can find a feaseble solution (better late than never): Your.df[,summary(lm(Y~X))$r.squared,by=Your.factor] where: Y, X and Your.factor are your variables. Please keep in mind that Your.df must be a data.table class – Trexler 17/11, 2015 at 9:33

I can't find the function xyplot, it doesn't show up after loading lme4 – Berardo 31/5, 2022 at 13:4

Since 2009, dplyr has been released which actually provides a very nice way to do this kind of grouping, closely resembling what SAS does.

library(dplyr)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                year=rep(1:10, 2),
                response=c(rnorm(10), rnorm(10)))
fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .))
# Source: local data frame [2 x 2]
# Groups: <by row>
#
#    state   model
#   (fctr)   (chr)
# 1     CA <S3:lm>
# 2     NY <S3:lm>
fitted_models$model
# [[1]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.06354      0.02677  
#
#
# [[2]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.35136      0.09385

To retrieve the coefficients and Rsquared/p.value, one can use the broom package. This package provides:

three S3 generics: tidy, which summarizes a model's statistical findings such as coefficients of a regression; augment, which adds columns to the original data such as predictions, residuals and cluster assignments; and glance, which provides a one-row summary of model-level statistics.

library(broom)
fitted_models %>% tidy(model)
# Source: local data frame [4 x 6]
# Groups: state [2]
# 
#    state        term    estimate  std.error  statistic   p.value
#   (fctr)       (chr)       (dbl)      (dbl)      (dbl)     (dbl)
# 1     CA (Intercept) -0.06354035 0.83863054 -0.0757668 0.9414651
# 2     CA        year  0.02677048 0.13515755  0.1980687 0.8479318
# 3     NY (Intercept) -0.35135766 0.60100314 -0.5846187 0.5749166
# 4     NY        year  0.09385309 0.09686043  0.9689519 0.3609470
fitted_models %>% glance(model)
# Source: local data frame [2 x 12]
# Groups: state [2]
# 
#    state   r.squared adj.r.squared     sigma statistic   p.value    df
#   (fctr)       (dbl)         (dbl)     (dbl)     (dbl)     (dbl) (int)
# 1     CA 0.004879969  -0.119510035 1.2276294 0.0392312 0.8479318     2
# 2     NY 0.105032068  -0.006838924 0.8797785 0.9388678 0.3609470     2
# Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl),
#   df.residual (int)
fitted_models %>% augment(model)
# Source: local data frame [20 x 10]
# Groups: state [2]
# 
#     state   response  year      .fitted   .se.fit     .resid      .hat
#    (fctr)      (dbl) (int)        (dbl)     (dbl)      (dbl)     (dbl)
# 1      CA  0.4547765     1 -0.036769875 0.7215439  0.4915464 0.3454545
# 2      CA  0.1217003     2 -0.009999399 0.6119518  0.1316997 0.2484848
# 3      CA -0.6153836     3  0.016771076 0.5146646 -0.6321546 0.1757576
# 4      CA -0.9978060     4  0.043541551 0.4379605 -1.0413476 0.1272727
# 5      CA  2.1385614     5  0.070312027 0.3940486  2.0682494 0.1030303
# 6      CA -0.3924598     6  0.097082502 0.3940486 -0.4895423 0.1030303
# 7      CA -0.5918738     7  0.123852977 0.4379605 -0.7157268 0.1272727
# 8      CA  0.4671346     8  0.150623453 0.5146646  0.3165112 0.1757576
# 9      CA -1.4958726     9  0.177393928 0.6119518 -1.6732666 0.2484848
# 10     CA  1.7481956    10  0.204164404 0.7215439  1.5440312 0.3454545
# 11     NY -0.6285230     1 -0.257504572 0.5170932 -0.3710185 0.3454545
# 12     NY  1.0566099     2 -0.163651479 0.4385542  1.2202614 0.2484848
# 13     NY -0.5274693     3 -0.069798386 0.3688335 -0.4576709 0.1757576
# 14     NY  0.6097983     4  0.024054706 0.3138637  0.5857436 0.1272727
# 15     NY -1.5511940     5  0.117907799 0.2823942 -1.6691018 0.1030303
# 16     NY  0.7440243     6  0.211760892 0.2823942  0.5322634 0.1030303
# 17     NY  0.1054719     7  0.305613984 0.3138637 -0.2001421 0.1272727
# 18     NY  0.7513057     8  0.399467077 0.3688335  0.3518387 0.1757576
# 19     NY -0.1271655     9  0.493320170 0.4385542 -0.6204857 0.2484848
# 20     NY  1.2154852    10  0.587173262 0.5170932  0.6283119 0.3454545
# Variables not shown: .sigma (dbl), .cooksd (dbl), .std.resid (dbl)

Osi answered 23/11, 2015 at 11:37 Comment(8)

I had to do rowwise(fitted_models) %>% tidy(model) to get the broom package to work, but otherwise, great answer. – Baryta 23/2, 2016 at 15:21

Works great... can do this all without leaving the pipe: d %>% group_by(state) %>% do(model = lm(response ~ year, data = .)) %>% rowwise() %>% tidy(model) – Anthia 16/2, 2018 at 0:30

@Baryta and @holastello, this no longer works, at least with R 3.6.1, broom_0.7.0, dplyr_0.8.3.

d %>% group_by(state) %>% do(model=lm(response ~year, data = .)) %>% rowwise() %>% tidy(model) Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :    Calling var(x) on a factor x is defunct.   Use something like 'all(duplicated(x)[-1L])' to test for a constant vector. In addition: Warning messages: 1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. ...

– Viscacha 27/8, 2020 at 13:30

Now (dplyr 1.0.4, tidyverse 1.3.0), you can do:

library(broom); library(tidyverse) d %>% nest(data = -state) %>% mutate(model = map(data, ~lm(response~year, data = .)), tidied = map(model, tidy)) %>% unnest(tidied)

– Labourer 14/5, 2021 at 7:26

@JonSpring is it still working this way? – Icing 17/4, 2022 at 7:55

Still works for me, yes. (on dplyr 1.0.8 and tidyverse 1.3.1) – Labourer 17/4, 2022 at 9:16

the fix only stores estimate, std error, statistic and p-value. Is there no way to get the whole summary() output, including R-squared, DOF, et cetera? – Kappel 13/7, 2022 at 13:58

too late to edit: not beautiful but works: looping through the list of fittedmodels and calling summary() on each, such as summary(fittedmodels[[2]][[1]]) – Kappel 13/7, 2022 at 14:5

Here's an approach using the plyr package:

d <- data.frame(
  state = rep(c('NY', 'CA'), 10),
  year = rep(1:10, 2),
  response= rnorm(20)
)

library(plyr)
# Break up d by state, then fit the specified model to each piece and
# return a list
models <- dlply(d, "state", function(df) 
  lm(response ~ year, data = df))

# Apply coef to each model and return a data frame
ldply(models, coef)

# Print the summary of each model
l_ply(models, summary, .print = TRUE)

Khan answered 31/7, 2009 at 19:30 Comment(4)

Say you added a additional independent variable that wasn't available in all states (i.e. miles.of.ocean.shoreline) that was represented by NA in your data. Wouldn't the lm call fail? How could it be dealt with? – Murrhine 14/3, 2013 at 1:58

Inside the function you'd need to test for that case and use a different formula – Khan 14/3, 2013 at 15:24

Is it possible to add the name of the subgroup to each call in the summary (last step)? – Ruphina 16/4, 2014 at 9:31

if you run layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page and then l_ply(models, plot) you get each of the residuals plots too. Is it possible to label each of the plots with the group (e.g., "state" in this case)? – Lubricous 23/4, 2015 at 17:23

Here's one way using the lme4 package.

 library(lme4)
 d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                 year=rep(1:10, 2),
                 response=c(rnorm(10), rnorm(10)))

 xyplot(response ~ year, groups=state, data=d, type='l')

 fits <- lmList(response ~ year | state, data=d)
 fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
   (Intercept)        year
CA -1.34420990  0.17139963
NY  0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

Impound answered 23/7, 2009 at 4:55 Comment(3)

Is there a way to list R2 for both of these two models? e.g. add an R2 column after year. Also add p-value for each of the coeff? – Excrescent 27/8, 2014 at 15:52

I can't find the function xyplot, it doesn't show up after loading lme4 – Berardo 31/5, 2022 at 13:4

In my opinion is a mixed linear model a better approach for this kind of data. The code below given in the fixed effect the overall trend. The random effects indicate how the trend for each individual state differ from the global trend. The correlation structure takes the temporal autocorrelation into account. Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).

library(nlme)
lme(response ~ year, random = ~year|state, correlation = corAR1(~year))

Counts answered 24/7, 2009 at 11:32 Comment(3)

This is a really good general stats theory answer which makes me think about some things I had not considered. The application which caused me to ask the question would not be applicable to this solution, but I'm glad you brought it up. Thank you. – Intestinal 24/7, 2009 at 14:11

It's not a good idea to start with a mixed model - how do you know that any of the assumptions are warranted? – Khan 31/7, 2009 at 19:26

One should check the assumption by model validation (and knowledge of the data). BTW you cannot warrant the assumption on the individual lm's either. You would have to validate all models seperately. – Counts 31/7, 2009 at 22:59

A nice solution using data.table was posted here in CrossValidated by @Zach. I'd just add that it is possible to obtain iteratively also the regression coefficient r^2:

## make fake data
    library(data.table)
    set.seed(1)
    dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50))

##calculate the regression coefficient r^2
    dat[,summary(lm(y~x))$r.squared,by=grp]
       grp         V1
    1:   1 0.01465726
    2:   2 0.02256595

as well as all the other output from summary(lm):

dat[,list(r2=summary(lm(y~x))$r.squared , f=summary(lm(y~x))$fstatistic[1] ),by=grp]
   grp         r2        f
1:   1 0.01465726 0.714014
2:   2 0.02256595 1.108173

Trexler answered 17/11, 2015 at 10:4 Comment(0)

I think it's worthwhile to add the purrr::map approach to this problem.

library(tidyverse)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                                 year=rep(1:10, 2),
                                 response=c(rnorm(10), rnorm(10)))

d %>% 
  group_by(state) %>% 
  nest() %>% 
  mutate(model = map(data, ~lm(response ~ year, data = .)))

See @Paul Hiemstra's answer for further ideas on using the broom package with these results.

Moluccas answered 24/4, 2018 at 15:13 Comment(2)

A little extension in case you want a column of fitted values or residuals: wrap the lm() call in a resid() call and then pipe everything in the last line into an unnest() call. Of course, you would want to change the variable name from "model" to something more relevant. – Indeliberate 16/7, 2019 at 22:52

Hi, I adapted your code for my own data. However, R is throwing "<tibble> <S3: lm> " instead of the results. Could you please help me ? – Flitting 11/1 at 20:27

## make fake data
 ngroups <- 2
 group <- 1:ngroups
 nobs <- 100
 dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
 head(dta)
#--------------------
  group          y         x
1     1  0.6482007 0.5429575
2     1 -0.4637118 0.7052843
3     1 -0.5129840 0.7312955
4     1 -0.6612649 0.9028034
5     1 -0.5197448 0.1661308
6     1  0.4240346 0.8944253
#------------ 
## function to extract the results of one model
 foo <- function(z) {
   ## coef and se in a data frame
   mr <- data.frame(coef(summary(lm(y~x,data=z))))
   ## put row names (predictors/indep variables)
   mr$predictor <- rownames(mr)
   mr
 }
 ## see that it works
 foo(subset(dta,group==1))
#=========
              Estimate Std..Error   t.value  Pr...t..   predictor
(Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
x           -0.3669890  0.3321875 -1.104765 0.2719666           x
#----------
## one option: use command by
 res <- by(dta,dta$group,foo)
 res
#=========
dta$group: 1
              Estimate Std..Error   t.value  Pr...t..   predictor
(Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
x           -0.3669890  0.3321875 -1.104765 0.2719666           x
------------------------------------------------------------ 
dta$group: 2
               Estimate Std..Error    t.value  Pr...t..   predictor
(Intercept) -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
x            0.06286456  0.3020321  0.2081387 0.8355526           x

## using package plyr is better
 library(plyr)
 res <- ddply(dta,"group",foo)
 res
#----------
  group    Estimate Std..Error    t.value  Pr...t..   predictor
1     1  0.21764767  0.1919140  1.1340897 0.2595235 (Intercept)
2     1 -0.36698898  0.3321875 -1.1047647 0.2719666           x
3     2 -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
4     2  0.06286456  0.3020321  0.2081387 0.8355526           x

Cacodemon answered 23/7, 2009 at 4:22 Comment(0)

I now my answer comes a bit late, but I was looking for a similar functionality. It would seem the built-in function 'by' in R can also do the grouping easily:

?by contains the following example, which fits per group and extracts the coefficients with sapply:

require(stats)
## now suppose we want to extract the coefficients by group 
tmp <- with(warpbreaks,
            by(warpbreaks, tension,
               function(x) lm(breaks ~ wool, data = x)))
sapply(tmp, coef)

Chang answered 28/2, 2017 at 14:53 Comment(0)

The lm() function above is an simple example. By the way, I imagine that your database has the columns as in the following form:

year state var1 var2 y...

In my point of view, you can to use the following code:

require(base) 
library(base) 
attach(data) # data = your data base
             #state is your label for the states column
modell<-by(data, data$state, function(data) lm(y~I(1/var1)+I(1/var2)))
summary(modell)

Apoenzyme answered 21/8, 2012 at 23:25 Comment(0)

The question seems to be about how to call regression functions with formulas which are modified inside a loop.

Here is how you can do it in (using diamonds dataset):

attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)

formula <- list(); model <- list()
for (i in 1:1) {
  formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
  model[[i]] = glm(formula[[i]]) 

  #then you can plot the results or anything else ...
  png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
  par(mfrow = c(2, 2))      
  plot(model[[i]])
  dev.off()
  }

Titbit answered 28/3, 2017 at 17:26 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags