Linear Regression and group by in R
Asked Answered
I

10

136

I want to do a linear regression in R using the lm() function. My data is an annual time series with one field for year (22 years) and another for state (50 states). I want to fit a regression for each state so that at the end I have a vector of lm responses. I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector. That does not seem very R-like, however. In SAS I would do a 'by' statement and in SQL I would do a 'group by'. What's the R way of doing this?

Intestinal answered 23/7, 2009 at 4:0 Comment(1)
Just want to tell people that although there are lots of group-by functions in R, not all of them are the right one for group-by regression. For example, aggregate is not a right one; neither is tapply.Brush
I
61

Here's one way using the lme4 package.

 library(lme4)
 d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                 year=rep(1:10, 2),
                 response=c(rnorm(10), rnorm(10)))

 xyplot(response ~ year, groups=state, data=d, type='l')

 fits <- lmList(response ~ year | state, data=d)
 fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
   (Intercept)        year
CA -1.34420990  0.17139963
NY  0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316
Impound answered 23/7, 2009 at 4:55 Comment(3)
Is there a way to list R2 for both of these two models? e.g. add an R2 column after year. Also add p-value for each of the coeff?Excrescent
@Excrescent here you can find a feaseble solution (better late than never): Your.df[,summary(lm(Y~X))$r.squared,by=Your.factor] where: Y, X and Your.factor are your variables. Please keep in mind that Your.df must be a data.table classTrexler
I can't find the function xyplot, it doesn't show up after loading lme4Berardo
O
76

Since 2009, dplyr has been released which actually provides a very nice way to do this kind of grouping, closely resembling what SAS does.

library(dplyr)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                year=rep(1:10, 2),
                response=c(rnorm(10), rnorm(10)))
fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .))
# Source: local data frame [2 x 2]
# Groups: <by row>
#
#    state   model
#   (fctr)   (chr)
# 1     CA <S3:lm>
# 2     NY <S3:lm>
fitted_models$model
# [[1]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.06354      0.02677  
#
#
# [[2]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.35136      0.09385  

To retrieve the coefficients and Rsquared/p.value, one can use the broom package. This package provides:

three S3 generics: tidy, which summarizes a model's statistical findings such as coefficients of a regression; augment, which adds columns to the original data such as predictions, residuals and cluster assignments; and glance, which provides a one-row summary of model-level statistics.

library(broom)
fitted_models %>% tidy(model)
# Source: local data frame [4 x 6]
# Groups: state [2]
# 
#    state        term    estimate  std.error  statistic   p.value
#   (fctr)       (chr)       (dbl)      (dbl)      (dbl)     (dbl)
# 1     CA (Intercept) -0.06354035 0.83863054 -0.0757668 0.9414651
# 2     CA        year  0.02677048 0.13515755  0.1980687 0.8479318
# 3     NY (Intercept) -0.35135766 0.60100314 -0.5846187 0.5749166
# 4     NY        year  0.09385309 0.09686043  0.9689519 0.3609470
fitted_models %>% glance(model)
# Source: local data frame [2 x 12]
# Groups: state [2]
# 
#    state   r.squared adj.r.squared     sigma statistic   p.value    df
#   (fctr)       (dbl)         (dbl)     (dbl)     (dbl)     (dbl) (int)
# 1     CA 0.004879969  -0.119510035 1.2276294 0.0392312 0.8479318     2
# 2     NY 0.105032068  -0.006838924 0.8797785 0.9388678 0.3609470     2
# Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl),
#   df.residual (int)
fitted_models %>% augment(model)
# Source: local data frame [20 x 10]
# Groups: state [2]
# 
#     state   response  year      .fitted   .se.fit     .resid      .hat
#    (fctr)      (dbl) (int)        (dbl)     (dbl)      (dbl)     (dbl)
# 1      CA  0.4547765     1 -0.036769875 0.7215439  0.4915464 0.3454545
# 2      CA  0.1217003     2 -0.009999399 0.6119518  0.1316997 0.2484848
# 3      CA -0.6153836     3  0.016771076 0.5146646 -0.6321546 0.1757576
# 4      CA -0.9978060     4  0.043541551 0.4379605 -1.0413476 0.1272727
# 5      CA  2.1385614     5  0.070312027 0.3940486  2.0682494 0.1030303
# 6      CA -0.3924598     6  0.097082502 0.3940486 -0.4895423 0.1030303
# 7      CA -0.5918738     7  0.123852977 0.4379605 -0.7157268 0.1272727
# 8      CA  0.4671346     8  0.150623453 0.5146646  0.3165112 0.1757576
# 9      CA -1.4958726     9  0.177393928 0.6119518 -1.6732666 0.2484848
# 10     CA  1.7481956    10  0.204164404 0.7215439  1.5440312 0.3454545
# 11     NY -0.6285230     1 -0.257504572 0.5170932 -0.3710185 0.3454545
# 12     NY  1.0566099     2 -0.163651479 0.4385542  1.2202614 0.2484848
# 13     NY -0.5274693     3 -0.069798386 0.3688335 -0.4576709 0.1757576
# 14     NY  0.6097983     4  0.024054706 0.3138637  0.5857436 0.1272727
# 15     NY -1.5511940     5  0.117907799 0.2823942 -1.6691018 0.1030303
# 16     NY  0.7440243     6  0.211760892 0.2823942  0.5322634 0.1030303
# 17     NY  0.1054719     7  0.305613984 0.3138637 -0.2001421 0.1272727
# 18     NY  0.7513057     8  0.399467077 0.3688335  0.3518387 0.1757576
# 19     NY -0.1271655     9  0.493320170 0.4385542 -0.6204857 0.2484848
# 20     NY  1.2154852    10  0.587173262 0.5170932  0.6283119 0.3454545
# Variables not shown: .sigma (dbl), .cooksd (dbl), .std.resid (dbl)
Osi answered 23/11, 2015 at 11:37 Comment(8)
I had to do rowwise(fitted_models) %>% tidy(model) to get the broom package to work, but otherwise, great answer.Baryta
Works great... can do this all without leaving the pipe: d %>% group_by(state) %>% do(model = lm(response ~ year, data = .)) %>% rowwise() %>% tidy(model)Anthia
@Baryta and @holastello, this no longer works, at least with R 3.6.1, broom_0.7.0, dplyr_0.8.3. d %>% group_by(state) %>% do(model=lm(response ~year, data = .)) %>% rowwise() %>% tidy(model) Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : Calling var(x) on a factor x is defunct. Use something like 'all(duplicated(x)[-1L])' to test for a constant vector. In addition: Warning messages: 1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. ... Viscacha
Now (dplyr 1.0.4, tidyverse 1.3.0), you can do: library(broom); library(tidyverse) d %>% nest(data = -state) %>% mutate(model = map(data, ~lm(response~year, data = .)), tidied = map(model, tidy)) %>% unnest(tidied)Labourer
@JonSpring is it still working this way?Icing
Still works for me, yes. (on dplyr 1.0.8 and tidyverse 1.3.1)Labourer
the fix only stores estimate, std error, statistic and p-value. Is there no way to get the whole summary() output, including R-squared, DOF, et cetera?Kappel
too late to edit: not beautiful but works: looping through the list of fittedmodels and calling summary() on each, such as summary(fittedmodels[[2]][[1]])Kappel
K
66

Here's an approach using the plyr package:

d <- data.frame(
  state = rep(c('NY', 'CA'), 10),
  year = rep(1:10, 2),
  response= rnorm(20)
)

library(plyr)
# Break up d by state, then fit the specified model to each piece and
# return a list
models <- dlply(d, "state", function(df) 
  lm(response ~ year, data = df))

# Apply coef to each model and return a data frame
ldply(models, coef)

# Print the summary of each model
l_ply(models, summary, .print = TRUE)
Khan answered 31/7, 2009 at 19:30 Comment(4)
Say you added a additional independent variable that wasn't available in all states (i.e. miles.of.ocean.shoreline) that was represented by NA in your data. Wouldn't the lm call fail? How could it be dealt with?Murrhine
Inside the function you'd need to test for that case and use a different formulaKhan
Is it possible to add the name of the subgroup to each call in the summary (last step)?Ruphina
if you run layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page and then l_ply(models, plot) you get each of the residuals plots too. Is it possible to label each of the plots with the group (e.g., "state" in this case)?Lubricous
I
61

Here's one way using the lme4 package.

 library(lme4)
 d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                 year=rep(1:10, 2),
                 response=c(rnorm(10), rnorm(10)))

 xyplot(response ~ year, groups=state, data=d, type='l')

 fits <- lmList(response ~ year | state, data=d)
 fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
   (Intercept)        year
CA -1.34420990  0.17139963
NY  0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316
Impound answered 23/7, 2009 at 4:55 Comment(3)
Is there a way to list R2 for both of these two models? e.g. add an R2 column after year. Also add p-value for each of the coeff?Excrescent
@Excrescent here you can find a feaseble solution (better late than never): Your.df[,summary(lm(Y~X))$r.squared,by=Your.factor] where: Y, X and Your.factor are your variables. Please keep in mind that Your.df must be a data.table classTrexler
I can't find the function xyplot, it doesn't show up after loading lme4Berardo
C
26

In my opinion is a mixed linear model a better approach for this kind of data. The code below given in the fixed effect the overall trend. The random effects indicate how the trend for each individual state differ from the global trend. The correlation structure takes the temporal autocorrelation into account. Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).

library(nlme)
lme(response ~ year, random = ~year|state, correlation = corAR1(~year))
Counts answered 24/7, 2009 at 11:32 Comment(3)
This is a really good general stats theory answer which makes me think about some things I had not considered. The application which caused me to ask the question would not be applicable to this solution, but I'm glad you brought it up. Thank you.Intestinal
It's not a good idea to start with a mixed model - how do you know that any of the assumptions are warranted?Khan
One should check the assumption by model validation (and knowledge of the data). BTW you cannot warrant the assumption on the individual lm's either. You would have to validate all models seperately.Counts
T
23

A nice solution using data.table was posted here in CrossValidated by @Zach. I'd just add that it is possible to obtain iteratively also the regression coefficient r^2:

## make fake data
    library(data.table)
    set.seed(1)
    dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50))

##calculate the regression coefficient r^2
    dat[,summary(lm(y~x))$r.squared,by=grp]
       grp         V1
    1:   1 0.01465726
    2:   2 0.02256595

as well as all the other output from summary(lm):

dat[,list(r2=summary(lm(y~x))$r.squared , f=summary(lm(y~x))$fstatistic[1] ),by=grp]
   grp         r2        f
1:   1 0.01465726 0.714014
2:   2 0.02256595 1.108173
Trexler answered 17/11, 2015 at 10:4 Comment(0)
M
19

I think it's worthwhile to add the purrr::map approach to this problem.

library(tidyverse)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                                 year=rep(1:10, 2),
                                 response=c(rnorm(10), rnorm(10)))

d %>% 
  group_by(state) %>% 
  nest() %>% 
  mutate(model = map(data, ~lm(response ~ year, data = .)))

See @Paul Hiemstra's answer for further ideas on using the broom package with these results.

Moluccas answered 24/4, 2018 at 15:13 Comment(2)
A little extension in case you want a column of fitted values or residuals: wrap the lm() call in a resid() call and then pipe everything in the last line into an unnest() call. Of course, you would want to change the variable name from "model" to something more relevant.Indeliberate
Hi, I adapted your code for my own data. However, R is throwing "<tibble> <S3: lm> " instead of the results. Could you please help me ?Flitting
C
9
## make fake data
 ngroups <- 2
 group <- 1:ngroups
 nobs <- 100
 dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
 head(dta)
#--------------------
  group          y         x
1     1  0.6482007 0.5429575
2     1 -0.4637118 0.7052843
3     1 -0.5129840 0.7312955
4     1 -0.6612649 0.9028034
5     1 -0.5197448 0.1661308
6     1  0.4240346 0.8944253
#------------ 
## function to extract the results of one model
 foo <- function(z) {
   ## coef and se in a data frame
   mr <- data.frame(coef(summary(lm(y~x,data=z))))
   ## put row names (predictors/indep variables)
   mr$predictor <- rownames(mr)
   mr
 }
 ## see that it works
 foo(subset(dta,group==1))
#=========
              Estimate Std..Error   t.value  Pr...t..   predictor
(Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
x           -0.3669890  0.3321875 -1.104765 0.2719666           x
#----------
## one option: use command by
 res <- by(dta,dta$group,foo)
 res
#=========
dta$group: 1
              Estimate Std..Error   t.value  Pr...t..   predictor
(Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
x           -0.3669890  0.3321875 -1.104765 0.2719666           x
------------------------------------------------------------ 
dta$group: 2
               Estimate Std..Error    t.value  Pr...t..   predictor
(Intercept) -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
x            0.06286456  0.3020321  0.2081387 0.8355526           x

## using package plyr is better
 library(plyr)
 res <- ddply(dta,"group",foo)
 res
#----------
  group    Estimate Std..Error    t.value  Pr...t..   predictor
1     1  0.21764767  0.1919140  1.1340897 0.2595235 (Intercept)
2     1 -0.36698898  0.3321875 -1.1047647 0.2719666           x
3     2 -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
4     2  0.06286456  0.3020321  0.2081387 0.8355526           x
Cacodemon answered 23/7, 2009 at 4:22 Comment(0)
C
9

I now my answer comes a bit late, but I was looking for a similar functionality. It would seem the built-in function 'by' in R can also do the grouping easily:

?by contains the following example, which fits per group and extracts the coefficients with sapply:

require(stats)
## now suppose we want to extract the coefficients by group 
tmp <- with(warpbreaks,
            by(warpbreaks, tension,
               function(x) lm(breaks ~ wool, data = x)))
sapply(tmp, coef)
Chang answered 28/2, 2017 at 14:53 Comment(0)
A
6

The lm() function above is an simple example. By the way, I imagine that your database has the columns as in the following form:

year state var1 var2 y...

In my point of view, you can to use the following code:

require(base) 
library(base) 
attach(data) # data = your data base
             #state is your label for the states column
modell<-by(data, data$state, function(data) lm(y~I(1/var1)+I(1/var2)))
summary(modell)
Apoenzyme answered 21/8, 2012 at 23:25 Comment(0)
T
1

The question seems to be about how to call regression functions with formulas which are modified inside a loop.

Here is how you can do it in (using diamonds dataset):

attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)

formula <- list(); model <- list()
for (i in 1:1) {
  formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
  model[[i]] = glm(formula[[i]]) 

  #then you can plot the results or anything else ...
  png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
  par(mfrow = c(2, 2))      
  plot(model[[i]])
  dev.off()
  }
Titbit answered 28/3, 2017 at 17:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.