Trying to use aggregate to run linear model on subset of values in a column

Asked 10/9, 2019 at 14:19 Answered 10/9, 2019 at 14:30

I have roughly 7,500 subsidence values. Each subsidence value (V9) corresponds to a station (V2) and a year (V3). I want a line of best fit (V9~V3) for each station.

I created a function using lm that works when I manually subset the data. However, when I try to use aggregate to run a linear model on each station, I get the same value for every station.

Here's an example of what some of the data looks like:

V2        V3    V9
CRMS0002  2008  -28.4990000
CRMS0002  2009  -28.8080808
CRMS0002  2012  -31.9871795
CRMS0006  2008  -56.8998413
CRMS0006  2013   40.8611111
CRMS0006  2015   32.8555555
CRMS0033  2007  -16.8044444

This is the code:

sub_rate = function(x) {lm(CRMSsub$V9~CRMSsub$V3)}
agg <- aggregate(CRMSsub$V9, by = list(CRMSsub$V2), FUN = sub_rate)

I also tried:

agg <- lapply(split(CRMSsub, CRMSsub$V3), FUN = sub_rate)

The aggregate by part of the first and second code works. So I get 354 elements that are organized by station, but the linear model results which give me intercept and slope are the same for every station, which means it's not performing my function by station. Here's an example of the result:

Group.1         x
CRMS0002        c(`(Intercept)` = -2333.06378840009, `CRMSsub$V3` = 1.1541441797906)
CRMS0006        c(`(Intercept)` = -2333.06378840009, `CRMSsub$V3` = 1.1541441797906)
CRMS0033        c(`(Intercept)` = -2333.06378840009, `CRMSsub$V3` = 1.1541441797906)

Aime answered 10/9, 2019 at 14:19 Comment(0)

The problem with your approach is that you specified your dataset in the call of sub_rate. You also need to specify the dataset as x in lapply(). For instance you can do:

library(dplyr)

sub_rate <- function(x){lm(x$V9~x$V3)}

lapply(CRMSsub %>% split(.$V2),sub_rate)

Corby answered 10/9, 2019 at 14:30 Comment(0)

Try not to overcomplicate it. Just split on the variable you want to regress, and simply apply lm on the right columns

lapply(split(CRMSsub, CRMSsub$V2), function(i)lm(V9 ~ V3, i))

Which gives,

$CRMS0002

Call:
lm(formula = V9 ~ V3, data = i)

Coefficients:
(Intercept)           V3  
  1809.7832      -0.9153  


$CRMS0006

Call:
lm(formula = V9 ~ V3, data = i)

Coefficients:
(Intercept)           V3  
  -28396.65        14.12  


$CRMS0033

Call:
lm(formula = V9 ~ V3, data = i)

Coefficients:
(Intercept)           V3  
      -16.8           NA

Reggie answered 10/9, 2019 at 14:28 Comment(0)

The problem with your approach is that you specified your dataset in the call of sub_rate. You also need to specify the dataset as x in lapply(). For instance you can do:

library(dplyr)

sub_rate <- function(x){lm(x$V9~x$V3)}

lapply(CRMSsub %>% split(.$V2),sub_rate)

Corby answered 10/9, 2019 at 14:30 Comment(0)

Recommended topics

Hot tags