"weighted" regression in R
Asked Answered
M

1

14

I have created a script like the one below to do something I called as "weighted" regression:

library(plyr)

set.seed(100)

temp.df <- data.frame(uid=1:200,
                      bp=sample(x=c(100:200),size=200,replace=TRUE),
                      age=sample(x=c(30:65),size=200,replace=TRUE),
                      weight=sample(c(1:10),size=200,replace=TRUE),
                      stringsAsFactors=FALSE)

temp.df.expand <- ddply(temp.df,
                        c("uid"),
                        function(df) {
                          data.frame(bp=rep(df[,"bp"],df[,"weight"]),
                                     age=rep(df[,"age"],df[,"weight"]),
                                     stringsAsFactors=FALSE)})

temp.df.lm <- lm(bp~age,data=temp.df,weights=weight)
temp.df.expand.lm <- lm(bp~age,data=temp.df.expand)

You can see that in temp.df, each row has its weight, what I mean is that there is a total of 1178 sample but for rows with same bp and age, they are merge into 1 row and represented in the weight column.

I used the weight parameters in the lm function, then I cross check the result with another dataframe that the temp.df dataframe is "expanded". But I found the lm outputs different for the 2 dataframe.

Did I misinterpret the weight parameters in lm function, and can anyone let me know how to I run regression properly (i.e. without expanding the dataframe manually) for a dataset presented like temp.df? Thanks.

Moonscape answered 22/4, 2012 at 14:16 Comment(6)
The two regressions yield identical results for me.Irritate
see the summary output, they are differentMoonscape
The coefficients are the same, but the p-values are indeed different. I guess the following happens. When you expand the data, the observations are assumed to be independent: since there is a lot of data, you can be very confident on the estimates and the p-values are low. When using weights, the number of observations remains small, and the p-values are high.Irritate
but 2 dataframe refer to the same dataset (1178 rows of data), just different in presentation, temp.df present the 1178 rows of data using 200 rows. They should present the same p-value, if the same regression are performed in 2 dataframe. I need to get it fixed because in my case, I may have more than 1 million row, if I don't use the weight method, I may not have enough memory to store them all.Moonscape
@VincentZoonekynd - That does appear to be the case. The summaries show this directly. summary(temp.df.lm): ... Residual standard error: 69.89 on 198 degrees of freedom .... summary(temp.df.expand.lm): ... Residual standard error: 28.68 on 1176 degrees of freedom ...Bonkers
This question should be moved to Cross Validated.Ijssel
V
17

The problem here is that the degrees of freedom are not being properly added up to get the right Df and mean-sum-squares statistics. This will correct the problem:

temp.df.lm.aov <- anova(temp.df.lm)
temp.df.lm.aov$Df[length(temp.df.lm.aov$Df)] <- 
        sum(temp.df.lm$weights)-   
        sum(temp.df.lm.aov$Df[-length(temp.df.lm.aov$Df)]  ) -1
temp.df.lm.aov$`Mean Sq` <- temp.df.lm.aov$`Sum Sq`/temp.df.lm.aov$Df
temp.df.lm.aov$`F value`[1] <- temp.df.lm.aov$`Mean Sq`[1]/
                                        temp.df.lm.aov$`Mean Sq`[2]
temp.df.lm.aov$`Pr(>F)`[1] <- pf(temp.df.lm.aov$`F value`[1], 1, 
                                      temp.df.lm.aov$Df, lower.tail=FALSE)[2]
temp.df.lm.aov
Analysis of Variance Table

Response: bp
            Df Sum Sq Mean Sq F value   Pr(>F)   
age          1   8741  8740.5  10.628 0.001146 **
Residuals 1176 967146   822.4        

Compare with:

> anova(temp.df.expand.lm)
Analysis of Variance Table

Response: bp
            Df Sum Sq Mean Sq F value   Pr(>F)   
age          1   8741  8740.5  10.628 0.001146 **
Residuals 1176 967146   822.4                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

I am a bit surprised this has not come up more often on R-help. Either that or my search strategy development powers are weakening with old age.

Venous answered 22/4, 2012 at 18:47 Comment(7)
There is an error in the upper block of code (temp.df.lm.aovn Sq' <- temp.df.lm.aov$'Sum Sq'/temp.df.lm.aov$Df). Note that the code did not correct the problem (the ANOVA tables differ).Anastigmat
I have attempted a correction. Please ensure you approve. Note that I used subsetting / indexing (ie, [1]), & it's not clear that's your style / as general as you may have wanted this to be. (However, the output does now match the output you wanted it to.)Anastigmat
There were syntactic errors (unmatched back-ticks) that I did not have the time to investigate. Thanks for trying to fix it.Venous
Your welcome. The code, as fixed doesn't match the correct version. EG, p=.18 != p=.001; I thought you were trying to show that if the df were computed correctly, the anova() outputs would match. Was that not your point?Anastigmat
Yes. My point was actually that you needed to divide the sums of squares by the correct Df and then recalculate the F-statistic and the Pr(>F).Venous
Can someone share a reference on where the justification for this correction comes from?Silique
@MarkWhite. You asking about an answer that is almost 7 years old and you are not say what part of this is unclear. Do you have a specific area of confusion?Venous

© 2022 - 2024 — McMap. All rights reserved.