"weighted" regression in R - McMap

About

"weighted" regression in R

Asked 22/4, 2012 at 14:16 Answered 22/4, 2012 at 18:47

Solved r linear-regression weighted

M

1

14

I have created a script like the one below to do something I called as "weighted" regression:

library(plyr)

set.seed(100)

temp.df <- data.frame(uid=1:200,
                      bp=sample(x=c(100:200),size=200,replace=TRUE),
                      age=sample(x=c(30:65),size=200,replace=TRUE),
                      weight=sample(c(1:10),size=200,replace=TRUE),
                      stringsAsFactors=FALSE)

temp.df.expand <- ddply(temp.df,
                        c("uid"),
                        function(df) {
                          data.frame(bp=rep(df[,"bp"],df[,"weight"]),
                                     age=rep(df[,"age"],df[,"weight"]),
                                     stringsAsFactors=FALSE)})

temp.df.lm <- lm(bp~age,data=temp.df,weights=weight)
temp.df.expand.lm <- lm(bp~age,data=temp.df.expand)

You can see that in temp.df, each row has its weight, what I mean is that there is a total of 1178 sample but for rows with same bp and age, they are merge into 1 row and represented in the weight column.

I used the weight parameters in the lm function, then I cross check the result with another dataframe that the temp.df dataframe is "expanded". But I found the lm outputs different for the 2 dataframe.

Did I misinterpret the weight parameters in lm function, and can anyone let me know how to I run regression properly (i.e. without expanding the dataframe manually) for a dataset presented like temp.df? Thanks.

Moonscape answered 22/4, 2012 at 14:16 Comment(6)

The two regressions yield identical results for me. – Irritate 22/4, 2012 at 14:25

see the summary output, they are different – Moonscape 22/4, 2012 at 14:26

The coefficients are the same, but the p-values are indeed different. I guess the following happens. When you expand the data, the observations are assumed to be independent: since there is a lot of data, you can be very confident on the estimates and the p-values are low. When using weights, the number of observations remains small, and the p-values are high. – Irritate 22/4, 2012 at 14:32

but 2 dataframe refer to the same dataset (1178 rows of data), just different in presentation, temp.df present the 1178 rows of data using 200 rows. They should present the same p-value, if the same regression are performed in 2 dataframe. I need to get it fixed because in my case, I may have more than 1 million row, if I don't use the weight method, I may not have enough memory to store them all. – Moonscape 22/4, 2012 at 14:40

@VincentZoonekynd - That does appear to be the case. The summaries show this directly. summary(temp.df.lm): ... Residual standard error: 69.89 on 198 degrees of freedom .... summary(temp.df.expand.lm): ... Residual standard error: 28.68 on 1176 degrees of freedom ... – Bonkers 22/4, 2012 at 17:14

This question should be moved to Cross Validated. – Ijssel 12/11, 2018 at 5:6

V

17

The problem here is that the degrees of freedom are not being properly added up to get the right Df and mean-sum-squares statistics. This will correct the problem:

temp.df.lm.aov <- anova(temp.df.lm)
temp.df.lm.aov$Df[length(temp.df.lm.aov$Df)] <- 
        sum(temp.df.lm$weights)-   
        sum(temp.df.lm.aov$Df[-length(temp.df.lm.aov$Df)]  ) -1
temp.df.lm.aov$`Mean Sq` <- temp.df.lm.aov$`Sum Sq`/temp.df.lm.aov$Df
temp.df.lm.aov$`F value`[1] <- temp.df.lm.aov$`Mean Sq`[1]/
                                        temp.df.lm.aov$`Mean Sq`[2]
temp.df.lm.aov$`Pr(>F)`[1] <- pf(temp.df.lm.aov$`F value`[1], 1, 
                                      temp.df.lm.aov$Df, lower.tail=FALSE)[2]
temp.df.lm.aov
Analysis of Variance Table

Response: bp
            Df Sum Sq Mean Sq F value   Pr(>F)   
age          1   8741  8740.5  10.628 0.001146 **
Residuals 1176 967146   822.4

Compare with:

> anova(temp.df.expand.lm)
Analysis of Variance Table

Response: bp
            Df Sum Sq Mean Sq F value   Pr(>F)   
age          1   8741  8740.5  10.628 0.001146 **
Residuals 1176 967146   822.4                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I am a bit surprised this has not come up more often on R-help. Either that or my search strategy development powers are weakening with old age.

Venous answered 22/4, 2012 at 18:47 Comment(7)

There is an error in the upper block of code (temp.df.lm.aovn Sq' <- temp.df.lm.aov$'Sum Sq'/temp.df.lm.aov$Df). Note that the code did not correct the problem (the ANOVA tables differ). – Anastigmat 8/6, 2013 at 2:28

I have attempted a correction. Please ensure you approve. Note that I used subsetting / indexing (ie, [1]), & it's not clear that's your style / as general as you may have wanted this to be. (However, the output does now match the output you wanted it to.) – Anastigmat 8/6, 2013 at 2:43

There were syntactic errors (unmatched back-ticks) that I did not have the time to investigate. Thanks for trying to fix it. – Venous 8/6, 2013 at 3:42

Your welcome. The code, as fixed doesn't match the correct version. EG, p=.18 != p=.001; I thought you were trying to show that if the df were computed correctly, the anova() outputs would match. Was that not your point? – Anastigmat 8/6, 2013 at 3:45

Yes. My point was actually that you needed to divide the sums of squares by the correct Df and then recalculate the F-statistic and the Pr(>F). – Venous 8/6, 2013 at 4:2

Can someone share a reference on where the justification for this correction comes from? – Silique 7/2, 2019 at 20:22

@MarkWhite. You asking about an answer that is almost 7 years old and you are not say what part of this is unclear. Do you have a specific area of confusion? – Venous 7/2, 2019 at 22:30

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.