How to get regression coefficients and model fits using correlation or covariance matrix instead of data frame using R?
Asked Answered
A

4

6

I want to be able to regression coefficients from multiple linear regression by supplying a correlation or covariance matrix instead of a data.frame. I realise you lose some information relevant to determining the intercept and so on, but it should even the correlation matrix should be sufficient for getting standardised coefficients and estimates of variance explained.

So for example, if you had the following data

# get some data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]

You could run a regression as follows:

lm(EngineSize ~ Horsepower + RPM, x)

but what if instead of having data you had the correlation matrix or the covariance matrix:

corx <- cor(x)
covx <- cov(x)
  • What function in R allows you to run a regression based on the correlation or covariance matrix? Ideally it should be similar to lm so that you can easily obtain things like r-squared, adjusted r-squared, predicted values and so on. Presumably, for some of these things, you would need to also provide the sample size and possibly a vector of means. But that would also be fine.

I.e., something like:

lm(EngineSize ~ Horsepower + RPM, cov = covx) # obviously this doesn't work

Note that this answer on Stats.SE provides a theoretical explanation for why it's possible, and provides an example of some custom R code for calculating coefficients?

Alsup answered 25/7, 2016 at 0:43 Comment(3)
Is this helpful? stats.stackexchange.com/questions/107597/…Sunshade
@Sunshade Thanks I've integrated some points about that Stats.SE question into the question. It seems like that post could be adapted to get coefficients. I've tweaked my question. I guess what I'm hoping for is a function that is similar to lm but merely takes covariance instead of data. I.e., so that it's then easy to get things like model fits, and so on.Alsup
You could use lavaan. It will take a correlation/covariance matrix as input.Gloze
G
3

Using lavaan you could do the following:

library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]

lav.input<- cov(x)
lav.mean <- colMeans(x)

library(lavaan)
m1 <- 'EngineSize ~ Horsepower+RPM'
fit <- sem(m1, sample.cov = lav.input,sample.nobs = nrow(x), meanstructure = TRUE, sample.mean = lav.mean)
summary(fit, standardize=TRUE)

Results are:

Regressions:
                   Estimate    Std.Err  Z-value  P(>|z|)   Std.lv    Std.all
  EngineSize ~                                                              
    Horsepower          0.015    0.001   19.889    0.000      0.015    0.753
    RPM                -0.001    0.000  -15.197    0.000     -0.001   -0.576

Intercepts:
                  Estimate    Std.Err  Z-value  P(>|z|)   Std.lv    Std.all
   EngineSize          5.805    0.362   16.022    0.000      5.805    5.627

Variances:
                  Estimate    Std.Err  Z-value  P(>|z|)   Std.lv    Std.all
    EngineSize          0.142    0.021    6.819    0.000      0.142    0.133
Gloze answered 25/7, 2016 at 1:18 Comment(1)
I missed you wanted R-squared values. So: summary(fit, standardize=TRUE, rsquare=TRUE) will give you what you want. Most other functions associated with lm will work including predict and anova etc. Plus all the goodies of lavaan so := can be used to define new parameters within the model rather than using deltaMethod from car after fitting.Gloze
A
2

I think lavaan sounds like a good option, I note that @Philip pointed me in the right direction. I just mention here how to extract a few extra model features using lavaan (particularly, r-squared and adjusted r-squared) that you might want.

For the latest version see: https://gist.github.com/jeromyanglim/9f766e030966eaa1241f10bd7d6e2812 :

# get data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]

# define sample statistics 
covx <- cov(x)
n <- nrow(x)
means <- sapply(x, mean) # this is optional


fit <- lavaan::sem("EngineSize ~ Horsepower + RPM", sample.cov = covx,
                   sample.mean = means,
                    sample.nobs = n)

coef(fit) # unstandardised coefficients
standardizedSolution(fit) # Standardised coefficients
inspect(fit, 'r2') # r-squared

# adjusted r-squared
adjr2 <- function(rsquared, n, p) 1 - (1-rsquared)  * ((n-1)/(n-p-1))
# update p below with number of predictor variables
adjr2(inspect(fit, 'r2'), n = inspect(fit, "nobs"), p = 2) 

Custom function

And here is a bit of a function that supplies the fit from lavaan along with a few features of relevance (i.e., basically packaging most of the above). It assumes a case where you don't have the means.

covlm <- function(dv, ivs, n, cov) {
    # Assumes lavaan package
    # library(lavaan)
    # dv: charcter vector of length 1 with name of outcome variable
    # ivs: character vector of names of predictors
    # n: numeric vector of length 1: sample size
    # cov: covariance matrix where row and column names 
    #       correspond to dv and ivs
    # Return
    #      list with lavaan model fit
    #      and various other features of the model

    results <- list()
    eq <- paste(dv, "~", paste(ivs, collapse = " + "))
    results$fit <- lavaan::sem(eq, sample.cov = cov,
                       sample.nobs = n)

    # coefficients
    ufit <- parameterestimates(results$fit) 
    ufit <- ufit[ufit$op == "~", ]
    results$coef <- ufit$est
    names(results$coef) <- ufit$rhs

    sfit <- standardizedsolution(results$fit) 
    sfit <- sfit[sfit$op == "~", ]
    results$standardizedcoef <- sfit$est.std
    names(results$standardizedcoef) <- sfit$rhs

    # use unclass to not limit r2 to 3 decimals
     results$r.squared <- unclass(inspect(results$fit, 'r2')) # r-squared

    # adjusted r-squared
      adjr2 <- function(rsquared, n, p) 1 - (1-rsquared)  * ((n-1)/(n-p-1))
    results$adj.r.squared <- adjr2(unclass(inspect(results$fit, 'r2')), 
                                n = n, p = length(ivs)) 
    results

}

For example:

x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
covlm(dv = "EngineSize", ivs = c("Horsepower", "RPM"),
      n = nrow(x), cov = cov(x))

This all produces:

$fit
lavaan (0.5-20) converged normally after  27 iterations

  Number of observations                            93

  Estimator                                         ML
  Minimum Function Test Statistic                0.000
  Degrees of freedom                                 0
  Minimum Function Value               0.0000000000000

$coef
 Horsepower         RPM 
 0.01491908 -0.00100051 

$standardizedcoef
Horsepower        RPM 
 0.7532350 -0.5755326 

$r.squared
EngineSize 
     0.867 

$adj.r.squared
EngineSize 
     0.864 
Alsup answered 25/7, 2016 at 1:37 Comment(0)
S
1

Remember that:

$beta=(X'X)^-1. X'Y$

Try:

(bs<-solve(covx[-1,-1],covx[-1,1]))

 Horsepower         RPM 
 0.01491908 -0.00100051 

For the Intercept you will need averages of the variables. For example:

  ms=colMeans(x)
  (b0=ms[1]-bs%*%ms[-1])

         [,1]
[1,] 5.805301
Samphire answered 25/7, 2016 at 0:59 Comment(0)
C
1

Another kind of funky solution is to generate a data set that has the same variance-covariance matrix as the original data. You can do this with mvrnorm() in the MASS package. Using lm() on this new data set will yield parameter estimates and standard errors identical to those that would have been estimated from the original data set (except for the intercept, which is inaccessible unless you have the means of each variable). Here's an example of what this would look like:

#Assuming the variance covariance matrix is called VC
n <- 100 #sample size
nvar <- ncol(VC)
fake.data <- mvrnorm(n, mu = rep(0, nvar), sigma = VC, empirical = TRUE)
lm(Y~., data = fake.data)
Cotenant answered 25/8, 2017 at 21:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.