I am not quite going to answer your question exactly, but instead the more general question of how, in R, I would test the hypothesis of a difference in slopes across two groups with suspected unequal variance in the response variable.
Overview
There are several options, two of which I'll go into. All of the good options involve combining the two datasets into a single modelling strategy, and confronting a "full" model which includes an interaction effect of gender and the slopes with a "no interaction" model which has an additive gender effect, but the same slopes of the other variables.
If we were prepared to assume the variance were the same in the two gender groups, we just use ordinary least squares to fit our two models to the combined data and use a classical F test:
Data <- data.frame(
gender = sample (c("men", "women"), 2000, replace = TRUE),
var1 = sample (c("value1", "value2"), 2000, replace = TRUE),
var2 = sample (c("valueA", "valueB"), 2000, replace = TRUE),
var3 = sample (c("valueY", "valueZ"), 2000, replace = TRUE),
y = sample(0:10, 2000, replace = TRUE)
)
lm_full <- lm(y ~ (var1 + var2 + var3) * gender, data = Data)
lm_nointeraction <- lm(y ~ var1 + var2 + var3 + gender, data = Data)
# If the variance were equal we could just do an F-test:
anova(lm_full, lm_nointeraction)
However, this assumption is not acceptable, so we need an alternative. I think this discussion on Cross-Validated is useful.
Option 1 - weighted least squares
I'm not sure if this is the same as a Welch's t-test; I suspect it is a higher level generalization of it. This is quite a simple parametric approach to the problem. Basically, we just model the variance of the response at the same time as its mean. Then in the fitting process (which becomes iterative) we give less weight to the points that are expected to have higher variance ie more randomness. The gls
function - generalized least squares - in nlme
package does this for us.
# Option 1 - modelling variance, and making weights inversely proportional to it
library(nlme)
gls_full <- gls(y ~ (var1 + var2 + var3) * gender, data = Data, weights = varPower())
gls_nointeraction <- gls(y ~ var1 + var2 + var3 + gender, data = Data, weights = varPower())
# test the two models against eachother (preferred):
AIC(gls_full, gls_nointeraction) # lower value wins
# or test individual interactions:
summary(gls_full)$tTable
Option 2 - robust regression, comparison via bootstrap
A second option is to use M-estimation, which is designed to be robust to unequal variances in groups within the data. Good practice with robust regression comparing two models is to pick some kind of validation statistic and use the bootstrap to see which of the models on average does better on that statistic.
This is a little more complicated, but here is a worked example with your simulated data:
# Option 2 - use robust regression and the bootstrap
library(MASS)
library(boot)
rlm_full <- rlm(y ~ (var1 + var2 + var3) * gender, data = Data)
rlm_nointeraction <- rlm(y ~ var1 + var2 + var3 + gender, data = Data)
# Could just test to see which one fits best (lower value wins)
AIC(rlm_full, rlm_nointeraction)
# or - preferred - use the bootstrap to validate each model and pick the best one.
# First we need a function to give us a performance statistic on how good
# a model is at predicting values compared to actuality. Let's use root
# mean squared error:
RMSE <- function(predicted, actual){
sqrt(mean((actual - predicted) ^ 2))
}
# This function takes a dataset full_data, "scrambled" by the resampling vector i.
# It fits the model to the resampled/scrambled version of the data, and uses this
# to predict the values of y in the full original unscrambled dataset. This is
# described as the "simple bootstrap" in Harrell *Regression Modeling Strategies*,
# buiolding on Efron and Tibshirani.
simple_bootstrap <- function(full_data, i){
sampled_data <- full_data[i, ]
rlm_full <- rlm(y ~ (var1 + var2 + var3) * gender, data = sampled_data)
rlm_nointeraction <- rlm(y ~ var1 + var2 + var3 + gender, data = sampled_data)
pred_full <- predict(rlm_full, newdata = full_data)
pred_nointeraction <- predict(rlm_nointeraction, newdata = full_data)
rmse_full <- RMSE(pred_full, full_data$y)
rmse_nointeraction <- RMSE(pred_nointeraction, full_data$y)
return(rmse_full - rmse_nointeraction)
}
rlm_boot <- boot(Data, statistic = simple_bootstrap, R = 500, strata = Data$gender)
# Confidence interval for the improvement from the full model, compared to the one with no interaction:
boot.ci(rlm_boot, type = "perc")
Conclusion
One or either of the above would be suitable. When I suspect variance in the variances, I will generally think a bootstrap is an important aspect of inference. It could be used even if you use nlme::gls
for instance. The bootstrap is more robust and makes redundant many of the older named statistical tests for dealing with specific situations.