is there a way to have a similar, nice output for the scikit logistic regression models as in statsmodels? With all the p-values, std. errors etc. in one table?
As you and others have pointed out, this is a limitation of scikit learn. Before discussing below a scikit approach for your question, the “best” option is to use statsmodels as follows:
import statsmodels.api as sm
smlog = sm.Logit(y,sm.add_constant(X)).fit()
smlog.summary()
X represents your input features/predictors matrix and y represents the outcome variable. Statsmodels works well if X lacks highly correlated features, lacks low variance features, feature(s) don’t generate “perfect/quasi-perfect separation”, and any categorical features are reduced to “n-1” levels i.e., dummy-coded (and not “n” levels i.e., one-hot encoded as described here: dummy variable trap).
However, if above isn't feasible/practical, one scikit approach is coded below for fairly equivalent results - in terms of feature coefficients/odds with their standard errors and 95%CI estimates. Essentially, the code generates these results from distinct logistic regression scikit models trained against distinct test-train splits of your data. Again, make sure categorical features are dummy coded to n-1 levels (or your scikit coefficients will be incorrect for categorical features).
#Instantiate logistic regression model with regularization turned OFF
log_nr = LogisticRegression(fit_intercept = True, penalty
= "none")
##Generate 5 distinct random numbers - as random seeds for 5 test-train splits
import random
randomlist = random.sample(range(1, 10000), 5)
##Create features column
coeff_table = pd.DataFrame(X.columns, columns=["features"])
##Assemble coefficients over logistic regression models on 5 random data splits
#iterate over random states while keeping track of `i`
from sklearn.model_selection import train_test_split
for i, state in enumerate(randomlist):
train_x, test_x, train_y, test_y = train_test_split(X, y, stratify=y,
test_size=0.3, random_state=state) #5 test-train splits
log_nr.fit(train_x, train_y) #fit logistic model
coeff_table[f"coefficients_{i+1}"] = np.transpose(log_nr.coef_)
##Calculate mean and std error for model coefficients (from 5 models above)
coeff_table["mean_coeff"] = coeff_table.mean(axis=1)
coeff_table["se_coeff"] = coeff_table.iloc[:, 1:6].sem(axis=1)
#Calculate 95% CI intervals for feature coefficients
coeff_table["95ci_se_coeff"] = 1.96*coeff_table["se_coeff"]
coeff_table["coeff_95ci_LL"] = coeff_table["mean_coeff"] -
coeff_table["95ci_se_coeff"]
coeff_table["coeff_95ci_UL"] = coeff_table["mean_coeff"] +
coeff_table["95ci_se_coeff"]
Finally, (optionally) convert coefficients to odds by exponentiating as follows. Odds ratios are my favorite output from logistic regression and these are appended to your dataframe using code below.
#Calculate odds ratios and 95% CI (LL = lower limit, UL = upper limit) intervals for each feature
coeff_table["odds_mean"] = np.exp(coeff_table["mean_coeff"])
coeff_table["95ci_odds_LL"] = np.exp(coeff_table["coeff_95ci_LL"])
coeff_table["95ci_odds_UL"] = np.exp(coeff_table["coeff_95ci_UL"])
This answer builds upon on a somewhat related reply by @pciunkiewicz available here : Collate model coefficients across multiple test-train splits from sklearn
© 2022 - 2024 — McMap. All rights reserved.