You just need to standardize your original DataFrame using a z distribution (i.e., z-score) first and then perform a linear regression.
Assume you name your dataframe as df
, which has independent variables x1
, x2
, and x3
, and dependent variable y
. Consider the following code:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.formula.api as smf
# standardizing dataframe
df_z = df.select_dtypes(include=[np.number]).dropna().apply(stats.zscore)
# fitting regression
formula = 'y ~ x1 + x2 + x3'
result = smf.ols(formula, data=df_z).fit()
# checking results
result.summary()
Now, the coef
will show you the standardized (beta) coefficients so that you can compare their influence on your dependent variable.
Notes:
- Please keep in mind that you need
.dropna()
. Otherwise, stats.zscore
will return all NaN
for a column if it has any missing values.
- Instead of using
.select_dtypes()
, you can select column manually but make sure all the columns you selected are numeric.
- If you only care about the standardized (beta) coefficients, you can also use
result.params
to return it only. It will usually be displayed in a scientific-notation fashion. You can use something like round(result.params, 5)
to round them.