Linear regression with dummy/categorical variables

Asked 7/6, 2018 at 4:34 Answered 13/12, 2018 at 6:20

Solved python pandas linear-regression statsmodels dummy-variable

I have a set of data. I have use pandas to convert them in a dummy and categorical variables respectively. So, now I want to know, how to run a multiple linear regression (I am using statsmodels) in Python?. Are there some considerations or maybe I have to indicate that the variables are dummy/ categorical in my code someway? Or maybe the transfromation of the variables is enough and I just have to run the regression as model = sm.OLS(y, X).fit()?.

My code is the following:

datos = pd.read_csv("datos_2.csv")
df = pd.DataFrame(datos)
print(df)

I get this:

Age  Gender    Wage         Job         Classification 
32    Male  450000       Professor           High
28    Male  500000  Administrative           High
40  Female   20000       Professor            Low
47    Male   70000       Assistant         Medium
50  Female  345000       Professor         Medium
27  Female  156000       Assistant            Low
56    Male  432000  Administrative            Low
43  Female  100000  Administrative            Low

Then I do: 1= Male, 0= Female and 1:Professor, 2:Administrative, 3: Assistant this way:

df['Sex_male']=df.Gender.map({'Female':0,'Male':1})
        df['Job_index']=df.Job.map({'Professor':1,'Administrative':2,'Assistant':3})
print(df)

Getting this:

 Age  Gender    Wage             Job Classification  Sex_male  Job_index
 32    Male  450000       Professor           High         1          1
 28    Male  500000  Administrative           High         1          2
 40  Female   20000       Professor            Low         0          1
 47    Male   70000       Assistant         Medium         1          3
 50  Female  345000       Professor         Medium         0          1
 27  Female  156000       Assistant            Low         0          3
 56    Male  432000  Administrative            Low         1          2
 43  Female  100000  Administrative            Low         0          2

Now, if I would run a multiple linear regression, for example:

y = datos['Wage']
X = datos[['Sex_mal', 'Job_index','Age']]
X = sm.add_constant(X)
model1 = sm.OLS(y, X).fit()
results1=model1.summary(alpha=0.05)
print(results1)

The result is shown normally, but would it be fine? Or do I have to indicate somehow that the variables are dummy or categorical?. Please help, I am new to Python and I want to learn. Greetings from South America - Chile.

Appalling answered 7/6, 2018 at 4:34 Comment(0)

You'll need to indicate that either Job or Job_index is a categorical variable; otherwise, in the case of Job_index it will be treated as a continuous variable (which just happens to take values 1, 2, and 3), which isn't right.

You can use a few different kinds of notation in statsmodels, here's the formula approach, which uses C() to indicate a categorical variable:

from statsmodels.formula.api import ols

fit = ols('Wage ~ C(Sex_male) + C(Job) + Age', data=df).fit() 

fit.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Wage   R-squared:                       0.592
Model:                            OLS   Adj. R-squared:                  0.048
Method:                 Least Squares   F-statistic:                     1.089
Date:                Wed, 06 Jun 2018   Prob (F-statistic):              0.492
Time:                        22:35:43   Log-Likelihood:                -104.59
No. Observations:                   8   AIC:                             219.2
Df Residuals:                       3   BIC:                             219.6
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept             3.67e+05   3.22e+05      1.141      0.337   -6.57e+05    1.39e+06
C(Sex_male)[T.1]     2.083e+05   1.39e+05      1.498      0.231   -2.34e+05    6.51e+05
C(Job)[T.Assistant] -2.167e+05   1.77e+05     -1.223      0.309    -7.8e+05    3.47e+05
C(Job)[T.Professor] -9273.0556   1.61e+05     -0.058      0.958   -5.21e+05    5.03e+05
Age                 -3823.7419   6850.345     -0.558      0.616   -2.56e+04     1.8e+04
==============================================================================
Omnibus:                        0.479   Durbin-Watson:                   1.620
Prob(Omnibus):                  0.787   Jarque-Bera (JB):                0.464
Skew:                          -0.108   Prob(JB):                        0.793
Kurtosis:                       1.839   Cond. No.                         215.
==============================================================================

Note: Job and Job_index won't use the same categorical level as a baseline, so you'll see slightly different results for the dummy coefficients at each level, even though the overall model fit remains the same.

Egon answered 7/6, 2018 at 5:40 Comment(5)

Ok, thank you very much, but a new question arises, because in the results only coefficients for Sex_male are shown for one of the two categories (Male and Female)? In the same way for Job, only the coefficients for two of its three categories ?. Could you explain why that happens? – Spokane 7/6, 2018 at 20:1

In regression, any categorical variable needs to use one level as a baseline against which the other levels are compared. That's how you get separate coefficients for each category level - the coefficient will indicate the predictive signal of that level, compared to whatever the baseline is. The baseline does not get compared to otself, so there is no coefficient for it. You can look into contrasts for more information (but further questions about this are more appropriate for CrossValidated than for StackOverflow). – Egon 7/6, 2018 at 20:51

@Héctor Alonso if this answer has resolved your original question, please mark it accepted by clicking the check symbol next to the answer. Thanks! – Egon 7/6, 2018 at 22:36

Done. Thanks :D it was really helpful – Spokane 8/6, 2018 at 1:53

For some reason, this solution doesn't work for me :/ I have posted my question here: #73373597 – Quondam 16/8, 2022 at 12:27

In linear regression with categorical variables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here

Idea is to use dummy variable encoding with drop_first=True, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOT lose and relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.

Here is complete code on how you can do it for your jobs dataset

So you have your X features:

Age, Gender, Job, Classification

And one numerical features that you are trying to predict:

Wage

First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:

Input variables (your dataset is bit different but whole code remains the same, you will put every column from dataset in X, except one that will go to Y. pd.get_dummies works without problem like that - it will just convert categorical variables and it won't touch numerical):

X = jobs[['Age','Gender','Job','Classification']]

Prediction:

Y = jobs['Wage']

Convert categorical variable into dummy/indicator variables and drop one in each category:

X = pd.get_dummies(data=X, drop_first=True)

So now if you check shape of X (X.shape) with drop_first=True you will see that it has 4 columns less - one for each of your categorical variables.

You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:

from sklearn import linear_model
from sklearn.model_selection import train_test_split
    
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
    
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)

Dysphemism answered 13/12, 2018 at 6:20 Comment(2)

Hello! How to plot this values? Mine says that X and y values are not the same size. – Pavilion 1/6, 2021 at 8:24

Very concise descriptions! Thanx! – Expectorate 31/10, 2023 at 12:7

Recommended topics

Hot tags