Linear regression analysis with string/categorical features (variables)?
Asked Answered
P

4

77

Regression algorithms seem to be working on features represented as numbers. For example:

simple data without categorical features

This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price.


But now I want to do a regression analysis on data that contain categorical features:

data-set with categorical features

There are 5 features: District, Condition, Material, Security, Type


How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values.

Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Python that can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?

Polacca answered 30/11, 2015 at 20:21 Comment(1)
Quick note to point to 2 open-source Python packages for encoding of categorical variables: category encoders (contrib.scikit-learn.org/category_encoders) and Feature-engine (feature-engine.readthedocs.io/en/latest/index.html)Barfuss
M
106

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.

Usually there are three possibilities:

  1. One-Hot encoding for categorical data
  2. Arbitrary numbers for ordinal data
  3. Use something like group means for categorical data (e. g. mean prices for city districts).

You have to be carefull to not infuse information you do not have in the application case.

One hot encoding

If you have categorical data, you can create dummy variables with 0/1 values for each possible value.

E. g.

idx color
0   blue
1   green
2   green
3   red

to

idx blue green red
0   1    0     0
1   0    1     0
2   0    1     0
3   0    0     1

This can easily be done with pandas:

import pandas as pd

data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})
print(pd.get_dummies(data))

will result in:

   color_blue  color_green  color_red
0           1            0          0
1           0            1          0
2           0            1          0
3           0            0          1

Numbers for ordinal data

Create a mapping of your sortable categories, e. g. old < renovated < new → 0, 1, 2

This is also possible with pandas:

data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']})
data['q'] = data['q'].astype('category')
data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True)
data['q'] = data['q'].cat.codes
print(data['q'])

Result:

0    0
1    2
2    2
3    1
Name: q, dtype: int8

Using categorical data for groupby operations

You could use the mean for each category over past (known events).

Say you have a DataFrame with the last known mean prices for cities:

prices = pd.DataFrame({
    'city': ['A', 'A', 'A', 'B', 'B', 'C'],
    'price': [1, 1, 1, 2, 2, 3],
})
mean_price = prices.groupby('city').mean()
data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})

print(data.merge(mean_price, on='city', how='left'))

Result:

  city  price
0    A      1
1    B      2
2    C      3
3    A      1
4    B      2
5    A      1
Mistiemistime answered 30/11, 2015 at 21:25 Comment(2)
But how could hotencoding help you when you will try to predict a new color ? Maybe in your case you have to retrain the model. Do you have any solution ?Folder
@Folder you can see #57946506Icj
G
23

In linear regression with categorical variables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here

Idea is to use dummy variable encoding with drop_first=True, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOT lose any relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.

Here is complete code on how you can do it for your housing dataset

So you have categorical features:

District, Condition, Material, Security, Type

And one numerical features that you are trying to predict:

Price

First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:

Input variables:

X = housing[['District','Condition','Material','Security','Type']]

Prediction:

Y = housing['Price']

Convert categorical variable into dummy/indicator variables and drop one in each category:

X = pd.get_dummies(data=X, drop_first=True)

So now if you check shape of X with drop_first=True you will see that it has 4 columns less - one for each of your categorical variables.

You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:

from sklearn import linear_model
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)

regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)
Gaylagayle answered 13/12, 2018 at 6:14 Comment(3)
what does "4 columns less - one for each of your categorical variables" mean?Osanna
Idea is to use dummy variable encoding with drop_first=True, this will omit one column from each category after converting categorical variable into dummy/indicator variables. So for 4 categorical variables you will omit all together 4 columns.Gaylagayle
How can we convert this model into coremltoolsCompletion
B
17

You can use "Dummy Coding" in this case. There are Python libraries to do dummy coding, you have a few options:

  • You may use scikit-learn library. Take a look at here.
  • Or, if you are working with pandas, it has a built-in function to create dummy variables.

An example with pandas is below:

import pandas as pd

sample_data = [[1,2,'a'],[3,4,'b'],[5,6,'c'],[7,8,'b']]
df = pd.DataFrame(sample_data, columns=['numeric1','numeric2','categorical'])
dummies = pd.get_dummies(df.categorical)
df.join(dummies)
Bottoms answered 30/11, 2015 at 21:9 Comment(0)
R
11

One way to achieve regression with categorical variables as independent variables is as mentioned above - Using encoding. Another way of doing is by using R like statistical formula using statmodels library. Here is a code snippet

from statsmodels.formula.api import ols
tips = sns.load_dataset("tips")

model = ols('tip ~ total_bill + C(sex) + C(day) + C(day) + size', data=tips)
fitted_model = model.fit()
fitted_model.summary()

Dataset

total_bill  tip     sex  smoker day  time  size
0   16.99   1.01    Female  No  Sun Dinner  2
1   10.34   1.66    Male    No  Sun Dinner  3
2   21.01   3.50    Male    No  Sun Dinner  3
3   23.68   3.31    Male    No  Sun Dinner  2
4   24.59   3.61    Female  No  Sun Dinner  4

Summary of regression

enter image description here

Ruhnke answered 4/7, 2019 at 0:14 Comment(5)
Elegant solution which saves you from going through encoding!Yawn
what does C(sex) or C(day) do?New
They treat sex and day as categorical variables and mostly create dummy variables under the hoodRuhnke
what does T.Fri mean in the above output?Noellenoellyn
And how do you make predictions with the categorical variables?Heelpost

© 2022 - 2024 — McMap. All rights reserved.