Using categorical variables in statsmodels OLS class
Asked Answered
D

3

14

I want to use statsmodels OLS class to create a multiple regression model. Consider the following dataset:

import statsmodels.api as sm
import pandas as pd
import numpy as np

dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
  'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90} 

df = pd.DataFrame.from_dict(dict)

x = data[['debt_ratio', 'industry']]
y = data['cash_flow']

def reg_sm(x, y):
    x = np.array(x).T
    x = sm.add_constant(x)
    results = sm.OLS(endog = y, exog = x).fit()
    return results

When I run the following code:

reg_sm(x, y)

I get the following error:

TypeError: '>=' not supported between instances of 'float' and 'str'

I've tried converting the industry variable to categorical, but I still get an error. I'm out of options.

Diallage answered 18/4, 2019 at 1:45 Comment(1)
This is because 'industry' is categorial variable, but OLS expects numbers (this could be seen from its source code). drop industry, or group your data by industry and apply OLS to each group.Dikdik
A
16

You're on the right path with converting to a Categorical dtype. However, once you convert the DataFrame to a NumPy array, you get an object dtype (NumPy arrays are one uniform type as a whole). This means that the individual values are still underlying str which a regression definitely is not going to like.

What you might want to do is to dummify this feature. Instead of factorizing it, which would effectively treat the variable as continuous, you want to maintain some semblance of categorization:

>>> import statsmodels.api as sm
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)
>>> data = {
...     'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
...    'debt_ratio':np.random.randn(5),
...    'cash_flow':np.random.randn(5) + 90
... }
>>> data = pd.DataFrame.from_dict(data)
>>> data = pd.concat((
...     data,
...     pd.get_dummies(data['industry'], drop_first=True)), axis=1)
>>> # You could also use data.drop('industry', axis=1)
>>> # in the call to pd.concat()
>>> data
         industry  debt_ratio  cash_flow  finance  hospitality  mining  transportation
0          mining    0.357440  88.856850        0            0       1               0
1  transportation    0.377538  89.457560        0            0       0               1
2     hospitality    1.382338  89.451292        0            1       0               0
3         finance    1.175549  90.208520        1            0       0               0
4   entertainment   -0.939276  90.212690        0            0       0               0

Now you have dtypes that statsmodels can better work with. The purpose of drop_first is to avoid the dummy trap:

>>> y = data['cash_flow']
>>> x = data.drop(['cash_flow', 'industry'], axis=1)
>>> sm.OLS(y, x).fit()
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x115b87cf8>

Lastly, just a small pointer: it helps to try to avoid naming references with names that shadow built-in object types, such as dict.

Arbitral answered 18/4, 2019 at 2:4 Comment(3)
As alternative to using pandas for creating the dummy variables, the formula interface automatically converts string categorical through patsy.Concenter
@Concenter Can you elaborate on how to (cleanly) do that? As Pandas is converting any string to np.object. And I get ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data). when I tried to use a DataFrame with some columns being strings.Madore
add dtype='int' to avoid boolean values in the dummies #71705740Ullrich
M
20

I also had this problem as well and have lots of columns needed to be treated as categorical, and this makes it quite annoying to deal with dummify. And converting to string doesn't work for me.

For anyone looking for a solution without onehot-encoding the data, The R interface provides a nice way of doing this:

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
  'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90} 

df = pd.DataFrame.from_dict(dict)

x = df[['debt_ratio', 'industry']]
y = df['cash_flow']

# NB. unlike sm.OLS, there is "intercept" term is included here
smf.ols(formula="cash_flow ~ debt_ratio + C(industry)", data=df).fit()

Reference: https://www.statsmodels.org/stable/example_formulas.html#categorical-variables

Madore answered 6/9, 2019 at 6:45 Comment(6)
How to predict with cat features in this case?Gustafsson
It should be similar to what has been discussed here. statsmodels.org/stable/examples/notebooks/generated/…Madore
Personally, I would have accepted this answer, it is much cleaner (and I don't know R)!Sorus
the function .get_prediction() doesnt seem to work on this method. I need to get the confidence/prediction intervals out. Is there a way to do this with the smf r syntax?Willhite
@OceanScientist In the latest version of statsmodels (v0.12.2), .get_prediction() method works.Minos
Otherwise model.predict(dataframe) works.Delldella
A
16

You're on the right path with converting to a Categorical dtype. However, once you convert the DataFrame to a NumPy array, you get an object dtype (NumPy arrays are one uniform type as a whole). This means that the individual values are still underlying str which a regression definitely is not going to like.

What you might want to do is to dummify this feature. Instead of factorizing it, which would effectively treat the variable as continuous, you want to maintain some semblance of categorization:

>>> import statsmodels.api as sm
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)
>>> data = {
...     'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
...    'debt_ratio':np.random.randn(5),
...    'cash_flow':np.random.randn(5) + 90
... }
>>> data = pd.DataFrame.from_dict(data)
>>> data = pd.concat((
...     data,
...     pd.get_dummies(data['industry'], drop_first=True)), axis=1)
>>> # You could also use data.drop('industry', axis=1)
>>> # in the call to pd.concat()
>>> data
         industry  debt_ratio  cash_flow  finance  hospitality  mining  transportation
0          mining    0.357440  88.856850        0            0       1               0
1  transportation    0.377538  89.457560        0            0       0               1
2     hospitality    1.382338  89.451292        0            1       0               0
3         finance    1.175549  90.208520        1            0       0               0
4   entertainment   -0.939276  90.212690        0            0       0               0

Now you have dtypes that statsmodels can better work with. The purpose of drop_first is to avoid the dummy trap:

>>> y = data['cash_flow']
>>> x = data.drop(['cash_flow', 'industry'], axis=1)
>>> sm.OLS(y, x).fit()
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x115b87cf8>

Lastly, just a small pointer: it helps to try to avoid naming references with names that shadow built-in object types, such as dict.

Arbitral answered 18/4, 2019 at 2:4 Comment(3)
As alternative to using pandas for creating the dummy variables, the formula interface automatically converts string categorical through patsy.Concenter
@Concenter Can you elaborate on how to (cleanly) do that? As Pandas is converting any string to np.object. And I get ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data). when I tried to use a DataFrame with some columns being strings.Madore
add dtype='int' to avoid boolean values in the dummies #71705740Ullrich
L
0

Just another example from a similar case for categorical variables, which gives correct result compared to a statistics course given in R (Hanken, Finland).

import wooldridge as woo
import statsmodels.formula.api as smf
import numpy as np

df = woo.dataWoo('beauty')
print(df.describe)

df['abvavg'] = (df['looks']>=4).astype(int) # good looking
df['belavg'] = (df['looks']<=2).astype(int) # bad looking

df_female = df[df['female']==1]
df_male = df[df['female']==0]

results_female = smf.ols(formula = 'np.log(wage) ~ belavg + abvavg',data=df_female).fit()
print(f"FEMALE results, summary \n {results_female.summary()}")

results_male = smf.ols(formula = 'np.log(wage) ~ belavg + abvavg',data=df_male).fit()
print(f"MALE results, summary \n {results_male.summary()}")

Terveisin, Markus

Logorrhea answered 24/4, 2022 at 9:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.