The question:
How can I add a dummy / factor variable to a model using sm.OLS()
?
The details:
Data sample structure:
Date A B weekday
2013-05-04 25.03 88.51 Saturday
2013-05-05 52.98 67.99 Sunday
2013-05-06 39.93 75.19 Monday
2013-05-07 47.31 86.99 Tuesday
2013-05-08 19.61 87.94 Wednesday
2013-05-09 39.51 83.10 Thursday
2013-05-10 21.22 62.16 Friday
2013-05-11 19.04 58.79 Saturday
2013-05-12 18.53 75.27 Sunday
2013-05-13 11.90 75.43 Monday
2013-05-14 47.64 64.76 Tuesday
2013-05-15 27.47 91.65 Wednesday
2013-05-16 11.20 59.83 Thursday
2013-05-17 25.10 67.47 Friday
2013-05-18 19.89 64.70 Saturday
2013-05-19 38.91 76.68 Sunday
2013-05-20 42.11 94.36 Monday
2013-05-21 7.845 73.67 Tuesday
2013-05-22 35.45 76.67 Wednesday
2013-05-23 29.43 79.05 Thursday
2013-05-24 33.51 78.53 Friday
2013-05-25 13.58 59.26 Saturday
2013-05-26 37.38 68.59 Sunday
2013-05-27 37.09 67.79 Monday
2013-05-28 21.70 70.54 Tuesday
2013-05-29 11.85 60.00 Wednesday
The following creates a linear regression model of B on A using sm.ols()
(including a constant term using sm.add_constant()
)
Complete code with data sample for regression analysis using statsmodels:
# imports
import pandas as pd
import statsmodels.api as sm
# same data as described above
data = {'Date': {0: '2013-05-04',
1: '2013-05-05',
2: '2013-05-06',
3: '2013-05-07',
4: '2013-05-08',
5: '2013-05-09',
6: '2013-05-10',
7: '2013-05-11',
8: '2013-05-12',
9: '2013-05-13',
10: '2013-05-14',
11: '2013-05-15',
12: '2013-05-16',
13: '2013-05-17',
14: '2013-05-18',
15: '2013-05-19',
16: '2013-05-20',
17: '2013-05-21',
18: '2013-05-22',
19: '2013-05-23',
20: '2013-05-24',
21: '2013-05-25',
22: '2013-05-26',
23: '2013-05-27',
24: '2013-05-28',
25: '2013-05-29'},
'A': {0: 25.03,
1: 52.98,
2: 39.93,
3: 47.31,
4: 19.61,
5: 39.51,
6: 21.22,
7: 19.04,
8: 18.53,
9: 11.9,
10: 47.64,
11: 27.47,
12: 11.2,
13: 25.1,
14: 19.89,
15: 38.91,
16: 42.11,
17: 7.845,
18: 35.45,
19: 29.43,
20: 33.51,
21: 13.58,
22: 37.38,
23: 37.09,
24: 21.7,
25: 11.85},
'B': {0: 88.51,
1: 67.99,
2: 75.19,
3: 86.99,
4: 87.94,
5: 83.1,
6: 62.16,
7: 58.79,
8: 75.27,
9: 75.43,
10: 64.76,
11: 91.65,
12: 59.83,
13: 67.47,
14: 64.7,
15: 76.68,
16: 94.36,
17: 73.67,
18: 76.67,
19: 79.05,
20: 78.53,
21: 59.26,
22: 68.59,
23: 67.79,
24: 70.54,
25: 60.0},
'weekday': {0: 'Saturday',
1: 'Sunday',
2: 'Monday',
3: 'Tuesday',
4: 'Wednesday',
5: 'Thursday',
6: 'Friday',
7: 'Saturday',
8: 'Sunday',
9: 'Monday',
10: 'Tuesday',
11: 'Wednesday',
12: 'Thursday',
13: 'Friday',
14: 'Saturday',
15: 'Sunday',
16: 'Monday',
17: 'Tuesday',
18: 'Wednesday',
19: 'Thursday',
20: 'Friday',
21: 'Saturday',
22: 'Sunday',
23: 'Monday',
24: 'Tuesday',
25: 'Wednesday'}}
df = pd.DataFrame(data)
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df['B'].to_frame()
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
Output (shortened):
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -1.4328 17.355 -0.083 0.935 -37.252 34.386
B 0.4034 0.233 1.729 0.097 -0.078 0.885
==============================================================================
Now I'd like to add weekday as an explanatory factor variable. I was hoping it would be as easy as changing the data type in the dataframe, but unfortunately that doesn't seem to work although the column was accepted by the x = sm.add_constant(independent)
part.
import pandas as pd
import statsmodels.api as sm
df = pd.read_clipboard(sep='\\s+')
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df[['B', 'weekday']]
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
When you come to the model = sm.OLS(df['A'], x).fit()
part, a value error is raised:
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
Any other suggestions?