I have a set of data. I have use pandas to convert them in a dummy and categorical variables respectively. So, now I want to know, how to run a multiple linear regression (I am using statsmodels) in Python?. Are there some considerations or maybe I have to indicate that the variables are dummy/ categorical in my code someway? Or maybe the transfromation of the variables is enough and I just have to run the regression as model = sm.OLS(y, X).fit()
?.
My code is the following:
datos = pd.read_csv("datos_2.csv")
df = pd.DataFrame(datos)
print(df)
I get this:
Age Gender Wage Job Classification
32 Male 450000 Professor High
28 Male 500000 Administrative High
40 Female 20000 Professor Low
47 Male 70000 Assistant Medium
50 Female 345000 Professor Medium
27 Female 156000 Assistant Low
56 Male 432000 Administrative Low
43 Female 100000 Administrative Low
Then I do: 1= Male, 0= Female and 1:Professor, 2:Administrative, 3: Assistant this way:
df['Sex_male']=df.Gender.map({'Female':0,'Male':1})
df['Job_index']=df.Job.map({'Professor':1,'Administrative':2,'Assistant':3})
print(df)
Getting this:
Age Gender Wage Job Classification Sex_male Job_index
32 Male 450000 Professor High 1 1
28 Male 500000 Administrative High 1 2
40 Female 20000 Professor Low 0 1
47 Male 70000 Assistant Medium 1 3
50 Female 345000 Professor Medium 0 1
27 Female 156000 Assistant Low 0 3
56 Male 432000 Administrative Low 1 2
43 Female 100000 Administrative Low 0 2
Now, if I would run a multiple linear regression, for example:
y = datos['Wage']
X = datos[['Sex_mal', 'Job_index','Age']]
X = sm.add_constant(X)
model1 = sm.OLS(y, X).fit()
results1=model1.summary(alpha=0.05)
print(results1)
The result is shown normally, but would it be fine? Or do I have to indicate somehow that the variables are dummy or categorical?. Please help, I am new to Python and I want to learn. Greetings from South America - Chile.