I am trying to write a multiple linear regression model from scratch to predict the key factors contributing to number of views of a song on Facebook. About each song we collect this information, i.e. variables I'm using:
df.dtypes
clicked int64
listened_5s int64
listened_20s int64
views int64
percentage_listened float64
reactions_total int64
shared_songs int64
comments int64
avg_time_listened int64
song_length int64
likes int64
listened_later int64
i'm using number of views as my dependent variable and all other variables in a dataset as independent ones. The model is posted down below:
#df_x --> new dataframe of independent variables
df_x = df.drop(['views'], 1)
#df_y --> new dataframe of dependent variable views
df_y = df.ix[:, ['views']]
names = [i for i in list(df_x)]
regr = linear_model.LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size = 0.2)
#Fitting the model to the training dataset
regr.fit(x_train,y_train)
regr.intercept_
print('Coefficients: \n', regr.coef_)
print("Mean Squared Error(MSE): %.2f"
% np.mean((regr.predict(x_test) - y_test) ** 2))
print('Variance Score: %.2f' % regr.score(x_test, y_test))
regr.coef_[0].tolist()
Output here:
regr.intercept_
array([-1173904.20950487])
MSE: 19722838329246.82
Variance Score: 0.99
Looks like something went miserably wrong.
Trying the OLS model:
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
model=sm.OLS(y_train,x_train)
result = model.fit()
print(result.summary())
Output:
R-squared: 0.992
F-statistic: 6121.
coef std err t P>|t| [95.0% Conf. Int.]
clicked 0.3333 0.012 28.257 0.000 0.310 0.356
listened_5s -0.4516 0.115 -3.944 0.000 -0.677 -0.227
listened_20s 1.9015 0.138 13.819 0.000 1.631 2.172
percentage_listened 7693.2520 1.44e+04 0.534 0.594 -2.06e+04 3.6e+04
reactions_total 8.6680 3.561 2.434 0.015 1.672 15.664
shared_songs -36.6376 3.688 -9.934 0.000 -43.884 -29.392
comments 34.9031 5.921 5.895 0.000 23.270 46.536
avg_time_listened 1.702e+05 4.22e+04 4.032 0.000 8.72e+04 2.53e+05
song_length -6309.8021 5425.543 -1.163 0.245 -1.7e+04 4349.413
likes 4.8448 4.194 1.155 0.249 -3.395 13.085
listened_later -2.3761 0.160 -14.831 0.000 -2.691 -2.061
Omnibus: 233.399 Durbin-Watson:
1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB):
2859.005
Skew: 1.621 Prob(JB):
0.00
Kurtosis: 14.020 Cond. No.
2.73e+07
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+07. This might indicate that there are strong multicollinearity or other numerical problems.
It looks like somethings went seriously wrong just by looking at this output.
I believe that something went wrong with training/testing sets and creating two different data frames x and y but can't figure out what. This problem must be solvable by using multiple regression. Shall it not be linear? Could you please help me figure out what went wrong?
coef_
will give you the feature importance. In DecisionTreeRegressor usefeature_importances_
attribute. Also posting this on Cross-Validated may help more. – Grandparent