How do I implement multiple linear regression in Python?
Asked Answered
M

1

6

I am trying to write a multiple linear regression model from scratch to predict the key factors contributing to number of views of a song on Facebook. About each song we collect this information, i.e. variables I'm using:

df.dtypes
clicked                      int64
listened_5s                  int64
listened_20s                 int64
views                        int64
percentage_listened          float64
reactions_total              int64
shared_songs                 int64
comments                     int64
avg_time_listened            int64
song_length                  int64
likes                        int64
listened_later               int64

i'm using number of views as my dependent variable and all other variables in a dataset as independent ones. The model is posted down below:

  #df_x --> new dataframe of independent variables
  df_x = df.drop(['views'], 1)

  #df_y --> new dataframe of dependent variable views
  df_y = df.ix[:, ['views']]

  names = [i for i in list(df_x)]

  regr = linear_model.LinearRegression()
  x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size = 0.2)

   #Fitting the model to the training dataset
   regr.fit(x_train,y_train)
   regr.intercept_
   print('Coefficients: \n', regr.coef_)
   print("Mean Squared Error(MSE): %.2f"
         % np.mean((regr.predict(x_test) - y_test) ** 2))
   print('Variance Score: %.2f' % regr.score(x_test, y_test))
   regr.coef_[0].tolist()

Output here:

 regr.intercept_
 array([-1173904.20950487])
 MSE: 19722838329246.82
 Variance Score: 0.99

Looks like something went miserably wrong.

Trying the OLS model:

   import statsmodels.api as sm
   from statsmodels.sandbox.regression.predstd import wls_prediction_std
   model=sm.OLS(y_train,x_train)
   result = model.fit()
   print(result.summary())

Output:

     R-squared:                       0.992
     F-statistic:                     6121.   

                      coef        std err      t      P>|t|      [95.0% Conf. Int.]


clicked                0.3333      0.012     28.257      0.000         0.310     0.356
listened_5s            -0.4516      0.115    -3.944      0.000        -0.677    -0.227
listened_20s           1.9015      0.138     13.819      0.000         1.631     2.172
percentage_listened    7693.2520   1.44e+04   0.534      0.594     -2.06e+04   3.6e+04
reactions_total        8.6680      3.561      2.434      0.015         1.672    15.664
shared_songs         -36.6376      3.688     -9.934      0.000       -43.884   -29.392
comments              34.9031      5.921      5.895      0.000        23.270    46.536
avg_time_listened    1.702e+05   4.22e+04     4.032      0.000      8.72e+04  2.53e+05
song_length         -6309.8021   5425.543    -1.163      0.245      -1.7e+04  4349.413
likes                  4.8448      4.194      1.155      0.249        -3.395    13.085
listened_later        -2.3761      0.160    -14.831      0.000        -2.691    -2.061


Omnibus:                      233.399   Durbin-Watson:                   
1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             
2859.005
Skew:                           1.621   Prob(JB):                         
0.00
Kurtosis:                      14.020   Cond. No.                     
2.73e+07

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+07. This might indicate that there are strong multicollinearity or other numerical problems.

It looks like somethings went seriously wrong just by looking at this output.

I believe that something went wrong with training/testing sets and creating two different data frames x and y but can't figure out what. This problem must be solvable by using multiple regression. Shall it not be linear? Could you please help me figure out what went wrong?

Maddocks answered 15/1, 2018 at 4:58 Comment(5)
Most of the columns you are using seem to be an "after-effect" of being viewed.Grandparent
@VivekKumar what's the recommendation then? Not to multiple linear regression? What to use instead?Maddocks
I dont quite understand what you mean by "multiple linear regression"? And in the above comment, I was implying that maybe most of your data has correlation in them as you found in the statsmodel, (maybe because the they all depend on the "views") I would advise you to take other features if you can like the content of the audio, what its about, the artist, genre etc.Grandparent
I don't have such information, unfortunately. The dataset it given to me as it is and I have to find the key factors contributing to views (standing for how many time that song video appeared in users news feed). You are right, the other variables can be seen as post-effect, however, in social networks a share or like also has an effect that my friends are going to see the video of a song in their news feeds as well. Meaning the more liked - more viewed; more commented - more viewed. The question is now how I find the key contributors given the data of this set? What would you use?Maddocks
Then first try standardizing the data and then use LinearRegression or DecisionTreeRegressor. In LinearRegression, the coef_ will give you the feature importance. In DecisionTreeRegressor use feature_importances_ attribute. Also posting this on Cross-Validated may help more.Grandparent
R
0

I would suggest using regularisation with the Lasso, which also performs feature selection:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

# Standardize the data (excluding 'views' which is the target variable)
scaler = StandardScaler()
df_x_standardized = scaler.fit_transform(df.drop(['views'], axis=1))
df_y = df['views']

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(df_x_standardized, df_y, test_size=0.2, random_state=42)

# Initialize and fit the Lasso regression model with cross-validation
lasso = LassoCV(cv=5, random_state=42).fit(x_train, y_train)

# Print the coefficients and intercept
print('Intercept: ', lasso.intercept_)
print('Coefficients: \n', lasso.coef_)

# Evaluate the model
print("Mean Squared Error (MSE): %.2f" % np.mean((lasso.predict(x_test) - y_test) ** 2))
print('Variance Score (R^2): %.2f' % lasso.score(x_test, y_test))
Restorative answered 31/1 at 18:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.