How do I implement multiple linear regression in Python?

I am trying to write a multiple linear regression model from scratch to predict the key factors contributing to number of views of a song on Facebook. About each song we collect this information, i.e. variables I'm using:

df.dtypes
clicked                      int64
listened_5s                  int64
listened_20s                 int64
views                        int64
percentage_listened          float64
reactions_total              int64
shared_songs                 int64
comments                     int64
avg_time_listened            int64
song_length                  int64
likes                        int64
listened_later               int64

i'm using number of views as my dependent variable and all other variables in a dataset as independent ones. The model is posted down below:

  #df_x --> new dataframe of independent variables
  df_x = df.drop(['views'], 1)

  #df_y --> new dataframe of dependent variable views
  df_y = df.ix[:, ['views']]

  names = [i for i in list(df_x)]

  regr = linear_model.LinearRegression()
  x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size = 0.2)

   #Fitting the model to the training dataset
   regr.fit(x_train,y_train)
   regr.intercept_
   print('Coefficients: \n', regr.coef_)
   print("Mean Squared Error(MSE): %.2f"
         % np.mean((regr.predict(x_test) - y_test) ** 2))
   print('Variance Score: %.2f' % regr.score(x_test, y_test))
   regr.coef_[0].tolist()

Output here:

 regr.intercept_
 array([-1173904.20950487])
 MSE: 19722838329246.82
 Variance Score: 0.99

Looks like something went miserably wrong.

Trying the OLS model:

   import statsmodels.api as sm
   from statsmodels.sandbox.regression.predstd import wls_prediction_std
   model=sm.OLS(y_train,x_train)
   result = model.fit()
   print(result.summary())

Output:

     R-squared:                       0.992
     F-statistic:                     6121.   

                      coef        std err      t      P>|t|      [95.0% Conf. Int.]


clicked                0.3333      0.012     28.257      0.000         0.310     0.356
listened_5s            -0.4516      0.115    -3.944      0.000        -0.677    -0.227
listened_20s           1.9015      0.138     13.819      0.000         1.631     2.172
percentage_listened    7693.2520   1.44e+04   0.534      0.594     -2.06e+04   3.6e+04
reactions_total        8.6680      3.561      2.434      0.015         1.672    15.664
shared_songs         -36.6376      3.688     -9.934      0.000       -43.884   -29.392
comments              34.9031      5.921      5.895      0.000        23.270    46.536
avg_time_listened    1.702e+05   4.22e+04     4.032      0.000      8.72e+04  2.53e+05
song_length         -6309.8021   5425.543    -1.163      0.245      -1.7e+04  4349.413
likes                  4.8448      4.194      1.155      0.249        -3.395    13.085
listened_later        -2.3761      0.160    -14.831      0.000        -2.691    -2.061


Omnibus:                      233.399   Durbin-Watson:                   
1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             
2859.005
Skew:                           1.621   Prob(JB):                         
0.00
Kurtosis:                      14.020   Cond. No.                     
2.73e+07

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+07. This might indicate that there are strong multicollinearity or other numerical problems.

It looks like somethings went seriously wrong just by looking at this output.

I believe that something went wrong with training/testing sets and creating two different data frames x and y but can't figure out what. This problem must be solvable by using multiple regression. Shall it not be linear? Could you please help me figure out what went wrong?

from sklearn.model_selection import train_test_split from sklearn.linear_model import LassoCV from sklearn.preprocessing import StandardScaler # Standardize the data (excluding 'views' which is the target variable) scaler = StandardScaler() df_x_standardized = scaler.fit_transform(df.drop(['views'], axis=1)) df_y = df['views'] # Split the data into training and testing sets x_train, x_test, y_train, y_test = train_test_split(df_x_standardized, df_y, test_size=0.2, random_state=42) # Initialize and fit the Lasso regression model with cross-validation lasso = LassoCV(cv=5, random_state=42).fit(x_train, y_train) # Print the coefficients and intercept print('Intercept: ', lasso.intercept_) print('Coefficients: \n', lasso.coef_) # Evaluate the model print("Mean Squared Error (MSE): %.2f" % np.mean((lasso.predict(x_test) - y_test) ** 2)) print('Variance Score (R^2): %.2f' % lasso.score(x_test, y_test))

Recommended topics

Hot tags