Linear Regression: How to find the distance between the points and the prediction line?
Asked Answered
A

1

6

I'm looking to find the distance between the points and the prediction line. Ideally I would like the results to be displayed in a new column which contains the distance, called 'Distance'.

My Imports:

import os.path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
%matplotlib inline 

Sample of my data:

idx  Exam Results  Hours Studied
0       93          8.232795
1       94          7.879095
2       92          6.972698
3       88          6.854017
4       91          6.043066
5       87          5.510013
6       89          5.509297

My code so far:

x = df['Hours Studied'].values[:,np.newaxis]
y = df['Exam Results'].values

model = LinearRegression()
model.fit(x, y)

plt.scatter(x, y,color='r')
plt.plot(x, model.predict(x),color='k')
plt.show()

My plot

Any help would be greatly appreciated. Thanks

Apsis answered 16/4, 2018 at 14:21 Comment(1)
Check this answer #39840530Tartarean
M
12

You simply need to assign the difference between y and model.predict(x) to a new column (or take absolute value if you just want the magnitude if the difference):

#df["Distance"] = abs(y - model.predict(x))  # if you only want magnitude
df["Distance"] = y - model.predict(x)
print(df)
#   Exam Results  Hours Studied  Distance
#0            93       8.232795 -0.478739
#1            94       7.879095  1.198511
#2            92       6.972698  0.934043
#3            88       6.854017 -2.838712
#4            91       6.043066  1.714063
#5            87       5.510013 -1.265269
#6            89       5.509297  0.736102

This is because your model predicts a y (dependent variable) for each independent variable (x). The x coordinates are the same, so the difference in y is the value you want.

Micmac answered 16/4, 2018 at 14:32 Comment(5)
I seem to be getting this error when I try and run that line of code. ValueError: Length of values does not match length of index Any idea as to why this is? The shape of x is (132, 1), and the shape of y is (132,).Apsis
What's the length of your dataframe? That error indicates that the problem is coming from the df["Distance"] = part rather than the y - model.predict(x) part. You could also do df["Distance"] = df['Exam Results'].values - model.predict(df['Hours Studied'].values[:,np.newaxis]).Micmac
The length of my dataframe is 1789, but that last piece of code seems to do the trick, thanks very much. Any idea as to why I encountered this problem?Apsis
The error was because you were trying to assign 132 values to a dataframe with 1789 rows. I suspect that you built your model only on a subset of the data and were trying to then calculate the Distance for every row.Micmac
i know this is a old post. but can any one help how to do this to each group after group byAdalard

© 2022 - 2024 — McMap. All rights reserved.