How do I calculate r-squared using Python and Numpy?

Asked 21/5, 2009 at 15:55 Answered 17/2 at 22:6

Solved python math statistics numpy curve-fitting

140

I'm using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.).

This much works, but I also want to calculate r (coefficient of correlation) and r-squared(coefficient of determination). I am comparing my results with Excel's best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best-fit (degree equals 1). However, my function does not work for polynomials with degree greater than 1.

Excel is able to do this. How do I calculate r-squared for higher-order polynomials using Numpy?

Here's my function:

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)
     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    correlation = numpy.corrcoef(x, y)[0,1]

     # r
    results['correlation'] = correlation
     # r-squared
    results['determination'] = correlation**2

    return results

Joellyn answered 21/5, 2009 at 15:55 Comment(4)

Note: you use the degree only in the calculation of coeffs. – Floaty 21/5, 2009 at 17:11

tydok is correct. You are calculating the correlation of x and y and r-squared for y=p_0 + p_1 * x. See my answer below for some code that should work. If you don't mind me asking, what is your ultimate goal? Are you doing model selection (choosing what degree to use)? Or something else? – Granitite 21/5, 2009 at 22:51

@Granitite -- The request boils down to "do it like Excel does". I'm getting the feeling from these answers that the users may be reading too much into the r-squared value when using a non-linear best-fit curve. Nonetheless, I'm not a math wizard, and this is the requested functionality. – Joellyn 22/5, 2009 at 0:45

side question : doesn't pandas corr() function return the r^"2 pearson coeffcient? – Welsh 18/2, 2021 at 10:6

From the numpy.polyfit documentation, it is fitting linear regression. Specifically, numpy.polyfit with degree 'd' fits a linear regression with the mean function

E(y|x) = p_d * x**d + p_{d-1} * x **(d-1) + ... + p_1 * x + p_0

So you just need to calculate the R-squared for that fit. The wikipedia page on linear regression gives full details. You are interested in R^2 which you can calculate in a couple of ways, the easisest probably being

SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST

Where I use 'y_bar' for the mean of the y's, and 'y_ihat' to be the fit value for each point.

I'm not terribly familiar with numpy (I usually work in R), so there is probably a tidier way to calculate your R-squared, but the following should be correct

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)

     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = p(x)                         # or [p(z) for z in x]
    ybar = numpy.sum(y)/len(y)          # or sum(y)/len(y)
    ssreg = numpy.sum((yhat-ybar)**2)   # or sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = numpy.sum((y - ybar)**2)    # or sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot

    return results

Granitite answered 21/5, 2009 at 20:48 Comment(3)

I just want to point out that using the numpy array functions instead of list comprehension will be much faster, e.g. numpy.sum((yi - ybar)**2) and easier to read – Wrinkle 18/10, 2010 at 3:31

According to wiki page en.wikipedia.org/wiki/Coefficient_of_determination, the most general definition of R^2 is R^2 = 1 - SS_err/SS_tot, with R^2 = SS_reg/SS_tot being just a special case. – Cloudberry 29/4, 2013 at 0:3

What about R squared for a non-linear least square function? – Hierarch 27/10, 2021 at 8:19

201

A very late reply, but just in case someone needs a ready function for this:

scipy.stats.linregress

i.e.

slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)

as in @Adam Marples's answer.

Topminnow answered 4/10, 2009 at 21:15 Comment(3)

It's reasonable to analyze with coefficient of correlation, and then to do the bigger job, regression. – Lunette 17/1, 2012 at 18:59

This reply only works for linear regression, which is the simplest polynomial regression – Comfy 4/8, 2015 at 9:49

Caution: r_value here is a Pearson's correlation coefficient, not R-squared. r_squared = r_value**2 – Cumuliform 12/2, 2020 at 7:36

From yanl (yet-another-library) sklearn.metrics has an r2_score function;

from sklearn.metrics import r2_score

coefficient_of_dermination = r2_score(y, p(x))

Gaskell answered 12/6, 2015 at 13:41 Comment(7)

(Beware: "Default value corresponds to ‘variance_weighted’, this behaviour is deprecated since version 0.17 and will be changed to ‘uniform_average’ starting from 0.19") – Cavorelievo 4/8, 2017 at 22:3

r2_score in sklearn could be negative value, which is not the normal case. – Gallinaceous 21/11, 2017 at 11:25

Why is r2_score([1,2,3],[4,5,7]) = -16? – Neral 1/9, 2020 at 10:43

One thing I like is it doesn't require training the model -- often I'm computing metrics from models trained in different environment. – Volny 11/12, 2020 at 15:23

@cz the formula for R2 is R2= 1 - (SSres/SStot) here the calculation: average of array 1 equals to 2, so. ((4-1)^2 + (5-2)^2 + (7-3)^2) / ((1-2)^2 + (2-2)^2 + (3-2)^2) = 17 1 - 17 = -16. – Watkin 9/5, 2023 at 22:24

@cz I am assuming we are used to seeing a positive R2 because a real best fit line would never fit the points [1, 2, 3] with [4, 5, 7] – Dactylogram 23/6, 2023 at 21:14

Why using Numpy do we get this? y_true = [1,2,3] y_pred = [4,5,7] print(np.corrcoef(y_true, y_pred)[0,1]**2) # -> 0.9642857142857141 – Peplos 21/3 at 16:12

From the numpy.polyfit documentation, it is fitting linear regression. Specifically, numpy.polyfit with degree 'd' fits a linear regression with the mean function

E(y|x) = p_d * x**d + p_{d-1} * x **(d-1) + ... + p_1 * x + p_0

SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST

Where I use 'y_bar' for the mean of the y's, and 'y_ihat' to be the fit value for each point.

I'm not terribly familiar with numpy (I usually work in R), so there is probably a tidier way to calculate your R-squared, but the following should be correct

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)

     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = p(x)                         # or [p(z) for z in x]
    ybar = numpy.sum(y)/len(y)          # or sum(y)/len(y)
    ssreg = numpy.sum((yhat-ybar)**2)   # or sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = numpy.sum((y - ybar)**2)    # or sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot

    return results

Granitite answered 21/5, 2009 at 20:48 Comment(3)

I just want to point out that using the numpy array functions instead of list comprehension will be much faster, e.g. numpy.sum((yi - ybar)**2) and easier to read – Wrinkle 18/10, 2010 at 3:31

What about R squared for a non-linear least square function? – Hierarch 27/10, 2021 at 8:19

I have been using this successfully, where x and y are array-like.

Note: for linear regression only

def rsquared(x, y):
    """ Return R^2 where x and y are array-like."""

    slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
    return r_value**2

Dordrecht answered 17/12, 2012 at 16:37 Comment(7)

This is not Perason's coefficient of determination, but the square of the correlation coefficient - something else entirely. – Spurgeon 24/10, 2021 at 5:34

@Spurgeon It's my understanding that the coefficient of determination is the square of the coefficient of correlation – Dordrecht 25/10, 2021 at 9:25

I think this is only true when using linear regression: en.wikipedia.org/wiki/Coefficient_of_determination "One class of such cases includes that of simple linear regression where r2 is used instead of R2. When only an intercept is included, then r2 is simply the square of the sample correlation coefficient (i.e., r) between the observed outcomes and the observed predictor values." – Spurgeon 26/10, 2021 at 0:4

@Spurgeon I am using r**2 from linear regression in my answer, scipy.stats.linregress, so it is correct – Dordrecht 27/10, 2021 at 18:3

I think you are confusing linear regression, which can fit a polynomial of any degree, with fitting a linear model. You are using linear regression - but only to fit a polynomial of the first degree. So, you are not really answering the question, which was about the "...best fit polynomial of arbitrary degree". I suggest you edit your answer to reflect that it is only useful for polynomials of first degree (i.e., model function a + bx). – Spurgeon 27/10, 2021 at 20:8

Ah yes I did not properly read the question. In my defence it was 9 years ago and I still haven't. – Dordrecht 29/10, 2021 at 0:41

r-squared is already defined here: en.wikipedia.org/wiki/Coefficient_of_determination and it is so easy to calculate?! – Your 14/4, 2023 at 12:37

I originally posted the benchmarks below with the purpose of recommending numpy.corrcoef, foolishly not realizing that the original question already uses corrcoef and was in fact asking about higher order polynomial fits. I've added an actual solution to the polynomial r-squared question using statsmodels, and I've left the original benchmarks, which while off-topic, are potentially useful to someone.

statsmodels has the capability to calculate the r^2 of a polynomial fit directly, here are 2 methods...

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Construct the columns for the different powers of x
def get_r2_statsmodels(x, y, k=1):
    xpoly = np.column_stack([x**i for i in range(k+1)])    
    return sm.OLS(y, xpoly).fit().rsquared

# Use the formula API and construct a formula describing the polynomial
def get_r2_statsmodels_formula(x, y, k=1):
    formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1))
    data = {'x': x, 'y': y}
    return smf.ols(formula, data).fit().rsquared # or rsquared_adj

To further take advantage of statsmodels, one should also look at the fitted model summary, which can be printed or displayed as a rich HTML table in Jupyter/IPython notebook. The results object provides access to many useful statistical metrics in addition to rsquared.

model = sm.OLS(y, xpoly)
results = model.fit()
results.summary()

Below is my original Answer where I benchmarked various linear regression r^2 methods...

The corrcoef function used in the Question calculates the correlation coefficient, r, only for a single linear regression, so it doesn't address the question of r^2 for higher order polynomial fits. However, for what it's worth, I've come to find that for linear regression, it is indeed the fastest and most direct method of calculating r.

def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2

These were my timeit results from comparing a bunch of methods for 1000 random (x, y) points:

Pure Python (direct r calculation)
- 1000 loops, best of 3: 1.59 ms per loop
Numpy polyfit (applicable to n-th degree polynomial fits)
- 1000 loops, best of 3: 326 µs per loop
Numpy Manual (direct r calculation)
- 10000 loops, best of 3: 62.1 µs per loop
Numpy corrcoef (direct r calculation)
- 10000 loops, best of 3: 56.6 µs per loop
Scipy (linear regression with r as an output)
- 1000 loops, best of 3: 676 µs per loop
Statsmodels (can do n-th degree polynomial and many other fits)
- 1000 loops, best of 3: 422 µs per loop

The corrcoef method narrowly beats calculating the r^2 "manually" using numpy methods. It is >5X faster than the polyfit method and ~12X faster than the scipy.linregress. Just to reinforce what numpy is doing for you, it's 28X faster than pure python. I'm not well-versed in things like numba and pypy, so someone else would have to fill those gaps, but I think this is plenty convincing to me that corrcoef is the best tool for calculating r for a simple linear regression.

Here's my benchmarking code. I copy-pasted from a Jupyter Notebook (hard not to call it an IPython Notebook...), so I apologize if anything broke on the way. The %timeit magic command requires IPython.

import numpy as np
from scipy import stats
import statsmodels.api as sm
import math

n=1000
x = np.random.rand(1000)*10
x.sort()
y = 10 * x + (5+np.random.randn(1000)*10-5)

x_list = list(x)
y_list = list(y)

def get_r2_numpy(x, y):
    slope, intercept = np.polyfit(x, y, 1)
    r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1)))
    return r_squared
    
def get_r2_scipy(x, y):
    _, _, r_value, _, _ = stats.linregress(x, y)
    return r_value**2
    
def get_r2_statsmodels(x, y):
    return sm.OLS(y, sm.add_constant(x)).fit().rsquared
    
def get_r2_python(x_list, y_list):
    n = len(x_list)
    x_bar = sum(x_list)/n
    y_bar = sum(y_list)/n
    x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1))
    y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1))
    zx = [(xi-x_bar)/x_std for xi in x_list]
    zy = [(yi-y_bar)/y_std for yi in y_list]
    r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1)
    return r**2
    
def get_r2_numpy_manual(x, y):
    zx = (x-np.mean(x))/np.std(x, ddof=1)
    zy = (y-np.mean(y))/np.std(y, ddof=1)
    r = np.sum(zx*zy)/(len(x)-1)
    return r**2
    
def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2
    
print('Python')
%timeit get_r2_python(x_list, y_list)
print('Numpy polyfit')
%timeit get_r2_numpy(x, y)
print('Numpy Manual')
%timeit get_r2_numpy_manual(x, y)
print('Numpy corrcoef')
%timeit get_r2_numpy_corrcoef(x, y)
print('Scipy')
%timeit get_r2_scipy(x, y)
print('Statsmodels')
%timeit get_r2_statsmodels(x, y)

7/28/21 Benchmark results. (Python 3.7, numpy 1.19, scipy 1.6, statsmodels 0.12)

Python
2.41 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy polyfit
318 µs ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy Manual
79.3 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy corrcoef
83.8 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Scipy
221 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Statsmodels
375 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Righthander answered 5/1, 2016 at 17:21 Comment(8)

You are comparing 3 methods with fitting a slope and regression with 3 methods without fitting a slope. – Wrinkle 5/1, 2016 at 18:37

Yeah, I knew that much... but now I feel silly for not reading the original question and seeing that it uses corrcoef already and is specifically addressing r^2 for higher order polynomials... now I feel silly for posting my benchmarks which were for a different purpose. Oops... – Righthander 5/1, 2016 at 18:44

I've updated my answer with a solution to the original question using statsmodels, and apologized for the needless benchmarking of linear regression r^2 methods, which I kept as interesting, yet off-topic info. – Righthander 5/1, 2016 at 20:32

I still find the benchmark interesting because I didn't expect scipy's linregress to be slower than statsmodels which does more generic work. – Wrinkle 5/1, 2016 at 21:50

Note, np.column_stack([x**i for i in range(k+1)]) can be vectorized in numpy with x[:,None]**np.arange(k+1) or using numpy's vander functions which have reversed order in columns. – Wrinkle 5/1, 2016 at 21:51

There is a typo, it should be n = len(x_list) for pure Python – Itin 17/7, 2020 at 17:59

Thanks @FabianSchn. - fixed – Righthander 27/8, 2020 at 16:42

For what it's worth @Wrinkle 2016 comment about linregress being slower than statsmodels is not true in 2021. I'm assuming linregress got faster. My 2021 benchmark results have been added to the answer. – Righthander 28/7, 2021 at 17:26

Here is a function to compute the weighted r-squared with Python and Numpy (most of the code comes from sklearn):

from __future__ import division 
import numpy as np

def compute_r2_weighted(y_true, y_pred, weight):
    sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
    tse = (weight * (y_true - np.average(
        y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse

Example:

from __future__ import print_function, division 
import sklearn.metrics 

def compute_r2_weighted(y_true, y_pred, weight):
    sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
    tse = (weight * (y_true - np.average(
        y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse    

def compute_r2(y_true, y_predicted):
    sse = sum((y_true - y_predicted)**2)
    tse = (len(y_true) - 1) * np.var(y_true, ddof=1)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse

def main():
    '''
    Demonstrate the use of compute_r2_weighted() and checks the results against sklearn
    '''        
    y_true = [3, -0.5, 2, 7]
    y_pred = [2.5, 0.0, 2, 8]
    weight = [1, 5, 1, 2]
    r2_score = sklearn.metrics.r2_score(y_true, y_pred)
    print('r2_score: {0}'.format(r2_score))  
    r2_score,_,_ = compute_r2(np.array(y_true), np.array(y_pred))
    print('r2_score: {0}'.format(r2_score))
    r2_score = sklearn.metrics.r2_score(y_true, y_pred,weight)
    print('r2_score weighted: {0}'.format(r2_score))
    r2_score,_,_ = compute_r2_weighted(np.array(y_true), np.array(y_pred), np.array(weight))
    print('r2_score weighted: {0}'.format(r2_score))

if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

outputs:

r2_score: 0.9486081370449679
r2_score: 0.9486081370449679
r2_score weighted: 0.9573170731707317
r2_score weighted: 0.9573170731707317

This corresponds to the formula (mirror):

with f_i is the predicted value from the fit, y_{av} is the mean of the observed data y_i is the observed data value. w_i is the weighting applied to each data point, usually w_i=1. SSE is the sum of squares due to error and SST is the total sum of squares.

If interested, the code in R: https://gist.github.com/dhimmel/588d64a73fa4fef02c8f (mirror)

Cavorelievo answered 7/8, 2017 at 0:55 Comment(0)

Here's a very simple python function to compute R^2 from the actual and predicted values assuming y and y_hat are pandas series:

def r_squared(y, y_hat):
    y_bar = y.mean()
    ss_tot = ((y-y_bar)**2).sum()
    ss_res = ((y-y_hat)**2).sum()
    return 1 - (ss_res/ss_tot)

Nim answered 1/5, 2020 at 18:48 Comment(2)

This formula gives a different answer than the numpy module for non-trivial data. This is likely because r_squared is an optimization problem with multiple solutions for the slope and offset of the best fit line. – Diastema 22/8, 2021 at 0:0

The function above applies to any model, linear, nonlinear, ML etc… It only looks at the differences between the predicted values and the actual values. Each model will typically create a different R^2. Fitting a given model involves minimizing R^2 by varying the parameters of the model. A straight line fit for a curve with one independent variable and one dependent variable has a unique solution (the local minima == the global minima). More complicated models, particularly with additional independent variables, may have many local minima and finding the global minima may be very difficult. – Nim 22/8, 2021 at 19:54

R-squared is a statistic that only applies to linear regression.

Essentially, it measures how much variation in your data can be explained by the linear regression.

So, you calculate the "Total Sum of Squares", which is the total squared deviation of each of your outcome variables from their mean. . .

formula1

where y_bar is the mean of the y's.

Then, you calculate the "regression sum of squares", which is how much your FITTED values differ from the mean

formula2

and find the ratio of those two.

Now, all you would have to do for a polynomial fit is plug in the y_hat's from that model, but it's not accurate to call that r-squared.

Here is a link I found that speaks to it a little.

Numismatist answered 21/5, 2009 at 16:54 Comment(4)

This seems to be the root of my problem. How does Excel get a different r-squared value for a polynomial fit vs. a linear regression then? – Joellyn 21/5, 2009 at 16:59

are you just giving excel the fits from a linear regression, and the fits from a polynomial model? It's going to calculate the rsq from two arrays of data, and just assume that you're giving it the fits from a linear model. What are you giving excel? What is the 'best fit trendline' command in excel? – Numismatist 21/5, 2009 at 17:45

It's part of the graphing functions of Excel. You can plot some data, right-click on it, then choose from several different types of trend lines. There is the option to see the equation of the line as well as an r-squared value for each type. The r-squared value is also different for each type. – Joellyn 21/5, 2009 at 20:19

@Travis Beale -- you are going to get a different r-squared for each different mean function you try (unless two models are nested and the extra coeffecients in the larger model all work to be 0). So of course Excel gives a different r-squared values. @Numismatist -- this is linear regression so it is r-squared. – Granitite 21/5, 2009 at 20:20

The wikipedia article on r-squareds suggests that it may be used for general model fitting rather than just linear regression.

Goldstein answered 25/8, 2009 at 6:6 Comment(1)

Here's a good description of the issue with R2 for non-linear regression: blog.minitab.com/blog/adventures-in-statistics/… – Uric 10/4, 2015 at 15:57

Using the numpy module (tested in python3):

import numpy as np
def linear_regression(x, y): 
    coefs = np.polynomial.polynomial.polyfit(x, y, 1)
    ffit = np.poly1d(coefs)
    m = ffit[0]
    b = ffit[1] 
    eq = 'y = {}x + {}'.format(round(m, 3), round(b, 3))
    rsquared = np.corrcoef(x, y)[0, 1]**2
    return rsquared, eq, m, b

rsquared, eq, m, b = linear_regression(x,y)
print(rsquared, m, b)
print(eq)

Output:

0.013378252355751777 0.1316331351105754 0.7928782850418713 
y = 0.132x + 0.793

Note: r² ≠ R²
r² is called the "Coefficient of Determination"
R² is the square of the Pearson Coefficient

R², officially conflated as r², is probably the one you want, as it's a least-square fit, which is better than the simple fraction of sums that r² is. Numpy is not afraid to call it "corrcoef", which presupposes Pearson is the de-facto correlation coefficient.

Diastema answered 21/8, 2021 at 23:50 Comment(1)

I posted this solution because the wikipedia article formula gives a different result than the numpy solution. I believe the numpy module is correct because the wikipedia formula does not consider that multiple solutions exist (different slope and offsets of best fit line) and numpy apparently solves an actual optimization problem and not just calculate a fraction of sums. Evidence of the [simple] wikipedia formula being wrong is that it produces negative r_squared values, which means it's coming up with the wrong slope for the best fit line for non-trivial data. – Diastema 22/8, 2021 at 0:11

You can execute this code directly, this will find you the polynomial, and will find you the R-value you can put a comment down below if you need more explanation.

from scipy.stats import linregress
import numpy as np

x = np.array([1,2,3,4,5,6])
y = np.array([2,3,5,6,7,8])

p3 = np.polyfit(x,y,3) # 3rd degree polynomial, you can change it to any degree you want
xp = np.linspace(1,6,6)  # 6 means the length of the line
poly_arr = np.polyval(p3,xp)

poly_list = [round(num, 3) for num in list(poly_arr)]
slope, intercept, r_value, p_value, std_err = linregress(x, poly_list)
print(r_value**2)

Pulmonic answered 1/7, 2020 at 13:38 Comment(0)

From scipy.stats.linregress source. They use the average sum of squares method.

import numpy as np

x = np.array(x)
y = np.array(y)

# average sum of squares:
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat

r_num = ssxym
r_den = np.sqrt(ssxm * ssym)
r = r_num / r_den

if r_den == 0.0:
    r = 0.0
else:
    r = r_num / r_den

    if r > 1.0:
        r = 1.0
    elif r < -1.0:
        r = -1.0

Buzzell answered 16/11, 2019 at 0:43 Comment(0)

you can do it directly in numpy==1.26 without any extra libraries by passing full=True:

np.polyfit(xs, ys, 1, full=True)

Example:

# make 5 (x, y) points with fake variation
xs = np.linspace(0, 1, 5)
ys = np.array([0, 0.22, 0.55, 0.70, 1])

# ss_r: residuals – sum of squared residuals of the least squares fit.
#       ss_r will be 1D array if `ys` is 1D.
(m, b), ss_r, _, _, _ = np.polyfit(xs, ys, 1, full=True)
m, b, ss_r

(0.9920000000000001 -0.001999999999999821 array([0.00568]))

# calculate line via `polyfit` then total sum of squares
# y = mx + b
r_ys = xs*m + b
ss_tot = np.sum((r_ys - np.mean(r_ys)) ** 2)

r_squared = 1 - (ss_r / ss_tot)
r_squared

array([0.99630472])

Dosi answered 17/2 at 22:6 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags