How to get the numerical fitting results when plotting a regression in seaborn
Asked Answered
N

4

74

If I use the seaborn library in Python to plot the result of a linear regression, is there a way to find out the numerical results of the regression? For example, I might want to know the fitting coefficients or the R2 of the fit.

I could re-run the same fit using the underlying statsmodels interface, but that would seem to be unnecessary duplicate effort, and anyway I'd want to be able to compare the resulting coefficients to be sure the numerical results are the same as what I'm seeing in the plot.

Nevlin answered 4/4, 2014 at 2:6 Comment(2)
does anyone know if you can plot the actual values for each bar on a bar graph in seaborn, rather than guessing the value by looking across at the Y axis and trying to match it up? In all the examples I have seen of the Seaborn visualisation library no one has put actual values on the individual bars to show the exact values, they are all completely visual.Irregular
See here a solution in a possibly duplicated question. As in the comments of that answer, one can easily get the equation line with two points and then plot it.Encaustic
M
36

There's no way to do this.

In my opinion, asking a visualization library to give you statistical modeling results is backwards. statsmodels, a modeling library, lets you fit a model and then draw a plot that corresponds exactly to the model you fit. If you want that exact correspondence, this order of operations makes more sense to me.

You might say "but the plots in statsmodels don't have as many aesthetic options as seaborn". But I think that makes sense — statsmodels is a modeling library that sometimes uses visualization in the service of modeling. seaborn is a visualization library that sometimes uses modeling in the service of visualization. It is good to specialize, and bad to try to do everything.

Fortunately, both seaborn and statsmodels use tidy data. That means that you really need very little effort duplication to get both plots and models through the appropriate tools.

Medication answered 4/4, 2014 at 2:9 Comment(12)
I think it's inside-out or upside-down if you try to get the regression out of the plot instead of the plot from the regression.Slapstick
@user333700, agreed. I am not using seaborn currently because of this limitation, though I might take a look at it. If there isn't a way to do it now, I might suggest a feature where a fit object from statsmodels could be used as an input to the appropriate seaborn plotting functions.Nevlin
Isn't there a line being drawn after the regression? even in the most basic seaborn example there is a line being drawn with a clear slope and intercept. stanford.edu/~mwaskom/software/seaborn/tutorial/regression.htmlLias
@mwaskom, I just got notified this question has gotten 2500 views. Just a data point in case you're wondering how many people are looking for this feature.Nevlin
OK, Thanks. I'd +1 again if I could. FWIW, the use case I have in mind is if I use seaborn to make a plot and put it in a powerpoint. And if someone asks "what's the R-squared for that fit?" or something like that, I want to be able to confidently give an answer that really matches the plot I showed them.Nevlin
I agree that Seaborn should include it. I use this library a lot for data exploration and it is always useful to be able to quickly see the actual coefficient value and R square without having to create a separate model for it.Recommendation
@Medication sometimes one uses a visualisation with a regression model to say 'oh! there's a correlation, I did not expect that!' but most times one uses a regression model, to actually model something. And to do that, you need the parameters of the model. This is the reason why this question has 6000+ views.Goulash
@user333700 why would you want to run a regression twice? Seaborn is already driving the car for you, just forgetting to tell you where it’s parked. It just sends you a snapshot and wish you good luck finding itGoulash
Still relevant. I was trusting seaborn with the regression, but since I can't check the parameters used, not much point in it... good to know that it is better to do it myself. One less library to use....Superfluity
Is there an easy way to get seaborn to plot the model returned from statsmodels? That seems like something that's very easy to implement without hurting specialization (or might already exist).Pernell
This really seems like a basic requirement even for a visualization package. In most situations it not acceptable to present a figure without reporting a p-value, r^2 value, and the coefficients. I would not consider this a specialized feature. As others have mentioned in the comments it really renders the seaborn regression useless for any legitimate purposes like a research article.Dorsy
@mwaskom: I appreciate your position, but suppose I follow your advice: I run a statsmodel ("sm") regression on my data, but there's no given way to that estimator in the graph or to guarantee that the estimator shown is the one found in sm. As a novice, i'm just trying to replace google sheets with numpy/seaborn, and google sheets has a "show formula" option that I use all the time. At the opposite end of the spectrum, above, people require this for use in research papers. support.google.com/docs/thread/11313654?msgid=11379847 Would you accept a "show_equation" pull request?Decastyle
P
23

Seaborn's creator has unfortunately stated that he won't add such a feature. Below are some options. (The last section contains my original suggestion, which was a hack that used private implementation details of seaborn and was not particularly flexible.)

Simple alternative version of regplot

The following function overlays a fit line on a scatter plot and returns the results from statsmodels. This supports the simplest and perhaps most common usage for sns.regplot, but does not implement any of the fancier functionality.

import statsmodels.api as sm


def simple_regplot(
    x, y, n_std=2, n_pts=100, ax=None, scatter_kws=None, line_kws=None, ci_kws=None
):
    """ Draw a regression line with error interval. """
    ax = plt.gca() if ax is None else ax

    # calculate best-fit line and interval
    x_fit = sm.add_constant(x)
    fit_results = sm.OLS(y, x_fit).fit()

    eval_x = sm.add_constant(np.linspace(np.min(x), np.max(x), n_pts))
    pred = fit_results.get_prediction(eval_x)

    # draw the fit line and error interval
    ci_kws = {} if ci_kws is None else ci_kws
    ax.fill_between(
        eval_x[:, 1],
        pred.predicted_mean - n_std * pred.se_mean,
        pred.predicted_mean + n_std * pred.se_mean,
        alpha=0.5,
        **ci_kws,
    )
    line_kws = {} if line_kws is None else line_kws
    h = ax.plot(eval_x[:, 1], pred.predicted_mean, **line_kws)

    # draw the scatterplot
    scatter_kws = {} if scatter_kws is None else scatter_kws
    ax.scatter(x, y, c=h[0].get_color(), **scatter_kws)

    return fit_results

The results from statsmodels contain a wealth of information, e.g.:

>>> print(fit_results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.477
Model:                            OLS   Adj. R-squared:                  0.471
Method:                 Least Squares   F-statistic:                     89.23
Date:                Fri, 08 Jan 2021   Prob (F-statistic):           1.93e-15
Time:                        17:56:00   Log-Likelihood:                -137.94
No. Observations:                 100   AIC:                             279.9
Df Residuals:                      98   BIC:                             285.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1417      0.193     -0.735      0.464      -0.524       0.241
x1             3.1456      0.333      9.446      0.000       2.485       3.806
==============================================================================
Omnibus:                        2.200   Durbin-Watson:                   1.777
Prob(Omnibus):                  0.333   Jarque-Bera (JB):                1.518
Skew:                          -0.002   Prob(JB):                        0.468
Kurtosis:                       2.396   Cond. No.                         4.35
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

A drop-in replacement (almost) for sns.regplot

The advantage of the method above over my original answer below is that it's easy to extend it to more complex fits.

Shameless plug: here is such an extended regplot function that I wrote that implements a large fraction of sns.regplot's functionality: https://github.com/ttesileanu/pydove.

While some features are still missing, the function I wrote

  • allows flexibility by separating the plotting from the statistical modeling (and you also get easy access to the fitting results).
  • is much faster for large datasets because it lets statsmodels calculate confidence intervals instead of using bootstrapping.
  • allows for slightly more diverse fits (e.g., polynomials in log(x)).
  • allows for slightly more fine-grained plotting options.

Old answer

Seaborn's creator has unfortunately stated that he won't add such a feature, so here's a workaround.

def regplot(
    *args,
    line_kws=None,
    marker=None,
    scatter_kws=None,
    **kwargs
):
    # this is the class that `sns.regplot` uses
    plotter = sns.regression._RegressionPlotter(*args, **kwargs)

    # this is essentially the code from `sns.regplot`
    ax = kwargs.get("ax", None)
    if ax is None:
        ax = plt.gca()

    scatter_kws = {} if scatter_kws is None else copy.copy(scatter_kws)
    scatter_kws["marker"] = marker
    line_kws = {} if line_kws is None else copy.copy(line_kws)

    plotter.plot(ax, scatter_kws, line_kws)

    # unfortunately the regression results aren't stored, so we rerun
    grid, yhat, err_bands = plotter.fit_regression(plt.gca())

    # also unfortunately, this doesn't return the parameters, so we infer them
    slope = (yhat[-1] - yhat[0]) / (grid[-1] - grid[0])
    intercept = yhat[0] - slope * grid[0]
    return slope, intercept

Note that this only works for linear regression because it simply infers the slope and intercept from the regression results. The nice thing is that it uses seaborn's own regression class and so the results are guaranteed to be consistent with what's shown. The downside is of course that we're using a private implementation detail in seaborn that can break at any point.

Pernell answered 15/1, 2020 at 17:47 Comment(4)
Probably a long shot since this answer dates back to Jan 15, however I've tried to use this code above and I get the following error: local variable 'scatter_kws' referenced before assignment - How can I get around it?Grant
Turns out I was missing some keyword arguments in the def. Should work now, thanks for pointing this out, @Marioanzas!Pernell
Thanks, this is a really nice function you provided here! A small improvement makes the alpha value changeable too: if 'alpha' in ci_kws: alpha = ci_kws['alpha'] del ci_kws['alpha'] else: alpha= 0.5Liking
@Liking definitely, I just wanted to include a short proof of concept in the answer itself. The function in my repo at github.com/ttesileanu/pygrutils has a lot more features, as well as better compatibility to seaborn.Pernell
G
2

Looking thru the currently available doc, the closest I've been able to determine if this functionality can now be met is if one uses the scipy.stats.pearsonr module.

r2 = stats.pearsonr("pct", "rdiff", df)

In attempting to make it work directly within a Pandas dataframe, there's an error kicked out from violating the basic scipy input requirements:

TypeError: pearsonr() takes exactly 2 arguments (3 given)

I managed to locate another Pandas Seaborn user who evidently solved it: https://github.com/scipy/scipy/blob/v0.14.0/scipy/stats/stats.py#L2392

sns.regplot("rdiff", "pct", df, corr_func=stats.pearsonr);

But, unfortunately I haven't managed to get that to work as it appears the author created his own custom 'corr_func' or either there's an undocumented Seaborn arguement passing method that's available using a more manual method:

# x and y should have same length.
    x = np.asarray(x)
    y = np.asarray(y)
    n = len(x)
    mx = x.mean()
    my = y.mean()
    xm, ym = x-mx, y-my
    r_num = np.add.reduce(xm * ym)
    r_den = np.sqrt(ss(xm) * ss(ym))
    r = r_num / r_den

# Presumably, if abs(r) > 1, then it is only some small artifact of floating
# point arithmetic.
r = max(min(r, 1.0), -1.0)
df = n-2
if abs(r) == 1.0:
    prob = 0.0
else:
    t_squared = r*r * (df / ((1.0 - r) * (1.0 + r)))
    prob = betai(0.5*df, 0.5, df / (df + t_squared))
return r, prob

Hope this helps to advance this original request along toward an interim solution as there's much needed utility to add the regression fitness stats to the Seaborn package as a replacement to what one can easily get from MS-Excel or a stock Matplotlib lineplot.

Gersham answered 24/6, 2014 at 20:50 Comment(2)
Thank you, is there a sequencing dependency? For example this call plots the scatter + linreg line: sns.lmplot("total_bill", "tip", tips); and this one adds the bi-variate distributions + the pearsonsr: sns.jointplot("total_bill", "tip", tips); but no linreg line. Is a workaround possible to add the linreg manually to this?: sns.lmplot("total_bill", "tip", tips, scatter_kws={"marker": ".", "color": "slategray"}, line_kws={"linewidth": 1, "color": "seagreen"});Gersham
Why doesn't the developer want to include this basic information? I keep seeing advice like "it's easy, just use these other 10 lines of code." But this doesn't feel very pythonic (especially reproducing a fit that's already been done). Why would I use Seaborn rather than just doing the fits using scipy and matplotlib, since I'm basically guaranteed to want the equations enough of the time?Tremor
I
2

Unfortunately it is not possible to directly extract numerical information from e.g. seaborn.regplot. Therefore, the minimal function below fits a polynomial regression and returns values of the smoothed line and corresponding confidence interval.

import numpy as np
from scipy import stats

def polynomial_regression(X, y, order=1, confidence=95, num=100):
    confidence = 1 - ((1 - (confidence / 100)) / 2)
    y_model = np.polyval(np.polyfit(X, y, order), X)
    residual = y - y_model
    n = X.size                     
    m = 2                          
    dof = n - m  
    t = stats.t.ppf(confidence, dof) 
    std_error = (np.sum(residual**2) / dof)**.5
    X_line = np.linspace(np.min(X), np.max(X), num)
    y_line = np.polyval(np.polyfit(X, y, order), X_line)
    ci = t * std_error * (1/n + (X_line - np.mean(X))**2 / np.sum((X - np.mean(X))**2))**.5
    return X_line, y_line, ci

Example run:

X = np.linspace(0,1,100)
y = np.random.random(100)

X_line, y_line, ci = polynomial_regression(X, y, order=3)

plt.scatter(X, y)
plt.plot(X_line, y_line)
plt.fill_between(X_line, y_line - ci, y_line + ci, alpha=.5)

enter image description here

Insistence answered 26/4, 2022 at 7:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.