How to plot statsmodels linear regression (OLS) cleanly
Asked Answered
E

2

30

Problem Statement:

I have some nice data in a pandas dataframe. I'd like to run simple linear regression on it:

enter image description here

Using statsmodels, I perform my regression. Now, how do I get my plot? I've tried statsmodels' plot_fit method, but the plot is a little funky:

enter image description here

I was hoping to get a horizontal line which represents the actual result of the regression.

Statsmodels has a variety of methods for plotting regression (a few more details about them here) but none of them seem to be the super simple "just plot the regression line on top of your data" -- plot_fit seems to be the closest thing.

Questions:

  • The first picture above is from pandas' plot function, which returns a matplotlib.axes._subplots.AxesSubplot. Can I overlay a regression line easily onto that plot?
  • Is there a function in statsmodels I've overlooked?
  • Is there a better way to put together this figure?

Two related questions:

Neither seems to have a good answer.

Sample data

        motifScore  expression
6870    1.401123    0.55
10456   1.188554    -1.58
12455   1.476361    -1.75
18052   1.805736    0.13
19725   1.110953    2.30
30401   1.744645    -0.49
30716   1.098253    -1.59
30771   1.098253    -2.04

abline_plot

I had tried this, but it doesn't seem to work... not sure why:

enter image description here

Eucharis answered 15/2, 2017 at 23:20 Comment(1)
Please post a sample dataset (looks like yours is small anyway, so you can post the whole thing). In general, I would recommend seaborn.regplot which will accomplish what you need, if you are okay with having that dependency.Gaitskell
G
36

As I mentioned in the comments, seaborn is a great choice for statistical data visualization.

import seaborn as sns

sns.regplot(x='motifScore', y='expression', data=motif)

sns.regplot


Alternatively, you can use statsmodels.regression.linear_model.OLS and manually plot a regression line.

import statsmodels.api as sm

# regress "expression" onto "motifScore" (plus an intercept)
model = sm.OLS(motif.expression, sm.add_constant(motif.motifScore))
p = model.fit().params

# generate x-values for your regression line (two is sufficient)
x = np.arange(1, 3)

# scatter-plot data
ax = motif.plot(x='motifScore', y='expression', kind='scatter')

# plot regression line on the same axes, set x-axis limits
ax.plot(x, p.const + p.motifScore * x)
ax.set_xlim([1, 2])

manual


Yet another solution is statsmodels.graphics.regressionplots.abline_plot which takes away some of the boilerplate from the above approach.

import statsmodels.api as sm
from statsmodels.graphics.regressionplots import abline_plot

# regress "expression" onto "motifScore" (plus an intercept)
model = sm.OLS(motif.expression, sm.add_constant(motif.motifScore))

# scatter-plot data
ax = motif.plot(x='motifScore', y='expression', kind='scatter')

# plot regression line
abline_plot(model_results=model.fit(), ax=ax)

abline_plot

Gaitskell answered 16/2, 2017 at 1:28 Comment(1)
Thanks @IgorRaush! I think I'll stick to the second solution. Despite seaborn seeming like an excellent library, I'd like to keep the number of dependencies low, and since I'm only making one kind of plot and already depending on pandas and statsmodels, I'll stick to what those can do for me. But I hope others are inspired to use seaborn!Eucharis
P
0

I agree with @Igor Rauch that seaborn is incredibly easy to use when it comes to plotting simple regression line of fit (especially because OLS fitting is done under the hood).

With seaborn, you can turn off ci, pass kwargs for line and scatter separately.

import pandas as pd
import seaborn as sns
df = pd.DataFrame({
    'motifScore': [1.401123, 1.188554, 1.476361, 1.805736, 1.110953, 1.744645, 1.098253, 1.098253], 
    'expression': [0.55, -1.58, -1.75, 0.13, 2.3, -0.49, -1.59, -2.04]})

sns.regplot(x='motifScore', y='expression', data=df, ci=False, line_kws={'color': 'red'}, scatter_kws={'s': 20, 'alpha': 0.7});

img2


The relevant statsmodels method is abline_plot(). It uses matplotlib.lines.Line2D to construct the line of fit under the hood; so if the axis limits are not appropriately set, the line might not show. For example, for the default limits of ((0,1), (0,1)), the line of fit won't show up at all for the sample data.

import statsmodels.api as sm

X = sm.add_constant(df['motifScore'])
y = df['expression']
results = sm.OLS(y, X).fit()

fig = sm.graphics.abline_plot(model_results=results, color='red')
fig.axes[0].set(ylim=(-1,0), xlim=(1,2))

img0

It doesn't plot the original data, so it must be plotted separately. Since abline is a line of fit, it probably goes through the scattered markers anyway, so there's no need to adjust the axis limits. Note that it's probably better to plot the scatter plot before abline_plot() to get a more well-defined axis limits.

import matplotlib.pyplot as plt
plt.scatter(df['motifScore'], df['expression'])
fig = sm.graphics.abline_plot(model_results=results, color='red', ax=plt.gca())

img00


If you want to stick to statsmodels.graphics, there's another plotter worth checking out: plot_ccpr(). Because this plots the CCPR, its main function is to see the effect a particular regressor has on the dependent variable (plot x against b*x for model y=a+b*x), it will be off by the constant term. If y-ticks are not important, it's useful.

fig = sm.graphics.plot_ccpr(results, 'motifScore')
# the above is the same as the following (uncomment to see it drawn)
# notice that results.params.const is missing from y
# fig.axes[0].plot(range(1,3), [results.params['motifScore']*i for i in range(1,3)]);

img1

Pinelli answered 13/5, 2023 at 7:28 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.