bug of autocorrelation plot in matplotlib‘s plt.acorr?
Asked Answered
V

1

12

I am plotting autocorrelation with python. I used three ways to do it: 1. pandas, 2. matplotlib, 3. statsmodels. I found the graph I got from matplotlib is not consistent with the other two. The code is:

 from statsmodels.graphics.tsaplots import *
 # print out data
 print mydata.values

 #1. pandas
 p=autocorrelation_plot(mydata)
 plt.title('mydata')

 #2. matplotlib
 fig=plt.figure()
 plt.acorr(mydata,maxlags=150)
 plt.title('mydata')

 #3. statsmodels.graphics.tsaplots.plot_acf
 plot_acf(mydata)
 plt.title('mydata')

The graph is here: http://quant365.com/viewtopic.php?f=4&t=33

Visit answered 18/12, 2014 at 7:29 Comment(7)
This question appears to be off-topic because it is a bug reportPublican
Not only do bug reports not belong on SO, but your example is not runable (mydata is undefined and imports are missing) and your graphs are password protected. Not sure what kind of responses you expect. If you want to improve this question, I recommend focusing on asking what each particular function is actually doing. There's a chance that matplotlib is taking a different, but equally valid approach.Publican
It should be ok now. I can not put the graph here because the image is like quant365.com/download/file.php?id=5, which can not be posted here.Visit
It's not a bug, it's just that plt.acorr is a lower-level function that the autocorrelation plot in statsmodels. In the matplotlib version, you're seeing the "full" autocorrelation, and it hasn't "centered" (i.e. zero mean) your data for you. The calculation is correct, however.Rhododendron
at lag 0, the ACF is 0.5. But the matplotlib version is not 0.5 obviously! How can it be correct?Visit
@WuFuheng - Actually, they're all 1 at lag 0, by definition. I'll add an example of exactly what's going on after I get home. They're all the correct calculation, though, just using different assumptions and displaying it in different ways.Rhododendron
Yes, you should compare the output of R's ACF. That one is really clear.Visit
R
45

This is a result of different common definitions between statistics and signal processing. Basically, the signal processing definition assumes that you're going to handle the detrending. The statistical definition assumes that subtracting the mean is all the detrending you'll do, and does it for you.

First off, let's demonstrate the problem with a stand-alone example:

import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
from statsmodels.graphics import tsaplots

def label(ax, string):
    ax.annotate(string, (1, 1), xytext=(-8, -8), ha='right', va='top',
                size=14, xycoords='axes fraction', textcoords='offset points')

np.random.seed(1977)
data = np.random.normal(0, 1, 100).cumsum()

fig, axes = plt.subplots(nrows=4, figsize=(8, 12))
fig.tight_layout()

axes[0].plot(data)
label(axes[0], 'Raw Data')

axes[1].acorr(data, maxlags=data.size-1)
label(axes[1], 'Matplotlib Autocorrelation')

tsaplots.plot_acf(data, axes[2])
label(axes[2], 'Statsmodels Autocorrelation')

pd.tools.plotting.autocorrelation_plot(data, ax=axes[3])
label(axes[3], 'Pandas Autocorrelation')

# Remove some of the titles and labels that were automatically added
for ax in axes.flat:
    ax.set(title='', xlabel='')
plt.show()

enter image description here

So, why the heck am I saying that they're all correct? They're clearly different!

Let's write our own autocorrelation function to demonstrate what plt.acorr is doing:

def acorr(x, ax=None):
    if ax is None:
        ax = plt.gca()
    autocorr = np.correlate(x, x, mode='full')
    autocorr /= autocorr.max()

    return ax.stem(autocorr)

If we plot this with our data, we'll get a more-or-less identical result to plt.acorr (I'm leaving out properly labeling the lags, simply because I'm lazy):

fig, ax = plt.subplots()
acorr(data)
plt.show()

enter image description here

This is a perfectly valid autocorrelation. It's all a matter of whether your background is signal processing or statistics.

This is the definition used in signal processing. The assumption is that you're going to handle detrending your data (note the detrend kwarg in plt.acorr). If you want it detrended, you'll explictly ask for it (and probably do something better than just subtracting the mean), and otherwise it shouldn't be assumed.

In statistics, simply subtracting the mean is assumed to be what you wanted to do for detrending.

All of the other functions are subtracting the mean of the data before the correlation, similar to this:

def acorr(x, ax=None):
    if ax is None:
        ax = plt.gca()

    x = x - x.mean()

    autocorr = np.correlate(x, x, mode='full')
    autocorr /= autocorr.max()

    return ax.stem(autocorr)

fig, ax = plt.subplots()
acorr(data)
plt.show()

enter image description here

However, we still have one large difference. This one is purely a plotting convention.

In most signal processing textbooks (that I've seen, anyway), the "full" autocorrelation is displayed, such that zero lag is in the center, and the result is symmetric on each side. R, on the other hand, has the very reasonable convention to display only one side of it. (After all, the other side is completely redundant.) The statistical plotting functions follow the R convetion, and plt.acorr follows what Matlab does, which is the opposite convention.

Basically, you'd want this:

def acorr(x, ax=None):
    if ax is None:
        ax = plt.gca()

    x = x - x.mean()

    autocorr = np.correlate(x, x, mode='full')
    autocorr = autocorr[x.size:]
    autocorr /= autocorr.max()

    return ax.stem(autocorr)

fig, ax = plt.subplots()
acorr(data)
plt.show()

enter image description here

Rhododendron answered 19/12, 2014 at 0:55 Comment(3)
Thanks I still have one thing not very clear. "However, we still have one large difference. This one is purely a plotting convention." May I know what plotting convention is it? I am from stats background and have no knowledge about signal processing. Why python dont provide a uniform version? And I just found there is no partial autocorrelation in pandas, which is disappointing.Visit
Why in my graph the autocorrelation at lag 0 is 0.5? This is wrong as it should be 1! But I think I used the function correctly. Or I missed anything?Visit
@WuFuheng - The plotting convention is the difference between the last two figures: whether or not the full, symmetric autocorrelation is show, or just one half of it. As far as why your graphs have 0.5 at lag 0, I have no idea. I get an autocorrelation of 1 at lag 0 with the exact same functions.Rhododendron

© 2022 - 2024 — McMap. All rights reserved.