Numpy Correlate is not providing an offset
Asked Answered
B

1

4

I am trying to look at astronomical spectra using Python, and I'm using numpy.correlate to try and find a radial velocity shift. I'm comparing each spectrum I have to one template spectrum. The problem that I'm encountering is that, no matter which spectra I use, numpy.correlate states that the maximal value of the correlation function occurs with a shift of zero pixels, i.e. the spectra already line up, which is very clearly not true. Here is some of the relevant code:

corr = np.correlate(temp_data, imag_data, mode='same')
ax1.plot(delta_data, corr, c='g')
ax1.plot(delta_data, 100*temp_data, c='b')
ax1.plot(delta_data, 100*imag_data, c='r')

The output of this code is shown here:

What I Have

Note that the cross-correlation function peaks at an offset of zero pixels despite the template (blue) and observed (red) spectra clearly showing an offset. What I would expect to see would be something a bit like (albeit not exactly like; this is merely the closest representation I could produce):

What I Want

Here I have introduced an artificial offset of 50 pixels in the template data, and they more or less line up now. What I would like is, for a case like this, for a peak to appear at an offset of 50 pixels rather than at zero (I don't care if the spectra at the bottom appear lined up; that is merely for visual representation). However, despite several hours of work and research online, I can't find someone who even describes this problem, let alone a solution. I've attempted to use ScyPy's correlate and MatLib's xcorr, and bot show this same thing (although I'm led to believe that they are essentially the same function).

Why is the cross-correlation not acting the way I expect, and how to do I get it to act in a useful way?

Borroff answered 9/4, 2018 at 22:26 Comment(4)
We need to see some data samples that cause the unexpected output.Sideling
I suspect this has something to do with the zero padding. Have you tried truncating one of your series?Dextrogyrate
When the signals are of the same length, the correlation of the unshifted signals will be highest simply because when the signals are shifted, their tails don't overlap and hence can't be multiplied. Are your signals 0-centered with a standard deviation of 1?Kinny
@Kinny He/she has the signal plotted as well and they are not 0-centered. I think essentially the boundary effect dominated correlation but I'm not an expert in SP and I was not sure how to fix. You probably should write an answer.Dextrogyrate
I
11

The issue you're experiencing is probably because your spectra are not zero-centered; their RMS value looks to be about 100 in whichever units you're plotting, instead of 0. The reason this is an issue is because numpy.correlate works by "sliding" imag_data over temp_data to get their dot product at each possible offset between the two series. (See the wikipedia on cross-correlation to understand the operation itself.) When using mode='same' to produce an output that is the same length as your first input (temp_data), NumPy has to "pad" a bunch of dummy values--zeroes--to the ends of imag_data in order to be able to calculate the dot products of all the shifted versions of the imag_data. When we have any non-zero offset between the spectra, some of the values in temp_data are being multiplied by those dummy zero-padding values instead of the values in image_data. If the values in the spectra were centered around zero (RMS=0), then this zero-padding would not impact our expectation of the dot product, but because these spectra have RMS values around 100 units, that dot product (our correlation) is largest when we lay the two spectra on top of one another with no offset.

Notice that your cross-correlation result looks like a triangular pulse, which is what you might expect from the cross-correlation of two square pulses (c.f. Convolution of a Rectangular "Pulse" With Itself. That's because your spectra, once padded, look like a step function from zero up to a pulse of slightly noisy values around 100. You can try convolving with mode='full' to see the entire response of the two spectra you're correlating, or, notice that with mode='valid' that you should only get one value in return, since your two spectra are the exact same length, so there is only one offset (zero!) where you can entirely line them up.

To sidestep this issue, you can try either subtracting away the RMS value of the spectra so that they are zero-centered, or manually padding the beginning and end of imag_data with (len(temp_data)/2-1) dummy values equal to np.sqrt(np.mean(imag_data**2))

Edit: In response to your questions in the comments, I thought I'd include a graphic to make the point I'm trying to describe a little clearer.

Say we have two vectors of values, not entirely unlike your spectra, each with some large non-zero mean.

# Generate two noisy, but correlated series
t = np.linspace(0,250,250) # time domain from 0 to 250 steps
# signal_model = narrow_peak + gaussian_noise + constant
f = 10*np.exp(-((t-90)**2)/8) + np.random.randn(250) + 40
g = 10*np.exp(-((t-180)**2)/8) + np.random.randn(250) + 40

Fake signals

f has a spike around t=90, and g has a spike around t=180. So we expect the correlation of g and f to have a spike around a lag of 90 timesteps (in the case of spectra, frequency bins instead of timesteps.)

But in order to get an output that is the same shape as our inputs, as in np.correlate(g,f,mode='same'), we have to "pad" f on either side with half its length in dummy values: np.correlate pads with zeroes. If we don't pad f (as in np.correlate(g,f,mode='valid')), we will only get one value in return (the correlation with zero offset), because f and g are the same length, and there is no room to shift one of the signals relative to the other.

When you calculate the correlation of g and f after that padding, you find that it peaks when the non-zero portion of signals aligns completely, that is, when there is no offset between the original f and g. This is because the RMS value of the signals is so much higher than zero--the size of the overlap of f and g depends much more strongly on the number of elements overlapping at this high RMS level than on the relatively small fluctuations each function has around it. We can remove this large contribution to the correlation by subtracting the RMS level from each series. In the graph below, the gray line on the right shows the cross-correlation the two series before zero-centering, and the teal line shows the cross-correlation after. The gray line is, like your first attempt, triangular with the overlap of the two non-zero signals. The teal line better reflects the correlation between the fluctuation of the two signals, as we desired.

Cross-Correlations

xcorr = np.correlate(g,f,'same')
xcorr_rms = np.correlate(g-40,f-40,'same')
fig, axes = plt.subplots(5,2,figsize=(18,18),gridspec_kw={'width_ratios':[5,2]})
for n, axis in enumerate(axes):
    offset = (0,75,125,215,250)[n]
    fp = np.pad(f,[offset,250-offset],mode='constant',constant_values=0.)
    gp = np.pad(g,[125,125],mode='constant',constant_values=0.)

    axis[0].plot(fp,color='purple',lw=1.65)
    axis[0].plot(gp,color='orange',lw=lw)
    axis[0].axvspan(max(125,offset),min(375,offset+250),color='blue',alpha=0.06)
    axis[0].axvspan(0,max(125,offset),color='brown',alpha=0.03)
    axis[0].axvspan(min(375,offset+250),500,color='brown',alpha=0.03)
    if n==0:
        axis[0].legend(['f','g'])
    axis[0].set_title('offset={}'.format(offset-125))
    
    
    axis[1].plot(xcorr/(40*40),color='gray')
    axis[1].plot(xcorr_rms,color='teal')
    axis[1].axvline(offset,-100,350,color='maroon',lw=5,alpha=0.5)
    if n == 0:
        axis[1].legend(["$g \star f$","$g' \star f'$","offset"],loc='upper left')
    
plt.show()
Ingeborg answered 9/4, 2018 at 22:58 Comment(6)
Thank you very much! Subtracting the RMS value appeared to do the trick, and it now gives an offset that seems reasonable. Granted, I'm not entirely certain why it works, and I'll admit that I don't understand a lot of what you said. I'm not that savvy at coding, to be honest, so talking about padding and zero-centered goes well over my head. Is there some reference you could point me at to explain these terms a bit better? Perhaps one that also explains cross-correlation functions, as I apparently didn't understand them as well as I thought?Borroff
1) If an answer solves your problem, please consider accepting it. 2) If you have a new question, please don't discuss it in the comments, ask a new question instead. But I have to point out that asking for off-site resources like tutorials is considered off topic for SOPolyandry
@Borroff Cross-correlation is like sliding one function f over another g and finding the overlap of the two functions (the integral of their product) for every possible value of the "lag" or "offset" between them. But here, instead of having functions that are defined over all the real numbers, we just have two arrays of numbers (in this case, spectra of equal length.) It's easy to calculate the overlap of these two arrays when they are entirely lined up (just take their dot product), but we have to decide what to do when we shift one array relative to the other. ... contIngeborg
@Borroff (2) The three "modes" in np.correlate offer different options for how to handle the calculation when your arrays do not overlap completely. (See SciPy's documentation.) In order to get a result that is the same length as your input arrays, you end up multiplying all the values that don't overlap by zero. The effect of this process is that the value of your cross-correlation is higher when your two input arrays overlap more, because the average (RMS) value of your spectra is not zero. ..cont...Ingeborg
@Borroff (3) When the arrays are offset, you get a correlation value that is close to (f_avg)*(g_avg)*len(overlap), which is largest when lag=0. When you subtract away the RMS value of the spectra ("zero centering," since the data now oscillate around zero), the way we deal with non-overlapping elements of the arrays is not as impactful, since the expected value of any element is now also zero. See SciPy's reference tutorials for more detail on how numPy calculates convolution and correlation.Ingeborg
You can also use scipy.signal.detrend with type='constant' to remove the DC offset.Fideicommissary

© 2022 - 2024 — McMap. All rights reserved.