Use convolution to find a reference audio sample in a continuous stream of sound

Asked 1/5, 2011 at 9:23 Answered 5/5, 2011 at 9:40

Solved c#.net signal-processing convolution cross-correlation

in my previous question on finding a reference audio sample in a bigger audio sample, it was proposed, that I should use convolution.
Using DSPUtil, I was able to do this. I played a little with it and tried different combinations of audio samples, to see what the result was. To visualize the data, I just dumped the raw audio as numbers to Excel and created a chart using this numbers. A peak is visible, but I don't really know how this helps me. I have these problems:

I don't know, how to infer the starting position of the match in the original audio sample from the location of the peak.
I don't know, how I should apply this with a continuous stream of audio, so I can react, as soon as the reference audio sample occurs.
I don't understand, why picture 2 and picture 4 (see below) differ so much, although, both represent an audio sample convolved with itself...

Any help is highly appreciated.

The following pictures are the result of the analysis using Excel:

A longer audio sample with the reference audio (a beep) near the end:
The beep convolved with itself:
A longer audio sample without the beep convolved with the beep:
The longer audio sample of point 3 convolved with itself:

UPDATE and solution:
Thanks to the extensive help of Han, I was able to achieve my goal.
After I rolled my own slow implementation without FFT, I found alglib which provides a fast implementation. There is one basic assumption to my problem: One of the audio samples is contained completely within the other.
So, the following code returns the offset in samples in the larger of the two audio samples and the normalized cross-correlation value at that offset. 1 means complete correlation, 0 means no correlation at all and -1 means complete negative correlation:

private void CalcCrossCorrelation(IEnumerable<double> data1, 
                                  IEnumerable<double> data2, 
                                  out int offset, 
                                  out double maximumNormalizedCrossCorrelation)
{
    var data1Array = data1.ToArray();
    var data2Array = data2.ToArray();
    double[] result;
    alglib.corrr1d(data1Array, data1Array.Length, 
                   data2Array, data2Array.Length, out result);

    var max = double.MinValue;
    var index = 0;
    var i = 0;
    // Find the maximum cross correlation value and its index
    foreach (var d in result)
    {
        if (d > max)
        {
            index = i;
            max = d;
        }
        ++i;
    }
    // if the index is bigger than the length of the first array, it has to be
    // interpreted as a negative index
    if (index >= data1Array.Length)
    {
        index *= -1;
    }

    var matchingData1 = data1;
    var matchingData2 = data2;
    var biggerSequenceCount = Math.Max(data1Array.Length, data2Array.Length);
    var smallerSequenceCount = Math.Min(data1Array.Length, data2Array.Length);
    offset = index;
    if (index > 0)
        matchingData1 = data1.Skip(offset).Take(smallerSequenceCount).ToList();
    else if (index < 0)
    {
        offset = biggerSequenceCount + smallerSequenceCount + index;
        matchingData2 = data2.Skip(offset).Take(smallerSequenceCount).ToList();
        matchingData1 = data1.Take(smallerSequenceCount).ToList();
    }
    var mx = matchingData1.Average();
    var my = matchingData2.Average();
    var denom1 = Math.Sqrt(matchingData1.Sum(x => (x - mx) * (x - mx)));
    var denom2 = Math.Sqrt(matchingData2.Sum(y => (y - my) * (y - my)));
    maximumNormalizedCrossCorrelation = max / (denom1 * denom2);
}

BOUNTY:
No new answers required! I started the bounty to award it to Han for his continued effort with this question!

Toomey answered 1/5, 2011 at 9:23 Comment(6)

Thanks for the code. comparison returns 2.3, what does that mean? – Comeback 5/11, 2017 at 16:15

I would say that means there is an error in your code. It should return values between -1 and 1 – Toomey 5/11, 2017 at 17:49

Thanks. Do you have any sample code? I have used your code + the code from this question #40143721. Please suggest – Comeback 5/11, 2017 at 18:18

Now the code always returns 0.6 to 0.8 but even when I am not speaking. – Comeback 5/11, 2017 at 18:19

You should open a new question with the details of what you have now – Toomey 5/11, 2017 at 18:26

Here you go #47125610 – Comeback 5/11, 2017 at 18:33

Here we go for the bounty :)

To find a particular reference signal in a larger audio fragment, you need to use a cross-correlation algorithm. The basic formulae can be found in this Wikipedia article.

Cross-correlation is a process by which 2 signals are compared. This is done by multiplying both signals and summing the results for all samples. Then one of the signals is shifted (usually by 1 sample), and the calculation is repeated. If you try to visualize this for very simple signals such as a single impulse (e.g. 1 sample has a certain value while the remaining samples are zero), or a pure sine wave, you will see that the result of the cross-correlation is indeed a measure for for how much both signals are alike and the delay between them. Another article that may provide more insight can be found here.

This article by Paul Bourke also contains source code for a straightforward time-domain implementation. Note that the article is written for a general signal. Audio has the special property that the long-time average is usualy 0. This means that the averages used in Paul Bourkes formula (mx and my) can be left out. There are also fast implementations of the cross-correlation based on the FFT (see ALGLIB).

The (maximum) value of the correlation depends on the sample values in the audio signals. In Paul Bourke's algorithm however the maximum is scaled to 1.0. In cases where one of the signals is contained entirely within another signal, the maximum value will reach 1. In the more general case the maximum will be lower and a threshold value will have to be determined to decide whether the signals are sufficiently alike.

Fluoric answered 5/5, 2011 at 9:40 Comment(2)

Thanks! I already awarded the bounty to your other answer :-) – Toomey 5/5, 2011 at 10:10

I have two wave audio signals. the first, reference signal contains only the half of the pattern at the end. If I use corrr1d on them, it will result the highest correlation value somewhere at the beginning. Here is a simplified version: Signal: { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }; Pattern: { 10, 11, 12, 13, 14, 15 }; So the signal only has 10,11,12 from the pattern. Do you know why is this happening? – Cecrops 21/7, 2015 at 23:53

Instead of a convolution you should use a correlation. The size of the correlation peak tells you how much both signals are alike, the position of the peak their relative position in time, or the delay between both signals.

Fluoric answered 1/5, 2011 at 10:8 Comment(17)

Can you please elaborate a little bit? How to do this correlation? – Toomey 1/5, 2011 at 11:24

You can find relevant information on the correlation here: link. The same article also links to pattern recognition which is what you are attempting to do. There are fast algorithms for the cross-correlation based on the FFT. – Fluoric 1/5, 2011 at 15:24

The wikipedia article is pretty theoretic. Isn't there are more practical introduction? Or even source code? ;-) – Toomey 2/5, 2011 at 10:55

This is the first result for a Google search on 'cross-correlation source code': paulbourke.net/miscellaneous/correlate. It doesn't use a fast algorithm, but it does give some insight on what cross-correlation does. I'm sure you could find much more if you tried. – Fluoric 2/5, 2011 at 12:26

@Han: Thanks! I sure can find examples in google, but because I don't understand the concept even after reading the wikipedia article, I am not sure whether the google result that turns up is really what I need to use... :| – Toomey 2/5, 2011 at 12:29

Cross-correlation is a process by which 2 signals are compared. This is done by multiplying both signals and summing the results for all samples. Then one of the signals is shifted (usually by 1 sample), and the calculation is repeated. If you try to visualize this for very simple signals such as a single impulse (e.g. 1 sample has a certain value while the remaining samples are zero), or a pure sine wave, you will see that the result of the cross-correlation is indeed a measure for for how much both signals are alike. Try to perform the calculations as given in the formula by hand. – Fluoric 2/5, 2011 at 21:2

@Han: Again, thanks a lot. The link to paul bourkes article helped a lot in understanding the concept! I rolled my own implementation of the algorithm based on his code and as expected, it returns a correlation of 1 with an offset of 0, when both arrays are read from the same file. However, in my real world scenario above, it only returns a maximum correlation of 0.39, although, the reference signal was extracted from the source signal using audacity... I would have expected something more closer to one, like 0.95 or so. So, for me the question is: What is a reasonable threshold? – Toomey 2/5, 2011 at 22:13

@Han: I did some more testing and it seems as if my understanding wasn't really correct: Is it possible, to have a correlation of 1 of both arrays aren't of the same length? – Toomey 2/5, 2011 at 22:30

@Han: In paul bourkes code is the denominator defined as Math.Sqrt(Math.Pow(x.Sum() - x.Average(), 2)) * Math.Sqrt(Math.Pow(y.Sum() - y.Average(), 2)). If I change this to simply Math.Pow(y.Sum() - y.Average(), 2) where y is the array that contains the reference data I am looking for, it returns the expected values! Is this the correct way of doing it? Or is it just a co-incidence? – Toomey 2/5, 2011 at 22:42

@Han: Sorry, for bombarding you with questions... It really was a coincidence that it worked. The right way is to use the denominator as described in the article, with the one difference, that the part for the larger of the two arrays is only calculated from the offset a match is found and for the length of the shorter array. Correct? – Toomey 3/5, 2011 at 7:47

The denominator is simply a normalizing factor, and replacing it as you propose seems incorrect, or at least make thresholding more difficult. Determining the threshold by which to judge whether the 2 signals are equal is in fact the most difficult part of the procedure, because this threshold is in most cases a hueristic. If you have a-priori knowledge about the signals (for instance a frequency range) then you could pre-filter the signals to remove irrelevant frequency components and improve the cross-correlation peak. – Fluoric 3/5, 2011 at 7:59

@Han: My simple tests show, that I get a correlation close to 1, when normalizing the way I described last. It seems to make sense to me, because I only use those values from the sequences that seem to match and use them for the normalization. What I know about the signals is that one of them is contained completely inside the other. – Toomey 3/5, 2011 at 8:15

@Han: I updated my question with the solution I used. Can you please have a look at it? – Toomey 3/5, 2011 at 8:37

Daniel, the solution looks fine for your purpose, where you know a-priori that one signal is contained within the other, and a maximum correlation value of 1.0 is required. It is not very efficient though. It seems to me that the normalizing factor (if at all necessary) should be calculated directly in the alglib code. Also, note that the long time average of an audio signal is usually zero, making the calculation of the averages superfluous in this case. – Fluoric 4/5, 2011 at 8:38

You can also use the knowledge that one signal is contained in the other, to skip the calculation of one of the integrals in the denominator. E.g. denom2 = denom1 * (max2 / max1) – Fluoric 4/5, 2011 at 8:47

@Han: Thanks a lot for the additional info. I started a bounty, so I can award you some more rep for your continued effort. I would like to ask you however, to summarize the main points discussed in the comments section in your answer, so others can more easily access this info. :-) – Toomey 4/5, 2011 at 9:16

@Han: I already thought about optimizing the code for the normalization but after profiling the code, I didn't bother with it, because the real bottle neck is the calculation of the cross correlation itself. It takes 8000 times longer than the normalization. – Toomey 4/5, 2011 at 9:25

Here we go for the bounty :)

To find a particular reference signal in a larger audio fragment, you need to use a cross-correlation algorithm. The basic formulae can be found in this Wikipedia article.

Fluoric answered 5/5, 2011 at 9:40 Comment(2)

Thanks! I already awarded the bounty to your other answer :-) – Toomey 5/5, 2011 at 10:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags