Use convolution to find a reference audio sample in a continuous stream of sound
Asked Answered
T

2

11

in my previous question on finding a reference audio sample in a bigger audio sample, it was proposed, that I should use convolution.
Using DSPUtil, I was able to do this. I played a little with it and tried different combinations of audio samples, to see what the result was. To visualize the data, I just dumped the raw audio as numbers to Excel and created a chart using this numbers. A peak is visible, but I don't really know how this helps me. I have these problems:

  • I don't know, how to infer the starting position of the match in the original audio sample from the location of the peak.
  • I don't know, how I should apply this with a continuous stream of audio, so I can react, as soon as the reference audio sample occurs.
  • I don't understand, why picture 2 and picture 4 (see below) differ so much, although, both represent an audio sample convolved with itself...

Any help is highly appreciated.

The following pictures are the result of the analysis using Excel:

  1. A longer audio sample with the reference audio (a beep) near the end:
  2. The beep convolved with itself:
  3. A longer audio sample without the beep convolved with the beep:
  4. The longer audio sample of point 3 convolved with itself:

UPDATE and solution:
Thanks to the extensive help of Han, I was able to achieve my goal.
After I rolled my own slow implementation without FFT, I found alglib which provides a fast implementation. There is one basic assumption to my problem: One of the audio samples is contained completely within the other.
So, the following code returns the offset in samples in the larger of the two audio samples and the normalized cross-correlation value at that offset. 1 means complete correlation, 0 means no correlation at all and -1 means complete negative correlation:

private void CalcCrossCorrelation(IEnumerable<double> data1, 
                                  IEnumerable<double> data2, 
                                  out int offset, 
                                  out double maximumNormalizedCrossCorrelation)
{
    var data1Array = data1.ToArray();
    var data2Array = data2.ToArray();
    double[] result;
    alglib.corrr1d(data1Array, data1Array.Length, 
                   data2Array, data2Array.Length, out result);

    var max = double.MinValue;
    var index = 0;
    var i = 0;
    // Find the maximum cross correlation value and its index
    foreach (var d in result)
    {
        if (d > max)
        {
            index = i;
            max = d;
        }
        ++i;
    }
    // if the index is bigger than the length of the first array, it has to be
    // interpreted as a negative index
    if (index >= data1Array.Length)
    {
        index *= -1;
    }

    var matchingData1 = data1;
    var matchingData2 = data2;
    var biggerSequenceCount = Math.Max(data1Array.Length, data2Array.Length);
    var smallerSequenceCount = Math.Min(data1Array.Length, data2Array.Length);
    offset = index;
    if (index > 0)
        matchingData1 = data1.Skip(offset).Take(smallerSequenceCount).ToList();
    else if (index < 0)
    {
        offset = biggerSequenceCount + smallerSequenceCount + index;
        matchingData2 = data2.Skip(offset).Take(smallerSequenceCount).ToList();
        matchingData1 = data1.Take(smallerSequenceCount).ToList();
    }
    var mx = matchingData1.Average();
    var my = matchingData2.Average();
    var denom1 = Math.Sqrt(matchingData1.Sum(x => (x - mx) * (x - mx)));
    var denom2 = Math.Sqrt(matchingData2.Sum(y => (y - my) * (y - my)));
    maximumNormalizedCrossCorrelation = max / (denom1 * denom2);
}

BOUNTY:
No new answers required! I started the bounty to award it to Han for his continued effort with this question!

Toomey answered 1/5, 2011 at 9:23 Comment(6)
Thanks for the code. comparison returns 2.3, what does that mean?Comeback
I would say that means there is an error in your code. It should return values between -1 and 1Toomey
Thanks. Do you have any sample code? I have used your code + the code from this question #40143721. Please suggestComeback
Now the code always returns 0.6 to 0.8 but even when I am not speaking.Comeback
You should open a new question with the details of what you have nowToomey
Here you go #47125610Comeback
F
3

Here we go for the bounty :)

To find a particular reference signal in a larger audio fragment, you need to use a cross-correlation algorithm. The basic formulae can be found in this Wikipedia article.

Cross-correlation is a process by which 2 signals are compared. This is done by multiplying both signals and summing the results for all samples. Then one of the signals is shifted (usually by 1 sample), and the calculation is repeated. If you try to visualize this for very simple signals such as a single impulse (e.g. 1 sample has a certain value while the remaining samples are zero), or a pure sine wave, you will see that the result of the cross-correlation is indeed a measure for for how much both signals are alike and the delay between them. Another article that may provide more insight can be found here.

This article by Paul Bourke also contains source code for a straightforward time-domain implementation. Note that the article is written for a general signal. Audio has the special property that the long-time average is usualy 0. This means that the averages used in Paul Bourkes formula (mx and my) can be left out. There are also fast implementations of the cross-correlation based on the FFT (see ALGLIB).

The (maximum) value of the correlation depends on the sample values in the audio signals. In Paul Bourke's algorithm however the maximum is scaled to 1.0. In cases where one of the signals is contained entirely within another signal, the maximum value will reach 1. In the more general case the maximum will be lower and a threshold value will have to be determined to decide whether the signals are sufficiently alike.

Fluoric answered 5/5, 2011 at 9:40 Comment(2)
Thanks! I already awarded the bounty to your other answer :-)Toomey
I have two wave audio signals. the first, reference signal contains only the half of the pattern at the end. If I use corrr1d on them, it will result the highest correlation value somewhere at the beginning. Here is a simplified version: Signal: { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }; Pattern: { 10, 11, 12, 13, 14, 15 }; So the signal only has 10,11,12 from the pattern. Do you know why is this happening?Cecrops
F
4

Instead of a convolution you should use a correlation. The size of the correlation peak tells you how much both signals are alike, the position of the peak their relative position in time, or the delay between both signals.

Fluoric answered 1/5, 2011 at 10:8 Comment(17)
Can you please elaborate a little bit? How to do this correlation?Toomey
You can find relevant information on the correlation here: link. The same article also links to pattern recognition which is what you are attempting to do. There are fast algorithms for the cross-correlation based on the FFT.Fluoric
The wikipedia article is pretty theoretic. Isn't there are more practical introduction? Or even source code? ;-)Toomey
This is the first result for a Google search on 'cross-correlation source code': paulbourke.net/miscellaneous/correlate. It doesn't use a fast algorithm, but it does give some insight on what cross-correlation does. I'm sure you could find much more if you tried.Fluoric
@Han: Thanks! I sure can find examples in google, but because I don't understand the concept even after reading the wikipedia article, I am not sure whether the google result that turns up is really what I need to use... :|Toomey
Cross-correlation is a process by which 2 signals are compared. This is done by multiplying both signals and summing the results for all samples. Then one of the signals is shifted (usually by 1 sample), and the calculation is repeated. If you try to visualize this for very simple signals such as a single impulse (e.g. 1 sample has a certain value while the remaining samples are zero), or a pure sine wave, you will see that the result of the cross-correlation is indeed a measure for for how much both signals are alike. Try to perform the calculations as given in the formula by hand.Fluoric
@Han: Again, thanks a lot. The link to paul bourkes article helped a lot in understanding the concept! I rolled my own implementation of the algorithm based on his code and as expected, it returns a correlation of 1 with an offset of 0, when both arrays are read from the same file. However, in my real world scenario above, it only returns a maximum correlation of 0.39, although, the reference signal was extracted from the source signal using audacity... I would have expected something more closer to one, like 0.95 or so. So, for me the question is: What is a reasonable threshold?Toomey
@Han: I did some more testing and it seems as if my understanding wasn't really correct: Is it possible, to have a correlation of 1 of both arrays aren't of the same length?Toomey
@Han: In paul bourkes code is the denominator defined as Math.Sqrt(Math.Pow(x.Sum() - x.Average(), 2)) * Math.Sqrt(Math.Pow(y.Sum() - y.Average(), 2)). If I change this to simply Math.Pow(y.Sum() - y.Average(), 2) where y is the array that contains the reference data I am looking for, it returns the expected values! Is this the correct way of doing it? Or is it just a co-incidence?Toomey
@Han: Sorry, for bombarding you with questions... It really was a coincidence that it worked. The right way is to use the denominator as described in the article, with the one difference, that the part for the larger of the two arrays is only calculated from the offset a match is found and for the length of the shorter array. Correct?Toomey
The denominator is simply a normalizing factor, and replacing it as you propose seems incorrect, or at least make thresholding more difficult. Determining the threshold by which to judge whether the 2 signals are equal is in fact the most difficult part of the procedure, because this threshold is in most cases a hueristic. If you have a-priori knowledge about the signals (for instance a frequency range) then you could pre-filter the signals to remove irrelevant frequency components and improve the cross-correlation peak.Fluoric
@Han: My simple tests show, that I get a correlation close to 1, when normalizing the way I described last. It seems to make sense to me, because I only use those values from the sequences that seem to match and use them for the normalization. What I know about the signals is that one of them is contained completely inside the other.Toomey
@Han: I updated my question with the solution I used. Can you please have a look at it?Toomey
Daniel, the solution looks fine for your purpose, where you know a-priori that one signal is contained within the other, and a maximum correlation value of 1.0 is required. It is not very efficient though. It seems to me that the normalizing factor (if at all necessary) should be calculated directly in the alglib code. Also, note that the long time average of an audio signal is usually zero, making the calculation of the averages superfluous in this case.Fluoric
You can also use the knowledge that one signal is contained in the other, to skip the calculation of one of the integrals in the denominator. E.g. denom2 = denom1 * (max2 / max1)Fluoric
@Han: Thanks a lot for the additional info. I started a bounty, so I can award you some more rep for your continued effort. I would like to ask you however, to summarize the main points discussed in the comments section in your answer, so others can more easily access this info. :-)Toomey
@Han: I already thought about optimizing the code for the normalization but after profiling the code, I didn't bother with it, because the real bottle neck is the calculation of the cross correlation itself. It takes 8000 times longer than the normalization.Toomey
F
3

Here we go for the bounty :)

To find a particular reference signal in a larger audio fragment, you need to use a cross-correlation algorithm. The basic formulae can be found in this Wikipedia article.

Cross-correlation is a process by which 2 signals are compared. This is done by multiplying both signals and summing the results for all samples. Then one of the signals is shifted (usually by 1 sample), and the calculation is repeated. If you try to visualize this for very simple signals such as a single impulse (e.g. 1 sample has a certain value while the remaining samples are zero), or a pure sine wave, you will see that the result of the cross-correlation is indeed a measure for for how much both signals are alike and the delay between them. Another article that may provide more insight can be found here.

This article by Paul Bourke also contains source code for a straightforward time-domain implementation. Note that the article is written for a general signal. Audio has the special property that the long-time average is usualy 0. This means that the averages used in Paul Bourkes formula (mx and my) can be left out. There are also fast implementations of the cross-correlation based on the FFT (see ALGLIB).

The (maximum) value of the correlation depends on the sample values in the audio signals. In Paul Bourke's algorithm however the maximum is scaled to 1.0. In cases where one of the signals is contained entirely within another signal, the maximum value will reach 1. In the more general case the maximum will be lower and a threshold value will have to be determined to decide whether the signals are sufficiently alike.

Fluoric answered 5/5, 2011 at 9:40 Comment(2)
Thanks! I already awarded the bounty to your other answer :-)Toomey
I have two wave audio signals. the first, reference signal contains only the half of the pattern at the end. If I use corrr1d on them, it will result the highest correlation value somewhere at the beginning. Here is a simplified version: Signal: { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }; Pattern: { 10, 11, 12, 13, 14, 15 }; So the signal only has 10,11,12 from the pattern. Do you know why is this happening?Cecrops

© 2022 - 2024 — McMap. All rights reserved.