Find sound effect inside an audio file

Asked 29/9, 2018 at 21:26 Answered 4/1, 2019 at 2:4

Solved python audio audio-processing librosa

I have a load of 3 hour MP3 files, and every ~15 minutes a distinct 1 second sound effect is played, which signals the beginning of a new chapter.

Is it possible to identify each time this sound effect is played, so I can note the time offsets?

The sound effect is similar every time, but because it's been encoded in a lossy file format, there will be a small amount of variation.

The time offsets will be stored in the ID3 Chapter Frame MetaData.

Example Source, where the sound effect plays twice.

ffmpeg -ss 0.9 -i source.mp3 -t 0.95 sample1.mp3 -acodec copy -y

Sample 1 (Spectrogram)

ffmpeg -ss 4.5 -i source.mp3 -t 0.95 sample2.mp3 -acodec copy -y

Sample 2 (Spectrogram)

I'm very new to audio processing, but my initial thought was to extract a sample of the 1 second sound effect, then use librosa in python to extract a floating point time series for both files, round the floating point numbers, and try to get a match.

import numpy
import librosa

print("Load files")

source_series, source_rate = librosa.load('source.mp3') # 3 hour file
sample_series, sample_rate = librosa.load('sample.mp3') # 1 second file

print("Round series")

source_series = numpy.around(source_series, decimals=5);
sample_series = numpy.around(sample_series, decimals=5);

print("Process series")

source_start = 0
sample_matching = 0
sample_length = len(sample_series)

for source_id, source_sample in enumerate(source_series):

    if source_sample == sample_series[sample_matching]:

        sample_matching += 1

        if sample_matching >= sample_length:

            print(float(source_start) / source_rate)

            sample_matching = 0

        elif sample_matching == 1:

            source_start = source_id;

    else:

        sample_matching = 0

This does not work with the MP3 files above, but did with an MP4 version - where it was able to find the sample I extracted, but it was only that one sample (not all 12).

I should also note this script takes just over 1 minute to process the 3 hour file (which includes 237,426,624 samples). So I can imagine that some kind of averaging on every loop would cause this to take considerably longer.

Phagocyte answered 29/9, 2018 at 21:26 Comment(6)

Audio is a continuous wave data but a time series is discrete, so what you're doing here would really only work if all the occurrence of your sound clips are synchronised with respect to the sampling rate. You might want to try to do an onset detection and then use the onsets to match up notes. – Repeated 30/9, 2018 at 3:4

Thanks @LieRyan, you make a good point, and it has highlighted that these sound effects aren't as similar as I thought they were. I've added some example files, and created spectrograms of the two samples (which includes the onset detection, details). I've also had a play with averaging these, and using frames_to_time , but must admit I'm not sure I'm going about it in the right way (will keep trying though). Thanks again. – Phagocyte 30/9, 2018 at 12:3

I didn't really look into it, but one idea would be to calculate the "correlation" between the marker sound and the whole file. The correlation should have peaks at the times where the markers occur in the file. – Blockage 30/9, 2018 at 14:31

@Matthias, thanks for the suggestion, I did have a play, but there is really only 3 peaks, and getting them to line up wasn't particularly accurate. I'm currently having a play with the data that goes into a Spectrogram, as I think that might do well at analysing the sound more in detail (i.e. a bang from a gun sounds different to a drum)... but must admit I am guessing a lot here :-) – Phagocyte 14/10, 2018 at 12:13

see also github.com/topics/sound-event-detection and github.com/jim-schwoebel/sound_event_detection – Salvatoresalvay 23/6, 2023 at 12:35

Related: Answer to: find the timestamp of a sound sample of an mp3 with linux or python – Expiate 4/3 at 17:15

To follow up on the answers by @jonnor and @paul-john-leonard, they are both correct, by using frames (FFT) I was able to do Audio Event Detection.

I've written up the full source code at:

https://github.com/craigfrancis/audio-detect

Some notes though:

To create the templates, I used ffmpeg:

ffmpeg -ss 13.15 -i source.mp4 -t 0.8 -acodec copy -y templates/01.mp4;
I decided to use librosa.core.stft, but I needed to make my own implementation of this stft function for the 3 hour file I'm analysing, as it's far too big to keep in memory.
When using stft I tried using a hop_length of 64 at first, rather than the default (512), as I assumed that would give me more data to work with... the theory might be true, but 64 was far too detailed, and caused it to fail most of the time.
I still have no idea how to get cross-correlation between frame and template to work (via numpy.correlate)... instead I took the results per frame (the 1025 buckets, not 1024, which I believe relate to the Hz frequencies found) and did a very simple average difference check, then ensured that average was above a certain value (my test case worked at 0.15, the main files I'm using this on required 0.55 - presumably because the main files had been compressed quite a bit more):

hz_score = abs(source[0:1025,x] - template[2][0:1025,y])
hz_score = sum(hz_score)/float(len(hz_score))
When checking these scores, it's really useful to show them on a graph. I often used something like the following:

import matplotlib.pyplot as plt
plt.figure(figsize=(30, 5))
plt.axhline(y=hz_match_required_start, color='y')

while x < source_length:
debug.append(hz_score)
if x == mark_frame:
plt.axvline(x=len(debug), ymin=0.1, ymax=1, color='r')

plt.plot(debug)
plt.show()
When you create the template, you need to trim off any leading silence (to avoid bad matching), and an extra ~5 frames (it seems that the compression / re-encoding process alters this)... likewise, remove the last 2 frames (I think the frames include a bit of data from their surroundings, where the last one in particular can be a bit off).
When you start finding a match, you might find it's ok for the first few frames, then it fails... you will probably need to try again a frame or two later. I found it easier having a process that supported multiple templates (slight variations on the sound), and would check their first testable (e.g. 6th) frame and if that matched, put them in a list of potential matches. Then, as it progressed on to the next frames of the source, it could compare it to the next frames of the template, until all frames in the template had been matched (or failed).

Phagocyte answered 4/1, 2019 at 2:4 Comment(2)

'but I needed to make my own implementation of this stft function for the 3 hour file I'm analysing, as it's far too big to keep in memory.' - for a similar task (matching jingles in long 44.1 kHz files) I simply loaded the first audio channel of the inputs into float32 numpy arrays and cross-correlated them via scipy.signal.correlate(..., mode='full', method='fft') which took 'just' 10.2 GiB of RAM (~ 12 s runtime) for an input file of 2 hours. Arguably, by today standards, this isn't excessive, i.e. since MS Teams/Chrome/Eclipse/VSCode/etc. likely use much more after startup ... – Expiate 4/3 at 17:5

'I still have no idea how to get cross-correlation between frame and template to work' - You have to search for peaks, e.g. via scipy.signal.find_peaks(). As height parameter you can use something like 70 per-cent of np.dot(sample, sample) - i.e. of the self-correlation of your sample you are searching. – Expiate 4/3 at 17:11

Trying to directly match waveforms samples in the time domain is not a good idea. The mp3 signal will preserve the perceptual properties but it is quite likely the phases of the frequency components will be shifted so the sample values will not match.

You could try trying to match the volume envelopes of your effect and your sample. This is less likely to be affected by the mp3 process.

First, normalise your sample so the embedded effects are the same level as your reference effect. Constructing new waveforms from the effect and the sample by using the average of the peak values over time frames that are just short enough to capture the relevant features. Better still use overlapping frames. Then use cross-correlation in the time domain.

If this does not work then you could analyze each frame using an FFT this gives you a feature vector for each frame. You then try to find matches of the sequence of features in your effect with the sample. Similar to https://stackoverflow.com/users/1967571/jonnor suggestion. MFCC is used in speech recognition but since you are not detecting speech FFT is probably OK.

I am assuming the effect playing by itself (no background noise) and it is added to the recording electronically (as opposed to being recorded via a microphone). If this is not the case the problem becomes more difficult.

Ascensive answered 6/10, 2018 at 20:54 Comment(3)

Thanks @paul-john-leonard, at the moment I've simply cropped the sound effect out of a sample file, so I assume there is no need to normalise it at the moment (that said, it might be needed when looking at the real files). I've also been using librosa.stft, which I believe is doing the FFT bit, and it's passing the data though util.normalize. Do you think I'm on the same path you're suggesting, or am I missing something? Also, it is added electronically, but I think there may be some compression going on during the recording process, so there may be a bit of variation. – Phagocyte 14/10, 2018 at 14:48

I am not familiar with librosa but from your code it looks like you are going along the same lines as I suggest. – Ascensive 15/10, 2018 at 15:17

As you mention in your own answer the next step is to determine what is a match. If you get stuck manually tweaking then examples of real and false matches could be used as training data for say nearest neighbours. – Ascensive 15/10, 2018 at 15:25

To follow up on the answers by @jonnor and @paul-john-leonard, they are both correct, by using frames (FFT) I was able to do Audio Event Detection.

I've written up the full source code at:

https://github.com/craigfrancis/audio-detect

Some notes though:

To create the templates, I used ffmpeg:

ffmpeg -ss 13.15 -i source.mp4 -t 0.8 -acodec copy -y templates/01.mp4;
I decided to use librosa.core.stft, but I needed to make my own implementation of this stft function for the 3 hour file I'm analysing, as it's far too big to keep in memory.
When using stft I tried using a hop_length of 64 at first, rather than the default (512), as I assumed that would give me more data to work with... the theory might be true, but 64 was far too detailed, and caused it to fail most of the time.
I still have no idea how to get cross-correlation between frame and template to work (via numpy.correlate)... instead I took the results per frame (the 1025 buckets, not 1024, which I believe relate to the Hz frequencies found) and did a very simple average difference check, then ensured that average was above a certain value (my test case worked at 0.15, the main files I'm using this on required 0.55 - presumably because the main files had been compressed quite a bit more):

hz_score = abs(source[0:1025,x] - template[2][0:1025,y])
hz_score = sum(hz_score)/float(len(hz_score))
When checking these scores, it's really useful to show them on a graph. I often used something like the following:

import matplotlib.pyplot as plt
plt.figure(figsize=(30, 5))
plt.axhline(y=hz_match_required_start, color='y')

while x < source_length:
debug.append(hz_score)
if x == mark_frame:
plt.axvline(x=len(debug), ymin=0.1, ymax=1, color='r')

plt.plot(debug)
plt.show()
When you create the template, you need to trim off any leading silence (to avoid bad matching), and an extra ~5 frames (it seems that the compression / re-encoding process alters this)... likewise, remove the last 2 frames (I think the frames include a bit of data from their surroundings, where the last one in particular can be a bit off).
When you start finding a match, you might find it's ok for the first few frames, then it fails... you will probably need to try again a frame or two later. I found it easier having a process that supported multiple templates (slight variations on the sound), and would check their first testable (e.g. 6th) frame and if that matched, put them in a list of potential matches. Then, as it progressed on to the next frames of the source, it could compare it to the next frames of the template, until all frames in the template had been matched (or failed).

Phagocyte answered 4/1, 2019 at 2:4 Comment(2)

This is an Audio Event Detection problem. If the sound is always the same and there are no other sounds at the same time, it can probably be solved with a Template Matching approach. At least if there is no other sounds with other meanings that sound similar.

The simplest kind of template matching is to compute the cross-correlation between your input signal and the template.

Cut out an example of the sound to detect (using Audacity). Take as much as possible, but avoid the start and end. Store this as .wav file
Load the .wav template using librosa.load()
Chop up the input file into a series of overlapping frames. Length should be same as your template. Can be done with librosa.util.frame
Iterate over the frames, and compute cross-correlation between frame and template using numpy.correlate.
High values of cross-correlation indicate a good match. A threshold can be applied in order to decide what is an event or not. And the frame number can be used to calculate the time of the event.

You should probably prepare some shorter test files which have both some examples of the sound to detect as well as other typical sounds.

If the volume of the recordings is inconsistent you'll want to normalize that before running detection.

If cross-correlation in the time-domain does not work, you can compute the melspectrogram or MFCC features and cross-correlate that. If this does not yield OK results either, a machine learning model can be trained using supervised learning, but this requires labeling a bunch of data as event/not-event.

Schulman answered 6/10, 2018 at 19:48 Comment(1)

Thanks @jonnor, I've not used librosa.util.frame directly, but I have been playing with librosa.stft which, in their source code, does use it to "window the time series"... admittedly I have no idea what STFT is doing yet (I'm using it because it's used to create a Spectrogram). But would your suggestion be a better/simpler route, or have I stumbled on to something that's basically the same? Also, thanks for mentioning "Audio Event Detection", I was struggling to search for further info, where I'll have a look at MelSpectrogram/MFCC as well. – Phagocyte 14/10, 2018 at 14:40

This might not be an answer, it's just where I got to before I start researching the answers by @jonnor and @paul-john-leonard.

I was looking at the Spectrograms you can get by using librosa stft and amplitude_to_db, and thinking that if I take the data that goes in to the graphs, with a bit of rounding, I could potentially find the 1 sound effect being played:

https://librosa.github.io/librosa/generated/librosa.display.specshow.html

The code I've written below kind of works; although it:

Does return quite a few false positives, which might be fixed by tweaking the parameters of what is considered a match.
I would need to replace the librosa functions with something that can parse, round, and do the match checks in one pass; as a 3 hour audio file causes python to run out of memory on a computer with 16GB of RAM after ~30 minutes before it even got to the rounding bit.

import sys
import numpy
import librosa

#--------------------------------------------------

if len(sys.argv) == 3:
    source_path = sys.argv[1]
    sample_path = sys.argv[2]
else:
    print('Missing source and sample files as arguments');
    sys.exit()

#--------------------------------------------------

print('Load files')

source_series, source_rate = librosa.load(source_path) # The 3 hour file
sample_series, sample_rate = librosa.load(sample_path) # The 1 second file

source_time_total = float(len(source_series) / source_rate);

#--------------------------------------------------

print('Parse Data')

source_data_raw = librosa.amplitude_to_db(abs(librosa.stft(source_series, hop_length=64)))
sample_data_raw = librosa.amplitude_to_db(abs(librosa.stft(sample_series, hop_length=64)))

sample_height = sample_data_raw.shape[0]

#--------------------------------------------------

print('Round Data') # Also switches X and Y indexes, so X becomes time.

def round_data(raw, height):

    length = raw.shape[1]

    data = [];

    range_length = range(1, (length - 1))
    range_height = range(1, (height - 1))

    for x in range_length:

        x_data = []

        for y in range_height:

            # neighbours = []
            # for a in [(x - 1), x, (x + 1)]:
            #     for b in [(y - 1), y, (y + 1)]:
            #         neighbours.append(raw[b][a])
            #
            # neighbours = (sum(neighbours) / len(neighbours));
            #
            # x_data.append(round(((raw[y][x] + raw[y][x] + neighbours) / 3), 2))

            x_data.append(round(raw[y][x], 2))

        data.append(x_data)

    return data

source_data = round_data(source_data_raw, sample_height)
sample_data = round_data(sample_data_raw, sample_height)

#--------------------------------------------------

sample_data = sample_data[50:268] # Temp: Crop the sample_data (318 to 218)

#--------------------------------------------------

source_length = len(source_data)
sample_length = len(sample_data)
sample_height -= 2;

source_timing = float(source_time_total / source_length);

#--------------------------------------------------

print('Process series')

hz_diff_match = 18 # For every comparison, how much of a difference is still considered a match - With the Source, using Sample 2, the maximum diff was 66.06, with an average of ~9.9

hz_match_required_switch = 30 # After matching "start" for X, drop to the lower "end" requirement
hz_match_required_start = 850 # Out of a maximum match value of 1023
hz_match_required_end = 650
hz_match_required = hz_match_required_start

source_start = 0
sample_matched = 0

x = 0;
while x < source_length:

    hz_matched = 0
    for y in range(0, sample_height):
        diff = source_data[x][y] - sample_data[sample_matched][y];
        if diff < 0:
            diff = 0 - diff
        if diff < hz_diff_match:
            hz_matched += 1

    # print('  {} Matches - {} @ {}'.format(sample_matched, hz_matched, (x * source_timing)))

    if hz_matched >= hz_match_required:

        sample_matched += 1

        if sample_matched >= sample_length:

            print('      Found @ {}'.format(source_start * source_timing))

            sample_matched = 0 # Prep for next match

            hz_match_required = hz_match_required_start

        elif sample_matched == 1: # First match, record where we started

            source_start = x;

        if sample_matched > hz_match_required_switch:

            hz_match_required = hz_match_required_end # Go to a weaker match requirement

    elif sample_matched > 0:

        # print('  Reset {} / {} @ {}'.format(sample_matched, hz_matched, (source_start * source_timing)))

        x = source_start # Matched something, so try again with x+1

        sample_matched = 0 # Prep for next match

        hz_match_required = hz_match_required_start

    x += 1

#--------------------------------------------------

Phagocyte answered 14/10, 2018 at 12:8 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags