find the timestamp of a sound sample of an mp3 with linux or python

J

2

5

I am slowly working on a project which where it would be very useful if the computer could find where in an mp3 file a certain sample occurs. I would restrict this problem to meaning a fairly exact snippet of the audio, not just for example the chorus in a song on a different recording by the same band where it would become more some kind of machine learning problem. Am thinking if it has no noise added and comes from the same file, it should somehow be possible to locate the time at which it occurs without machine learning, just like grep can find the lines in a textfile where a word occurs.

In case you don't have an mp3 lying around, can set up the problem with some music available on the net which is in the public domain, so nobody complains:

curl https://web.archive.org/web/20041019004300/http://www.navyband.navy.mil/anthems/ANTHEMS/United%20Kingdom.mp3 --output godsavethequeen.mp3

It's a minute long:

exiftool godsavethequeen.mp3 | grep Duration
Duration                        : 0:01:03 (approx)

Now cut out a bit between 30 and 33 seconds (the bit which goes la la la la..):

ffmpeg -ss 30 -to 33 -i godsavethequeen.mp3 gstq_sample.mp3

both files in the folder:

$ ls -la
-rw-r--r-- 1 cardamom cardamom   48736 Jun 23 00:08 gstq_sample.mp3
-rw-r--r-- 1 cardamom cardamom 1007055 Jun 22 23:57 godsavethequeen.mp3

For some reason exiftool seems to overestimate the duration of the sample:

$ exiftool gstq_sample.mp3 | grep Duration
Duration                        : 6.09 s (approx)

..but I suppose it's only approximate like it tells you.

This is what am after:

$ findsoundsample gstq_sample.mp3 godsavethequeen.mp3
start 30 end 33

Am happy if it is a bash script or a python solution, even using some kind of python library. Sometimes if you use the wrong tool, the solution might work but look horrible, so whichever tool is more suitable. This is a one minute mp3, have not thought yet about performance just about getting it done at all, but would like some scalability, eg find ten seconds somewhere in half an hour.

Have been looking at the following resources as I try to solve this myself:

How to recognize a music sample using Python and Gracenote?

https://github.com/craigfrancis/audio-detect

https://madmom.readthedocs.io/en/latest/introduction.html

Reading *.wav files in Python

https://github.com/aubio/aubio

aubionset is a good candidate

https://willdrevo.com/fingerprinting-and-audio-recognition-with-python/

Joyless answered 22/6, 2020 at 22:39 Comment(0)

C

5

As suggested in Carson's answer, processing the audio gets a lot easier once the files are converted to the .wav format.

You may do so using Wernight's answer on reading mp3 in python:

ffmpeg -i godsavethequeen.mp3 -vn -acodec pcm_s16le -ac 1 -ar 44100 -f wav godsavethequeen.wav
ffmpeg -i gstq_sample.mp3 -vn -acodec pcm_s16le -ac 1 -ar 44100 -f wav gstq_sample.wav

Then to find the position of the sample is mostly a matter of obtaining the peak of the cross-correlation function between the source (godsavethequeen.wav in this case) and the sample to look for (gstq_sample.wav). In essence, this will find the shift at which the sample looks the most like the corresponding portion in the source. This can be done with python using scipy.signal.correlate.

Throwing a small python script to do just that would look like:

import numpy as np
import sys
from scipy.io import wavfile
from scipy import signal

snippet = sys.argv[1]
source  = sys.argv[2]

# read the sample to look for
rate_snippet, snippet = wavfile.read(snippet);
snippet = np.array(snippet, dtype='float')

# read the source
rate, source = wavfile.read(source);
source = np.array(source, dtype='float')

# resample such that both signals are at the same sampling rate (if required)
if rate != rate_snippet:
  num = int(np.round(rate*len(snippet)/rate_snippet))
  snippet = signal.resample(snippet, num)

# compute the cross-correlation
z = signal.correlate(source, snippet);

peak = np.argmax(np.abs(z))
start = (peak-len(snippet)+1)/rate
end   = peak/rate

print("start {} end {}".format(start, end))

Note that for good measures I've included a check to make sure both .wav files have the same sampling rate (and resample as needed), but you could alternatively make sure they are always the same while you convert them from .mp3 format using the -ar 44100 argument to ffmpeg.

Chancy answered 30/6, 2020 at 2:32 Comment(4)

Omg you've done it, fantastic! - start 30.0 end 32.999977324263035 Very interesting.. Can see on your profile you know about signal processing and fft. It took 0.7 seconds on my machine, quick.. Will look at the intermediate results in your code on each line. – Joyless 30/6, 2020 at 11:50

This is so great. Would there be a way to get multiple start-end time pairs, if a sound is repetitive within the audio file? – Adapter 9/5, 2021 at 3:1

FWIW, in my experiments, doing such processing on dtype='float32' (instead of dtype='float') arrays increases performance. – Chordate 29/6 at 12:28

@Adapter yes, there is a way. You can use scipy.signal.find_peaks() to find all matching locations, for example. See also how I use it for such a task in a project of mine. – Chordate 29/6 at 12:36

F

6

MP3 is an interesting format. The underlying data is stored in 'Frames', each 0.026 seconds long. Each frame is a Fast Fourier transform of the sound wave, encoded with varying degrees of quality depending on the size and bitrate, etc.. In your case, are you certain that the mp3s have matching bitrates? If they do, a relatively straightforward grep-style approach should be possible, given that you select on Frame boundaries. However, it is entirely likely and possible that this is not the case.

For a true solution, you need to process the mp3 file to some degree, to abstract away the encoding. However, there is no guarantee that the resulting wave match even for matching sounds, as bitrates and possibly frame alignment may differ. This small degree of chance makes it much harder.

I will give you my approach to this problem, but it is worth noting that this is not the perfect way to do things, just my best swing. Even though its the same file, there's no guarantee that frame boundaries are aligned, so I think you need to take a very wave-oriented approach, rather than a data-oriented one.

First, convert the mp3s to waves. I know that it'd be great to leave it compressed, but again I think wave-oriented is our only hope. Then, use a high-pass filter to try to remove any artifacts of audio compression that would differ between samples. Once you have two waveforms, it should be relatively straight forward to find the wavelet in the wave. You can iterate through possible starting positions and subtract the waves. When you get close to zero, you know you're close.

Footwork answered 25/6, 2020 at 15:55 Comment(2)

Thanks, good to see someone knows this area a bit. I easily converted from .mp3 to .wav in the terminal, the file was at least double as big. When you say "iterate through possible starting positions and subtract the waves, when you get close to zero, you know you're close.", had also intuitively had a thought like that. Get the sample and iterate it to different positions over the longer mp3 / wav. – Joyless 25/6, 2020 at 17:47

was reading again. Could I trouble you to say a bit more about how to find the wavelet in the wave, just name the technique or choice of techniques. It sounds like you are busy with this kind of thing at the moment.. – Joyless 29/6, 2020 at 23:9

C

5