Sound sample recognition library/code

Asked 12/5, 2010 at 9:54 Answered 7/4, 2012 at 17:14

Solved audio signal-processing audio-processing

I don't want sound-to-text software. What I need is the following:

I'll record multiple (say 50+) audio streams (recordings of radio stations)
from that recordings, I'll mark interesting audio clips - their length ranges from 2 to 60 seconds - there will be few thousands of such audio clips
library should be able to find other instances of same audio clips from recorded sound streams
confidence factor should be reported to used and additional input provided so the recognition could perform better next time

Do you know of such software library? LGPL would be most valuable to me, but I can go for commercial license as well.

Audio clips will contain both music, text, effects, or any combination thereof. So, TEXT recognition is out of the question.

Architecture: c++, C# for glue, CUDA if possible.

Blinker answered 12/5, 2010 at 9:54 Comment(5)

Will the audio clips contain speech, sounds, music, all of these? – Curb 15/5, 2010 at 2:28

Do you have a specific language or processor architecture in mind? – Mage 15/5, 2010 at 4:55

BTW, I created my own implementation, after 2 years of development, and it is available for commercial exploatation :) videophill.com/index.php?page=playkontrol – Culosio 3/4, 2012 at 19:5

Database Connection Failed on videophill.com/index.php?page=playkontrol @DanielMošmondor – Postnatal 8/11, 2013 at 22:21

MIT licensed Python library here: github.com/worldveil/dejavu – Egeria 21/7, 2014 at 4:43

I have not found any libraries (yet), but two interesting papers, which may give you terminology and background to refine your searches:

EDIT: Searching for "Audio fingerprinting" came to a page of implementations, both open source and commercial.

http://wiki.musicbrainz.org/AudioFingerprint
Picard seems to be well established, and could be useful if your clips contain music.

Here is an introduction to Audio fingerprinting

Curb answered 15/5, 2010 at 2:33 Comment(1)

First of your proposals looks promising, and I know of Picard, but I'm not sure that is appropriate for 'sample from stream' detection. – Culosio 16/5, 2010 at 15:3

What you are describing is a matched filter and all you need is a cross-correlation function which should be part of any reasonable DSP library. Depending upon your choice of processor architecture and language you may even be able to find a vectorized library that can perform this operation more efficiently.

If you don't really care about performance you could use Python...

$ python
Python 2.6.4 (r264:75706, Dec  7 2009, 18:45:15) 
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy
>>> interesting_clip = [ 5, 7, 2, 1]
>>> full_stream = [ 1, 5, 7, 2, 1, 4, 3, 2, 4, 7, 1, 2, 2, 5, 1]
>>> correlation = scipy.correlate (full_stream, interesting_clip)
>>> print correlation
[56 79 55 28 41 49 44 53 73 48 28 35]
>>> for offset, value in enumerate(correlation) :
...     if (value > 60) :
...         print "match at position", offset, "with value of", value
... 
match at position 1 with value of 79
match at position 8 with value of 73

My threshold above is arbitrarily. You should experimentally determine what is appropriate for you.

Keep in mind that the longer your "interesting clip", the longer it will take to compute the correlation. While longer clips will help actual matches stand out better from non-matches, you probably won't need more than a few seconds.

Mage answered 15/5, 2010 at 2:58 Comment(1)

OK, correlation seems fine enough, but in WHAT feature space??? What would you propose? – Culosio 16/5, 2010 at 14:34

AudioDB is an open source c++ project that searches for similar sections of audio, and handles noisy streams, and can give you a measure of similarity. It can be run as client/server, but I believe you can do a standalone program.
The other answers about dsp correlation are kind of correct, but in general these dsp algorithms want to compare two streams of the same length, which have the similar parts overlapping.
What you need requires it to work on arbitrary segments of the stream; this is what AudioDB was built for. (One application is to find hidden references/sampling or blatant copyright misuse.) I've used it for finding sounds that were played backwards, and it also finds the case where some noise or speech changes are introduced.
Note that it is still under development even though the dates on the home page seem to off. I would subscribe to the mailing list and ask what the current state is and how you might go about incorporating it.

Cleodal answered 18/5, 2010 at 2:4 Comment(0)

You might want to look at this paper by Li-Chun Wang regarding www.shazam.com.

It is not an API but it does give details of how their algorithm was developed.

Negotiation answered 20/5, 2010 at 8:34 Comment(0)

Take a look at the Microsoft Speech API (SAPI):
http://msdn.microsoft.com/en-us/library/ee125077%28VS.85%29.aspx

All the other requirements you listed are basically implementation details that you'll have to implement on your own. For example, as the software interprets the audio streams, it can store them in SQL server with full text indexing ... from that you do the searches to find similar/same audio clips.

There are of course other ways to implement that, and this is but one idea :-)

Denning answered 14/5, 2010 at 20:33 Comment(1)

Well, since my question explicitely stated that I don't want sound-to-text recognition, because I have no use for it in finding jingles or some other kind of sounds, I'll have to drop you -1 on this. – Culosio 16/5, 2010 at 14:35

I would go somewhere in line with Tim Kryger's answer and use simple statistical correlation functions, as you want to stay content-agnostic.

As for the features I would definately try MFCC as it's used both in speech processing and music recognition (genres, songs). You can find MFCC and a wealth of other audio features available in the excellent open source Vamp plugins (or its more high-level bundle, a program called Sonic Annotator) or alternatively in the Marsyas framework.

Frederigo answered 7/4, 2012 at 17:14 Comment(0)

Recommended topics

Hot tags