Algorithm for voice comparison

Asked 11/5, 2010 at 7:46 Answered 11/5, 2010 at 9:45

Given two recorded voices in digital format, is there an algorithm to compare the two and return a coefficient of similarity?

Farrow answered 11/5, 2010 at 7:46 Comment(2)

Are you trying to determine if the speakers are the same or similar, or if the speech itself is the same or similar.. or both? – Jopa 11/5, 2010 at 8:2

Sorry that I didn't clarify this: independent of speakers is preferred. I am looking for similarity of the speech itself. – Farrow 11/5, 2010 at 8:23

Given your clarification I think what you are looking for falls under speech recognition algorithms.

Even though you are only looking for the measure of similarity and not trying to turn speech into text, still the concepts are the same and I would not be surprised if a large part of the algorithms would be quite useful.

However, you will have to define this coefficient of similarity more formally and precisely to get anywhere.

EDIT: I believe speech recognition algorithms would be useful because they do abstraction of the sound and comparison to some known forms. Conceptually this might not be that different from taking two recordings, abstracting them and comparing them.

From wikipedia article on HMM

"In speech recognition, the hidden Markov model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients."

So if you run such an algorithm on both recordings you would end up with coefficients that represent the recordings and it might be far easier to measure and establish similarities between the two.

But again now you come to the question of defining the 'similarity coefficient' and introducing dogs and horses did not really help.

(Well it does a bit, but in terms of evaluating algorithms and choosing one over another, you will have to do better).

Crompton answered 11/5, 2010 at 9:1 Comment(2)

I am not going to get any meaning from the sound source. As example, if I record two dog bark and a horse neighs, comparing of the two dog barking should give a higher coefficient than comparing of a barking with a neighing. – Farrow 11/5, 2010 at 9:17

@Horace Ho, replied in the EDIT as part of the answer – Crompton 11/5, 2010 at 9:47

I recommend to take a look into the HTK toolkit for speech recognition http://htk.eng.cam.ac.uk/, especially the part on feature extraction.

Features that I would assume to be good indicators:

Mel-Cepstrum coefficients (general timbre)
LPC (for the harmonics)

Taka answered 11/5, 2010 at 9:45 Comment(2)

Does the license (htk.eng.cam.ac.uk/docs/license.shtml) permits using the toolkit for another application for distribution? – Farrow 11/5, 2010 at 10:0

From what I remember it is extremely restrictive. However you can also try clam-project.org which is free-software. You'll find there efficient implementation of the feature extraction algorithms provided by HTK (and some more). – Taka 11/5, 2010 at 10:19

There are many different algorithms - the general name for this task is Speaker Identification - start with this Wikipedia page and work from there: http://en.wikipedia.org/wiki/Speaker_recognition

Opportunity answered 11/5, 2010 at 7:52 Comment(0)

I'm not sure this will work for soundfiles, but it gives you an idea how to proceed i hope. That is a basic way how to find a pattern (image) in another image.

You first have to calculate the fft of both the soundfiles and then do a correlation. In formular it would look like (pseudocode):

fftSoundFile1 = fft(soundFile1);
fftConjSoundFile2 = conj(fft(soundFile2));
result_corr = real(ifft(soundFile1.*soundFile2));

Where fft= fast Fourier transform, ifft = inverse, conj = conjugate complex. The fft is performed on the sample values of the soundfiles. The peaks in the result_corr vector will then give you the positions of high correlation. Note that both soundfiles must in this case be of the same size-otherwise you have to place the shorter one into a file of max(soundFileLength) vector.

Regards

Edit: .* means (in matlab style) a component wise mult, you must not do a vector mult! Next Edit: Note that you have to operate with complex numbers - but there are several Complex classes out there so I think you don't have to bother about this.

Carson answered 11/5, 2010 at 8:3 Comment(2)

This does not even get close to being a working solution. The spectrum of speech is time varying and noisy. You could only really do something like this for a very small segment of speech where the speaker is saying, e.g. the same vowel, and even then it probably won't work very well, if at all. – Opportunity 11/5, 2010 at 8:16

Sorry, i'm not a "speech expert", but I thought for a simple "how same are therse soundfiles" it would be ok for a first approch, due to the fact, that it works with images. – Carson 11/5, 2010 at 8:27

Recommended topics

Hot tags