Comparing two recorded voices

Asked 11/1, 2015 at 20:19 Answered 9/6, 2016 at 21:10

I need to find some literature in how to compare a realtime recorded voice (From a mic) against a database of pre-recorded voices. After comparing I would then need to output a match percentage of it.

I am researching on audio fingerprinting, but I cant really get to any conclusion on any literature of such implementation. Any expert out here which can easily guide me in achieving this ?

Below answered 11/1, 2015 at 20:19 Comment(1)

do you have any results on this work?Im working on same problem and I got the MFCC vector, need to start comparing with spesific criterions – Menorah 11/5, 2019 at 11:19

I have done similar work before, so I may be the right person to describe the procedure to you.

I had pure recordings of sounds which I considered as gold standards. I had written python scripts to convert these sounds as an array of MFCC vectors. Read more about MFCCs here.

Extracting MFCCs can be considered as the first step in the processing of an audio file, that is features that are good for identifying the acoustic content. I generated MFCCs for every 10ms and had 39 attributes. So a sound file which was 5 seconds long had around 500 MFCCs each having 39 attributes.

Then I wrote an artificial neural network code on these lines . More about neural network can be read from here.

Then I train the neural network's weights and bias known commonly as the network parameters using the stochastic gradient descent algorithm trained using the back propagation algorithm. The trained model was then saved to identify unknown sounds.

The new sounds were then represented as a sequence of MFCC vectors and given as input to the neural network. The neural network is able to predict for each MFCC instance obtained from the new sound file into one of the sound classes that the neural network is trained on. The number of correctly classified MFCC instances gives the accuracy with which the neural network was able classify the unknown sound.

Consider for example : You train your neural network on 4 types of sounds, 1. whistle, 2. car horn, 3. dog bark and 4. siren using the procedure described above.

The new sound is say a siren sound which is 5 s long. You will obtain approximately 500 MFCC instances. The trained neural network will try to classify each of the MFCC instance to one of the classes that the neural network is trained on. So you may get something like this.

30 instances were classified as whistle. 20 instances were classified as car horn/ 10 instances were classified as dog bark and the remaining instances were correctly classified as siren.

The accuracy of classification or rather the commonness between the sounds can be approximately calculated as the ratio of the number of correctly classified instances to the total number of instances which in this case would be 440 / 500 which is 88%. This field is relatively new and much work has been done before using similar machine learning algorithms like Hidden Markov Model, Support Vector Machine and more.

This problem has already been tackled before and you may find some research paper about these in google scholar.

Recency answered 9/6, 2016 at 21:10 Comment(2)

Are there any implemented products/solutions in this field? – Tufthunter 18/2, 2019 at 7:58

im wondering about if there are any implementation too – Walczak 18/1, 2021 at 21:14

No expert in this field (so handle accordingly) but you should look at:

Cepstral or Spectral analysis via DFFT or DFCT
correlation coefficient from statistics
FIR (finite impulse response) filters

How to approach?

filter voices

recognizable speech minimum is up to 0.4-3.4 KHz (that is why these are used in old phone filters). Human voice is usually up to 12.7 KHz so if you are sure you have unfiltered recordings then filter up to 12.7 KHz and also take out the 50Hz or 60Hz from power lines
Make the dataset

if you have recording of the same sentence to compare then you can just compute spectrum via DFFT or DFCT of the same tone/letter (for example start,middle,end). Filter out unused areas, make voice print dataset from the data. If not then you need to find similar tones/letters in recordings first for that you need speech recognition to be sure or find parts in recording that have similar properties. What they are you have to learn (by trial, or by researching speech recognition papers) here some hints: tempo,dynamic volume range,frequency ranges.
compare dataset

numeric comparison is done by correlation coefficient which is pretty straightforward (and mine favorite) you can also use neural network for this (even bullet 2) also may be there is some FUZZY approach for this. I recommend to use correlation because its output is similar to what you want and it is deterministic so there are no problems with over/under learning or invalid architecture,etc ...

[edit1]

People are using also Furmant filters to generate vocals and speech. Their properties mimics human vocalization paths and the math behind them can be also used in speech recognition by inspecting the major frequencies of the filter you can detect vocal, intonation, tempo ... Which might be used for speech detection directly. However that is way outside my field of expertise but there are many papers about this out there so just google ...

Silkstocking answered 12/1, 2015 at 9:1 Comment(0)

This is definitely not a trivial problem.

If you're seriously trying to solve it, I suggest you take a close look at how speech encoders work.

A rough break-down of the steps involved:

Identify the intervals in the recording, that contains vowels
Determine the fundamental frequency and the harmonics of the vowel sound
Determine the relative amplitude of the harmonics and the average frequency of the fundamental
Develop a "distance" metric that measures how close two vowel sounds are to each other based on the parameters from step 3
Calculate the distance from the vowel sounds of a new recording to those of the recordings of the data base.

The parameters from step 3 is a sort of "fingerprint" of the vocal tract. Typically the consonant sounds are not sufficiently different to be of substantial use (unless the vowel sounds from two individuals are very similar).

As a first and very simple step try to determine the average fundamental of the vowels and use that frequency as the signature.

Good luck,

Jens

Botts answered 14/1, 2015 at 19:7 Comment(0)

Recommended topics

Hot tags