Speaker Recognition [closed]

Asked 29/1, 2011 at 14:56 Answered 24/12, 2013 at 10:5

How could I differentiate between two people speaking? As in if someone says "hello" and then another person says "hello" what kind of signature should I be looking for in the audio data? periodicity?

Thanks a lot to anyone who can answer this!

Prove answered 29/1, 2011 at 14:56 Comment(0)

The solution to this problem lies in Digital Signal Processing (DSP). Speaker recognition is a complex problem which brings computers and communication engineering to work hand in hand. Most techniques of speaker identification require signal processing with machine learning (training over the speaker database and then identification using training data). The outline of algorithm which may be followed -

Record the audio in raw format. This serves as the digital signal which needs to be processed.
Apply some pre-processing routines over the captured signal. These routines could be simply signal normalization, or filtering the signal to remove noise (using band pass filters for normal frequency range of human voice. Band pass filters can in turn be created using a low pass and a high pass filter in combination.)
Once it is fairly certain that the captured signal is pretty much free from noise, feature extraction phase begins. Some of the known techniques which are used for extracting voice features are - Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC) or simple FFT features.
Now, there are two phases - training and testing.
First the system needs to be trained over the voice features of different speakers before it is capable to distinguish between them. In order to ensure that the features are correctly calculated, it is recommended that several (>10) samples of voice from speakers must be collected for training purposes.
Training can be done using different techniques like neural networks or distance based classification to find the differences in the features of voices from different speakers.
In testing phase, the training data is used to find the voice feature set which lies at the lowest distance from the signal being tested. Different distances like Euclidean or Chebyshev distances might be used to calculate this proximity.

There are two open source implementations which enable speaker identification - ALIZE: http://mistral.univ-avignon.fr/index_en.html and MARF: http://marf.sourceforge.net/.

I know its a bit late to answer this question, but I hope someone finds it useful.

Nathanson answered 24/12, 2013 at 10:5 Comment(2)

A third open source option now exists: Recognito github.com/amaurycrickx/recognito. Main advantage is it's short learning curve. I would recommend reading "Fundamentals of Speaker Recognition" by Homayoon Beigi for in-depth explanations on the subject – Jackinthepulpit 4/4, 2014 at 23:2

@ExtremeCoder and I are looking for "signature". More specifically, How are we arriving at the conclusion that MFCCs are sufficient to differentiate speakers? Do you have any references to this? – Hartsell 3/11, 2018 at 17:34

This is an extremely hard problem, even for experts in speech and signal processing. This page has much more information: http://en.wikipedia.org/wiki/Speaker_recognition

And some suggested technology starting points:

The various technologies used to process and store voice prints include frequency estimation, hidden Markov models, Gaussian mixture models, pattern matching algorithms, neural networks, matrix representation,Vector Quantization and decision trees. Some systems also use "anti-speaker" techniques, such as cohort models, and world models.

Mucor answered 29/1, 2011 at 15:8 Comment(0)

Having only two people to differentiate, if they are uttering the same word or phrase will make this much easier. I suggest starting with something simple, and only adding complexity as needed.

To begin, I'd try sample counts of the digital waveform, binned by time and magnitude or (if you have the software functionality handy) an FFT of the entire utterance. I'd consider a basic modeling process first, too, such as linear discriminant (or whatever you already have available).

Deformation answered 29/1, 2011 at 16:53 Comment(0)

Another way to go is to use an array of microphones and differentiate between the postions and directions of the vocal sources. I consider this to be a easier approach since the position calculation is much less complicated than separating different speakers from a mono or stereo source.

Nervine answered 30/1, 2011 at 10:5 Comment(0)

Recommended topics

Hot tags