audio comparison with R
Asked Answered
T

1

8

I am working in a project where my task deals with speech/audio/voice comparison. This project is used for judging the winner in the competitions(mimicry). Practically I need to capture the user's speech/voice and compare it with the original audio file and return a percentage match. I need to develop this in R-language.

I had already tried voice related packages in R (tuneR, audio, seewave) but in my search, I am not able to get the comparison related information.

I need some assistance from you guys that where, I can find the information related to my work, which is the best way to handle this type of problems and if there, what are the prerequisites for processing these type of audio related work.

Turnkey answered 14/12, 2015 at 10:20 Comment(2)
I am not an audio processing expert but you can do a lot of stuff with seewave that could be helpful for you. For your specific problem, spectrograms and amplitude normalization come to my mind - both of which can be easily done in seewave.Icebox
ya thanks for your suggestion, I had tried the amplitude normalization in the seewave package but, As per my knowledge we need to have the justified values while doing normalization which, I can find please let me know if you have any idea regarding that. Once again thank you.Turnkey
W
5
  • Basically, the best features to be used for speech/voice comparison are the MFCC.

There are some softwares that can be used to extract these coefficients: Praat website
You can also try to find a lib to extract these coefficients.
[Edit: I've found in tuneR documentation that it has a function to extract MFCC - search for the function melfcc()]

  • After you've extracted these features, you can use Machine Learning (SVM, RandomForests or something like that) to develop a classifier.

I have a seminar that I've presented about Speaker Recognition Systems, take a look at it, it may be helpful. (Seminar)

If you have time and interest, you could algo read:
Authors: Kinnunen, T., & Li, H. (2010)
Paper: an overview of text-independent speaker recognition: From features to supervectors

After you get a feature vector for each audio sample (with MFCC and/or other features), then you'll need to compare pairs of feature vectors (Features from A versus Features from B):
You could try to use the Absolute Difference between these feature vectors:

  • abs(feature vector from A - feature vector from B)

The result of the operation above is a feature vector where every element is >=0 and it has the same size of the A (or B) feature vector.

You could also test the element-wise multiplication between A and B features:

  • (A1*B1, A2*B2, ... , An*Bn)

Then you need to label each feature vector
(1 if person A == person B and 0 if person A != person B).

Usually the absolute difference performs better than the multiplication feature vector, but you can append both vectors and test the performance of the classifier using both the abs diff and the multiplication features at the same time.

Washedup answered 14/12, 2015 at 11:40 Comment(11)
Nice answer, but the author referenced a competition. I have no background in audio processing so I find this very interesting. But, wouldn't you want to look at some function like a norm of the two vectors, with one vector being what you are trying to mimic and the other being a competitor, instead of attempting classification?Gipon
I've never worked with mimic but I've already worked with Speaker Recognition Systems (SRS). To solve this mimic problem I would use the same approach that's used in SRS. In SRS we develop a classificator that must be robust against spoofing (and also mimic). One way to deal with spoofing is to use the classificatory as a predictor (get the probability response of the classificator). Instead of looking for a norm, I would take a look at this Probability between the competitor and what you are trying to mimic. Best mimics would have a greater score in the classificatorWashedup
I would choose the approach above because I think that the mimic problem is analogous to the spoofing problem in Speaker Recognition Systems. Norm usually isn't the best approach, but it is one of the easiest to implement. Instead of norm, a better simpler approach would be to calculate feature vectors for A (competitor) and B (trying to mimic), and then calculate the cosine similarity (or the correlation) between both feature vectors.Washedup
So do these vectors of statistics have fairly normal distributions (A and B), or is it largely dependent on the application? If they aren't normal I would be hesitant about using a correlation. Ultimately I think this is going to end up involving some subjective weighting through imposing an ordering on an un-ordered set. The alternative is through integration under some distribution, and I'm not sure how possible it is to do in this situation without some fairly large distribution assumptions, or angry contestants :)Gipon
@JonathanLisic Actually I don't know if it would assume a Gaussian distribution, because I've never used a pairwise distance to measure how similar is a mimic/spoofing sample compared with an original audio. And I've never read any previous work with this kind of analysis. I agree with your comments... Probably fairly large distribution assumptions may be needed if the author of the question wants to get an initial/quicker/simpler solution for his problem. This simpler solution would require less data than the data required to develop a good machine learning classifier.Washedup
@JonathanLisic I've said correlation because usually GMM's are used to model a projection space for speaker recognition applications. So I guess that there's a chance that assuming Gaussian distribution would't be a bad assumption. GMM works very well for these systems. (The most cited paper in this area is about the use of GMM for speakers modeling - paper's resume)Washedup
I had partially understood your answer can, I get some clarity regarding the coefficients that you mentioned in the solution. And more over what are the modifications/changes(general) that are need to be done for an audio/speech file before actually applying an algorithm for comparing. Please guide me if there are some....Turnkey
@Turnkey My point is that you don´t need to compare the audio samples. Instead of that, you should compare some features of both of the audio samples. First of all you should extract the MFCC´s for both of the audio samples and then you should search for a metric of similarity between both of the feature vectors. Don´t compare time series of audio, compare feature vectors or time series of feature vectors.Washedup
@Turnkey Answering your question: what are the modifications/changes(general) that are need to be done for an audio/speech file before actually applying an algorithm for comparing You recieve an audio file, calculate the MFCC´s for the audio file and then you can compare this MFCC´s. How to compare the MFCC´s? That´s not a trivial thing to do. Easiest/simplest path But you could compare some statistic metrics of both the MFCC´s time series Best performance path Use GMMs (Gaussian Mixture Models) as a projection space to apply a SVM (Support Vector Machine),Washedup
Thanks for your suggestion let me try this way, and get back to you @Renan.V NovasTurnkey
@Renan.V Novas hi, the MFCC's is a bit confusing for me to understand all the given details so, can you please provide some details to deal this work using R packages (or) some low level documents. I hope you understand my situation. Thank you.Turnkey

© 2022 - 2024 — McMap. All rights reserved.