CMU Sphinx for Voice/Speaker Recognition
Asked Answered
G

1

6

I'm looking for a way to match a known data set, let's say a list of MP3s or wav files, each which is a sample of someone speaking. At this point I know file ABC is of Person X speaking.

I would then like to take another sample, and do some voice matching to show who this voice is most likely of, given then known data set.

Also, I don't necessarily care what the person has said, as long as I can find a match, i.e I don't need any transcribing or otherwise.

I'm aware CMU Sphinx doesn't do voice recognition, and it's primarily used for voice-to-text, but I have seen other systems, eg: the LIUM Speaker Diarization (http://cmusphinx.sourceforge.net/wiki/speakerdiarization) or the VoiceID project (https://code.google.com/p/voiceid/) which uses CMU as a base for this type of work.

If I am to use CMU, how can I do voice matching?

Also, if CMU Sphinx isn't the best framework, is there an alternate that's open source?

Gastrin answered 10/1, 2013 at 0:37 Comment(1)
Any follow up? What have you done? Did you succeed?Cessation
C
2

This is a subject which would be adequate in complexity for a PhD thesis. There are no good and reliable systems as of right now.

The task you're up for is a very complex one. How you should approach it depends on your situation.

  • do you have a limited amount of people? how many?
  • how much data do you have for each person?

If you have very few people to recognize, you may attempt something as simple as obtaining formants of those people and comparing them to a sample.

Otherwise - you have to contact some academics who work on the subject or jury rig a solution of your own. Either way, as I said, it is a difficult problem.

Cessation answered 11/2, 2013 at 9:3 Comment(2)
I'm curious about your statement that there are no good and reliable systems. this paper mentions four diarization frameworks and the LIUM tool (from 2009) mentioned by the OP seems fairly well used e.g. by the Sphinx community. Do these existing approaches have specific limitations?Earthen
I should've written "I don't know any". Still, have you seen these results? They're not that great. Using voice as a biometric feature still is very unreliable.Cessation

© 2022 - 2024 — McMap. All rights reserved.