Writing speech-recognition engine [closed]

Asked 20/11, 2011 at 15:52 Answered 21/11, 2011 at 17:34

So, like many others I decided to create my own speech-recognition engine. As it turned out, it's not easy at all, instead, it's rather difficult to accomplish for English language particularly, because there is, I'd say, dramatical difference between the way a word is written, and the way it's pronounced. Being from Georgia, I decided to write speech-recognition for Georgian language. In Georgian, you pronounce words EXACTLY the way you write them. It's just like a transcription. Will this fact significantly ease my task? Or there are even more difficult... difficulties :D ?

Gerry answered 20/11, 2011 at 15:52 Comment(3)

Btw, a friend of mine recently created Georgian ASR. If you are interested, let me know. – Bluebeard 15/4, 2014 at 13:33

Nika, did you create the software? please share what you have done, we are interested too if such software exists. – Deleterious 7/11, 2016 at 19:13

i think the easiest way to do that is use of AI use of Multilevel Perceprtrons or something like that (I mean neural network) and train it... i think with this solution you can easy solve problem that mentioned Yahia in his answer, GL ;) – Howlet 14/11, 2016 at 12:20

Speech recognition is a complex domain with many specific algorithms, tools and methods. To create your own engine you could start with CMUSphinx open source speech recognition toolkit which will allow you to:

Collect and process data required to support Georgian language
Create the models for Georgian
Implement a speech recognition engine in Georgian.
Use engine to create a speech recognition application running on desktop, on server or on IPhone (through OpenEars)

CMUSphinx already supports English, German, Spanish, French, Dutch, Russian, Mandarin, Icelandic, Italian and many other languages. It's very simple to add a new one. For new people it usually takes a month or two of concentrated work to implement the required process.

To get started visit the homepage:

http://cmusphinx.sourceforge.net

and read the tutorial

http://cmusphinx.sourceforge.net/wiki/tutorial

If you have any question, please ask them on forums or here!

And, it's a very common misconception that you just spell the sounds when you speak Georgian. It's not true for most of the languages in the world. To test the hypothesis try to record some audio in an audio editor and check which sounds are actually pronounced. You'll be surprised. Tutorial above covers this question in details.

Bluebeard answered 21/11, 2011 at 17:34 Comment(2)

so you mean that i can add absolutely unexplored language, such as Georgian and "make it work" in couple of months?!?! – Gerry 23/11, 2011 at 6:40

Yes, why not. Actually CMUSphinx has made a lot of progress to support low resourced languages. – Bluebeard 23/11, 2011 at 16:55

Do all people from Georgia sound absolutely the same ? I think not... lots of major problems in speech recognition are not directly related to the language itself:

different people (women, men, children, elders etc.) have different voices
sometimes the same person sounds different for example when the person has a cold
different background noises
everyday speech sometimes contains words from other languages (like you have the german word Kindergarden in the US/English)
some persons not from the country itself learned the language (they usually sound different)
some persons speak faster, others speak slower
quality of the microphone
etc.

Solving these things always is pretty hard... on top of that you have the language/pronounciation to take care of... I don't know Georgian but what you describe might make the task a bit easier but it will still be a hard task.

EDIT - as per comments:

Using good libraries might lower the time-frame and even help in quality... but not every library is good for speech recognition despite perhaps being brilliant on some other audio-related matters...

For reference see the Wikipedia article http://en.wikipedia.org/wiki/Speech_recognition - it has a good overview including some links and book references which are a good starting point...

As for how to design such an API see for example http://java.sun.com/products/java-media/speech/forDevelopers/jsapi-guide/Recognition.html

Bagpipes answered 20/11, 2011 at 15:59 Comment(11)

about the way different people sound: actually, even thought they don't sound absolutely the same, there is a great similarity because speaking georgian is like reading transcription; and there, you don't have much of a choice, i think. – Gerry 20/11, 2011 at 16:3

@NikaGamkrelidze I suspect if you hear the same word from 2 different persons you can distinguish between the persons (like your mother versus your father versus some friend etc.) ? – Bagpipes 20/11, 2011 at 16:6

of course :DDD i see. it's still difficult :SS but, how do you think, is it possible for a complete noob in this sphere (although not bad of a programmer that knows lots of math and deals with audio editing) to write descent speech-recognition engine in, let's say an year? – Gerry 20/11, 2011 at 16:11

@NikaGamkrelidze it depends on the goal... do you want to recognize all Georgian words or just 100-300... ? – Bagpipes 20/11, 2011 at 16:13

@NikaGamkrelidze writing that from scratch for one developer alone (even a very good one) will be hard and from my POV take 3-5 years for decent quality for the first language, a second one will take less because you already have several base aspects solved... after one year you could have something rough and working for a small subset of the language, the capability to deal with background noises etc. will be rather limited. – Bagpipes 20/11, 2011 at 16:17

@NikaGamkrelidze no that is general... since most major problems are not language specific... and the area of 3-5 years means for an easier language (Georgian?) perhaps 2.5-3 years and for a harder language (like japanese/chinese...) perhaps even 7 years... – Bagpipes 20/11, 2011 at 16:20

@NikaGamkrelidze good luck, you can always come back with more specific questions... another point: using good libraries for parts of the solution can prove helpful (quality and time-frame). – Bagpipes 20/11, 2011 at 16:22

btw, by "from scratch", of course i don't mean writing all that Fourier transforming algorithms and cleaning functions, and such standard, open-source stuff that every language needs :D :D now, is it easier? :D – Gerry 20/11, 2011 at 16:26

@NikaGamkrelidze you will have to evaluate the library because not every implementation is good for every usage i.e. some might be bad for speech recoginition but brilliant for some other audio-related stuff... perhaps you can get it down to 2 years if the language is simple and you use some good libraries... – Bagpipes 20/11, 2011 at 16:28

can you give me an advice about the book i should start with? – Gerry 20/11, 2011 at 16:31

@NikaGamkrelidze for a general overview see en.wikipedia.org/wiki/Speech_recognition - it has also references to rahter good books on the subject... also read java.sun.com/products/java-media/speech/forDevelopers/… to see how APIs for such things are designed... – Bagpipes 20/11, 2011 at 16:37

Recommended topics

Hot tags