How do I build a large-vocabulary language model for CMU Sphinx?

Asked 24/1, 2011 at 14:49 Answered 5/10, 2011 at 2:1

I would like to build a language model for CMU Sphinx, but my corpus has more than 1000 words so I cannot use the online tool. How do I use (the scripts in cmuclmtk?) to build my language model?

Volteface answered 24/1, 2011 at 14:49 Comment(0)

Please read the tutorial

http://cmusphinx.sourceforge.net/wiki/tutoriallm

Garner answered 24/1, 2011 at 19:20 Comment(14)

That document was very helpful with the exception of 'Generating a dictionary'. Does the distribution come with a script to generate that dictionary? – Volteface 24/1, 2011 at 19:25

You can use pronounce tool which you can checkout from subversion cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/trunk/logios/… There are external g2p packages like code.google.com/p/phonetisaurus or sequitur-g2p, they also can be used. – Garner 24/1, 2011 at 21:28

It appears pocketsphinx has a dictionary in the en_US directory, right next to the models. I'm going to try using that one. – Volteface 26/1, 2011 at 21:40

hi Nikolay,Currently am having a large text file which contains around 11k words,can you please tell me the exact command which can generate .lm and .dic/.dmp files from that text file.thanks in adv. – General 8/9, 2011 at 7:51

Hello ravoorinandan. You can find exact commands (text2wfreq, text2idngram, idngram2lm) in the tutorial above. – Garner 14/9, 2011 at 9:42

Thanks a lot for your reply Nikolay.Right now using the documentation available i had created .binlm and .arpa files from the corpus text file.so currently i dont know how to use them in my application.i mean what is the key we need to provide while giving arpa format as input apart from .lm or .DMP. – General 17/9, 2011 at 9:34

And can you please let me know how to create a dictionary plz.Thanks a lot for your help. – General 17/9, 2011 at 9:35

i had checked out with the link you have provided above(cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/trunk/logios/)but am not able to use create the file with .dic extention.from the Corpus.txt file. – General 19/9, 2011 at 10:55

Logios is not the only tool, you can find a references to others in tutorial. In case of troubles it's always recommended to provide more details on them. Since you don't describe where you fail it's hard to suggest you anything. – Garner 20/9, 2011 at 0:44

yes thanks a lot for your fast response nikolay.i will come up with the error details and let you know.mean while i do have one more doubt,can we convert .DMP file into .lm.write now am using Sphinx_lm_convert to convert arpa into DMP. – General 20/9, 2011 at 6:35

And am creating a dictionary file(.dict format) by uploading the Corpus text file in the following link speech.cs.cmu.edu/tools/lextool.html so that it provides me with .dict and .word files can i use them in mine? – General 20/9, 2011 at 6:38

Hello. You can convert from DMP to ARPA with sphinx_lm_convert too. Lextool web service is essentially logios installed as a webservice. You can checkout and install it in your home machine. You can try other package too. – Garner 20/9, 2011 at 19:33

yes nikolay i am aware of that,but the problem is does this DMP file works for sphinx-II decoder? i mean in vocal kit???because nothing is happening when am using the .dic and .DMP files in my voice search.(i mean its returning null value). – General 21/9, 2011 at 11:54

DMP should work with vocalkit. If you have some specific issues you could debug it. – Garner 21/9, 2011 at 18:45

Not a trivial task. Generating a language model is a time- and resource-intensive task.

If you want to have a "good" language model, you will need a large or very large text corpus to train a language model (think in the order of magnitude of several years of wall street journal texts).

"good" means: if the language model will be able to generalize from the training data to new and previously unseen input data

You should look at the documentation of the Sphinx and the HTK language model toolkits.

http://cmusphinx.sourceforge.net/wiki/tutoriallm

Also check these two threads:

Building openears compatible language model

Ruby Text Analysis

You could take a more general Language Model, based on a bigger corpus and interpolate your smaller Language Model with it .. e.g a back-off language model ... but that's not a trivial task.

see: Katz's back-off model

Tugman answered 5/10, 2011 at 2:1 Comment(0)

Recommended topics

Hot tags