Creating ARPA language model file with 50,000 words
Asked Answered
F

2

14

I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words?

Foucault answered 21/4, 2011 at 11:24 Comment(1)
Do you mean to say that you need a Collection of english words?Neutron
L
10

I thought I'd answer this one since it has a few votes, although based on Christina's other questions I don't think this will be a usable answer for her since a 50,000-word language model almost certainly won't have an acceptable word error rate or recognition speed (or most likely even function for long) with in-app recognition systems for iOS that use this format of language model currently, due to hardware constraints. I figured it was worth documenting it because I think it may be helpful to others who are using a platform where keeping a vocabulary this size in memory is more of a viable thing, and maybe it will be a possibility for future device models as well.

There is no web-based tool I'm aware of like the Sphinx Knowledge Base Tool that will munge a 50,000-word plaintext corpus and return an ARPA language model. But, you can obtain an already-complete 64,000-word DMP language model (which can be used with Sphinx at the command line or in other platform implementations in the same way as an ARPA .lm file) with the following steps:

  1. Download this language model from the CMU speech site:

http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20HUB4%20Language%20Model/HUB4_trigram_lm.zip

In that folder is a file called language_model.arpaformat.DMP which will be your language model.

  1. Download this file from the CMU speech site, which will become your pronunciation dictionary:

https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

Convert the contents of cmu07a.dic to all uppercase letters.

If you want, you could also trim down the pronunciation dictionary by removing any words from it which aren't found in the corpus language_model.vocabulary (this would be a regex problem). These files are intended for use with one of the Sphinx English-language acoustic models.

If the desire to use a 50,000-word English language model is driven by the idea of doing some kind of generalized large vocabulary speech recognition and not by the need to use a very specific 50,000 words (for instance, something specialized like a medical dictionary or 50,000-entry contact list), this approach should give those results if the hardware can handle it. There are probably going to be some Sphinx or Pocketsphinx settings that will need to be changed which will optimize searches through this size of model.

Lidia answered 15/6, 2011 at 11:10 Comment(6)
Open ears new version that is 0.91 have an inbuilt feature of creating language model file .That really solved my problem.And hope everyone else will get help from thisFoucault
Hi Christina, happy to hear that OpenEars .91 dynamic language model generation is working well for you, but I'm amazed to hear that it works for generating a 50,000 word language model. Is that working on the device or just the Simulator?Lidia
I'm just asking out of curiosity since I had no idea it would be used or useable for such large models when I designed the LanguageModelGenerator class -- I was thinking on the order of 10-500 words for context-specific command and control language models.Lidia
its not like that wheneevr we want to create new language model for any word then we can create it dynamically i havet tested for such large amount of words yetFoucault
Then how to create a language model with large amount of words approximately 12k.??Crake
can i use above two files as follows; NSString *lmPath = [[NSBundle mainBundle] pathForResource:@"OpenEarsLanguageFile" ofType:@"DMP"]; NSString *dicPath = [[NSBundle mainBundle] pathForResource:@"OpenEarsLanguageFile" ofType:@"dic"]; [self.pocketsphinxController startListeningWithLanguageModelAtPath:lmPath dictionaryAtPath:dicPath languageModelIsJSGF:NO]; If this one is not correct can you let me know correct way to use above two files.Infraction
W
0

How big is your training corpus? if it's only 50000 words, that would be tiny / too small.

In general, you could either use the toolkit from CMU or HTK

Detailed documentation for the HTK Speech Recognition Toolkit here: http://htk.eng.cam.ac.uk/ftp/software/htkbook_html.tar.gz

Here's also a description of CMU's SLM Toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html

see also: Building openears compatible language model

You could take a more general Language Model, based on a bigger corpus and interpolate your smaller Language Model with it .. e.g a back-off language model ... but that's not a trivial task.

see: http://en.wikipedia.org/wiki/Katz's_back-off_model

Waylan answered 5/10, 2011 at 2:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.