ARPA language model documentation
Asked Answered
B

3

18

Where can I find documentation on ARPA language model format?

I am developing simple speech recognition app with pocket-sphinx STT engine. ARPA is recommended there for performance reasons. I want to understand how much can I do to adjust my language model for my custom needs.

All I found is some very brief ARPA format descriptions:

I am beginner to STT and I have trouble to wrap head around this (n-grams, etc...). I am looking for more detailed docs. Something like documentation on JSGF grammar here:

http://www.w3.org/TR/jsgf/

Berkie answered 6/5, 2013 at 22:14 Comment(2)
I found this link useful: speech.sri.com/projects/srilm/manpages/ngram-format.5.html It describes the n-gram aka ARPA aka Doug Paul format.Estrous
have a look at this msdn link.. arpa and args format are well explained Compile Grammar Input and Output File FormatWun
L
4

There is actually not much more to say about the format than is said in those docs..

Besides, you'll probably want to prepare a text file with sample sentences and generate the language file based on it. There is an online version which can do it for you: lmtool

Lutist answered 7/5, 2013 at 6:27 Comment(5)
Still, in uses some kind of n-grams, backoff, etc... what are those and where can I find more info about those?Berkie
@Berkie What is n-gram? A sequence of N words. Backoff is optional. And the probability is in log 10 scale as far as I remember.Lutist
Backoff is a way to estimate probability of a unseen (during training) ngram. It basically backs off to a lower order ngram if a higher order ngram is not in the LM. E.g back off to 2gram if encountered 3gram is not present. The back off weight is to make sure the joint probability is a true probability, i.e sums to 1.Penninite
@Lutist the link to the lmtool is down. Is there any other tool to build the ARPA language model?Ame
I googled and found this: npmjs.com/package/lmtoolLutist
P
2

You can complement those docs with this tech report that gives a comprehensive overview of smoothing for language modeling: http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf You will also find definitions for backoff models and interpolated models.

Pluri answered 13/11, 2013 at 10:44 Comment(0)
P
2

I'm probably very late to answer this, I found the ARPA LM format well documented in this link from The HTK Book by Steve Young et. al.

Each line of ARPA is a triple that stores:

n-gram log-probability(base10) ; the n-gram itself ; back-off weight (also in log space). 
Penninite answered 18/12, 2019 at 7:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.