Trying to use MEGAM as an NLTK ClassifierBasedPOSTagger?

Asked 17/12, 2010 at 2:29 Answered 24/6, 2014 at 14:57

I am currently trying to build a general purpose (or as general as is practical) POS tagger with NLTK. I have dabbled with the brown and treebank corpora for training, but will probably be settling on the treebank corpus.

Learning as I go, I am finding the classifier POS taggers are the most accurate. The Maximum Entity classifier is meant to be the most accurate, but I find it uses so much memory (and processing time) that I have to significantly reduce the training dataset, so the end result is less accurate than using the default Naive Bayes classifier.

It has been suggested that I use MEGAM. NLTK has some support for MEGAM, but all the examples I have found are for general classifiers (eg. a text classifier that uses a vector of word features), rather than a more specific POS tagger. Without having to recreate my own POS feature extractor and compiler (ie. I prefer to use the one already in NLTK), how can I used the MEGAM MaxEnt classifier? Ie. how can I drop it in some existing MaxEnt code that is along the lines of:

maxent_tagger = ClassifierBasedPOSTagger(train=training_sentences,
                                        classifier_builder=MaxentClassifier.train )

Seaborg answered 17/12, 2010 at 2:29 Comment(2)

Have you read: streamhacker.com/2008/11/03/…? It's a pretty good look at POS taggers in general, if you look at all four articles. – Context 17/12, 2010 at 2:37

Yes, I have his book. The blog has some interesting efficiency comparisons, and I might yet add a Brill tagger on the end of the classifier (as suggested); but the posts don't appear to mention MEGAM? Perhaps I need to look at the NLTK MaxEnt code and reverse engineer or duplicate it to use MEGAM. – Seaborg 17/12, 2010 at 2:49

This one liner should work for training a MEGAM MaxentClassifier for the ClassifierBasedPOSTagger. Of course, that assumes MEGAM is already installed (go here to download)

maxent_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=lambda train_feats: MaxentClassifier.train(train_feats, algorithm='megam', max_iter=10, min_lldelta=0.1))

Jaconet answered 17/12, 2010 at 17:14 Comment(4)

Also checkout train_tagger.py in github.com/japerk/nltk-trainer. Sometime soon I'll write an intro article, but hopefully the help messages are enough to get you started. – Jaconet 17/12, 2010 at 17:21

Thanks Jacob - that looks to be working! (looks like I need to remind myself about Python Lambda functions). Yes I had Megam installed, my problem was getting the (generic) classifier hooked into the POS tagging classifier. So far my tests are giving accuracies that are very similar to the Naive Bayes (within about 1%). It does take longer to build the classifier, but not as long as the default MaxEnt algorithm. I've printed train_tagger.py out - I'll look over it during a coffee break :-) – Seaborg 17/12, 2010 at 19:52

You could try increasing max_iter or decreasing min_lldelta to achieve higher accuracy. Those are just the numbers I usually use since accuracy plateaus quite quickly. – Jaconet 17/12, 2010 at 20:12

Hal moved to Maryland. The new link to download is: umiacs.umd.edu/~hal/megam – Auriga 25/1, 2013 at 11:33

For the future users:

Megam is now available on MAC:

$brew tap homebrew/science
$brew install megam

If you dont have XQuartz, it might ask you to get that first. Here is the direct download link: http://xquartz.macosforge.org/downloads/SL/XQuartz-2.7.5_rc4.dmg

Shamrao answered 24/6, 2014 at 14:57 Comment(0)

Recommended topics

Hot tags