The normal ngram algorithm traditionally works with prior context only, and for good reason: A bigram tagger makes decisions by considering the tags of the last two words, plus the current word. So unless you tag in two passes, the tag of the next word is not yet known. But you are interested in word ngrams, not tag ngrams, so nothing keeps you from training an ngram tagger where the ngram consists of words from both sides. And you can indeed do it easily with the NLTK.
The NLTK's ngram taggers all make tag ngrams, from the left; but you can easily derive your own tagger from their abstract base class, ContextTagger
:
import nltk
from nltk.tag import ContextTagger
class TwoSidedTagger(ContextTagger):
left = 2
right = 1
def context(self, tokens, index, history):
left = self.left
right = self.right
tokens = tuple(t.lower() for t in tokens)
if index < left:
tokens = ("<start>",) * left + tokens
index += left
if index + right >= len(tokens):
tokens = tokens + ("<end>",) * right
return tokens[index-left:index+right+1]
This defines a tetragram tagger (2+1+1) where the current word is third in the ngram, not last as usual. You can then initialize and train a tagger just like the regular ngram taggers (see chapter 5 of the NLTK book, especially sections 5.4ff). Let's see first how you'd build a part-of-speech tagger, using a portion of the Brown corpus as training data:
data = list(nltk.corpus.brown.tagged_sents(categories="news"))
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('NN'))
twosidedtagger._train(train_sents)
Like all ngram taggers in the NLTK, this one will delegate to the backoff tagger if it is asked to tag an ngram it did not see during training.
For simplicity I used a simple "default tagger" as the backoff tagger, but you'll probably need to use something more powerful (see the NLTK chapter again).
You can then use your tagger to tag new text, or evaluate it with an already tagged test set:
>>> print(twosidedtagger.tag("There were dogs everywhere .".split()))
>>> print(twosidedtagger.evaluate(test_sents))
Predicting words:
The above tagger assigns a POS tag by considering nearby words; but your goal is to predict the word itself, so you need different training data and a different default tagger. The NLTK API expects training data in the form (word, LABEL)
, where LABEL
is the value you want to generate. In your case, LABEL
is just the current word itself; so make your training data as follows:
data = [ zip(s,s) for s in nltk.corpus.brown.sents(categories="news") ]
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('the')) # most common word
twosidedtagger._train(train_sents)
It makes no sense for the target word to appear in the "context" ngram, so you should also modify the method context()
so that the returned ngram does not include it:
def context(self, tokens, index, history):
...
return tokens[index-left:index] + tokens[index+1:index+right+1]
This tagger uses trigrams consisting of two words from the left and one from the right of the current word.
With these modifications, you'll build a tagger that outputs the most likely word at any position. Try it and how you like it.
Prediction:
My expectation is that you'll need a humongous amount of training data before you can get decent performance. The problem is that ngram taggers can only suggest a tag for contexts they saw during training.
To build a tagger that generalizes, consider using the NLTK to train a "sequential classifier". You can use whatever features you want, including the words before and after-- of course, how well it will work is your problem. The NLTK classifier API is similar to that for the ContextTagger
, but the context function (aka feature function) returns a dictionary, not a tuple. Again, see the NLTK book and the source code.
for t, token in enumerate(tokens): do_something(tokens[t-n:t+m])
– Stolid