Probability tree for sentences in nltk employing both lookahead and lookback dependencies
Asked Answered
R

2

12

Does nltk or any other NLP tool allow to construct probability trees based on input sentences thus storing the language model of the input text in a dictionary tree, the following example gives the rough idea, but I need the same functionality such that a word Wt does not just probabilistically modelled on past input words(history) Wt-n but also on lookahead words like Wt+m. Also the lookback and lookahead word count should also be 2 or more i.e. bigrams or more. Are there any other libraries in python which achieve this?

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

the solution requires both lookahead and lookback and a specially sub classed dictionary may help in solving this problem. Can also point to relevant resources which talk about implementing such a system. A nltk.models seemed to be doing something similar but is no longer available. Are there any existing design patterns in NLP which implement this idea? skip gram based models are similar to this idea too but I feel this has should have been implemented already somewhere.

Represent answered 13/8, 2015 at 11:5 Comment(4)
You can iterate like this to get the desired slice: for t, token in enumerate(tokens): do_something(tokens[t-n:t+m])Stolid
@Stolid ya i can do that but then how to model it to build a tree ?Represent
Can you clarify what do you mean by probability tree?Drinkwater
like language model probabilities as in above example , given a set of consecutive words and its left and right context what is the probability that it is a correct sequence of word? @DrinkwaterRepresent
D
3

If I understand your question correctly, you are looking for a way to predict the probability of a word given its surrounding context (not just backward context but also the forward context). One quick hack for your purpose is to train two different language models. One from right to left and the other from left to right and then probability of a word given its context would be the normalized sum of both forward and backward contexts.

Extending your code:

from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
import numpy as np


ngram = defaultdict(lambda: defaultdict(int))
ngram_rev = defaultdict(lambda: defaultdict(int)) #reversed n-grams
corpus = "The cat is cute. He jumps and he is happy."

for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
    for token, rev_token in zip(tokens[1:], tokens):
        ngram_rev[token][rev_token] += 1
for token in ngram:
    total = np.log(np.sum(ngram[token].values()))
    total_rev = np.log(np.sum(ngram_rev[token].values()))
    ngram[token] = {nxt: np.log(v) - total 
                    for nxt, v in ngram[token].items()}
    ngram_rev[token] = {prv: np.log(v) - total_rev 
                    for prv, v in ngram_rev[token].items()}

Now the context is in both ngram and ngram_rev which respectively hold the forward and backward contexts.

You should also account for smoothing. That is if a given phrase is not seen in your training corpus, you would just get zero probabilities. In order to avoid that, there are many smoothing techniques the most simple of which is the add-on smoothing.

Drinkwater answered 17/8, 2015 at 18:24 Comment(7)
Actually the context should be atleast two words on each side,only unigrams search possible with thisRepresent
You specifically provided the code for bigrams, and therefore I extended it based on bigrams as well. You can easily generalize it to ngram by something like: for token, next_token, next_token_2 in zip (tokens, tokens[1:], tokens[2:]): ngram[token]['%s-%s' % (next_token, next_token_2)] += 1Drinkwater
it uses the last two words as a constant suffix thereby limiting the number of valid ngramsRepresent
can you comment on your response?Represent
You can modify it to account for the last two words separately, e.g. for token, next_token, next_token_2 in zip (tokens, tokens[1:], tokens[2:]): ngram[token][next_token] += 1; ngram[token]['%s-%s' % (next_token, next_token_2)] += 1 This way you are adding the next immediate context anyway and prevent the problem you mentioned.Drinkwater
why not use a nested dictionary?Represent
@stackit, here I used a nested dict (ngram is nested).Drinkwater
C
1

The normal ngram algorithm traditionally works with prior context only, and for good reason: A bigram tagger makes decisions by considering the tags of the last two words, plus the current word. So unless you tag in two passes, the tag of the next word is not yet known. But you are interested in word ngrams, not tag ngrams, so nothing keeps you from training an ngram tagger where the ngram consists of words from both sides. And you can indeed do it easily with the NLTK.

The NLTK's ngram taggers all make tag ngrams, from the left; but you can easily derive your own tagger from their abstract base class, ContextTagger:

import nltk
from nltk.tag import ContextTagger

class TwoSidedTagger(ContextTagger):
        left = 2
        right = 1

    def context(self, tokens, index, history):
        left = self.left
        right = self.right
        tokens = tuple(t.lower() for t in tokens)
        if index < left:
            tokens = ("<start>",) * left + tokens
            index += left
        if index + right >= len(tokens):
            tokens = tokens + ("<end>",) * right

        return tokens[index-left:index+right+1]

This defines a tetragram tagger (2+1+1) where the current word is third in the ngram, not last as usual. You can then initialize and train a tagger just like the regular ngram taggers (see chapter 5 of the NLTK book, especially sections 5.4ff). Let's see first how you'd build a part-of-speech tagger, using a portion of the Brown corpus as training data:

data = list(nltk.corpus.brown.tagged_sents(categories="news"))
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('NN'))
twosidedtagger._train(train_sents)

Like all ngram taggers in the NLTK, this one will delegate to the backoff tagger if it is asked to tag an ngram it did not see during training. For simplicity I used a simple "default tagger" as the backoff tagger, but you'll probably need to use something more powerful (see the NLTK chapter again).

You can then use your tagger to tag new text, or evaluate it with an already tagged test set:

>>> print(twosidedtagger.tag("There were dogs everywhere .".split()))
>>> print(twosidedtagger.evaluate(test_sents))

Predicting words:

The above tagger assigns a POS tag by considering nearby words; but your goal is to predict the word itself, so you need different training data and a different default tagger. The NLTK API expects training data in the form (word, LABEL), where LABEL is the value you want to generate. In your case, LABEL is just the current word itself; so make your training data as follows:

data = [ zip(s,s) for s in nltk.corpus.brown.sents(categories="news") ]
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('the')) # most common word
twosidedtagger._train(train_sents)

It makes no sense for the target word to appear in the "context" ngram, so you should also modify the method context() so that the returned ngram does not include it:

    def context(self, tokens, index, history):
        ...
        return tokens[index-left:index] + tokens[index+1:index+right+1]

This tagger uses trigrams consisting of two words from the left and one from the right of the current word.

With these modifications, you'll build a tagger that outputs the most likely word at any position. Try it and how you like it.

Prediction:

My expectation is that you'll need a humongous amount of training data before you can get decent performance. The problem is that ngram taggers can only suggest a tag for contexts they saw during training.

To build a tagger that generalizes, consider using the NLTK to train a "sequential classifier". You can use whatever features you want, including the words before and after-- of course, how well it will work is your problem. The NLTK classifier API is similar to that for the ContextTagger, but the context function (aka feature function) returns a dictionary, not a tuple. Again, see the NLTK book and the source code.

Cates answered 27/8, 2015 at 11:5 Comment(15)
Thanks. how can this be used to predict the next word given the context(both left right)? which is what the question was ?Represent
also this just tags the words with a POS tag , better tag it with most probable wordRepresent
I must admit I don't understand what application you have in mind: If you already know the words at positions i+1, i+2 etc., how are you "predicting" the word at position i? But anyway if you want the tagger to produce words, just replace the training data with (word_i, word_i+1) tuples. It's a one-liner, do you need help with it?Cates
the application is basically a word suggester\ correcter ,assuming an incorrect sentence given with incorrect usage of words ,grammar etc correct it.Represent
how would you use this to suggest most probable word for the current word give both sides of the context as well as the current word ?Represent
In the following eg , how to train the dwdiff output to get error suggestions based on the model stated above: What is {+a+} genetic risk? Genetic risk refers [-more-] to [-your-] chance of inheriting a disorder or disease. People get [-certain disease-] {+diseases+} because of genetic changes. How much a genetic change tells us about your chance of developing a disorder is not always clear. If your genetic results [-indicate-] {+indicated+}...Represent
Are you able to construct a training set that pairs each word in a corpus with the next word, or are you not?Cates
I mean, ok I get what you want it for. But was the explanation in my comment sufficient or was it not?Cates
yes i do have that training set which has a mapping between correct words to incorrect wordsRepresent
can you write an example on how to actually use it for prediction and evaluate accuracy? and does it generate a tree internally?Represent
"Generate a tree internally"? What are you talking about? I've revised the code to predict the current word at each position, instead of its POS, and show how to create the training data.Cates
so does it use a classifier internally or pure ngram matching? and for "sequential classifier" how this can be implemented?Represent
can it utilize the context's pos tag values?Represent
Your question was how to make an ngram tagger that uses words from both sides, and how to train it to estimate the current word instead of a POS tag. I have answered both parts. Read my answer again, and the NLTK book, for the answers to your follow-up questions. (But yes, you can use POS tags too)Cates
And of course if you are stuck with some other aspect of your project, you are always encouraged to ask a new question.Cates

© 2022 - 2024 — McMap. All rights reserved.