Statistical sentence suggestion model like spell checking
Asked Answered
A

1

1

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.

Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?

Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.

Alexis answered 5/8, 2015 at 8:52 Comment(2)
What do you mean by a "correct phrase"? Do you mean like how many phones now suggest the "next word" that you would like to type? Do you mean a phrase that it grammatically correct?Drogin
yes like next word i.e statistical, not concerned about grammar @KristyHughesAlexis
C
4

What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.

You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).

You can consider 2-gram (Markov model):

What is the probability for "kitten" to follow "cute"?

...or 3-gram:

What is the probability for "kitten" to follow "the cute"?

etc.

Obviously training the model with n+1-gram is costlier than n-gram.

Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))

You can also try to consider the lemmas instead of the words.

Here some interesting lectures about N-gram modelling:

  1. Introduction to N-grams
  2. Estimating N-gram Probabilities

This is a simple and short example of code (2-gram) not optimized:

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}
Controversy answered 5/8, 2015 at 14:34 Comment(5)
are there no existing tools which work at word level?Alexis
@Alexis I guess but I never did this task before so I don't know. I think it is quite easy to code. I edited my question to suggest a code. Obviously, it is not optimized... I can't tell you more, sorry...Controversy
ok nice code but why are the log probabilities coming to 0? while others are not ,is smoothing needed ? also it would work only for unigram sliding window , should i use a nested dict to have NGRAM for arbitary N ,also this seems to be a very common construct in language modelling , nltk has a ngrammodel package can that be used instead or any other existing construct which has same functionality as the above code?Alexis
related bounty question : #31986966Alexis
@Alexis you should add a little delta to avoid log(0) indeed... Actually, storing probabilities as log is optional but additions are lighter than multiplications as it is explain in the lecture (in my remember).Controversy

© 2022 - 2024 — McMap. All rights reserved.