Predicting phrases instead of just next word

For an application that we built, we are using a simple statistical model for word prediction (like Google Autocomplete) to guide search.

It uses a sequence of ngrams gathered from a large corpus of relevant text documents. By considering the previous N-1 words, it suggests the 5 most likely "next words" in descending order of probability, using Katz back-off.

We would like to extend this to predict phrases (multiple words) instead of a single word. However, when we are predicting a phrase, we would prefer not to display its prefixes.

For example, consider the input the cat.

In this case we would like to make predictions like the cat in the hat, but not the cat in & not the cat in the.

Assumptions:

We do not have access to past search statistics
We do not have tagged text data (for instance, we do not know the parts of speech)

What is a typical way to make these kinds of multi-word predictions? We've tried multiplicative and additive weighting of longer phrases, but our weights are arbitrary and overfit to our tests.

For this question, you need to define what it is you consider to be a valid completion -- then it should be possible to come up with a solution.

In the example you've given, "the cat in the hat" is much better than "the cat in the". I could interpret this as, "it should end with a noun" or "it shouldn't end with overly common words".

You've restricted the use of "tagged text data" but you could use a pretrained model, (e.g. NLTK, spacy, StanfordNLP) to guess the parts of speech and make an attempt to restrict predictions to only complete noun-phrases (or sequence ending in noun). Note that you would not necessarily need to tag all documents fed into the model, but only those phrases you're keeping in your autocomplete db.
Alternately, you could avoid completions that end in stopwords (or very high frequency words). Both "in" and "the" are words that occur in almost all English documents, so you could experimentally find a frequency cutoff (can't end in a word that occurs in more than 50% of documents) that help you filter. You could also look at phrases -- if the end of the phrase is drastically more common as a shorter phrase, then it doesn't make sense to tag it on, as the user could come up with it on their own.
Ultimately, you could create a labeled set of good and bad instances and try to create a supervised re-ranker based on word features -- both ideas above could be strong features in a supervised model (document frequency = 2, pos tag = 1). This is typically how search engines with data can do it. Note that you don't need search statistics or users for this, just a willingness to label the top-5 completions for a few hundred queries. Building a formal evaluation (that can be run in an automated manner) would probably help when trying to improve the system in the future. Any time you observe a bad completion, you could add it to the database and do a few labels -- over time, a supervised approach would get better.

Recommended topics

Hot tags