When are n-grams (n>3) important as opposed to just bigrams or trigrams?

Asked 23/4, 2012 at 18:20 Answered 30/5, 2017 at 10:41

I am just wondering what is the use of n-grams (n>3) (and their occurrence frequency) considering the computational overhead in computing them. Are there any applications where bigrams or trigrams are simply not enough?

If so, what is the state-of-the-art in n-gram extraction? Any suggestions? I am aware of the following:

A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese
Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus
Word association norms, mutual information, and lexicography
Retrieving collocations from text: Xtract

Dawndawna answered 23/4, 2012 at 18:20 Comment(6)

This probably doesn't reach the level of information Legend is looking for, but this video from Pycon 2012 does a pretty good job explaining the basics of computing n-grams in python (and using them to build a search engine): pyvideo.org/video/715/building-a-python-based-search-engine . For anyone else who stumbles on this question. – Meliorism 23/4, 2012 at 19:18

The "computational overhead" of computing ngrams is negligible: You can do it in a single pass through your corpus. Even storing higher-order ngrams is not a huge deal. The real cost is that for larger n, you need a bigger and bigger corpus to overcome sparsity problems. – Taeniafuge 29/4, 2012 at 12:54

@alexis: It would be great if you could provide more information. Specifically, something related to sparsity problems, any research that shows "computational overhead of computing n-grams is negligible"? Thank You. – Dawndawna 29/4, 2012 at 21:38

@alexis: Just checking with you again (in regard to my comment again). Thanks. – Dawndawna 7/8, 2012 at 22:42

@Legend, did you see my answer below? – Taeniafuge 14/8, 2012 at 8:51

@alexis: Oops... Sorry! I must have missed out on that notification. Thanks for your time. – Dawndawna 14/8, 2012 at 21:41

I'm not familiar with a good deal of the tags listed here, however n-grams (the abstract concept) are often useful related to statistical models. As a result, here's some applications which aren't restricted merely to bigrams and trigrams:

Compression algorithms (the PPM variety especially) where the length of the grams depends on how much data is available for providing specific contexts.
Approximate string matching (e.g. BLAST for genetic sequence matching)
Predictive models (e.g. name generators)
Speech recognition (phonemes grams are used to help evaluate the likelihood of possibilities for the current phoneme undergoing recognition)

Those are the ones off the top of my head, but there's much more listed on Wikipedia.

As far as "state-of-the-art" n-gram extraction, no idea. N-gram "extraction" is an adhoc attempt to speed up certain processes while still maintaining the benefits of n-gram style modeling. In short, "state-of-the-art" depends on what you're trying to do. If you're looking at fuzzy matching or fuzzy grouping, it depends on what kind of data you're matching/grouping. (E.g. street addresses are going to be very different to fuzzy match than first names.)

Bireme answered 23/4, 2012 at 18:43 Comment(0)

An (unconventional) way to think about higher order n-grams can be done by making the connection to an unnormalized autocorrelation function, i.e. the correlation of a signal with itself. A 2-gram corpus would measure the correlation of a word with a "time"-lag of a single word, while 3-gram can give us the information for a "time"-lag of two steps. Higher order n-grams give a measure of the probability distribution of a particular corpus (be it Moby Dick or human DNA). In this way, if an n-gram is distinct from the null expected value, then there is useful statistical information for that value of n.

Bonnibelle answered 23/4, 2012 at 19:2 Comment(0)

I don't think your question is posed quite right: Ngrams are a tool, not a problem to be solved, so there is no "state of the art" in ngrams. As @Hooked pointed out, an ngram is a kind of auto-correlation function (or "autoregressive function"). So what you really want to know is if there are any problems for which the state of the art solutions involve long ngrams.

For numerical applications such as fitting financial or weather models, or speech recognition, you'd definitely use vectors of dimension > 3. For example, autoregressive Hidden Markov Models fit a piecewise function of the last n measurements, where n can be moderately large if past states are relevant to predicting the future.

But all your examples concern word ngrams, and I can't think of any work that found n > 3 to be useful in that domain. I don't think it's a question of computational cost or finding enough training data: Superficial auto-correlation in language seems to peter out after 3 words or so. Random example: this article tries to reinterpret Zipf's law in terms of ngram-based information content. They consider n up to 4, but get the highest overall correlations for the trigram counts.

I don't mean to say that n > 3 is not useful; but your observation that it doesn't seem to come up much is well founded.

But note that the complexity of counting ngrams in a text is not an issue: If you have a tokenized corpus of length L, you could collect all ngrams of the corpus like this:

    for i in range(0, L-n):
        tuple = corpus[i:i+n]
        ngrams[tuple] += 1

As you can see this requires only O(L) steps, i.e., it is linear on the size of the corpus and does not grow with n. So collecting ngrams of any dimension is a non-issue. But the number of possible ngrams quickly mushrooms. To illustrate, if you distinguish 32 letter tokens (letters and some punctuation classes), there are 1024 letter bigrams but 1048576 tetragrams. To find enough of them to populate your frequency tables, you need exponentially more text.

For word ngrams the sparsity problem is even worse, since not only do you have a lot more than 32 different word tokens, but the vocabulary size increases (slowly) with corpus size: the famous "long tail" property. So your data will be sparse (even for small n) no matter how large a corpus you collect. You'll then need to fit complicated statistical models, whose computation cost depends on the number of distinct ngrams.

As a result, sparsity is always an issue in word ngram applications (hence "smoothing" is usually necessary). If you google "ngram sparsity" you'll find a ton of references.

Taeniafuge answered 30/4, 2012 at 10:33 Comment(0)

In addition to Kaganar's answer:

Any kind of stylometric analysis (e.g., author profiling based on writing styles, or, trying to detect the epoch of a text) will require longer n-grams for shallow syntactic parsing. Usually such approaches are complemented by deep syntactic parsing based on PCFG, TAG, etc.

Nonary answered 23/4, 2012 at 20:34 Comment(0)

You can also use n>3 language models if your datset is very large.

Buckthorn answered 30/5, 2017 at 10:41 Comment(1)

should be a comment – Rachmaninoff 30/5, 2017 at 11:11

Recommended topics

Hot tags