What exactly is an n Gram?

C

3

29

I found this previous question on SO: N-grams: Explanation + 2 applications. The OP gave this example and asked if it was correct:

Sentence: "I live in NY."

word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'
character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"

When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:

word level bigrams: [1, 1, 1, 1, 1]
character level bigrams: [2, 1, 1, ...]

Someone in the answer section confirmed this was correct, but unfortunately I'm a bit lost beyond that as I didn't fully understand everything else that was said! I'm using LingPipe and following a tutorial which stated I should choose a value between 7 and 12 - but without stating why.

What is a good nGram value and how should I take it into account when using a tool like LingPipe?

Edit: This was the tutorial: http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

Catha answered 12/8, 2013 at 17:40 Comment(0)

S

47

N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. For example, given the word fox, all 2-grams (or “bigrams”) are fo and ox. You may also count the word boundary – that would expand the list of 2-grams to #f, fo, ox, and x#, where # denotes a word boundary.

You can do the same on the word level. As an example, the hello, world! text contains the following word-level bigrams: # hello, hello world, world #.

The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

Spoonful answered 12/8, 2013 at 17:48 Comment(4)

So the smaller the nGram, the more comparisons made and the more accurate the analysis? I'm trying to understand why this tutorial suggested a number between 7 and 12. – Catha 12/8, 2013 at 17:50

So for doing a sentiment analysis on tweets, how should I pick a number? Just pot luck? – Catha 12/8, 2013 at 17:59

I guess the easiest way to figure out the best number is to experiment. As an example, you may split your training data in two halves, train on the first half and then use the number that gets you best results with the second one. Or try tea leaves! – Spoonful 12/8, 2013 at 18:4

Tea leaves it is. Thanks! – Catha 12/8, 2013 at 18:6

S

52

Usually a picture is worth thousand words.

Source: http://recognize-speech.com/language-model/n-gram-model/comparison

Stefaniestefano answered 3/8, 2017 at 7:20 Comment(4)

The link is locked. – Chapple 11/3, 2019 at 10:6

archived version of the link – Carabiniere 12/9, 2019 at 7:33

Exactly -> a picture is worth thousand words. How few people can digest complex stuff in simple visible manner. – Beer 27/12, 2020 at 0:23

Words of wisdom in this answer haha! – Stewardson 29/4, 2021 at 16:42

S

47

N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. For example, given the word fox, all 2-grams (or “bigrams”) are fo and ox. You may also count the word boundary – that would expand the list of 2-grams to #f, fo, ox, and x#, where # denotes a word boundary.

You can do the same on the word level. As an example, the hello, world! text contains the following word-level bigrams: # hello, hello world, world #.

The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

Spoonful answered 12/8, 2013 at 17:48 Comment(4)

So the smaller the nGram, the more comparisons made and the more accurate the analysis? I'm trying to understand why this tutorial suggested a number between 7 and 12. – Catha 12/8, 2013 at 17:50

So for doing a sentiment analysis on tweets, how should I pick a number? Just pot luck? – Catha 12/8, 2013 at 17:59

I guess the easiest way to figure out the best number is to experiment. As an example, you may split your training data in two halves, train on the first half and then use the number that gets you best results with the second one. Or try tea leaves! – Spoonful 12/8, 2013 at 18:4

Tea leaves it is. Thanks! – Catha 12/8, 2013 at 18:6

B

3

An n-gram is a n-tuple or group of n words or characters (grams, for pieces of grammar) which follow one another. So an n of 3 for the words from your sentence would be like "# I live", "I live in", "live in NY", "in NY #". This is used to create an index of how often words follow one another. You can use this in a Markov Chain to create something that will be similar to language. As you populate a mapping of the distributions of word groups or character groups, you can recombine them with the probability that the output will be close to natural, the longer the n-gram is.

Too high of a number, and your output will be a word for word copy of the original, too low of a number, and the output will be too messy.

Busily answered 12/8, 2013 at 18:7 Comment(5)

Would you have a recommendation for the nGram for tweet analysis? – Catha 12/8, 2013 at 18:12

My stock answer is, it depends on your goals in your analysis. Are you just looking for hash tags trending or common phrases or symantic analysis for word group trends? – Busily 12/8, 2013 at 18:14

Sorry for delay in response. I am collecting all tweets I can with the words (manchester united, man united, man utd, mufc) and I want to analyse the overall sentiment in these tweets - whether they are positive or negative. This is only a simplistic version of my tool (I've a more sophisticated version in Python). I created a classifier already, but in my created classifier I used an nGram of 7, without really understanding why - as I said, I just picked a number between 7 and 12, as recommended by my tutorial. – Catha 12/8, 2013 at 20:5

So, your question, as I interpret it is, "Is an n-gram of 7 sufficient to detect good/bad sentiment" and the answer is, what are common 7 word phrases that are showing up. If you're looking for occurrences of "what a rubbish call" that would require an n-gram of 4. If you're looking at n-gram 7, you'll find something like, "what a rubbish call! The refs are" What you may find necessary is to perform multiple analysis of the your input content across a range of n-gram sizes. Maybe process between 4 and 10 or something, and develop a heuristic analysis technique. – Busily 12/8, 2013 at 20:16

Thanks. To be honest, I hadn't even though about it to that degree. I just followed this wee tutorial and thought nothing more of it. IT was only when I was doing my write-up as to WHY I chose the number 7, I couldn't explain it. – Catha 12/8, 2013 at 20:32

Recommended topics

Hot tags