What are ngram counts and how to implement using nltk?
Asked Answered
S

4

15

I've read a paper that uses ngram counts as feature for a classifier, and I was wondering what this exactly means.

Example text: "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam"

I can create unigrams, bigrams, trigrams, etc. out of this text, where I have to define on which "level" to create these unigrams. The "level" can be character, syllable, word, ...

So creating unigrams out of the sentence above would simply create a list of all words?

Creating bigrams would result in word pairs bringing together words that follow each other?

So if the paper talks about ngram counts, it simply creates unigrams, bigrams, trigrams, etc. out of the text, and counts how often which ngram occurs?

Is there an existing method in python's nltk package? Or do I have to implement a version of my own?

Sula answered 10/10, 2012 at 14:1 Comment(2)
Yours is a common interpretation, but the "gram" unit could be e.g. bytes or characters, too. So character 3-grams of "lorem" could be "lor" and "em" or even "lor", "ore", "rem" if you use a sliding window.Leeann
Useful: github.com/hb20007/hands-on-nltk-tutorial/blob/master/…Anschluss
B
18

I found my old code, maybe it's useful.

import nltk
from nltk import bigrams
from nltk import trigrams

text="""Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam tempus vitae. Morbi justo mauris,
congue sit amet imperdiet ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam"""
# split the texts into tokens
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1] #same as unigrams
bi_tokens = bigrams(tokens)
tri_tokens = trigrams(tokens)

# print trigrams count

print [(item, tri_tokens.count(item)) for item in sorted(set(tri_tokens))]
>>> 
[(('adipiscing', 'elit.', 'nullam'), 2), (('amet', 'consectetur', 'adipiscing'), 2),(('amet', 'imperdiet', 'ipsum'), 1), (('congue', 'sit', 'amet'), 1), (('consectetur', 'adipiscing', 'elit.'), 2), (('diam', 'tempus', 'vitae.'), 1), (('dolor', 'sit', 'amet'), 2), (('elit.', 'nullam', 'ornare'), 2), (('imperdiet', 'ipsum', 'dolor'), 1), (('ipsum', 'dolor', 'sit'), 2), (('justo', 'mauris', 'congue'), 1), (('lacus', 'quis', 'pellentesque'), 2), (('lorem', 'ipsum', 'dolor'), 1), (('mauris', 'congue', 'sit'), 1), (('morbi', 'justo', 'mauris'), 1), (('nullam', 'ornare', 'tempor'), 2), (('ornare', 'tempor', 'lacus'), 2), (('pellentesque', 'diam', 'tempus'), 1), (('quis', 'pellentesque', 'diam'), 2), (('sit', 'amet', 'consectetur'), 2), (('sit', 'amet', 'imperdiet'), 1), (('tempor', 'lacus', 'quis'), 2), (('tempus', 'vitae.', 'morbi'), 1), (('vitae.', 'morbi', 'justo'), 1)]
Bigham answered 10/10, 2012 at 14:7 Comment(1)
Is it correct that it counts ['tempus', 'vitae', 'morbi'] as a trigram if they are not in the same sentence?Bulge
T
3

When you count n-grams, it's better to use hash table(dictionary) rather than using count. For the above example:

unigrams = {}
for token in tokens:
  if token not in unigrams:
    unigrams[token] = 1
  else:
    unigrams[token] += 1

this gives you time complexity O(n)

Towland answered 30/9, 2016 at 0:44 Comment(2)
Is this an answer? if so please post it with details.Riposte
This is not true in Python 3.4+. Surprising results with Python timeit: Counter() vs defaultdict() vs dict()Tournai
C
1

There is a concept called Collocations in NLTK.

You may find it useful.

Computation answered 6/9, 2013 at 6:34 Comment(0)
Y
-1

I don't think there is a specific method in nltk to help with this. This isn't tough though. If you have a sentence of n words (assuming you're using word level), get all ngrams of length 1-n, iterate through each of those ngrams and make them keys in an associative array, with the value being the count. Shouldn't be more than 30 lines of code, you could build your own package for this and import it where needed.

Yanyanaton answered 10/10, 2012 at 14:6 Comment(1)
Ok, then it seems like I understand the ngram stuff correctly :)Sula

© 2022 - 2024 — McMap. All rights reserved.