Computing N Grams using Python

O

8

36

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

"Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."

I started in Python and used the following code:

#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = []                      # start with an empty list
if N> 1:
    space = " " * (N-1)         # add N - 1 spaces
    text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
    NList.append(text[i:i+N])
return NList                    # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?

Obedience answered 16/11, 2012 at 20:26 Comment(2)

are you coming from a MATLAB background? you don't need the semicolons as the end of every line anymore!! – Babylonian 16/11, 2012 at 20:29

Woops! Sorry.. I'm kinda new here and didn't know I have to accept answers!!!.... The code isn't mine.. it's from the website I've given... – Obedience 16/11, 2012 at 20:35

Q

41

Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):
    grams.setdefault(g, 0)
    grams[g] += 1

Putting that all together into one final function gives:

def ngrams(input, n):
   input = input.split(' ')
   output = {}
   for i in range(len(input)-n+1):
       g = ' '.join(input[i:i+n])
       output.setdefault(g, 0)
       output[g] += 1
   return output

ngrams('a a a a', 2) # {'a a': 3}

Quaky answered 16/11, 2012 at 20:33 Comment(5)

So, I'm going to get the total number of DISTINCT WORDS as the output for n=1?? – Obedience 17/11, 2012 at 8:59

Also, any clue how I can generate a sentence using this? That would really help! – Obedience 17/11, 2012 at 9:0

Yes, you will get distinct words (though punctuation will affect all of this to a degree). To generate sentences, I assume that you want something like a Markov chain? I actually wrote up an article on word generation using markov chains a few years ago. The basic ideas are the same: ohthehugemanatee.net/2009/10/…. You'll need to find a way to label "starting" words in the data structure which this does not do, as well as "ending" or "terminal" words. – Quaky 17/11, 2012 at 13:59

Hmm..I actually had to make one using the frequency of words... which I can see the code is able to calculate.. – Obedience 18/11, 2012 at 1:14

Is there any way to use N-gram to check a whole document such as txt ? I am not familiar with Python so I don't know if it can open up a txt file and then use the N-gram analysis to check through and give a result like this? – Timetable 28/4, 2016 at 20:22

P

52

A short Pythonesque solution from this blog:

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

Usage:

>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
>>> find_ngrams(input_list, 1)
[('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)]
>>> find_ngrams(input_list, 2)
[('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')]
>>> find_ngrams(input_list, 3))
[('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]

Prearrange answered 3/6, 2015 at 0:53 Comment(0)

Q

41

Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):
    grams.setdefault(g, 0)
    grams[g] += 1

Putting that all together into one final function gives:

def ngrams(input, n):
   input = input.split(' ')
   output = {}
   for i in range(len(input)-n+1):
       g = ' '.join(input[i:i+n])
       output.setdefault(g, 0)
       output[g] += 1
   return output

ngrams('a a a a', 2) # {'a a': 3}

Quaky answered 16/11, 2012 at 20:33 Comment(5)

So, I'm going to get the total number of DISTINCT WORDS as the output for n=1?? – Obedience 17/11, 2012 at 8:59

Also, any clue how I can generate a sentence using this? That would really help! – Obedience 17/11, 2012 at 9:0

Yes, you will get distinct words (though punctuation will affect all of this to a degree). To generate sentences, I assume that you want something like a Markov chain? I actually wrote up an article on word generation using markov chains a few years ago. The basic ideas are the same: ohthehugemanatee.net/2009/10/…. You'll need to find a way to label "starting" words in the data structure which this does not do, as well as "ending" or "terminal" words. – Quaky 17/11, 2012 at 13:59

Hmm..I actually had to make one using the frequency of words... which I can see the code is able to calculate.. – Obedience 18/11, 2012 at 1:14

Is there any way to use N-gram to check a whole document such as txt ? I am not familiar with Python so I don't know if it can open up a txt file and then use the N-gram analysis to check through and give a result like this? – Timetable 28/4, 2016 at 20:22

I

27

Use NLTK (the Natural Language Toolkit) and use the functions to tokenize (split) your text into a list and then find bigrams and trigrams.

import nltk
words = nltk.word_tokenize(my_text)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)

Isla answered 17/11, 2012 at 15:26 Comment(3)

This here is brilliant... But I wanted distinct ones...!! – Obedience 18/11, 2012 at 4:23

@Obedience in python, computing distinct elements in a list is very easy: distinct_list= list(set(original_list)) You may omit that last conversion to list again if a set has enough functionality for you. – Bor 18/11, 2012 at 5:27

If you like this, then you are going to love the NLTK book, especially chapter 1: nltk.org/book/ch01.html – Isla 18/11, 2012 at 19:15

N

11

There is one more interesting module into python called Scikit. Here is the code. This will help u to get all the grams given in a particular range. Here is the code

from sklearn.feature_extraction.text import CountVectorizer 
text = "this is a foo bar sentences and i want to ngramize it"
vectorizer = CountVectorizer(ngram_range=(1,6))
analyzer = vectorizer.build_analyzer()
print analyzer(text)

Output is

[u'this', u'is', u'foo', u'bar', u'sentences', u'and', u'want', u'to', u'ngramize', u'it', u'this is', u'is foo', u'foo bar', u'bar sentences', u'sentences and', u'and want', u'want to', u'to ngramize', u'ngramize it', u'this is foo', u'is foo bar', u'foo bar sentences', u'bar sentences and', u'sentences and want', u'and want to', u'want to ngramize', u'to ngramize it', u'this is foo bar', u'is foo bar sentences', u'foo bar sentences and', u'bar sentences and want', u'sentences and want to', u'and want to ngramize', u'want to ngramize it', u'this is foo bar sentences', u'is foo bar sentences and', u'foo bar sentences and want', u'bar sentences and want to', u'sentences and want to ngramize', u'and want to ngramize it', u'this is foo bar sentences and', u'is foo bar sentences and want', u'foo bar sentences and want to', u'bar sentences and want to ngramize', u'sentences and want to ngramize it']

Here it gives all the grams given in a range 1 to 6. Its using the method called countVectorizer. Here is the link for that.

Naivete answered 30/10, 2014 at 14:14 Comment(0)

C

3

Using collections.deque:

from collections import deque
from itertools import islice

def ngrams(message, n=1):
    it = iter(message.split())
    window = deque(islice(it, n), maxlen=n)
    yield tuple(window)
    for item in it:
        window.append(item)
        yield tuple(window)

...or maybe you could do it in one line as a list comprehension:

n = 2
message = "Hello, how are you?".split()
myNgrams = [message[i:i+n] for i in range(len(message) - n + 1)]

Concelebrate answered 16/11, 2012 at 20:39 Comment(2)

You can avoid the superfluous first yield if you "islice(it, n-1)" – Telling 27/8, 2015 at 15:51

myNgrams = [message[i:i+n] for i in range(len(message) - n + 1) ] – Fosse 9/10, 2015 at 14:17

C

2

nltk has native support for ngrams

'n' is the ngram size ex: n=3 is for a trigram

from nltk import ngrams

def ngramize(texts, n):
    output=[]
    for text in texts:
        output += ngrams(text,n)
    return output

Cutoff answered 28/9, 2016 at 22:35 Comment(0)

M

2

Though the post is old, I thought to mention my answer here so that most of the ngrams creation logic can be in one post.

There is something by name TextBlob in Python. It creates ngrams very easily similar to NLTK.

Below is the code snippet with its output for easy understanding.

sent = """This is to show the usage of Text Blob in Python"""
blob = TextBlob(sent)
unigrams = blob.ngrams(n=1)
bigrams = blob.ngrams(n=2)
trigrams = blob.ngrams(n=3)

And the output is :

unigrams
[WordList(['This']),
 WordList(['is']),
 WordList(['to']),
 WordList(['show']),
 WordList(['the']),
 WordList(['usage']),
 WordList(['of']),
 WordList(['Text']),
 WordList(['Blob']),
 WordList(['in']),
 WordList(['Python'])]

bigrams
[WordList(['This', 'is']),
 WordList(['is', 'to']),
 WordList(['to', 'show']),
 WordList(['show', 'the']),
 WordList(['the', 'usage']),
 WordList(['usage', 'of']),
 WordList(['of', 'Text']),
 WordList(['Text', 'Blob']),
 WordList(['Blob', 'in']),
 WordList(['in', 'Python'])]

trigrams
[WordList(['This', 'is', 'to']),
 WordList(['is', 'to', 'show']),
 WordList(['to', 'show', 'the']),
 WordList(['show', 'the', 'usage']),
 WordList(['the', 'usage', 'of']),
 WordList(['usage', 'of', 'Text']),
 WordList(['of', 'Text', 'Blob']),
 WordList(['Text', 'Blob', 'in']),
 WordList(['Blob', 'in', 'Python'])]

As simple as that.

There is more to this that are being done by TextBlob. Please have a look at this doc for more details - https://textblob.readthedocs.io/en/dev/

Musicale answered 20/9, 2017 at 13:11 Comment(0)

A

2

If efficiency is an issue and you have to build multiple different n-grams I would consider using the following code (building up on Franck's excellent answer):

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

Usage :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Argufy answered 18/1, 2018 at 7:34 Comment(2)

Great answer! But could be improved if your variable names were according to PEP :) Meaning lowercase with underscores. – Socalled 14/1, 2019 at 9:1

I normally like differentiating between names of callables and variables but because it's an answer for other people, I edited for PEP8 style :) – Argufy 15/1, 2019 at 0:25

Recommended topics

Hot tags