n-grams in python, four, five, six grams?
Asked Answered
E

17

177

I'm looking for a way to split a text into n-grams. Normally I would do something like:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams?

Thanks!

Expropriate answered 8/7, 2013 at 16:35 Comment(3)
Do you want the text split into groups of n size by word or character? Can you give an example of what output should look like for the above?Phrygia
Never done nltk but looks like there is a function ingrams whose second parameter is the degree of the ngrams you want. Is THIS the version of nltk you are using? Even if not, here is the source EDIT: There is ngrams and ingrams in there, ingrams being a generator.Greensward
There is also an answer under this thread that may be useful: https://mcmap.net/q/144223/-fast-n-gram-calculationPhrygia
S
279

Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library).

There is an ngram module that people seldom use in nltk. It's not because it's hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

from nltk import ngrams

sentence = 'this is a foo bar sentences and I want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print(grams)
Sinfonia answered 9/7, 2013 at 12:10 Comment(3)
For character ngrams, please also look at: #22428520Sinfonia
Is there any way to use N-gram to check a whole document such as txt ? I am not familiar with Python so I don't know if it can open up a txt file and then use the N-gram analysis to check through ?Ephor
Can someone comment on how to test the accuracy of sixgrams?Josephina
E
85

I'm surprised that this hasn't shown up yet:

In [34]: sentence = "I really like python, it's pretty awesome.".split()

In [35]: N = 4

In [36]: grams = [sentence[i: i + N] for i in range(len(sentence) - N + 1)]

In [37]: for gram in grams: print (gram)
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']
Equivalence answered 8/7, 2013 at 16:54 Comment(3)
That's exactly what the first answer does minus the frequency counting and tuple conversion.Greensward
It is nicer to see it rewritten as a comprehension though.Greensward
@amirouche: good catch. Thanks for the bug reports. It's been fixed nowEquivalence
S
26

People have already answered pretty nicely for the scenario where you need bigrams or trigrams but if you need everygram for the sentence in that case you can use nltk.util.everygrams

>>> from nltk.util import everygrams

>>> message = "who let the dogs out"

>>> msg_split = message.split()

>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

Incase you have a limit like in case of trigrams where the max length should be 3 then you can use max_len param to specify it.

>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

You can just modify the max_len param to achieve whatever gram i.e four gram, five gram, six or even hundred gram.

The previous mentioned solutions can be modified to implement the above mentioned solution but this solution is much straight forward than that.

For further reading click here

And when you just need a specific gram like bigram or trigram etc you can use the nltk.util.ngrams as mentioned in M.A.Hassan's answer.

Savil answered 18/12, 2018 at 10:39 Comment(2)
This is an important answer for all those that do not want to make repeated calls to ngrams. Great answer!Lower
Great! I was searching for this!Etiolate
P
21

Using only nltk tools

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

Example output

get_ngrams('This is the simplest text i could think of', 3 )

['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

In order to keep the ngrams in array format just remove ' '.join

Potion answered 31/8, 2015 at 9:28 Comment(0)
B
17

here is another simple way for do n-grams

>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]
Bronchial answered 18/6, 2014 at 17:59 Comment(1)
Had to do nltk.download('punkt') to use the nltk.word_tokenize() function. Also to print the results had to convert the generator object like bigrams, trigrams and fourgrams to list using list(<genrator_object>).Savil
C
9

You can easily whip up your own function to do this using itertools:

from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]
Countercharge answered 8/7, 2013 at 16:50 Comment(1)
Can you please explain izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N)))) I do not quite understand it.Leda
L
6

A more elegant approach to build bigrams with python’s builtin zip(). Simply convert the original string into a list by split(), then pass the list once normally and once offset by one element.

string = "I really like python, it's pretty awesome."

def find_bigrams(s):
    input_list = s.split(" ")
    return zip(input_list, input_list[1:])

def find_ngrams(s, n):
  input_list = s.split(" ")
  return zip(*[input_list[i:] for i in range(n)])

find_bigrams(string)

[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]
Lasko answered 29/3, 2016 at 5:22 Comment(0)
O
3

If efficiency is an issue and you have to build multiple different n-grams (up to a hundred as you say), but you want to use pure python I would do:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an itirator over the n-grams given a listTokens"""
    shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shiftedTokens = (shiftToken(i) for i in range(n))
    tupleNGrams = zip(*shiftedTokens)
    return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)

def range_ngrams(listTokens, ngramRange=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
    return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

Usage :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost from my previous answer.

Openmouthed answered 18/1, 2018 at 7:52 Comment(0)
O
2

I have never dealt with nltk but did N-grams as part of some small class project. If you want to find the frequency of all N-grams occurring in the string, here is a way to do that. D would give you the histogram of your N-words.

D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
    try:
        D[tuple(strparts[i:i+N])] += 1
    except:
        D[tuple(strparts[i:i+N])] = 1
Outsell answered 8/7, 2013 at 16:44 Comment(1)
collections.Counter(tuple(strparts[i:i+N]) for i in xrange(len(strparts)-N)) will work faster than the try-exceptEquivalence
I
2

For four_grams it is already in NLTK, here is a piece of code that can help you toward this:

 from nltk.collocations import *
 import nltk
 #You should tokenize your text
 text = "I do not like green eggs and ham, I do not like them Sam I am!"
 tokens = nltk.wordpunct_tokenize(text)
 fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
 for fourgram, freq in fourgrams.ngram_fd.items():  
       print fourgram, freq

I hope it helps.

Ingar answered 3/2, 2015 at 16:52 Comment(0)
R
2

You can use sklearn.feature_extraction.text.CountVectorizer:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

You can set to ngram_size to any positive integer. I.e. you can split a text in four-grams, five-grams or even hundred-grams.

Remus answered 14/8, 2015 at 19:57 Comment(0)
G
2

Nltk is great, but sometimes is a overhead for some projects:

import re
def tokenize(text, ngrams=1):
    text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = text.split()
    return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

Example use:

>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]
Gaddis answered 28/8, 2015 at 14:48 Comment(0)
E
2

After about seven years, here's a more elegant answer using collections.deque:

def ngrams(words, n):
    d = collections.deque(maxlen=n)
    d.extend(words[:n])
    words = words[n:]
    for window, word in zip(itertools.cycle((d,)), words):
        print(' '.join(window))
        d.append(word)
    print(' '.join(window))

words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']

Output:

In [236]: ngrams(words, 2)
I am
am become
become death,
death, the
the destroyer
destroyer of
of worlds

In [237]: ngrams(words, 3)
I am become
am become death,
become death, the
death, the destroyer
the destroyer of
destroyer of worlds

In [238]: ngrams(words, 4)
I am become death,
am become death, the
become death, the destroyer
death, the destroyer of
the destroyer of worlds

In [239]: ngrams(words, 1)
I
am
become
death,
the
destroyer
of
worlds

Equivalence answered 21/2, 2020 at 21:13 Comment(2)
The last ngram appears to be missing.Shupe
@BjörnLindqvist: thanks for the bugreport. Fixed now :)Equivalence
C
2

If you want a pure iterator solution for large strings with constant memory usage:

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

Test:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

Output:

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']
Coroner answered 1/5, 2020 at 23:43 Comment(0)
H
1

You can get all 4-6gram using the code without other package below:

from itertools import chain

def get_m_2_ngrams(input_list, min, max):
    for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
        yield ' '.join(s)

def get_ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

if __name__ == '__main__':
    input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
    for s in get_m_2_ngrams(input_list, 4, 6):
        print(s)

the output is below:

I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams

you can find more detail on this blog

Huskey answered 29/1, 2018 at 9:5 Comment(0)
A
1

It's quite easy to do n gram in python, for example:

def n_gram(list,n): 
    return [ list[i:i+n] for i in range(len(list)-n+1) ]

and if you do :

str = "I really like python, it's pretty awesome."
n_gram(str.split(" "),4)

You will get

[['I', 'really', 'like', 'python,'], 
['really', 'like', 'python,', "it's"], 
['like', 'python,', "it's", 'pretty'], 
['python,', "it's", 'pretty', 'awesome.']]
Aerialist answered 1/11, 2021 at 5:49 Comment(0)
N
1

It's an old question, but if you want to actually get the n-grams as a list of substrings (not as list of lists or tuples) and don't want to import anything, the following code works just fine and is easy to read:

def get_substrings(phrase, n):
    phrase = phrase.split()
    substrings = []
    for i in range(len(phrase)):
        if len(phrase[i:i+n]) == n:
            substrings.append(' '.join(phrase[i:i+n]))
    return substrings

You can use it e.g. in this way to get all n-grams of a list of terms up to a words length:

a = 5
terms = [
    "An n-gram is a contiguous sequence of n items",
    "An n-gram of size 1 is referred to as a unigram",
]

for term in terms:
    for i in range(1, a+1):
        print(f"{i}-grams: {get_substrings(term, i)}")

Prints:

1-grams: ['An', 'n-gram', 'is', 'a', 'contiguous', 'sequence', 'of', 'n', 'items']
2-grams: ['An n-gram', 'n-gram is', 'is a', 'a contiguous', 'contiguous sequence', 'sequence of', 'of n', 'n items']
3-grams: ['An n-gram is', 'n-gram is a', 'is a contiguous', 'a contiguous sequence', 'contiguous sequence of', 'sequence of n', 'of n items']
4-grams: ['An n-gram is a', 'n-gram is a contiguous', 'is a contiguous sequence', 'a contiguous sequence of', 'contiguous sequence of n', 'sequence of n items']
5-grams: ['An n-gram is a contiguous', 'n-gram is a contiguous sequence', 'is a contiguous sequence of', 'a contiguous sequence of n', 'contiguous sequence of n items']
1-grams: ['An', 'n-gram', 'of', 'size', '1', 'is', 'referred', 'to', 'as', 'a', 'unigram']
2-grams: ['An n-gram', 'n-gram of', 'of size', 'size 1', '1 is', 'is referred', 'referred to', 'to as', 'as a', 'a unigram']
3-grams: ['An n-gram of', 'n-gram of size', 'of size 1', 'size 1 is', '1 is referred', 'is referred to', 'referred to as', 'to as a', 'as a unigram']
4-grams: ['An n-gram of size', 'n-gram of size 1', 'of size 1 is', 'size 1 is referred', '1 is referred to', 'is referred to as', 'referred to as a', 'to as a unigram']
5-grams: ['An n-gram of size 1', 'n-gram of size 1 is', 'of size 1 is referred', 'size 1 is referred to', '1 is referred to as', 'is referred to as a', 'referred to as a unigram']
Necessitate answered 28/12, 2021 at 15:0 Comment(1)
Can you please add what is different in this answer than the previous ones?Decalcomania

© 2022 - 2024 — McMap. All rights reserved.