Python - compare n-grams across multiple text files

Asked 10/12, 2014 at 23:36 Answered 21/8, 2019 at 18:4

First time poster - I am a new Python user with limited programming skills. Ultimately I am trying to identify and compare n-grams across numerous text documents found in the same directory. My analysis is somewhat similar to plagiarism detection - I want to calculate the percentage of text documents in which a particular n-gram can be found. For now, I am attempting a simpler version of the larger problem, trying to compare n-grams across two text documents. I have no problem identifying the n-grams but I am struggling to compare across the two documents. Is there a way to store the n-grams in a list to effectively compare which ones are present in the two documents? Here's what I've done so far (forgive the naive coding). For reference, I provide basic sentences below as opposed to the text documents I am actually reading in my code.

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.split(), n)
trigrams2 = ngrams(text2.split(), n)

print(trigrams1)
for grams in trigrams1:
    print(grams)

def compare(trigrams1, trigrams2):
    for grams1 in trigrams1:
        if each_gram in trigrams2:
            print (each_gram)
    return False

Thanks to everyone for your help!

Glean answered 10/12, 2014 at 23:36 Comment(2)

Any example of input files, or few rows from them? – Clew 10/12, 2014 at 23:37

The text documents I am reading are about 1-3 pages each. I updated the simple example with two brief sentences for reference. Thanks! – Glean 10/12, 2014 at 23:49

Use a list say common in the compare function. Append each ngram to this list that is common to both trigrams and finally return the list as:

>>> trigrams1 = ngrams(text1.lower().split(), n)  # use text1.lower() to ignore sentence case.
>>> trigrams2 = ngrams(text2.lower().split(), n)  # use text2.lower() to ignore sentence case.
>>> trigrams1
[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'jason')]
>>> trigrams2
[('my', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'mike')]
>>> def compare(trigrams1, trigrams2):
...    common=[]
...    for grams1 in trigrams1:
...       if grams1 in trigrams2:
...         common.append(grams1)
...    return common
... 
>>> compare(trigrams1, trigrams2)
[('my', 'name', 'is')]

Animadversion answered 11/12, 2014 at 0:17 Comment(1)

ngrams returns a generator object, not a list. The compare function will not work unless it is converted to list first. – Bohr 21/8, 2019 at 17:53

I think it is maybe easier to concatenate the elements in the ngrams and make a list of the strings and then do the comparison.

Let's go over the process with the example you provided.

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

After applying the ngrams function from nltk you get the following two lists that I similarly name text1 and text2 as before:

text1 = [('Hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'Jason')]
text2 = [('My', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'Mike')]

When you want to compare the ngrams, you should lowercase all the elements, lest it count 'my' and 'My' as separate tokens, something that we obviously don't want.

The following function does exactly that.

def append_elements(n_gram):
    for element in range(len(n_gram)):
            phrase = ''
            for sub_element in n_gram[element]:
                    phrase += sub_element+' '
            n_gram[element] = phrase.strip().lower()
    return n_gram

So if we feed it text1 we get ['hello my name', 'my name is', 'name is jason'] which is easier to process.

Next we make the compare function. You were right in assuming that we could use a list to store commonalities. I named it common here:

def compare(n_gram1, n_gram2):
    n_gram1 = append_elements(n_gram1)
    n_gram2 = append_elements(n_gram2)
    common = []
    for phrase in n_gram1:
        if phrase in n_gram2:
            common.append(phrase)
    if not common:
        return False
        # or you could print a message saying no commonality was found
    else:
        for i in common:
            print(i)

if not common means if the common list is empty, in which case it prints a message or returns False

Now in your example, when we run compare(text1, text2) the only commonality is:

>>> 
my name is
>>>

which is the correct answer.

Logician answered 11/12, 2014 at 0:57 Comment(0)

I was doing a task very similar to yours when I came across this old thread which seemed to work pretty well except there was one bug. I will add this answer here in case someone else stumbles upon this. The ngrams from nltk.util returns a generator object and not a list. It would need to be converted to a list to use the compare function that you wrote. Using lower() for case insensitive match.

Complete example:

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.lower().split(), n)
trigrams2 = ngrams(text2.lower().split(), n)

def compare_ngrams(trigrams1, trigrams2):
    trigrams1 = list(trigrams1)
    trigrams2 = list(trigrams2)
    common=[]
    for gram in trigrams1:
        if gram in trigrams2:
            common.append(gram)
    return common

common = compare_ngrams(trigrams1, trigrams2)
print(common)

Output:

[('my', 'name', 'is')]

Bohr answered 21/8, 2019 at 18:4 Comment(0)

Recommended topics

Hot tags