Counting bi-gram frequencies

Asked 4/5, 2011 at 12:43 Answered 5/5, 2011 at 1:10

I've written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka. I'd like to alter it so that it can count bi-gram frequencies, i.e. pairs of words instead of single words although my attempts have proved unsuccessful at best.

I realise there's alot to look at but any help on this is greatly appreciated. Here's my code:

    import re
    import nltk

    # Quran subset
    filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

    # create list of lower case words
    word_list = re.split('\s+', file(filename).read().lower())
    print 'Words in text:', len(word_list)
    # punctuation and numbers to be removed
    punctuation = re.compile(r'[-.?!,":;()|0-9]')
    word_list = [punctuation.sub("", word) for word in word_list]

    word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]



    # create dictionary of word:frequency pairs
    freq_dic = {}


    for word in word_list2:

        # form dictionary
        try: 
            freq_dic[word] += 1
        except: 
            freq_dic[word] = 1


    print '-'*30

    print "sorted by highest frequency first:"
    # create list of (val, key) tuple pairs
    freq_list2 = [(val, key) for key, val in freq_dic.items()]
    # sort by val or frequency
    freq_list2.sort(reverse=True)
    freq_list3 = list(freq_list2)
    # display result as top 10 most frequent words
    freq_list4 =[]
    freq_list4=freq_list3[:10]

    words = []

    for item in freq_list4:
        a = str(item[1])
        a = a.lower()
        words.append(a)



    f = open(filename)

    newlist = []

    for line in f:
        line = punctuation.sub("", line)
        line = line.lower()
        newlist.append(line)

    f2 = open('Lines.txt','w')

    newlist2= []
    for line in newlist:
        line = line.split()
        newlist2.append(line)
        f2.write(str(line))
        f2.write("\n")


    print newlist2

    # ARFF Creation

    arff = open('output.arff','w')
    arff.write('@RELATION wordfrequency\n\n')
    for word in words:
        arff.write('@ATTRIBUTE ')
        arff.write(str(word))
        arff.write(' numeric\n')

    arff.write('@ATTRIBUTE class {endofworld, notendofworld}\n\n')
    arff.write('@DATA\n')
    # Counting word frequencies for each verse
    for line in newlist2:
        word_occurrences = str("")
        for word in words:
            matches = int(0)
            for item in line:
                if str(item) == str(word):
                matches = matches + int(1)
                else:
                continue
            word_occurrences = word_occurrences + str(matches) + ","
        word_occurrences = word_occurrences + "endofworld"
        arff.write(word_occurrences)
        arff.write("\n")

    print words

Grekin answered 4/5, 2011 at 12:43 Comment(0)

This should get you started:

def bigrams(words):
    wprev = None
    for w in words:
        yield (wprev, w)
        wprev = w

Note that the first bigram is (None, w1) where w1 is the first word, so you have a special bigram that marks start-of-text. If you also want an end-of-text bigram, add yield (wprev, None) after the loop.

Silicone answered 4/5, 2011 at 12:57 Comment(3)

I would be nice if your first item was not (None, first_word), but rather (first_word, second_word) so that the caller need not write a special case for the first item. – Homeopathy 4/5, 2011 at 13:4

@Steven: it's quite common to have a special bigram that marks start-of-text. In fact, for a real application I would add a line yield (wprev, None) at the end as well. – Silicone 4/5, 2011 at 13:6

This answer is the same idea as used in the recipe for a pairwise iterator in the documentation for the itertools module (see docs.python.org/library/itertools.html#recipes). – Homeopathy 4/5, 2011 at 13:11

Generalized to n-grams with optional padding, also uses defaultdict(int) for frequencies, to work in 2.6:

from collections import defaultdict

def ngrams(words, n=2, padding=False):
    "Compute n-grams with optional padding"
    pad = [] if not padding else [None]*(n-1)
    grams = pad + words + pad
    return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))

# grab n-grams
words = ['the','cat','sat','on','the','dog','on','the','cat']
for size, padding in ((3, 0), (4, 0), (2, 1)):
    print '\n%d-grams padding=%d' % (size, padding)
    print list(ngrams(words, size, padding))

# show frequency
counts = defaultdict(int)
for ng in ngrams(words, 2, False):
    counts[ng] += 1

print '\nfrequencies of bigrams:'
for c, ng in sorted(((c, ng) for ng, c in counts.iteritems()), reverse=True):
    print c, ng

Output:

3-grams padding=0
[('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), 
 ('on', 'the', 'dog'), ('the', 'dog', 'on'), ('dog', 'on', 'the'), 
 ('on', 'the', 'cat')]

4-grams padding=0
[('the', 'cat', 'sat', 'on'), ('cat', 'sat', 'on', 'the'), 
 ('sat', 'on', 'the', 'dog'), ('on', 'the', 'dog', 'on'), 
 ('the', 'dog', 'on', 'the'), ('dog', 'on', 'the', 'cat')]

2-grams padding=1
[(None, 'the'), ('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), 
 ('on', 'the'), ('the', 'dog'), ('dog', 'on'), ('on', 'the'), 
 ('the', 'cat'), ('cat', None)]

frequencies of bigrams:
2 ('the', 'cat')
2 ('on', 'the')
1 ('the', 'dog')
1 ('sat', 'on')
1 ('dog', 'on')
1 ('cat', 'sat')

Forklift answered 4/5, 2011 at 14:51 Comment(0)

Life is much more easier if you start using NLTK's FreqDist function to do the counting. Also NLTK has bigram feature. Examples for both of them are in the following page.

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html

Meaghanmeagher answered 5/5, 2011 at 1:10 Comment(0)

I've rewritten the first bit for you, because it's icky. Points to note:

List comprehensions are your friend, use more of them.
collections.Counter is great!

OK, code:

import re
import nltk
import collections

# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')

# create list of lower case words
word_list = re.split('\s+', open(filename).read().lower())
print 'Words in text:', len(word_list)

words = (punctuation.sub("", word).strip() for word in word_list)
words = (word for word in words if word not in ntlk.corpus.stopwords.words('english'))

# create dictionary of word:frequency pairs
frequencies = collections.Counter(words)

print '-'*30

print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
print frequencies

# display result as top 10 most frequent words
print frequencies.most_common(10)

[word for word, frequency in frequencies.most_common(10)]

Wept answered 4/5, 2011 at 13:21 Comment(3)

-1. An excellent rewrite, but you do not answer the question which asked about altering the code to count bi-gram frequencies. – Homeopathy 4/5, 2011 at 13:31

I can't seem to figure out why but it keeps coming up with an error frequencies = collections.Counter(words) AttributeError: 'module' object has no attribute 'Counter' – Grekin 6/5, 2011 at 16:25

@Alex: You're using Python version 2.6 or lower; Counter was introduced in 2.7. Either upgrade or write your own Counter... – Wept 7/5, 2011 at 10:12

Recommended topics

Hot tags