Effective 1-5 grams extraction with python

Asked 13/10, 2014 at 13:45 Answered 22/10, 2014 at 10:10

python nlp nltk information-retrieval n-gram

I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.:

This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .

Currently, I am doing it as below but this don't seem to be a efficient way to extract the 1-5grams:

#!/usr/bin/env python -*- coding: utf-8 -*-

import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')

with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, \
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
    # Extract words from file. 
    src_words = ['<s>'] + srcfin.read().replace('\n', ' </s> <s> ').split()
    del src_words[-1] # Removes the final '<s>'
    trg_words = ['<s>'] + trgfin.read().replace('\n', ' </s> <s> ').split()
    del trg_words[-1] # Removes the final '<s>'

    # Unigrams count.
    src_unigrams = Counter(src_words) 
    trg_unigrams = Counter(trg_words) 
    # Sum of unigram counts.
    src_sum_unigrams = sum(src_unigrams.values())
    trg_sum_unigrams = sum(trg_unigrams.values())

    # Bigrams count.
    src_bigrams = Counter(zip(src_words,src_words[1:]))
    trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
    # Sum of bigram counts.
    src_sum_bigrams = sum(src_bigrams.values())
    trg_sum_bigrams = sum(trg_bigrams.values())

    # Trigrams count.
    src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
    trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
    # Sum of trigram counts.
    src_sum_trigrams = sum(src_bigrams.values())
    trg_sum_trigrams = sum(trg_bigrams.values())

Is there any other way to do this more efficiently?

How to optimally extract different N ngrams simultaneously?

From Fast/Optimize N-gram implementations in python, essentially this:

zip(*[words[i:] for i in range(n)])

when hard-coded is this for bigrams, n=2:

zip(src_words,src_words[1:])

and is this for trigrams, n=3:

zip(src_words,src_words[1:],src_words[2:])

Staphylorrhaphy answered 13/10, 2014 at 13:45 Comment(1)

What is the format of the input files? – Shingly 15/10, 2014 at 15:46

If you are interested only in the most common (frequent) n-grams (which is your case I suppose), you can reuse the central idea of the Apriori algorithm. Given s_min, a minimal support which can be thought as the number of lines that a given n-gram is contained in, it efficiently searches for all such n-grams.

The idea is as follows: write a query function which takes an n-gram and tests how many times it is contained in the corpus. After you have such a function prepared (may be optimized as discussed later), scan the whole corpus and get all the 1-grams, i.e. bare tokens, and select those which are contained at least s_min times. This gives you subset F1 of frequent 1-grams. Then test all the possible 2-grams by combining all the 1-grams from F1. Again, select those which hold the s_min criterion and you'll get F2. By combining all the 2-grams from F2 and selecting the frequent 3-grams, you'll get F3. Repeat for as long as Fn is non-empty.

Many optimizations can be done here. When combining n-grams from Fn, you can exploit the fact that n-grams x and y may only be combined to form (n+1)-gram iff x[1:] == y[:-1] (may be checked in constant time for any n if proper hashing is used). Moreover, if you have enough RAM (for your corpus, many GBs), you can extremely speed up the query function. For each 1-gram, store a hash-set of line indices containing the given 1-gram. When combining two n-grams into an (n+1)-gram, use intersection of the two corresponding sets, obtaining a set of lines where the (n+1)-gram may be contained.

The time complexity grows as s_min decreases. The beauty is that infrequent (and hence uninteresting) n-grams are completely filtered as the algorithm runs, saving computational time for the frequent ones only.

Intonation answered 22/10, 2014 at 10:10 Comment(0)

I am giving you a bunch of pointers regarding the general problems you are trying to solve.. One or more of these should be useful for you and help you figure this out.

For what you are doing (I am guessing some sort of machine translation experiment) you don't really need to load the two files srcfin and trgfin into memory at the same time (at least not for the code sample you have provided).. Processing them separately will be less expensive in terms of the amount of stuff you need to hold in memory at a given time.

You are reading a ton of data into memory, processing it (which takes even more memory), and then holding the results in some in-memory data-structures. Instead of doing that, you should strive to be lazier. Learn about python generators and write a generator which streams out all the ngrams from a given text without needing to hold the entire text in memory at any given point in time. The itertools python package will probably come in handy while writing this.

Beyond a point, it will no longer be feasible for you to hold all this data in memory. You should consider looking at map-reduce to help you break this down. Check out the mrjob python package which lets you write map reduce jobs in python. In the mapper step, you will break text down into its ngrams, and in the reducer stage you will count the number of times you see each ngram to get its overall count. mrjob's can also be run locally which obviously won't give you any parallelization benefits, but will be nice cause mrjob will still do a lot of heavy lifting for you.

If you are compelled to hold all the counts in memory at the same time (for a massive amount of text), then either implement some pruning strategy to prune out very rare ngrams, or consider using some persistent file-based lookup table such sqlite to hold all the data for you.

Aldin answered 16/10, 2014 at 7:16 Comment(0)

Assuming you don't want to count ngrams between lines, and assuming naive tokenization:

def ngrams(n, f):
    deque = collections.deque(maxlen=n)
    for line in f:
        deque.clear()
        words = ["<s>"] + line.split() + ["</s>"]
        deque.extend(words[:n-1]) # pre-seed so 5-gram counter doesn't count incomplete 5-grams
        for word in words[n-1:]:
            deque.append(word)
            yield tuple(str(w) for w in deque) # n-gram tokenization
counters = [collections.Counter(ngrams(i, open('somefile.txt'))) for i in range(5)]

edit: added beginning/end line tokens

The resultant data object is I believe about as sparse as possible. 3m lines with 40 words is ~120m tokens. With ~1m words in English (though less commonly used), you'll probably get a rather long tail. If you can imagine your data to be exchangeable / iid, then you can add some pruning in the middle:

def ngrams(n, f, prune_after=10000):
    counter = collections.Counter()
    deque = collections.deque(maxlen=n)
    for i, line in enumerate(f):
        deque.clear()
        words = ["<s>"] + line.split() + ["</s>"]
        deque.extend(words[:n-1])
        for word in words[n-1:]:
            deque.append(word)
            ngram = tuple(str(w) for w in deque)
            if i < prune_after or ngram in counter:
                counter[ngram] += 1
    return counter

Relaxing the exchangeability assumption would require something like Tregoreg's answer for efficient pruning, but in most cases exchangeability should hold.

As far as raw speed, I think zip (like the original code) vs deque is the fundamental question. zip removes the innermost loop, so it is likely already very fast. deque requires the innermost loop but also consumes the data iteratively, so its working memory footprint should be much smaller. Which is better will likely depend on your machine, but I'd imagine for large machines/small data that zip would be faster. Once you start running out of memory (especially if you start talking about pruning), however, deque gets a few more advantages.

Frederickson answered 21/10, 2014 at 21:23 Comment(2)

He's adding those tags because it's useful to have extra "words" representing the beginning and end of a sentence. This can be useful, for example, for language models. See the section on "Language Modeling" on this Coursera course if you want more information. – Noncooperation 22/10, 2014 at 10:32

That makes sense, I guess I'd consider it a modified n-gram. – Frederickson 22/10, 2014 at 15:16

Recommended topics

Hot tags