Counting bigrams real fast (with or without multiprocessing) - python
Asked Answered
Q

2

8

Given the big.txt from norvig.com/big.txt, the goal is to count the bigrams really fast (Imagine that I have to repeat this counting 100,000 times).

According to Fast/Optimize N-gram implementations in python, extracting bigrams like this would be the most optimal:

_bigrams = zip(*[text[i:] for i in range(2)])

And if I'm using Python3, the generator won't be evaluated until i materialize it with list(_bigrams) or some other functions that will do the same.

import io
from collections import Counter

import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
     text = fin.read().lower().replace(u' ', u"\uE000")

while True: 
    _bigrams = zip(*[text[i:] for i in range(2)])
    start = time.time()
    top100 = Counter(_bigrams).most_common(100)
    # Do some manipulation to text and repeat the counting.
    text = manipulate(text, top100)      

But that takes around 1+ seconds per iteration, and 100,000 iterations would be too long.

I've also tried sklearn CountVectorizer but the time to extract, count and get the top100 bigrams are comparable to the native python.

Then I've experimented with some multiprocessing, using slight modification from Python multiprocessing and a shared counter and http://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing:

from multiprocessing import Process, Manager, Lock

import time

class MultiProcCounter(object):
    def __init__(self):
        self.dictionary = Manager().dict()
        self.lock = Lock()

    def increment(self, item):
        with self.lock:
            self.dictionary[item] = self.dictionary.get(item, 0) + 1

def func(counter, item):
    counter.increment(item)

def multiproc_count(inputs):
    counter = MultiProcCounter()
    procs = [Process(target=func, args=(counter,_in)) for _in in inputs]
    for p in procs: p.start()
    for p in procs: p.join()
    return counter.dictionary

inputs = [1,1,1,1,2,2,3,4,4,5,2,2,3,1,2]

print (multiproc_count(inputs))

But using the MultiProcCounter in the bigram counting takes even longer than 1+ seconds per iteration. I've no idea why is that the case, using the dummy list of int example, the multiproc_count works perfectly.

I've tried:

import io
from collections import Counter

import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
     text = fin.read().lower().replace(u' ', u"\uE000")

while True:
    _bigrams = zip(*[text[i:] for i in range(2)])
    start = time.time()
    top100 = Counter(multiproc_count(_bigrams)).most_common(100)

Is there any way to count the bigrams really fast in Python?

Qktp answered 2/11, 2016 at 6:3 Comment(5)
Sounds like you should be looking into distributed processing and map/reduce if you really cannot avoid performing the same thing 100,000 times. I assume you mean you have data which is even bigger, not that you literally repeat the same computation 100,000 times; if that's really what you mean, this sounds like there is a flaw in your basic plan.Fortney
It's repeating the same thing 100,000 times but each time, it takes the top100 bigrams and manipulate the text, so the input text to extract bigrams is different at every iteration.Qktp
Do you consider the first 20 bigrams of big.txt to be ['th', 'he', 'e ', ' p', 'pr', 'ro', 'oj', 'je', 'ec', 'ct', 't ', ' g', 'gu', 'ut', 'te', 'en', 'nb', 'be', 'er', 'rg'] as your code produces, or a word-oriented subset like ['th', 'he', 'pr', 'ro', 'oj', 'je', 'ec', 'ct', 'gu', 'ut', 'te', 'en', 'nb', 'be', 'er', 'rg', 'eb', 'bo', 'oo', 'ok']? Just trying to understand the rules of the game.Zoniazoning
Are the edits so complex you cannot update the bigrams counts as you change the text? E.g. when you replace o → a in dog, you can decrement do and og and increment da and ag. If most of the text does not change, this should be faster than repeating the calculation.Merriott
Yes, updating the counts of bigrams is possible but that would mean i would need a hashtable of n^2 , given n is the no. of characters and in some cases, n=300,000 =(Qktp
I
2
import os, thread

text = 'I really like cheese' #just load whatever you want here, this is just an example

CORE_NUMBER = os.cpu_count() # may not be available, just replace with how many cores you have if it crashes

ready = []
bigrams = []

def extract_bigrams(cores):
    global ready, bigrams
    bigrams = []
    ready = []
    for a in xrange(cores): #xrange is best for performance
        bigrams.append(0)
        ready.append(0)
    cpnt = 0#current point
    iterator = int(len(text)/cores)
    for a in xrange(cores-1):
        thread.start_new(extract_bigrams2, (cpnt, cpnt+iterator+1, a)) #overlap is intentional
        cpnt += iterator
    thread.start_new(extract_bigrams2, (cpnt, len(text), a+1))
    while 0 in ready:
        pass

def extract_bigrams2(startpoint, endpoint, threadnum):
    global ready, bigrams
    ready[threadnum] = 0
    bigrams[threadnum] = zip(*[text[startpoint+i:endpoint] for i in xrange(2)])
    ready[threadnum] = 1

extract_bigrams(CORE_NUMBER)
thebigrams = []
for a in bigrams:
    thebigrams+=a

print thebigrams

There are some issues with this program, like it not filtering out whitespace or punctuation, but I made this program to show what you should be shooting for. You can easily edit it to suit your needs.

This program auto-detects how many cores your computer has, and creates that number of threads, attempting to evenly distribute the areas where it looks for bigrams. I've only been able to test this code in an online browser on a school owned computer, so I can't be certain this works completely. If there are any problems or questions, please leave them in the comments.

Intertwist answered 10/11, 2016 at 19:12 Comment(2)
I appreciate your threaded approach -- things like data filtering and case folding happen before the threading so won't significantly affect the performance gain. However, your solution doesn't actually tally the bigrams -- once threads complete, the main program will have to merge all the counts which is a complication that single threaded solutions don't face. Without a more complete example, it's hard to know whether that extra overhead might negate potential gains.Zoniazoning
Did you try to benchmark this with big.txt vs the native python non-threaded / non-multiprocessing approach?Qktp
B
0

My proposal:

Text= "The Project Gutenberg EBook of The Adventures of Sherlock Holmes"
"by Sir Arthur Conan Doyle"

# Counters
Counts= [[0 for x in range(128)] for y in range(128)]

# Perform the counting
R= ord(Text[0])
for i in range(1, len(Text)):
    L= R; R= ord(Text[i])
    Counts[L][R]+= 1

# Output the results
for i in range(ord('A'), ord('{')):
    if i < ord('[') or i >= ord('a'):
        for j in range(ord('A'), ord('{')):
            if (j < ord('[') or j >= ord('a')) and Counts[i][j] > 0:
                print chr(i) + chr(j), Counts[i][j]


Ad 1
Bo 1
EB 1
Gu 1
Ho 1
Pr 1
Sh 1
Th 2
be 1
ck 1
ct 1
dv 1
ec 1
en 2
er 2
es 2
he 3
je 1
lm 1
lo 1
me 1
nb 1
nt 1
oc 1
of 2
oj 1
ok 1
ol 1
oo 1
re 1
rg 1
rl 1
ro 1
te 1
tu 1
ur 1
ut 1
ve 1

This version is case sensitive; probably better to first lowercase the whole text.

Biblioclast answered 2/11, 2016 at 13:53 Comment(7)
That's assuming that the no. of characters is fixed, no?Qktp
@alvas: what number of characters ?Biblioclast
I mean the fixed array list of Counts= [[0 for x in range(128)] for y in range(128)].Qktp
@alvas: yep, an array is faster than a dictionary.Biblioclast
I don't understand your claim. On my system, your array-based solution is slower than the OP's dictionary-based solution and slower than a dictionary-based solution that includes all the same data filtering as yours plus case folding. What am I missing?Zoniazoning
@cdlane: I didn't claim anything. On what text length did you benchmark ?Biblioclast
The claim I question is, "an array is faster than a dictionary" in your comment previous to mine. The text I tested with was the norvig.com/big.txt file to which the OP provided a link in his quesion.Zoniazoning

© 2022 - 2024 — McMap. All rights reserved.