Given the big.txt
from norvig.com/big.txt, the goal is to count the bigrams really fast (Imagine that I have to repeat this counting 100,000 times).
According to Fast/Optimize N-gram implementations in python, extracting bigrams like this would be the most optimal:
_bigrams = zip(*[text[i:] for i in range(2)])
And if I'm using Python3
, the generator won't be evaluated until i materialize it with list(_bigrams)
or some other functions that will do the same.
import io
from collections import Counter
import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
text = fin.read().lower().replace(u' ', u"\uE000")
while True:
_bigrams = zip(*[text[i:] for i in range(2)])
start = time.time()
top100 = Counter(_bigrams).most_common(100)
# Do some manipulation to text and repeat the counting.
text = manipulate(text, top100)
But that takes around 1+ seconds per iteration, and 100,000 iterations would be too long.
I've also tried sklearn
CountVectorizer but the time to extract, count and get the top100 bigrams are comparable to the native python.
Then I've experimented with some multiprocessing
, using slight modification from Python multiprocessing and a shared counter and http://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing:
from multiprocessing import Process, Manager, Lock
import time
class MultiProcCounter(object):
def __init__(self):
self.dictionary = Manager().dict()
self.lock = Lock()
def increment(self, item):
with self.lock:
self.dictionary[item] = self.dictionary.get(item, 0) + 1
def func(counter, item):
counter.increment(item)
def multiproc_count(inputs):
counter = MultiProcCounter()
procs = [Process(target=func, args=(counter,_in)) for _in in inputs]
for p in procs: p.start()
for p in procs: p.join()
return counter.dictionary
inputs = [1,1,1,1,2,2,3,4,4,5,2,2,3,1,2]
print (multiproc_count(inputs))
But using the MultiProcCounter
in the bigram counting takes even longer than 1+ seconds per iteration. I've no idea why is that the case, using the dummy list of int
example, the multiproc_count
works perfectly.
I've tried:
import io
from collections import Counter
import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
text = fin.read().lower().replace(u' ', u"\uE000")
while True:
_bigrams = zip(*[text[i:] for i in range(2)])
start = time.time()
top100 = Counter(multiproc_count(_bigrams)).most_common(100)
Is there any way to count the bigrams really fast in Python?
o → a
indog
, you can decrementdo
andog
and incrementda
andag
. If most of the text does not change, this should be faster than repeating the calculation. – Merriott