I wonder why whoosh is kinda slow with the following code. Especially the commit takes quite a long time.
I tried to use limitmb=2048 with the writer instead of the default 128, but it makes almost no difference. As per suggestions I tried procs=3 for the writer, which makes the indexing a little faster, but the commit even slower. Also commit(merge=False) doesn't help here, since the index is empty.
I get results like this:
index_documents 12.41 seconds
commit 22.79 seconds
run 35.34 seconds
Which for such a small schema and roughly 45000 objects seems a bit much.
I tested with whoosh 2.5.7 and Python 2.7.
Is that normal and I just expect too much, or am I doing something wrong?
I also profiled a little and it seems like whoosh is writing out and then reading in lots of pickles. It seems to be related to how the transactions are handled.
from contextlib import contextmanager
from whoosh import fields
from whoosh.analysis import NgramWordAnalyzer
from whoosh.index import create_in
import functools
import itertools
import tempfile
import shutil
import time
def timecall(f):
@functools.wraps(f)
def wrapper(*args, **kw):
start = time.time()
result = f(*args, **kw)
end = time.time()
print "%s %.2f seconds" % (f.__name__, end - start)
return result
return wrapper
def schema():
return fields.Schema(
path=fields.ID(stored=True, unique=True),
text=fields.TEXT(analyzer=NgramWordAnalyzer(2, 4), stored=False, phrase=False))
@contextmanager
def create_index():
directory = tempfile.mkdtemp()
try:
yield create_in(directory, schema())
finally:
shutil.rmtree(directory)
def iter_documents():
for root in ('egg', 'ham', 'spam'):
for i in range(1000, 16000):
yield {
u"path": u"/%s/%s" % (root, i),
u"text": u"%s %s" % (root, i)}
@timecall
def index_documents(writer):
start = time.time()
counter = itertools.count()
for doc in iter_documents():
count = counter.next()
current = time.time()
if (current - start) > 1:
print count
start = current
writer.add_document(**doc)
@timecall
def commit(writer):
writer.commit()
@timecall
def run():
with create_index() as ix:
writer = ix.writer()
index_documents(writer)
commit(writer)
if __name__ == '__main__':
run()