Using cPickle to serialize a large dictionary causes MemoryError
Asked Answered
S

3

8

I'm writing an inverted index for a search engine on a collection of documents. Right now, I'm storing the index as a dictionary of dictionaries. That is, each keyword maps to a dictionary of docIDs->positions of occurrence.

The data model looks something like: {word : { doc_name : [location_list] } }

Building the index in memory works fine, but when I try to serialize to disk, I hit a MemoryError. Here's my code:

# Write the index out to disk
serializedIndex = open(sys.argv[3], 'wb')
cPickle.dump(index, serializedIndex, cPickle.HIGHEST_PROTOCOL)

Right before serialization, my program is using about 50% memory (1.6 Gb). As soon as I make the call to cPickle, my memory usage skyrockets to 80% before crashing.

Why is cPickle using so much memory for serialization? Is there a better way to be approaching this problem?

Sipe answered 18/2, 2011 at 3:52 Comment(0)
R
10

cPickle needs to use a bunch of extra memory because it does cycle detection. You could try using the marshal module if you are sure your data has no cycles

Rape answered 18/2, 2011 at 4:38 Comment(4)
Worked like a charm. Incredibly simple fix -- basically just changed "pickle" to "marshal" and was done. I didn't realize cPickle performed cycle detection. By using marshal instead, writing to disk took a matter of seconds as opposed to 20 minutes, and reduced memory consumption from 30% and crashing to almost 0%. Thanks!Sipe
Simple solution plus a concise explanation, 100% awesome.Vlaminck
@John how can we know the data has no cycles?Piroshki
@JoãoAlmeida, More often that not, objects don't contain references to themselves (including nested references), you should know if yours do. One simple example that would contain cycles is a doubly linked list.Rape
A
0

There's the other pickle library you could try. Also there might be some cPickle settings you could change.

Other options: Break your dictionary into smaller pieces and cPickle each piece. Then put them back together when you load everything in.

Sorry this is vague, I'm just writing off the top of my head. I figured it might still be helpful since no one else has answered.

Aptitude answered 18/2, 2011 at 4:37 Comment(0)
V
0

You may well be using the wrong tool for this job. If you want to persist a huge amount of indexed data, I'd strongly suggest using an SQLite on-disk database (or, of course, just a normal database) with an ORM like SQLObject or SQL Alchemy.

These will take care of the mundane things like compatibility, optimising format for purpose, and not holding all the data in memory simultaneously so that you run out of memory...

Added: Because I was working on a near identical thing anyway, but mainly because I'm such a nice person, here's a demo that appears to do what you need (it'll create an SQLite file in your current dir, and delete it if a file with that name already exists, so put it somewhere empty first):

import sqlobject
from sqlobject import SQLObject, UnicodeCol, ForeignKey, IntCol, SQLMultipleJoin
import os

DB_NAME = "mydb"
ENCODING = "utf8"

class Document(SQLObject):
    dbName = UnicodeCol(dbEncoding=ENCODING)

class Location(SQLObject):
    """ Location of each individual occurrence of a word within a document.
    """
    dbWord = UnicodeCol(dbEncoding=ENCODING)
    dbDocument = ForeignKey('Document')
    dbLocation = IntCol()

TEST_DATA = {
    'one' : {
        'doc1' : [1,2,10],
        'doc3' : [6],
    },

    'two' : {
        'doc1' : [2, 13],
        'doc2' : [5,6,7],
    },

    'three' : {
        'doc3' : [1],
    },
}        

if __name__ == "__main__":
    db_filename = os.path.abspath(DB_NAME)
    if os.path.exists(db_filename):
        os.unlink(db_filename)
    connection = sqlobject.connectionForURI("sqlite:%s" % (db_filename))
    sqlobject.sqlhub.processConnection = connection

    # Create the tables
    Document.createTable()
    Location.createTable()

    # Import the dict data:
    for word, locs in TEST_DATA.items():
        for doc, indices in locs.items():
            sql_doc = Document(dbName=doc)
            for index in indices:
                Location(dbWord=word, dbDocument=sql_doc, dbLocation=index)

    # Let's check out the data... where can we find 'two'?
    locs_for_two = Location.selectBy(dbWord = 'two')

    # Or...
    # locs_for_two = Location.select(Location.q.dbWord == 'two')

    print "Word 'two' found at..."
    for loc in locs_for_two:
        print "Found: %s, p%s" % (loc.dbDocument.dbName, loc.dbLocation)

    # What documents have 'one' in them?
    docs_with_one = Location.selectBy(dbWord = 'one').throughTo.dbDocument

    print
    print "Word 'one' found in documents..."
    for doc in docs_with_one:
        print "Found: %s" % doc.dbName

This is certainly not the only way (or necessarily the best way) to do this. Whether the Document or Word tables should be separate tables from the Location table depends on your data and typical usage. In your case, the "Word" table could probably be a separate table with some added settings for indexing and uniqueness.

Violone answered 18/2, 2011 at 5:54 Comment(2)
Thanks for your suggestion. For now, I'm going to use marshal instead of pickle, but I may revisit this and migrate to a db-based solution in the future. Cheers!Sipe
@Stephen Poletto - that's cool, if marhsal works, it works, and this can remain here for posterity :)Violone

© 2022 - 2024 — McMap. All rights reserved.