You may well be using the wrong tool for this job. If you want to persist a huge amount of indexed data, I'd strongly suggest using an SQLite on-disk database (or, of course, just a normal database) with an ORM like SQLObject or SQL Alchemy.
These will take care of the mundane things like compatibility, optimising format for purpose, and not holding all the data in memory simultaneously so that you run out of memory...
Added: Because I was working on a near identical thing anyway, but mainly because I'm such a nice person, here's a demo that appears to do what you need (it'll create an SQLite file in your current dir, and delete it if a file with that name already exists, so put it somewhere empty first):
import sqlobject
from sqlobject import SQLObject, UnicodeCol, ForeignKey, IntCol, SQLMultipleJoin
import os
DB_NAME = "mydb"
ENCODING = "utf8"
class Document(SQLObject):
dbName = UnicodeCol(dbEncoding=ENCODING)
class Location(SQLObject):
""" Location of each individual occurrence of a word within a document.
"""
dbWord = UnicodeCol(dbEncoding=ENCODING)
dbDocument = ForeignKey('Document')
dbLocation = IntCol()
TEST_DATA = {
'one' : {
'doc1' : [1,2,10],
'doc3' : [6],
},
'two' : {
'doc1' : [2, 13],
'doc2' : [5,6,7],
},
'three' : {
'doc3' : [1],
},
}
if __name__ == "__main__":
db_filename = os.path.abspath(DB_NAME)
if os.path.exists(db_filename):
os.unlink(db_filename)
connection = sqlobject.connectionForURI("sqlite:%s" % (db_filename))
sqlobject.sqlhub.processConnection = connection
# Create the tables
Document.createTable()
Location.createTable()
# Import the dict data:
for word, locs in TEST_DATA.items():
for doc, indices in locs.items():
sql_doc = Document(dbName=doc)
for index in indices:
Location(dbWord=word, dbDocument=sql_doc, dbLocation=index)
# Let's check out the data... where can we find 'two'?
locs_for_two = Location.selectBy(dbWord = 'two')
# Or...
# locs_for_two = Location.select(Location.q.dbWord == 'two')
print "Word 'two' found at..."
for loc in locs_for_two:
print "Found: %s, p%s" % (loc.dbDocument.dbName, loc.dbLocation)
# What documents have 'one' in them?
docs_with_one = Location.selectBy(dbWord = 'one').throughTo.dbDocument
print
print "Word 'one' found in documents..."
for doc in docs_with_one:
print "Found: %s" % doc.dbName
This is certainly not the only way (or necessarily the best way) to do this. Whether the Document or Word tables should be separate tables from the Location table depends on your data and typical usage. In your case, the "Word" table could probably be a separate table with some added settings for indexing and uniqueness.