Python Shelve Module Memory Consumption
Asked Answered
S

1

9

I have been assigned the task of reading a .txt file which is a log of various events and writing some of those events into a dictionary.

The problem is that the file can sometimes get bigger than 3GB in size. This means that the dictionary gets too big to fit into main memory. It seems that Shelve is a good way to solve this problem. However, since I will be constantly modifying the dictionary, I must have the writeback option enabled. This is where I am concerned - the tutorial says that this would slow down the read/write process and use more memory, but I am unable to find statistics on how the speed and memory are affected.

Can anyone clarify by how much the read/write speed and memory are affected so that I can decide whether to use the writeback option or sacrifice some readability for code efficiency?

Thank you

Swanskin answered 24/5, 2011 at 18:30 Comment(7)
It depends what you're doing with the dictionary: if you only need to modify it by replacing values (shelf['key'] = newvalue), you don't need writeback. If you're modifying mutable types in it (shelf['key'].append(x), you need writeback. Of course, you can leave writeback off and always remember to modify and replace values in your shelf, if you prefer.Yogi
I only need to add key, value pairs. But since I'm working with nested dictionaries, I will be adding k,v pairs to the inner dicts as well.Swanskin
Can you write it so that you always grab a value from the shelf, add to that at whatever level, and then put it back on the shelf?Yogi
Would that be the way show in the tutorial that uses the temp variableSwanskin
That's what I'm trying to avoid doing, though. I can do it if I have tom, but I'd like to avoid it as much as possible - hence the whole writeback issue. Thoughts?Swanskin
Take a look at the source code: svn.python.org/view/python/branches/release27-maint/Lib/… In short, it will store everything you touch in memory until you call .sync(), at which point it's rewritten to disk and freed. So the hit depends on what sort of pattern you're accessing the file in.Yogi
I know that this doesn't answer your question, but for data on that scale it may be worth looking into another embedded document store. Maybe unqlite?Animalism
C
2

For databases this size, shelve really is the wrong tool. If you do not need a highly available client/server architecture, and you just want to convert your TXT file to a local in-memory-accessible database, you really should be using ZODB

If you need something highly-available, you will of course need to switch to a formal "NoSQL" database, of which there are many to choose from.

Here's a simple example of how to convert your shelve database to a ZODB database which will solve your memory usage / performance problems.

#!/usr/bin/env python
import shelve
import ZODB, ZODB.FileStorage
import transaction
from optparse import OptionParser
import os
import sys
import re

reload(sys)
sys.setdefaultencoding("utf-8")

parser = OptionParser()

parser.add_option("-o", "--output", dest = "out_file", default = False, help ="original shelve database filename")
parser.add_option("-i", "--input", dest = "in_file", default = False, help ="new zodb database filename")

parser.set_defaults()
options, args = parser.parse_args()

if options.in_file == False or options.out_file == False :
    print "Need input and output database filenames"
    exit(1)

db = shelve.open(options.in_file, writeback=True)
zstorage = ZODB.FileStorage.FileStorage(options.out_file)
zdb = ZODB.DB(zstorage)
zconnection = zdb.open()
newdb = zconnection.root()

for key, value in db.iteritems() :
    print "Copying key: " + str(key)
    newdb[key] = value
                                                                                                                                                                                                
transaction.commit() 
Chemesh answered 30/5, 2015 at 3:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.