Your goal is to make your process manageable within memory constraints. To be able to do this with the ZODB as a tool you need to understand how ZODB transactions work, and how to use them.
Why your ZODB grows so large
First of all you need to understand what a transaction commit does here, which also explains why your Data.fs is getting so large.
The ZODB writes data out per transaction, where any persistent object that has changed gets written to disk. The important detail here is persistent object that has changed; the ZODB works in units of persistent objects.
Not every python value is a persistent object. If I define a straight-up python class, it will not be persistent, nor are any of the built-in python types such as int or list. On the other hand, any class you define that inherits from persistence.Persistent
is a persistent object. The BTrees
set of classes, as well as the PeristentList
class you use in your code do inherit from Persistent
.
Now, on a transaction commit, any persistent object that has changed is written to disk as part of that transaction. So any PersistentList
object that has been append to will be written in it's entirety to disk. BTrees
handle this a little more efficient; they store Buckets, themselves persistent, which in turn hold the actually stored objects. So for every few new nodes you create, a Bucket is written to the transaction, not the whole BTree structure. Note that because the items held in the tree are themselves persistent objects only references to them are stored in the Bucket records.
Now, the ZODB writes transaction data by appending it to the Data.fs
file, and it does not remove old data automatically. It can construct the current state of the database by finding the most recent version of a given object from the store. This is why your Data.fs
is growing so much, you are writing out new versions of larger and larger PersistentList
instances as transactions are committed.
Removing the old data is called packing, which is similar to the VACUUM
command in PostgreSQL and other relational databases. Simply call the .pack()
method on the db
variable to remove all old revisions, or use the t
and days
parameters of that method to set limits on how much history to retain, the first is a time.time()
timestamp (seconds since the epoch) before which you can pack, and days
is the number of days in the past to retain from current time or t
if specified. Packing should reduce your data file considerably as the partial lists in older transactions are removed. Do note that packing is an expensive operation and thus can take a while, depending on the size of your dataset.
Using transaction to manage memory
You are trying to build a very large dataset, by using persistence to work around constraints with memory, and are using transactions to try and flush things to disk. Normally, however, using a transaction commit signals you have completed constructing your dataset, something you can use as one atomic whole.
What you need to use here is a savepoint. Savepoints are essentially subtransactions, a point during the whole transaction where you can ask for data to be temporarily stored on disk. They'll be made permanent when you commit the transaction. To create a savepoint, call the .savepoint
method on the transaction:
for Gnodes in G.nodes(): # Gnodes iterates over 10000 values
Gvalue = someoperation(Gnodes)
for Hnodes in H.nodes(): # Hnodes iterates over 10000 values
Hvalue =someoperation(Hnodes)
score = SomeOperation on (Gvalue,Hvalue)
btree_container.setdefault(Gnodes, PersistentList()).append(
[Hnodes, score, -1 ])
transaction.savepoint(True)
transaction.commit()
In the above example I set the optimistic
flag to True, meaning: I do not intent to roll back to this savepoint; some storages do not support rolling back, and signalling you do not need this makes your code work in such situations.
Also note that the transaction.commit()
happens when the whole data set has been processed, which is what a commit is supposed to achieve.
One thing a savepoint does, is call for a garbage collection of the ZODB caches, which means that any data not currently in use is removed from memory.
Note the 'not currently in use' part there; if any of your code holds on to large values in a variable the data cannot be cleared from memory. As far as I can determine from the code you've shown us, this looks fine. But I do not know how your operations work or how you generate the nodes; be careful to avoid building complete lists in memory there when an iterator will do, or build large dictionaries where all your lists of lists are referenced, for example.
You can experiment a little as to where you create your savepoints; you could create one every time you've processed one HNodes
, or only when done with a GNodes
loop like I've done above. You are constructing a list per GNodes
, so it would be kept in memory while looping over all the H.nodes()
anyway, and flushing to disk would probably only make sense once you've completed constructing it in full.
If, however, you find that you need to clear memory more often, you should consider using either a BTrees.OOBTree.TreeSet
class or a BTrees.IOBTree.BTree
class instead of a PersistentList
to break up your data into more persistent objects. A TreeSet
is ordered but not (easily) indexable, while a BTree
could be used as a list by using simple incrementing index keys:
for i, Hnodes in enumerate(H.nodes()):
...
btree_container.setdefault(Gnodes, IOBTree())[i] = [Hnodes, score, -1]
if i % 100 == 0:
transaction.savepoint(True)
The above code uses a BTree instead of a PersistentList and creates a savepoint every 100 HNodes
processed. Because the BTree uses buckets, which are persistent objects in themselves, the whole structure can be flushed to a savepoint more easily without having to stay in memory for all H.nodes()
to be processed.
setdefault
call or you can replace the default[]
withPersistentList
and drop the loop overG.nodes()
to set up the initialPersistentList
s you have there. – Wretched