Does shelve write to disk on every change?
Asked Answered
D

2

5

I wish to use shelve in an asyncio program and I fear that every change will cause the main event loop to stall.

While I don't mind the occasional slowdown of the pickling operation, the disk writes may be substantial.

Every how often does shelve sync to disk? Is it a blocking operation? Do I have to call .sync()?

If I schedule the sync() to run under a different thread, a different asyncio task may modify the shelve at the same time, which violates the requirement of single-thread writes.

Dejecta answered 8/8 at 21:27 Comment(0)
W
5

shelve, by default, is backed by the dbm module, in turn backed by some dbm implementation available on the system. Neither the shelve module, nor the dbm module, make any effort to minimize writes; an assignment of a value to a key causes a write every time. Even when writeback=True, that just means that new assignments are placed in the cache and immediately written to the backing dbm; they're written to make sure the original value is there, and the cache entry is made because the object assigned might change after assignment and needs to be handled just like a freshly read object (meaning it will be written again when synced or closed, in case it changed).

While it's possible some implementation of the underlying dbm libraries might include some caching, AFAICT, most do try to write immediately (that is, pushing data to the kernel immediately without user-mode buffering), they just don't necessarily force immediate synchronization to disk (though it can be requested, e.g. with gdbm_sync).

writeback=True will make it worse, because when it does sync, it's a major effort (it literally rewrites every object read or written to the DB since the last sync, because it has no way of knowing which of them might have been modified), as opposed to the small effort of rewriting a single key/value pair at a time.

In short, if you're really concerned about blocking writes, you can't use unthreaded async code without potential blocking, but said blocking is likely short-lived as long as writeback=True is not involved (or as long as you don't sync/close it until performance considerations are no longer relevant). If you need to have truly non-blocking async behavior, all shelve interactions will need to occur under a lock in worker threads, and either writeback must be False (to avoid race conditions pickling data) or if writeback is True, you must take care to avoid modifying any object that might be in the cache during the sync/close.

Westbrook answered 8/8 at 21:55 Comment(2)
As a side-note: The shelve module is not remotely thread-safe (the mechanism behind sync is to set writeback to False, redo all the assignments, then set it back to True, which means anything writing after the sync begins will be written but not cached, so it won't benefit from writeback protections), so the locking will be important in any threaded scenario.Westbrook
The user-mode buffering was precisely my question, and I wondered if any dbm library caches the operations. Thanks for the detailed answer. Looks like I'm going to have to create a MRSW lock, might as well publish it open source.Dejecta
B
4

It writes to disk every time you update the shelve object itself. So if you do

shelf[key] = something

or

shelf.update(somedict)

it will write to the file.

However, if there are mutable values in the in the dictionary, modifying them will not trigger a write to the file. Objects in Python don't have any reference back to the containers that reference them, so there's no way for the shelve object to detect those changes and write the file. If you need to support mutable values in the dictionary, you should use the writeback=True option when creating the shelve, to create an in-memory cache; the file will then be updated whenever you sync() or close().

Bridgman answered 8/8 at 21:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.