Persistent multiprocess shared cache in Python with stdlib or minimal dependencies
Asked Answered
E

3

7

I just tried a Python shelve module as the persistent cache for data fetched from the external service. The complete example is here.

I was wondering what would the best approach if I want to make this multiprocess safe? I am aware of redis, memcached and such "real solutions", but I'd like to use only the parts of Python standard library or very minimal dependencies to keep my code compact and not introduce unnecessary complexity when running the code in single process - single thread model.

It's easy to come up with a single-process solution, but this does not work well current Python web run-times. Specifically, the problem would be that in Apache + mod_wsgi enviroment

  • Only one process is updating the cached data once (file locks, somehow?)

  • Other processes use the cached data while the update is under way

  • If the process fails to update the cached data there is penalty of N minutes before another process can try again (to prevent thundering herd and such) - how to signal this between mod_wsgi processes

  • You do not utilize any "heavy tools" for this, only Python standard libraries and UNIX

Also if some PyPi package does this without external dependencies let me know of it please. Alternative approaches and recommendations, like "just use sqlite" are welcome.

Example:

import datetime
import os
import shelve
import logging


logger = logging.getLogger(__name__)


class Converter:

    def __init__(self, fpath):
        self.last_updated = None
        self.data = None

        self.data = shelve.open(fpath)

        if os.path.exists(fpath):
            self.last_updated = datetime.datetime.fromtimestamp(os.path.getmtime(fpath))

    def convert(self, source, target, amount, update=True, determiner="24h_avg"):
        # Do something with cached data
        pass

    def is_up_to_date(self):
        if not self.last_updated:
            return False

        return datetime.datetime.now() < self.last_updated + self.refresh_delay

    def update(self):
        try:
            # Update data from the external server
            self.last_updated = datetime.datetime.now()
            self.data.sync()
        except Exception as e:
            logger.error("Could not refresh market data: %s %s", self.api_url, e)
            logger.exception(e)
Ectoblast answered 6/12, 2013 at 9:18 Comment(2)
Does your cached data need to be able to swap and/or persist to disk or is it same to assume that your cache will fit into available memory?Sheply
It's just small amount of data; but it is required to be there after server cold start to avoid the issues of no data available on startupEctoblast
B
4

I'd say you'd want to use some existing caching library, dogpile.cache comes to mind, it has many features already, and you can easily plug in the backends you might need.

dogpile.cache documentation tells the following:

This “get-or-create” pattern is the entire key to the “Dogpile” system, which coordinates a single value creation operation among many concurrent get operations for a particular key, eliminating the issue of an expired value being redundantly re-generated by many workers simultaneously.

Bombycid answered 6/12, 2013 at 15:46 Comment(4)
+1 for dogpile, it's pretty good and can deal with thundering hurd. it is a 3rd party package and probably requires some dependencies in real-life use, thus a bit out of tune with OP.Insistent
I am currently researching the very same task, and came across dogpile as well. While I still not fully understand the internals of dogpile, there are multiple occasions of using threads in the docs. According to [deadlock] (#24510150) and others, including my experience, multiprocessing + multithreading + logging can lead to deadlock. This can be avoided by first spawning the processes and only then the threads.Newsboy
@ZoltanK. I refer you to my other answer https://mcmap.net/q/700447/-safe-to-call-multiprocessing-from-a-thread-in-python :)Protectionist
Also, dogpile.cache itself is thread-free, it just uses threading.LockProtectionist
I
3

Let's consider your requirements systematically:

minimum or no external dependencies

Your use case will determine if you can use in-band (file descriptor or memory region inherited across fork) or out-of-band synchronisation (posix file locks, sys V shared memory).

Then you may have other requirements, e.g. cross-platform availability of the tools, etc.

There really isn't that much in the standard library, except bare tools. One module however, stands out, sqlite3. Sqlite uses fcntl/posix locks, there are performance limitations though, multiple processes imply file-backed database, and sqlite requires fdatasync on commit.

Thus there's a limit to transactions/s in sqlite imposed by your hard drive rpm. The latter is not a big deal if you have hw raid, but can be a major handicap on commodity hardware, e.g. a laptop or usb flash or sd card. Plan for ~100tps if you use a regular, rotating hard drive.

Your processes can also block on sqlite, if you use special transaction modes.

preventing thundering herd

There are two major approaches for this:

  • probabilistically refresh cache item earlier than required, or
  • refresh only when required but block other callers

Presumably if you trust another process with the cache value, you don't have any security considerations. Thus either will work, or perhaps a combination of both.

Insistent answered 11/12, 2013 at 15:17 Comment(2)
Cool, this is high quality input I was looking for.Ectoblast
Wrt. transactions/s limit: If you don't need consistent cache over reboot, user PRAGMA synchronous = OFF, then fdatasyncs are not done and performance is cool again; do make sure to clear cache on startup. alternatively keep your databases on tmpfs/ramfs volume.Insistent
K
1

I wrote a locking (thread- and mulitprocess-safe) wrapper around the standard shelve module with no external dependencies:

https://github.com/cristoper/shelfcache

It meets many of your requirements, but it does not have any sort of backoff strategy to prevent thundering herds, and if you want Reader-Writer lock (so that multiple threads can read, but only one write) you have to provide your own RW lock.

However, if I were to do it again I'd probably "just use sqlite". The shelve module which abstracts over several different dbm implementations, which themselves abstract over various OS locking mechanisms, is a pain (using the shelfcache flock option with gdbm on Mac OS X (or busybox), for example, results in a deadlock).

There are several python projects which try to provide a standard dict interface to sqlite or other persistent stores, ex: https://github.com/RaRe-Technologies/sqlitedict

(Note that sqldict is thread safe even for the same database connection, but it is not safe to share the same database connection between processes.)

Khrushchev answered 28/9, 2018 at 19:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.