Memoize a function so that it isn't reset when I rerun the file in Python
Asked Answered
D

4

2

I often do interactive work in Python that involves some expensive operations that I don't want to repeat often. I'm generally running whatever Python file I'm working on frequently.

If I write:

import functools32

@functools32.lru_cache()
def square(x):
    print "Squaring", x
    return x*x

I get this behavior:

>>> square(10)
Squaring 10
100
>>> square(10)
100
>>> runfile(...)
>>> square(10)
Squaring 10
100

That is, rerunning the file clears the cache. This works:

try:
    safe_square
except NameError:
    @functools32.lru_cache()
    def safe_square(x):
        print "Squaring", x
        return x*x

but when the function is long it feels strange to have its definition inside a try block. I can do this instead:

def _square(x):
    print "Squaring", x
    return x*x

try:
    safe_square_2
except NameError:
    safe_square_2 = functools32.lru_cache()(_square)

but it feels pretty contrived (for example, in calling the decorator without an '@' sign)

Is there a simple way to handle this, something like:

@non_resetting_lru_cache()
def square(x):
    print "Squaring", x
    return x*x

?

Decoupage answered 20/11, 2013 at 1:46 Comment(3)
Your change doesn't really "work", unless you're re-running the script in a single Python interpreter session by, e.g., execfile-ing it.Omnivore
See below; I removed the word "persistent" to be clearer. Re-running the script in a single Python interpreter session is what I'm shooting for here.Decoupage
This is a very odd way to run scripts, but I can see why you'd want to do it. Let me write another answer with some comments.Omnivore
O
7

Writing a script to be executed repeatedly in the same session is an odd thing to do.

I can see why you'd want to do it, but it's still odd, and I don't think it's unreasonable for the code to expose that oddness by looking a little odd, and having a comment explaining it.

However, you've made things uglier than necessary.

First, you can just do this:

@functools32.lru_cache()
def _square(x):
    print "Squaring", x
    return x*x

try:
    safe_square_2
except NameError:
    safe_square_2 = _square

There is no harm in attaching a cache to the new _square definition. It won't waste any time, or more than a few bytes of storage, and, most importantly, it won't affect the cache on the previous _square definition. That's the whole point of closures.


There is a potential problem here with recursive functions. It's already inherent in the way you're working, and the cache doesn't add to it in any way, but you might only notice it because of the cache, so I'll explain it and show how to fix it. Consider this function:

@lru_cache()
def _fact(n):
    if n < 2:
        return 1
    return _fact(n-1) * n

When you re-exec the script, even if you have a reference to the old _fact, it's going to end up calling the new _fact, because it's accessing _fact as a global name. It has nothing to do with the @lru_cache; remove that, and the old function will still end up calling the new _fact.

But if you're using the renaming trick above, you can just call the renamed version:

@lru_cache()
def _fact(n):
    if n < 2:
        return 1
    return fact(n-1) * n

Now the old _fact will call fact, which is still the old _fact. Again, this works identically with or without the cache decorator.


Beyond that initial trick, you can factor that whole pattern out into a simple decorator. I'll explain step by step below, or see this blog post.


Anyway, even with the less-ugly version, it's still a bit ugly and verbose. And if you're doing this dozens of times, my "well, it should look a bit ugly" justification will wear thin pretty fast. So, you'll want to handle this the same way you always factor out ugliness: wrap it in a function.

You can't really pass names around as objects in Python. And you don't want to use a hideous frame hack just to deal with this. So you'll have to pass the names around as strings. ike this:

globals().setdefault('fact', _fact)

The globals function just returns the current scope's global dictionary. Which is a dict, which means it has the setdefault method, which means this will set the global name fact to the value _fact if it didn't already have a value, but do nothing if it did. Which is exactly what you wanted. (You could also use setattr on the current module, but I think this way emphasizes that the script is meant to be (repeatedly) executed in someone else's scope, not used as a module.)

So, here that is wrapped up in a function:

def new_bind(name, value):
    globals().setdefault(name, value)

… which you can turn that into a decorator almost trivially:

def new_bind(name):
    def wrap(func):
        globals().setdefault(name, func)
        return func
    return wrap

Which you can use like this:

@new_bind('foo')
def _foo():
    print(1)

But wait, there's more! The func that new_bind gets is going to have a __name__, right? If you stick to a naming convention, like that the "private" name must be the "public" name with a _ prefixed, we can do this:

def new_bind(func):
    assert func.__name__[0] == '_'
    globals().setdefault(func.__name__[1:], func)
    return func

And you can see where this is going:

@new_bind
@lru_cache()
def _square(x):
    print "Squaring", x
    return x*x

There is one minor problem: if you use any other decorators that don't wrap the function properly, they will break your naming convention. So… just don't do that. :)


And I think this works exactly the way you want in every edge case. In particular, if you've edited the source and want to force the new definition with a new cache, you just del square before rerunning the file, and it works.


And of course if you want to merge those two decorators into one, it's trivial to do so, and call it non_resetting_lru_cache.

However, I'd keep them separate. I think it's more obvious what they do. And if you ever want to wrap another decorator around @lru_cache, you're probably still going to want @new_bind to be the outermost decorator, right?


What if you want to put new_bind into a module that you can import? Then it's not going to work, because it will be referring to the globals of that module, not the one you're currently writing.

You can fix that by explicitly passing your globals dict, or your module object, or your module name as an argument, like @new_bind(__name__), so it can find your globals instead of its. But that's ugly and repetitive.

You can also fix it with an ugly frame hack. At least in CPython, sys._getframe() can be used to get your caller's frame, and frame objects have a reference to their globals namespace, so:

def new_bind(func):
    assert func.__name__[0] == '_'
    g = sys._getframe(1).f_globals
    g.setdefault(func.__name__[1:], func)
    return func

Notice the big box in the docs that tells you this is an "implementation detail" that may only apply to CPython and is "for internal and specialized purposes only". Take this seriously. Whenever someone has a cool idea for the stdlib or builtins that could be implemented in pure Python, but only by using _getframe, it's generally treated almost the same as an idea that can't be implemented in pure Python at all. But if you know what you're doing, and you want to use this, and you only care about present-day versions of CPython, it will work.

Omnivore answered 20/11, 2013 at 3:11 Comment(7)
Apologies for writing something twice as long as the persistent answer, even though the final result is 6 lines of pretty trivial code (as opposed to 239 lines that were complex enough that I felt the need to git them). But I think it was worth going through all the steps to get there. I could pull all the middle stuff out and turn it into a blog post somewhere and just link to that and summarize it here, if anyone thinks this is just way too much for an answer.Omnivore
Clever! I would not have thought of creating the new cache but not using it. To run on my Python (2.7) I had to change "rebind\n" to "rebind()\n".Decoupage
@kuzzooroo: Sorry, yeah. The earlier rebind is a function returning a decorator; the later one can just be a decorator itself, so you won't need that, but not without me editing the code. I'll do that; thanks for pointing it out.Omnivore
I just realized that rebind was a terrible name for a function that does the exact opposite of that—bind a new name, but never rebind an existing name. So I change it to new_bind. Not beautiful, but at least not completely wrong…Omnivore
I believe this doesn't work if the place I'm trying to use the decorator is not in the same module as the decorator's definition. For example, the code below doesn't work for me (second assertion fails). Is there a way to stick create_if_not_exists into a library without running into this issue? from functools32 import lrucache\n from cine import create_if_not_exists\n \n @create_if_not_exists\n @lru_cache()\n def _square(x):\n print "Squaring", x\n return x*x\n \n assert "_square" in globals()\n assert "square" in globals()\nDecoupage
@kuzzooroo: Yes, there is. First, you can make it explicitly take a globals dict or a module or a module name, instead of using globals(). That's nice and clean, but a bit ugly to use—you have to write @new_bind(globals()) or @new_bind(__name__) or something similar.Omnivore
@kuzzooroo: Alternatively, you can use sys._getframe to peek up to your caller's frame, and get the globals from that frame. As the docs imply, this is implementation-specific, and any use of it pretty much qualifies as an ugly hack… but it will work, at least in CPython, for exactly what you want. I'll edit the answer with more details.Omnivore
O
6

There is no persistent_lru_cache in the stdlib. But you can build one pretty easily.

The functools source is linked directly from the docs, because this is one of those modules that's as useful as sample code as it is for using it directly.

As you can see, the cache is just a dict. If you replace that with, say, a shelf, it will become persistent automatically:

def persistent_lru_cache(filename, maxsize=128, typed=False):
    """new docstring explaining what dbpath does"""
    # same code as before up to here
    def decorating_function(user_function):
        cache = shelve.open(filename)
        # same code as before from here on.

Of course that only works if your arguments are strings. And it could be a little slow.

So, you might want to instead keep it as an in-memory dict, and just write code that pickles it to a file atexit, and restores it from a file if present at startup:

    def decorating_function(user_function):
        # ...

        try:
            with open(filename, 'rb') as f:
                cache = pickle.load(f)
            except:
                cache = {}
        def cache_save():
            with lock:
                with open(filename, 'wb') as f:
                    pickle.dump(cache, f)
        atexit.register(cache_save)

        # …
        wrapper.cache_save = cache_save
        wrapper.cache_filename = filename

Or, if you want it to write every N new values (so you don't lose the whole cache on, say, an _exit or a segfault or someone pulling the cord), add this to the second and third versions of wrapper, right after the misses += 1:

            if misses % N == 0:
                cache_save()

See here for a working version of everything up to this point (using save_every as the "N" argument, and defaulting to 1, which you probably don't want in real life).

If you want to be really clever, maybe copy the cache and save that in a background thread.

You might want to extend the cache_info to include something like number of cache writes, number of misses since last cache write, number of entries in the cache at startup, …

And there are probably other ways to improve this.

From a quick test, with save_every=1, this makes the cache on both get_pep and fib (from the functools docs) persistent, with no measurable slowdown to get_pep and a very small slowdown to fib the first time (note that fib(100) has 100097 hits vs. 101 misses…), and of course a large speedup to get_pep (but not fib) when you re-run it. So, just what you'd expect.

Omnivore answered 20/11, 2013 at 1:52 Comment(1)
I've changed "@persistent..." to "@non_resetting..." to be clearer. Having the cache stay around if I leave Python and restart it isn't the key; the key is that when I rerun my file in the same console window I don't lose all my state. I believe a shelf would work if my keys were all strings, with the side effect that I would get real persistence.Decoupage
D
0

I can't say I won't just use @abarnert's "ugly frame hack", but here is the version that requires you to pass in the calling module's globals dict. I think it's worth posting given that decorator functions with arguments are tricky and meaningfully different from those without arguments.

def create_if_not_exists_2(my_globals):
    def wrap(func):
        if "_" != func.__name__[0]:
            raise Exception("Function names used in cine must begin with'_'")
        my_globals.setdefault(func.__name__[1:], func)
        def wrapped(*args):
            func(*args)
        return wrapped
    return wrap

Which you can then use in a different module like this:

from functools32 import lru_cache
from cine import create_if_not_exists_2

@create_if_not_exists_2(globals())
@lru_cache()
def _square(x):
    print "Squaring", x
    return x*x

assert "_square" in globals()
assert "square" in globals()
Decoupage answered 25/11, 2013 at 23:55 Comment(0)
D
0

I've gained enough familiarity with decorators during this process that I was comfortable taking a swing at solving the problem another way:

from functools32 import lru_cache

try:
    my_cine
except NameError:
    class my_cine(object):
        _reg_funcs = {}

        @classmethod
        def func_key (cls, f):
            try:
                name = f.func_name
            except AttributeError:
                name = f.__name__
            return (f.__module__, name)

        def __init__(self, f):
            k = self.func_key(f)
            self._f = self._reg_funcs.setdefault(k, f)

        def __call__(self, *args, **kwargs):
            return self._f(*args, **kwargs)


if __name__ == "__main__":
    @my_cine
    @lru_cache()
    def fact_my_cine(n):
        print "In fact_my_cine for", n
        if n < 2:
            return 1
        return fact_my_cine(n-1) * n

    x = fact_my_cine(10)
    print "The answer is", x

@abarnert, if you are still watching, I'd be curious to hear your assessment of the downsides of this method. I know of two:

  1. You have to know in advance what attributes to look in for a name to associate with the function. My first stab at it only looked at func_name which failed when passed an lru_cache object.
  2. Resetting a function is painful: del my_cine._reg_funcs[('__main__', 'fact_my_cine')], and the swing I took at adding a __delitem__ was unsuccessful.
Decoupage answered 27/11, 2013 at 1:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.