Caching a generator
Asked Answered
L

3

10

A recent similar question (isinstance(foo, types.GeneratorType) or inspect.isgenerator(foo)?) got me curious about how to implement this generically.

It seems like a generally-useful thing to have, actually, to have a generator-type object that will cache the first time through (like itertools.cycle), report StopIteration, and then return items from the cache next time through, but if the object isn't a generator (i.e. a list or dict that inherently supports O(1) lookup), then don't cache, and have the same behaviour, but for the original list.

Possibilities:

1) Modify itertools.cycle. It looks like this:

def cycle(iterable):
    saved = []
    try: 
         saved.append(iterable.next())
         yield saved[-1]
         isiter = True
    except:
         saved = iterable
         isiter = False
    # cycle('ABCD') --> A B C D A B C D A B C D ...
    for element in iterable:
        yield element
        if isiter: 
            saved.append(element)

     # ??? What next?

If I could restart the generator, that would be perfect - I could send back a StopIteration, and then on the next gen.next(), return entry 0 i.e. `A B C D StopIteration A B C D StopIteration' but it doesn't look like that's actually possible.

Second would be that once StopIteration is hit, then saved has a cache. But it doesn't look like there's any way to get to the internal saved[] field. Maybe a class version of this?

2) Or I could pass in the list directly:

def cycle(iterable, saved=[]):
    saved.clear()
    try: 
         saved.append(iterable.next())
         yield saved[-1]
         isiter = True
    except:
         saved = iterable
         isiter = False
    # cycle('ABCD') --> A B C D A B C D A B C D ...
    for element in iterable:
        yield element
        if isiter: 
            saved.append(element)

mysaved = []
myiter = cycle(someiter, mysaved)

But that just looks nasty. And in C/++ I could pass in some reference, and change the actual reference to saved to point to iterable - you can't actually do that in python. So this doesn't even work.

Other options?

Edit: More data. The CachingIterable method appears to be too slow to be effective, but it did push me in a direction that might work. It's slightly slower than the naive method (converting to list myself), but appears not to take the hit if it's already iterable.

Some code and data:

def cube_generator(max=100):
    i = 0
    while i < max:
        yield i*i*i
        i += 1

# Base case: use generator each time
%%timeit
cg = cube_generator(); [x for x in cg]
cg = cube_generator(); [x for x in cg]
cg = cube_generator(); [x for x in cg]
10000 loops, best of 3: 55.4 us per loop

# Fastest case: flatten to list, then iterate
%%timeit
cg = cube_generator()
cl = list(cg)
[x for x in cl]
[x for x in cl]
[x for x in cl]
10000 loops, best of 3: 27.4 us per loop

%%timeit
cg = cube_generator()
ci2 = CachingIterable(cg)
[x for x in ci2]
[x for x in ci2]
[x for x in ci2]
1000 loops, best of 3: 239 us per loop

# Another attempt, which is closer to the above
# Not exactly the original solution using next, but close enough i guess
class CacheGen(object):
    def __init__(self, iterable):
        if isinstance(iterable, (list, tuple, dict)):
            self._myiter = iterable
        else:
            self._myiter = list(iterable)
    def __iter__(self):
        return self._myiter.__iter__()
    def __contains__(self, key):
        return self._myiter.__contains__(key)
    def __getitem__(self, key):
        return self._myiter.__getitem__(key)

%%timeit
cg = cube_generator()
ci = CacheGen(cg)
[x for x in ci]
[x for x in ci]
[x for x in ci]
10000 loops, best of 3: 30.5 us per loop

# But if you start with a list, it is faster
cg = cube_generator()
cl = list(cg)
%%timeit
[x for x in cl]
[x for x in cl]
[x for x in cl]
100000 loops, best of 3: 11.6 us per loop

%%timeit
ci = CacheGen(cl)
[x for x in ci]
[x for x in ci]
[x for x in ci]
100000 loops, best of 3: 13.5 us per loop

Any faster recipes that can get closer to the 'pure' loop?

Landpoor answered 21/10, 2013 at 19:52 Comment(8)
The main problem is that once StopIteration is raised, then by the generator specification, it should no longer yield anything...Bassarisk
yes, that's exactly my problem. i just wanted something you could iterate over, but i guess an iterable works just as well. as an aside, i realized it would be somewhat simple to take a class that wraps a list, returns list.iter for its own iter, and if you pass a generator, just unwrap it with list(generator) and do the same thing.Landpoor
Why did the flatten case take 23.5 us per loop early on, yet 11.6 us per loop after? Are you testing in the same stable environment?Evan
i don't see a 23.5, but if you meant the 27.4 vs. 11.6, the 27.4 is timing for creating the list from the generator & iterating the list 3 times; the 11.6 is only for iterating the list 3 times. It's only meant to show that this CacheGen implementation is not copying the list if it gets one, only if it gets a generator.Landpoor
@CorleyBrigman: ok, gotcha, that makes sense. so yea it seems any solution will be slower than just doing list() and then iterating over the list - so your CacheGen would be the way to go. if ultimately you have to exhaust the whole iterator then you might as well do it all in one go at the beginning. But if you have infinite generators then you won't be able to do it that way. or if you might not iterate over the whole thing you'll waste resources. I've updated my answer with a more efficient "as you go" cacher, but still slower than the simple oneEvan
my intention here is that this would only be used if the user knows he wants to iterate multiple times over the 'iterable', but doesn't know if the input is a generator or iterable. this lets you ignore that distinction, while not losing (much) performance. if you needed as-you-go, or thought you might get an infinite generator, you probably wouldn't use this at all. i can see different uses for your generator though, so i'm keeping this bookmarked :)Landpoor
@CorleyBrigman: oh gothca. then perhaps all you need is a function, ensure_list = lambda i: i if isinstance(i, (list, tuple, dict)) else list(i) =P.Evan
that's actually a very good idea, thanks! simple to type, but effective. it might be slower in some cases, because you have to iterate through the list one additional time compared to a solution like CacheGen, but at least for the smallish list above & 3 iterations through, the data above says that this method is still faster, even with the additional iteration. if you post that as an answer i'll mark it as such.Landpoor
E
6

Based on this comment:

my intention here is that this would only be used if the user knows he wants to iterate multiple times over the 'iterable', but doesn't know if the input is a generator or iterable. this lets you ignore that distinction, while not losing (much) performance.

This simple solution does exactly that:

def ensure_list(it):
    if isinstance(it, (list, tuple, dict)):
        return it
    else:
        return list(it)

now ensure_list(a_list) is practically a no-op - two function calls - while ensure_list(a_generator) will turn it into a list and return it, which turned out to be faster than any other approach.

Evan answered 22/10, 2013 at 18:3 Comment(1)
This requires the entire iterable to be available when ensure_list is called, or else it will take a long time before returning. This is problematic when dealing with I/O streams such as HTTP responses.Transported
E
8

What you want is not an iterator, but an iterable. An iterator can only iterate once through its contents. You want something which takes an iterator and over which you can then iterate multiple times, producing the same values from the iterator, even if the iterator doesn't remember them, like a generator. Then it's just a matter of special-casing those inputs which don't need caching. Here's a non-thread-safe example (EDIT: updated for efficiency):

import itertools
class AsYouGoCachingIterable(object):
    def __init__(self, iterable):
        self.iterable = iterable
        self.iter = iter(iterable)
        self.done = False
        self.vals = []

    def __iter__(self):
        if self.done:
            return iter(self.vals)
        #chain vals so far & then gen the rest
        return itertools.chain(self.vals, self._gen_iter())

    def _gen_iter(self):
        #gen new vals, appending as it goes
        for new_val in self.iter:
            self.vals.append(new_val)
            yield new_val
        self.done = True

And some timings:

class ListCachingIterable(object):
    def __init__(self, obj):
        self.vals = list(obj)

    def __iter__(self):
        return iter(self.vals)

def cube_generator(max=1000):
    i = 0
    while i < max:
        yield i*i*i
        i += 1

def runit(iterable_factory):
    for i in xrange(5):
        for what in iterable_factory():
            pass

def puregen():
    runit(lambda: cube_generator())
def listtheniter():
    res = list(cube_generator())
    runit(lambda: res)
def listcachingiterable():
    res = ListCachingIterable(cube_generator())
    runit(lambda: res)
def asyougocachingiterable():
    res = AsYouGoCachingIterable(cube_generator())
    runit(lambda: res)

Results are:

In [59]: %timeit puregen()
1000 loops, best of 3: 774 us per loop

In [60]: %timeit listtheniter()
1000 loops, best of 3: 345 us per loop

In [61]: %timeit listcachingiterable()
1000 loops, best of 3: 348 us per loop

In [62]: %timeit asyougocachingiterable()
1000 loops, best of 3: 630 us per loop

So the simplest approach in terms of a class, ListCachingIterable, works just about as well as doing the list manually. The "as-you-go" variant is almost twice as slow, but has advantages if you don't consume the entire list, e.g. say you're only looking for the first cube over 100:

def first_cube_past_100(cubes):
    for cube in cubes:
        if cube > 100:
            return cube
    raise Error("No cube > 100 in this iterable")

Then:

In [76]: %timeit first_cube_past_100(cube_generator())
100000 loops, best of 3: 2.92 us per loop

In [77]: %timeit first_cube_past_100(ListCachingIterable(cube_generator()))
1000 loops, best of 3: 255 us per loop

In [78]: %timeit first_cube_past_100(AsYouGoCachingIterable(cube_generator()))
100000 loops, best of 3: 10.2 us per loop
Evan answered 21/10, 2013 at 20:36 Comment(9)
this looks pretty reasonable, i will think about this one and see if it completely solves my problem. the non-caching is sometimes an issue, but an example might be join, where it's going to go through the list twice, and not be modified. standard procedure is to pass a list (for performance), but it's not necessarily to duplicate it if it's already a list - you could do something like ''.join(CachingIterable(my_real_iterable)) and it would be 'automatic'...Landpoor
hmm, i don't think i can accept this answer... mostly, because it's very slow for a small number of iterations - doing it 3 times, it's about a factor of 5 slower than just using the generators without caching. maybe an optimized method?Landpoor
@CorleyBrigman: hmm perhaps, can you put a codepad or pastebin of your test case so I can mess around with it?Evan
just wanted to thank you for the additional timing detail. and AsYouGoCachingIterable could probably be sped up slightly by replacing self.vals.append(new_val) with self.vals += (new_val,)...Landpoor
@CorleyBrigman: I'm not so sure. the former just appends an element to a list, which is amortized O(1), while the latter has to construct a new tuple with an additional element, which is O(n). check it outEvan
that makes sense for tuples - due to immutability, they have to be rebuilt. lists don't, and it has been shown before that += is faster than append, don't know why. however, self.vals += [new_val] is slower than both - it's kind of 'accidental', and only works because constructing a tuple is cheaper than constructing a list, but for list addition, += is faster than append.Landpoor
@CorleyBrigman: l += [1] isn't faster than l.append(1) either. the former has to create a new list, the latter just modifies the existing one. but if you have to create a new list, then l += [1] is indeed way faster than l = l[:]; l.append(1).Evan
maybe not exactly... but if set l1 = range(100); l2 = [101] (python 2.6), then do %%timeit l1 += l2 and %%timeit l1.extend(l2), the += version is about 20% faster. not directly related, true, but %timeit showed that doing += with a tuple is about 15% faster than append.Landpoor
@CorleyBrigman: oh interesting, looks like you're right. list += is faster than both .extend and .append with the first element, if the list to plus is already constructed. if you only have one element, then l.append(el) is faster than l += [el], though. i think this is because the interpreter can be more efficient in picking which function to call - with extend it does a dict lookup but with += it doesn't. constructing a one-element list negates this speed-up though. it seems i was wrong, += doesn't copy the list, it does it in-place.Evan
E
6

Based on this comment:

my intention here is that this would only be used if the user knows he wants to iterate multiple times over the 'iterable', but doesn't know if the input is a generator or iterable. this lets you ignore that distinction, while not losing (much) performance.

This simple solution does exactly that:

def ensure_list(it):
    if isinstance(it, (list, tuple, dict)):
        return it
    else:
        return list(it)

now ensure_list(a_list) is practically a no-op - two function calls - while ensure_list(a_generator) will turn it into a list and return it, which turned out to be faster than any other approach.

Evan answered 22/10, 2013 at 18:3 Comment(1)
This requires the entire iterable to be available when ensure_list is called, or else it will take a long time before returning. This is problematic when dealing with I/O streams such as HTTP responses.Transported
W
1

Just made a library that solves exactly that -- supports caching for functions returning iterators:

from typing import *
from cacheable_iter import iter_cache

@iter_cache
def iterator_function(n: int) -> Iterator[int]:
    yield from range(n)

An example of usage:

from typing import *
from cacheable_iter import iter_cache

@iter_cache
def my_iter(n: int) -> Iterator[int]:
    print(" * my_iter called")
    for i in range(n):
        print(f" * my_iter step {i}")
        yield i

gen1 = my_iter(4)
print("Creating an iterator...")
print(f"The first value of gen1 is {next(gen1)}")
print(f"The second value of gen1 is {next(gen1)}")

gen2 = my_iter(4)
print("Creating an iterator...")
print(f"The first value of gen2 is {next(gen2)}")
print(f"The second value of gen2 is {next(gen2)}")
print(f"The third value of gen2 is {next(gen2)}")

Which would print:

Creating an iterator...
 * my_iter called
 * my_iter step 0
The first value of gen1 is 0
 * my_iter step 1
The second value of gen1 is 1
Creating an iterator...
The first value of gen2 is 0
The second value of gen2 is 1
 * my_iter step 2
The third value of gen2 is 2

Also supports caching awaitable iterators and asynchronous iterators

Wend answered 21/6, 2021 at 10:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.