When is not a good time to use python generators?
Asked Answered
S

11

92

This is rather the inverse of What can you use Python generator functions for?: python generators, generator expressions, and the itertools module are some of my favorite features of python these days. They're especially useful when setting up chains of operations to perform on a big pile of data--I often use them when processing DSV files.

So when is it not a good time to use a generator, or a generator expression, or an itertools function?

  • When should I prefer zip() over itertools.izip(), or
  • range() over xrange(), or
  • [x for x in foo] over (x for x in foo)?

Obviously, we eventually need to "resolve" a generator into actual data, usually by creating a list or iterating over it with a non-generator loop. Sometimes we just need to know the length. This isn't what I'm asking.

We use generators so that we're not assigning new lists into memory for interim data. This especially makes sense for large datasets. Does it make sense for small datasets too? Is there a noticeable memory/cpu trade-off?

I'm especially interested if anyone has done some profiling on this, in light of the eye-opening discussion of list comprehension performance vs. map() and filter(). (alt link)

Sauterne answered 29/10, 2008 at 4:25 Comment(3)
I posed a similar question here and did some analysis to find that in my particular example lists are faster for iterables of length <5.Pattipattie
Does this answer your question? Generator Expressions vs. List ComprehensionWordsworth
In 3.x, zip behaves lazily, and itertools.izip has been removed. Similarly with range and xrange.Allness
I
68

Use a list instead of a generator when:

1) You need to access the data multiple times (i.e. cache the results instead of recomputing them):

for i in outer:           # used once, okay to be a generator or return a list
    for j in inner:       # used multiple times, reusing a list is better
         ...

2) You need random access (or any access other than forward sequential order):

for i in reversed(data): ...     # generators aren't reversible

s[i], s[j] = s[j], s[i]          # generators aren't indexable

3) You need to join strings (which requires two passes over the data):

s = ''.join(data)                # lists are faster than generators in this use case

4) You are using PyPy which sometimes can't optimize generator code as much as it can with normal function calls and list manipulations.

Insensitive answered 29/10, 2014 at 16:36 Comment(5)
For #3, couldn't the two passes be avoided by using ireduce to replicate the join?Electromagnetism
Thanks! I wasn't aware of the string joining behavior. Can you provide or link to an explanation of why it requires two passes?Sauterne
@DavidEyk str.join makes one pass to add-up the lengths of all the string fragments so it knows much memory to allocate for the combined final result. The second pass copies the string fragments into in the new buffer to create a single new string. See hg.python.org/cpython/file/82fd95c2851b/Objects/stringlib/…Insensitive
Interesting, I use very often generators to join srings. But, I wonder, how does it work if it needs two passes? for instance ''.join('%s' % i for i in xrange(10))Monochromat
@ikaros45 If the input to join isn't a list, it has to do extra work to build a temporary list for the two passes. Roughly this ``data = data if isinstance(data, list) else list(data); n = sum(map(len, data)); buffer = bytearray(n); ... <copy fragments into buffer>```.Insensitive
H
44

In general, don't use a generator when you need list operations, like len(), reversed(), and so on.

There may also be times when you don't want lazy evaluation (e.g. to do all the calculation up front so you can release a resource). In that case, a list expression might be better.

Herve answered 29/10, 2008 at 4:42 Comment(2)
Also, doing all the calculation up front ensures that if the calculation of the list elements throws an exception, it will be thrown at the point where the list is created, not in the loop that subsequently iterates through it. If you need to ensure error-free processing of the entire list before continuing, generators are no good.Teeming
That's a good point. It's very frustrating to get halfway through processing a generator, only to have everything explode. It can potentially be dangerous.Sauterne
P
29

Profile, Profile, Profile.

Profiling your code is the only way to know if what you're doing has any effect at all.

Most usages of xrange, generators, etc are over static size, small datasets. It's only when you get to large datasets that it really makes a difference. range() vs. xrange() is mostly just a matter of making the code look a tiny little bit more ugly, and not losing anything, and maybe gaining something.

Profile, Profile, Profile.

Pilliwinks answered 29/10, 2008 at 11:37 Comment(2)
Profile, indeed. One of these days, I'll try and do an empirical comparison. Until then, I was just hoping someone else already had. :)Sauterne
Profile, Profile, Profile. I completely agree. Profile, Profile, Profile.Ophiology
C
17

You should never favor zip over izip, range over xrange, or list comprehensions over generator comprehensions. In Python 3.0 range has xrange-like semantics and zip has izip-like semantics.

List comprehensions are actually clearer like list(frob(x) for x in foo) for those times you need an actual list.

Chanellechaney answered 29/10, 2008 at 4:28 Comment(10)
@Steven I don't disagree, but I am wondering what the reasoning behind your answer is. Why should zip, range, and list comprehensions never be favoured over the corresponding "lazy" version??Watch
because, as he said, the old behaviour of zip and range will go away soon.Jannjanna
@Steven: Good point. I'd forgotten about these changes in 3.0, which probably means that someone up there is convinced of their general superiority. Re: List comprehensions, they are often clearer (and faster than expanded for loops!), but one can easily write incomprehensible list comprehensions.Sauterne
I meant that list(frob(x) for x in foo) is more descriptive than [frob(x) for x in foo] -- i.e. the [] list comprehension "sugar" is not helpful.Chanellechaney
I see what you mean, but I find the [] form descriptive enough (and more concise, and less cluttered, generally). But this is just a matter of taste.Sauterne
And it looks like this will be the official answer, mainly for the point about generators becoming the normal forms in 3.0. Nobody's brought up any serious detriments to the careful use of generators, even on short datasets, so I will continue to use them with abandon.Sauterne
Please check my response with performance numbers below. List comprehensions can be significantly faster than generator expressions when using psyco.Herve
The list operations are faster for small data sizes, but everything is fast when the data size is small, so you should always prefer generators unless you have a specific reason to use lists (for such reasons, see Ryan Ginstrom's answer).Teeming
This is a rather weak point, I can't think of a case where you can't get the lazy version by playing with try/except NameError or ImportError.Monochromat
Using timeit in Python 3.8 gives [frob(x) for x in foo] as 50% faster than list(frob(x) for x in foo). The latter is not a list comprehension as the post states--it's a generator expression.Wordsworth
U
7

As you mention, "This especially makes sense for large datasets", I think this answers your question.

If your not hitting any walls, performance-wise, you can still stick to lists and standard functions. Then when you run into problems with performance make the switch.

As mentioned by @u0b34a0f6ae in the comments, however, using generators at the start can make it easier for you to scale to larger datasets.

Unmindful answered 29/10, 2008 at 8:50 Comment(1)
+1 Generators makes your code more ready for big datasets without you having to anticipate it.Bungalow
H
6

Regarding performance: if using psyco, lists can be quite a bit faster than generators. In the example below, lists are almost 50% faster when using psyco.full()

import psyco
import time
import cStringIO

def time_func(func):
    """The amount of time it requires func to run"""
    start = time.clock()
    func()
    return time.clock() - start

def fizzbuzz(num):
    """That algorithm we all know and love"""
    if not num % 3 and not num % 5:
        return "%d fizz buzz" % num
    elif not num % 3:
        return "%d fizz" % num
    elif not num % 5:
        return "%d buzz" % num
    return None

def with_list(num):
    """Try getting fizzbuzz with a list comprehension and range"""
    out = cStringIO.StringIO()
    for fibby in [fizzbuzz(x) for x in range(1, num) if fizzbuzz(x)]:
        print >> out, fibby
    return out.getvalue()

def with_genx(num):
    """Try getting fizzbuzz with generator expression and xrange"""
    out = cStringIO.StringIO()
    for fibby in (fizzbuzz(x) for x in xrange(1, num) if fizzbuzz(x)):
        print >> out, fibby
    return out.getvalue()

def main():
    """
    Test speed of generator expressions versus list comprehensions,
    with and without psyco.
    """

    #our variables
    nums = [10000, 100000]
    funcs = [with_list, with_genx]

    #  try without psyco 1st
    print "without psyco"
    for num in nums:
        print "  number:", num
        for func in funcs:
            print func.__name__, time_func(lambda : func(num)), "seconds"
        print

    #  now with psyco
    print "with psyco"
    psyco.full()
    for num in nums:
        print "  number:", num
        for func in funcs:
            print func.__name__, time_func(lambda : func(num)), "seconds"
        print

if __name__ == "__main__":
    main()

Results:

without psyco
  number: 10000
with_list 0.0519102208309 seconds
with_genx 0.0535933367509 seconds

  number: 100000
with_list 0.542204280744 seconds
with_genx 0.557837353115 seconds

with psyco
  number: 10000
with_list 0.0286369007033 seconds
with_genx 0.0513424889137 seconds

  number: 100000
with_list 0.335414877839 seconds
with_genx 0.580363490491 seconds
Herve answered 1/11, 2008 at 5:53 Comment(2)
That's because psyco doesn't speed up generators at all, so it's more of a shortcoming of psyco than of generators. Good answer, though.Chanellechaney
Also, psyco is pretty much unmaintained now. All the developers are spending time on PyPy's JIT which does to the best of my knowledge optimise generators.Railey
C
3

You should prefer list comprehensions if you need to keep the values around for something else later and the size of your set is not too large.

For example: you are creating a list that you will loop over several times later in your program.

To some extent you can think of generators as a replacement for iteration (loops) vs. list comprehensions as a type of data structure initialization. If you want to keep the data structure then use list comprehensions.

Complimentary answered 1/11, 2008 at 23:49 Comment(1)
If you only need limited look-ahead / look-behind on the stream, then maybe itertools.tee() can help you. But generally, if you want more than one pass, or random access to some intermediate data, make a list/set/dict of it.Footpath
E
2

As far as performance is concerned, I can't think of any times that you would want to use a list over a generator.

Ergosterol answered 29/10, 2008 at 11:44 Comment(1)
all(True for _ in range(10 ** 8)) is slower than all([True for _ in range(10 ** 8)]) in Python 3.8. I'd prefer a list over a generator hereWordsworth
T
2

I've never found a situation where generators would hinder what you're trying to do. There are, however, plenty of instances where using generators would not help you any more than not using them.

For example:

sorted(xrange(5))

Does not offer any improvement over:

sorted(range(5))
Teodoro answered 29/10, 2008 at 16:44 Comment(1)
Neither of those offers any improvement over range(5), since the resulting list is already sorted.Kktp
M
0

A generator builds and enumerable list of values. enumerables are useful when iterative process can use the values on demand. It takes time to build your generator, so if the list is millions of records in size, it may be more useful to use sql server to process the data in sql.

Mortise answered 27/5, 2021 at 17:36 Comment(0)
S
0

16 years later and I can add some other reasons not to use generators:

When integrating them extensively in your workflow, you will discover that they were not treated nicely by many of the tools you may be using or even by the own python language, for instance:

  • Numba: Don't support generators with 'send', 'throw' or 'yield from'
  • Multiprocessing: may fail depend on who you use them, as generators can't be pickled.
  • Pickle: generators can't be pickled.
  • Async: async generators with 'yield from' are not supported, and probably won't ever be. (Python don't support the syntax)
  • Async with Itertools, some functions won't work for async generators like itertools.tee, and there is no built-in async alternatives to them, but at least normally you can do your own without an excessive effort.
  • Send with itertools, it seems that generators with 'send' were barely/not taken into account for itertools or any built-in library that I know, so more DIY or rely in third parties, I don't know any good library in this regard yet.

Please feel free to add your own items to the list.

Said that, I really like the concept of generators and they definitely come with memory savings and make much cleaner operations, like a prime number generator which works nicely when listing all would be impossible.

But the issues I mention, and many others, may reduce their application range to few use-cases and I would take it into account when relying too much on them.

If you don't plan to use advanced generators (with send, throw, yield from), I would say simple generators are well supported by most use cases.

Sable answered 16/4, 2024 at 9:28 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.