Why can't I iterate twice over the same iterator? How can I "reset" the iterator or reuse the data?
Asked Answered
F

6

93

Consider the code:

def test(data):
    for row in data:
        print("first loop")
    for row in data:
        print("second loop")

When data is an iterator, for example a list iterator or a generator expression*, this does not work:

>>> test(iter([1, 2]))
first loop
first loop
>>> test((_ for _ in [1, 2]))
first loop
first loop

This prints first loop a few times, since data is non-empty. However, it does not print second loop. Why does iterating over data work the first time, but not the second time? How can I make it work a second time?

Aside from for loops, the same problem appears to occur with any kind of iteration: list/set/dict comprehensions, passing the iterator to list(), sum() or reduce(), etc.

On the other hand, if data is another kind of iterable, such as a list or a range (which are both sequences), both loops run as expected:

>>> test([1, 2])
first loop
first loop
second loop
second loop
>>> test(range(2))
first loop
first loop
second loop
second loop

* More examples:


For general theory and terminology explanation, see What are iterator, iterable, and iteration?.

To detect whether the input is an iterator or a "reusable" iterable, see Ensure that an argument can be iterated twice.

Fractostratus answered 16/8, 2014 at 3:42 Comment(8)
Iterable vs. iterator.Misprize
I'm not saying that this is a duplicate, but you might also want to refer to #9884632 for some more context / explanationAlaster
Related: Resetting an iterator objectInfatuated
The code presented in this question is not the shortest possible to recreate the problem. The question could be improved by presenting a better code example.Trouper
@Trilarion Yes, I think the def _view(self,dbName): db = self.dictDatabases[dbName] data = db[3] can be removed safely since no other answer discusses that portion of the code.Nilla
@MateenUlhaq Thanks for the improvement. I despair a bit at the question because as a debugging question it never showed runnable code and as a knowledge question (already knowing that it's an iterator) it doesn't show any research, yet it got so many upvotes. Added a bit of research because that is what a good question would have done.Trouper
I think there's an unanswered question here, one that can trip up novices: "How can I tell if my data is an iterator or just iterable?" For example, why can I go through this list twice, but not through this file twice?Expressive
@Trilarion I'd like to invite you (and Mateen) to check out my rework of the question. Including a function wrapper is useful, since it allows us to easily show the behaviour for different values of data. The issue with the db lines isn't so much that they were unnecessary, but that they didn't explain how data came to be an iterator.Guelph
A
70

An iterator can only be consumed once. For example:

data = [1, 2, 3]
it = iter(data)

next(it)
# => 1
next(it)
# => 2
next(it)
# => 3
next(it)
# => StopIteration

When the iterator is supplied to a for loop instead, that last StopIteration will cause it to exit the first time. Trying to use the same iterator in another for loop will cause StopIteration again immediately, because the iterator has already been consumed.

A simple way to work around this is to save all the elements to a list, which can be traversed as many times as needed. For example:

data = list(it)

If the iterator would iterate over many elements at roughly the same time, however, it's a better idea to create independent iterators using tee():

import itertools
it1, it2 = itertools.tee(data, 2) # create as many as needed

Now each one can be iterated over separately:

next(it1)
# => 1
next(it1)
# => 2
next(it2)
# => 1
next(it2)
# => 2
next(it1)
# => 3
next(it2)
# => 3
Alfonse answered 16/8, 2014 at 3:45 Comment(3)
@ÓscarLópez Note from the documentation on tee: "This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee()." So if you're using it1 and it2 like you are in the example, you might not be getting any real benefit out of tee (while probably taking some extra overhead).Tyika
I support @Tyika - in this case tee will create full copy of iterator values in slightly less efficient way than a single list call. One should use tee not when there are a lot of elements in iterable - this is not relevant, but when there is locality of usage -in this case the tee's cache can be less than the whole list. For example if two iterators go neck in neck, like in zip(a, islice(b, 1)) call.Hamrick
@user2357112supportsMonica Your edits to this answer are being discussed on meta.Supramolecular
T
36

Iterators (e.g. from calling iter, from generator expressions, or from generator functions which yield) are stateful and can only be consumed once.

This is explained in Óscar López's answer, however, that answer's recommendation to use itertools.tee(data) instead of list(data) for performance reasons is misleading. In most cases, where you want to iterate through the whole of data and then iterate through the whole of it again, tee takes more time and uses more memory than simply consuming the whole iterator into a list and then iterating over it twice. According to the documentation:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

tee may be preferred if you will only consume the first few elements of each iterator, or if you will alternate between consuming a few elements from one iterator and then a few from the other.

Trifurcate answered 16/8, 2014 at 3:42 Comment(2)
This would be more convincing with some concrete profiling results and/or theoretical examination of the work tee needs to do vs. creating an auxiliary list.Guelph
@KarlKnechtel This claim comes from the documentation - I've edited to include a quote and a link. I agree that some empirical analysis would be an improvement too.Trifurcate
B
13

Once an iterator is exhausted, it will not yield any more.

>>> it = iter([3, 1, 2])
>>> for x in it: print(x)
...
3
1
2
>>> for x in it: print(x)
...
>>>
Bakken answered 16/8, 2014 at 3:45 Comment(4)
that makes sense, but how do I get around it?Fractostratus
@JSchwartz, Convert the iterator into sequence object (list, tuple). Then iterate the sequence object. (Only if the size of the csv is not huge)Bakken
@JSchwartz, Alternatively, if you can access the underlying file object and that is is seekable. you can change file position before the second loop: csv_file_object.seek(0)Bakken
This answer is obsoleted by my attempt to improve the question as a canonical (after explaining the question as clearly as possible and giving concrete examples, the answer now repeats information present in the question). Sorry about that.Guelph
N
12

How do I loop over an iterator twice?

It is usually impossible. (Explained later.) Instead, do one of the following:

  • Collect the iterator into something that can be looped over multiple times.

    items = list(iterator)
    
    for item in items:
        ...
    

    Downside: This costs memory.

  • Create a new iterator. It usually takes only a microsecond to make a new iterator.

    for item in create_iterator():
        ...
    
    for item in create_iterator():
        ...
    

    Downside: Iteration itself may be expensive (e.g. reading from disk or network).

  • Reset the "iterator". For example, with file iterators:

    with open(...) as f:
        for item in f:
            ...
    
        f.seek(0)
    
        for item in f:
            ...
    

    Downside: Most iterators cannot be "reset".


Philosophy of an Iterator

Typically, though not technically1:

  • Iterable: A for-loopable object that represents data. Examples: list, tuple, str.
  • Iterator: A pointer to some element of an iterable.

If we were to define a sequence iterator, it might look something like this:

class SequenceIterator:
    index: int
    items: Sequence  # Sequences can be randomly indexed via items[index].

    def __next__(self):
        """Increment index, and return the latest item."""

The important thing here is that typically, an iterator does not store any actual data inside itself.

Iterators usually model a temporary "stream" of data. That data source is consumed by the process of iteration. This is a good hint as to why one cannot loop over an arbitrary source of data more than once. We need to open a new temporary stream of data (i.e. create a new iterator) to do that.

Exhausting an Iterator

What happens when we extract items from an iterator, starting with the current element of the iterator, and continuing until it is entirely exhausted? That's what a for loop does:

iterable = "ABC"
iterator = iter(iterable)

for item in iterator:
    print(item)

Let's support this functionality in SequenceIterator by telling the for loop how to extract the next item:

class SequenceIterator:
    def __next__(self):
        item = self.items[self.index]
        self.index += 1
        return item

Hold on. What if index goes past the last element of items? We should raise a safe exception for that:

class SequenceIterator:
    def __next__(self):
        try:
            item = self.items[self.index]
        except IndexError:
            raise StopIteration  # Safely says, "no more items in iterator!"
        self.index += 1
        return item

Now, the for loop knows when to stop extracting items from the iterator.

What happens if we now try to loop over the iterator again?

iterable = "ABC"
iterator = iter(iterable)

# iterator.index == 0

for item in iterator:
    print(item)

# iterator.index == 3

for item in iterator:
    print(item)

# iterator.index == 3

Since the second loop starts from the current iterator.index, which is 3, it does not have anything else to print and so iterator.__next__ raises the StopIteration exception, causing the loop to end immediately.


1 Technically:

  • Iterable: An object that returns an iterator when __iter__ is called on it.
  • Iterator: An object that one can repeatedly call __next__ on in a loop in order to extract items. Furthermore, calling __iter__ on it should return itself.

More details here.

Nilla answered 15/2, 2022 at 1:58 Comment(1)
There is a lot of good information here, but also several minor technical inaccuracies. I started trying to edit it, but ended up deciding I could do much better by starting over, with the material organized totally differently.Guelph
G
3

Why doesn't iterating work the second time for iterators?

It does "work", in the sense that the for loop in the examples does run. It simply performs zero iterations. This happens because the iterator is "exhausted"; it has already iterated over all of the elements.

Why does it work for other kinds of iterables?

Because, behind the scenes, a new iterator is created for each loop, based on that iterable. Creating the iterator from scratch means that it starts at the beginning.

This happens because iterating requires an iterable. If an iterable was already provided, it will be used as-is; but otherwise, a conversion is necessary, which creates a new object.

Given an iterator, how can we iterate twice over the data?

By caching the data; starting over with a new iterator (assuming we can re-create the initial condition); or, if the iterator was specifically designed for it, seeking or resetting the iterator. Relatively few iterators offer seeking or resetting.

Caching

The only fully general approach is to remember what elements were seen (or determine what elements will be seen) the first time and iterate over them again. The simplest way is by creating a list or tuple from the iterator:

elements = list(iterator)
for element in elements:
    ...

for element in elements:
    ...

Since the list is a non-iterator iterable, each loop will create a new iterable that iterates over all the elements. If the iterator is already "part way through" an iteration when we do this, the list will only contain the "following" elements:

abstract = (x for x in range(10)) # represents integers from 0 to 9 inclusive
next(abstract) # skips the 0
concrete = list(abstract) # makes a list with the rest
for element in concrete:
    print(element) # starts at 1, because the list does

for element in concrete:
    print(element) # also starts at 1, because a new iterator is created

A more sophisticated way is using itertools.tee. This essentially creates a "buffer" of elements from the original source as they're iterated over, and then creates and returns several custom iterators that work by remembering an index, fetching from the buffer if possible, and appending to the buffer (using the original iterable) when necessary. (In the reference implementation of modern Python versions, this does not use native Python code.)

from itertools import tee
concrete = list(range(10)) # `tee` works on any iterable, iterator or not
x, y = tee(concrete, 2) # the second argument is the number of instances.
for element in x:
    print(element)
    if element == 3:
        break

for element in y:
    print(element) # starts over at 0, taking 0, 1, 2, 3 from a buffer

Starting over

If we know and can recreate the starting conditions for the iterator when the iteration started, that also solves the problem. This is implicitly what happens when iterating multiple times over a list: the "starting conditions for the iterator" are just the contents of the list, and all the iterators created from it give the same results. For another example, if a generator function does not depend on an external state, we can simply call it again with the same parameters:

def powers_of(base, *range_args):
    for i in range(*range_args):
        yield base ** i

exhaustible = powers_of(2, 1, 12):

for value in exhaustible:
    print(value)

print('exhausted')

for value in exhaustible: # no results from here
    print(value)

# Want the same values again? Then use the same generator again:
print('replenished')
for value in powers_of(2, 1, 12):
    print(value)

Seekable or resettable iterators

Some specific iterators may make it possible to "reset" iteration to the beginning, or even to "seek" to a specific point in the iteration. In general, iterators need to have some kind of internal state in order to keep track of "where" they are in the iteration. Making an iterator "seekable" or "resettable" simply means allowing external access to, respectively, modify or re-initialize that state.

Nothing in Python disallows this, but in many cases it's not feasible to provide a simple interface; in most other cases, it just isn't supported even though it might be trivial. For generator functions, the internal state in question, on the other hand, the internal state is quite complex, and protects itself against modification.

The classic example of a seekable iterator is an open file object created using the built-in open function. The state in question is a position within the underlying file on disk; the .tell and .seek methods allow us to inspect and modify that position value - e.g. .seek(0) will set the position to the beginning of the file, effectively resetting the iterator. Similarly, csv.reader is a wrapper around a file; seeking within that file will therefore affect the subsequent results of iteration.

In all but the simplest, deliberately-designed cases, rewinding an iterator will be difficult to impossible. Even if the iterator is designed to be seekable, this leaves the question of figuring out where to seek to - i.e., what the internal state was at the desired point in the iteration. In the case of the powers_of generator shown above, that's straightforward: just modify i. For a file, we'd need to know what the file position was at the beginning of the desired line, not just the line number. That's why the file interface provides .tell as well as .seek.

Here's a re-worked example of powers_of representing an unbound sequence, and designed to be seekable, rewindable and resettable via an exponent property:

class PowersOf:
    def __init__(self, base):
        self._exponent = 0
        self._base = base
    def __iter__(self):
        return self
    def __next__(self):
        result = self._base ** self._exponent
        self._exponent += 1
        return result
    @property
    def exponent(self):
        return self._exponent
    @exponent.setter
    def exponent(self, value):
        if not isinstance(new_value, int):
            raise TypeError("must set with an integer")
        if new_value < 0:
            raise ValueError("can't set to negative value")
        self._exponent = new_value

Examples:

pot = PowersOf(2)
for i in pot:
    if i > 1000:
        break
    print(i)

pot.exponent = 5 # jump to this point in the (unbounded) sequence
print(next(pot)) # 32
print(next(pot)) # 64

Technical detail

Iterators vs. iterables

Recall that, briefly:

  • "iteration" means looking at each element in turn, of some abstract, conceptual sequence of values. This can include:
  • "iterable" means an object that represents such a sequence. (What the Python documentation calls a "sequence" is in fact more specific than that - basically it also needs to be finite and ordered.). Note that the elements do not need to be "stored" - in memory, disk or anywhere else; it is sufficient that we can determine them during the process of iteration.
  • "iterator" means an object that represents a process of iteration; in some sense, it keeps track of "where we are" in the iteration.

Combining the definitions, an iterable is something that represents elements that can be examined in a specified order; an iterator is something that allows us to examine elements in a specified order. Certainly an iterator "represents" those elements - since we can find out what they are, by examining them - and certainly they can be examined in a specified order - since that's what the iterator enables. So, we can conclude that an iterator is a kind of iterable - and Python's definitions agree.

How iteration works

In order to iterate, we need an iterator. When we iterate in Python, an iterator is needed; but in normal cases (i.e. except in poorly written user-defined code), any iterable is permissible. Behind the scenes, Python will convert other iterables to corresponding iterators; the logic for this is available via the built-in iter function. To iterate, Python repeatedly asks the iterator for a "next element" until the iterator raises a StopException. The logic for this is available via the built-in next function.

Generally, when iter is given a single argument that already is an iterator, that same object is returned unchanged. But if it's some other kind of iterable, a new iterator object will be created. This directly leads to the problem in the OP. User-defined types can break both of these rules, but they probably shouldn't.

The iterator protocol

Python roughly defines an "iterator protocol" that specifies how it decides whether a type is an iterable (or specifically an iterator), and how types can provide the iteration functionality. The details have changed a slightly over the years, but the modern setup works like so:

  • Anything that has an __iter__ or a __getitem__ method is an iterable. Anything that defines an __iter__ method and a __next__ method is specifically an iterator. (Note in particular that if there is a __getitem__ and a __next__ but no __iter__, the __next__ has no particular meaning, and the object is a non-iterator iterable.)

  • Given a single argument, iter will attempt to call the __iter__ method of that argument, verify that the result has a __next__ method, and return that result. It does not ensure the presence of an __iter__ method on the result. Such objects can often be used in places where an iterator is expected, but will fail if e.g. iter is called on them.) If there is no __iter__, it will look for __getitem__, and use that to create an instance of a built-in iterator type. That iterator is roughly equivalent to

class Iterator:
    def __init__(self, bound_getitem):
        self._index = 0
        self._bound_getitem = bound_getitem
    def __iter__(self):
        return self
    def __next__(self):
        try:
            result = self._bound_getitem(self._index)
        except IndexError:
            raise StopIteration
        self._index += 1
        return result
  • Given a single argument, next will attempt to call the __next__ method of that argument, allowing any StopIteration to propagate.

  • With all of this machinery in place, it is possible to implement a for loop in terms of while. Specifically, a loop like

for element in iterable:
    ...

will approximately translate to:

iterator = iter(iterable)
while True:
    try:
        element = next(iterator)
    except StopIteration:
        break
    ...

except that the iterator is not actually assigned any name (the syntax here is to emphasize that iter is only called once, and is called even if there are no iterations of the ... code).

Guelph answered 9/1, 2023 at 1:9 Comment(1)
I ended up giving a lot more detail than I planned on, but the important points are all up front.Guelph
T
1

The other answers are all correct, but there is one more option that was not made explicit. It might be just a little hacky, but some situations demand a hacky solution.

Say you're given some function like this, which you are not allowed to modify:

def do_something(items):
    items_copy = list(items)
    
    for item in items:
        ...  # actual work

This function iterates over the items argument over multiple times, so items can only be a sized collection (such as a list, tuple or set) to achieve the desired result, as otherwise the iterator will be exhausted after the call to list. So supplying a custom iterator to the for loop (such as a progress bar that advances at each iteration) seems out of the question without a rewrite of the function.

Or does it? Let's create a simple custom iterator that wraps a number of iterators and returns them one after the other:

class StaggeredChain:
    def __init__(self, *iters):
        self.iters = iter(iters)
    
    def __iter__(self):
        return iter(next(self.iters, ()))

Note that this differs from itertools.chain in that it can be iterated over multiple times and at each step behaves like the corresponding individual wrapped iterator:

>>> chained = StaggeredChain(range(5), range(4, -1, -1))
>>> list(chained)
[0, 1, 2, 3, 4]
>>> list(chained)
[4, 3, 2, 1, 0]
>>> list(chained)
[]

With this class, we can achieve the goal of adding a progress bar to the inner loop:

>>> from tqdm import tqdm
>>> vals = range(5)
>>> do_something(StaggeredChain(vals, tqdm(vals)))
100%|█████████████████████████████████|

(Aside: tqdm in this case will see the first iteration start at its own constructor until the first iteration of the loop ends, which might be a lot longer than the just the loop iteration. Ideally you'd want to delay initialisation of the progress bar until that generator is actually next'ed, but that's a tqdm-specific detail. One way would be to change the constructor of StaggeredChain to __init__(self, iters) and pass in a single argument that generates the individual iterators.)

If the requirement is only to repeat a given set of values a number of times and then stop, we can do something like this:

import itertools

class StaggeredRepeat:
    def __init__(self, vals, loops=1):
        self.iters = itertools.repeat(tuple(vals), loops)
    
    def __iter__(self):
        return iter(next(self.iters, ()))

Now you can iterate over a given collection a desired number of times:

>>> rep = StaggeredRepeat(range(5), 2)
>>> list(rep)
[0, 1, 2, 3, 4]
>>> list(rep)
[0, 1, 2, 3, 4]
>>> list(rep)
[]
Torsion answered 25/4, 2023 at 10:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.