Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22871

The Python Coding Blog: Beware Python Iterators That Are Not Independent

$
0
0

Python iterators are cool. And very useful. Often, you use them to perform operations on existing data without creating copies of those data.

An iterator is a Python object that represents a stream of data. This means it allows you to go from one item from a data source to the next. You’ll see a couple of short examples soon to get you started.

However, an iterator is dependent on the data from which you created it. So what happens if you create two or more iterators from the same data? The iterators are all dependant on the same data, so are they dependent on each other, too? This article will explore the topic of iterator independence. Sounds weird and abstract? Lots of things are at first. I’ll go through why this is relevant and interesting in this article.

What’s the context? I often like to dive randomly into documentation pages, find a function that I’ve not seen or used before, and explore it. Some hobby I have, you’re thinking. I came across itertools.tee(), which had a curious name. Is this the golfing term tee or the letter ‘T’, or perhaps the tea you drink spelt incorrectly? Who knows? (It’s actually the second option, I later found out, but by now, I was sufficiently intrigued to explore further.)

This led me to itertools.islice() and to the issue of when iterators are dependent of each other, or not. This article will look at these two itertools functions, too.

Looking At Python Iterators

Start with a list of numbers:

numbers = [1, 3, 5, 7, 9]

A list is an iterable. This means you can iterate through it. Even simpler, it means you can use it in a for loop as the object that comes after the in keyword in the for loop statement.

When you use a list in a for loop, its iterator is created. Whereas the list contains all the data, the iterator made from it only represents the stream of data and doesn’t store all the items. The iterator created from a list relies on the data stored in the list.

Let’s see why this matters:

>>> numbers = [1, 3, 5, 7, 9]
>>> numbers_iter = iter(numbers)  # Create the iterator

>>> next(numbers_iter)
1
>>> next(numbers_iter)
3

Each time you call next() on the iterator, you get the next value from the data the iterator is based on. But, before you call next() again to get the value that’s next in line, try removing a value from the original list:

# Same session as previous code snippet
>>> numbers.remove(1)
>>> numbers
[3, 5, 7, 9]

>>> next(numbers_iter)
7

You’ve removed the value 1 from the list which was the first item in the list. Note that the iterator had already gone through this value. You might expect that removing the first element from the list will not have any effect on the iterator which has already gone past that point.

You can see from the result of next() that that’s not what happens. The call to next() doesn’t return 5 but 7. When you removed the first element from the list, all the other elements fell down by one place. And when you call next() on the iterator, the number 5 is skipped.

You can look at another example using zip():

>>> first_names = ["Isaac", "René", "Christiaan", "Pierre", "Gottfried"]
>>> last_names = ["Newton", "Descartes", "Huygens", "de Fermat", "Leibniz"]

>>> full_names = zip(first_names, last_names)

>>> next(full_names)
('Isaac', 'Newton')
>>> next(full_names)
('René', 'Descartes')

# Now, remove an item from first_names
>>> first_names.remove("Isaac")
>>> next(full_names)
('Pierre', 'Huygens')

By using zip(), you get an iterator which uses the data from the lists first_names and last_names. The data are not duplicated, saving memory. The first two calls of next(full_names) give you the result you expect. But you then remove the name "Isaac" from first_names and the next call to next() creates a hybrid 17th century scientist called Pierre Huygens!

Let’s now fast forward to my serendipitous encounter with itertools.tee()

Exploring itertools.tee()

So I read the docs for itertools.tee(). The function “returns n independent iterators from a single iterable”. OK, seems simple enough, no? You’ve read about iterables and iterators above. The documentation goes on to show code that’s equivalent to what tee() does. This is one of those instances where the Python docs weren’t enough for me to say “Ah, great, it’s all very clear now.”

So I googled a bit more and found lots of dry examples showing how tee() works in a four-line-code-snippet-type example. They show whatitertools.tee() does. But they don’t shed any light on why you’d want to use it and when.

Luckily, it didn’t take long to find David Amos’s RealPython article. Finally, some sense. Do read this overview of itertools through lots of great examples. But first, finish reading this article, of course!

The boring bit

So, I’m now contractually obliged to give you one of those dry examples that shows you what itertools.tee() does, but nothing else. Don’t worry. Better examples are coming later on!

>>> import itertools
>>> original_generator = (number for number in range(10))
>>> original_generator
<generator object <genexpr> at 0x7fec3027a4a0>>>> first, second, third = itertools.tee(original_generator, 3)
>>> first
<itertools._tee object at 0x7fec3028a500>>>> second
<itertools._tee object at 0x7fec3028a140>>>> third
<itertools._tee object at 0x7fec3028acc0>

As the documentation said, tee() returns independent iterators from the original iterable. All three will iterate through all the items in the original iterable. The iterators returned are _tee objects. In this case, the original iterable is a generator.

The second argument in tee() determines how many independent iterators the function returns. Let’s check they’re independent:

# Get the first two values from `first`
>>> next(first)
0
>>> next(first)
1

# Now exhaust `second` fully
>>> for item in second:
...     print(item)

0
1
2
3
4
5
6
7
8
9

# And get a value from `third`
>>> next(third)
0

Each of the three iterators first, second, and third go through values independently from each other. When you looped through second, the code printed all numbers from 0 to 9 even though you had already used up 0 and 1 in first. And third was still untouched!

Note that the three iterators are independent of each other, but they’re not independent of the original generator:

# Recreate the original generator and the three independent iterators
>>> original_generator = (number for number in range(10))
>>> first, second, third = itertools.tee(original_generator, 3)

# Use up the first two values from the original generator
>>> next(original_generator)
0
>>> next(original_generator)
1

# The iterators from tee() start from where you've just left off!
>>> next(first)
2
>>> next(second)
2

You’ll return to tee() later to see how and when it can be useful. You’ll also revisit the issue of when generators are and aren’t independent of each other.

Exploring itertools.islice()

Let’s dive into another function in itertools. You can create a slice in an iterable by using itertools.islice(). This returns an iterator. The concept is similar to slicing through sequences in the normal way with the difference that the result is an iterator:

>>> import itertools
>>> original_generator = (number for number in range(10))
>>> a_slice = itertools.islice(original_generator, 4, 8)
>>> a_slice
<itertools.islice object at 0x7fec3026d720>>>> next(a_slice)
4
>>> next(a_slice)
5
>>> next(a_slice)
6
>>> next(a_slice)
7
>>> next(a_slice)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
StopIteration

# But, be careful!
>>> next(original_generator)
8

The iterator slice you created starts from the value at index 4 and goes up to, but excluding, the value at index 8. You’ve set these values using the second and third arguments in islice().

You can see how you call next(a_slice) four times successfully. These calls return 4, 5, 6, and 7. However, when you call next(a_slice) again, you get a StopIteration error as the islice iterator is exhausted.

What about original_generator? So far, you haven’t explicitly used original_generator except for creating the islice. However, the result of next(original_generator) is 8. This means that original_generator and a_slice are not independent. When you advanced through a_slice, you also advanced through original_generator because a_slice depends on original_generator in a similar way to how, earlier in this article, the iterator you created from a list was dependant on the data in the list.

How to Shuffle a Generator in Python Without Converting Into A List

You’ve set yourself the task to shuffle the values in a generator without ever converting it into a list. This is a drill or an exercise as in real-world programs you’re likely to be better off converting into a list. But for the sake of this exercise, let’s try this challenge.

I will stick with the simple generator with numbers from 0 to 9 in this example. Of course, if you wanted a generator with random numbers from 0 to 9, you could create one directly. However, this is not the case for other generators you may have in your code. I’ll keep using this example as it’s easy to demonstrate what’s going on.

You can’t use functions such as random.shuffle() or numpy.random.shuffle() on a generator:

>>> import random
>>> original_generator = (number for number in range(10))

>>> random.shuffle(original_generator)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/random.py", line 359, in shuffle
    for i in reversed(range(1, len(x))):
TypeError: object of type 'generator' has no len()

A generator doesn’t have a __len__ attribute. Therefore, these functions can’t work. This is true for iterators in general. Whereas an iterable such as a list has a length, and iterator doesn’t.

The solution in many cases is to convert the generator into a list, shuffle the list, and then convert it back into a generator if that’s what you’d like to have:

>>> import random
>>> original_generator = (number for number in range(10))

>>> numbers = list(original_generator)
>>> random.shuffle(numbers)
>>> numbers
[3, 7, 6, 5, 2, 0, 8, 9, 1, 4]

>>> new_generator = (number for number in numbers)

Often, this is a perfectly good solution. But, for the sake of this exercise, you’ve set yourself the task to avoid converting the generator into a list.

Luckily, itertools and the two functions you’ve explored earlier in this article can come to the rescue.

Planning and Writing the Code

The technique you’ll use here is the following:

  • Create two independent iterators from the original generator
  • Choose a random index and slice the two iterators using this index so that one has the first part of the original and the other has the second part
  • Yield the value at the location of the split
  • Merge the remaining parts back into a single iterator and repeat the process until you’ve used up all the values in the original generator

This method is inspired by David Amos’s example in the article I mentioned in the introduction.

You can start by creating the generator you’ve already used several times in this article and define a generator function using the yield keyword. I’ll use a script for this example rather than the console sessions I used earlier.

# shuffle_generators.py

n = 10
original_generator = (number for number in range(n))

def randomise_generator(original, length):
    while True:
        yield

new_generator = randomise_generator(original_generator, n)

for number in new_generator:
    print(number)

The generator function randomise_generator() yields None forever for the time being. You’ll fix this soon.

You’ve also written code to create a new generator from the generator function randomise_generator() and test it by going through the new generator using a for loop.

If you run this code now, it will print None forever!

First attempt: Just using islice()

Let’s try to use itertools.islice() directly on the original generator first. Spoiler alert: this won’t work. But let’s see why:

# shuffle_generators.py

import itertools
import random

n = 10
original_generator = (number for number in range(n))

def randomise_generator(original, length):
    while True:
        idx = random.randint(0, length - 1)
        first_part = itertools.islice(original, idx)
        second_part = itertools.islice(original, idx, None)
        yield next(second_part)
        original = itertools.chain(first_part, second_part)
        length -= 1
        if length == 0:
            return

new_generator = randomise_generator(original_generator, n)

for number in new_generator:
    print(number)

You’re first choosing a random index where you’ll split your generator. Next, you use this index to create two iterator slices from the original generator. Note that when you use islice() with two arguments, the second argument is the stop parameter and the start defaults to index 0. Therefore, first_part is a slice from the beginning of the original generator up to, but excluding, the value with index idx.

When you call islice() with three arguments, the second and third are the start and stop parameters. If the third is None, the slice goes to the end.

Next, you yield the first value of second_part. This is the value just after the point where you split the generator into two.

Following the yield statement, you put the two remaining parts together again using itertools.chain(). The plan is to merge the remaining parts of the original iterator minus the one value you’ve removed already.

You decrease the value of length by 1 to account for the element you’ve removed and yielded already and put in a condition to end the generator function when there are no more elements left.

You run this code, and you get this:

0
4
9
Traceback (most recent call last):
  File "<file_path>", line 15, in randomise_generator
    yield next(second_part)
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<file_path>", line 23, in <module>
    for number in new_generator:
RuntimeError: generator raised StopIteration

Both the values and the number of outputs you’ll get before the error will be different each time you run this code. But you’ll always end up with the StopIteration error.

Let’s investigate this problem by going back into the console. In this example, you’re splitting the generator at index 6:

>>> import itertools
>>> original_generator = (number for number in range(10))
>>> first_part = itertools.islice(original_generator, 6)
>>> second_part = itertools.islice(original_generator, 6, None)

>>> for number in first_part:
...     print(number)

0
1
2
3
4
5

>>> for number in second_part:
...     print(number)


>>> 
# There's no output from the second `for` loop

You intend to create two iterator slices. One from 0 to 5 and the other from 6 to 9. The first islice seems to be correct. When you iterate through first_part, you get the expected numbers.

However, when you iterate through second_part you get no output. The iterator second_part is empty.

You can check if the same thing happens if you use second_part before first_part. Remember you’ll need to recreate the original generator and the slices each time:

>>> original_generator = (number for number in range(10))
>>> first_part = itertools.islice(original_generator, 6)
>>> second_part = itertools.islice(original_generator, 6, None)

>>> for number in second_part:
...     print(number)

6
7
8
9

>>> for number in first_part:
...     print(number)

>>>
# Again, no output from the second loop

This time, it’s first_part that’s empty. This is because the iterator slices are not independent of the original generator. When you exhaust an iterator slice, you’re also using up the original generator. You’ve seen this issue earlier in this article when you first read about itertools.islice()

Second attempt: tee() to the rescue

This is where itertools.tee() comes in useful. This function creates two independent iterators from an iterable. The independence is the important part here!

To be able to shuffle a generator in Python, you can update the code to include itertools.tee():

# shuffle_generators.py

import itertools
import random

n = 10
original_generator = (number for number in range(n))

def randomise_generator(original, length):
    while True:
        idx = random.randint(0, length - 1)
        first_iter, second_iter = itertools.tee(original, 2)
        first_part = itertools.islice(first_iter, idx)
        second_part = itertools.islice(second_iter, idx, None)
        yield next(second_part)
        original = itertools.chain(first_part, second_part)
        length -= 1
        if length == 0:
            return

new_generator = randomise_generator(original_generator, n)

for number in new_generator:
    print(number)

First, you create first_iter and second_iter using itertools.tee(). Both iterators go through all the elements of the original generator, but they’re independent of each other.

Next, you create iterator slices from first_iter and second_iter. You no longer have the problem you encountered in the previous section as these are independent iterators now.

You can verify this in the console:

>>> import itertools
>>> original_generator = (number for number in range(10))
>>> first_iter, second_iter = itertools.tee(original_generator, 2)
>>> first_part = itertools.islice(first_iter, 6)
>>> second_part = itertools.islice(second_iter, 6, None)

>>> for number in first_part:
...     print(number)

0
1
2
3
4
5

>>> for number in second_part:
...     print(number)

6
7
8
9

In this example, first_part goes from 0 to 5 and second_part goes from 6 to 9. Independence problem solved!

You can run the shuffle_generators.py script now. You’ll verify that new_generator is a generator which has all the values in original_generator, but they’ve been shuffled:

5
8
6
7
1
0
2
3
9
4

This way of shuffling a generator is not very efficient, so if you ever need to shuffle a generator, you’re better off converting to a list first!

Final Words

In this article, you’ve explored a bit how Python iterators work and how they are dependent on the original data they’re created from. You’ve also looked at two functions which create iterators from the itertools module.

When using iterators, keep in mind that the iterators depend on the data you’ve created them from. This information could save you hours looking for hard-to-find bugs in some cases when you’re using iterators!


You may also like the article on stacks, queues, and deques


Get the latest blog updates

No spam promise. You’ll get an email when a new blog post is published


The post Beware Python Iterators That Are Not Independent appeared first on The Python Coding Book.


Viewing all articles
Browse latest Browse all 22871

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>