On demand data in Python, Part 2
The magic of itertools
Find a versatile toolkit for working with iterators in the standard library
This content is part # of # in the series: On demand data in Python, Part 2
This content is part of the series:On demand data in Python, Part 2
Stay tuned for additional content in this series.
In Part 1 of this series, you looked at Python iterators and the most straightforward way of creating your own iterators, which is by writing generator functions or expressions. Once you get the basics, you will quickly start to have all sorts of ideas for building on iterators. However, it’s a good idea to make sure you’re not reinventing the wheel.
Python’s standard library is a marvel of valuable support and tools, with some a bit more buried than others. There are several modules useful in manipulating iterators, but first you should look at some facilities available right in the built-ins, meaning for the most part, you don’t need to import anything to use them.
Using map to build on iterators
map function, you can process each item in an iterator, which leaves you with a new iterator of the results.
>>> r = range(10) >>> m = map(str, r) >>> next(m) '0' >>> list(m) ['1', '2', '3', '4', '5', '6', '7', '8', '9']
The first argument to map is the mapping function, the function to apply to each item.
In this case, I use
str to convert each integer from the range into a
string. When I request the first item from the map iterator, I get a string. I then use list to extract everything in the sequence. Notice that the list starts with
1 rather than
0. That’s because I already extracted the first item from the map. Iterators, unlike lists or tuples, do not support random access. You get items one at a time, in a forward direction.
Of course, you can write your own mapping function. It expects one input value from
map. Paste the following into your interpreter session:
def get_letter(i): return chr(ord('a')+i)
Here, I use
ord to get the numerical code corresponding to the character
'a' and add the function’s argument to derive a new character code. The function
chr converts this code back to a string character. Let’s use this function with
>>> r = range(10) >>> list(map(get_letter, r)) ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Map versus generator expressions
At this point, you probably remember Part 1 and realize this is much like a generator expression. Indeed, I could have done the following.
>>> list( ( get_letter(i) for i in range(10) ) ) ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Notice how I put the generator expression right in as an argument to
That should help illustrate that you can use generator expressions just like other
Python expressions. In this case, the
get_letter function is simple enough that you could even build it right into the generator expression.
>>> list( ( chr(ord('a')+i) for i in range(10) ) ) ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
This seems redundant, but map has existed in Python longer than generator expressions, and there are specialized cases where it’s still handy. For the most part though, you would probably use generator expressions for such simple manipulations of iterators.
Making reductions from iterators
Sometimes, you don’t want to turn one iterator into another but rather compute one single value from an iterator.
>>> sum(range(20, 30)) 245
sum function takes an iterator of numbers (think ints or floats) and returns a sum of all items in the iterator, in this case the sum of all numbers from 20 to 30. This is called a reduction, where a series of values is reduced to one new value. Python also provides a generalized function for reduction. Start by pasting in the following function:
def sum_of_squares(acc, i): return acc + i**2
Use this to reduce as follows. You need the
>>> import functools >>> functools.reduce(sum_of_squares, range(10)) 285
The first argument is a function that takes two arguments. It takes an accumulated value and then it takes a new value to combine into an updated accumulated value in some way, which is then returned. In this case, the accumulated value is the sum of squares of items so far.
Just as generator expressions can be more readable than using
are often more readable alternatives to reduce, but it’s useful to have for certain
specialized uses (and it’s a conceptual counterpoint to
The idea of
map is to take an iterator and generate a derived iterator. The idea of
reduce is to take an iterator and return a single value derived from all its items. This combination of concepts turns out to be a powerful way to think about processing large quantities of data.
The string.join method
I want to mention a special method that acts as a reduction.
>>> tenletters = ( chr(ord('a')+i) for i in range(10) ) >>> '!'.join(tenletters) 'a!b!c!d!e!f!g!h!i!j'
join method on a string takes an iterator and creates a new string by
joining together the iterator item strings with a specified string in between. Playing around a bit more:
>>> '!'.join(tenletters) '' >>> tenletters = ( chr(ord('a')+i) for i in range(10) ) >>> ''.join(tenletters) 'abcdefghij' >>> tenletters = ( chr(ord('a')+i) for i in range(10) ) >>> ' and '.join(tenletters) 'a and b and c and d and e and f and g and h and i and j'
The output from the first line is a good reminder that iterators are one way, and once used up, you won’t get any more out of them. Pay attention because this is a common cause of bugs in code using iterators. In the next line, I set up a new generator, and you can see that a join on the empty string just concatenates all the strings in the given iterator. You can also join with longer strings as demonstrated in the final couple of lines.
There is another built-in function you can use to create an iterator containing only some subset of its input. In Part 1, I presented the following example.
>>> import math >>> notby2or3 = ( n for n in range(1, 20) if math.gcd(n, 2) == 1 and math.gcd(n, 3) == 1 ) >>> list(notby2or3) [1, 5, 7, 11, 13, 17, 19]
Go back and check Part
1 if you need a refresher on the
notby2or3 function. You can
implement the equivalent of this generator expression using the filter
function, which (again) is just good to be aware of.
>>> import math >>> def notby2or3(n): ... return math.gcd(n, 2) == 1 and math.gcd(n, 3) == 1 ... >>> list(filter(notby2or3, range(1, 20))) [1, 5, 7, 11, 13, 17, 19]
I could have written this with one or two lines fewer using what’s called a
lambda function, but that’s outside the scope of this tutorial series.
There are many small patterns for using iterators that come up over and
over again. The
itertools module provides high-performance implementations of many of these, which are well worth learning.
Suppose you’d like to create a longer iterator that concatenates a sequence of other iterators:
>>> import itertools >>> it = itertools.chain(range(5), range(5, 0, -1), range(5)) >>> list(it) [0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4]
itertools.chain the iterators as function arguments and
it produces a chain of the outputs of these. If you have a list or tuple sequence of iterators, you can use Python's special syntax for variable positional function arguments.
>>> list_of_iters = [range(5), range(5, 0, -1), range(5)] >>> it = itertools.chain(*list_of_iters) >>> list(it) [0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4]
You can use any iterator for functional variable arguments. It doesn’t have to be a list or tuple. Paste in the following:
def forth_back_forth(n): yield range(n) yield range(n, 0, -1) yield range(n)
You can use this generator to feed itertools.chain its arguments:
>>> it = itertools.chain(*forth_back_forth(3)) >>> list(it) [0, 1, 2, 3, 2, 1, 0, 1, 2]
The combination of
itertools.chain makes me wonder what it would be like to continue that back/forth/back number pattern even further. Paste in the following:
def zigzag_iters(period): while True: yield range(period) yield range(period, 0, -1)
Then, try it out:
>>> it = zigzag_iters(2) >>> next(it) range(0, 2) >>> next(it) range(2, 0, -1) >>> next(it) range(0, 2) >>> next(it) range(2, 0, -1) >>> next(it) range(0, 2)
If you continue to enter
next(it), it continues to cycle
between the two output ranges over and over because
True implements an infinite loop, and there is no code within it to break the
loop’s execution. This generator is in a constant state of suspension and
resumption until the Python process (the interactive interpreter, in this
case) terminates. Or you could use the
close() method on any generator object to end it regardless of where it is in execution.
You might have noticed that I’m not using
list to illustrate
the iterator's contents. If you try to do this, you are basically trying to create an infinite list, and of course, you will run out of memory.
Remember also that this is returning iterators, not items. With
forth_back_forth, the idea was to use
itertools.chain to get the items from those iterators. You
cannot do the same thing with
zigzag_iters because this also
tries to create a collection of infinite items, and all the arguments to a
function must be known before the function is executed, even in the case
of variable arguments. This would would mean the Python runtime would be trying to get all items from an infinite iterator, which is not possible.
Infinite iterators are actually a useful concept once you start using generators. But, you have to be aware of operations that attempt to build any collection from the contents of such iterators.
Because you can’t use
zigzag_iters, create a more direct version.
def zigzag(period): while True: for n in range(period): yield n for n in range(period, 0, -1): yield n
Paste this in and see how the iterator continues indefinitely:
>>> it = zigzag(2) >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 0 >>> next(it) 1
This is another common pattern where an outer iterator is a cascade of one
or more inner iterators. It’s common enough that Python provides a
standard syntax for it, the
yield from statement, as you can see from the following rewrite of
zigzag that is functionally identical.
def zigzag(period): while True: yield from range(period) yield from range(period, 0, -1)
If you think of an iterator providing the entirety of another iterator,
yield from. The provided iterator will be completely
consumed in this case. If it is an infinite iterator, the outer iterator will also be infinite and continue on the
yield from statement indefinitely.
A variant on itertools.chain
One final note, for completeness: Earlier, I said that you can’t use the
zigzag_iters function with
itertools.chain because it would attempt to create an infinite collection of function arguments. The minds behind
itertools also thought of that and provided a variant that does work,
>>> it = itertools.chain.from_iterable(zigzag_iters(2)) >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1
More from itertools
You’ll find in
itertools some handy routines for dealing with
infinite iterators. One is
itertools.islice. With this, you
can extract a subset sequence from an iterator, including from an infinite iterator.
>>> it1 = zigzag(5) >>> it2 = itertools.islice(it1, 0, 20) >>> list(it2) [0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 4, 3, 2, 1] >>> it2 = itertools.islice(it1, 1, 20) >>> list(it2) [1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 4, 3, 2, 1] >>> it2 = itertools.islice(it1, 0, 20, 2) >>> list(it2) [0, 2, 4, 4, 2, 0, 2, 4, 4, 2]
The function takes the iterator to excerpt from, the starting index, and
the ending index. You can also optionally provide a step, so in some ways,
it feels similar to
range. Be warned though: Unlike with
range, a negative step value is not allowed.
One of the most useful bits of
itertools is also one of its trickiest, but well worth understanding.
itertools.groupby to segment iterators according to some criteria. Here is one that groups hours into quadrants of a twelve-hour period.
>>> hours = range(12) >>> def divide_by_3(n): return n//3 ... >>> it = itertools.groupby(hours, divide_by_3) >>> next(it) (0, <itertools._grouper object at 0x1075ac9e8>)
itertools.groupby with an input iterator and a grouping
function, which is executed on each item in that iterator. When the result of the grouping function changes from one item to the next, a new group is created, with all the items until the next change in grouping function value. The grouping function,
divide_by_3, uses the floor division operator, //, to divide and then round down to the next smallest integer.
The Figure 1 helps visualize the operation of
itertools.groupby in this example.
Figure 1. A visualization of itertools.groupby
The key to understanding
itertools.groupby is remembering that
the change in the value of the grouping function is what leads to the
grouped items in the output. With some practice, you’ll be able to expertly write grouping functions for all sorts of use cases.
Let’s dig a bit more into our
>>> hours = range(12) >>> for quadrant, group_it in itertools.groupby(hours, divide_by_3): ... print('Items in quadrant', quadrant) ... for i in group_it: ... print(i) ... Items in quadrant 0 0 1 2 Items in quadrant 1 3 4 5 Items in quadrant 2 6 7 8 Items in quadrant 3 9 10 11
Here, the outer loop invokes
itertools.groupby, but each item
in the loop is a tuple. The second element in this tuple,
group_it, is in turn an iterator, as you can see in Figure 1. So, the inner loop works on the iterator of each group.
Grouping with infinite iterators
To pull together a couple of concepts from this tutorial, you can use grouping with infinite iterators. Start by looking at a feature from itertools, which neatly creates infinite iterators.
>>> inf_it = itertools.cycle(range(12)) >>> print(list(itertools.islice(inf_it, 32))) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7]
In the first line, I use
itertools.cycle, which creates an
infinite cycle over a finite iterator you provide. In this case, it just
generates the 0 to 11 sequence over and over again. I use
itertools.islice to safely take a subset of it and print that as a list. There is also
itertools.repeat, which takes a single value and iterates that one value over and over.
itertools.cycle is the last tool to explore in this tutorial.
>>> inf_it = itertools.cycle(range(12)) >>> hours_32 = itertools.islice(inf_it, 32) >>> for quadrant, group_it in itertools.groupby(hours_32, divide_by_3): ... print('Items in quadrant', quadrant, ':', list(group_it)) ... Items in quadrant 0 : [0, 1, 2] Items in quadrant 1 : [3, 4, 5] Items in quadrant 2 : [6, 7, 8] Items in quadrant 3 : [9, 10, 11] Items in quadrant 0 : [0, 1, 2] Items in quadrant 1 : [3, 4, 5] Items in quadrant 2 : [6, 7, 8] Items in quadrant 3 : [9, 10, 11] Items in quadrant 0 : [0, 1, 2] Items in quadrant 1 : [3, 4, 5] Items in quadrant 2 : [6, 7]
Study this session carefully to get a valuable grasp of how these iterator tools interact. For example, you can see how the cycle of the hour numbers leads to a cycle of the grouping function results.
You're starting to see how compactly you can pull together basic building blocks to produce interesting patterns of iterators. As you combine such facilities with your own specialized generators, this becomes a powerful paradigm for breaking down problems dealing with data in series efficiently.
Go forth and iterate
There is more to itertools than I was able to cover in this tutorial, but you now have a basic idea of some of its offerings, and a good look at one of its trickiest but potentially most valuable facilities,
As with the entire standard library, you can find comprehensive documentation of the module, but as usual, this is more for a well-versed developer to check on details. It’s not designed as a primer. I recommend that you start with the building blocks in this tutorial and the previous one, and experiment bit-by-bit with the features of itertools, and the other facilities available in built-ins and the standard library for working with iterators.
Take a trip on the even wilder side in Part 3 of this series. You'll explore
coroutines, a special sort of generator function. You'll also learn about another powerful but tricky standard library module: