On demand data in Python, Part 2

# The magic of itertools

Find a versatile toolkit for working with iterators in the standard library

### Content series:

## This content is part # of # in the series: On demand data in Python, Part 2

## This content is part of the series:On demand data in Python, Part 2

Stay tuned for additional content in this series.

In Part 1 of this series, you looked at Python iterators and the most straightforward way of creating your own iterators, which is by writing generator functions or expressions. Once you get the basics, you will quickly start to have all sorts of ideas for building on iterators. However, it’s a good idea to make sure you’re not reinventing the wheel.

Python’s standard library is a marvel of valuable support and tools, with some a bit more buried than others. There are several modules useful in manipulating iterators, but first you should look at some facilities available right in the built-ins, meaning for the most part, you don’t need to import anything to use them.

## Using map to build on iterators

With the `map`

function, you can process each item in an iterator, which leaves you with a new iterator of the results.

>>> r = range(10) >>> m = map(str, r) >>> next(m) '0' >>> list(m) ['1', '2', '3', '4', '5', '6', '7', '8', '9']

The first argument to map is the mapping function, the function to apply to each item.
In this case, I use `str`

to convert each integer from the range into a
string. When I request the first item from the map iterator, I get a string. I then use list to extract everything in the sequence. Notice that the list starts with `1`

rather than `0`

. That’s because I already extracted the first item from the map. Iterators, unlike lists or tuples, do not support random access. You get items one at a time, in a forward direction.

Of course, you can write your own mapping function. It expects one input value from
`map`

. Paste the following into your interpreter session:

def get_letter(i): return chr(ord('a')+i)

Here, I use `ord`

to get the numerical code corresponding to the character
`'a'`

and add the function’s argument to derive a new character code. The function `chr`

converts this code back to a string character. Let’s use this function with `map`

.

>>> r = range(10) >>> list(map(get_letter, r)) ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

### Map versus generator expressions

At this point, you probably remember Part 1 and realize this is much like a generator expression. Indeed, I could have done the following.

>>> list( ( get_letter(i) for i in range(10) ) ) ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Notice how I put the generator expression right in as an argument to `list`

.
That should help illustrate that you can use generator expressions just like other
Python expressions. In this case, the `get_letter`

function is simple enough that you could even build it right into the generator expression.

>>> list( ( chr(ord('a')+i) for i in range(10) ) ) ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

This seems redundant, but map has existed in Python longer than generator expressions, and there are specialized cases where it’s still handy. For the most part though, you would probably use generator expressions for such simple manipulations of iterators.

## Making reductions from iterators

Sometimes, you don’t want to turn one iterator into another but rather compute one single value from an iterator.

>>> sum(range(20, 30)) 245

The `sum`

function takes an iterator of numbers (think ints or floats) and returns a sum of all items in the iterator, in this case the sum of all numbers from 20 to 30. This is called a reduction, where a series of values is reduced to one new value. Python also provides a generalized function for reduction. Start by pasting in the following function:

def sum_of_squares(acc, i): return acc + i**2

Use this to reduce as follows. You need the `functools`

module.

>>> import functools >>> functools.reduce(sum_of_squares, range(10)) 285

The first argument is a function that takes two arguments. It takes an accumulated value and then it takes a new value to combine into an updated accumulated value in some way, which is then returned. In this case, the accumulated value is the sum of squares of items so far.

Just as generator expressions can be more readable than using `map`

, there
are often more readable alternatives to reduce, but it’s useful to have for certain
specialized uses (and it’s a conceptual counterpoint to `map`

).

The idea of `map`

is to take an iterator and generate a derived iterator. The idea of `reduce`

is to take an iterator and return a single value derived from all its items. This combination of concepts turns out to be a powerful way to think about processing large quantities of data.

### The string.join method

I want to mention a special method that acts as a reduction.

>>> tenletters = ( chr(ord('a')+i) for i in range(10) ) >>> '!'.join(tenletters) 'a!b!c!d!e!f!g!h!i!j'

The `join`

method on a string takes an iterator and creates a new string by
joining together the iterator item strings with a specified string in between. Playing around a bit more:

>>> '!'.join(tenletters) '' >>> tenletters = ( chr(ord('a')+i) for i in range(10) ) >>> ''.join(tenletters) 'abcdefghij' >>> tenletters = ( chr(ord('a')+i) for i in range(10) ) >>> ' and '.join(tenletters) 'a and b and c and d and e and f and g and h and i and j'

The output from the first line is a good reminder that iterators are one way, and once used up, you won’t get any more out of them. Pay attention because this is a common cause of bugs in code using iterators. In the next line, I set up a new generator, and you can see that a join on the empty string just concatenates all the strings in the given iterator. You can also join with longer strings as demonstrated in the final couple of lines.

### Filtering

There is another built-in function you can use to create an iterator containing only some subset of its input. In Part 1, I presented the following example.

>>> import math >>> notby2or3 = ( n for n in range(1, 20) if math.gcd(n, 2) == 1 and math.gcd(n, 3) == 1 ) >>> list(notby2or3) [1, 5, 7, 11, 13, 17, 19]

Go back and check Part
1 if you need a refresher on the `notby2or3`

function. You can
implement the equivalent of this generator expression using the filter
function, which (again) is just good to be aware of.

>>> import math >>> def notby2or3(n): ... return math.gcd(n, 2) == 1 and math.gcd(n, 3) == 1 ... >>> list(filter(notby2or3, range(1, 20))) [1, 5, 7, 11, 13, 17, 19]

I could have written this with one or two lines fewer using what’s called a `lambda`

function, but that’s outside the scope of this tutorial series.

## Introducing itertools

There are many small patterns for using iterators that come up over and
over again. The `itertools`

module provides high-performance implementations of many of these, which are well worth learning.

Suppose you’d like to create a longer iterator that concatenates a sequence of other iterators:

>>> import itertools >>> it = itertools.chain(range(5), range(5, 0, -1), range(5)) >>> list(it) [0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4]

Pass `itertools.chain`

the iterators as function arguments and
it produces a chain of the outputs of these. If you have a list or tuple sequence of iterators, you can use Python's special syntax for variable positional function arguments.

>>> list_of_iters = [range(5), range(5, 0, -1), range(5)] >>> it = itertools.chain(*list_of_iters) >>> list(it) [0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4]

You can use any iterator for functional variable arguments. It doesn’t have to be a list or tuple. Paste in the following:

def forth_back_forth(n): yield range(n) yield range(n, 0, -1) yield range(n)

You can use this generator to feed itertools.chain its arguments:

>>> it = itertools.chain(*forth_back_forth(3)) >>> list(it) [0, 1, 2, 3, 2, 1, 0, 1, 2]

### Infinite iterators

The combination of `forth_back_forth`

and `itertools.chain`

makes me wonder what it would be like to continue that back/forth/back number pattern even further. Paste in the following:

def zigzag_iters(period): while True: yield range(period) yield range(period, 0, -1)

Then, try it out:

>>> it = zigzag_iters(2) >>> next(it) range(0, 2) >>> next(it) range(2, 0, -1) >>> next(it) range(0, 2) >>> next(it) range(2, 0, -1) >>> next(it) range(0, 2)

If you continue to enter `next(it)`

, it continues to cycle
between the two output ranges over and over because ```
while
True
```

implements an infinite loop, and there is no code within it to break the
loop’s execution. This generator is in a constant state of suspension and
resumption until the Python process (the interactive interpreter, in this
case) terminates. Or you could use the `close()`

method on any generator object to end it regardless of where it is in execution.

You might have noticed that I’m not using `list`

to illustrate
the iterator's contents. If you try to do this, you are basically trying to create an infinite list, and of course, you will run out of memory.

Remember also that this is returning iterators, not items. With
`forth_back_forth`

, the idea was to use
`itertools.chain`

to get the items from those iterators. You
cannot do the same thing with `zigzag_iters`

because this also
tries to create a collection of infinite items, and all the arguments to a
function must be known before the function is executed, even in the case
of variable arguments. This would would mean the Python runtime would be trying to get all items from an infinite iterator, which is not possible.

Infinite iterators are actually a useful concept once you start using generators. But, you have to be aware of operations that attempt to build any collection from the contents of such iterators.

## Cascading iterators

Because you can’t use `itertools.chain`

with `zigzag_iters`

, create a more direct version.

def zigzag(period): while True: for n in range(period): yield n for n in range(period, 0, -1): yield n

Paste this in and see how the iterator continues indefinitely:

>>> it = zigzag(2) >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 0 >>> next(it) 1

This is another common pattern where an outer iterator is a cascade of one
or more inner iterators. It’s common enough that Python provides a
standard syntax for it, the `yield from`

statement, as you can see from the following rewrite of `zigzag`

that is functionally identical.

def zigzag(period): while True: yield from range(period) yield from range(period, 0, -1)

If you think of an iterator providing the entirety of another iterator,
think `yield from`

. The provided iterator will be completely
consumed in this case. If it is an infinite iterator, the outer iterator will also be infinite and continue on the `yield from`

statement indefinitely.

### A variant on itertools.chain

One final note, for completeness: Earlier, I said that you can’t use the
`zigzag_iters`

function with `itertools.chain`

because it would attempt to create an infinite collection of function arguments. The minds behind `itertools`

also thought of that and provided a variant that does work, `itertools.chain.from_iterable`

.

>>> it = itertools.chain.from_iterable(zigzag_iters(2)) >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1 >>> next(it) 0 >>> next(it) 1 >>> next(it) 2 >>> next(it) 1

### More from itertools

You’ll find in `itertools`

some handy routines for dealing with
infinite iterators. One is `itertools.islice`

. With this, you
can extract a subset sequence from an iterator, including from an infinite iterator.

>>> it1 = zigzag(5) >>> it2 = itertools.islice(it1, 0, 20) >>> list(it2) [0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 4, 3, 2, 1] >>> it2 = itertools.islice(it1, 1, 20) >>> list(it2) [1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 4, 3, 2, 1] >>> it2 = itertools.islice(it1, 0, 20, 2) >>> list(it2) [0, 2, 4, 4, 2, 0, 2, 4, 4, 2]

The function takes the iterator to excerpt from, the starting index, and
the ending index. You can also optionally provide a step, so in some ways,
it feels similar to `range`

. Be warned though: Unlike with `range`

, a negative step value is not allowed.

### Grouping

One of the most useful bits of `itertools`

is also one of its trickiest, but well worth understanding.

Use `itertools.groupby`

to segment iterators according to some criteria. Here is one that groups hours into quadrants of a twelve-hour period.

>>> hours = range(12) >>> def divide_by_3(n): return n//3 ... >>> it = itertools.groupby(hours, divide_by_3) >>> next(it) (0, <itertools._grouper object at 0x1075ac9e8>)

You provide `itertools.groupby`

with an input iterator and a grouping
function, which is executed on each item in that iterator. When the result of the grouping function changes from one item to the next, a new group is created, with all the items until the next change in grouping function value. The grouping function, `divide_by_3`

, uses the floor division operator, //, to divide and then round down to the next smallest integer.

The Figure 1 helps visualize the operation of `itertools.groupby`

in this example.

##### Figure 1. A visualization of itertools.groupby

The key to understanding `itertools.groupby`

is remembering that
the change in the value of the grouping function is what leads to the
grouped items in the output. With some practice, you’ll be able to expertly write grouping functions for all sorts of use cases.

Let’s dig a bit more into our `itertools.groupby`

example.

>>> hours = range(12) >>> for quadrant, group_it in itertools.groupby(hours, divide_by_3): ... print('Items in quadrant', quadrant) ... for i in group_it: ... print(i) ... Items in quadrant 0 0 1 2 Items in quadrant 1 3 4 5 Items in quadrant 2 6 7 8 Items in quadrant 3 9 10 11

Here, the outer loop invokes `itertools.groupby`

, but each item
in the loop is a tuple. The second element in this tuple,
`group_it`

, is in turn an iterator, as you can see in Figure 1. So, the inner loop works on the iterator of each group.

### Grouping with infinite iterators

To pull together a couple of concepts from this tutorial, you can use grouping with infinite iterators. Start by looking at a feature from itertools, which neatly creates infinite iterators.

>>> inf_it = itertools.cycle(range(12)) >>> print(list(itertools.islice(inf_it, 32))) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7]

In the first line, I use `itertools.cycle`

, which creates an
infinite cycle over a finite iterator you provide. In this case, it just
generates the 0 to 11 sequence over and over again. I use `itertools.islice`

to safely take a subset of it and print that as a list. There is also `itertools.repeat`

, which takes a single value and iterates that one value over and over.

Combining `itertools.groupby`

and `itertools.cycle`

is the last tool to explore in this tutorial.

>>> inf_it = itertools.cycle(range(12)) >>> hours_32 = itertools.islice(inf_it, 32) >>> for quadrant, group_it in itertools.groupby(hours_32, divide_by_3): ... print('Items in quadrant', quadrant, ':', list(group_it)) ... Items in quadrant 0 : [0, 1, 2] Items in quadrant 1 : [3, 4, 5] Items in quadrant 2 : [6, 7, 8] Items in quadrant 3 : [9, 10, 11] Items in quadrant 0 : [0, 1, 2] Items in quadrant 1 : [3, 4, 5] Items in quadrant 2 : [6, 7, 8] Items in quadrant 3 : [9, 10, 11] Items in quadrant 0 : [0, 1, 2] Items in quadrant 1 : [3, 4, 5] Items in quadrant 2 : [6, 7]

Study this session carefully to get a valuable grasp of how these iterator tools interact. For example, you can see how the cycle of the hour numbers leads to a cycle of the grouping function results.

You're starting to see how compactly you can pull together basic building blocks to produce interesting patterns of iterators. As you combine such facilities with your own specialized generators, this becomes a powerful paradigm for breaking down problems dealing with data in series efficiently.

## Go forth and iterate

There is more to itertools than I was able to cover in this tutorial, but you now have a basic idea of some of its offerings, and a good look at one of its trickiest but potentially most valuable facilities, `groupby`

.

As with the entire standard library, you can find comprehensive documentation of the module, but as usual, this is more for a well-versed developer to check on details. It’s not designed as a primer. I recommend that you start with the building blocks in this tutorial and the previous one, and experiment bit-by-bit with the features of itertools, and the other facilities available in built-ins and the standard library for working with iterators.

Take a trip on the even wilder side in Part 3 of this series. You'll explore
coroutines, a special sort of generator function. You'll also learn about another powerful but tricky standard library module: `asyncio`

.