Contents


On demand data in Python, Part 2

The magic of itertools

Find a versatile toolkit for working with iterators in the standard library

Comments

Content series:

This content is part # of # in the series: On demand data in Python, Part 2

Stay tuned for additional content in this series.

This content is part of the series:On demand data in Python, Part 2

Stay tuned for additional content in this series.

In Part 1 of this series, you looked at Python iterators and the most straightforward way of creating your own iterators, which is by writing generator functions or expressions. Once you get the basics, you will quickly start to have all sorts of ideas for building on iterators. However, it’s a good idea to make sure you’re not reinventing the wheel.

Python’s standard library is a marvel of valuable support and tools, with some a bit more buried than others. There are several modules useful in manipulating iterators, but first you should look at some facilities available right in the built-ins, meaning for the most part, you don’t need to import anything to use them.

Using map to build on iterators

With the map function, you can process each item in an iterator, which leaves you with a new iterator of the results.

>>> r = range(10)
>>> m = map(str, r)
>>> next(m)
'0'
>>> list(m)
['1', '2', '3', '4', '5', '6', '7', '8', '9']

The first argument to map is the mapping function, the function to apply to each item. In this case, I use str to convert each integer from the range into a string. When I request the first item from the map iterator, I get a string. I then use list to extract everything in the sequence. Notice that the list starts with 1 rather than 0. That’s because I already extracted the first item from the map. Iterators, unlike lists or tuples, do not support random access. You get items one at a time, in a forward direction.

Of course, you can write your own mapping function. It expects one input value from map. Paste the following into your interpreter session:

def get_letter(i):
    return chr(ord('a')+i)

Here, I use ord to get the numerical code corresponding to the character 'a' and add the function’s argument to derive a new character code. The function chr converts this code back to a string character. Let’s use this function with map.

>>> r = range(10)
>>> list(map(get_letter, r))
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Map versus generator expressions

At this point, you probably remember Part 1 and realize this is much like a generator expression. Indeed, I could have done the following.

>>> list( ( get_letter(i) for i in range(10) ) )
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Notice how I put the generator expression right in as an argument to list. That should help illustrate that you can use generator expressions just like other Python expressions. In this case, the get_letter function is simple enough that you could even build it right into the generator expression.

>>> list( ( chr(ord('a')+i) for i in range(10) ) )
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

This seems redundant, but map has existed in Python longer than generator expressions, and there are specialized cases where it’s still handy. For the most part though, you would probably use generator expressions for such simple manipulations of iterators.

Making reductions from iterators

Sometimes, you don’t want to turn one iterator into another but rather compute one single value from an iterator.

>>> sum(range(20, 30))
245

The sum function takes an iterator of numbers (think ints or floats) and returns a sum of all items in the iterator, in this case the sum of all numbers from 20 to 30. This is called a reduction, where a series of values is reduced to one new value. Python also provides a generalized function for reduction. Start by pasting in the following function:

def sum_of_squares(acc, i):
    return acc + i**2

Use this to reduce as follows. You need the functools module.

>>> import functools
>>> functools.reduce(sum_of_squares, range(10))
285

The first argument is a function that takes two arguments. It takes an accumulated value and then it takes a new value to combine into an updated accumulated value in some way, which is then returned. In this case, the accumulated value is the sum of squares of items so far.

Just as generator expressions can be more readable than using map, there are often more readable alternatives to reduce, but it’s useful to have for certain specialized uses (and it’s a conceptual counterpoint to map).

The idea of map is to take an iterator and generate a derived iterator. The idea of reduce is to take an iterator and return a single value derived from all its items. This combination of concepts turns out to be a powerful way to think about processing large quantities of data.

The string.join method

I want to mention a special method that acts as a reduction.

>>> tenletters = ( chr(ord('a')+i) for i in range(10) )
>>> '!'.join(tenletters)
'a!b!c!d!e!f!g!h!i!j'

The join method on a string takes an iterator and creates a new string by joining together the iterator item strings with a specified string in between. Playing around a bit more:

>>> '!'.join(tenletters)
''
>>> tenletters = ( chr(ord('a')+i) for i in range(10) )
>>> ''.join(tenletters)
'abcdefghij'
>>> tenletters = ( chr(ord('a')+i) for i in range(10) )
>>> ' and '.join(tenletters)
'a and b and c and d and e and f and g and h and i and j'

The output from the first line is a good reminder that iterators are one way, and once used up, you won’t get any more out of them. Pay attention because this is a common cause of bugs in code using iterators. In the next line, I set up a new generator, and you can see that a join on the empty string just concatenates all the strings in the given iterator. You can also join with longer strings as demonstrated in the final couple of lines.

Filtering

There is another built-in function you can use to create an iterator containing only some subset of its input. In Part 1, I presented the following example.

>>> import math
>>> notby2or3 = ( n for n in range(1, 20) if math.gcd(n, 2) == 1 and math.gcd(n, 3) == 1 )
>>> list(notby2or3)
[1, 5, 7, 11, 13, 17, 19]

Go back and check Part 1 if you need a refresher on the notby2or3 function. You can implement the equivalent of this generator expression using the filter function, which (again) is just good to be aware of.

>>> import math
>>> def notby2or3(n):
...     return math.gcd(n, 2) == 1 and math.gcd(n, 3) == 1
... 
>>> list(filter(notby2or3, range(1, 20)))
[1, 5, 7, 11, 13, 17, 19]

I could have written this with one or two lines fewer using what’s called a lambda function, but that’s outside the scope of this tutorial series.

Introducing itertools

There are many small patterns for using iterators that come up over and over again. The itertools module provides high-performance implementations of many of these, which are well worth learning.

Suppose you’d like to create a longer iterator that concatenates a sequence of other iterators:

>>> import itertools
>>> it = itertools.chain(range(5), range(5, 0, -1), range(5))
>>> list(it)
[0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4]

Pass itertools.chain the iterators as function arguments and it produces a chain of the outputs of these. If you have a list or tuple sequence of iterators, you can use Python's special syntax for variable positional function arguments.

>>> list_of_iters = [range(5), range(5, 0, -1), range(5)]
>>> it = itertools.chain(*list_of_iters)
>>> list(it)
[0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4]

You can use any iterator for functional variable arguments. It doesn’t have to be a list or tuple. Paste in the following:

def forth_back_forth(n):
    yield range(n)
    yield range(n, 0, -1)
    yield range(n)

You can use this generator to feed itertools.chain its arguments:

>>> it = itertools.chain(*forth_back_forth(3))
>>> list(it)
[0, 1, 2, 3, 2, 1, 0, 1, 2]

Infinite iterators

The combination of forth_back_forth and itertools.chain makes me wonder what it would be like to continue that back/forth/back number pattern even further. Paste in the following:

def zigzag_iters(period):
    while True:
        yield range(period)
        yield range(period, 0, -1)

Then, try it out:

>>> it = zigzag_iters(2)
>>> next(it)
range(0, 2)
>>> next(it)
range(2, 0, -1)
>>> next(it)
range(0, 2)
>>> next(it)
range(2, 0, -1)
>>> next(it)
range(0, 2)

If you continue to enter next(it), it continues to cycle between the two output ranges over and over because while True implements an infinite loop, and there is no code within it to break the loop’s execution. This generator is in a constant state of suspension and resumption until the Python process (the interactive interpreter, in this case) terminates. Or you could use the close() method on any generator object to end it regardless of where it is in execution.

You might have noticed that I’m not using list to illustrate the iterator's contents. If you try to do this, you are basically trying to create an infinite list, and of course, you will run out of memory.

Remember also that this is returning iterators, not items. With forth_back_forth, the idea was to use itertools.chain to get the items from those iterators. You cannot do the same thing with zigzag_iters because this also tries to create a collection of infinite items, and all the arguments to a function must be known before the function is executed, even in the case of variable arguments. This would would mean the Python runtime would be trying to get all items from an infinite iterator, which is not possible.

Infinite iterators are actually a useful concept once you start using generators. But, you have to be aware of operations that attempt to build any collection from the contents of such iterators.

Cascading iterators

Because you can’t use itertools.chain with zigzag_iters, create a more direct version.

def zigzag(period):
    while True:
        for n in range(period):
            yield n
        for n in range(period, 0, -1):
            yield n

Paste this in and see how the iterator continues indefinitely:

>>> it = zigzag(2)
>>> next(it)
0
>>> next(it)
1
>>> next(it)
2
>>> next(it)
1
>>> next(it)
0
>>> next(it)
1
>>> next(it)
2
>>> next(it)
1
>>> next(it)
2
>>> next(it)
1
>>> next(it)
0
>>> next(it)
1

This is another common pattern where an outer iterator is a cascade of one or more inner iterators. It’s common enough that Python provides a standard syntax for it, the yield from statement, as you can see from the following rewrite of zigzag that is functionally identical.

def zigzag(period):
    while True:
        yield from range(period)
        yield from range(period, 0, -1)

If you think of an iterator providing the entirety of another iterator, think yield from. The provided iterator will be completely consumed in this case. If it is an infinite iterator, the outer iterator will also be infinite and continue on the yield from statement indefinitely.

A variant on itertools.chain

One final note, for completeness: Earlier, I said that you can’t use the zigzag_iters function with itertools.chain because it would attempt to create an infinite collection of function arguments. The minds behind itertools also thought of that and provided a variant that does work, itertools.chain.from_iterable.

>>> it = itertools.chain.from_iterable(zigzag_iters(2))
>>> next(it)
0
>>> next(it)
1
>>> next(it)
2
>>> next(it)
1
>>> next(it)
0
>>> next(it)
1
>>> next(it)
2
>>> next(it)
1
>>> next(it)
0
>>> next(it)
1
>>> next(it)
2
>>> next(it)
1

More from itertools

You’ll find in itertools some handy routines for dealing with infinite iterators. One is itertools.islice. With this, you can extract a subset sequence from an iterator, including from an infinite iterator.

>>> it1 = zigzag(5)
>>> it2 = itertools.islice(it1, 0, 20)
>>> list(it2)
[0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 4, 3, 2, 1]
>>> it2 = itertools.islice(it1, 1, 20)
>>> list(it2)
[1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 4, 3, 2, 1]
>>> it2 = itertools.islice(it1, 0, 20, 2)
>>> list(it2)
[0, 2, 4, 4, 2, 0, 2, 4, 4, 2]

The function takes the iterator to excerpt from, the starting index, and the ending index. You can also optionally provide a step, so in some ways, it feels similar to range. Be warned though: Unlike with range, a negative step value is not allowed.

Grouping

One of the most useful bits of itertools is also one of its trickiest, but well worth understanding.

Use itertools.groupby to segment iterators according to some criteria. Here is one that groups hours into quadrants of a twelve-hour period.

>>> hours = range(12)
>>> def divide_by_3(n): return n//3
... 
>>> it = itertools.groupby(hours, divide_by_3)
>>> next(it)
(0, <itertools._grouper object at 0x1075ac9e8>)

You provide itertools.groupby with an input iterator and a grouping function, which is executed on each item in that iterator. When the result of the grouping function changes from one item to the next, a new group is created, with all the items until the next change in grouping function value. The grouping function, divide_by_3, uses the floor division operator, //, to divide and then round down to the next smallest integer.

The Figure 1 helps visualize the operation of itertools.groupby in this example.

Figure 1. A visualization of itertools.groupby

The key to understanding itertools.groupby is remembering that the change in the value of the grouping function is what leads to the grouped items in the output. With some practice, you’ll be able to expertly write grouping functions for all sorts of use cases.

Let’s dig a bit more into our itertools.groupby example.

>>> hours = range(12)
>>> for quadrant, group_it in itertools.groupby(hours, divide_by_3):
...     print('Items in quadrant', quadrant)
...     for i in group_it:
...         print(i)
... 
Items in quadrant 0
0
1
2
Items in quadrant 1
3
4
5
Items in quadrant 2
6
7
8
Items in quadrant 3
9
10
11

Here, the outer loop invokes itertools.groupby, but each item in the loop is a tuple. The second element in this tuple, group_it, is in turn an iterator, as you can see in Figure 1. So, the inner loop works on the iterator of each group.

Grouping with infinite iterators

To pull together a couple of concepts from this tutorial, you can use grouping with infinite iterators. Start by looking at a feature from itertools, which neatly creates infinite iterators.

>>> inf_it = itertools.cycle(range(12))
>>> print(list(itertools.islice(inf_it, 32)))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7]

In the first line, I use itertools.cycle, which creates an infinite cycle over a finite iterator you provide. In this case, it just generates the 0 to 11 sequence over and over again. I use itertools.islice to safely take a subset of it and print that as a list. There is also itertools.repeat, which takes a single value and iterates that one value over and over.

Combining itertools.groupby and itertools.cycle is the last tool to explore in this tutorial.

>>> inf_it = itertools.cycle(range(12))
>>> hours_32 = itertools.islice(inf_it, 32)
>>> for quadrant, group_it in itertools.groupby(hours_32, divide_by_3):
...     print('Items in quadrant', quadrant, ':', list(group_it))
... 
Items in quadrant 0 : [0, 1, 2]
Items in quadrant 1 : [3, 4, 5]
Items in quadrant 2 : [6, 7, 8]
Items in quadrant 3 : [9, 10, 11]
Items in quadrant 0 : [0, 1, 2]
Items in quadrant 1 : [3, 4, 5]
Items in quadrant 2 : [6, 7, 8]
Items in quadrant 3 : [9, 10, 11]
Items in quadrant 0 : [0, 1, 2]
Items in quadrant 1 : [3, 4, 5]
Items in quadrant 2 : [6, 7]

Study this session carefully to get a valuable grasp of how these iterator tools interact. For example, you can see how the cycle of the hour numbers leads to a cycle of the grouping function results.

You're starting to see how compactly you can pull together basic building blocks to produce interesting patterns of iterators. As you combine such facilities with your own specialized generators, this becomes a powerful paradigm for breaking down problems dealing with data in series efficiently.

Go forth and iterate

There is more to itertools than I was able to cover in this tutorial, but you now have a basic idea of some of its offerings, and a good look at one of its trickiest but potentially most valuable facilities, groupby.

As with the entire standard library, you can find comprehensive documentation of the module, but as usual, this is more for a well-versed developer to check on details. It’s not designed as a primer. I recommend that you start with the building blocks in this tutorial and the previous one, and experiment bit-by-bit with the features of itertools, and the other facilities available in built-ins and the standard library for working with iterators.

Take a trip on the even wilder side in Part 3 of this series. You'll explore coroutines, a special sort of generator function. You'll also learn about another powerful but tricky standard library module: asyncio.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing
ArticleID=1061800
ArticleTitle=On demand data in Python, Part 2: The magic of itertools
publish-date=06222018