You are here: Home Dive Into Python 3

Iterators

East is East, and West is West, and never the twain shall meet.
Rudyard Kipling

 

Diving In

Generators are really just a special case of iterators. A function that yields values is a nice, compact way of building an iterator without building an iterator. Remember the Fibonacci generator? Here it is as a built-from-scratch iterator:

[download fibonacci2.py]

class Fib:
    def __init__(self, max):
        self.max = max

    def __iter__(self):
        self.a, self.b = 0, 1
        return self

    def __next__(self):
        fib = self.a
        if fib > self.max:
            raise StopIteration
        self.a, self.b = self.b, self.a + self.b
        return fib

Let's take that one line at a time.

class Fib:

class? What's a class?

Defining Classes

Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you've defined.

Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that's all that's required, since a class doesn't need to inherit from any other class.


class PapayaWhip:  
    pass           
  1. The name of this class is PapayaWhip, and it doesn't inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement.
  2. You probably guessed this, but everything in a class is indented, just like the code within a function, if statement, for loop, or any other block of code. The first line not indented is outside the class.

    This PapayaWhip class doesn't define any methods or attributes, but syntactically, there needs to be something in the definition, thus the pass statement. This is a Python reserved word that just means “move along, nothing to see here”. It's a statement that does nothing, and it's a good placeholder when you're stubbing out functions or classes.

    The pass statement in Python is like a empty set of curly braces ({}) in Java or C.

    Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don't have explicit constructors and destructors. Although it's not required, Python classes can have something similar to a constructor: the __init__ method.

    The __init__() Method

    FIXME - port from DiP

    Know When To Use self and __init__

    FIXME - port from DiP

    Instantiating Classes

    FIXME - port from DiP

    A Note About Garbage Collection

    FIXME - port from DiP, verify it's still true

    Special Method Names

    FIXME - port from DiP, link to http://docs.python.org/3.0/reference/datamodel.html#special-method-names

    FIXME - do we want to make an appendix out of some of the special methods? The organization in the Python docs is somewhat haphazard and most names have no examples at all

    Class Attributes

    FIXME

    A Fibonacci Iterator

    FIXME

    [download fibonacci2.py]

    class Fib:                                        
        def __init__(self, max):                      
            self.max = max
    
        def __iter__(self):                           
            self.a, self.b = 0, 1
            return self
    
        def __next__(self):                           
            fib = self.a
            if fib > self.max:
                raise StopIteration                   
            self.a, self.b = self.b, self.a + self.b
            return fib                                
    1. To build an iterator from scratch, fib needs to be a class, not a function.
    2. “Calling” fib(max) is really creating an instance of this class and calling its __init__() method with max. The __init__() method saves the maximum value as an instance variable so other methods can refer to it later.
    3. The __iter__() method is called whenever someone calls iter(fib). (As you’ll see in a minute, a for loop will call this automatically, but you can also call it yourself manually.) After performing beginning-of-iteration initialization (in this case, resetting self.a and self.b, our two counters), the __iter__() method can return any object that implements a __next__() method. In this case (and in most cases), __iter__() simply returns self, since this class implements its own __next__() method.
    4. The __next__() method is called whenever someone calls next() on an iterator of an instance of a class. That will make more sense in a minute.
    5. When the __next__() method raises a StopIteration exception, this signals to the caller that the iteration is over; no more values are available. If the caller is a for loop, it will notice this StopIteration exception and gracefully exit the loop. (In other words, it will swallow the exception.) This little bit of magic is actually the key to using iterators in for loops.
    6. To spit out the next value, an iterator’s __next__() method simply returns the value. Do not use yield here; that’s a bit of syntactic sugar that only applies when you’re using generators. Here you’re creating your own iterator from scratch; use return instead.

    Thoroughly confused yet? Excellent. Let’s see how to call this iterator:

    >>> from fibonacci2 import Fib
    >>> for n in Fib(1000):
    ...     print(n, end=' ')
    0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987

    Why, it’s exactly the same! Byte for byte identical to how you called Fibonacci-as-a-generator (modulo one capital letter). But how?

    There’s a bit of magic involved in for loops. Here’s what happens:

    • The for loop calls Fib(1000), as shown. This returns an instance of the Fib class. Call this fib_inst.
    • Secretly, and quite cleverly, the for loop calls iter(fib_inst), which returns an iterator object. Call this fib_iter. In this case, fib_iter == fib_inst, because the __iter__() method returns self, but the for loop doesn’t know (or care) about that.
    • To “loop through” the iterator, the for loop calls next(fib_iter), which calls the __next__() method on the fib_iter object, which does the next-Fibonacci-number calculations and returns a value. The for loop takes this value and assigns it to n, then executes the body of the for loop for that value of n.
    • How does the for loop know when to stop? I’m glad you asked! When next(fib_iter) raises a StopIteration exception, the for loop will swallow the exception and gracefully exit. (Any other exception will pass through and be raised as usual.) And where have you seen a StopIteration exception? In the __next__() method, of course!

    A Plural Rule Iterator

    Now it’s time for the finale. Let's rewrite the plural rules generator as an iterator.

    [download plural6.py]

    class LazyRules:
        rules_filename = 'plural6-rules.txt'
    
        def __init__(self):
            self.pattern_file = open(self.rules_filename)
            self.cache = []
    
        def __iter__(self):
            self.cache_index = 0
            return self
    
        def __next__(self):
            self.cache_index += 1
            if len(self.cache) >= self.cache_index:
                return self.cache[self.cache_index - 1]
    
            if self.pattern_file.closed:
                raise StopIteration
    
            line = self.pattern_file.readline()
            if not line:
                self.pattern_file.close()
                raise StopIteration
    
            pattern, search, replace = line.split(None, 3)
            funcs = build_match_and_apply_functions(
                pattern, search, replace)
            self.cache.append(funcs)
            return funcs
    
    rules = LazyRules()

    So this is a class that implements __iter__() and __next__(), so it can be used as an iterator. Then, you instantiate the class and assign it to rules. This happens just once, on import.

    Let’s take the class one bite at a time.

    class LazyRules:
        def __init__(self):                                
            self.pattern_file = open('plural6-rules.txt')  
            self.cache = []                                
    1. The __init__() method is only going to be called once, when you instantiate the class and assign it to rules.
    2. Since this is only going to get called once, it’s the perfect place to open the pattern file. You’ll read it later; no point doing more than you absolutely have to until absolutely necessary!
    3. Also, this is a good place to initialize the cache, which you’ll use later as you read the patterns from the pattern file.
        def __iter__(self):       
            self.cache_index = 0  
            return self           
    
    1. The __iter__() method will be called every time someone — say, a for loop — calls iter(rules).
    2. This is the place to reset the counter that we’re going to use to retrieve items from the cache (that we haven’t built yet — patience, grasshopper).
    3. Finally, the __iter__() method returns self, which signals that this class will take care of returning its own values throughout an iteration.
        def __next__(self):                                 
            .
            .
            .
            pattern, search, replace = line.split(None, 3)
            funcs = build_match_and_apply_functions(        
                pattern, search, replace)
            self.cache.append(funcs)                        
            return funcs
    1. The __next__() method gets called whenever someone — say, a for loop — calls next(rules). This method will only make sense if we start at the end and work backwards. So let’s do that.
    2. The last part of this function should look familiar, at least. The build_match_and_apply_functions() function hasn’t changed; it’s the same as it ever was. Each line of the pattern file will be read exactly once, as late as possible.
    3. The only difference is that, before returning the match and apply functions (which are stored in the tuple funcs), we’ve going to save them in self.cache. Each match and apply function will be built exactly once, as late as possible, then cached.

    Moving backwards…

        def __next__(self):
            .
            .
            .
            line = self.pattern_file.readline()  
            if not line:                         
                self.pattern_file.close()
                raise StopIteration              
            .
            .
            .
    1. A bit of advanced file trickery here. The readline() method (note: singular, not the plural readlines()) reads exactly one line from an open file. Specifically, the next line. (File objects are iterators too! It’s iterators all the way down…)
    2. If there was a line for readline() to read, line will not be an empty string. Even if the file contained a blank line, line would end up as the one-character string '\n' (a carriage return). If line is really an empty string, that means there are no more lines to read from the file.
    3. When we reach the end of the file, we should close the file and raise the magic StopIteration exception. Remember, we got to this point because we needed a match and apply function for the next rule. The next rule comes from the next line of the file… but there is no next line! Therefore, we have no value to return. The iteration is over. ( The party’s over… )

    Moving backwards all the way to the start of the __next__() method…

        def __next__(self):
            self.cache_index += 1
            if len(self.cache) >= self.cache_index:
                return self.cache[self.cache_index - 1]     
    
            if self.pattern_file.closed:
                raise StopIteration                         
            .
            .
            .
    1. self.cache will be a list of the functions we need to match and apply individual rules. (At least that should sound familiar!) self.cache_index keeps track of which cached item we should return next. If we haven’t exhausted the cache yet (i.e. if the length of self.cache is greater than self.cache_index), then we have a cache hit! Hooray! We can return the match and apply functions from the cache instead of building them from scratch.
    2. On the other hand, if we don’t get a hit from the cache, and the file object has been closed (which could happen, further down the method, as you saw in the previous code snippet), then there’s nothing more we can do. If the file is closed, it means we’ve exhausted it — we’ve already read through every line from the pattern file, and we’ve already built and cached the match and apply functions for each pattern. The file is exhausted; the cache is exhausted; I’m exhausted. Wait, what? Hang in there, we’re almost done.

    Putting it all together, here’s what happens when:

    • When the module is imported, it creates a single instance of the LazyRules class, called rules, which opens the pattern file but does not read from it.
    • When asked for the first match and apply function, it checks its cache but finds the cache is empty. So it reads a single line from the pattern file, builds the match and apply functions from those patterns, and caches them.
    • Let’s say, for the sake of argument, that the very first rule matched. If so, no further match and apply functions are built, and no further lines are read from the pattern file.
    • Furthermore, for the sake of argument, suppose that the caller calls the plural() function again to pluralize a different word. The for loop in the plural() function will call iter(rules), which will reset the cache index but will not reset the open file object.
    • The first time through, the for loop will ask for a value from rules, which will invoke its __next__() method. This time, however, the cache is primed with a single pair of match and apply functions, corresponding to the patterns in the first line of the pattern file. Since they were built and cached in the course of pluralizing the previous word, they’re retrieved from the cache. The cache index increments, and the open file is never touched.
    • Let’s say, for the sake of argument, that the first rule does not match this time around. So the for loop comes around again and asks for another value from rules. This invokes the __next__() method a second time. This time, the cache is exhausted — it only contained one item, and we’re asking for a second — so the __next__() method continues. It reads another line from the open file, builds match and apply functions out of the patterns, and caches them.
    • This read-build-and-cache process will continue as long as the rules being read from the pattern file don’t match the word we’re trying to pluralize. If we do find a matching rule before the end of the file, we simply use it and stop, with the file still open. The file pointer will stay wherever we stopped reading, waiting for the next readline() command. In the meantime, the cache now has more items in it, and if we start all over again trying to pluralize a new word, each of those items in the cache will be tried before reading the next line from the pattern file.

    Thus, we have achieved our combined goal [FIXME xref]:

    1. Minimal startup cost. The only thing that happens on import is instantiating a single class and opening a file (but not reading from it).
    2. Maximum performance. The previous example would read through the file and build functions dynamically every time you wanted to pluralize a word. This version will cache functions as soon as they’re built, and in the worst case, it will only read through the pattern file once, no matter how many words you pluralize.
    3. Separation of code and data. All the patterns are stored in a separate file. Code is code, and data is data, and never the twain shall meet.

    Further Reading

    © 2001–9 Mark Pilgrim