2.4. Everything Is an Object

2.6. Testing Modules

Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them. Here's an example that uses the if __name__ trick.

if __name__ == "__main__":

Some quick observations before you get to the good stuff. First, parentheses are not required around the if expression. Second, the if statement ends with a colon, and is followed by indented code.
NoteLike C, Python uses == for comparison and = for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.

So why is this particular if statement a trick? Modules are objects, and all modules have a built-in attribute __name__. A module's __name__ depends on how you're using the module. If you import the module, then __name__ is the module's filename, without a directory path or file extension. But you can also run the module directly as a standalone program, in which case __name__ will be a special default value, __main__.

>>> import odbchelper
>>> odbchelper.__name__
'odbchelper'

Knowing this, you can design a test suite for your module within the module itself by putting it in this if statement. When you run the module directly, __name__ is __main__, so the test suite executes. When you import the module, __name__ is something else, so the test suite is ignored. This makes it easier to develop and debug new modules before integrating them into a larger program.
TipOn MacPython, there is an additional step to make the if __name__ trick work. Pop up the module's options menu by clicking the black triangle in the upper-right corner of the window, and make sure Run as __main__ is checked.

Further Reading on Importing Modules

3.4. Declaring variables

Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from Chapter 2, odbchelper.py.

Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring into existence by being assigned a value, and they are automatically destroyed when they go out of scope.

Example 3.17. Defining the myParams Variable


if __name__ == "__main__":
    myParams = {"server":"mpilgrim", \
                "database":"master", \
                "uid":"sa", \
                "pwd":"secret" \
                }

Notice the indentation. An if statement is a code block and needs to be indented just like a function.

Also notice that the variable assignment is one command split over several lines, with a backslash (“\”) serving as a line-continuation marker.
NoteWhen a command is split among several lines with the line-continuation marker (“\”), the continued lines can be indented in any manner; Python's normally stringent indentation rules do not apply. If your Python IDE auto-indents the continued line, you should probably accept its default unless you have a burning reason not to.

Strictly speaking, expressions in parentheses, straight brackets, or curly braces (like defining a dictionary) can be split into multiple lines with or without the line continuation character (“\”). I like to include the backslash even when it's not required because I think it makes the code easier to read, but that's a matter of style. [unbound variable exception example was here]

3.4.2. Assigning Multiple Values at Once

One of the cooler programming shortcuts in Python is using sequences to assign multiple values at once.

Example 3.19. Assigning multiple values at once

>>> v = ('a', 'b', 'e')
>>> (x, y, z) = v     
>>> x
'a'
>>> y
'b'
>>> z
'e'
  1. v is a tuple of three elements, and (x, y, z) is a tuple of three variables. Assigning one to the other assigns each of the values of v to each of the variables, in order.

    This has all sorts of uses. I often want to assign names to a range of values. In C, you would use enum and manually list each constant and its associated value, which seems especially tedious when the values are consecutive. In Python, you can use the built-in range function with multi-variable assignment to quickly assign consecutive values.

    Example 3.20. Assigning Consecutive Values

    >>> range(7)              
    [0, 1, 2, 3, 4, 5, 6]
    >>> (MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7) 
    >>> MONDAY                
    0
    >>> TUESDAY
    1
    >>> SUNDAY
    6
    1. The built-in range function returns a list of integers. In its simplest form, it takes an upper limit and returns a zero-based list counting up to but not including the upper limit. (If you like, you can pass other parameters to specify a base other than 0 and a step other than 1. You can print range.__doc__ for details.)
    2. MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, and SUNDAY are the variables you're defining. (This example came from the calendar module, a fun little module that prints calendars, like the UNIX program cal. The calendar module defines integer constants for days of the week.)
    3. Now each variable has its value: MONDAY is 0, TUESDAY is 1, and so forth.

      You can also use multi-variable assignment to build functions that return multiple values, simply by returning a tuple of all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including the os module, which you'll discuss in Chapter 6.

      Further Reading on Variables

      Example 6.12. Introducing sys.modules

      >>> import sys        
      >>> print '\n'.join(sys.modules.keys()) 
      win32api
      os.path
      os
      exceptions
      __main__
      ntpath
      nt
      sys
      __builtin__
      site
      signal
      UserDict
      stat
      1. The sys module contains system-level information, such as the version of Python you're running (sys.version or sys.version_info), and system-level options such as the maximum allowed recursion depth (sys.getrecursionlimit() and sys.setrecursionlimit()).
      2. sys.modules is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules your program has imported. Python preloads some modules on startup, and if you're using a Python IDE, sys.modules contains all the modules imported by all the programs you've run within the IDE.

        This example demonstrates how to use sys.modules.

        Example 6.13. Using sys.modules

        >>> import fileinfo         
        >>> print '\n'.join(sys.modules.keys())
        win32api
        os.path
        os
        fileinfo
        exceptions
        __main__
        ntpath
        nt
        sys
        __builtin__
        site
        signal
        UserDict
        stat
        >>> fileinfo
        <module 'fileinfo' from 'fileinfo.pyc'>
        >>> sys.modules["fileinfo"] 
        <module 'fileinfo' from 'fileinfo.pyc'>
        1. As new modules are imported, they are added to sys.modules. This explains why importing the same module twice is very fast: Python has already loaded and cached the module in sys.modules, so importing the second time is simply a dictionary lookup.
        2. Given the name (as a string) of any previously-imported module, you can get a reference to the module itself through the sys.modules dictionary.

          The next example shows how to use the __module__ class attribute with the sys.modules dictionary to get a reference to the module in which a class is defined.

          Example 6.14. The __module__ Class Attribute

          >>> from fileinfo import MP3FileInfo
          >>> MP3FileInfo.__module__              
          'fileinfo'
          >>> sys.modules[MP3FileInfo.__module__] 
          <module 'fileinfo' from 'fileinfo.pyc'>
          1. Every Python class has a built-in class attribute __module__, which is the name of the module in which the class is defined.
          2. Combining this with the sys.modules dictionary, you can get a reference to the module in which a class is defined.

            Now you're ready to see how sys.modules is used in fileinfo.py, the sample program introduced in Chapter 5. This example shows that portion of the code.

            Example 6.15. sys.modules in fileinfo.py

            
                def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):       
                    "get file info class from filename extension"           
                    subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]        
                    return hasattr(module, subclass) and getattr(module, subclass) or FileInfo 
            1. This is a function with two arguments; filename is required, but module is optional and defaults to the module that contains the FileInfo class. This looks inefficient, because you might expect Python to evaluate the sys.modules expression every time the function is called. In fact, Python evaluates default expressions only once, the first time the module is imported. As you'll see later, you never call this function with a module argument, so module serves as a function-level constant.
            2. You'll plow through this line later, after you dive into the os module. For now, take it on faith that subclass ends up as the name of a class, like MP3FileInfo.
            3. You already know about getattr, which gets a reference to an object by name. hasattr is a complementary function that checks whether an object has a particular attribute; in this case, whether a module has a particular class (although it works for any object and any attribute, just like getattr). In English, this line of code says, “If this module has the class named by subclass then return it, otherwise return the base class FileInfo.”

              Further Reading on Modules

              6.5. Working with Directories

              The os.path module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing the contents of a directory.

              Example 6.16. Constructing Pathnames

              >>> import os
              >>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")  
              'c:\\music\\ap\\mahadeva.mp3'
              >>> os.path.join("c:\\music\\ap", "mahadeva.mp3")   
              'c:\\music\\ap\\mahadeva.mp3'
              >>> os.path.expanduser("~")       
              'c:\\Documents and Settings\\mpilgrim\\My Documents'
              >>> os.path.join(os.path.expanduser("~"), "Python") 
              'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'
              1. os.path is a reference to a module -- which module depends on your platform. Just as getpass encapsulates differences between platforms by setting getpass to a platform-specific function, os encapsulates differences between platforms by setting path to a platform-specific module.
              2. The join function of os.path constructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing with pathnames on Windows is annoying because the backslash character must be escaped.)
              3. In this slightly less trivial case, join will add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since addSlashIfNecessary is one of the stupid little functions I always need to write when building up my toolbox in a new language. Do not write this stupid little function in Python; smart people have already taken care of it for you.
              4. expanduser will expand a pathname that uses ~ to represent the current user's home directory. This works on any platform where users have a home directory, like Windows, UNIX, and Mac OS X; it has no effect on Mac OS.
              5. Combining these techniques, you can easily construct pathnames for directories and files under the user's home directory.

                Example 6.17. Splitting Pathnames

                >>> os.path.split("c:\\music\\ap\\mahadeva.mp3")      
                ('c:\\music\\ap', 'mahadeva.mp3')
                >>> (filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3") 
                >>> filepath      
                'c:\\music\\ap'
                >>> filename      
                'mahadeva.mp3'
                >>> (shortname, extension) = os.path.splitext(filename)                 
                >>> shortname
                'mahadeva'
                >>> extension
                '.mp3'
                1. The split function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use multi-variable assignment to return multiple values from a function? Well, split is such a function.
                2. You assign the return value of the split function into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple.
                3. The first variable, filepath, receives the value of the first element of the tuple returned from split, the file path.
                4. The second variable, filename, receives the value of the second element of the tuple returned from split, the filename.
                5. os.path also contains a function splitext, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique to assign each of them to separate variables.

                  Example 6.18. Listing Directories

                  >>> os.listdir("c:\\music\\_singles\\")              
                  ['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
                  'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3', 
                  'spinning.mp3']
                  >>> dirname = "c:\\"
                  >>> os.listdir(dirname)            
                  ['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin',
                  'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS',
                  'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys',
                  'Program Files', 'Python20', 'RECYCLER',
                  'System Volume Information', 'TEMP', 'WINNT']
                  >>> [f for f in os.listdir(dirname)
                  ...    if os.path.isfile(os.path.join(dirname, f))] 
                  ['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS',
                  'NTDETECT.COM', 'ntldr', 'pagefile.sys']
                  >>> [f for f in os.listdir(dirname)
                  ...    if os.path.isdir(os.path.join(dirname, f))]  
                  ['cygwin', 'docbook', 'Documents and Settings', 'Incoming',
                  'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER',
                  'System Volume Information', 'TEMP', 'WINNT']
                  1. The listdir function takes a pathname and returns a list of the contents of the directory.
                  2. listdir returns both files and folders, with no indication of which is which.
                  3. You can use list filtering and the isfile function of the os.path module to separate the files from the folders. isfile takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're using os.path.join to ensure a full pathname, but isfile also works with a partial path, relative to the current working directory. You can use os.getcwd() to get the current working directory.
                  4. os.path also has a isdir function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories within a directory.

                    Example 6.19. Listing Directories in fileinfo.py

                    
                    def listDirectory(directory, fileExtList):    
                        "get list of file info objects for files of particular extensions" 
                        fileList = [os.path.normcase(f)
                                    for f in os.listdir(directory)]             
                        fileList = [os.path.join(directory, f) 
                                   for f in fileList
                                    if os.path.splitext(f)[1] in fileExtList]    
                    1. os.listdir(directory) returns a list of all the files and folders in directory.
                    2. Iterating through the list with f, you use os.path.normcase(f) to normalize the case according to operating system defaults. normcase is a useful little function that compensates for case-insensitive operating systems that think that mahadeva.mp3 and mahadeva.MP3 are the same file. For instance, on Windows and Mac OS, normcase will convert the entire filename to lowercase; on UNIX-compatible systems, it will return the filename unchanged.
                    3. Iterating through the normalized list with f again, you use os.path.splitext(f) to split each filename into name and extension.
                    4. For each file, you see if the extension is in the list of file extensions you care about (fileExtList, which was passed to the listDirectory function).
                    5. For each file you care about, you use os.path.join(directory, f) to construct the full pathname of the file, and return a list of the full pathnames.
                      NoteWhenever possible, you should use the functions in os and os.path for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like os.path.split work on UNIX, Windows, Mac OS, and any other platform supported by Python.

                      There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you may already be familiar with from working on the command line.

                      Example 6.20. Listing Directories with glob

                      >>> os.listdir("c:\\music\\_singles\\")               
                      ['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
                      'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
                      'spinning.mp3']
                      >>> import glob
                      >>> glob.glob('c:\\music\\_singles\\*.mp3')           
                      ['c:\\music\\_singles\\a_time_long_forgotten_con.mp3',
                      'c:\\music\\_singles\\hellraiser.mp3',
                      'c:\\music\\_singles\\kairo.mp3',
                      'c:\\music\\_singles\\long_way_home1.mp3',
                      'c:\\music\\_singles\\sidewinder.mp3',
                      'c:\\music\\_singles\\spinning.mp3']
                      >>> glob.glob('c:\\music\\_singles\\s*.mp3')          
                      ['c:\\music\\_singles\\sidewinder.mp3',
                      'c:\\music\\_singles\\spinning.mp3']
                      >>> glob.glob('c:\\music\\*\\*.mp3')
                      
                      1. As you saw earlier, os.listdir simply takes a directory path and lists all files and directories in that directory.
                      2. The glob module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard. Here the wildcard is a directory path plus "*.mp3", which will match all .mp3 files. Note that each element of the returned list already includes the full path of the file.
                      3. If you want to find all the files in a specific directory that start with "s" and end with ".mp3", you can do that too.
                      4. Now consider this scenario: you have a music directory, with several subdirectories within it, with .mp3 files within each subdirectory. You can get a list of all of those with a single call to glob, by using two wildcards at once. One wildcard is the "*.mp3" (to match .mp3 files), and one wildcard is within the directory path itself, to match any subdirectory within c:\music. That's a crazy amount of power packed into one deceptively simple-looking function!

                        Further Reading on the os Module

                        [HTML stuff was here]

                        8.5. locals and globals

                        Let's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables.

                        Remember locals? You first saw it here:

                        
                            def unknown_starttag(self, tag, attrs):
                                strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
                                self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
                        

                        No, wait, you can't learn about locals yet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention.

                        Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute.

                        At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which keeps track of the function's variables, including function arguments and locally defined variables. Each module has its own namespace, called the global namespace, which keeps track of the module's variables, including functions, classes, any other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any module, which holds built-in functions and exceptions.

                        When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order:

                        1. local namespace - specific to the current function or class method. If the function defines a local variable x, or has an argument x, Python will use this and stop searching.
                        2. global namespace - specific to the current module. If the module has defined a variable, function, or class called x, Python will use that and stop searching.
                        3. built-in namespace - global to all modules. As a last resort, Python will assume that x is the name of built-in function or variable.

                        If Python doesn't find x in any of these namespaces, it gives up and raises a NameError with the message There is no variable named 'x', which you saw back in Example 3.18, “Referencing an Unbound Variable”, but you didn't appreciate how much work Python was doing before giving you that error.
                        ImportantPython 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a nested function or lambda function, Python will search for that variable in the current (nested or lambda) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested or lambda) function's namespace, then in the parent function's namespace, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2:
                        
                        from __future__ import nested_scopes

                        Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in locals function, and the global (module level) namespace is accessible via the built-in globals function.

                        Example 8.10. Introducing locals

                        >>> def foo(arg): 
                        ...    x = 1
                        ...    print locals()
                        ...    
                        >>> foo(7)        
                        {'arg': 7, 'x': 1}
                        >>> foo('bar')    
                        {'arg': 'bar', 'x': 1}
                        1. The function foo has two variables in its local namespace: arg, whose value is passed in to the function, and x, which is defined within the function.
                        2. locals returns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values of the dictionary are the actual values of the variables. So calling foo with 7 prints the dictionary containing the function's two local variables: arg (7) and x (1).
                        3. Remember, Python has dynamic typing, so you could just as easily pass a string in for arg; the function (and the call to locals) would still work just as well. locals works with all variables of all datatypes.

                          What locals does for the local (function) namespace, globals does for the global (module) namespace. globals is more exciting, though, because a module's namespace is more exciting. [3] Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes defined in the module. Plus, it includes anything that was imported into the module.

                          Remember the difference between from module import and import module? With import module, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access any of its functions or attributes: module.function. But with from module import, you're actually importing specific functions and attributes from another module into your own namespace, which is why you access them directly without referencing the original module they came from. With the globals function, you can actually see this happen.

                          Example 8.11. Introducing globals

                          Look at the following block of code at the bottom of BaseHTMLProcessor.py:

                          
                          if __name__ == "__main__":
                              for k, v in globals().items():             
                                  print k, "=", v
                          1. Just so you don't get intimidated, remember that you've seen all this before. The globals function returns a dictionary, and you're iterating through the dictionary using the items method and multi-variable assignment. The only thing new here is the globals function.

                            Now running the script from the command line gives this output (note that your output may be slightly different, depending on your platform and where you installed Python):

                            c:\docbook\dip\py> python BaseHTMLProcessor.py
                            
                            SGMLParser = sgmllib.SGMLParser                
                            htmlentitydefs = <module 'htmlentitydefs' from 'C:\Python23\lib\htmlentitydefs.py'> 
                            BaseHTMLProcessor = __main__.BaseHTMLProcessor 
                            __name__ = __main__          
                            ... rest of output omitted for brevity...
                            1. SGMLParser was imported from sgmllib, using from module import. That means that it was imported directly into the module's namespace, and here it is.
                            2. Contrast this with htmlentitydefs, which was imported using import. That means that the htmlentitydefs module itself is in the namespace, but the entitydefs variable defined within htmlentitydefs is not.
                            3. This module only defines one class, BaseHTMLProcessor, and here it is. Note that the value here is the class itself, not a specific instance of the class.
                            4. Remember the if __name__ trick? When running a module (as opposed to importing it from another module), the built-in __name__ attribute is a special value, __main__. Since you ran this module as a script from the command line, __name__ is __main__, which is why the little test code to print the globals got executed.
                              NoteUsing the locals and globals functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors the functionality of the getattr function, which allows you to access arbitrary functions dynamically by providing the function name as a string.

                              There is one other important difference between the locals and globals functions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning it.

                              Example 8.12. locals is read-only, globals is not

                              
                              def foo(arg):
                                  x = 1
                                  print locals()    
                                  locals()["x"] = 2 
                                  print "x=",x      
                              
                              z = 7
                              print "z=",z
                              foo(3)
                              globals()["z"] = 8    
                              print "z=",z          
                              
                              1. Since foo is called with 3, this will print {'arg': 3, 'x': 1}. This should not be a surprise.
                              2. locals is a function that returns a dictionary, and here you are setting a value in that dictionary. You might think that this would change the value of the local variable x to 2, but it doesn't. locals does not actually return the local namespace, it returns a copy. So changing it does nothing to the value of the variables in the local namespace.
                              3. This prints x= 1, not x= 2.
                              4. After being burned by locals, you might think that this wouldn't change the value of z, but it does. Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself), globals returns the actual global namespace, not a copy: the exact opposite behavior of locals. So any changes to the dictionary returned by globals directly affect your global variables.
                              5. This prints z= 8, not z= 7. [XML stuff was here]

                                9.2. Packages

                                Actually parsing an XML document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour to talk about packages.

                                Example 9.5. Loading an XML document (a sneak peek)

                                >>> from xml.dom import minidom 
                                >>> xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')
                                1. This is a syntax you haven't seen before. It looks almost like the from module import you know and love, but the "." gives it away as something above and beyond a simple import. In fact, xml is what is known as a package, dom is a nested package within xml, and minidom is a module within xml.dom.

                                  That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than directories of modules; nested packages are subdirectories. The modules within a package (or a nested package) are still just .py files, like always, except that they're in a subdirectory instead of the main lib/ directory of your Python installation.

                                  Example 9.6. File layout of a package

                                  Python21/           root Python installation (home of the executable)
                                  |
                                  +--lib/             library directory (home of the standard library modules)
                                     |
                                     +-- xml/         xml package (really just a directory with other stuff in it)
                                         |
                                         +--sax/      xml.sax package (again, just a directory)
                                         |
                                         +--dom/      xml.dom package (contains minidom.py)
                                         |
                                         +--parsers/  xml.parsers package (used internally)

                                  So when you say from xml.dom import minidom, Python figures out that that means “look in the xml directory for a dom directory, and look in that for the minidom module, and import it as minidom”. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import specific classes or functions from a module contained within a package. You can also import the package itself as a module. The syntax is all the same; Python figures out what you mean based on the file layout of the package, and automatically does the right thing.

                                  Example 9.7. Packages are modules, too

                                  >>> from xml.dom import minidom         
                                  >>> minidom
                                  <module 'xml.dom.minidom' from 'C:\Python21\lib\xml\dom\minidom.pyc'>
                                  >>> minidom.Element
                                  <class xml.dom.minidom.Element at 01095744>
                                  >>> from xml.dom.minidom import Element 
                                  >>> Element
                                  <class xml.dom.minidom.Element at 01095744>
                                  >>> minidom.Element
                                  <class xml.dom.minidom.Element at 01095744>
                                  >>> from xml import dom                 
                                  >>> dom
                                  <module 'xml.dom' from 'C:\Python21\lib\xml\dom\__init__.pyc'>
                                  >>> import xml        
                                  >>> xml
                                  <module 'xml' from 'C:\Python21\lib\xml\__init__.pyc'>
                                  1. Here you're importing a module (minidom) from a nested package (xml.dom). The result is that minidom is imported into your namespace, and in order to reference classes within the minidom module (like Element), you need to preface them with the module name.
                                  2. Here you are importing a class (Element) from a module (minidom) from a nested package (xml.dom). The result is that Element is imported directly into your namespace. Note that this does not interfere with the previous import; the Element class can now be referenced in two ways (but it's all still the same class).
                                  3. Here you are importing the dom package (a nested package of xml) as a module in and of itself. Any level of a package can be treated as a module, as you'll see in a moment. It can even have its own attributes and methods, just the modules you've seen before.
                                  4. Here you are importing the root level xml package as a module.

                                    So how can a package (which is just a directory on disk) be imported and treated as a module (which is always a file on disk)? The answer is the magical __init__.py file. You see, packages are not simply directories; they are directories with a specific file, __init__.py, inside. This file defines the attributes and methods of the package. For instance, xml.dom contains a Node class, which is defined in xml/dom/__init__.py. When you import a package as a module (like dom from xml), you're really importing its __init__.py file.
                                    NoteA package is a directory with the special __init__.py file in it. The __init__.py file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file, but it has to exist. But if __init__.py doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages.

                                    So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an xml package with sax and dom packages inside, the authors could have chosen to put all the sax functionality in xmlsax.py and all the dom functionality in xmldom.py, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the XML package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different areas simultaneously).

                                    If you ever find yourself writing a large subsystem in Python (or, more likely, when you realize that your small subsystem has grown into a large one), invest some time designing a good package architecture. It's one of the many things Python is good at, so take advantage of it.

                                    9.3. Parsing XML

                                    As I was saying, actually parsing an XML document is very simple: one line of code. Where you go from there is up to you.

                                    10.6. Handling command-line arguments

                                    Python fully supports creating programs that can be run on the command line, complete with command-line arguments and either short- or long-style flags to specify various options. None of this is XML-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it.

                                    It's difficult to talk about command-line processing without understanding how command-line arguments are exposed to your Python program, so let's write a simple program to see them.

                                    Example 10.20. Introducing sys.argv

                                    If you have not already done so, you can download this and other examples used in this book.

                                    
                                    #argecho.py
                                    import sys
                                    
                                    for arg in sys.argv: 
                                        print arg
                                    1. Each command-line argument passed to the program will be in sys.argv, which is just a list. Here you are printing each argument on a separate line.

                                      Example 10.21. The contents of sys.argv

                                      [you@localhost py]$ python argecho.py             
                                      argecho.py
                                      [you@localhost py]$ python argecho.py abc def     
                                      argecho.py
                                      abc
                                      def
                                      [you@localhost py]$ python argecho.py --help      
                                      argecho.py
                                      --help
                                      [you@localhost py]$ python argecho.py -m kant.xml 
                                      argecho.py
                                      -m
                                      kant.xml
                                      1. The first thing to know about sys.argv is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later, in Chapter 16, Functional Programming. Don't worry about it for now.
                                      2. Command-line arguments are separated by spaces, and each shows up as a separate element in the sys.argv list.
                                      3. Command-line flags, like --help, also show up as their own element in the sys.argv list.
                                      4. To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag (-m) which takes an argument (kant.xml). Both the flag itself and the flag's argument are simply sequential elements in the sys.argv list. No attempt is made to associate one with the other; all you get is a list.

                                        So as you can see, you certainly have all the information passed on the command line, but then again, it doesn't look like it's going to be all that easy to actually use it. For simple programs that only take a single argument and have no flags, you can simply use sys.argv[1] to access the argument. There's no shame in this; I do it all the time. For more complex programs, you need the getopt module.

                                        Example 10.22. Introducing getopt

                                        
                                        def main(argv):       
                                            grammar = "kant.xml"                 
                                            try:              
                                                opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="]) 
                                            except getopt.GetoptError:           
                                                usage()        
                                                sys.exit(2)   
                                        
                                        ...
                                        
                                        if __name__ == "__main__":
                                            main(sys.argv[1:])
                                        1. First off, look at the bottom of the example and notice that you're calling the main function with sys.argv[1:]. Remember, sys.argv[0] is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off and pass the rest of the list.
                                        2. This is where all the interesting processing happens. The getopt function of the getopt module takes three parameters: the argument list (which you got from sys.argv[1:]), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer command-line flags that are equivalent to the single-character versions. This is quite confusing at first glance, and is explained in more detail below.
                                        3. If anything goes wrong trying to parse these command-line flags, getopt will raise an exception, which you catch. You told getopt all the flags you understand, so this probably means that the end user passed some command-line flag that you don't understand.
                                        4. As is standard practice in the UNIX world, when the script is passed flags it doesn't understand, you print out a summary of proper usage and exit gracefully. Note that I haven't shown the usage function here. You would still need to code that somewhere and have it print out the appropriate summary; it's not automatic.

                                          So what are all those parameters you pass to the getopt function? Well, the first one is simply the raw list of command-line flags and arguments (not including the first element, the script name, which you already chopped off before calling the main function). The second is the list of short command-line flags that the script accepts.

                                          "hg:d"

                                          -h
                                          print usage summary
                                          -g ...
                                          use specified grammar file or URL
                                          -d
                                          show debugging information while parsing

                                          The first and third flags are simply standalone flags; you specify them or you don't, and they do things (print help) or change state (turn on debugging). However, the second flag (-g) must be followed by an argument, which is the name of the grammar file to read from. In fact it can be a filename or a web address, and you don't know which yet (you'll figure it out later), but you know it has to be something. So you tell getopt this by putting a colon after the g in that second parameter to the getopt function.

                                          To further complicate things, the script accepts either short flags (like -h) or long flags (like --help), and you want them to do the same thing. This is what the third parameter to getopt is for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter.

                                          ["help", "grammar="]

                                          --help
                                          print usage summary
                                          --grammar ...
                                          use specified grammar file or URL

                                          Three things of note here:

                                          1. All long flags are preceded by two dashes on the command line, but you don't include those dashes when calling getopt. They are understood.
                                          2. The --grammar flag must always be followed by an additional argument, just like the -g flag. This is notated by an equals sign, "grammar=".
                                          3. The list of long flags is shorter than the list of short flags, because the -d flag does not have a corresponding long version. This is fine; only -d will turn on debugging. But the order of short and long flags needs to be the same, so you'll need to specify all the short flags that do have corresponding long flags first, then all the rest of the short flags.

                                          Confused yet? Let's look at the actual code and see if it makes sense in context.

                                          Example 10.23. Handling command-line arguments in kgp.py

                                          
                                          def main(argv):        
                                              grammar = "kant.xml"                
                                              try:              
                                                  opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
                                              except getopt.GetoptError:          
                                                  usage()       
                                                  sys.exit(2)   
                                              for opt, arg in opts:                
                                                  if opt in ("-h", "--help"):      
                                                      usage()   
                                                      sys.exit()
                                                  elif opt == '-d':                
                                                      global _debug               
                                                      _debug = 1
                                                  elif opt in ("-g", "--grammar"): 
                                                      grammar = arg               
                                          
                                              source = "".join(args)               
                                          
                                              k = KantGenerator(grammar, source)
                                              print k.output()
                                          1. The grammar variable will keep track of the grammar file you're using. You initialize it here in case it's not specified on the command line (using either the -g or the --grammar flag).
                                          2. The opts variable that you get back from getopt contains a list of tuples: flag and argument. If the flag doesn't take an argument, then arg will simply be None. This makes it easier to loop through the flags.
                                          3. getopt validates that the command-line flags are acceptable, but it doesn't do any sort of conversion between short and long flags. If you specify the -h flag, opt will contain "-h"; if you specify the --help flag, opt will contain "--help". So you need to check for both.
                                          4. Remember, the -d flag didn't have a corresponding long flag, so you only need to check for the short form. If you find it, you set a global variable that you'll refer to later to print out debugging information. (I used this during the development of the script. What, you thought all these examples worked on the first try?)
                                          5. If you find a grammar file, either with a -g flag or a --grammar flag, you save the argument that followed it (stored in arg) into the grammar variable, overwriting the default that you initialized at the top of the main function.
                                          6. That's it. You've looped through and dealt with all the command-line flags. That means that anything left must be command-line arguments. These come back from the getopt function in the args variable. In this case, you're treating them as source material for the parser. If there are no command-line arguments specified, args will be an empty list, and source will end up as the empty string.

                                            10.7. Putting it all together

                                            You've covered a lot of ground. Let's step back and see how all the pieces fit together.

                                            To start with, this is a script that takes its arguments on the command line, using the getopt module.

                                            
                                            def main(argv):       
                                            ...
                                                try:              
                                                    opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
                                                except getopt.GetoptError:          
                                            ...
                                                for opt, arg in opts:               
                                            ...

                                            You create a new instance of the KantGenerator class, and pass it the grammar file and source that may or may not have been specified on the command line.

                                            
                                                k = KantGenerator(grammar, source)

                                            The KantGenerator instance automatically loads the grammar, which is an XML file. You use your custom openAnything function to open the file (which could be stored in a local file or a remote web server), then use the built-in minidom parsing functions to parse the XML into a tree of Python objects.

                                            
                                                def _load(self, source):
                                                    sock = toolbox.openAnything(source)
                                                    xmldoc = minidom.parse(sock).documentElement
                                                    sock.close()

                                            Oh, and along the way, you take advantage of your knowledge of the structure of the XML document to set up a little cache of references, which are just elements in the XML document.

                                            
                                                def loadGrammar(self, grammar):       
                                                    for ref in self.grammar.getElementsByTagName("ref"):
                                                        self.refs[ref.attributes["id"].value] = ref     

                                            If you specified some source material on the command line, you use that; otherwise you rip through the grammar looking for the "top-level" reference (that isn't referenced by anything else) and use that as a starting point.

                                            
                                                def getDefaultSource(self):
                                                    xrefs = {}
                                                    for xref in self.grammar.getElementsByTagName("xref"):
                                                        xrefs[xref.attributes["id"].value] = 1
                                                    xrefs = xrefs.keys()
                                                    standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
                                                    return '<xref id="%s"/>' % random.choice(standaloneXrefs)

                                            Now you rip through the source material. The source material is also XML, and you parse it one node at a time. To keep the code separated and more maintainable, you use separate handlers for each node type.

                                            
                                                def parse_Element(self, node): 
                                                    handlerMethod = getattr(self, "do_%s" % node.tagName)
                                                    handlerMethod(node)

                                            You bounce through the grammar, parsing all the children of each p element,

                                            
                                                def do_p(self, node):
                                            ...
                                                    if doit:
                                                        for child in node.childNodes: self.parse(child)

                                            replacing choice elements with a random child,

                                            
                                                def do_choice(self, node):
                                                    self.parse(self.randomChildElement(node))

                                            and replacing xref elements with a random child of the corresponding ref element, which you previously cached.

                                            
                                                def do_xref(self, node):
                                                    id = node.attributes["id"].value
                                                    self.parse(self.randomChildElement(self.refs[id]))

                                            Eventually, you parse your way down to plain text,

                                            
                                                def parse_Text(self, node):    
                                                    text = node.data
                                            ...
                                                        self.pieces.append(text)

                                            which you print out.

                                            
                                            def main(argv):       
                                            ...
                                                k = KantGenerator(grammar, source)
                                                print k.output()

                                            10.8. Summary

                                            Python comes with powerful libraries for parsing and manipulating XML documents. The minidom takes an XML file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments, error handling, even the ability to take input from the piped result of a previous program.

                                            Before moving on to the next chapter, you should be comfortable doing all of these things:

                                            [HTTP web services stuff was here] [unit testing stuff was here]

                                            Chapter 14. Test-First Programming

                                            14.1. roman.py, stage 1

                                            Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're going to do this in stages, so you can see all the unit tests fail, then watch them pass one by one as you fill in the gaps in roman.py.

                                            Example 14.1. roman1.py

                                            This file is available in py/roman/stage1/ in the examples directory.

                                            If you have not already done so, you can download this and other examples used in this book.

                                            
                                            """Convert to and from Roman numerals"""
                                            
                                            #Define exceptions
                                            class RomanError(Exception): pass                
                                            class OutOfRangeError(RomanError): pass          
                                            class NotIntegerError(RomanError): pass
                                            class InvalidRomanNumeralError(RomanError): pass 
                                            
                                            def to_roman(n):
                                                """convert integer to Roman numeral"""
                                                pass     
                                            
                                            def from_roman(s):
                                                """convert Roman numeral to integer"""
                                                pass
                                            
                                            1. This is how you define your own custom exceptions in Python. Exceptions are classes, and you create your own by subclassing existing exceptions. It is strongly recommended (but not required) that you subclass Exception, which is the base class that all built-in exceptions inherit from. Here I am defining RomanError (inherited from Exception) to act as the base class for all my other custom exceptions to follow. This is a matter of style; I could just as easily have inherited each individual exception from the Exception class directly.
                                            2. The OutOfRangeError and NotIntegerError exceptions will eventually be used by to_roman() to flag various forms of invalid input, as specified in ToRomanBadInput.
                                            3. The InvalidRomanNumeralError exception will eventually be used by from_roman() to flag invalid input, as specified in FromRomanBadInput.
                                            4. At this stage, you want to define the API of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word pass.

                                              Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to romantest.py and re-evaluate why you coded a test so useless that it passes with do-nothing functions.

                                            5. At this stage, you want to define the API of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word pass.

                                              Run romantest1.py with the -v command-line option, which will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this:

                                              Example 14.2. Output of romantest1.py against roman1.py

                                              from_roman should only accept uppercase input ... ERROR
                                              to_roman should always return uppercase ... ERROR
                                              from_roman should fail with malformed antecedents ... FAIL
                                              from_roman should fail with repeated pairs of numerals ... FAIL
                                              from_roman should fail with too many repeated numerals ... FAIL
                                              from_roman should give known result with known input ... FAIL
                                              to_roman should give known result with known input ... FAIL
                                              from_roman(to_roman(n))==n for all n ... FAIL
                                              to_roman should fail with non-integer input ... FAIL
                                              to_roman should fail with negative input ... FAIL
                                              to_roman should fail with large input ... FAIL
                                              to_roman should fail with 0 input ... FAIL
                                              
                                              ======================================================================
                                              ERROR: from_roman should only accept uppercase input
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 154, in testFromRomanCase
                                                  roman1.from_roman(numeral.upper())
                                              AttributeError: 'None' object has no attribute 'upper'
                                              ======================================================================
                                              ERROR: to_roman should always return uppercase
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 148, in testToRomanCase
                                                  self.assertEqual(numeral, numeral.upper())
                                              AttributeError: 'None' object has no attribute 'upper'
                                              ======================================================================
                                              FAIL: from_roman should fail with malformed antecedents
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 133, in testMalformedAntecedent
                                                  self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
                                                File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                  raise self.failureException, excName
                                              AssertionError: InvalidRomanNumeralError
                                              ======================================================================
                                              FAIL: from_roman should fail with repeated pairs of numerals
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 127, in testRepeatedPairs
                                                  self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
                                                File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                  raise self.failureException, excName
                                              AssertionError: InvalidRomanNumeralError
                                              ======================================================================
                                              FAIL: from_roman should fail with too many repeated numerals
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 122, in testTooManyRepeatedNumerals
                                                  self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
                                                File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                  raise self.failureException, excName
                                              AssertionError: InvalidRomanNumeralError
                                              ======================================================================
                                              FAIL: from_roman should give known result with known input
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 99, in testFromRomanKnownValues
                                                  self.assertEqual(integer, result)
                                                File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                                                  raise self.failureException, (msg or '%s != %s' % (first, second))
                                              AssertionError: 1 != None
                                              ======================================================================
                                              FAIL: to_roman should give known result with known input
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 93, in testToRomanKnownValues
                                                  self.assertEqual(numeral, result)
                                                File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                                                  raise self.failureException, (msg or '%s != %s' % (first, second))
                                              AssertionError: I != None
                                              ======================================================================
                                              FAIL: from_roman(to_roman(n))==n for all n
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 141, in testSanity
                                                  self.assertEqual(integer, result)
                                                File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                                                  raise self.failureException, (msg or '%s != %s' % (first, second))
                                              AssertionError: 1 != None
                                              ======================================================================
                                              FAIL: to_roman should fail with non-integer input
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 116, in testNonInteger
                                                  self.assertRaises(roman1.NotIntegerError, roman1.to_roman, 0.5)
                                                File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                  raise self.failureException, excName
                                              AssertionError: NotIntegerError
                                              ======================================================================
                                              FAIL: to_roman should fail with negative input
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 112, in testNegative
                                                  self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, -1)
                                                File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                  raise self.failureException, excName
                                              AssertionError: OutOfRangeError
                                              ======================================================================
                                              FAIL: to_roman should fail with large input
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 104, in testTooLarge
                                                  self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 4000)
                                                File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                  raise self.failureException, excName
                                              AssertionError: OutOfRangeError
                                              ======================================================================
                                              FAIL: to_roman should fail with 0 input               
                                              ----------------------------------------------------------------------
                                              Traceback (most recent call last):
                                                File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 108, in testZero
                                                  self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 0)
                                                File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                  raise self.failureException, excName
                                              AssertionError: OutOfRangeError    
                                              ----------------------------------------------------------------------
                                              Ran 12 tests in 0.040s             
                                              
                                              FAILED (failures=10, errors=2)     

                                              14.2. roman.py, stage 2

                                              Now that you have the framework of the roman module laid out, it's time to start writing code and passing test cases.

                                              Example 14.3. roman2.py

                                              This file is available in py/roman/stage2/ in the examples directory.

                                              If you have not already done so, you can download this and other examples used in this book.

                                              
                                              """Convert to and from Roman numerals"""
                                              
                                              #Define exceptions
                                              class RomanError(Exception): pass
                                              class OutOfRangeError(RomanError): pass
                                              class NotIntegerError(RomanError): pass
                                              class InvalidRomanNumeralError(RomanError): pass
                                              
                                              #Define digit mapping
                                              romanNumeralMap = (('M',  1000), 
                                               ('CM', 900),
                                               ('D',  500),
                                               ('CD', 400),
                                               ('C',  100),
                                               ('XC', 90),
                                               ('L',  50),
                                               ('XL', 40),
                                               ('X',  10),
                                               ('IX', 9),
                                               ('V',  5),
                                               ('IV', 4),
                                               ('I',  1))
                                              
                                              def to_roman(n):
                                                  """convert integer to Roman numeral"""
                                                  result = ""
                                                  for numeral, integer in romanNumeralMap:
                                                      while n >= integer:      
                                                          result += numeral
                                                          n -= integer
                                                  return result
                                              
                                              def from_roman(s):
                                                  """convert Roman numeral to integer"""
                                                  pass
                                              
                                              1. romanNumeralMap is a tuple of tuples which defines three things:
                                                1. The character representations of the most basic Roman numerals. Note that this is not just the single-character Roman numerals; you're also defining two-character pairs like CM (“one hundred less than one thousand”); this will make the to_roman() code simpler later.
                                                2. The order of the Roman numerals. They are listed in descending value order, from M all the way down to I.
                                                3. The value of each Roman numeral. Each inner tuple is a pair of (numeral, value).
                                              2. Here's where your rich data structure pays off, because you don't need any special logic to handle the subtraction rule. To convert to Roman numerals, you simply iterate through romanNumeralMap looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.

                                                Example 14.4. How to_roman() works

                                                If you're not clear how to_roman() works, add a print statement to the end of the while loop:

                                                
                                                        while n >= integer:
                                                            result += numeral
                                                            n -= integer
                                                            print 'subtracting', integer, 'from input, adding', numeral, 'to output'
                                                >>> import roman2
                                                >>> roman2.to_roman(1424)
                                                subtracting 1000 from input, adding M to output
                                                subtracting 400 from input, adding CD to output
                                                subtracting 10 from input, adding X to output
                                                subtracting 10 from input, adding X to output
                                                subtracting 4 from input, adding IV to output
                                                'MCDXXIV'
                                                

                                                So to_roman() appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely.

                                                Example 14.5. Output of romantest2.py against roman2.py

                                                Remember to run romantest2.py with the -v command-line flag to enable verbose mode.

                                                from_roman should only accept uppercase input ... FAIL
                                                to_roman should always return uppercase ... ok
                                                from_roman should fail with malformed antecedents ... FAIL
                                                from_roman should fail with repeated pairs of numerals ... FAIL
                                                from_roman should fail with too many repeated numerals ... FAIL
                                                from_roman should give known result with known input ... FAIL
                                                to_roman should give known result with known input ... ok       
                                                from_roman(to_roman(n))==n for all n ... FAIL
                                                to_roman should fail with non-integer input ... FAIL            
                                                to_roman should fail with negative input ... FAIL
                                                to_roman should fail with large input ... FAIL
                                                to_roman should fail with 0 input ... FAIL
                                                1. to_roman() does, in fact, always return uppercase, because romanNumeralMap defines the Roman numeral representations as uppercase. So this test passes already.
                                                2. Here's the big news: this version of the to_roman() function passes the known values test. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
                                                3. However, the function does not “work” for bad values; it fails every single bad input test. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to be raised (via assertRaises), and you're never raising them. You'll do that in the next stage.

                                                  Here's the rest of the output of the unit test, listing the details of all the failures. You're down to 10.

                                                  
                                                  ======================================================================
                                                  FAIL: from_roman should only accept uppercase input
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 156, in testFromRomanCase
                                                      roman2.from_roman, numeral.lower())
                                                    File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                      raise self.failureException, excName
                                                  AssertionError: InvalidRomanNumeralError
                                                  ======================================================================
                                                  FAIL: from_roman should fail with malformed antecedents
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 133, in testMalformedAntecedent
                                                      self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
                                                    File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                      raise self.failureException, excName
                                                  AssertionError: InvalidRomanNumeralError
                                                  ======================================================================
                                                  FAIL: from_roman should fail with repeated pairs of numerals
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 127, in testRepeatedPairs
                                                      self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
                                                    File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                      raise self.failureException, excName
                                                  AssertionError: InvalidRomanNumeralError
                                                  ======================================================================
                                                  FAIL: from_roman should fail with too many repeated numerals
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 122, in testTooManyRepeatedNumerals
                                                      self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
                                                    File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                      raise self.failureException, excName
                                                  AssertionError: InvalidRomanNumeralError
                                                  ======================================================================
                                                  FAIL: from_roman should give known result with known input
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 99, in testFromRomanKnownValues
                                                      self.assertEqual(integer, result)
                                                    File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                                                      raise self.failureException, (msg or '%s != %s' % (first, second))
                                                  AssertionError: 1 != None
                                                  ======================================================================
                                                  FAIL: from_roman(to_roman(n))==n for all n
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 141, in testSanity
                                                      self.assertEqual(integer, result)
                                                    File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                                                      raise self.failureException, (msg or '%s != %s' % (first, second))
                                                  AssertionError: 1 != None
                                                  ======================================================================
                                                  FAIL: to_roman should fail with non-integer input
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 116, in testNonInteger
                                                      self.assertRaises(roman2.NotIntegerError, roman2.to_roman, 0.5)
                                                    File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                      raise self.failureException, excName
                                                  AssertionError: NotIntegerError
                                                  ======================================================================
                                                  FAIL: to_roman should fail with negative input
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 112, in testNegative
                                                      self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, -1)
                                                    File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                      raise self.failureException, excName
                                                  AssertionError: OutOfRangeError
                                                  ======================================================================
                                                  FAIL: to_roman should fail with large input
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 104, in testTooLarge
                                                      self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000)
                                                    File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                      raise self.failureException, excName
                                                  AssertionError: OutOfRangeError
                                                  ======================================================================
                                                  FAIL: to_roman should fail with 0 input
                                                  ----------------------------------------------------------------------
                                                  Traceback (most recent call last):
                                                    File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 108, in testZero
                                                      self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 0)
                                                    File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                      raise self.failureException, excName
                                                  AssertionError: OutOfRangeError
                                                  ----------------------------------------------------------------------
                                                  Ran 12 tests in 0.320s
                                                  
                                                  FAILED (failures=10)

                                                  14.3. roman.py, stage 3

                                                  Now that to_roman() behaves correctly with good input (integers from 1 to 3999), it's time to make it behave correctly with bad input (everything else).

                                                  Example 14.6. roman3.py

                                                  This file is available in py/roman/stage3/ in the examples directory.

                                                  If you have not already done so, you can download this and other examples used in this book.

                                                  
                                                  """Convert to and from Roman numerals"""
                                                  
                                                  #Define exceptions
                                                  class RomanError(Exception): pass
                                                  class OutOfRangeError(RomanError): pass
                                                  class NotIntegerError(RomanError): pass
                                                  class InvalidRomanNumeralError(RomanError): pass
                                                  
                                                  #Define digit mapping
                                                  romanNumeralMap = (('M',  1000),
                                                   ('CM', 900),
                                                   ('D',  500),
                                                   ('CD', 400),
                                                   ('C',  100),
                                                   ('XC', 90),
                                                   ('L',  50),
                                                   ('XL', 40),
                                                   ('X',  10),
                                                   ('IX', 9),
                                                   ('V',  5),
                                                   ('IV', 4),
                                                   ('I',  1))
                                                  
                                                  def to_roman(n):
                                                      """convert integer to Roman numeral"""
                                                      if not (0 < n < 4000):         
                                                          raise OutOfRangeError, "number out of range (must be 1..3999)" 
                                                      if int(n) <> n:                
                                                          raise NotIntegerError, "non-integers can not be converted"
                                                  
                                                      result = ""  
                                                      for numeral, integer in romanNumeralMap:
                                                          while n >= integer:
                                                              result += numeral
                                                              n -= integer
                                                      return result
                                                  
                                                  def from_roman(s):
                                                      """convert Roman numeral to integer"""
                                                      pass
                                                  
                                                  1. This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to if not ((0 < n) and (n < 4000)), but it's much easier to read. This is the range check, and it should catch inputs that are too large, negative, or zero.
                                                  2. You raise exceptions yourself with the raise statement. You can raise any of the built-in exceptions, or you can raise any of your custom exceptions that you've defined. The second parameter, the error message, is optional; if given, it is displayed in the traceback that is printed if the exception is never handled.
                                                  3. This is the non-integer check. Non-integers can not be converted to Roman numerals.
                                                  4. The rest of the function is unchanged.

                                                    Example 14.7. Watching to_roman() handle bad input

                                                    >>> import roman3
                                                    >>> roman3.to_roman(4000)
                                                    Traceback (most recent call last):
                                                      File "<interactive input>", line 1, in ?
                                                      File "roman3.py", line 27, in to_roman
                                                        raise OutOfRangeError, "number out of range (must be 1..3999)"
                                                    OutOfRangeError: number out of range (must be 1..3999)
                                                    >>> roman3.to_roman(1.5)
                                                    Traceback (most recent call last):
                                                      File "<interactive input>", line 1, in ?
                                                      File "roman3.py", line 29, in to_roman
                                                        raise NotIntegerError, "non-integers can not be converted"
                                                    NotIntegerError: non-integers can not be converted
                                                    

                                                    Example 14.8. Output of romantest3.py against roman3.py

                                                    from_roman should only accept uppercase input ... FAIL
                                                    to_roman should always return uppercase ... ok
                                                    from_roman should fail with malformed antecedents ... FAIL
                                                    from_roman should fail with repeated pairs of numerals ... FAIL
                                                    from_roman should fail with too many repeated numerals ... FAIL
                                                    from_roman should give known result with known input ... FAIL
                                                    to_roman should give known result with known input ... ok 
                                                    from_roman(to_roman(n))==n for all n ... FAIL
                                                    to_roman should fail with non-integer input ... ok        
                                                    to_roman should fail with negative input ... ok           
                                                    to_roman should fail with large input ... ok
                                                    to_roman should fail with 0 input ... ok
                                                    1. to_roman() still passes the known values test, which is comforting. All the tests that passed in stage 2 still pass, so the latest code hasn't broken anything.
                                                    2. More exciting is the fact that all of the bad input tests now pass. This test, testNonInteger, passes because of the int(n) <> n check. When a non-integer is passed to to_roman(), the int(n) <> n check notices it and raises the NotIntegerError exception, which is what testNonInteger is looking for.
                                                    3. This test, testNegative, passes because of the not (0 < n < 4000) check, which raises an OutOfRangeError exception, which is what testNegative is looking for.
                                                      
                                                      ======================================================================
                                                      FAIL: from_roman should only accept uppercase input
                                                      ----------------------------------------------------------------------
                                                      Traceback (most recent call last):
                                                        File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 156, in testFromRomanCase
                                                          roman3.from_roman, numeral.lower())
                                                        File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                          raise self.failureException, excName
                                                      AssertionError: InvalidRomanNumeralError
                                                      ======================================================================
                                                      FAIL: from_roman should fail with malformed antecedents
                                                      ----------------------------------------------------------------------
                                                      Traceback (most recent call last):
                                                        File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 133, in testMalformedAntecedent
                                                          self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
                                                        File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                          raise self.failureException, excName
                                                      AssertionError: InvalidRomanNumeralError
                                                      ======================================================================
                                                      FAIL: from_roman should fail with repeated pairs of numerals
                                                      ----------------------------------------------------------------------
                                                      Traceback (most recent call last):
                                                        File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 127, in testRepeatedPairs
                                                          self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
                                                        File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                          raise self.failureException, excName
                                                      AssertionError: InvalidRomanNumeralError
                                                      ======================================================================
                                                      FAIL: from_roman should fail with too many repeated numerals
                                                      ----------------------------------------------------------------------
                                                      Traceback (most recent call last):
                                                        File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 122, in testTooManyRepeatedNumerals
                                                          self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
                                                        File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                          raise self.failureException, excName
                                                      AssertionError: InvalidRomanNumeralError
                                                      ======================================================================
                                                      FAIL: from_roman should give known result with known input
                                                      ----------------------------------------------------------------------
                                                      Traceback (most recent call last):
                                                        File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 99, in testFromRomanKnownValues
                                                          self.assertEqual(integer, result)
                                                        File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                                                          raise self.failureException, (msg or '%s != %s' % (first, second))
                                                      AssertionError: 1 != None
                                                      ======================================================================
                                                      FAIL: from_roman(to_roman(n))==n for all n
                                                      ----------------------------------------------------------------------
                                                      Traceback (most recent call last):
                                                        File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 141, in testSanity
                                                          self.assertEqual(integer, result)
                                                        File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                                                          raise self.failureException, (msg or '%s != %s' % (first, second))
                                                      AssertionError: 1 != None
                                                      ----------------------------------------------------------------------
                                                      Ran 12 tests in 0.401s
                                                      
                                                      FAILED (failures=6) 
                                                      1. You're down to 6 failures, and all of them involve from_roman(): the known values test, the three separate bad input tests, the case check, and the sanity check. That means that to_roman() has passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires that from_roman() be written, which it isn't yet.) Which means that you must stop coding to_roman() now. No tweaking, no twiddling, no extra checks “just in case”. Stop. Now. Back away from the keyboard.
                                                        NoteThe most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module.

                                                        14.4. roman.py, stage 4

                                                        Now that to_roman() is done, it's time to start coding from_roman(). the to_roman() function.

                                                        Example 14.9. roman4.py

                                                        This file is available in py/roman/stage4/ in the examples directory.

                                                        If you have not already done so, you can download this and other examples used in this book.

                                                        
                                                        """Convert to and from Roman numerals"""
                                                        
                                                        #Define exceptions
                                                        class RomanError(Exception): pass
                                                        class OutOfRangeError(RomanError): pass
                                                        class NotIntegerError(RomanError): pass
                                                        class InvalidRomanNumeralError(RomanError): pass
                                                        
                                                        #Define digit mapping
                                                        romanNumeralMap = (('M',  1000),
                                                         ('CM', 900),
                                                         ('D',  500),
                                                         ('CD', 400),
                                                         ('C',  100),
                                                         ('XC', 90),
                                                         ('L',  50),
                                                         ('XL', 40),
                                                         ('X',  10),
                                                         ('IX', 9),
                                                         ('V',  5),
                                                         ('IV', 4),
                                                         ('I',  1))
                                                        
                                                        # to_roman function omitted for clarity (it hasn't changed)
                                                        
                                                        def from_roman(s):
                                                            """convert Roman numeral to integer"""
                                                            result = 0
                                                            index = 0
                                                            for numeral, integer in romanNumeralMap:
                                                                while s[index:index+len(numeral)] == numeral: 
                                                                    result += integer
                                                                    index += len(numeral)
                                                            return result
                                                        
                                                        1. The pattern here is the same as to_roman(). You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer values as often as possible, you match the “highest” Roman numeral character strings as often as possible.

                                                          Example 14.10. How from_roman() works

                                                          If you're not clear how from_roman() works, add a print statement to the end of the while loop:

                                                          
                                                                  while s[index:index+len(numeral)] == numeral:
                                                                      result += integer
                                                                      index += len(numeral)
                                                                      print 'found', numeral, 'of length', len(numeral), ', adding', integer
                                                          >>> import roman4
                                                          >>> roman4.from_roman('MCMLXXII')
                                                          found M , of length 1, adding 1000
                                                          found CM , of length 2, adding 900
                                                          found L , of length 1, adding 50
                                                          found X , of length 1, adding 10
                                                          found X , of length 1, adding 10
                                                          found I , of length 1, adding 1
                                                          found I , of length 1, adding 1
                                                          1972

                                                          Example 14.11. Output of romantest4.py against roman4.py

                                                          from_roman should only accept uppercase input ... FAIL
                                                          to_roman should always return uppercase ... ok
                                                          from_roman should fail with malformed antecedents ... FAIL
                                                          from_roman should fail with repeated pairs of numerals ... FAIL
                                                          from_roman should fail with too many repeated numerals ... FAIL
                                                          from_roman should give known result with known input ... ok 
                                                          to_roman should give known result with known input ... ok
                                                          from_roman(to_roman(n))==n for all n ... ok
                                                          to_roman should fail with non-integer input ... ok
                                                          to_roman should fail with negative input ... ok
                                                          to_roman should fail with large input ... ok
                                                          to_roman should fail with 0 input ... ok
                                                          1. Two pieces of exciting news here. The first is that from_roman() works for good input, at least for all the known values you test.
                                                          2. The second is that the sanity check also passed. Combined with the known values tests, you can be reasonably sure that both to_roman() and from_roman() work properly for all possible good values. (This is not guaranteed; it is theoretically possible that to_roman() has a bug that produces the wrong Roman numeral for some particular set of inputs, and that from_roman() has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that to_roman() generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write more comprehensive test cases until it doesn't bother you.)
                                                            
                                                            ======================================================================
                                                            FAIL: from_roman should only accept uppercase input
                                                            ----------------------------------------------------------------------
                                                            Traceback (most recent call last):
                                                              File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 156, in testFromRomanCase
                                                                roman4.from_roman, numeral.lower())
                                                              File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                                raise self.failureException, excName
                                                            AssertionError: InvalidRomanNumeralError
                                                            ======================================================================
                                                            FAIL: from_roman should fail with malformed antecedents
                                                            ----------------------------------------------------------------------
                                                            Traceback (most recent call last):
                                                              File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 133, in testMalformedAntecedent
                                                                self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
                                                              File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                                raise self.failureException, excName
                                                            AssertionError: InvalidRomanNumeralError
                                                            ======================================================================
                                                            FAIL: from_roman should fail with repeated pairs of numerals
                                                            ----------------------------------------------------------------------
                                                            Traceback (most recent call last):
                                                              File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 127, in testRepeatedPairs
                                                                self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
                                                              File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                                raise self.failureException, excName
                                                            AssertionError: InvalidRomanNumeralError
                                                            ======================================================================
                                                            FAIL: from_roman should fail with too many repeated numerals
                                                            ----------------------------------------------------------------------
                                                            Traceback (most recent call last):
                                                              File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 122, in testTooManyRepeatedNumerals
                                                                self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
                                                              File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                                                                raise self.failureException, excName
                                                            AssertionError: InvalidRomanNumeralError
                                                            ----------------------------------------------------------------------
                                                            Ran 12 tests in 1.222s
                                                            
                                                            FAILED (failures=4)

                                                            14.5. roman.py, stage 5

                                                            Example 14.12. roman5.py

                                                            This file is available in py/roman/stage5/ in the examples directory.

                                                            If you have not already done so, you can download this and other examples used in this book.

                                                            
                                                            """Convert to and from Roman numerals"""
                                                            import re
                                                            
                                                            #Define exceptions
                                                            class RomanError(Exception): pass
                                                            class OutOfRangeError(RomanError): pass
                                                            class NotIntegerError(RomanError): pass
                                                            class InvalidRomanNumeralError(RomanError): pass
                                                            
                                                            #Define digit mapping
                                                            romanNumeralMap = (('M',  1000),
                                                             ('CM', 900),
                                                             ('D',  500),
                                                             ('CD', 400),
                                                             ('C',  100),
                                                             ('XC', 90),
                                                             ('L',  50),
                                                             ('XL', 40),
                                                             ('X',  10),
                                                             ('IX', 9),
                                                             ('V',  5),
                                                             ('IV', 4),
                                                             ('I',  1))
                                                            
                                                            def to_roman(n):
                                                                """convert integer to Roman numeral"""
                                                                if not (0 < n < 4000):
                                                                    raise OutOfRangeError, "number out of range (must be 1..3999)"
                                                                if int(n) <> n:
                                                                    raise NotIntegerError, "non-integers can not be converted"
                                                            
                                                                result = ""
                                                                for numeral, integer in romanNumeralMap:
                                                                    while n >= integer:
                                                                        result += numeral
                                                                        n -= integer
                                                                return result
                                                            
                                                            #Define pattern to detect valid Roman numerals
                                                            romanNumeralPattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$' 
                                                            
                                                            def from_roman(s):
                                                                """convert Roman numeral to integer"""
                                                                if not re.search(romanNumeralPattern, s):
                                                                    raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
                                                            
                                                                result = 0
                                                                index = 0
                                                                for numeral, integer in romanNumeralMap:
                                                                    while s[index:index+len(numeral)] == numeral:
                                                                        result += integer
                                                                        index += len(numeral)
                                                                return result
                                                            
                                                            1. This is just a continuation of the pattern you discussed in Section 7.3, “Case Study: Roman Numerals”. The tens places is either XC (90), XL (40), or an optional L followed by 0 to 3 optional X characters. The ones place is either IX (9), IV (4), or an optional V followed by 0 to 3 optional I characters.
                                                            2. Having encoded all that logic into a regular expression, the code to check for invalid Roman numerals becomes trivial. If re.search returns an object, then the regular expression matched and the input is valid; otherwise, the input is invalid.

                                                              At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of invalid Roman numerals. But don't take my word for it, look at the results:

                                                              Example 14.13. Output of romantest5.py against roman5.py

                                                              
                                                              from_roman should only accept uppercase input ... ok          
                                                              to_roman should always return uppercase ... ok
                                                              from_roman should fail with malformed antecedents ... ok      
                                                              from_roman should fail with repeated pairs of numerals ... ok 
                                                              from_roman should fail with too many repeated numerals ... ok
                                                              from_roman should give known result with known input ... ok
                                                              to_roman should give known result with known input ... ok
                                                              from_roman(to_roman(n))==n for all n ... ok
                                                              to_roman should fail with non-integer input ... ok
                                                              to_roman should fail with negative input ... ok
                                                              to_roman should fail with large input ... ok
                                                              to_roman should fail with 0 input ... ok
                                                              
                                                              ----------------------------------------------------------------------
                                                              Ran 12 tests in 2.864s
                                                              
                                                              OK     
                                                              1. One thing I didn't mention about regular expressions is that, by default, they are case-sensitive. Since the regular expression romanNumeralPattern was expressed in uppercase characters, the re.search check will reject any input that isn't completely uppercase. So the uppercase input test passes.
                                                              2. More importantly, the bad input tests pass. For instance, the malformed antecedents test checks cases like MCMC. As you've seen, this does not match the regular expression, so from_roman() raises an InvalidRomanNumeralError exception, which is what the malformed antecedents test case is looking for, so the test passes.
                                                              3. In fact, all the bad input tests pass. This regular expression catches everything you could think of when you made your test cases.
                                                              4. NoteWhen all of your tests pass, stop coding. [functional programming stuff was here]

                                                                The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the build process for this book; I have unit tests for several of the example programs (not just the roman.py module featured in Chapter 13, Unit Testing), and the first thing my automated build script does is run this program to make sure all my examples still work. If this regression test fails, the build immediately stops. I don't want to release non-working examples any more than you want to download them and sit around scratching your head and yelling at your monitor and wondering why they don't work.

                                                                Example 16.1. regression.py

                                                                If you have not already done so, you can download this and other examples used in this book.

                                                                
                                                                """Regression testing framework
                                                                
                                                                This module will search for scripts in the same directory named
                                                                XYZtest.py. Each such script should be a test suite that tests a
                                                                module through PyUnit. (As of Python 2.1, PyUnit is included in
                                                                the standard library as "unittest".)  This script will aggregate all
                                                                found test suites into one big test suite and run them all at once.
                                                                """
                                                                
                                                                import sys, os, re, unittest
                                                                
                                                                def regressionTest():
                                                                    path = os.path.abspath(os.path.dirname(sys.argv[0]))   
                                                                    files = os.listdir(path)             
                                                                    test = re.compile("test\.py$", re.IGNORECASE)          
                                                                    files = filter(test.search, files)   
                                                                    filenameToModuleName = lambda f: os.path.splitext(f)[0]
                                                                    moduleNames = map(filenameToModuleName, files)         
                                                                    modules = map(__import__, moduleNames)                 
                                                                    load = unittest.defaultTestLoader.loadTestsFromModule  
                                                                    return unittest.TestSuite(map(load, modules))          
                                                                
                                                                if __name__ == "__main__": 
                                                                    unittest.main(defaultTest="regressionTest")
                                                                

                                                                Running this script in the same directory as the rest of the example scripts that come with this book will find all the unit tests, named moduletest.py, run them as a single test, and pass or fail them all at once.

                                                                Example 16.2. Sample output of regression.py

                                                                [you@localhost py]$ python regression.py -v
                                                                help should fail with no object ... ok           
                                                                help should return known result for apihelper ... ok
                                                                help should honor collapse argument ... ok
                                                                help should honor spacing argument ... ok
                                                                buildConnectionString should fail with list input ... ok           
                                                                buildConnectionString should fail with string input ... ok
                                                                buildConnectionString should fail with tuple input ... ok
                                                                buildConnectionString handles empty dictionary ... ok
                                                                buildConnectionString returns known result with known input ... ok
                                                                from_roman should only accept uppercase input ... ok                
                                                                to_roman should always return uppercase ... ok
                                                                from_roman should fail with blank string ... ok
                                                                from_roman should fail with malformed antecedents ... ok
                                                                from_roman should fail with repeated pairs of numerals ... ok
                                                                from_roman should fail with too many repeated numerals ... ok
                                                                from_roman should give known result with known input ... ok
                                                                to_roman should give known result with known input ... ok
                                                                from_roman(to_roman(n))==n for all n ... ok
                                                                to_roman should fail with non-integer input ... ok
                                                                to_roman should fail with negative input ... ok
                                                                to_roman should fail with large input ... ok
                                                                to_roman should fail with 0 input ... ok
                                                                kgp a ref test ... ok
                                                                kgp b ref test ... ok
                                                                kgp c ref test ... ok
                                                                kgp d ref test ... ok
                                                                kgp e ref test ... ok
                                                                kgp f ref test ... ok
                                                                kgp g ref test ... ok
                                                                
                                                                ----------------------------------------------------------------------
                                                                Ran 29 tests in 2.799s
                                                                
                                                                OK
                                                                1. The first 5 tests are from apihelpertest.py, which tests the example script from Chapter 4, The Power Of Introspection.
                                                                2. The next 5 tests are from odbchelpertest.py, which tests the example script from Chapter 2, Your First Python Program.
                                                                3. The rest are from romantest.py, which you studied in depth in Chapter 13, Unit Testing.

                                                                  16.2. Finding the path

                                                                  When running Python scripts from the command line, it is sometimes useful to know where the currently running script is located on disk.

                                                                  This is one of those obscure little tricks that is virtually impossible to figure out on your own, but simple to remember once you see it. The key to it is sys.argv. As you saw in Chapter 9, XML Processing, this is a list that holds the list of command-line arguments. However, it also holds the name of the running script, exactly as it was called from the command line, and this is enough information to determine its location.

                                                                  Example 16.3. fullpath.py

                                                                  If you have not already done so, you can download this and other examples used in this book.

                                                                  
                                                                  import sys, os
                                                                  
                                                                  print 'sys.argv[0] =', sys.argv[0]             
                                                                  pathname = os.path.dirname(sys.argv[0])        
                                                                  print 'path =', pathname
                                                                  print 'full path =', os.path.abspath(pathname) 
                                                                  1. Regardless of how you run a script, sys.argv[0] will always contain the name of the script, exactly as it appears on the command line. This may or may not include any path information, as you'll see shortly.
                                                                  2. os.path.dirname takes a filename as a string and returns the directory path portion. If the given filename does not include any path information, os.path.dirname returns an empty string.
                                                                  3. os.path.abspath is the key here. It takes a pathname, which can be partial or even blank, and returns a fully qualified pathname.

                                                                    os.path.abspath deserves further explanation. It is very flexible; it can take any kind of pathname.

                                                                    Example 16.4. Further explanation of os.path.abspath

                                                                    >>> import os
                                                                    >>> os.getcwd()      
                                                                    /home/you
                                                                    >>> os.path.abspath('')                
                                                                    /home/you
                                                                    >>> os.path.abspath('.ssh')            
                                                                    /home/you/.ssh
                                                                    >>> os.path.abspath('/home/you/.ssh') 
                                                                    /home/you/.ssh
                                                                    >>> os.path.abspath('.ssh/../foo/')    
                                                                    /home/you/foo
                                                                    1. os.getcwd() returns the current working directory.
                                                                    2. Calling os.path.abspath with an empty string returns the current working directory, same as os.getcwd().
                                                                    3. Calling os.path.abspath with a partial pathname constructs a fully qualified pathname out of it, based on the current working directory.
                                                                    4. Calling os.path.abspath with a full pathname simply returns it.
                                                                    5. os.path.abspath also normalizes the pathname it returns. Note that this example worked even though I don't actually have a 'foo' directory. os.path.abspath never checks your actual disk; this is all just string manipulation.
                                                                      NoteThe pathnames and filenames you pass to os.path.abspath do not need to exist.
                                                                      Noteos.path.abspath not only constructs full path names, it also normalizes them. That means that if you are in the /usr/ directory, os.path.abspath('bin/../local/bin') will return /usr/local/bin. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without turning it into a full pathname, use os.path.normpath instead.

                                                                      Example 16.5. Sample output from fullpath.py

                                                                      [you@localhost py]$ python /home/you/diveintopython3/common/py/fullpath.py 
                                                                      sys.argv[0] = /home/you/diveintopython3/common/py/fullpath.py
                                                                      path = /home/you/diveintopython3/common/py
                                                                      full path = /home/you/diveintopython3/common/py
                                                                      [you@localhost diveintopython3]$ python common/py/fullpath.py               
                                                                      sys.argv[0] = common/py/fullpath.py
                                                                      path = common/py
                                                                      full path = /home/you/diveintopython3/common/py
                                                                      [you@localhost diveintopython3]$ cd common/py
                                                                      [you@localhost py]$ python fullpath.py 
                                                                      sys.argv[0] = fullpath.py
                                                                      path = 
                                                                      full path = /home/you/diveintopython3/common/py
                                                                      1. In the first case, sys.argv[0] includes the full path of the script. You can then use the os.path.dirname function to strip off the script name and return the full directory name, and os.path.abspath simply returns what you give it.
                                                                      2. If the script is run by using a partial pathname, sys.argv[0] will still contain exactly what appears on the command line. os.path.dirname will then give you a partial pathname (relative to the current directory), and os.path.abspath will construct a full pathname from the partial pathname.
                                                                      3. If the script is run from the current directory without giving any path, os.path.dirname will simply return an empty string. Given an empty string, os.path.abspath returns the current directory, which is what you want, since the script was run from the current directory.
                                                                        NoteLike the other functions in the os and os.path modules, os.path.abspath is cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of the os module.

                                                                        Addendum. One reader was dissatisfied with this solution, and wanted to be able to run all the unit tests in the current directory, not the directory where regression.py is located. He suggests this approach instead:

                                                                        Example 16.6. Running scripts in the current directory

                                                                        import sys, os, re, unittest
                                                                        
                                                                        def regressionTest():
                                                                            path = os.getcwd()       
                                                                            sys.path.append(path)    
                                                                            files = os.listdir(path) 
                                                                        
                                                                        1. Instead of setting path to the directory where the currently running script is located, you set it to the current working directory instead. This will be whatever directory you were in before you ran the script, which is not necessarily the same as the directory the script is in. (Read that sentence a few times until you get it.)
                                                                        2. Append this directory to the Python library search path, so that when you dynamically import the unit test modules later, Python can find them. You didn't need to do this when path was the directory of the currently running script, because Python always looks in that directory.
                                                                        3. The rest of the function is the same.

                                                                          This technique will allow you to re-use this regression.py script on multiple projects. Just put the script in a common directory, then change to the project's directory before running it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory where regression.py is located. [more functional programming stuff was here]

                                                                          16.6. Dynamically importing modules

                                                                          OK, enough philosophizing. Let's talk about dynamically importing modules.

                                                                          First, let's look at how you normally import modules. The import module syntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once this way, with a comma-separated list. You did this on the very first line of this chapter's script.

                                                                          Example 16.13. Importing multiple modules at once

                                                                          
                                                                          import sys, os, re, unittest 
                                                                          
                                                                          1. This imports four modules at once: sys (for system functions and access to the command line parameters), os (for operating system functions like directory listings), re (for regular expressions), and unittest (for unit testing).

                                                                            Now let's do the same thing, but with dynamic imports.

                                                                            Example 16.14. Importing modules dynamically

                                                                            >>> sys = __import__('sys')           
                                                                            >>> os = __import__('os')
                                                                            >>> re = __import__('re')
                                                                            >>> unittest = __import__('unittest')
                                                                            >>> sys             
                                                                            >>> <module 'sys' (built-in)>
                                                                            >>> os
                                                                            >>> <module 'os' from '/usr/local/lib/python2.2/os.pyc'>
                                                                            
                                                                            1. The built-in __import__ function accomplishes the same goal as using the import statement, but it's an actual function, and it takes a string as an argument.
                                                                            2. The variable sys is now the sys module, just as if you had said import sys. The variable os is now the os module, and so forth.

                                                                              So __import__ imports a module, but takes a string argument to do it. In this case the module you imported was just a hard-coded string, but it could just as easily be a variable, or the result of a function call. And the variable that you assign the module to doesn't need to match the module name, either. You could import a series of modules and assign them to a list.

                                                                              Example 16.15. Importing a list of modules dynamically

                                                                              >>> moduleNames = ['sys', 'os', 're', 'unittest'] 
                                                                              >>> moduleNames
                                                                              ['sys', 'os', 're', 'unittest']
                                                                              >>> modules = map(__import__, moduleNames)        
                                                                              >>> modules   
                                                                              [<module 'sys' (built-in)>,
                                                                              <module 'os' from 'c:\Python22\lib\os.pyc'>,
                                                                              <module 're' from 'c:\Python22\lib\re.pyc'>,
                                                                              <module 'unittest' from 'c:\Python22\lib\unittest.pyc'>]
                                                                              >>> modules[0].version          
                                                                              '2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
                                                                              >>> import sys
                                                                              >>> sys.version
                                                                              '2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
                                                                              
                                                                              1. moduleNames is just a list of strings. Nothing fancy, except that the strings happen to be names of modules that you could import, if you wanted to.
                                                                              2. Surprise, you wanted to import them, and you did, by mapping the __import__ function onto the list. Remember, this takes each element of the list (moduleNames) and calls the function (__import__) over and over, once with each element of the list, builds a list of the return values, and returns the result.
                                                                              3. So now from a list of strings, you've created a list of actual modules. (Your paths may be different, depending on your operating system, where you installed Python, the phase of the moon, etc.)
                                                                              4. To drive home the point that these are real modules, let's look at some module attributes. Remember, modules[0] is the sys module, so modules[0].version is sys.version. All the other attributes and methods of these modules are also available. There's nothing magic about the import statement, and there's nothing magic about modules. Modules are objects. Everything is an object.

                                                                                Now you should be able to put this all together and figure out what most of this chapter's code sample is doing.

                                                                                16.7. Putting it all together

                                                                                You've learned enough now to deconstruct the first seven lines of this chapter's code sample: reading a directory and importing selected modules within it.

                                                                                Example 16.16. The regressionTest function

                                                                                
                                                                                def regressionTest():
                                                                                    path = os.path.abspath(os.path.dirname(sys.argv[0]))   
                                                                                    files = os.listdir(path)             
                                                                                    test = re.compile("test\.py$", re.IGNORECASE)          
                                                                                    files = filter(test.search, files)   
                                                                                    filenameToModuleName = lambda f: os.path.splitext(f)[0]
                                                                                    moduleNames = map(filenameToModuleName, files)         
                                                                                    modules = map(__import__, moduleNames)                 
                                                                                load = unittest.defaultTestLoader.loadTestsFromModule  
                                                                                return unittest.TestSuite(map(load, modules))          
                                                                                

                                                                                Let's look at it line by line, interactively. Assume that the current directory is c:\diveintopython3\py, which contains the examples that come with this book, including this chapter's script. As you saw in Section 16.2, “Finding the path”, the script directory will end up in the path variable, so let's start hard-code that and go from there.

                                                                                Example 16.17. Step 1: Get all the files

                                                                                >>> import sys, os, re, unittest
                                                                                >>> path = r'c:\diveintopython3\py'
                                                                                >>> files = os.listdir(path)             
                                                                                >>> files 
                                                                                ['BaseHTMLProcessor.py', 'LICENSE.txt', 'apihelper.py', 'apihelpertest.py',
                                                                                'argecho.py', 'autosize.py', 'builddialectexamples.py', 'dialect.py',
                                                                                'fileinfo.py', 'fullpath.py', 'kgptest.py', 'makerealworddoc.py',
                                                                                'odbchelper.py', 'odbchelpertest.py', 'parsephone.py', 'piglatin.py',
                                                                                'plural.py', 'pluraltest.py', 'pyfontify.py', 'regression.py', 'roman.py', 'romantest.py',
                                                                                'uncurly.py', 'unicode2koi8r.py', 'urllister.py', 'kgp', 'plural', 'roman',
                                                                                'colorize.py']
                                                                                
                                                                                1. files is a list of all the files and directories in the script's directory. (If you've been running some of the examples already, you may also see some .pyc files in there as well.)

                                                                                  Example 16.18. Step 2: Filter to find the files you care about

                                                                                  >>> test = re.compile("test\.py$", re.IGNORECASE)           
                                                                                  >>> files = filter(test.search, files)    
                                                                                  >>> files               
                                                                                  ['apihelpertest.py', 'kgptest.py', 'odbchelpertest.py', 'pluraltest.py', 'romantest.py']
                                                                                  
                                                                                  1. This regular expression will match any string that ends with test.py. Note that you need to escape the period, since a period in a regular expression usually means “match any single character”, but you actually want to match a literal period instead.
                                                                                  2. The compiled regular expression acts like a function, so you can use it to filter the large list of files and directories, to find the ones that match the regular expression.
                                                                                  3. And you're left with the list of unit testing scripts, because they were the only ones named SOMETHINGtest.py.

                                                                                    Example 16.19. Step 3: Map filenames to module names

                                                                                    >>> filenameToModuleName = lambda f: os.path.splitext(f)[0] 
                                                                                    >>> filenameToModuleName('romantest.py')  
                                                                                    'romantest'
                                                                                    >>> filenameToModuleName('odchelpertest.py')
                                                                                    'odbchelpertest'
                                                                                    >>> moduleNames = map(filenameToModuleName, files)          
                                                                                    >>> moduleNames         
                                                                                    ['apihelpertest', 'kgptest', 'odbchelpertest', 'pluraltest', 'romantest']
                                                                                    
                                                                                    1. As you saw in Section 4.7, “Using lambda Functions”, lambda is a quick-and-dirty way of creating an inline, one-line function. This one takes a filename with an extension and returns just the filename part, using the standard library function os.path.splitext that you saw in Example 6.17, “Splitting Pathnames”.
                                                                                    2. filenameToModuleName is a function. There's nothing magic about lambda functions as opposed to regular functions that you define with a def statement. You can call the filenameToModuleName function like any other, and it does just what you wanted it to do: strips the file extension off of its argument.
                                                                                    3. Now you can apply this function to each file in the list of unit test files, using map.
                                                                                    4. And the result is just what you wanted: a list of modules, as strings.

                                                                                      Example 16.20. Step 4: Mapping module names to modules

                                                                                      >>> modules = map(__import__, moduleNames)
                                                                                      >>> modules             
                                                                                      [<module 'apihelpertest' from 'apihelpertest.py'>,
                                                                                      <module 'kgptest' from 'kgptest.py'>,
                                                                                      <module 'odbchelpertest' from 'odbchelpertest.py'>,
                                                                                      <module 'pluraltest' from 'pluraltest.py'>,
                                                                                      <module 'romantest' from 'romantest.py'>]
                                                                                      >>> modules[-1]         
                                                                                      <module 'romantest' from 'romantest.py'>
                                                                                      
                                                                                      1. As you saw in Section 16.6, “Dynamically importing modules”, you can use a combination of map and __import__ to map a list of module names (as strings) into actual modules (which you can call or access like any other module).
                                                                                      2. modules is now a list of modules, fully accessible like any other module.
                                                                                      3. The last module in the list is the romantest module, just as if you had said import romantest.

                                                                                        Example 16.21. Step 5: Loading the modules into a test suite

                                                                                        >>> load = unittest.defaultTestLoader.loadTestsFromModule  
                                                                                        >>> map(load, modules)   
                                                                                        [<unittest.TestSuite tests=[
                                                                                          <unittest.TestSuite tests=[<apihelpertest.BadInput testMethod=testNoObject>]>,
                                                                                          <unittest.TestSuite tests=[<apihelpertest.KnownValues testMethod=testApiHelper>]>,
                                                                                          <unittest.TestSuite tests=[
                                                                                            <apihelpertest.ParamChecks testMethod=testCollapse>, 
                                                                                            <apihelpertest.ParamChecks testMethod=testSpacing>]>, 
                                                                                            ...
                                                                                          ]
                                                                                        ]
                                                                                        >>> unittest.TestSuite(map(load, modules)) 
                                                                                        
                                                                                        1. These are real module objects. Not only can you access them like any other module, instantiate classes and call functions, you can also introspect into the module to figure out which classes and functions it has in the first place. That's what the loadTestsFromModule method does: it introspects into each module and returns a unittest.TestSuite object for each module. Each TestSuite object actually contains a list of TestSuite objects, one for each TestCase class in your module, and each of those TestSuite objects contains a list of tests, one for each test method in your module.
                                                                                        2. Finally, you wrap the list of TestSuite objects into one big test suite. The unittest module has no problem traversing this tree of nested test suites within test suites; eventually it gets down to an individual test method and executes it, verifies that it passes or fails, and moves on to the next one.

                                                                                          This introspection process is what the unittest module usually does for us. Remember that magic-looking unittest.main() function that our individual test modules called to kick the whole thing off? unittest.main() actually creates an instance of unittest.TestProgram, which in turn creates an instance of a unittest.defaultTestLoader and loads it up with the module that called it. (How does it get a reference to the module that called it if you don't give it one? By using the equally-magic __import__('__main__') command, which dynamically imports the currently-running module. I could write a book on all the tricks and techniques used in the unittest module, but then I'd never finish this one.)

                                                                                          Example 16.22. Step 6: Telling unittest to use your test suite

                                                                                          
                                                                                          if __name__ == "__main__": 
                                                                                              unittest.main(defaultTest="regressionTest") 
                                                                                          
                                                                                          1. Instead of letting the unittest module do all its magic for us, you've done most of it yourself. You've created a function (regressionTest) that imports the modules yourself, calls unittest.defaultTestLoader yourself, and wraps it all up in a test suite. Now all you need to do is tell unittest that, instead of looking for tests and building a test suite in the usual way, it should just call the regressionTest function, which returns a ready-to-use TestSuite.

                                                                                            16.8. Summary

                                                                                            The regression.py program and its output should now make perfect sense.

                                                                                            You should now feel comfortable doing all of these things:



                                                                                            [7] Technically, the second argument to filter can be any sequence, including lists, tuples, and custom classes that act like lists by defining the __getitem__ special method. If possible, filter will return the same datatype as you give it, so filtering a list returns a list, but filtering a tuple returns a tuple.

                                                                                            [8] Again, I should point out that map can take a list, a tuple, or any object that acts like a sequence. See previous footnote about filter.

                                                                                            Chapter 18. Performance Tuning

                                                                                            Performance tuning is a many-splendored thing. Just because Python is an interpreted language doesn't mean you shouldn't worry about code optimization. But don't worry about it too much.

                                                                                            18.1. Diving in

                                                                                            There are so many pitfalls involved in optimizing your code, it's hard to know where to start.

                                                                                            Let's start here: are you sure you need to do it at all? Is your code really so bad? Is it worth the time to tune it? Over the lifetime of your application, how much time is going to be spent running that code, compared to the time spent waiting for a remote database server, or waiting for user input?

                                                                                            Second, are you sure you're done coding? Premature optimization is like spreading frosting on a half-baked cake. You spend hours or days (or more) optimizing your code for performance, only to discover it doesn't do what you need it to do. That's time down the drain.

                                                                                            This is not to say that code optimization is worthless, but you need to look at the whole system and decide whether it's the best use of your time. Every minute you spend optimizing code is a minute you're not spending adding new features, or writing documentation, or playing with your kids, or writing unit tests.

                                                                                            Oh yes, unit tests. It should go without saying that you need a complete set of unit tests before you begin performance tuning. The last thing you need is to introduce new bugs while fiddling with your algorithms.

                                                                                            With these caveats in place, let's look at some techniques for optimizing Python code. The code in question is an implementation of the Soundex algorithm. Soundex was a method used in the early 20th century for categorizing surnames in the United States census. It grouped similar-sounding names together, so even if a name was misspelled, researchers had a chance of finding it. Soundex is still used today for much the same reason, although of course we use computerized database servers now. Most database servers include a Soundex function.

                                                                                            There are several subtle variations of the Soundex algorithm. This is the one used in this chapter:

                                                                                            1. Keep the first letter of the name as-is.
                                                                                            2. Convert the remaining letters to digits, according to a specific table:
                                                                                              • B, F, P, and V become 1.
                                                                                              • C, G, J, K, Q, S, X, and Z become 2.
                                                                                              • D and T become 3.
                                                                                              • L becomes 4.
                                                                                              • M and N become 5.
                                                                                              • R becomes 6.
                                                                                              • All other letters become 9.
                                                                                            3. Remove consecutive duplicates.
                                                                                            4. Remove all 9s altogether.
                                                                                            5. If the result is shorter than four characters (the first letter plus three digits), pad the result with trailing zeros.
                                                                                            6. if the result is longer than four characters, discard everything after the fourth character.

                                                                                            For example, my name, Pilgrim, becomes P942695. That has no consecutive duplicates, so nothing to do there. Then you remove the 9s, leaving P4265. That's too long, so you discard the excess character, leaving P426.

                                                                                            Another example: Woo becomes W99, which becomes W9, which becomes W, which gets padded with zeros to become W000.

                                                                                            Here's a first attempt at a Soundex function:

                                                                                            Example 18.1. soundex/stage1/soundex1a.py

                                                                                            If you have not already done so, you can download this and other examples used in this book.

                                                                                            
                                                                                            import string, re
                                                                                            
                                                                                            charToSoundex = {"A": "9",
                                                                                                             "B": "1",
                                                                                                             "C": "2",
                                                                                                             "D": "3",
                                                                                                             "E": "9",
                                                                                                             "F": "1",
                                                                                                             "G": "2",
                                                                                                             "H": "9",
                                                                                                             "I": "9",
                                                                                                             "J": "2",
                                                                                                             "K": "2",
                                                                                                             "L": "4",
                                                                                                             "M": "5",
                                                                                                             "N": "5",
                                                                                                             "O": "9",
                                                                                                             "P": "1",
                                                                                                             "Q": "2",
                                                                                                             "R": "6",
                                                                                                             "S": "2",
                                                                                                             "T": "3",
                                                                                                             "U": "9",
                                                                                                             "V": "1",
                                                                                                             "W": "9",
                                                                                                             "X": "2",
                                                                                                             "Y": "9",
                                                                                                             "Z": "2"}
                                                                                            
                                                                                            def soundex(source):
                                                                                                "convert string to Soundex equivalent"
                                                                                            
                                                                                                # Soundex requirements:
                                                                                                # source string must be at least 1 character
                                                                                                # and must consist entirely of letters
                                                                                                allChars = string.uppercase + string.lowercase
                                                                                                if not re.search('^[%s]+$' % allChars, source):
                                                                                                    return "0000"
                                                                                            
                                                                                                # Soundex algorithm:
                                                                                                # 1. make first character uppercase
                                                                                                source = source[0].upper() + source[1:]
                                                                                                
                                                                                                # 2. translate all other characters to Soundex digits
                                                                                                digits = source[0]
                                                                                                for s in source[1:]:
                                                                                                    s = s.upper()
                                                                                                    digits += charToSoundex[s]
                                                                                            
                                                                                                # 3. remove consecutive duplicates
                                                                                                digits2 = digits[0]
                                                                                                for d in digits[1:]:
                                                                                                    if digits2[-1] != d:
                                                                                                        digits2 += d
                                                                                                    
                                                                                                # 4. remove all "9"s
                                                                                                digits3 = re.sub('9', '', digits2)
                                                                                                
                                                                                                # 5. pad end with "0"s to 4 characters
                                                                                                while len(digits3) < 4:
                                                                                                    digits3 += "0"
                                                                                                    
                                                                                                # 6. return first 4 characters
                                                                                                return digits3[:4]
                                                                                            
                                                                                            if __name__ == '__main__':
                                                                                                from timeit import Timer
                                                                                                names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                                                                                for name in names:
                                                                                                    statement = "soundex('%s')" % name
                                                                                                    t = Timer(statement, "from __main__ import soundex")
                                                                                                    print name.ljust(15), soundex(name), min(t.repeat())
                                                                                            

                                                                                            Further Reading on Soundex

                                                                                            18.2. Using the timeit Module

                                                                                            The most important thing you need to know about optimizing Python code is that you shouldn't write your own timing function.

                                                                                            Timing short pieces of code is incredibly complex. How much processor time is your computer devoting to running this code? Are there things running in the background? Are you sure? Every modern computer has background processes running, some all the time, some intermittently. Cron jobs fire off at consistent intervals; background services occasionally “wake up” to do useful things like check for new mail, connect to instant messaging servers, check for application updates, scan for viruses, check whether a disk has been inserted into your CD drive in the last 100 nanoseconds, and so on. Before you start your timing tests, turn everything off and disconnect from the network. Then turn off all the things you forgot to turn off the first time, then turn off the service that's incessantly checking whether the network has come back yet, then ...

                                                                                            And then there's the matter of the variations introduced by the timing framework itself. Does the Python interpreter cache method name lookups? Does it cache code block compilations? Regular expressions? Will your code have side effects if run more than once? Don't forget that you're dealing with small fractions of a second, so small mistakes in your timing framework will irreparably skew your results.

                                                                                            The Python community has a saying: “Python comes with batteries included.” Don't write your own timing framework. Python 2.3 comes with a perfectly good one called timeit.

                                                                                            Example 18.2. Introducing timeit

                                                                                            If you have not already done so, you can download this and other examples used in this book.

                                                                                            >>> import timeit
                                                                                            >>> t = timeit.Timer("soundex.soundex('Pilgrim')",
                                                                                            ...    "import soundex")   
                                                                                            >>> t.timeit()              
                                                                                            8.21683733547
                                                                                            >>> t.repeat(3, 2000000)    
                                                                                            [16.48319309109, 16.46128984923, 16.44203948912]
                                                                                            
                                                                                            1. The timeit module defines one class, Timer, which takes two arguments. Both arguments are strings. The first argument is the statement you wish to time; in this case, you are timing a call to the Soundex function within the soundex with an argument of 'Pilgrim'. The second argument to the Timer class is the import statement that sets up the environment for the statement. Internally, timeit sets up an isolated virtual environment, manually executes the setup statement (importing the soundex module), then manually compiles and executes the timed statement (calling the Soundex function).
                                                                                            2. Once you have the Timer object, the easiest thing to do is call timeit(), which calls your function 1 million times and returns the number of seconds it took to do it.
                                                                                            3. The other major method of the Timer object is repeat(), which takes two optional arguments. The first argument is the number of times to repeat the entire test, and the second argument is the number of times to call the timed statement within each test. Both arguments are optional, and they default to 3 and 1000000 respectively. The repeat() method returns a list of the times each test cycle took, in seconds.

                                                                                              You can use the timeit module on the command line to test an existing Python program, without modifying the code. See http://docs.python.org/lib/node396.html for documentation on the command-line flags.

                                                                                              Note that repeat() returns a list of times. The times will almost never be identical, due to slight variations in how much processor time the Python interpreter is getting (and those pesky background processes that you can't get rid of). Your first thought might be to say “Let's take the average and call that The True Number.”

                                                                                              In fact, that's almost certainly wrong. The tests that took longer didn't take longer because of variations in your code or in the Python interpreter; they took longer because of those pesky background processes, or other factors outside of the Python interpreter that you can't fully eliminate. If the different timing results differ by more than a few percent, you still have too much variability to trust the results. Otherwise, take the minimum time and discard the rest.

                                                                                              Python has a handy min function that takes a list and returns the smallest value:

                                                                                              >>> min(t.repeat(3, 1000000))
                                                                                              8.22203948912
                                                                                              

                                                                                              The timeit module only works if you already know what piece of code you need to optimize. If you have a larger Python program and don't know where your performance problems are, check out the hotshot module.

                                                                                              18.3. Optimizing Regular Expressions

                                                                                              The first thing the Soundex function checks is whether the input is a non-empty string of letters. What's the best way to do this?

                                                                                              If you answered “regular expressions”, go sit in the corner and contemplate your bad instincts. Regular expressions are almost never the right answer; they should be avoided whenever possible. Not only for performance reasons, but simply because they're difficult to debug and maintain. Also for performance reasons.

                                                                                              This code fragment from soundex/stage1/soundex1a.py checks whether the function argument source is a word made entirely of letters, with at least one letter (not the empty string):

                                                                                              
                                                                                                  allChars = string.uppercase + string.lowercase
                                                                                                  if not re.search('^[%s]+$' % allChars, source):
                                                                                                      return "0000"
                                                                                              

                                                                                              How does soundex1a.py perform? For convenience, the __main__ section of the script contains this code that calls the timeit module, sets up a timing test with three different names, tests each name three times, and displays the minimum time for each:

                                                                                              
                                                                                              if __name__ == '__main__':
                                                                                                  from timeit import Timer
                                                                                                  names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                                                                                  for name in names:
                                                                                                      statement = "soundex('%s')" % name
                                                                                                      t = Timer(statement, "from __main__ import soundex")
                                                                                                      print name.ljust(15), soundex(name), min(t.repeat())
                                                                                              

                                                                                              So how does soundex1a.py perform with this regular expression?

                                                                                              C:\samples\soundex\stage1>python soundex1a.py
                                                                                              Woo             W000 19.3356647283
                                                                                              Pilgrim         P426 24.0772053431
                                                                                              Flingjingwaller F452 35.0463220884
                                                                                              

                                                                                              As you might expect, the algorithm takes significantly longer when called with longer names. There will be a few things we can do to narrow that gap (make the function take less relative time for longer input), but the nature of the algorithm dictates that it will never run in constant time.

                                                                                              The other thing to keep in mind is that we are testing a representative sample of names. Woo is a kind of trivial case, in that it gets shorted down to a single letter and then padded with zeros. Pilgrim is a normal case, of average length and a mixture of significant and ignored letters. Flingjingwaller is extraordinarily long and contains consecutive duplicates. Other tests might also be helpful, but this hits a good range of different cases.

                                                                                              So what about that regular expression? Well, it's inefficient. Since the expression is testing for ranges of characters (A-Z in uppercase, and a-z in lowercase), we can use a shorthand regular expression syntax. Here is soundex/stage1/soundex1b.py:

                                                                                              
                                                                                                  if not re.search('^[A-Za-z]+$', source):
                                                                                                      return "0000"
                                                                                              

                                                                                              timeit says soundex1b.py is slightly faster than soundex1a.py, but nothing to get terribly excited about:

                                                                                              C:\samples\soundex\stage1>python soundex1b.py
                                                                                              Woo             W000 17.1361133887
                                                                                              Pilgrim         P426 21.8201693232
                                                                                              Flingjingwaller F452 32.7262294509
                                                                                              

                                                                                              We saw in Section 15.3, “Refactoring” that regular expressions can be compiled and reused for faster results. Since this regular expression never changes across function calls, we can compile it once and use the compiled version. Here is soundex/stage1/soundex1c.py:

                                                                                              
                                                                                              isOnlyChars = re.compile('^[A-Za-z]+$').search
                                                                                              def soundex(source):
                                                                                                  if not isOnlyChars(source):
                                                                                                      return "0000"
                                                                                              

                                                                                              Using a compiled regular expression in soundex1c.py is significantly faster:

                                                                                              C:\samples\soundex\stage1>python soundex1c.py
                                                                                              Woo             W000 14.5348347346
                                                                                              Pilgrim         P426 19.2784703084
                                                                                              Flingjingwaller F452 30.0893873383
                                                                                              

                                                                                              But is this the wrong path? The logic here is simple: the input source needs to be non-empty, and it needs to be composed entirely of letters. Wouldn't it be faster to write a loop checking each character, and do away with regular expressions altogether?

                                                                                              Here is soundex/stage1/soundex1d.py:

                                                                                              
                                                                                                  if not source:
                                                                                                      return "0000"
                                                                                                  for c in source:
                                                                                                      if not ('A' <= c <= 'Z') and not ('a' <= c <= 'z'):
                                                                                                          return "0000"
                                                                                              

                                                                                              It turns out that this technique in soundex1d.py is not faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression):

                                                                                              C:\samples\soundex\stage1>python soundex1d.py
                                                                                              Woo             W000 15.4065058548
                                                                                              Pilgrim         P426 22.2753567842
                                                                                              Flingjingwaller F452 37.5845122774
                                                                                              

                                                                                              Why isn't soundex1d.py faster? The answer lies in the interpreted nature of Python. The regular expression engine is written in C, and compiled to run natively on your computer. On the other hand, this loop is written in Python, and runs through the Python interpreter. Even though the loop is relatively simple, it's not simple enough to make up for the overhead of being interpreted. Regular expressions are never the right answer... except when they are.

                                                                                              It turns out that Python offers an obscure string method. You can be excused for not knowing about it, since it's never been mentioned in this book. The method is called isalpha(), and it checks whether a string contains only letters.

                                                                                              This is soundex/stage1/soundex1e.py:

                                                                                              
                                                                                                  if (not source) and (not source.isalpha()):
                                                                                                      return "0000"
                                                                                              

                                                                                              How much did we gain by using this specific method in soundex1e.py? Quite a bit.

                                                                                              C:\samples\soundex\stage1>python soundex1e.py
                                                                                              Woo             W000 13.5069504644
                                                                                              Pilgrim         P426 18.2199394057
                                                                                              Flingjingwaller F452 28.9975225902
                                                                                              

                                                                                              Example 18.3. Best Result So Far: soundex/stage1/soundex1e.py

                                                                                              
                                                                                              import string, re
                                                                                              
                                                                                              charToSoundex = {"A": "9",
                                                                                                               "B": "1",
                                                                                                               "C": "2",
                                                                                                               "D": "3",
                                                                                                               "E": "9",
                                                                                                               "F": "1",
                                                                                                               "G": "2",
                                                                                                               "H": "9",
                                                                                                               "I": "9",
                                                                                                               "J": "2",
                                                                                                               "K": "2",
                                                                                                               "L": "4",
                                                                                                               "M": "5",
                                                                                                               "N": "5",
                                                                                                               "O": "9",
                                                                                                               "P": "1",
                                                                                                               "Q": "2",
                                                                                                               "R": "6",
                                                                                                               "S": "2",
                                                                                                               "T": "3",
                                                                                                               "U": "9",
                                                                                                               "V": "1",
                                                                                                               "W": "9",
                                                                                                               "X": "2",
                                                                                                               "Y": "9",
                                                                                                               "Z": "2"}
                                                                                              
                                                                                              def soundex(source):
                                                                                                  if (not source) and (not source.isalpha()):
                                                                                                      return "0000"
                                                                                                  source = source[0].upper() + source[1:]
                                                                                                  digits = source[0]
                                                                                                  for s in source[1:]:
                                                                                                      s = s.upper()
                                                                                                      digits += charToSoundex[s]
                                                                                                  digits2 = digits[0]
                                                                                                  for d in digits[1:]:
                                                                                                      if digits2[-1] != d:
                                                                                                          digits2 += d
                                                                                                  digits3 = re.sub('9', '', digits2)
                                                                                                  while len(digits3) < 4:
                                                                                                      digits3 += "0"
                                                                                                  return digits3[:4]
                                                                                              
                                                                                              if __name__ == '__main__':
                                                                                                  from timeit import Timer
                                                                                                  names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                                                                                  for name in names:
                                                                                                      statement = "soundex('%s')" % name
                                                                                                      t = Timer(statement, "from __main__ import soundex")
                                                                                                      print name.ljust(15), soundex(name), min(t.repeat())
                                                                                              

                                                                                              18.4. Optimizing Dictionary Lookups

                                                                                              The second step of the Soundex algorithm is to convert characters to digits in a specific pattern. What's the best way to do this?

                                                                                              The most obvious solution is to define a dictionary with individual characters as keys and their corresponding digits as values, and do dictionary lookups on each character. This is what we have in soundex/stage1/soundex1c.py (the current best result so far):

                                                                                              
                                                                                              charToSoundex = {"A": "9",
                                                                                                               "B": "1",
                                                                                                               "C": "2",
                                                                                                               "D": "3",
                                                                                                               "E": "9",
                                                                                                               "F": "1",
                                                                                                               "G": "2",
                                                                                                               "H": "9",
                                                                                                               "I": "9",
                                                                                                               "J": "2",
                                                                                                               "K": "2",
                                                                                                               "L": "4",
                                                                                                               "M": "5",
                                                                                                               "N": "5",
                                                                                                               "O": "9",
                                                                                                               "P": "1",
                                                                                                               "Q": "2",
                                                                                                               "R": "6",
                                                                                                               "S": "2",
                                                                                                               "T": "3",
                                                                                                               "U": "9",
                                                                                                               "V": "1",
                                                                                                               "W": "9",
                                                                                                               "X": "2",
                                                                                                               "Y": "9",
                                                                                                               "Z": "2"}
                                                                                              
                                                                                              def soundex(source):
                                                                                                  # ... input check omitted for brevity ...
                                                                                                  source = source[0].upper() + source[1:]
                                                                                                  digits = source[0]
                                                                                                  for s in source[1:]:
                                                                                                      s = s.upper()
                                                                                                      digits += charToSoundex[s]
                                                                                              

                                                                                              You timed soundex1c.py already; this is how it performs:

                                                                                              C:\samples\soundex\stage1>python soundex1c.py
                                                                                              Woo             W000 14.5341678901
                                                                                              Pilgrim         P426 19.2650071448
                                                                                              Flingjingwaller F452 30.1003563302
                                                                                              

                                                                                              This code is straightforward, but is it the best solution? Calling upper() on each individual character seems inefficient; it would probably be better to call upper() once on the entire string.

                                                                                              Then there's the matter of incrementally building the digits string. Incrementally building strings like this is horribly inefficient; internally, the Python interpreter needs to create a new string each time through the loop, then discard the old one.

                                                                                              Python is good at lists, though. It can treat a string as a list of characters automatically. And lists are easy to combine into strings again, using the string method join().

                                                                                              Here is soundex/stage2/soundex2a.py, which converts letters to digits by using ↦ and lambda:

                                                                                              
                                                                                              def soundex(source):
                                                                                                  # ...
                                                                                                  source = source.upper()
                                                                                                  digits = source[0] + "".join(map(lambda c: charToSoundex[c], source[1:]))
                                                                                              

                                                                                              Surprisingly, soundex2a.py is not faster:

                                                                                              C:\samples\soundex\stage2>python soundex2a.py
                                                                                              Woo             W000 15.0097526362
                                                                                              Pilgrim         P426 19.254806407
                                                                                              Flingjingwaller F452 29.3790847719
                                                                                              

                                                                                              The overhead of the anonymous lambda function kills any performance you gain by dealing with the string as a list of characters.

                                                                                              soundex/stage2/soundex2b.py uses a list comprehension instead of ↦ and lambda:

                                                                                              
                                                                                                  source = source.upper()
                                                                                                  digits = source[0] + "".join([charToSoundex[c] for c in source[1:]])
                                                                                              

                                                                                              Using a list comprehension in soundex2b.py is faster than using ↦ and lambda in soundex2a.py, but still not faster than the original code (incrementally building a string in soundex1c.py):

                                                                                              C:\samples\soundex\stage2>python soundex2b.py
                                                                                              Woo             W000 13.4221324219
                                                                                              Pilgrim         P426 16.4901234654
                                                                                              Flingjingwaller F452 25.8186157738
                                                                                              

                                                                                              It's time for a radically different approach. Dictionary lookups are a general purpose tool. Dictionary keys can be any length string (or many other data types), but in this case we are only dealing with single-character keys and single-character values. It turns out that Python has a specialized function for handling exactly this situation: the string.maketrans function.

                                                                                              This is soundex/stage2/soundex2c.py:

                                                                                              
                                                                                              allChar = string.uppercase + string.lowercase
                                                                                              charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
                                                                                              def soundex(source):
                                                                                                  # ...
                                                                                                  digits = source[0].upper() + source[1:].translate(charToSoundex)
                                                                                              

                                                                                              What the heck is going on here? string.maketrans creates a translation matrix between two strings: the first argument and the second argument. In this case, the first argument is the string ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz, and the second argument is the string 9123912992245591262391929291239129922455912623919292. See the pattern? It's the same conversion pattern we were setting up longhand with a dictionary. A maps to 9, B maps to 1, C maps to 2, and so forth. But it's not a dictionary; it's a specialized data structure that you can access using the string method translate, which translates each character into the corresponding digit, according to the matrix defined by string.maketrans.

                                                                                              timeit shows that soundex2c.py is significantly faster than defining a dictionary and looping through the input and building the output incrementally:

                                                                                              C:\samples\soundex\stage2>python soundex2c.py
                                                                                              Woo             W000 11.437645008
                                                                                              Pilgrim         P426 13.2825062962
                                                                                              Flingjingwaller F452 18.5570110168
                                                                                              

                                                                                              You're not going to get much better than that. Python has a specialized function that does exactly what you want to do; use it and move on.

                                                                                              Example 18.4. Best Result So Far: soundex/stage2/soundex2c.py

                                                                                              
                                                                                              import string, re
                                                                                              
                                                                                              allChar = string.uppercase + string.lowercase
                                                                                              charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
                                                                                              isOnlyChars = re.compile('^[A-Za-z]+$').search
                                                                                              
                                                                                              def soundex(source):
                                                                                                  if not isOnlyChars(source):
                                                                                                      return "0000"
                                                                                                  digits = source[0].upper() + source[1:].translate(charToSoundex)
                                                                                                  digits2 = digits[0]
                                                                                                  for d in digits[1:]:
                                                                                                      if digits2[-1] != d:
                                                                                                          digits2 += d
                                                                                                  digits3 = re.sub('9', '', digits2)
                                                                                                  while len(digits3) < 4:
                                                                                                      digits3 += "0"
                                                                                                  return digits3[:4]
                                                                                              
                                                                                              if __name__ == '__main__':
                                                                                                  from timeit import Timer
                                                                                                  names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                                                                                  for name in names:
                                                                                                      statement = "soundex('%s')" % name
                                                                                                      t = Timer(statement, "from __main__ import soundex")
                                                                                                      print name.ljust(15), soundex(name), min(t.repeat())
                                                                                              

                                                                                              18.5. Optimizing List Operations

                                                                                              The third step in the Soundex algorithm is eliminating consecutive duplicate digits. What's the best way to do this?

                                                                                              Here's the code we have so far, in soundex/stage2/soundex2c.py:

                                                                                              
                                                                                                  digits2 = digits[0]
                                                                                                  for d in digits[1:]:
                                                                                                      if digits2[-1] != d:
                                                                                                          digits2 += d
                                                                                              

                                                                                              Here are the performance results for soundex2c.py:

                                                                                              C:\samples\soundex\stage2>python soundex2c.py
                                                                                              Woo             W000 12.6070768771
                                                                                              Pilgrim         P426 14.4033353401
                                                                                              Flingjingwaller F452 19.7774882003
                                                                                              

                                                                                              The first thing to consider is whether it's efficient to check digits[-1] each time through the loop. Are list indexes expensive? Would we be better off maintaining the last digit in a separate variable, and checking that instead?

                                                                                              To answer this question, here is soundex/stage3/soundex3a.py:

                                                                                              
                                                                                                  digits2 = ''
                                                                                                  last_digit = ''
                                                                                                  for d in digits:
                                                                                                      if d != last_digit:
                                                                                                          digits2 += d
                                                                                                          last_digit = d
                                                                                              

                                                                                              soundex3a.py does not run any faster than soundex2c.py, and may even be slightly slower (although it's not enough of a difference to say for sure):

                                                                                              C:\samples\soundex\stage3>python soundex3a.py
                                                                                              Woo             W000 11.5346048171
                                                                                              Pilgrim         P426 13.3950636184
                                                                                              Flingjingwaller F452 18.6108927252
                                                                                              

                                                                                              Why isn't soundex3a.py faster? It turns out that list indexes in Python are extremely efficient. Repeatedly accessing digits2[-1] is no problem at all. On the other hand, manually maintaining the last seen digit in a separate variable means we have two variable assignments for each digit we're storing, which wipes out any small gains we might have gotten from eliminating the list lookup.

                                                                                              Let's try something radically different. If it's possible to treat a string as a list of characters, it should be possible to use a list comprehension to iterate through the list. The problem is, the code needs access to the previous character in the list, and that's not easy to do with a straightforward list comprehension.

                                                                                              However, it is possible to create a list of index numbers using the built-in range() function, and use those index numbers to progressively search through the list and pull out each character that is different from the previous character. That will give you a list of characters, and you can use the string method join() to reconstruct a string from that.

                                                                                              Here is soundex/stage3/soundex3b.py:

                                                                                              
                                                                                                  digits2 = "".join([digits[i] for i in range(len(digits))
                                                                                                   if i == 0 or digits[i-1] != digits[i]])
                                                                                              

                                                                                              Is this faster? In a word, no.

                                                                                              C:\samples\soundex\stage3>python soundex3b.py
                                                                                              Woo             W000 14.2245271396
                                                                                              Pilgrim         P426 17.8337165757
                                                                                              Flingjingwaller F452 25.9954005327
                                                                                              

                                                                                              It's possible that the techniques so far as have been “string-centric”. Python can convert a string into a list of characters with a single command: list('abc') returns ['a', 'b', 'c']. Furthermore, lists can be modified in place very quickly. Instead of incrementally building a new list (or string) out of the source string, why not move elements around within a single list?

                                                                                              Here is soundex/stage3/soundex3c.py, which modifies a list in place to remove consecutive duplicate elements:

                                                                                              
                                                                                                  digits = list(source[0].upper() + source[1:].translate(charToSoundex))
                                                                                                  i=0
                                                                                                  for item in digits:
                                                                                                      if item==digits[i]: continue
                                                                                                      i+=1
                                                                                                      digits[i]=item
                                                                                                  del digits[i+1:]
                                                                                                  digits2 = "".join(digits)
                                                                                              

                                                                                              Is this faster than soundex3a.py or soundex3b.py? No, in fact it's the slowest method yet:

                                                                                              C:\samples\soundex\stage3>python soundex3c.py
                                                                                              Woo             W000 14.1662554878
                                                                                              Pilgrim         P426 16.0397885765
                                                                                              Flingjingwaller F452 22.1789341942
                                                                                              

                                                                                              We haven't made any progress here at all, except to try and rule out several “clever” techniques. The fastest code we've seen so far was the original, most straightforward method (soundex2c.py). Sometimes it doesn't pay to be clever.

                                                                                              Example 18.5. Best Result So Far: soundex/stage2/soundex2c.py

                                                                                              
                                                                                              import string, re
                                                                                              
                                                                                              allChar = string.uppercase + string.lowercase
                                                                                              charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
                                                                                              isOnlyChars = re.compile('^[A-Za-z]+$').search
                                                                                              
                                                                                              def soundex(source):
                                                                                                  if not isOnlyChars(source):
                                                                                                      return "0000"
                                                                                                  digits = source[0].upper() + source[1:].translate(charToSoundex)
                                                                                                  digits2 = digits[0]
                                                                                                  for d in digits[1:]:
                                                                                                      if digits2[-1] != d:
                                                                                                          digits2 += d
                                                                                                  digits3 = re.sub('9', '', digits2)
                                                                                                  while len(digits3) < 4:
                                                                                                      digits3 += "0"
                                                                                                  return digits3[:4]
                                                                                              
                                                                                              if __name__ == '__main__':
                                                                                                  from timeit import Timer
                                                                                                  names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                                                                                  for name in names:
                                                                                                      statement = "soundex('%s')" % name
                                                                                                      t = Timer(statement, "from __main__ import soundex")
                                                                                                      print name.ljust(15), soundex(name), min(t.repeat())
                                                                                              

                                                                                              18.6. Optimizing String Manipulation

                                                                                              The final step of the Soundex algorithm is padding short results with zeros, and truncating long results. What is the best way to do this?

                                                                                              This is what we have so far, taken from soundex/stage2/soundex2c.py:

                                                                                              
                                                                                                  digits3 = re.sub('9', '', digits2)
                                                                                                  while len(digits3) < 4:
                                                                                                      digits3 += "0"
                                                                                                  return digits3[:4]
                                                                                              

                                                                                              These are the results for soundex2c.py:

                                                                                              C:\samples\soundex\stage2>python soundex2c.py
                                                                                              Woo             W000 12.6070768771
                                                                                              Pilgrim         P426 14.4033353401
                                                                                              Flingjingwaller F452 19.7774882003
                                                                                              

                                                                                              The first thing to consider is replacing that regular expression with a loop. This code is from soundex/stage4/soundex4a.py:

                                                                                              
                                                                                                  digits3 = ''
                                                                                                  for d in digits2:
                                                                                                      if d != '9':
                                                                                                          digits3 += d
                                                                                              

                                                                                              Is soundex4a.py faster? Yes it is:

                                                                                              C:\samples\soundex\stage4>python soundex4a.py
                                                                                              Woo             W000 6.62865531792
                                                                                              Pilgrim         P426 9.02247576158
                                                                                              Flingjingwaller F452 13.6328416042
                                                                                              

                                                                                              But wait a minute. A loop to remove characters from a string? We can use a simple string method for that. Here's soundex/stage4/soundex4b.py:

                                                                                              
                                                                                                  digits3 = digits2.replace('9', '')
                                                                                              

                                                                                              Is soundex4b.py faster? That's an interesting question. It depends on the input:

                                                                                              C:\samples\soundex\stage4>python soundex4b.py
                                                                                              Woo             W000 6.75477414029
                                                                                              Pilgrim         P426 7.56652144337
                                                                                              Flingjingwaller F452 10.8727729362
                                                                                              

                                                                                              The string method in soundex4b.py is faster than the loop for most names, but it's actually slightly slower than soundex4a.py in the trivial case (of a very short name). Performance optimizations aren't always uniform; tuning that makes one case faster can sometimes make other cases slower. In this case, the majority of cases will benefit from the change, so let's leave it at that, but the principle is an important one to remember.

                                                                                              Last but not least, let's examine the final two steps of the algorithm: padding short results with zeros, and truncating long results to four characters. The code you see in soundex4b.py does just that, but it's horribly inefficient. Take a look at soundex/stage4/soundex4c.py to see why:

                                                                                              
                                                                                                  digits3 += '000'
                                                                                                  return digits3[:4]
                                                                                              

                                                                                              Why do we need a while loop to pad out the result? We know in advance that we're going to truncate the result to four characters, and we know that we already have at least one character (the initial letter, which is passed unchanged from the original source variable). That means we can simply add three zeros to the output, then truncate it. Don't get stuck in a rut over the exact wording of the problem; looking at the problem slightly differently can lead to a simpler solution.

                                                                                              How much speed do we gain in soundex4c.py by dropping the while loop? It's significant:

                                                                                              C:\samples\soundex\stage4>python soundex4c.py
                                                                                              Woo             W000 4.89129791636
                                                                                              Pilgrim         P426 7.30642134685
                                                                                              Flingjingwaller F452 10.689832367
                                                                                              

                                                                                              Finally, there is still one more thing you can do to these three lines of code to make them faster: you can combine them into one line. Take a look at soundex/stage4/soundex4d.py:

                                                                                              
                                                                                                  return (digits2.replace('9', '') + '000')[:4]
                                                                                              

                                                                                              Putting all this code on one line in soundex4d.py is barely faster than soundex4c.py:

                                                                                              C:\samples\soundex\stage4>python soundex4d.py
                                                                                              Woo             W000 4.93624105857
                                                                                              Pilgrim         P426 7.19747593619
                                                                                              Flingjingwaller F452 10.5490700634
                                                                                              

                                                                                              It is also significantly less readable, and for not much performance gain. Is that worth it? I hope you have good comments. Performance isn't everything. Your optimization efforts must always be balanced against threats to your program's readability and maintainability.

                                                                                              18.7. Summary

                                                                                              This chapter has illustrated several important aspects of performance tuning in Python, and performance tuning in general.

                                                                                              • If you need to choose between regular expressions and writing a loop, choose regular expressions. The regular expression engine is compiled in C and runs natively on your computer; your loop is written in Python and runs through the Python interpreter.
                                                                                              • If you need to choose between regular expressions and string methods, choose string methods. Both are compiled in C, so choose the simpler one.
                                                                                              • General-purpose dictionary lookups are fast, but specialtiy functions such as string.maketrans and string methods such as isalpha() are faster. If Python has a custom-tailored function for you, use it.
                                                                                              • Don't be too clever. Sometimes the most obvious algorithm is also the fastest.
                                                                                              • Don't sweat it too much. Performance isn't everything.

                                                                                              I can't emphasize that last point strongly enough. Over the course of this chapter, you made this function three times faster and saved 20 seconds over 1 million function calls. Great. Now think: over the course of those million function calls, how many seconds will your surrounding application wait for a database connection? Or wait for disk I/O? Or wait for user input? Don't spend too much time over-optimizing one algorithm, or you'll ignore obvious improvements somewhere else. Develop an instinct for the sort of code that Python runs well, correct obvious blunders if you find them, and leave the rest alone.