diff --git a/comprehensions.html b/comprehensions.html index 92d9b98..fa30792 100644 --- a/comprehensions.html +++ b/comprehensions.html @@ -20,30 +20,203 @@ body{counter-reset:h1 3}

 

Diving In

-

FIXME +

This chapter will teach you about list comprehensions, dictionary comprehensions, and set comprehensions: three related concepts centered around one very powerful technique. But first, I want to take a little detour into two modules that will help you navigate your local file system. + +

The os module

+ +

Python 3 comes with a module called os, which stands for “operating system.” The os module contains a plethora of functions to get information on — and in some cases, to manipulate — local directories, files, processes, and environment variables. Python does its best to offer a unified API across all supported operating systems so your programs can run on any computer with as little platform-specific code as possible. + +

The Current Working Directory

+ +

When you’re just getting started with Python, you’re going to spend a lot of time in the Python Shell. Throughout this book, you will see examples that go like this: + +

    +
  1. Import one of the modules in the examples folder +
  2. Call a function in that module +
  3. Explain the result +
+ +

If you don’t know about the current working directory, step 1 will probably fail with an ImportError. Why? Because Python will look for the example module in the import search path, but it won’t find it because the examples folder isn’t one of the directories in the search path. To get past this, you can do one of two things: + +

    +
  1. Add the examples folder to the import search path +
  2. Change the current working directory to the examples folder +
+ +

The current working directory is an invisible property that Python holds in memory at all times. There is always a current working directory, whether you’re in the Python Shell, running your own Python script from the command line, or running a Python CGI script on a web server somewhere. + +

The os module contains two functions to deal with the current working directory. + +

+>>> import os                                            
+>>> print(os.getcwd())                                   
+C:\Python31
+>>> os.chdir('/Users/pilgrim/diveintopython3/examples')  
+>>> print(os.getcwd())                                   
+C:\Users\pilgrim\diveintopython3\examples
+
    +
  1. When you run the graphical Python Shell, the current working directory starts as the directory where the Python Shell executable is. On Windows, this depends on where you installed Python; the default directory is c:\Python31. If you run the Python Shell from the command line, the current working directory starts as the directory you were in when you ran python3. +
  2. FIXME +
  3. FIXME +
  4. FIXME +
+ +

The os.path module

+ +

FIXME The os.path module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing the contents of a directory. +

+>>> import os
+>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")  
+'c:\\music\\ap\\mahadeva.mp3'
+>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")   
+'c:\\music\\ap\\mahadeva.mp3'
+>>> os.path.expanduser("~")       
+'c:\\Documents and Settings\\mpilgrim\\My Documents'
+>>> os.path.join(os.path.expanduser("~"), "Python") 
+'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'
+
    +
  1. os.path is a reference to a module -- which module depends on your platform. Just as getpass encapsulates differences between platforms by setting getpass to a platform-specific function, os encapsulates differences between platforms by setting path to a platform-specific module. +
  2. The join function of os.path constructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing + with pathnames on Windows is annoying because the backslash character must be escaped.) +
  3. In this slightly less trivial case, join will add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since +addSlashIfNecessary is one of the stupid little functions I always need to write when building up my toolbox in a new language. Do not write this stupid little function in Python; smart people have already taken care of it for you. +
  4. expanduser will expand a pathname that uses ~ to represent the current user's home directory. This works on any platform where users have a home directory, like Windows, +UNIX, and Mac OS X; it has no effect on Mac OS. +
  5. Combining these techniques, you can easily construct pathnames for directories and files under the user's home directory. +
+ +

FIXME + +

>>> os.path.split("c:\\music\\ap\\mahadeva.mp3")      
+('c:\\music\\ap', 'mahadeva.mp3')
+>>> (filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3") 
+>>> filepath      
+'c:\\music\\ap'
+>>> filename      
+'mahadeva.mp3'
+>>> (shortname, extension) = os.path.splitext(filename)                 
+>>> shortname
+'mahadeva'
+>>> extension
+'.mp3'
+
    +
  1. The split function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use +multi-variable assignment to return multiple values from a function? Well, split is such a function. +
  2. You assign the return value of the split function into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple. +
  3. The first variable, filepath, receives the value of the first element of the tuple returned from split, the file path. +
  4. The second variable, filename, receives the value of the second element of the tuple returned from split, the filename. +
  5. os.path also contains a function splitext, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique + to assign each of them to separate variables. +
+ +

FIXME + +

>>> os.listdir("c:\\music\\_singles\\")              
+['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
+'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3', 
+'spinning.mp3']
+>>> dirname = "c:\\"
+>>> os.listdir(dirname)            
+['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin',
+'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS',
+'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys',
+'Program Files', 'Python20', 'RECYCLER',
+'System Volume Information', 'TEMP', 'WINNT']
+>>> [f for f in os.listdir(dirname)
+...    if os.path.isfile(os.path.join(dirname, f))] 
+['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS',
+'NTDETECT.COM', 'ntldr', 'pagefile.sys']
+>>> [f for f in os.listdir(dirname)
+...    if os.path.isdir(os.path.join(dirname, f))]  
+['cygwin', 'docbook', 'Documents and Settings', 'Incoming',
+'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER',
+'System Volume Information', 'TEMP', 'WINNT']
+
    +
  1. The listdir function takes a pathname and returns a list of the contents of the directory. +
  2. listdir returns both files and folders, with no indication of which is which. +
  3. You can use list filtering and the isfile function of the os.path module to separate the files from the folders. isfile takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're using os.path.join to ensure a full pathname, but isfile also works with a partial path, relative to the current working directory. You can use os.getcwd() to get the current working directory. +
  4. os.path also has a isdir function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories + within a directory. +
+ +

The glob module

+ +

FIXME + +

def listDirectory(directory, fileExtList):
+    "get list of file info objects for files of particular extensions"
+    fileList = [os.path.normcase(f)
+                for f in os.listdir(directory)]             
+    fileList = [os.path.join(directory, f) 
+               for f in fileList
+                if os.path.splitext(f)[1] in fileExtList]    
+
    +
  1. os.listdir(directory) returns a list of all the files and folders in directory. +
  2. Iterating through the list with f, you use os.path.normcase(f) to normalize the case according to operating system defaults. normcase is a useful little function that compensates for case-insensitive operating systems that think that mahadeva.mp3 and mahadeva.MP3 are the same file. For instance, on Windows and Mac OS, normcase will convert the entire filename to lowercase; on UNIX-compatible systems, it will return the filename unchanged. +
  3. Iterating through the normalized list with f again, you use os.path.splitext(f) to split each filename into name and extension. +
  4. For each file, you see if the extension is in the list of file extensions you care about (fileExtList, which was passed to the listDirectory function). +
  5. For each file you care about, you use os.path.join(directory, f) to construct the full pathname of the file, and return a list of the full pathnames. +
+ +
+

Whenever possible, you should use the functions in os and os.path for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like os.path.split() work on UNIX, Windows, Mac OS X, and any other platform supported by Python. +

+ +

There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you may already be familiar with from working on the command line. + +

+>>> os.listdir("c:\\music\\_singles\\")               
+['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
+'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
+'spinning.mp3']
+>>> import glob
+>>> glob.glob('c:\\music\\_singles\\*.mp3')           
+['c:\\music\\_singles\\a_time_long_forgotten_con.mp3',
+'c:\\music\\_singles\\hellraiser.mp3',
+'c:\\music\\_singles\\kairo.mp3',
+'c:\\music\\_singles\\long_way_home1.mp3',
+'c:\\music\\_singles\\sidewinder.mp3',
+'c:\\music\\_singles\\spinning.mp3']
+>>> glob.glob('c:\\music\\_singles\\s*.mp3')          
+['c:\\music\\_singles\\sidewinder.mp3',
+'c:\\music\\_singles\\spinning.mp3']
+>>> glob.glob('c:\\music\\*\\*.mp3')
+
+
    +
  1. As you saw earlier, os.listdir simply takes a directory path and lists all files and directories in that directory. +
  2. The glob module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard. + Here the wildcard is a directory path plus "*.mp3", which will match all .mp3 files. Note that each element of the returned list already includes the full path of the file. +
  3. If you want to find all the files in a specific directory that start with "s" and end with ".mp3", you can do that too. +
  4. Now consider this scenario: you have a music directory, with several subdirectories within it, with .mp3 files within each subdirectory. You can get a list of all of those with a single call to glob, by using two wildcards at once. One wildcard is the "*.mp3" (to match .mp3 files), and one wildcard is within the directory path itself, to match any subdirectory within c:\music. That's a crazy amount of power packed into one deceptively simple-looking function! +

List Comprehensions

-

FIXME - +

⁂ @@ -89,7 +266,7 @@ as params.items(), but each element in the

Further Reading

© 2001–9 Mark Pilgrim diff --git a/dip2 b/dip2 index f4716ec..59210bb 100755 --- a/dip2 +++ b/dip2 @@ -181,132 +181,11 @@ stat

  • Python Library Reference documents the sys module. -

    6.5. Working with Directories

    -

    The os.path module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing - the contents of a directory. -

    Example 6.16. Constructing Pathnames

    ->>> import os
    ->>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")  
    -'c:\\music\\ap\\mahadeva.mp3'
    ->>> os.path.join("c:\\music\\ap", "mahadeva.mp3")   
    -'c:\\music\\ap\\mahadeva.mp3'
    ->>> os.path.expanduser("~")       
    -'c:\\Documents and Settings\\mpilgrim\\My Documents'
    ->>> os.path.join(os.path.expanduser("~"), "Python") 
    -'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'
    -
      -
    1. os.path is a reference to a module -- which module depends on your platform. Just as getpass encapsulates differences between platforms by setting getpass to a platform-specific function, os encapsulates differences between platforms by setting path to a platform-specific module. -
    2. The join function of os.path constructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing - with pathnames on Windows is annoying because the backslash character must be escaped.) -
    3. In this slightly less trivial case, join will add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since -addSlashIfNecessary is one of the stupid little functions I always need to write when building up my toolbox in a new language. Do not write this stupid little function in Python; smart people have already taken care of it for you. -
    4. expanduser will expand a pathname that uses ~ to represent the current user's home directory. This works on any platform where users have a home directory, like Windows, -UNIX, and Mac OS X; it has no effect on Mac OS. -
    5. Combining these techniques, you can easily construct pathnames for directories and files under the user's home directory. -

      Example 6.17. Splitting Pathnames

      >>> os.path.split("c:\\music\\ap\\mahadeva.mp3")      
      -('c:\\music\\ap', 'mahadeva.mp3')
      ->>> (filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3") 
      ->>> filepath      
      -'c:\\music\\ap'
      ->>> filename      
      -'mahadeva.mp3'
      ->>> (shortname, extension) = os.path.splitext(filename)                 
      ->>> shortname
      -'mahadeva'
      ->>> extension
      -'.mp3'
      -
        -
      1. The split function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use -multi-variable assignment to return multiple values from a function? Well, split is such a function. -
      2. You assign the return value of the split function into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple. -
      3. The first variable, filepath, receives the value of the first element of the tuple returned from split, the file path. -
      4. The second variable, filename, receives the value of the second element of the tuple returned from split, the filename. -
      5. os.path also contains a function splitext, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique - to assign each of them to separate variables. -

        Example 6.18. Listing Directories

        >>> os.listdir("c:\\music\\_singles\\")              
        -['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
        -'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3', 
        -'spinning.mp3']
        ->>> dirname = "c:\\"
        ->>> os.listdir(dirname)            
        -['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin',
        -'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS',
        -'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys',
        -'Program Files', 'Python20', 'RECYCLER',
        -'System Volume Information', 'TEMP', 'WINNT']
        ->>> [f for f in os.listdir(dirname)
        -...    if os.path.isfile(os.path.join(dirname, f))] 
        -['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS',
        -'NTDETECT.COM', 'ntldr', 'pagefile.sys']
        ->>> [f for f in os.listdir(dirname)
        -...    if os.path.isdir(os.path.join(dirname, f))]  
        -['cygwin', 'docbook', 'Documents and Settings', 'Incoming',
        -'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER',
        -'System Volume Information', 'TEMP', 'WINNT']
        -
          -
        1. The listdir function takes a pathname and returns a list of the contents of the directory. -
        2. listdir returns both files and folders, with no indication of which is which. -
        3. You can use list filtering and the isfile function of the os.path module to separate the files from the folders. isfile takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're using os.path.join to ensure a full pathname, but isfile also works with a partial path, relative to the current working directory. You can use os.getcwd() to get the current working directory. -
        4. os.path also has a isdir function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories - within a directory. -

          Example 6.19. Listing Directories in fileinfo.py

          
          -def listDirectory(directory, fileExtList):    
          -    "get list of file info objects for files of particular extensions" 
          -    fileList = [os.path.normcase(f)
          -                for f in os.listdir(directory)]             
          -    fileList = [os.path.join(directory, f) 
          -               for f in fileList
          -                if os.path.splitext(f)[1] in fileExtList]    
          -
            -
          1. os.listdir(directory) returns a list of all the files and folders in directory. -
          2. Iterating through the list with f, you use os.path.normcase(f) to normalize the case according to operating system defaults. normcase is a useful little function that compensates for case-insensitive operating systems that think that mahadeva.mp3 and mahadeva.MP3 are the same file. For instance, on Windows and Mac OS, normcase will convert the entire filename to lowercase; on UNIX-compatible systems, it will return the filename unchanged. -
          3. Iterating through the normalized list with f again, you use os.path.splitext(f) to split each filename into name and extension. -
          4. For each file, you see if the extension is in the list of file extensions you care about (fileExtList, which was passed to the listDirectory function). -
          5. For each file you care about, you use os.path.join(directory, f) to construct the full pathname of the file, and return a list of the full pathnames. - - -
            NoteWhenever possible, you should use the functions in os and os.path for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like -os.path.split work on UNIX, Windows, Mac OS, and any other platform supported by Python. -

            There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you -may already be familiar with from working on the command line. -

            Example 6.20. Listing Directories with glob

            ->>> os.listdir("c:\\music\\_singles\\")               
            -['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
            -'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
            -'spinning.mp3']
            ->>> import glob
            ->>> glob.glob('c:\\music\\_singles\\*.mp3')           
            -['c:\\music\\_singles\\a_time_long_forgotten_con.mp3',
            -'c:\\music\\_singles\\hellraiser.mp3',
            -'c:\\music\\_singles\\kairo.mp3',
            -'c:\\music\\_singles\\long_way_home1.mp3',
            -'c:\\music\\_singles\\sidewinder.mp3',
            -'c:\\music\\_singles\\spinning.mp3']
            ->>> glob.glob('c:\\music\\_singles\\s*.mp3')          
            -['c:\\music\\_singles\\sidewinder.mp3',
            -'c:\\music\\_singles\\spinning.mp3']
            ->>> glob.glob('c:\\music\\*\\*.mp3')
            -
            -
              -
            1. As you saw earlier, os.listdir simply takes a directory path and lists all files and directories in that directory. -
            2. The glob module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard. - Here the wildcard is a directory path plus "*.mp3", which will match all .mp3 files. Note that each element of the returned list already includes the full path of the file. -
            3. If you want to find all the files in a specific directory that start with "s" and end with ".mp3", you can do that too. -
            4. Now consider this scenario: you have a music directory, with several subdirectories within it, with .mp3 files within each subdirectory. You can get a list of all of those with a single call to glob, by using two wildcards at once. One wildcard is the "*.mp3" (to match .mp3 files), and one wildcard is within the directory path itself, to match any subdirectory within c:\music. That's a crazy amount of power packed into one deceptively simple-looking function! -
              -

              Further Reading on the os Module

              - -[HTML stuff was here] @@ -690,731 +569,6 @@ def main(argv): -[HTTP web services stuff was here] - - - - - -[unit testing stuff was here] - - - - -
              -

              Chapter 14. Test-First Programming

              -

              14.1. roman.py, stage 1

              -

              Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're - going to do this in stages, so you can see all the unit tests fail, then watch them pass one by one as you fill in the gaps - in roman.py. -

              Example 14.1. roman1.py

              -

              This file is available in py/roman/stage1/ in the examples directory. -

              If you have not already done so, you can download this and other examples used in this book. -

              
              -"""Convert to and from Roman numerals"""
              -
              -#Define exceptions
              -class RomanError(Exception): pass                
              -class OutOfRangeError(RomanError): pass          
              -class NotIntegerError(RomanError): pass
              -class InvalidRomanNumeralError(RomanError): pass 
              -
              -def to_roman(n):
              -    """convert integer to Roman numeral"""
              -    pass     
              -
              -def from_roman(s):
              -    """convert Roman numeral to integer"""
              -    pass
              -
              -
                -
              1. This is how you define your own custom exceptions in Python. Exceptions are classes, and you create your own by subclassing existing exceptions. It is strongly recommended (but not - required) that you subclass Exception, which is the base class that all built-in exceptions inherit from. Here I am defining RomanError (inherited from Exception) to act as the base class for all my other custom exceptions to follow. This is a matter of style; I could just as easily - have inherited each individual exception from the Exception class directly. -
              2. The OutOfRangeError and NotIntegerError exceptions will eventually be used by to_roman() to flag various forms of invalid input, as specified in ToRomanBadInput. -
              3. The InvalidRomanNumeralError exception will eventually be used by from_roman() to flag invalid input, as specified in FromRomanBadInput. -
              4. At this stage, you want to define the API of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word pass. -

                Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At -this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to romantest.py and re-evaluate why you coded a test so useless that it passes with do-nothing functions. -

              5. At this stage, you want to define the API of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word pass. -

                Run romantest1.py with the -v command-line option, which will give more verbose output so you can see exactly what's going on as each test case runs. -With any luck, your output should look like this: -

                Example 14.2. Output of romantest1.py against roman1.py

                from_roman should only accept uppercase input ... ERROR
                -to_roman should always return uppercase ... ERROR
                -from_roman should fail with malformed antecedents ... FAIL
                -from_roman should fail with repeated pairs of numerals ... FAIL
                -from_roman should fail with too many repeated numerals ... FAIL
                -from_roman should give known result with known input ... FAIL
                -to_roman should give known result with known input ... FAIL
                -from_roman(to_roman(n))==n for all n ... FAIL
                -to_roman should fail with non-integer input ... FAIL
                -to_roman should fail with negative input ... FAIL
                -to_roman should fail with large input ... FAIL
                -to_roman should fail with 0 input ... FAIL
                -
                -======================================================================
                -ERROR: from_roman should only accept uppercase input
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 154, in testFromRomanCase
                -    roman1.from_roman(numeral.upper())
                -AttributeError: 'None' object has no attribute 'upper'
                -======================================================================
                -ERROR: to_roman should always return uppercase
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 148, in testToRomanCase
                -    self.assertEqual(numeral, numeral.upper())
                -AttributeError: 'None' object has no attribute 'upper'
                -======================================================================
                -FAIL: from_roman should fail with malformed antecedents
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 133, in testMalformedAntecedent
                -    self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
                -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                -    raise self.failureException, excName
                -AssertionError: InvalidRomanNumeralError
                -======================================================================
                -FAIL: from_roman should fail with repeated pairs of numerals
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 127, in testRepeatedPairs
                -    self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
                -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                -    raise self.failureException, excName
                -AssertionError: InvalidRomanNumeralError
                -======================================================================
                -FAIL: from_roman should fail with too many repeated numerals
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 122, in testTooManyRepeatedNumerals
                -    self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
                -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                -    raise self.failureException, excName
                -AssertionError: InvalidRomanNumeralError
                -======================================================================
                -FAIL: from_roman should give known result with known input
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 99, in testFromRomanKnownValues
                -    self.assertEqual(integer, result)
                -  File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                -    raise self.failureException, (msg or '%s != %s' % (first, second))
                -AssertionError: 1 != None
                -======================================================================
                -FAIL: to_roman should give known result with known input
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 93, in testToRomanKnownValues
                -    self.assertEqual(numeral, result)
                -  File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                -    raise self.failureException, (msg or '%s != %s' % (first, second))
                -AssertionError: I != None
                -======================================================================
                -FAIL: from_roman(to_roman(n))==n for all n
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 141, in testSanity
                -    self.assertEqual(integer, result)
                -  File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                -    raise self.failureException, (msg or '%s != %s' % (first, second))
                -AssertionError: 1 != None
                -======================================================================
                -FAIL: to_roman should fail with non-integer input
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 116, in testNonInteger
                -    self.assertRaises(roman1.NotIntegerError, roman1.to_roman, 0.5)
                -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                -    raise self.failureException, excName
                -AssertionError: NotIntegerError
                -======================================================================
                -FAIL: to_roman should fail with negative input
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 112, in testNegative
                -    self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, -1)
                -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                -    raise self.failureException, excName
                -AssertionError: OutOfRangeError
                -======================================================================
                -FAIL: to_roman should fail with large input
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 104, in testTooLarge
                -    self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 4000)
                -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                -    raise self.failureException, excName
                -AssertionError: OutOfRangeError
                -======================================================================
                -FAIL: to_roman should fail with 0 input               
                -----------------------------------------------------------------------
                -Traceback (most recent call last):
                -  File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 108, in testZero
                -    self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 0)
                -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                -    raise self.failureException, excName
                -AssertionError: OutOfRangeError    
                -----------------------------------------------------------------------
                -Ran 12 tests in 0.040s             
                -
                -FAILED (failures=10, errors=2)     
                -

                14.2. roman.py, stage 2

                -

                Now that you have the framework of the roman module laid out, it's time to start writing code and passing test cases. -

                Example 14.3. roman2.py

                -

                This file is available in py/roman/stage2/ in the examples directory. -

                If you have not already done so, you can download this and other examples used in this book. -

                
                -"""Convert to and from Roman numerals"""
                -
                -#Define exceptions
                -class RomanError(Exception): pass
                -class OutOfRangeError(RomanError): pass
                -class NotIntegerError(RomanError): pass
                -class InvalidRomanNumeralError(RomanError): pass
                -
                -#Define digit mapping
                -romanNumeralMap = (('M',  1000), 
                - ('CM', 900),
                - ('D',  500),
                - ('CD', 400),
                - ('C',  100),
                - ('XC', 90),
                - ('L',  50),
                - ('XL', 40),
                - ('X',  10),
                - ('IX', 9),
                - ('V',  5),
                - ('IV', 4),
                - ('I',  1))
                -
                -def to_roman(n):
                -    """convert integer to Roman numeral"""
                -    result = ""
                -    for numeral, integer in romanNumeralMap:
                -        while n >= integer:      
                -            result += numeral
                -            n -= integer
                -    return result
                -
                -def from_roman(s):
                -    """convert Roman numeral to integer"""
                -    pass
                -
                -
                  -
                1. romanNumeralMap is a tuple of tuples which defines three things: -
                  -
                    -
                  1. The character representations of the most basic Roman numerals. Note that this is not just the single-character Roman numerals; - you're also defining two-character pairs like CM (“one hundred less than one thousand”); this will make the to_roman() code simpler later. - -
                  2. The order of the Roman numerals. They are listed in descending value order, from M all the way down to I. - -
                  3. The value of each Roman numeral. Each inner tuple is a pair of (numeral, value). - -
                  -
                2. Here's where your rich data structure pays off, because you don't need any special logic to handle the subtraction rule. - To convert to Roman numerals, you simply iterate through romanNumeralMap looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation - to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat. -

                  Example 14.4. How to_roman() works

                  -

                  If you're not clear how to_roman() works, add a print statement to the end of the while loop:

                  
                  -        while n >= integer:
                  -            result += numeral
                  -            n -= integer
                  -            print 'subtracting', integer, 'from input, adding', numeral, 'to output'
                  ->>> import roman2
                  ->>> roman2.to_roman(1424)
                  -subtracting 1000 from input, adding M to output
                  -subtracting 400 from input, adding CD to output
                  -subtracting 10 from input, adding X to output
                  -subtracting 10 from input, adding X to output
                  -subtracting 4 from input, adding IV to output
                  -'MCDXXIV'
                  -

                  So to_roman() appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely. -

                  Example 14.5. Output of romantest2.py against roman2.py

                  -

                  Remember to run romantest2.py with the -v command-line flag to enable verbose mode. -

                  from_roman should only accept uppercase input ... FAIL
                  -to_roman should always return uppercase ... ok
                  -from_roman should fail with malformed antecedents ... FAIL
                  -from_roman should fail with repeated pairs of numerals ... FAIL
                  -from_roman should fail with too many repeated numerals ... FAIL
                  -from_roman should give known result with known input ... FAIL
                  -to_roman should give known result with known input ... ok       
                  -from_roman(to_roman(n))==n for all n ... FAIL
                  -to_roman should fail with non-integer input ... FAIL            
                  -to_roman should fail with negative input ... FAIL
                  -to_roman should fail with large input ... FAIL
                  -to_roman should fail with 0 input ... FAIL
                  -
                    -
                  1. to_roman() does, in fact, always return uppercase, because romanNumeralMap defines the Roman numeral representations as uppercase. So this test passes already. -
                  2. Here's the big news: this version of the to_roman() function passes the known values test. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including - inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it. -
                  3. However, the function does not “work” for bad values; it fails every single bad input test. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to - be raised (via assertRaises), and you're never raising them. You'll do that in the next stage. -

                    Here's the rest of the output of the unit test, listing the details of all the failures. You're down to 10. -

                    
                    -======================================================================
                    -FAIL: from_roman should only accept uppercase input
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 156, in testFromRomanCase
                    -    roman2.from_roman, numeral.lower())
                    -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                    -    raise self.failureException, excName
                    -AssertionError: InvalidRomanNumeralError
                    -======================================================================
                    -FAIL: from_roman should fail with malformed antecedents
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 133, in testMalformedAntecedent
                    -    self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
                    -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                    -    raise self.failureException, excName
                    -AssertionError: InvalidRomanNumeralError
                    -======================================================================
                    -FAIL: from_roman should fail with repeated pairs of numerals
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 127, in testRepeatedPairs
                    -    self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
                    -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                    -    raise self.failureException, excName
                    -AssertionError: InvalidRomanNumeralError
                    -======================================================================
                    -FAIL: from_roman should fail with too many repeated numerals
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 122, in testTooManyRepeatedNumerals
                    -    self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
                    -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                    -    raise self.failureException, excName
                    -AssertionError: InvalidRomanNumeralError
                    -======================================================================
                    -FAIL: from_roman should give known result with known input
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 99, in testFromRomanKnownValues
                    -    self.assertEqual(integer, result)
                    -  File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                    -    raise self.failureException, (msg or '%s != %s' % (first, second))
                    -AssertionError: 1 != None
                    -======================================================================
                    -FAIL: from_roman(to_roman(n))==n for all n
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 141, in testSanity
                    -    self.assertEqual(integer, result)
                    -  File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                    -    raise self.failureException, (msg or '%s != %s' % (first, second))
                    -AssertionError: 1 != None
                    -======================================================================
                    -FAIL: to_roman should fail with non-integer input
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 116, in testNonInteger
                    -    self.assertRaises(roman2.NotIntegerError, roman2.to_roman, 0.5)
                    -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                    -    raise self.failureException, excName
                    -AssertionError: NotIntegerError
                    -======================================================================
                    -FAIL: to_roman should fail with negative input
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 112, in testNegative
                    -    self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, -1)
                    -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                    -    raise self.failureException, excName
                    -AssertionError: OutOfRangeError
                    -======================================================================
                    -FAIL: to_roman should fail with large input
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 104, in testTooLarge
                    -    self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000)
                    -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                    -    raise self.failureException, excName
                    -AssertionError: OutOfRangeError
                    -======================================================================
                    -FAIL: to_roman should fail with 0 input
                    -----------------------------------------------------------------------
                    -Traceback (most recent call last):
                    -  File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 108, in testZero
                    -    self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 0)
                    -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                    -    raise self.failureException, excName
                    -AssertionError: OutOfRangeError
                    -----------------------------------------------------------------------
                    -Ran 12 tests in 0.320s
                    -
                    -FAILED (failures=10)

                    14.3. roman.py, stage 3

                    -

                    Now that to_roman() behaves correctly with good input (integers from 1 to 3999), it's time to make it behave correctly with bad input (everything else). -

                    Example 14.6. roman3.py

                    -

                    This file is available in py/roman/stage3/ in the examples directory. -

                    If you have not already done so, you can download this and other examples used in this book. -

                    
                    -"""Convert to and from Roman numerals"""
                    -
                    -#Define exceptions
                    -class RomanError(Exception): pass
                    -class OutOfRangeError(RomanError): pass
                    -class NotIntegerError(RomanError): pass
                    -class InvalidRomanNumeralError(RomanError): pass
                    -
                    -#Define digit mapping
                    -romanNumeralMap = (('M',  1000),
                    - ('CM', 900),
                    - ('D',  500),
                    - ('CD', 400),
                    - ('C',  100),
                    - ('XC', 90),
                    - ('L',  50),
                    - ('XL', 40),
                    - ('X',  10),
                    - ('IX', 9),
                    - ('V',  5),
                    - ('IV', 4),
                    - ('I',  1))
                    -
                    -def to_roman(n):
                    -    """convert integer to Roman numeral"""
                    -    if not (0 < n < 4000):         
                    -        raise OutOfRangeError, "number out of range (must be 1..3999)" 
                    -    if int(n) <> n:                
                    -        raise NotIntegerError, "non-integers can not be converted"
                    -
                    -    result = ""  
                    -    for numeral, integer in romanNumeralMap:
                    -        while n >= integer:
                    -            result += numeral
                    -            n -= integer
                    -    return result
                    -
                    -def from_roman(s):
                    -    """convert Roman numeral to integer"""
                    -    pass
                    -
                    -
                      -
                    1. This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to if not ((0 < n) and (n < 4000)), but it's much easier to read. This is the range check, and it should catch inputs that are too large, negative, or zero. -
                    2. You raise exceptions yourself with the raise statement. You can raise any of the built-in exceptions, or you can raise any of your custom exceptions that you've defined. - The second parameter, the error message, is optional; if given, it is displayed in the traceback that is printed if the exception - is never handled. -
                    3. This is the non-integer check. Non-integers can not be converted to Roman numerals. -
                    4. The rest of the function is unchanged. -

                      Example 14.7. Watching to_roman() handle bad input

                      ->>> import roman3
                      ->>> roman3.to_roman(4000)
                      -Traceback (most recent call last):
                      -  File "<interactive input>", line 1, in ?
                      -  File "roman3.py", line 27, in to_roman
                      -    raise OutOfRangeError, "number out of range (must be 1..3999)"
                      -OutOfRangeError: number out of range (must be 1..3999)
                      ->>> roman3.to_roman(1.5)
                      -Traceback (most recent call last):
                      -  File "<interactive input>", line 1, in ?
                      -  File "roman3.py", line 29, in to_roman
                      -    raise NotIntegerError, "non-integers can not be converted"
                      -NotIntegerError: non-integers can not be converted
                      -

                      Example 14.8. Output of romantest3.py against roman3.py

                      from_roman should only accept uppercase input ... FAIL
                      -to_roman should always return uppercase ... ok
                      -from_roman should fail with malformed antecedents ... FAIL
                      -from_roman should fail with repeated pairs of numerals ... FAIL
                      -from_roman should fail with too many repeated numerals ... FAIL
                      -from_roman should give known result with known input ... FAIL
                      -to_roman should give known result with known input ... ok 
                      -from_roman(to_roman(n))==n for all n ... FAIL
                      -to_roman should fail with non-integer input ... ok        
                      -to_roman should fail with negative input ... ok           
                      -to_roman should fail with large input ... ok
                      -to_roman should fail with 0 input ... ok
                      -
                        -
                      1. to_roman() still passes the known values test, which is comforting. All the tests that passed in stage 2 still pass, so the latest code hasn't broken anything. -
                      2. More exciting is the fact that all of the bad input tests now pass. This test, testNonInteger, passes because of the int(n) <> n check. When a non-integer is passed to to_roman(), the int(n) <> n check notices it and raises the NotIntegerError exception, which is what testNonInteger is looking for. -
                      3. This test, testNegative, passes because of the not (0 < n < 4000) check, which raises an OutOfRangeError exception, which is what testNegative is looking for. -
                        
                        -======================================================================
                        -FAIL: from_roman should only accept uppercase input
                        -----------------------------------------------------------------------
                        -Traceback (most recent call last):
                        -  File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 156, in testFromRomanCase
                        -    roman3.from_roman, numeral.lower())
                        -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                        -    raise self.failureException, excName
                        -AssertionError: InvalidRomanNumeralError
                        -======================================================================
                        -FAIL: from_roman should fail with malformed antecedents
                        -----------------------------------------------------------------------
                        -Traceback (most recent call last):
                        -  File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 133, in testMalformedAntecedent
                        -    self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
                        -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                        -    raise self.failureException, excName
                        -AssertionError: InvalidRomanNumeralError
                        -======================================================================
                        -FAIL: from_roman should fail with repeated pairs of numerals
                        -----------------------------------------------------------------------
                        -Traceback (most recent call last):
                        -  File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 127, in testRepeatedPairs
                        -    self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
                        -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                        -    raise self.failureException, excName
                        -AssertionError: InvalidRomanNumeralError
                        -======================================================================
                        -FAIL: from_roman should fail with too many repeated numerals
                        -----------------------------------------------------------------------
                        -Traceback (most recent call last):
                        -  File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 122, in testTooManyRepeatedNumerals
                        -    self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
                        -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                        -    raise self.failureException, excName
                        -AssertionError: InvalidRomanNumeralError
                        -======================================================================
                        -FAIL: from_roman should give known result with known input
                        -----------------------------------------------------------------------
                        -Traceback (most recent call last):
                        -  File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 99, in testFromRomanKnownValues
                        -    self.assertEqual(integer, result)
                        -  File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                        -    raise self.failureException, (msg or '%s != %s' % (first, second))
                        -AssertionError: 1 != None
                        -======================================================================
                        -FAIL: from_roman(to_roman(n))==n for all n
                        -----------------------------------------------------------------------
                        -Traceback (most recent call last):
                        -  File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 141, in testSanity
                        -    self.assertEqual(integer, result)
                        -  File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
                        -    raise self.failureException, (msg or '%s != %s' % (first, second))
                        -AssertionError: 1 != None
                        -----------------------------------------------------------------------
                        -Ran 12 tests in 0.401s
                        -
                        -FAILED (failures=6) 
                        -
                          -
                        1. You're down to 6 failures, and all of them involve from_roman(): the known values test, the three separate bad input tests, the case check, and the sanity check. That means that to_roman() has passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires that from_roman() be written, which it isn't yet.) Which means that you must stop coding to_roman() now. No tweaking, no twiddling, no extra checks “just in case”. Stop. Now. Back away from the keyboard. - - -
                          NoteThe most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for - a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module. -

                          14.4. roman.py, stage 4

                          -

                          Now that to_roman() is done, it's time to start coding from_roman(). - the to_roman() function. -

                          Example 14.9. roman4.py

                          -

                          This file is available in py/roman/stage4/ in the examples directory. -

                          If you have not already done so, you can download this and other examples used in this book. -

                          
                          -"""Convert to and from Roman numerals"""
                          -
                          -#Define exceptions
                          -class RomanError(Exception): pass
                          -class OutOfRangeError(RomanError): pass
                          -class NotIntegerError(RomanError): pass
                          -class InvalidRomanNumeralError(RomanError): pass
                          -
                          -#Define digit mapping
                          -romanNumeralMap = (('M',  1000),
                          - ('CM', 900),
                          - ('D',  500),
                          - ('CD', 400),
                          - ('C',  100),
                          - ('XC', 90),
                          - ('L',  50),
                          - ('XL', 40),
                          - ('X',  10),
                          - ('IX', 9),
                          - ('V',  5),
                          - ('IV', 4),
                          - ('I',  1))
                          -
                          -# to_roman function omitted for clarity (it hasn't changed)
                          -
                          -def from_roman(s):
                          -    """convert Roman numeral to integer"""
                          -    result = 0
                          -    index = 0
                          -    for numeral, integer in romanNumeralMap:
                          -        while s[index:index+len(numeral)] == numeral: 
                          -            result += integer
                          -            index += len(numeral)
                          -    return result
                          -
                          -
                            -
                          1. The pattern here is the same as to_roman(). You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer - values as often as possible, you match the “highest” Roman numeral character strings as often as possible. -

                            Example 14.10. How from_roman() works

                            -

                            If you're not clear how from_roman() works, add a print statement to the end of the while loop:

                            
                            -        while s[index:index+len(numeral)] == numeral:
                            -            result += integer
                            -            index += len(numeral)
                            -            print 'found', numeral, 'of length', len(numeral), ', adding', integer
                            ->>> import roman4
                            ->>> roman4.from_roman('MCMLXXII')
                            -found M , of length 1, adding 1000
                            -found CM , of length 2, adding 900
                            -found L , of length 1, adding 50
                            -found X , of length 1, adding 10
                            -found X , of length 1, adding 10
                            -found I , of length 1, adding 1
                            -found I , of length 1, adding 1
                            -1972

                            Example 14.11. Output of romantest4.py against roman4.py

                            from_roman should only accept uppercase input ... FAIL
                            -to_roman should always return uppercase ... ok
                            -from_roman should fail with malformed antecedents ... FAIL
                            -from_roman should fail with repeated pairs of numerals ... FAIL
                            -from_roman should fail with too many repeated numerals ... FAIL
                            -from_roman should give known result with known input ... ok 
                            -to_roman should give known result with known input ... ok
                            -from_roman(to_roman(n))==n for all n ... ok
                            -to_roman should fail with non-integer input ... ok
                            -to_roman should fail with negative input ... ok
                            -to_roman should fail with large input ... ok
                            -to_roman should fail with 0 input ... ok
                            -
                              -
                            1. Two pieces of exciting news here. The first is that from_roman() works for good input, at least for all the known values you test. -
                            2. The second is that the sanity check also passed. Combined with the known values tests, you can be reasonably sure that both to_roman() and from_roman() work properly for all possible good values. (This is not guaranteed; it is theoretically possible that to_roman() has a bug that produces the wrong Roman numeral for some particular set of inputs, and that from_roman() has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that to_roman() generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write - more comprehensive test cases until it doesn't bother you.) -
                              
                              -======================================================================
                              -FAIL: from_roman should only accept uppercase input
                              -----------------------------------------------------------------------
                              -Traceback (most recent call last):
                              -  File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 156, in testFromRomanCase
                              -    roman4.from_roman, numeral.lower())
                              -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                              -    raise self.failureException, excName
                              -AssertionError: InvalidRomanNumeralError
                              -======================================================================
                              -FAIL: from_roman should fail with malformed antecedents
                              -----------------------------------------------------------------------
                              -Traceback (most recent call last):
                              -  File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 133, in testMalformedAntecedent
                              -    self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
                              -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                              -    raise self.failureException, excName
                              -AssertionError: InvalidRomanNumeralError
                              -======================================================================
                              -FAIL: from_roman should fail with repeated pairs of numerals
                              -----------------------------------------------------------------------
                              -Traceback (most recent call last):
                              -  File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 127, in testRepeatedPairs
                              -    self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
                              -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                              -    raise self.failureException, excName
                              -AssertionError: InvalidRomanNumeralError
                              -======================================================================
                              -FAIL: from_roman should fail with too many repeated numerals
                              -----------------------------------------------------------------------
                              -Traceback (most recent call last):
                              -  File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 122, in testTooManyRepeatedNumerals
                              -    self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
                              -  File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
                              -    raise self.failureException, excName
                              -AssertionError: InvalidRomanNumeralError
                              -----------------------------------------------------------------------
                              -Ran 12 tests in 1.222s
                              -
                              -FAILED (failures=4)

                              14.5. roman.py, stage 5

                              -

                              Example 14.12. roman5.py

                              -

                              This file is available in py/roman/stage5/ in the examples directory. -

                              If you have not already done so, you can download this and other examples used in this book. -

                              
                              -"""Convert to and from Roman numerals"""
                              -import re
                              -
                              -#Define exceptions
                              -class RomanError(Exception): pass
                              -class OutOfRangeError(RomanError): pass
                              -class NotIntegerError(RomanError): pass
                              -class InvalidRomanNumeralError(RomanError): pass
                              -
                              -#Define digit mapping
                              -romanNumeralMap = (('M',  1000),
                              - ('CM', 900),
                              - ('D',  500),
                              - ('CD', 400),
                              - ('C',  100),
                              - ('XC', 90),
                              - ('L',  50),
                              - ('XL', 40),
                              - ('X',  10),
                              - ('IX', 9),
                              - ('V',  5),
                              - ('IV', 4),
                              - ('I',  1))
                              -
                              -def to_roman(n):
                              -    """convert integer to Roman numeral"""
                              -    if not (0 < n < 4000):
                              -        raise OutOfRangeError, "number out of range (must be 1..3999)"
                              -    if int(n) <> n:
                              -        raise NotIntegerError, "non-integers can not be converted"
                              -
                              -    result = ""
                              -    for numeral, integer in romanNumeralMap:
                              -        while n >= integer:
                              -            result += numeral
                              -            n -= integer
                              -    return result
                              -
                              -#Define pattern to detect valid Roman numerals
                              -romanNumeralPattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$' 
                              -
                              -def from_roman(s):
                              -    """convert Roman numeral to integer"""
                              -    if not re.search(romanNumeralPattern, s):
                              -        raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
                              -
                              -    result = 0
                              -    index = 0
                              -    for numeral, integer in romanNumeralMap:
                              -        while s[index:index+len(numeral)] == numeral:
                              -            result += integer
                              -            index += len(numeral)
                              -    return result
                              -
                              -
                                -
                              1. This is just a continuation of the pattern you discussed in Section 7.3, “Case Study: Roman Numerals”. The tens places is either XC (90), XL (40), or an optional L followed by 0 to 3 optional X characters. The ones place is either IX (9), IV (4), or an optional V followed by 0 to 3 optional I characters. -
                              2. Having encoded all that logic into a regular expression, the code to check for invalid Roman numerals becomes trivial. If -re.search returns an object, then the regular expression matched and the input is valid; otherwise, the input is invalid. -

                                At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of -invalid Roman numerals. But don't take my word for it, look at the results: -

                                Example 14.13. Output of romantest5.py against roman5.py

                                
                                -from_roman should only accept uppercase input ... ok          
                                -to_roman should always return uppercase ... ok
                                -from_roman should fail with malformed antecedents ... ok      
                                -from_roman should fail with repeated pairs of numerals ... ok 
                                -from_roman should fail with too many repeated numerals ... ok
                                -from_roman should give known result with known input ... ok
                                -to_roman should give known result with known input ... ok
                                -from_roman(to_roman(n))==n for all n ... ok
                                -to_roman should fail with non-integer input ... ok
                                -to_roman should fail with negative input ... ok
                                -to_roman should fail with large input ... ok
                                -to_roman should fail with 0 input ... ok
                                -
                                -----------------------------------------------------------------------
                                -Ran 12 tests in 2.864s
                                -
                                -OK     
                                -
                                  -
                                1. One thing I didn't mention about regular expressions is that, by default, they are case-sensitive. Since the regular expression -romanNumeralPattern was expressed in uppercase characters, the re.search check will reject any input that isn't completely uppercase. So the uppercase input test passes. -
                                2. More importantly, the bad input tests pass. For instance, the malformed antecedents test checks cases like MCMC. As you've seen, this does not match the regular expression, so from_roman() raises an InvalidRomanNumeralError exception, which is what the malformed antecedents test case is looking for, so the test passes. -
                                3. In fact, all the bad input tests pass. This regular expression catches everything you could think of when you made your test - cases. -
                                4. - -
                                  NoteWhen all of your tests pass, stop coding. - - - - - -[functional programming stuff was here] - - - - -

                                  The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the build process for this book; I have unit tests for several of the example programs (not just the roman.py module featured in Chapter 13, Unit Testing), and the first thing my automated build script does is run this program to make sure all my examples still work. If this @@ -1762,621 +916,3 @@ if __name__ == "__main__":

                                  [7] Technically, the second argument to filter can be any sequence, including lists, tuples, and custom classes that act like lists by defining the __getitem__ special method. If possible, filter will return the same datatype as you give it, so filtering a list returns a list, but filtering a tuple returns a tuple.

                                  [8] Again, I should point out that map can take a list, a tuple, or any object that acts like a sequence. See previous footnote about filter. - - - - - - - - - - -

                                  -

                                  Chapter 18. Performance Tuning

                                  -

                                  Performance tuning is a many-splendored thing. Just because Python is an interpreted language doesn't mean you shouldn't worry about code optimization. But don't worry about it too much. -

                                  18.1. Diving in

                                  -

                                  There are so many pitfalls involved in optimizing your code, it's hard to know where to start. -

                                  Let's start here: are you sure you need to do it at all? Is your code really so bad? Is it worth the time to tune it? Over the lifetime of your application, how much time is going -to be spent running that code, compared to the time spent waiting for a remote database server, or waiting for user input? -

                                  Second, are you sure you're done coding? Premature optimization is like spreading frosting on a half-baked cake. You spend hours or days (or more) optimizing your -code for performance, only to discover it doesn't do what you need it to do. That's time down the drain. -

                                  This is not to say that code optimization is worthless, but you need to look at the whole system and decide whether it's the -best use of your time. Every minute you spend optimizing code is a minute you're not spending adding new features, or writing -documentation, or playing with your kids, or writing unit tests. -

                                  Oh yes, unit tests. It should go without saying that you need a complete set of unit tests before you begin performance tuning. -The last thing you need is to introduce new bugs while fiddling with your algorithms. -

                                  With these caveats in place, let's look at some techniques for optimizing Python code. The code in question is an implementation of the Soundex algorithm. Soundex was a method used in the early 20th century -for categorizing surnames in the United States census. It grouped similar-sounding names together, so even if a name was -misspelled, researchers had a chance of finding it. Soundex is still used today for much the same reason, although of course -we use computerized database servers now. Most database servers include a Soundex function. -

                                  There are several subtle variations of the Soundex algorithm. This is the one used in this chapter: -

                                  -
                                    -
                                  1. Keep the first letter of the name as-is. -
                                  2. Convert the remaining letters to digits, according to a specific table: -
                                    -
                                      -
                                    • B, F, P, and V become 1. -
                                    • C, G, J, K, Q, S, X, and Z become 2. -
                                    • D and T become 3. -
                                    • L becomes 4. -
                                    • M and N become 5. -
                                    • R becomes 6. -
                                    • All other letters become 9. -
                                    - -
                                  3. Remove consecutive duplicates. -
                                  4. Remove all 9s altogether. -
                                  5. If the result is shorter than four characters (the first letter plus three digits), pad the result with trailing zeros. -
                                  6. if the result is longer than four characters, discard everything after the fourth character. -
                                  -

                                  For example, my name, Pilgrim, becomes P942695. That has no consecutive duplicates, so nothing to do there. Then you remove the 9s, leaving P4265. That's -too long, so you discard the excess character, leaving P426. -

                                  Another example: Woo becomes W99, which becomes W9, which becomes W, which gets padded with zeros to become W000. -

                                  Here's a first attempt at a Soundex function: -

                                  Example 18.1. soundex/stage1/soundex1a.py

                                  -

                                  If you have not already done so, you can download this and other examples used in this book. -

                                  
                                  -import string, re
                                  -
                                  -charToSoundex = {"A": "9",
                                  -                 "B": "1",
                                  -                 "C": "2",
                                  -                 "D": "3",
                                  -                 "E": "9",
                                  -                 "F": "1",
                                  -                 "G": "2",
                                  -                 "H": "9",
                                  -                 "I": "9",
                                  -                 "J": "2",
                                  -                 "K": "2",
                                  -                 "L": "4",
                                  -                 "M": "5",
                                  -                 "N": "5",
                                  -                 "O": "9",
                                  -                 "P": "1",
                                  -                 "Q": "2",
                                  -                 "R": "6",
                                  -                 "S": "2",
                                  -                 "T": "3",
                                  -                 "U": "9",
                                  -                 "V": "1",
                                  -                 "W": "9",
                                  -                 "X": "2",
                                  -                 "Y": "9",
                                  -                 "Z": "2"}
                                  -
                                  -def soundex(source):
                                  -    "convert string to Soundex equivalent"
                                  -
                                  -    # Soundex requirements:
                                  -    # source string must be at least 1 character
                                  -    # and must consist entirely of letters
                                  -    allChars = string.uppercase + string.lowercase
                                  -    if not re.search('^[%s]+$' % allChars, source):
                                  -        return "0000"
                                  -
                                  -    # Soundex algorithm:
                                  -    # 1. make first character uppercase
                                  -    source = source[0].upper() + source[1:]
                                  -    
                                  -    # 2. translate all other characters to Soundex digits
                                  -    digits = source[0]
                                  -    for s in source[1:]:
                                  -        s = s.upper()
                                  -        digits += charToSoundex[s]
                                  -
                                  -    # 3. remove consecutive duplicates
                                  -    digits2 = digits[0]
                                  -    for d in digits[1:]:
                                  -        if digits2[-1] != d:
                                  -            digits2 += d
                                  -        
                                  -    # 4. remove all "9"s
                                  -    digits3 = re.sub('9', '', digits2)
                                  -    
                                  -    # 5. pad end with "0"s to 4 characters
                                  -    while len(digits3) < 4:
                                  -        digits3 += "0"
                                  -        
                                  -    # 6. return first 4 characters
                                  -    return digits3[:4]
                                  -
                                  -if __name__ == '__main__':
                                  -    from timeit import Timer
                                  -    names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                  -    for name in names:
                                  -        statement = "soundex('%s')" % name
                                  -        t = Timer(statement, "from __main__ import soundex")
                                  -        print name.ljust(15), soundex(name), min(t.repeat())
                                  -
                                  -

                                  Further Reading on Soundex

                                  - -

                                  18.2. Using the timeit Module

                                  -

                                  The most important thing you need to know about optimizing Python code is that you shouldn't write your own timing function. -

                                  Timing short pieces of code is incredibly complex. How much processor time is your computer devoting to running this code? -Are there things running in the background? Are you sure? Every modern computer has background processes running, some all -the time, some intermittently. Cron jobs fire off at consistent intervals; background services occasionally “wake up” to do useful things like check for new mail, connect to instant messaging servers, check for application updates, scan for -viruses, check whether a disk has been inserted into your CD drive in the last 100 nanoseconds, and so on. Before you start -your timing tests, turn everything off and disconnect from the network. Then turn off all the things you forgot to turn off -the first time, then turn off the service that's incessantly checking whether the network has come back yet, then ... -

                                  And then there's the matter of the variations introduced by the timing framework itself. Does the Python interpreter cache method name lookups? Does it cache code block compilations? Regular expressions? Will your code have -side effects if run more than once? Don't forget that you're dealing with small fractions of a second, so small mistakes -in your timing framework will irreparably skew your results. -

                                  The Python community has a saying: “Python comes with batteries included.” Don't write your own timing framework. Python 2.3 comes with a perfectly good one called timeit. -

                                  Example 18.2. Introducing timeit

                                  -

                                  If you have not already done so, you can download this and other examples used in this book. -

                                  ->>> import timeit
                                  ->>> t = timeit.Timer("soundex.soundex('Pilgrim')",
                                  -...    "import soundex")   
                                  ->>> t.timeit()              
                                  -8.21683733547
                                  ->>> t.repeat(3, 2000000)    
                                  -[16.48319309109, 16.46128984923, 16.44203948912]
                                  -
                                  -
                                    -
                                  1. The timeit module defines one class, Timer, which takes two arguments. Both arguments are strings. The first argument is the statement you wish to time; in this case, - you are timing a call to the Soundex function within the soundex with an argument of 'Pilgrim'. The second argument to the Timer class is the import statement that sets up the environment for the statement. Internally, timeit sets up an isolated virtual environment, manually executes the setup statement (importing the soundex module), then manually compiles and executes the timed statement (calling the Soundex function). -
                                  2. Once you have the Timer object, the easiest thing to do is call timeit(), which calls your function 1 million times and returns the number of seconds it took to do it. -
                                  3. The other major method of the Timer object is repeat(), which takes two optional arguments. The first argument is the number of times to repeat the entire test, and the second - argument is the number of times to call the timed statement within each test. Both arguments are optional, and they default - to 3 and 1000000 respectively. The repeat() method returns a list of the times each test cycle took, in seconds. -
                                    -

                                    You can use the timeit module on the command line to test an existing Python program, without modifying the code. See http://docs.python.org/lib/node396.html for documentation on the command-line flags. -

                                    Note that repeat() returns a list of times. The times will almost never be identical, due to slight variations in how much processor time the -Python interpreter is getting (and those pesky background processes that you can't get rid of). Your first thought might be to -say “Let's take the average and call that The True Number.” -

                                    In fact, that's almost certainly wrong. The tests that took longer didn't take longer because of variations in your code -or in the Python interpreter; they took longer because of those pesky background processes, or other factors outside of the Python interpreter that you can't fully eliminate. If the different timing results differ by more than a few percent, you still -have too much variability to trust the results. Otherwise, take the minimum time and discard the rest. -

                                    Python has a handy min function that takes a list and returns the smallest value: -

                                    ->>> min(t.repeat(3, 1000000))
                                    -8.22203948912
                                    -
                                    -

                                    The timeit module only works if you already know what piece of code you need to optimize. If you have a larger Python program and don't know where your performance problems are, check out the hotshot module.

                                    18.3. Optimizing Regular Expressions

                                    -

                                    The first thing the Soundex function checks is whether the input is a non-empty string of letters. What's the best way to - do this? -

                                    If you answered “regular expressions”, go sit in the corner and contemplate your bad instincts. Regular expressions are almost never the right answer; they should -be avoided whenever possible. Not only for performance reasons, but simply because they're difficult to debug and maintain. -Also for performance reasons. -

                                    This code fragment from soundex/stage1/soundex1a.py checks whether the function argument source is a word made entirely of letters, with at least one letter (not the empty string): -

                                    
                                    -    allChars = string.uppercase + string.lowercase
                                    -    if not re.search('^[%s]+$' % allChars, source):
                                    -        return "0000"
                                    -

                                    How does soundex1a.py perform? For convenience, the __main__ section of the script contains this code that calls the timeit module, sets up a timing test with three different names, tests each name three times, and displays the minimum time for -each: -

                                    
                                    -if __name__ == '__main__':
                                    -    from timeit import Timer
                                    -    names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                    -    for name in names:
                                    -        statement = "soundex('%s')" % name
                                    -        t = Timer(statement, "from __main__ import soundex")
                                    -        print name.ljust(15), soundex(name), min(t.repeat())
                                    -

                                    So how does soundex1a.py perform with this regular expression? -

                                    -C:\samples\soundex\stage1>python soundex1a.py
                                    -Woo             W000 19.3356647283
                                    -Pilgrim         P426 24.0772053431
                                    -Flingjingwaller F452 35.0463220884
                                    -

                                    As you might expect, the algorithm takes significantly longer when called with longer names. There will be a few things we -can do to narrow that gap (make the function take less relative time for longer input), but the nature of the algorithm dictates -that it will never run in constant time. -

                                    The other thing to keep in mind is that we are testing a representative sample of names. Woo is a kind of trivial case, in that it gets shorted down to a single letter and then padded with zeros. Pilgrim is a normal case, of average length and a mixture of significant and ignored letters. Flingjingwaller is extraordinarily long and contains consecutive duplicates. Other tests might also be helpful, but this hits a good range -of different cases. -

                                    So what about that regular expression? Well, it's inefficient. Since the expression is testing for ranges of characters -(A-Z in uppercase, and a-z in lowercase), we can use a shorthand regular expression syntax. Here is soundex/stage1/soundex1b.py: -

                                    
                                    -    if not re.search('^[A-Za-z]+$', source):
                                    -        return "0000"
                                    -

                                    timeit says soundex1b.py is slightly faster than soundex1a.py, but nothing to get terribly excited about: -

                                    -C:\samples\soundex\stage1>python soundex1b.py
                                    -Woo             W000 17.1361133887
                                    -Pilgrim         P426 21.8201693232
                                    -Flingjingwaller F452 32.7262294509
                                    -

                                    We saw in Section 15.3, “Refactoring” that regular expressions can be compiled and reused for faster results. Since this regular expression never changes across -function calls, we can compile it once and use the compiled version. Here is soundex/stage1/soundex1c.py: -

                                    
                                    -isOnlyChars = re.compile('^[A-Za-z]+$').search
                                    -def soundex(source):
                                    -    if not isOnlyChars(source):
                                    -        return "0000"
                                    -

                                    Using a compiled regular expression in soundex1c.py is significantly faster: -

                                    -C:\samples\soundex\stage1>python soundex1c.py
                                    -Woo             W000 14.5348347346
                                    -Pilgrim         P426 19.2784703084
                                    -Flingjingwaller F452 30.0893873383
                                    -

                                    But is this the wrong path? The logic here is simple: the input source needs to be non-empty, and it needs to be composed entirely of letters. Wouldn't it be faster to write a loop checking each -character, and do away with regular expressions altogether? -

                                    Here is soundex/stage1/soundex1d.py: -

                                    
                                    -    if not source:
                                    -        return "0000"
                                    -    for c in source:
                                    -        if not ('A' <= c <= 'Z') and not ('a' <= c <= 'z'):
                                    -            return "0000"
                                    -

                                    It turns out that this technique in soundex1d.py is not faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression): -

                                    -C:\samples\soundex\stage1>python soundex1d.py
                                    -Woo             W000 15.4065058548
                                    -Pilgrim         P426 22.2753567842
                                    -Flingjingwaller F452 37.5845122774
                                    -

                                    Why isn't soundex1d.py faster? The answer lies in the interpreted nature of Python. The regular expression engine is written in C, and compiled to run natively on your computer. On the other hand, this -loop is written in Python, and runs through the Python interpreter. Even though the loop is relatively simple, it's not simple enough to make up for the overhead of being interpreted. -Regular expressions are never the right answer... except when they are. -

                                    It turns out that Python offers an obscure string method. You can be excused for not knowing about it, since it's never been mentioned in this book. -The method is called isalpha(), and it checks whether a string contains only letters. -

                                    This is soundex/stage1/soundex1e.py: -

                                    
                                    -    if (not source) and (not source.isalpha()):
                                    -        return "0000"
                                    -

                                    How much did we gain by using this specific method in soundex1e.py? Quite a bit. -

                                    -C:\samples\soundex\stage1>python soundex1e.py
                                    -Woo             W000 13.5069504644
                                    -Pilgrim         P426 18.2199394057
                                    -Flingjingwaller F452 28.9975225902
                                    -

                                    Example 18.3. Best Result So Far: soundex/stage1/soundex1e.py

                                    
                                    -import string, re
                                    -
                                    -charToSoundex = {"A": "9",
                                    -                 "B": "1",
                                    -                 "C": "2",
                                    -                 "D": "3",
                                    -                 "E": "9",
                                    -                 "F": "1",
                                    -                 "G": "2",
                                    -                 "H": "9",
                                    -                 "I": "9",
                                    -                 "J": "2",
                                    -                 "K": "2",
                                    -                 "L": "4",
                                    -                 "M": "5",
                                    -                 "N": "5",
                                    -                 "O": "9",
                                    -                 "P": "1",
                                    -                 "Q": "2",
                                    -                 "R": "6",
                                    -                 "S": "2",
                                    -                 "T": "3",
                                    -                 "U": "9",
                                    -                 "V": "1",
                                    -                 "W": "9",
                                    -                 "X": "2",
                                    -                 "Y": "9",
                                    -                 "Z": "2"}
                                    -
                                    -def soundex(source):
                                    -    if (not source) and (not source.isalpha()):
                                    -        return "0000"
                                    -    source = source[0].upper() + source[1:]
                                    -    digits = source[0]
                                    -    for s in source[1:]:
                                    -        s = s.upper()
                                    -        digits += charToSoundex[s]
                                    -    digits2 = digits[0]
                                    -    for d in digits[1:]:
                                    -        if digits2[-1] != d:
                                    -            digits2 += d
                                    -    digits3 = re.sub('9', '', digits2)
                                    -    while len(digits3) < 4:
                                    -        digits3 += "0"
                                    -    return digits3[:4]
                                    -
                                    -if __name__ == '__main__':
                                    -    from timeit import Timer
                                    -    names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                    -    for name in names:
                                    -        statement = "soundex('%s')" % name
                                    -        t = Timer(statement, "from __main__ import soundex")
                                    -        print name.ljust(15), soundex(name), min(t.repeat())
                                    -

                                    18.4. Optimizing Dictionary Lookups

                                    -

                                    The second step of the Soundex algorithm is to convert characters to digits in a specific pattern. What's the best way to - do this? -

                                    The most obvious solution is to define a dictionary with individual characters as keys and their corresponding digits as values, -and do dictionary lookups on each character. This is what we have in soundex/stage1/soundex1c.py (the current best result so far): -

                                    
                                    -charToSoundex = {"A": "9",
                                    -                 "B": "1",
                                    -                 "C": "2",
                                    -                 "D": "3",
                                    -                 "E": "9",
                                    -                 "F": "1",
                                    -                 "G": "2",
                                    -                 "H": "9",
                                    -                 "I": "9",
                                    -                 "J": "2",
                                    -                 "K": "2",
                                    -                 "L": "4",
                                    -                 "M": "5",
                                    -                 "N": "5",
                                    -                 "O": "9",
                                    -                 "P": "1",
                                    -                 "Q": "2",
                                    -                 "R": "6",
                                    -                 "S": "2",
                                    -                 "T": "3",
                                    -                 "U": "9",
                                    -                 "V": "1",
                                    -                 "W": "9",
                                    -                 "X": "2",
                                    -                 "Y": "9",
                                    -                 "Z": "2"}
                                    -
                                    -def soundex(source):
                                    -    # ... input check omitted for brevity ...
                                    -    source = source[0].upper() + source[1:]
                                    -    digits = source[0]
                                    -    for s in source[1:]:
                                    -        s = s.upper()
                                    -        digits += charToSoundex[s]
                                    -

                                    You timed soundex1c.py already; this is how it performs: -

                                    -C:\samples\soundex\stage1>python soundex1c.py
                                    -Woo             W000 14.5341678901
                                    -Pilgrim         P426 19.2650071448
                                    -Flingjingwaller F452 30.1003563302
                                    -

                                    This code is straightforward, but is it the best solution? Calling upper() on each individual character seems inefficient; it would probably be better to call upper() once on the entire string. -

                                    Then there's the matter of incrementally building the digits string. Incrementally building strings like this is horribly inefficient; internally, the Python interpreter needs to create a new string each time through the loop, then discard the old one. -

                                    Python is good at lists, though. It can treat a string as a list of characters automatically. And lists are easy to combine into -strings again, using the string method join(). -

                                    Here is soundex/stage2/soundex2a.py, which converts letters to digits by using ↦ and lambda: -

                                    
                                    -def soundex(source):
                                    -    # ...
                                    -    source = source.upper()
                                    -    digits = source[0] + "".join(map(lambda c: charToSoundex[c], source[1:]))
                                    -

                                    Surprisingly, soundex2a.py is not faster: -

                                    -C:\samples\soundex\stage2>python soundex2a.py
                                    -Woo             W000 15.0097526362
                                    -Pilgrim         P426 19.254806407
                                    -Flingjingwaller F452 29.3790847719
                                    -

                                    The overhead of the anonymous lambda function kills any performance you gain by dealing with the string as a list of characters. -

                                    soundex/stage2/soundex2b.py uses a list comprehension instead of ↦ and lambda: -

                                    
                                    -    source = source.upper()
                                    -    digits = source[0] + "".join([charToSoundex[c] for c in source[1:]])
                                    -

                                    Using a list comprehension in soundex2b.py is faster than using ↦ and lambda in soundex2a.py, but still not faster than the original code (incrementally building a string in soundex1c.py): -

                                    -C:\samples\soundex\stage2>python soundex2b.py
                                    -Woo             W000 13.4221324219
                                    -Pilgrim         P426 16.4901234654
                                    -Flingjingwaller F452 25.8186157738
                                    -

                                    It's time for a radically different approach. Dictionary lookups are a general purpose tool. Dictionary keys can be any -length string (or many other data types), but in this case we are only dealing with single-character keys and single-character values. It turns out that Python has a specialized function for handling exactly this situation: the string.maketrans function. -

                                    This is soundex/stage2/soundex2c.py: -

                                    
                                    -allChar = string.uppercase + string.lowercase
                                    -charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
                                    -def soundex(source):
                                    -    # ...
                                    -    digits = source[0].upper() + source[1:].translate(charToSoundex)
                                    -

                                    What the heck is going on here? string.maketrans creates a translation matrix between two strings: the first argument and the second argument. In this case, the first argument -is the string ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz, and the second argument is the string 9123912992245591262391929291239129922455912623919292. See the pattern? It's the same conversion pattern we were setting up longhand with a dictionary. A maps to 9, B maps -to 1, C maps to 2, and so forth. But it's not a dictionary; it's a specialized data structure that you can access using the -string method translate, which translates each character into the corresponding digit, according to the matrix defined by string.maketrans. -

                                    timeit shows that soundex2c.py is significantly faster than defining a dictionary and looping through the input and building the output incrementally: -

                                    -C:\samples\soundex\stage2>python soundex2c.py
                                    -Woo             W000 11.437645008
                                    -Pilgrim         P426 13.2825062962
                                    -Flingjingwaller F452 18.5570110168
                                    -

                                    You're not going to get much better than that. Python has a specialized function that does exactly what you want to do; use it and move on. -

                                    Example 18.4. Best Result So Far: soundex/stage2/soundex2c.py

                                    
                                    -import string, re
                                    -
                                    -allChar = string.uppercase + string.lowercase
                                    -charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
                                    -isOnlyChars = re.compile('^[A-Za-z]+$').search
                                    -
                                    -def soundex(source):
                                    -    if not isOnlyChars(source):
                                    -        return "0000"
                                    -    digits = source[0].upper() + source[1:].translate(charToSoundex)
                                    -    digits2 = digits[0]
                                    -    for d in digits[1:]:
                                    -        if digits2[-1] != d:
                                    -            digits2 += d
                                    -    digits3 = re.sub('9', '', digits2)
                                    -    while len(digits3) < 4:
                                    -        digits3 += "0"
                                    -    return digits3[:4]
                                    -
                                    -if __name__ == '__main__':
                                    -    from timeit import Timer
                                    -    names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                    -    for name in names:
                                    -        statement = "soundex('%s')" % name
                                    -        t = Timer(statement, "from __main__ import soundex")
                                    -        print name.ljust(15), soundex(name), min(t.repeat())
                                    -

                                    18.5. Optimizing List Operations

                                    -

                                    The third step in the Soundex algorithm is eliminating consecutive duplicate digits. What's the best way to do this? -

                                    Here's the code we have so far, in soundex/stage2/soundex2c.py: -

                                    
                                    -    digits2 = digits[0]
                                    -    for d in digits[1:]:
                                    -        if digits2[-1] != d:
                                    -            digits2 += d
                                    -

                                    Here are the performance results for soundex2c.py: -

                                    -C:\samples\soundex\stage2>python soundex2c.py
                                    -Woo             W000 12.6070768771
                                    -Pilgrim         P426 14.4033353401
                                    -Flingjingwaller F452 19.7774882003
                                    -

                                    The first thing to consider is whether it's efficient to check digits[-1] each time through the loop. Are list indexes expensive? Would we be better off maintaining the last digit in a separate -variable, and checking that instead? -

                                    To answer this question, here is soundex/stage3/soundex3a.py: -

                                    
                                    -    digits2 = ''
                                    -    last_digit = ''
                                    -    for d in digits:
                                    -        if d != last_digit:
                                    -            digits2 += d
                                    -            last_digit = d
                                    -

                                    soundex3a.py does not run any faster than soundex2c.py, and may even be slightly slower (although it's not enough of a difference to say for sure): -

                                    -C:\samples\soundex\stage3>python soundex3a.py
                                    -Woo             W000 11.5346048171
                                    -Pilgrim         P426 13.3950636184
                                    -Flingjingwaller F452 18.6108927252
                                    -

                                    Why isn't soundex3a.py faster? It turns out that list indexes in Python are extremely efficient. Repeatedly accessing digits2[-1] is no problem at all. On the other hand, manually maintaining the last seen digit in a separate variable means we have two variable assignments for each digit we're storing, which wipes out any small gains we might have gotten from eliminating -the list lookup. -

                                    Let's try something radically different. If it's possible to treat a string as a list of characters, it should be possible -to use a list comprehension to iterate through the list. The problem is, the code needs access to the previous character -in the list, and that's not easy to do with a straightforward list comprehension. -

                                    However, it is possible to create a list of index numbers using the built-in range() function, and use those index numbers to progressively search through the list and pull out each character that is different -from the previous character. That will give you a list of characters, and you can use the string method join() to reconstruct a string from that. -

                                    Here is soundex/stage3/soundex3b.py: -

                                    
                                    -    digits2 = "".join([digits[i] for i in range(len(digits))
                                    -     if i == 0 or digits[i-1] != digits[i]])
                                    -

                                    Is this faster? In a word, no. -

                                    -C:\samples\soundex\stage3>python soundex3b.py
                                    -Woo             W000 14.2245271396
                                    -Pilgrim         P426 17.8337165757
                                    -Flingjingwaller F452 25.9954005327
                                    -

                                    It's possible that the techniques so far as have been “string-centric”. Python can convert a string into a list of characters with a single command: list('abc') returns ['a', 'b', 'c']. Furthermore, lists can be modified in place very quickly. Instead of incrementally building a new list (or string) out of the source string, why not move elements around -within a single list? -

                                    Here is soundex/stage3/soundex3c.py, which modifies a list in place to remove consecutive duplicate elements: -

                                    
                                    -    digits = list(source[0].upper() + source[1:].translate(charToSoundex))
                                    -    i=0
                                    -    for item in digits:
                                    -        if item==digits[i]: continue
                                    -        i+=1
                                    -        digits[i]=item
                                    -    del digits[i+1:]
                                    -    digits2 = "".join(digits)
                                    -

                                    Is this faster than soundex3a.py or soundex3b.py? No, in fact it's the slowest method yet: -

                                    -C:\samples\soundex\stage3>python soundex3c.py
                                    -Woo             W000 14.1662554878
                                    -Pilgrim         P426 16.0397885765
                                    -Flingjingwaller F452 22.1789341942
                                    -

                                    We haven't made any progress here at all, except to try and rule out several “clever” techniques. The fastest code we've seen so far was the original, most straightforward method (soundex2c.py). Sometimes it doesn't pay to be clever. -

                                    Example 18.5. Best Result So Far: soundex/stage2/soundex2c.py

                                    
                                    -import string, re
                                    -
                                    -allChar = string.uppercase + string.lowercase
                                    -charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
                                    -isOnlyChars = re.compile('^[A-Za-z]+$').search
                                    -
                                    -def soundex(source):
                                    -    if not isOnlyChars(source):
                                    -        return "0000"
                                    -    digits = source[0].upper() + source[1:].translate(charToSoundex)
                                    -    digits2 = digits[0]
                                    -    for d in digits[1:]:
                                    -        if digits2[-1] != d:
                                    -            digits2 += d
                                    -    digits3 = re.sub('9', '', digits2)
                                    -    while len(digits3) < 4:
                                    -        digits3 += "0"
                                    -    return digits3[:4]
                                    -
                                    -if __name__ == '__main__':
                                    -    from timeit import Timer
                                    -    names = ('Woo', 'Pilgrim', 'Flingjingwaller')
                                    -    for name in names:
                                    -        statement = "soundex('%s')" % name
                                    -        t = Timer(statement, "from __main__ import soundex")
                                    -        print name.ljust(15), soundex(name), min(t.repeat())
                                    -

                                    18.6. Optimizing String Manipulation

                                    -

                                    The final step of the Soundex algorithm is padding short results with zeros, and truncating long results. What is the best - way to do this? -

                                    This is what we have so far, taken from soundex/stage2/soundex2c.py: -

                                    
                                    -    digits3 = re.sub('9', '', digits2)
                                    -    while len(digits3) < 4:
                                    -        digits3 += "0"
                                    -    return digits3[:4]
                                    -

                                    These are the results for soundex2c.py: -

                                    -C:\samples\soundex\stage2>python soundex2c.py
                                    -Woo             W000 12.6070768771
                                    -Pilgrim         P426 14.4033353401
                                    -Flingjingwaller F452 19.7774882003
                                    -

                                    The first thing to consider is replacing that regular expression with a loop. This code is from soundex/stage4/soundex4a.py: -

                                    
                                    -    digits3 = ''
                                    -    for d in digits2:
                                    -        if d != '9':
                                    -            digits3 += d
                                    -

                                    Is soundex4a.py faster? Yes it is: -

                                    -C:\samples\soundex\stage4>python soundex4a.py
                                    -Woo             W000 6.62865531792
                                    -Pilgrim         P426 9.02247576158
                                    -Flingjingwaller F452 13.6328416042
                                    -

                                    But wait a minute. A loop to remove characters from a string? We can use a simple string method for that. Here's soundex/stage4/soundex4b.py: -

                                    
                                    -    digits3 = digits2.replace('9', '')
                                    -

                                    Is soundex4b.py faster? That's an interesting question. It depends on the input: -

                                    -C:\samples\soundex\stage4>python soundex4b.py
                                    -Woo             W000 6.75477414029
                                    -Pilgrim         P426 7.56652144337
                                    -Flingjingwaller F452 10.8727729362
                                    -

                                    The string method in soundex4b.py is faster than the loop for most names, but it's actually slightly slower than soundex4a.py in the trivial case (of a very short name). Performance optimizations aren't always uniform; tuning that makes one case -faster can sometimes make other cases slower. In this case, the majority of cases will benefit from the change, so let's -leave it at that, but the principle is an important one to remember. -

                                    Last but not least, let's examine the final two steps of the algorithm: padding short results with zeros, and truncating long -results to four characters. The code you see in soundex4b.py does just that, but it's horribly inefficient. Take a look at soundex/stage4/soundex4c.py to see why: -

                                    
                                    -    digits3 += '000'
                                    -    return digits3[:4]
                                    -

                                    Why do we need a while loop to pad out the result? We know in advance that we're going to truncate the result to four characters, and we know that -we already have at least one character (the initial letter, which is passed unchanged from the original source variable). That means we can simply add three zeros to the output, then truncate it. Don't get stuck in a rut over the -exact wording of the problem; looking at the problem slightly differently can lead to a simpler solution. -

                                    How much speed do we gain in soundex4c.py by dropping the while loop? It's significant: -

                                    -C:\samples\soundex\stage4>python soundex4c.py
                                    -Woo             W000 4.89129791636
                                    -Pilgrim         P426 7.30642134685
                                    -Flingjingwaller F452 10.689832367
                                    -

                                    Finally, there is still one more thing you can do to these three lines of code to make them faster: you can combine them into -one line. Take a look at soundex/stage4/soundex4d.py: -

                                    
                                    -    return (digits2.replace('9', '') + '000')[:4]
                                    -

                                    Putting all this code on one line in soundex4d.py is barely faster than soundex4c.py: -

                                    -C:\samples\soundex\stage4>python soundex4d.py
                                    -Woo             W000 4.93624105857
                                    -Pilgrim         P426 7.19747593619
                                    -Flingjingwaller F452 10.5490700634
                                    -

                                    It is also significantly less readable, and for not much performance gain. Is that worth it? I hope you have good comments. -Performance isn't everything. Your optimization efforts must always be balanced against threats to your program's readability -and maintainability. -

                                    18.7. Summary

                                    -

                                    This chapter has illustrated several important aspects of performance tuning in Python, and performance tuning in general. -

                                    -
                                      -
                                    • If you need to choose between regular expressions and writing a loop, choose regular expressions. The regular expression - engine is compiled in C and runs natively on your computer; your loop is written in Python and runs through the Python interpreter. - -
                                    • If you need to choose between regular expressions and string methods, choose string methods. Both are compiled in C, so choose - the simpler one. - -
                                    • General-purpose dictionary lookups are fast, but specialtiy functions such as string.maketrans and string methods such as isalpha() are faster. If Python has a custom-tailored function for you, use it. - -
                                    • Don't be too clever. Sometimes the most obvious algorithm is also the fastest. -
                                    • Don't sweat it too much. Performance isn't everything. -
                                    -

                                    I can't emphasize that last point strongly enough. Over the course of this chapter, you made this function three times faster -and saved 20 seconds over 1 million function calls. Great. Now think: over the course of those million function calls, how -many seconds will your surrounding application wait for a database connection? Or wait for disk I/O? Or wait for user input? -Don't spend too much time over-optimizing one algorithm, or you'll ignore obvious improvements somewhere else. Develop an -instinct for the sort of code that Python runs well, correct obvious blunders if you find them, and leave the rest alone. - - diff --git a/diveintopython3.org b/diveintopython3.org index 5f7e69d..3332e63 100755 --- a/diveintopython3.org +++ b/diveintopython3.org @@ -2,10 +2,10 @@ * Your First Python Program ** TODO mention why from module import * is only allowed at module level * Native Datatypes -** TODO section (chapter?) on comprehensions -*** TODO list comprehensions -*** TODO set comprehensions -*** TODO dictionary comprehensions +* TODO Comprehensions +** List comprehensions +** Set comprehensions +** Dictionary comprehensions * Strings * Regular Expressions * Closures & Generators @@ -13,9 +13,7 @@ * DONE 2nd draft Advanced Iterators SCHEDULED: <2009-07-15 Wed> CLOSED: [2009-07-15 Wed 20:57] * TODO 2nd draft Unit Testing -* TODO 1st draft Advanced Unit Testing * TODO 2nd draft Refactoring -* TODO 1st draft Advanced Classes * DONE 1st draft Files SCHEDULED: <2009-07-16 Thu> CLOSED: [2009-07-19 Sun 15:26] ** Reading from text files