From de5be3ca0a2d346993379e1f48df3d5c811d42ea Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Thu, 5 Feb 2009 00:55:24 -0500 Subject: [PATCH] finished "your first python program", wrote synchronized highlighting script for callouts within a [pre], moved scripts to common .js file --- case-study-porting-chardet-to-python-3.html | 3 +- dip2 | 2966 +++++++++---------- dip3.css | 14 +- dip3.js | 75 + index.html | 11 +- porting-code-to-python-3-with-2to3.html | 42 +- your-first-python-program.html | 138 +- 7 files changed, 1624 insertions(+), 1625 deletions(-) create mode 100644 dip3.js diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 965a80c..04a5530 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -8,10 +8,11 @@ +

skip to main content -

+

Case study: porting chardet to Python 3

Words, words. They’re all we have to go on.
Rosencrantz and Guildenstern are Dead diff --git a/dip2 b/dip2 index f18bbf2..a16be0c 100644 --- a/dip2 +++ b/dip2 @@ -316,7 +316,7 @@ several months behind in updating their ActivePython installer when new version

If you are using Windows 95, Windows 98, or Windows ME, you will also need to download and install Windows Installer 2.0 before installing ActivePython.

  • -

    Double-click the installer, ActivePython-2.2.2-224-win32-ix86.msi. +

    Double-click the installer, ActivePython-2.2.2-224-win32-ix86.msi.

  • Step through the installer program. @@ -341,13 +341,13 @@ see 'Help/About PythonWin' for further copyright information.

    Download the latest Python Windows installer by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then downloading the .exe installer.

  • -

    Double-click the installer, Python-2.xxx.yyy.exe. The name will depend on the version of Python available when you read this. +

    Double-click the installer, Python-2.xxx.yyy.exe. The name will depend on the version of Python available when you read this.

  • Step through the installer program.

  • -

    If disk space is tight, you can deselect the HTMLHelp file, the utility scripts (Tools/), and/or the test suite (Lib/test/). +

    If disk space is tight, you can deselect the HTMLHelp file, the utility scripts (Tools/), and/or the test suite (Lib/test/).

  • If you do not have administrative rights on your machine, you can select Advanced Options, then choose Non-Admin Install. This just affects where Registry entries and Start menu shortcuts are created. @@ -380,13 +380,13 @@ interactive shell.

    To use the preinstalled version of Python, follow these steps:

    1. -

      Open the /Applications folder. +

      Open the /Applications folder.

    2. -

      Open the Utilities folder. +

      Open the Utilities folder.

    3. -

      Double-click Terminal to open a terminal window and get to a command line. +

      Double-click Terminal to open a terminal window and get to a command line.

    4. Type python at the command prompt. @@ -406,13 +406,13 @@ Type "help", "copyright", "credits", or "license" for more information.

      Follow these steps to download and install the latest version of Python:

      1. -

        Download the MacPython-OSX disk image from http://homepages.cwi.nl/~jack/macpython/download.html. +

        Download the MacPython-OSX disk image from http://homepages.cwi.nl/~jack/macpython/download.html.

      2. -

        If your browser has not already done so, double-click MacPython-OSX-2.3-1.dmg to mount the disk image on your desktop. +

        If your browser has not already done so, double-click MacPython-OSX-2.3-1.dmg to mount the disk image on your desktop.

      3. -

        Double-click the installer, MacPython-OSX.pkg. +

        Double-click the installer, MacPython-OSX.pkg.

      4. The installer will prompt you for your administrative username and password. @@ -421,13 +421,13 @@ Type "help", "copyright", "credits", or "license" for more information.

        Step through the installer program.

      5. -

        After installation is complete, close the installer and open the /Applications folder. +

        After installation is complete, close the installer and open the /Applications folder.

      6. -

        Open the MacPython-2.3 folder +

        Open the MacPython-2.3 folder

      7. -

        Double-click PythonIDE to launch Python. +

        Double-click PythonIDE to launch Python.

      The MacPython IDE should display a splash screen, then take you to the interactive shell. If the interactive shell does not appear, select @@ -458,25 +458,25 @@ Type "help", "copyright", "credits", or "license" for more information.

      Follow these steps to install Python on Mac OS 9:

      1. -

        Download the MacPython23full.bin file from http://homepages.cwi.nl/~jack/macpython/download.html. +

        Download the MacPython23full.bin file from http://homepages.cwi.nl/~jack/macpython/download.html.

      2. -

        If your browser does not decompress the file automatically, double-click MacPython23full.bin to decompress the file with Stuffit Expander. +

        If your browser does not decompress the file automatically, double-click MacPython23full.bin to decompress the file with Stuffit Expander.

      3. -

        Double-click the installer, MacPython23full. +

        Double-click the installer, MacPython23full.

      4. Step through the installer program.

      5. -

        AFter installation is complete, close the installer and open the /Applications folder. +

        AFter installation is complete, close the installer and open the /Applications folder.

      6. -

        Open the MacPython-OS9 2.3 folder. +

        Open the MacPython-OS9 2.3 folder.

      7. -

        Double-click Python IDE to launch Python. +

        Double-click Python IDE to launch Python.

      The MacPython IDE should display a splash screen, and then take you to the interactive shell. If the interactive shell does not appear, select @@ -490,7 +490,7 @@ MacPython IDE 1.0.1

      1.5. Python on RedHat Linux

      Installing under UNIX-compatible operating systems such as Linux is easy if you're willing to install a binary package. Pre-built binary packages are available for most popular Linux distributions. Or you can always compile from source. -

      Download the latest Python RPM by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then selecting the rpms/ directory within that. Then download the RPM with the highest version number. You can install it with the rpm command, as shown here: +

      Download the latest Python RPM by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then selecting the rpms/ directory within that. Then download the RPM with the highest version number. You can install it with the rpm command, as shown here:

      Example 1.2. Installing on RedHat Linux 9

       localhost:~$ su -
       Password: [enter your root password]
      @@ -571,7 +571,7 @@ logout
       Type "help", "copyright", "credits" or "license" for more information.
       >>> [press Ctrl+D to exit]
       

      1.7. Python Installation from Source

      -

      If you prefer to build from source, you can download the Python source code from http://www.python.org/ftp/python/. Select the highest version number listed, download the .tgz file), and then do the usual configure, make, make install dance. +

      If you prefer to build from source, you can download the Python source code from http://www.python.org/ftp/python/. Select the highest version number listed, download the .tgz file), and then do the usual configure, make, make install dance.

      Example 1.4. Installing from source

       localhost:~$ su -
       Password: [enter your root password]
      @@ -658,7 +658,7 @@ Let's skip all that.
       

      Here is a complete, working Python program.

      It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it. -

      Example 2.1. odbchelper.py

      +

      Example 2.1. odbchelper.py

      If you have not already done so, you can download this and other examples used in this book.

       def buildConnectionString(params):
           """Build a connection string from a dictionary of parameters.
      @@ -687,7 +687,7 @@ File->Run... (Ctrl-R).  Output is displayed in the i
       
       
       In the Python IDE on Mac OS, you can run a Python program with
      -Python->Run window... (Cmd-R), but there is an important option you must set first.  Open the .py file in the IDE, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the Run as __main__ option is checked.  This is a per-file setting, but you'll only need to do it once per file.
      +Python->Run window... (Cmd-R), but there is an important option you must set first.  Open the .py file in the IDE, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the Run as __main__ option is checked.  This is a per-file setting, but you'll only need to do it once per file.
       
       
       
      @@ -695,10 +695,10 @@ Python->Run window... (Cmd-R), but there is an impor
       
      -
      +
      Tip
      On UNIX-compatible systems (including Mac OS X), you can run a Python program from the command line: python odbchelper.pyOn UNIX-compatible systems (including Mac OS X), you can run a Python program from the command line: python odbchelper.py
      -

      The id="odbchelper.output" output of odbchelper.py will look like this:

      server=mpilgrim;uid=sa;database=master;pwd=secret

      2.2. Declaring Functions

      +

      The id="odbchelper.output" output of odbchelper.py will look like this:

      server=mpilgrim;uid=sa;database=master;pwd=secret

      2.2. Declaring Functions

      Python has functions like most other languages, but it does not have separate header files like C++ or interface/implementation sections like Pascal. When you need a function, just declare it, like this:

       def buildConnectionString(params):

      Note that the keyword def starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments @@ -744,7 +744,7 @@ In fact, every Python function returns a value; if the function ever executes a

      So Python is both dynamically typed (because it doesn't use explicit datatype declarations) and strongly typed (because once a variable has a datatype, it actually matters).

      2.3. Documenting Functions

      You can document a Python function by giving it a docstring. -

      Example 2.2. Defining the buildConnectionString Function's docstring

      +

      Example 2.2. Defining the buildConnectionString Function's docstring

       def buildConnectionString(params):
           """Build a connection string from a dictionary of parameters.
       
      @@ -776,166 +776,6 @@ need to give your function a docstring, but you always should.  I k
       
       
       

      2.4. Everything Is an Object

      -

      In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime. -

      A function, like everything else in Python, is an object. -

      Open your favorite Python IDE and follow along: -

      Example 2.3. Accessing the buildConnectionString Function's docstring

      >>> import odbchelper            1
      ->>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
      ->>> print odbchelper.buildConnectionString(params) 2
      -server=mpilgrim;uid=sa;database=master;pwd=secret
      ->>> print odbchelper.buildConnectionString.__doc__ 3
      -Build a connection string from a dictionary
      -
      -Returns string.
      - - - - - - - - - - - - - -
      1 -The first line imports the odbchelper program as a module -- a chunk of code that you can use interactively, or from a larger Python program. (You'll see examples of multi-module Python programs in Chapter 4.) Once you import a module, you can reference any of its public functions, classes, or attributes. Modules can do this - to access functionality in other modules, and you can do it in the IDE too. This is an important concept, and you'll talk more about it later. -
      2 -When you want to use functions defined in imported modules, you need to include the module name. So you can't just say buildConnectionString; it must be odbchelper.buildConnectionString. If you've used classes in Java, this should feel vaguely familiar. -
      3 -Instead of calling the function as you would expect to, you asked for one of the function's attributes, __doc__. -
      -
      - - - - - - -
      Note
      import in Python is like require in Perl. Once you import a Python module, you access its functions with module.function; once you require a Perl module, you access its functions with module::function. -
      -

      2.4.1. The Import Search Path

      -

      Before you go any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in sys.path. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists - later in this chapter.) -

      Example 2.4. Import Search Path

      ->>> import sys                 1
      ->>> sys.path 2
      -['', '/usr/local/lib/python2.2', '/usr/local/lib/python2.2/plat-linux2', 
      -'/usr/local/lib/python2.2/lib-dynload', '/usr/local/lib/python2.2/site-packages', 
      -'/usr/local/lib/python2.2/site-packages/PIL', '/usr/local/lib/python2.2/site-packages/piddle']
      ->>> sys      3
      -<module 'sys' (built-in)>
      ->>> sys.path.append('/my/new/path') 4
      - - - - - - - - - - - - - - - - - -
      1 -Importing the sys module makes all of its functions and attributes available. -
      2 -sys.path is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating - system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a .py file matching the module name you're trying to import. -
      3 -Actually, I lied; the truth is more complicated than that, because not all modules are stored as .py files. Some, like the sys module, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (The sys module is written in C.) -
      4 -You can add a new directory to Python's search path at runtime by appending the directory name to sys.path, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You'll talk more about append and other list methods in Chapter 3.) -
      -

      2.4.2. What's an Object?

      -

      Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute __doc__, which returns the docstring defined in the function's source code. The sys module is an object which has (among other things) an attribute called path. And so forth. -

      Still, this begs the question. What is an object? Different programming languages define “object” in different ways. In some, it means that all objects must have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in Chapter 3), and not all objects are subclassable (more on this in Chapter 5). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function - (more in this in Chapter 4). -

      This is so important that I'm going to repeat it in case you missed it the first few times: everything in Python is an object. Strings are objects. Lists are objects. Functions are objects. Even modules are objects. -

      -

      Further Reading on Objects

      - -

      2.5. Indenting Code

      -

      Python functions have no explicit begin or end, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (:) and the indentation of the code itself. -

      Example 2.5. Indenting the buildConnectionString Function

      -def buildConnectionString(params):
      -    """Build a connection string from a dictionary of parameters.
      -
      -    Returns string."""
      -    return ";".join(["%s=%s" % (k, v) for k, v in params.items()])

      Code blocks are defined by their indentation. By "code block", I mean functions, if statements, for loops, while loops, and so forth. Indenting starts a block and unindenting ends it. There are no explicit braces, brackets, or keywords. -This means that whitespace is significant, and must be consistent. In this example, the function code (including the docstring) is indented four spaces. It doesn't need to be four spaces, it just needs to be consistent. The first line that is not -indented is outside the function. -

      Example 2.6, “if Statements” shows an example of code indentation with if statements. -

      Example 2.6. if Statements

      -def fib(n): 1
      -    print 'n =', n            2
      -    if n > 1:                 3
      -        return n * fib(n - 1)
      -    else:   4
      -        print 'end of the line'
      -        return 1
      -
      - - - - - - - - - - - - - - - - - -
      1 -This is a function named fib that takes one argument, n. All the code within the function is indented. -
      2 -Printing to the screen is very easy in Python, just use print. print statements can take any data type, including strings, integers, and other native types like dictionaries and lists that you'll - learn about in the next chapter. You can even mix and match to print several things on one line by using a comma-separated - list of values. Each value is printed on the same line, separated by spaces (the commas don't print). So when fib is called with 5, this will print "n = 5". -
      3 -if statements are a type of code block. If the if expression evaluates to true, the indented block is executed, otherwise it falls to the else block. -
      4 -Of course if and else blocks can contain multiple lines, as long as they are all indented the same amount. This else block has two lines of code in it. There is no other special syntax for multi-line code blocks. Just indent and get on - with your life. -
      -

      After some initial protests and several snide analogies to Fortran, you will make peace with this and start seeing its benefits. One major benefit is that all Python programs look similar, since indentation is a language requirement and not a matter of style. This makes it easier to read -and understand other people's Python code. - - - - - - -
      Note
      Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. C++ and Java use semicolons to separate statements and curly braces to separate code blocks. -
      -

      -

      Further Reading on Code Indentation

      -

      2.6. Testing Modules

      Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them. Here's an example that uses the if __name__ trick. @@ -988,7 +828,7 @@ them into a larger program.Note -
      A dictionary in Python is like an instance of the Hashtable class in Java. +A dictionary in Python is like an instance of the Hashtable class in Java.
      @@ -996,7 +836,7 @@ them into a larger program.
      Note -
      A dictionary in Python is like an instance of the Scripting.Dictionary object in Visual Basic. +A dictionary in Python is like an instance of the Scripting.Dictionary object in Visual Basic.
      @@ -1016,7 +856,7 @@ KeyError: mpilgrim

      1 -First, you create a new dictionary with two elements and assign it to the variable d. Each element is a key-value pair, and the whole set of elements is enclosed in curly braces. +First, you create a new dictionary with two elements and assign it to the variable d. Each element is a key-value pair, and the whole set of elements is enclosed in curly braces. @@ -1138,13 +978,13 @@ KeyError: mpilgrim
      1 -del lets you delete individual items from a dictionary by key. +del lets you delete individual items from a dictionary by key. 2 -clear deletes all items from a dictionary. Note that the set of empty curly braces signifies a dictionary without any items. +clear deletes all items from a dictionary. Note that the set of empty curly braces signifies a dictionary without any items. @@ -1174,7 +1014,7 @@ KeyError: mpilgrim
      Note -A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the ArrayList class, which can hold arbitrary objects and can expand dynamically as new items are added. +A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the ArrayList class, which can hold arbitrary objects and can expand dynamically as new items are added. @@ -1289,7 +1129,7 @@ KeyError: mpilgrim
      4 -If both slice indices are left out, all elements of the list are included. But this is not the same as the original li list; it is a new list that happens to have all the same elements. li[:] is shorthand for making a complete copy of a list. +If both slice indices are left out, all elements of the list are included. But this is not the same as the original li list; it is a new list that happens to have all the same elements. li[:] is shorthand for making a complete copy of a list. @@ -1309,24 +1149,24 @@ KeyError: mpilgrim
      1 -append adds a single element to the end of the list. +append adds a single element to the end of the list. 2 -insert inserts a single element into a list. The numeric argument is the index of the first element that gets bumped out of position. +insert inserts a single element into a list. The numeric argument is the index of the first element that gets bumped out of position. Note that list elements do not need to be unique; there are now two separate elements with the value 'new', li[2] and li[6]. 3 -extend concatenates lists. Note that you do not call extend with multiple arguments; you call it with one argument, a list. In this case, that list has two elements. +extend concatenates lists. Note that you do not call extend with multiple arguments; you call it with one argument, a list. In this case, that list has two elements. -

      Example 3.11. The Difference between extend and append

      +

      Example 3.11. The Difference between extend and append

       >>> li = ['a', 'b', 'c']
       >>> li.extend(['d', 'e', 'f']) 1
       >>> li
      @@ -1348,7 +1188,7 @@ KeyError: mpilgrim
      1 -Lists have two methods, extend and append, that look like they do the same thing, but are in fact completely different. extend takes a single argument, which is always a list, and adds each of the elements of that list to the original list. +Lists have two methods, extend and append, that look like they do the same thing, but are in fact completely different. extend takes a single argument, which is always a list, and adds each of the elements of that list to the original list. @@ -1360,14 +1200,14 @@ KeyError: mpilgrim
      3 -On the other hand, append takes one argument, which can be any data type, and simply adds it to the end of the list. Here, you're calling the append method with a single argument, which is a list of three elements. +On the other hand, append takes one argument, which can be any data type, and simply adds it to the end of the list. Here, you're calling the append method with a single argument, which is a list of three elements. 4 Now the original list, which started as a list of three elements, contains four elements. Why four? Because the last element - that you just appended is itself a list. Lists can contain any type of data, including other lists. That may be what you want, or maybe not. Don't use append if you mean extend. + that you just appended is itself a list. Lists can contain any type of data, including other lists. That may be what you want, or maybe not. Don't use append if you mean extend. @@ -1388,13 +1228,13 @@ False
      1 -index finds the first occurrence of a value in the list and returns the index. +index finds the first occurrence of a value in the list and returns the index. 2 -index finds the first occurrence of a value in the list. In this case, 'new' occurs twice in the list, in li[2] and li[6], but index will return only the first index, 2. +index finds the first occurrence of a value in the list. In this case, 'new' occurs twice in the list, in li[2] and li[6], but index will return only the first index, 2. @@ -1408,7 +1248,7 @@ False
      4 -To test whether a value is in the list, use in, which returns True if the value is found or False if it is not. +To test whether a value is in the list, use in, which returns True if the value is found or False if it is not. @@ -1420,7 +1260,7 @@ False
      Before version 2.2.1, Python had no separate boolean datatype. To compensate for this, Python accepted almost anything in a boolean context (like an if statement), according to the following rules:
        -
      • 0 is false; all other numbers are true. +
      • 0 is false; all other numbers are true.
      • An empty string ("") is false, all other strings are true. @@ -1456,26 +1296,26 @@ ValueError: list.remove(x): x not in list 1 -remove removes the first occurrence of a value from a list. +remove removes the first occurrence of a value from a list. 2 -remove removes only the first occurrence of a value. In this case, 'new' appeared twice in the list, but li.remove("new") removed only the first occurrence. +remove removes only the first occurrence of a value. In this case, 'new' appeared twice in the list, but li.remove("new") removed only the first occurrence. 3 -If the value is not found in the list, Python raises an exception. This mirrors the behavior of the index method. +If the value is not found in the list, Python raises an exception. This mirrors the behavior of the index method. 4 -pop is an interesting beast. It does two things: it removes the last element of the list, and it returns the value that it removed. - Note that this is different from li[-1], which returns a value but does not change the list, and different from li.remove(value), which changes the list but does not return a value. +pop is an interesting beast. It does two things: it removes the last element of the list, and it returns the value that it removed. + Note that this is different from li[-1], which returns a value but does not change the list, and different from li.remove(value), which changes the list but does not return a value. @@ -1494,7 +1334,7 @@ ValueError: list.remove(x): x not in list 1 -Lists can also be concatenated with the + operator. list = list + otherlist has the same result as list.extend(otherlist). But the + operator returns a new (concatenated) list as a value, whereas extend only alters an existing list. This means that extend is faster, especially for large lists. +Lists can also be concatenated with the + operator. list = list + otherlist has the same result as list.extend(otherlist). But the + operator returns a new (concatenated) list as a value, whereas extend only alters an existing list. This means that extend is faster, especially for large lists. @@ -1582,25 +1422,25 @@ True
        1 -You can't add elements to a tuple. Tuples have no append or extend method. +You can't add elements to a tuple. Tuples have no append or extend method. 2 -You can't remove elements from a tuple. Tuples have no remove or pop method. +You can't remove elements from a tuple. Tuples have no remove or pop method. 3 -You can't find elements in a tuple. Tuples have no index method. +You can't find elements in a tuple. Tuples have no index method. 4 -You can, however, use in to see if an element exists in the tuple. +You can, however, use in to see if an element exists in the tuple. @@ -1623,7 +1463,7 @@ True
        Note -Tuples can be converted into lists, and vice-versa. The built-in tuple function takes a list and returns a tuple with the same elements, and the list function takes a tuple and returns a list. In effect, tuple freezes a list, and list thaws a tuple. +Tuples can be converted into lists, and vice-versa. The built-in tuple function takes a list and returns a tuple with the same elements, and the list function takes a tuple and returns a list. In effect, tuple freezes a list, and list thaws a tuple. @@ -1638,10 +1478,10 @@ True

      3.4. Declaring variables

      -

      Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from Chapter 2, odbchelper.py. +

      Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from Chapter 2, odbchelper.py.

      Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring into existence by being assigned a value, and they are automatically destroyed when they go out of scope. -

      Example 3.17. Defining the myParams Variable

      +

      Example 3.17. Defining the myParams Variable

       if __name__ == "__main__":
           myParams = {"server":"mpilgrim", \
                       "database":"master", \
      @@ -1659,7 +1499,7 @@ if __name__ == "__main__":
       
       

      Strictly speaking, expressions in parentheses, straight brackets, or curly braces (like defining a dictionary) can be split into multiple lines with or without the line continuation character (“\”). I like to include the backslash even when it's not required because I think it makes the code easier to read, but that's a matter of style. -

      Third, you never declared the variable myParams, you just assigned a value to it. This is like VBScript without the option explicit option. Luckily, unlike VBScript, Python will not allow you to reference a variable that has never been assigned a value; trying to do so will raise an exception. +

      Third, you never declared the variable myParams, you just assigned a value to it. This is like VBScript without the option explicit option. Luckily, unlike VBScript, Python will not allow you to reference a variable that has never been assigned a value; trying to do so will raise an exception.

      3.4.1. Referencing Variables

      Example 3.18. Referencing an Unbound Variable

      >>> x
       Traceback (innermost last):
      @@ -1682,12 +1522,12 @@ NameError: There is no variable named 'x'
       
       1 
       
      -v is a tuple of three elements, and (x, y, z) is a tuple of three variables.  Assigning one to the other assigns each of the values of v to each of the variables, in order.
      +v is a tuple of three elements, and (x, y, z) is a tuple of three variables.  Assigning one to the other assigns each of the values of v to each of the variables, in order.
       
       
       
       

      This has all sorts of uses. I often want to assign names to a range of values. In C, you would use enum and manually list each constant and its associated value, which seems especially tedious when the values are consecutive. - In Python, you can use the built-in range function with multi-variable assignment to quickly assign consecutive values. + In Python, you can use the built-in range function with multi-variable assignment to quickly assign consecutive values.

      Example 3.20. Assigning Consecutive Values

      >>> range(7)              1
       [0, 1, 2, 3, 4, 5, 6]
       >>> (MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7) 2
      @@ -1701,25 +1541,25 @@ NameError: There is no variable named 'x'
       
       1 
       
      -The built-in range function returns a list of integers.  In its simplest form, it takes an upper limit and returns a zero-based list counting
      -               up to but not including the upper limit.  (If you like, you can pass other parameters to specify a base other than 0 and a step other than 1.  You can print range.__doc__ for details.)
      +The built-in range function returns a list of integers.  In its simplest form, it takes an upper limit and returns a zero-based list counting
      +               up to but not including the upper limit.  (If you like, you can pass other parameters to specify a base other than 0 and a step other than 1.  You can print range.__doc__ for details.)
       
       
       
       2 
       
      -MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, and SUNDAY are the variables you're defining.  (This example came from the calendar module, a fun little module that prints calendars, like the UNIX program cal.  The calendar module defines integer constants for days of the week.)
      +MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, and SUNDAY are the variables you're defining.  (This example came from the calendar module, a fun little module that prints calendars, like the UNIX program cal.  The calendar module defines integer constants for days of the week.)
       
       
       
       3 
       
      -Now each variable has its value: MONDAY is 0, TUESDAY is 1, and so forth.
      +Now each variable has its value: MONDAY is 0, TUESDAY is 1, and so forth.
       
       
       
       

      You can also use multi-variable assignment to build functions that return multiple values, simply by returning a tuple of - all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including the os module, which you'll discuss in Chapter 6. + all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including the os module, which you'll discuss in Chapter 6.

      Further Reading on Variables

        @@ -1736,7 +1576,7 @@ NameError: There is no variable named 'x' Note -String formatting in Python uses the same syntax as the sprintf function in C. +String formatting in Python uses the same syntax as the sprintf function in C. @@ -1748,7 +1588,7 @@ NameError: There is no variable named 'x' 1 -The whole expression evaluates to a string. The first %s is replaced by the value of k; the second %s is replaced by the value of v. All other characters in the string (in this case, the equal sign) stay as they are. +The whole expression evaluates to a string. The first %s is replaced by the value of k; the second %s is replaced by the value of v. All other characters in the string (in this case, the equal sign) stay as they are. @@ -1785,7 +1625,7 @@ TypeError: cannot concatenate 'str' and 'int' objects
      (userCount, ) is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the - comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether (userCount) was a tuple with one element or just the value of userCount. + comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether (userCount) was a tuple with one element or just the value of userCount. @@ -1802,7 +1642,7 @@ TypeError: cannot concatenate 'str' and 'int' objects
      printf in C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values. +

      As with printf in C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.

      Example 3.23. Formatting Numbers

       >>> print "Today's stock price: %f" % 50.4625   1
       50.462500
      @@ -1855,7 +1695,7 @@ TypeError: cannot concatenate 'str' and 'int' objects
      1 -To make sense of this, look at it from right to left. li is the list you're mapping. Python loops through li one element at a time, temporarily assigning the value of each element to the variable elem. Python then applies the function elem*2 and appends that result to the returned list. +To make sense of this, look at it from right to left. li is the list you're mapping. Python loops through li one element at a time, temporarily assigning the value of each element to the variable elem. Python then applies the function elem*2 and appends that result to the returned list. @@ -1871,9 +1711,9 @@ TypeError: cannot concatenate 'str' and 'int' objects
      -

      Here are the list comprehensions in the buildConnectionString function that you declared in Chapter 2:

      -["%s=%s" % (k, v) for k, v in params.items()]

      First, notice that you're calling the items function of the params dictionary. This function returns a list of tuples of all the data in the dictionary. -

      Example 3.25. The keys, values, and items Functions

      >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
      +

      Here are the list comprehensions in the buildConnectionString function that you declared in Chapter 2:

      +["%s=%s" % (k, v) for k, v in params.items()]

      First, notice that you're calling the items function of the params dictionary. This function returns a list of tuples of all the data in the dictionary. +

      Example 3.25. The keys, values, and items Functions

      >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
       >>> params.keys()   1
       ['server', 'uid', 'database', 'pwd']
       >>> params.values() 2
      @@ -1884,26 +1724,26 @@ TypeError: cannot concatenate 'str' and 'int' objects
      1 -The keys method of a dictionary returns a list of all the keys. The list is not in the order in which the dictionary was defined +The keys method of a dictionary returns a list of all the keys. The list is not in the order in which the dictionary was defined (remember that elements in a dictionary are unordered), but it is a list. 2 -The values method returns a list of all the values. The list is in the same order as the list returned by keys, so params.values()[n] == params[params.keys()[n]] for all values of n. +The values method returns a list of all the values. The list is in the same order as the list returned by keys, so params.values()[n] == params[params.keys()[n]] for all values of n. 3 -The items method returns a list of tuples of the form (key, value). The list contains all the data in the dictionary. +The items method returns a list of tuples of the form (key, value). The list contains all the data in the dictionary. -

      Now let's see what buildConnectionString does. It takes a list, params.items(), and maps it to a new list by applying string formatting to each element. The new list will have the same number of elements -as params.items(), but each element in the new list will be a string that contains both a key and its associated value from the params dictionary. -

      Example 3.26. List Comprehensions in buildConnectionString, Step by Step

      >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
      +

      Now let's see what buildConnectionString does. It takes a list, params.items(), and maps it to a new list by applying string formatting to each element. The new list will have the same number of elements +as params.items(), but each element in the new list will be a string that contains both a key and its associated value from the params dictionary. +

      Example 3.26. List Comprehensions in buildConnectionString, Step by Step

      >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
       >>> params.items()
       [('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]
       >>> [k for k, v in params.items()]                1
      @@ -1916,13 +1756,13 @@ as params.items
       
       1 
       
      -Note that you're using two variables to iterate through the params.items() list.  This is another use of multi-variable assignment.  The first element of params.items() is ('server', 'mpilgrim'), so in the first iteration of the list comprehension, k will get 'server' and v will get 'mpilgrim'.  In this case, you're ignoring the value of v and only including the value of k in the returned list, so this list comprehension ends up being equivalent to params.keys().
      +Note that you're using two variables to iterate through the params.items() list.  This is another use of multi-variable assignment.  The first element of params.items() is ('server', 'mpilgrim'), so in the first iteration of the list comprehension, k will get 'server' and v will get 'mpilgrim'.  In this case, you're ignoring the value of v and only including the value of k in the returned list, so this list comprehension ends up being equivalent to params.keys().
       
       
       
       2 
       
      -Here you're doing the same thing, but ignoring the value of k, so this list comprehension ends up being equivalent to params.values().
      +Here you're doing the same thing, but ignoring the value of k, so this list comprehension ends up being equivalent to params.values().
       
       
       
      @@ -1936,36 +1776,36 @@ as params.items
       

      Further Reading on List Comprehensions

      3.7. Joining Lists and Splitting Strings

      -

      You have a list of key-value pairs in the form key=value, and you want to join them into a single string. To join any list of strings into a single string, use the join method of a string object. +

      You have a list of key-value pairs in the form key=value, and you want to join them into a single string. To join any list of strings into a single string, use the join method of a string object.

      -

      Here is an example of joining a list from the buildConnectionString function:

      +

      Here is an example of joining a list from the buildConnectionString function:

           return ";".join(["%s=%s" % (k, v) for k, v in params.items()])

      One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything -is an object. You might have thought I meant that string variables are objects. But no, look closely at this example and you'll see that the string ";" itself is an object, and you are calling its join method. -

      The join method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't +is an object. You might have thought I meant that string variables are objects. But no, look closely at this example and you'll see that the string ";" itself is an object, and you are calling its join method. +

      The join method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't need to be a semi-colon; it doesn't even need to be a single character. It can be any string. -
      Caution
      join works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements +join works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements will raise an exception.
      -

      Example 3.27. Output of odbchelper.py

      >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
      +

      Example 3.27. Output of odbchelper.py

      >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
       >>> ["%s=%s" % (k, v) for k, v in params.items()]
       ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
       >>> ";".join(["%s=%s" % (k, v) for k, v in params.items()])
      -'server=mpilgrim;uid=sa;database=master;pwd=secret'

      This string is then returned from the odbchelper function and printed by the calling block, which gives you the output that you marveled at when you started reading this +'server=mpilgrim;uid=sa;database=master;pwd=secret'

      This string is then returned from the odbchelper function and printed by the calling block, which gives you the output that you marveled at when you started reading this chapter.

      You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's -called split. +called split.

      Example 3.28. Splitting a String

      >>> li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
       >>> s = ";".join(li)
       >>> s
      @@ -1978,13 +1818,13 @@ called split.
       
       1 
       
      -split reverses join by splitting a string into a multi-element list.  Note that the delimiter (“;”) is stripped out completely; it does not appear in any of the elements of the returned list.
      +split reverses join by splitting a string into a multi-element list.  Note that the delimiter (“;”) is stripped out completely; it does not appear in any of the elements of the returned list.
       
       
       
       2 
       
      -split takes an optional second argument, which is the number of times to split.  (“Oooooh, optional arguments...”  You'll learn how to do this in your own functions in the next chapter.)
      +split takes an optional second argument, which is the number of times to split.  (“Oooooh, optional arguments...”  You'll learn how to do this in your own functions in the next chapter.)
       
       
       
      @@ -1993,7 +1833,7 @@ called split.
       Tip
       
       
      -anystring.split(delimiter, 1) is a useful technique when you want to search a string for a substring and then work with everything before the substring
      +anystring.split(delimiter, 1) is a useful technique when you want to search a string for a substring and then work with everything before the substring
             (which ends up in the first element of the returned list) and everything after it (which ends up in the second element).
       
       
      @@ -2005,18 +1845,18 @@ called split.
       
       
    5. Python Library Reference summarizes all the string methods. -
    6. Python Library Reference documents the string module. +
    7. Python Library Reference documents the string module. -
    8. The Whole Python FAQ explains why join is a string method instead of a list method. +
    9. The Whole Python FAQ explains why join is a string method instead of a list method.

      3.7.1. Historical Note on String Methods

      -

      When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story - behind the join method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate string module that contained all the string functions; each function took a string as its first argument. The functions were deemed - important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of - the old string module (which still has a lot of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead. +

      When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story + behind the join method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate string module that contained all the string functions; each function took a string as its first argument. The functions were deemed + important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of + the old string module (which still has a lot of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead.

      3.8. Summary

      -

      The odbchelper.py program and its output should now make perfect sense. +

      The odbchelper.py program and its output should now make perfect sense.

       def buildConnectionString(params):
           """Build a connection string from a dictionary of parameters.
      @@ -2031,7 +1871,7 @@ if __name__ == "__main__":
                       "pwd":"secret" \
                       }
           print buildConnectionString(myParams)
      -

      Here is the output of odbchelper.py:

      server=mpilgrim;uid=sa;database=master;pwd=secret
      +

      Here is the output of odbchelper.py:

      server=mpilgrim;uid=sa;database=master;pwd=secret

      Before diving into the next chapter, make sure you're comfortable doing all of these things:

        @@ -2059,7 +1899,7 @@ functions whose names you don't even know ahead of time.

        4.1. Diving In

        Here is a complete, working Python program. You should understand a good deal about it just by looking at it. The numbered lines illustrate concepts covered in Chapter 2, Your First Python Program. Don't worry if the rest of the code looks intimidating; you'll learn all about it throughout this chapter. -

        Example 4.1. apihelper.py

        +

        Example 4.1. apihelper.py

        If you have not already done so, you can download this and other examples used in this book.

         def info(object, spacing=10, collapse=1): 1 2 3
             """Print methods and docstrings.
        @@ -2078,13 +1918,13 @@ if __name__ == "__main__":                1 
         
        -This module has one function, info.  According to its function declaration, it takes three parameters: object, spacing, and collapse.  The last two are actually optional parameters, as you'll see shortly.
        +This module has one function, info.  According to its function declaration, it takes three parameters: object, spacing, and collapse.  The last two are actually optional parameters, as you'll see shortly.
         
         
         
         2 
         
        -The info function has a multi-line docstring that succinctly describes the function's purpose.  Note that no return value is mentioned; this function will be used solely
        +The info function has a multi-line docstring that succinctly describes the function's purpose.  Note that no return value is mentioned; this function will be used solely
                     for its effects, rather than its value.
         
         
        @@ -2098,7 +1938,7 @@ if __name__ == "__main__":                4 
         
         The if __name__ trick allows this program do something useful when run by itself, without interfering with its use as a module for other programs.
        -             In this case, the program simply prints out the docstring of the info function.
        +             In this case, the program simply prints out the docstring of the info function.
         
         
         
        @@ -2108,9 +1948,9 @@ if __name__ == "__main__":                info function is designed to be used by you, the programmer, while working in the Python IDE.  It takes any object that has functions or methods (like a module, which has functions, or a list, which has methods) and
        +

        The info function is designed to be used by you, the programmer, while working in the Python IDE. It takes any object that has functions or methods (like a module, which has functions, or a list, which has methods) and prints out the functions and their docstrings. -

        Example 4.2. Sample Usage of apihelper.py

        >>> from apihelper import info
        +

        Example 4.2. Sample Usage of apihelper.py

        >>> from apihelper import info
         >>> li = []
         >>> info(li)
         append     L.append(object) -- append object to end
        @@ -2121,8 +1961,8 @@ insert     L.insert(index, object) -- insert object before index
         pop        L.pop([index]) -> item -- remove and return item at index (default last)
         remove     L.remove(value) -- remove first occurrence of value
         reverse    L.reverse() -- reverse *IN PLACE*
        -sort       L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1

        By default the output is formatted to be easy to read. Multi-line docstrings are collapsed into a single long line, but this option can be changed by specifying 0 for the collapse argument. If the function names are longer than 10 characters, you can specify a larger value for the spacing argument to make the output easier to read. -

        Example 4.3. Advanced Usage of apihelper.py

        >>> import odbchelper
        +sort       L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1

        By default the output is formatted to be easy to read. Multi-line docstrings are collapsed into a single long line, but this option can be changed by specifying 0 for the collapse argument. If the function names are longer than 10 characters, you can specify a larger value for the spacing argument to make the output easier to read. +

        Example 4.3. Advanced Usage of apihelper.py

        >>> import odbchelper
         >>> info(odbchelper)
         buildConnectionString Build a connection string from a dictionary Returns string.
         >>> info(odbchelper, 30)
        @@ -2135,11 +1975,11 @@ buildConnectionString          Build a connection string from a dictionary Retur
         

        Python allows function arguments to have default values; if the function is called without the argument, the argument gets its default value. Futhermore, arguments can be specified in any order by using named arguments. Stored procedures in SQL Server Transact/SQL can do this, so if you're a SQL Server scripting guru, you can skim this part.

        -

        Here is an example of info, a function with two optional arguments:

        -def info(object, spacing=10, collapse=1):

        spacing and collapse are optional, because they have default values defined. object is required, because it has no default value. If info is called with only one argument, spacing defaults to 10 and collapse defaults to 1. If info is called with two arguments, collapse still defaults to 1. -

        Say you want to specify a value for collapse but want to accept the default value for spacing. In most languages, you would be out of luck, because you would need to call the function with three arguments. But in +

        Here is an example of info, a function with two optional arguments:

        +def info(object, spacing=10, collapse=1):

        spacing and collapse are optional, because they have default values defined. object is required, because it has no default value. If info is called with only one argument, spacing defaults to 10 and collapse defaults to 1. If info is called with two arguments, collapse still defaults to 1. +

        Say you want to specify a value for collapse but want to accept the default value for spacing. In most languages, you would be out of luck, because you would need to call the function with three arguments. But in Python, arguments can be specified by name, in any order. -

        Example 4.4. Valid Calls of info

        +

        Example 4.4. Valid Calls of info

         info(odbchelper)  1
         info(odbchelper, 12)                2
         info(odbchelper, collapse=0)        3
        @@ -2148,25 +1988,25 @@ info(spacing=15, object=odbchelper) 1 
         
        -With only one argument, spacing gets its default value of 10 and collapse gets its default value of 1.
        +With only one argument, spacing gets its default value of 10 and collapse gets its default value of 1.
         
         
         
         2 
         
        -With two arguments, collapse gets its default value of 1.
        +With two arguments, collapse gets its default value of 1.
         
         
         
         3 
         
        -Here you are naming the collapse argument explicitly and specifying its value.  spacing still gets its default value of 10.
        +Here you are naming the collapse argument explicitly and specifying its value.  spacing still gets its default value of 10.
         
         
         
         4 
         
        -Even required arguments (like object, which has no default value) can be named, and named arguments can appear in any order.
        +Even required arguments (like object, which has no default value) can be named, and named arguments can appear in any order.
         
         
         
        @@ -2187,13 +2027,13 @@ time, you'll call functions the “normal” way, but you always have th
         
      • Python Tutorial discusses exactly when and how default arguments are evaluated, which matters when the default value is a list or an expression with side effects.
      -

      4.3. Using type, str, dir, and Other Built-In Functions

      +

      4.3. Using type, str, dir, and Other Built-In Functions

      Python has a small set of extremely useful built-in functions. All other functions are partitioned off into modules. This was actually a conscious design decision, to keep the core language from getting bloated like other scripting languages (cough cough, Visual Basic). -

      4.3.1. The type Function

      -

      The type function returns the datatype of any arbitrary object. The possible types are listed in the types module. This is useful for helper functions that can handle several types of data. -

      Example 4.5. Introducing type

      >>> type(1)           1
      +

      4.3.1. The type Function

      +

      The type function returns the datatype of any arbitrary object. The possible types are listed in the types module. This is useful for helper functions that can handle several types of data. +

      Example 4.5. Introducing type

      >>> type(1)           1
       <type 'int'>
       >>> li = []
       >>> type(li)          2
      @@ -2208,32 +2048,32 @@ True
      1 -type takes anything -- and I mean anything -- and returns its datatype. Integers, strings, lists, dictionaries, tuples, functions, +type takes anything -- and I mean anything -- and returns its datatype. Integers, strings, lists, dictionaries, tuples, functions, classes, modules, even types are acceptable. 2 -type can take a variable and return its datatype. +type can take a variable and return its datatype. 3 -type also works on modules. +type also works on modules. 4 -You can use the constants in the types module to compare types of objects. This is what the info function does, as you'll see shortly. +You can use the constants in the types module to compare types of objects. This is what the info function does, as you'll see shortly. -

      4.3.2. The str Function

      -

      The str coerces data into a string. Every datatype can be coerced into a string. -

      Example 4.6. Introducing str

      +

      4.3.2. The str Function

      +

      The str coerces data into a string. Every datatype can be coerced into a string. +

      Example 4.6. Introducing str

       >>> str(1)          1
       '1'
       >>> horsemen = ['war', 'pestilence', 'famine']
      @@ -2250,32 +2090,32 @@ True
      1 -For simple datatypes like integers, you would expect str to work, because almost every language has a function to convert an integer to a string. +For simple datatypes like integers, you would expect str to work, because almost every language has a function to convert an integer to a string. 2 -However, str works on any object of any type. Here it works on a list which you've constructed in bits and pieces. +However, str works on any object of any type. Here it works on a list which you've constructed in bits and pieces. 3 -str also works on modules. Note that the string representation of the module includes the pathname of the module on disk, so +str also works on modules. Note that the string representation of the module includes the pathname of the module on disk, so yours will be different. 4 -A subtle but important behavior of str is that it works on None, the Python null value. It returns the string 'None'. You'll use this to your advantage in the info function, as you'll see shortly. +A subtle but important behavior of str is that it works on None, the Python null value. It returns the string 'None'. You'll use this to your advantage in the info function, as you'll see shortly. -

      At the heart of the info function is the powerful dir function. dir returns a list of the attributes and methods of any object: modules, functions, strings, lists, dictionaries... pretty much +

      At the heart of the info function is the powerful dir function. dir returns a list of the attributes and methods of any object: modules, functions, strings, lists, dictionaries... pretty much anything. -

      Example 4.7. Introducing dir

      >>> li = []
      +

      Example 4.7. Introducing dir

      >>> li = []
       >>> dir(li)           1
       ['append', 'count', 'extend', 'index', 'insert',
       'pop', 'remove', 'reverse', 'sort']
      @@ -2289,25 +2129,25 @@ True
      1 -li is a list, so dir(li) returns a list of all the methods of a list. Note that the returned list contains the names of the methods as strings, not +li is a list, so dir(li) returns a list of all the methods of a list. Note that the returned list contains the names of the methods as strings, not the methods themselves. 2 -d is a dictionary, so dir(d) returns a list of the names of dictionary methods. At least one of these, keys, should look familiar. +d is a dictionary, so dir(d) returns a list of the names of dictionary methods. At least one of these, keys, should look familiar. 3 -This is where it really gets interesting. odbchelper is a module, so dir(odbchelper) returns a list of all kinds of stuff defined in the module, including built-in attributes, like __name__, __doc__, and whatever other attributes and methods you define. In this case, odbchelper has only one user-defined method, the buildConnectionString function described in Chapter 2. +This is where it really gets interesting. odbchelper is a module, so dir(odbchelper) returns a list of all kinds of stuff defined in the module, including built-in attributes, like __name__, __doc__, and whatever other attributes and methods you define. In this case, odbchelper has only one user-defined method, the buildConnectionString function described in Chapter 2. -

      Finally, the callable function takes any object and returns True if the object can be called, or False otherwise. Callable objects include functions, class methods, even classes themselves. (More on classes in the next chapter.) -

      Example 4.8. Introducing callable

      +

      Finally, the callable function takes any object and returns True if the object can be called, or False otherwise. Callable objects include functions, class methods, even classes themselves. (More on classes in the next chapter.) +

      Example 4.8. Introducing callable

       >>> import string
       >>> string.punctuation           1
       '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
      @@ -2329,40 +2169,40 @@ True
       
       1 
       
      -The functions in the string module are deprecated (although many people still use the join function), but the module contains a lot of useful constants like this string.punctuation, which contains all the standard punctuation characters.
      +The functions in the string module are deprecated (although many people still use the join function), but the module contains a lot of useful constants like this string.punctuation, which contains all the standard punctuation characters.
       
       
       
       2 
       
      -string.join is a function that joins a list of strings.
      +string.join is a function that joins a list of strings.
       
       
       
       3 
       
      -string.punctuation is not callable; it is a string.  (A string does have callable methods, but the string itself is not callable.)
      +string.punctuation is not callable; it is a string.  (A string does have callable methods, but the string itself is not callable.)
       
       
       
       4 
       
      -string.join is callable; it's a function that takes two arguments.
      +string.join is callable; it's a function that takes two arguments.
       
       
       
       5 
       
      -Any callable object may have a docstring.  By using the callable function on each of an object's attributes, you can determine which attributes you care about (methods, functions, classes)
      +Any callable object may have a docstring.  By using the callable function on each of an object's attributes, you can determine which attributes you care about (methods, functions, classes)
                      and which you want to ignore (constants and so on) without knowing anything about the object ahead of time.
       
       
       
       

      4.3.3. Built-In Functions

      -

      type, str, dir, and all the rest of Python's built-in functions are grouped into a special module called __builtin__. (That's two underscores before and after.) If it helps, you can think of Python automatically executing from __builtin__ import * on startup, which imports all the “built-in” functions into the namespace so you can use them directly. +

      type, str, dir, and all the rest of Python's built-in functions are grouped into a special module called __builtin__. (That's two underscores before and after.) If it helps, you can think of Python automatically executing from __builtin__ import * on startup, which imports all the “built-in” functions into the namespace so you can use them directly.

      The advantage of thinking like this is that you can access all the built-in functions and attributes as a group by getting - information about the __builtin__ module. And guess what, Python has a function called info. Try it yourself and skim through the list now. We'll dive into some of the more important functions later. (Some of the - built-in error classes, like AttributeError, should already look familiar.) + information about the __builtin__ module. And guess what, Python has a function called info. Try it yourself and skim through the list now. We'll dive into some of the more important functions later. (Some of the + built-in error classes, like AttributeError, should already look familiar.)

      Example 4.9. Built-in Attributes and Functions

      >>> from apihelper import info
       >>> import __builtin__
       >>> info(__builtin__, 20)
      @@ -2391,10 +2231,10 @@ IOError              I/O operation failed.
       
    10. Python Library Reference documents all the built-in functions and all the built-in exceptions. -

      4.4. Getting Object References With getattr

      +

      4.4. Getting Object References With getattr

      You already know that Python functions are objects. What you don't know is that you can get a reference to a function without knowing its name until run-time, by using the -getattr function. -

      Example 4.10. Introducing getattr

      >>> li = ["Larry", "Curly"]
      +getattr function.
      +

      Example 4.10. Introducing getattr

      >>> li = ["Larry", "Curly"]
       >>> li.pop     1
       <built-in method pop of list object at 010DF884>
       >>> getattr(li, "pop")           2
      @@ -2412,38 +2252,38 @@ AttributeError: 'tuple' object has no attribute 'pop'
      1 -This gets a reference to the pop method of the list. Note that this is not calling the pop method; that would be li.pop(). This is the method itself. +This gets a reference to the pop method of the list. Note that this is not calling the pop method; that would be li.pop(). This is the method itself. 2 -This also returns a reference to the pop method, but this time, the method name is specified as a string argument to the getattr function. getattr is an incredibly useful built-in function that returns any attribute of any object. In this case, the object is a list, - and the attribute is the pop method. +This also returns a reference to the pop method, but this time, the method name is specified as a string argument to the getattr function. getattr is an incredibly useful built-in function that returns any attribute of any object. In this case, the object is a list, + and the attribute is the pop method. 3 -In case it hasn't sunk in just how incredibly useful this is, try this: the return value of getattr is the method, which you can then call just as if you had said li.append("Moe") directly. But you didn't call the function directly; you specified the function name as a string instead. +In case it hasn't sunk in just how incredibly useful this is, try this: the return value of getattr is the method, which you can then call just as if you had said li.append("Moe") directly. But you didn't call the function directly; you specified the function name as a string instead. 4 -getattr also works on dictionaries. +getattr also works on dictionaries. 5 -In theory, getattr would work on tuples, except that tuples have no methods, so getattr will raise an exception no matter what attribute name you give. +In theory, getattr would work on tuples, except that tuples have no methods, so getattr will raise an exception no matter what attribute name you give. -

      4.4.1. getattr with Modules

      -

      getattr isn't just for built-in datatypes. It also works on modules. -

      Example 4.11. The getattr Function in apihelper.py

      >>> import odbchelper
      +

      4.4.1. getattr with Modules

      +

      getattr isn't just for built-in datatypes. It also works on modules. +

      Example 4.11. The getattr Function in apihelper.py

      >>> import odbchelper
       >>> odbchelper.buildConnectionString             1
       <function buildConnectionString at 00D18DD4>
       >>> getattr(odbchelper, "buildConnectionString") 2
      @@ -2463,40 +2303,40 @@ True
      1 -This returns a reference to the buildConnectionString function in the odbchelper module, which you studied in Chapter 2, Your First Python Program. (The hex address you see is specific to my machine; your output will be different.) +This returns a reference to the buildConnectionString function in the odbchelper module, which you studied in Chapter 2, Your First Python Program. (The hex address you see is specific to my machine; your output will be different.) 2 -Using getattr, you can get the same reference to the same function. In general, getattr(object, "attribute") is equivalent to object.attribute. If object is a module, then attribute can be anything defined in the module: a function, class, or global variable. +Using getattr, you can get the same reference to the same function. In general, getattr(object, "attribute") is equivalent to object.attribute. If object is a module, then attribute can be anything defined in the module: a function, class, or global variable. 3 -And this is what you actually use in the info function. object is passed into the function as an argument; method is a string which is the name of a method or function. +And this is what you actually use in the info function. object is passed into the function as an argument; method is a string which is the name of a method or function. 4 -In this case, method is the name of a function, which you can prove by getting its type. +In this case, method is the name of a function, which you can prove by getting its type. 5 -Since method is a function, it is callable. +Since method is a function, it is callable. -

      4.4.2. getattr As a Dispatcher

      -

      A common usage pattern of getattr is as a dispatcher. For example, if you had a program that could output data in a variety of different formats, you could +

      4.4.2. getattr As a Dispatcher

      +

      A common usage pattern of getattr is as a dispatcher. For example, if you had a program that could output data in a variety of different formats, you could define separate functions for each output format and use a single dispatch function to call the right one.

      For example, let's imagine a program that prints site statistics in HTML, XML, and plain text formats. The choice of output format could be specified on the command line, or stored in a configuration - file. A statsout module defines three functions, output_html, output_xml, and output_text. Then the main program defines a single output function, like this: -

      Example 4.12. Creating a Dispatcher with getattr

      +   file.  A statsout module defines three functions, output_html, output_xml, and output_text.  Then the main program defines a single output function, like this:
      +

      Example 4.12. Creating a Dispatcher with getattr

       import statsout
       
       def output(data, format="text"):            1
      @@ -2507,28 +2347,28 @@ def output(data, format="text"):            
       1 
       
      -The output function takes one required argument, data, and one optional argument, format.  If format is not specified, it defaults to text, and you will end up calling the plain text output function.
      +The output function takes one required argument, data, and one optional argument, format.  If format is not specified, it defaults to text, and you will end up calling the plain text output function.
       
       
       
       2 
       
      -You concatenate the format argument with "output_" to produce a function name, and then go get that function from the statsout module.  This allows you to easily extend the program later to support other output formats, without changing this dispatch
      -            function.  Just add another function to statsout named, for instance, output_pdf, and pass "pdf" as the format into the output function.
      +You concatenate the format argument with "output_" to produce a function name, and then go get that function from the statsout module.  This allows you to easily extend the program later to support other output formats, without changing this dispatch
      +            function.  Just add another function to statsout named, for instance, output_pdf, and pass "pdf" as the format into the output function.
       
       
       
       3 
       
      -Now you can simply call the output function in the same way as any other function.  The output_function variable is a reference to the appropriate function from the statsout module.
      +Now you can simply call the output function in the same way as any other function.  The output_function variable is a reference to the appropriate function from the statsout module.
       
       
       
       

      Did you see the bug in the previous example? This is a very loose coupling of strings and functions, and there is no error - checking. What happens if the user passes in a format that doesn't have a corresponding function defined in statsout? Well, getattr will return None, which will be assigned to output_function instead of a valid function, and the next line that attempts to call that function will crash and raise an exception. That's + checking. What happens if the user passes in a format that doesn't have a corresponding function defined in statsout? Well, getattr will return None, which will be assigned to output_function instead of a valid function, and the next line that attempts to call that function will crash and raise an exception. That's bad. -

      Luckily, getattr takes an optional third argument, a default value. -

      Example 4.13. getattr Default Values

      +

      Luckily, getattr takes an optional third argument, a default value. +

      Example 4.13. getattr Default Values

       import statsout
       
       def output(data, format="text"):
      @@ -2539,17 +2379,17 @@ def output(data, format="text"):
       
       1 
       
      -This function call is guaranteed to work, because you added a third argument to the call to getattr.  The third argument is a default value that is returned if the attribute or method specified by the second argument wasn't
      +This function call is guaranteed to work, because you added a third argument to the call to getattr.  The third argument is a default value that is returned if the attribute or method specified by the second argument wasn't
                      found.
       
       
       
      -

      As you can see, getattr is quite powerful. It is the heart of introspection, and you'll see even more powerful examples of it in later chapters. +

      As you can see, getattr is quite powerful. It is the heart of introspection, and you'll see even more powerful examples of it in later chapters.

      4.5. Filtering Lists

      As you know, Python has powerful capabilities for mapping lists into other lists, via list comprehensions (Section 3.6, “Mapping Lists”). This can be combined with a filtering mechanism, where some elements in the list are mapped while others are skipped entirely.

      Here is the list filtering syntax:

      -[mapping-expression for element in source-list if filter-expression]

      This is an extension of the list comprehensions that you know and love. The first two thirds are the same; the last part, starting with the if, is the filter expression. A filter expression can be any expression that evaluates true or false (which in Python can be almost anything). Any element for which the filter expression evaluates true will be included in the mapping. All other elements are ignored, +[mapping-expression for element in source-list if filter-expression]

      This is an extension of the list comprehensions that you know and love. The first two thirds are the same; the last part, starting with the if, is the filter expression. A filter expression can be any expression that evaluates true or false (which in Python can be almost anything). Any element for which the filter expression evaluates true will be included in the mapping. All other elements are ignored, so they are never put through the mapping expression and are not included in the output list.

      Example 4.14. Introducing List Filtering

      >>> li = ["a", "mpilgrim", "foo", "b", "c", "b", "d", "d"]
       >>> [elem for elem in li if len(elem) > 1]       1
      @@ -2577,25 +2417,25 @@ so they are never put through the mapping expression and are not included in the
       
       3 
       
      -count is a list method that returns the number of times a value occurs in a list.  You might think that this filter would eliminate
      +count is a list method that returns the number of times a value occurs in a list.  You might think that this filter would eliminate
                   duplicates from a list, returning a list containing only one copy of each value in the original list.  But it doesn't, because
                   values that appear twice in the original list (in this case, b and d) are excluded completely.  There are ways of eliminating duplicates from a list, but filtering is not the solution.
       
       
       
      -

      Let's id="apihelper.filter.care" get back to this line from apihelper.py:

      +

      Let's id="apihelper.filter.care" get back to this line from apihelper.py:

           methodList = [method for method in dir(object) if callable(getattr(object, method))]

      This looks complicated, and it is complicated, but the basic structure is the same. The whole filter expression returns a -list, which is assigned to the methodList variable. The first half of the expression is the list mapping part. The mapping expression is an identity expression, -which it returns the value of each element. dir(object) returns a list of object's attributes and methods -- that's the list you're mapping. So the only new part is the filter expression after the if. -

      The filter expression looks scary, but it's not. You already know about callable, getattr, and in. As you saw in the previous section, the expression getattr(object, method) returns a function object if object is a module and method is the name of a function in that module. -

      So this expression takes an object (named object). Then it gets a list of the names of the object's attributes, methods, functions, and a few other things. Then it filters +list, which is assigned to the methodList variable. The first half of the expression is the list mapping part. The mapping expression is an identity expression, +which it returns the value of each element. dir(object) returns a list of object's attributes and methods -- that's the list you're mapping. So the only new part is the filter expression after the if. +

      The filter expression looks scary, but it's not. You already know about callable, getattr, and in. As you saw in the previous section, the expression getattr(object, method) returns a function object if object is a module and method is the name of a function in that module. +

      So this expression takes an object (named object). Then it gets a list of the names of the object's attributes, methods, functions, and a few other things. Then it filters that list to weed out all the stuff that you don't care about. You do the weeding out by taking the name of each attribute/method/function -and getting a reference to the real thing, via the getattr function. Then you check to see if that object is callable, which will be any methods and functions, both built-in (like -the pop method of a list) and user-defined (like the buildConnectionString function of the odbchelper module). You don't care about other attributes, like the __name__ attribute that's built in to every module. +and getting a reference to the real thing, via the getattr function. Then you check to see if that object is callable, which will be any methods and functions, both built-in (like +the pop method of a list) and user-defined (like the buildConnectionString function of the odbchelper module). You don't care about other attributes, like the __name__ attribute that's built in to every module.

      Further Reading on Filtering Lists

      4.6. The Peculiar Nature of and and or

      @@ -2611,7 +2451,7 @@ the pop method of a list) and user-defined (like t 1 -When using and, values are evaluated in a boolean context from left to right. 0, '', [], (), {}, and None are false in a boolean context; everything else is true. Well, almost everything. By default, instances of classes are +When using and, values are evaluated in a boolean context from left to right. 0, '', [], (), {}, and None are false in a boolean context; everything else is true. Well, almost everything. By default, instances of classes are true in a boolean context, but you can define special methods in your class to make an instance evaluate to false. You'll learn all about classes and special methods in Chapter 5. If all values are true in a boolean context, and returns the last value. In this case, and evaluates 'a', which is true, then 'b', which is true, and returns 'b'. @@ -2663,11 +2503,11 @@ the pop method of a list) and user-defined (like t 4 Note that or evaluates values only until it finds one that is true in a boolean context, and then it ignores the rest. This distinction - is important if some values can have side effects. Here, the function sidefx is never called, because or evaluates 'a', which is true, and returns 'a' immediately. + is important if some values can have side effects. Here, the function sidefx is never called, because or evaluates 'a', which is true, and returns 'a' immediately. -

      If you're a C hacker, you are certainly familiar with the bool ? a : b expression, which evaluates to a if bool is true, and b otherwise. Because of the way and and or work in Python, you can accomplish the same thing. +

      If you're a C hacker, you are certainly familiar with the bool ? a : b expression, which evaluates to a if bool is true, and b otherwise. Because of the way and and or work in Python, you can accomplish the same thing.

      4.6.1. Using the and-or Trick

      Example 4.17. Introducing the and-or Trick

      >>> a = "first"
       >>> b = "second"
      @@ -2680,18 +2520,18 @@ the pop method of a list) and user-defined (like t
       
       1 
       
      -This syntax looks similar to the bool ? a : b expression in C.  The entire expression is evaluated from left to right, so the and is evaluated first.  1 and 'first' evalutes to 'first', then 'first' or 'second' evalutes to 'first'.
      +This syntax looks similar to the bool ? a : b expression in C.  The entire expression is evaluated from left to right, so the and is evaluated first.  1 and 'first' evalutes to 'first', then 'first' or 'second' evalutes to 'first'.
       
       
       
       2 
       
      -0 and 'first' evalutes to False, and then 0 or 'second' evaluates to 'second'.
      +0 and 'first' evalutes to False, and then 0 or 'second' evaluates to 'second'.
       
       
       
       

      However, since this Python expression is simply boolean logic, and not a special construct of the language, there is one extremely important difference - between this and-or trick in Python and the bool ? a : b syntax in C. If the value of a is false, the expression will not work as you would expect it to. (Can you tell I was bitten by this? More than once?) + between this and-or trick in Python and the bool ? a : b syntax in C. If the value of a is false, the expression will not work as you would expect it to. (Can you tell I was bitten by this? More than once?)

      Example 4.18. When the and-or Trick Fails

      >>> a = ""
       >>> b = "second"
       >>> 1 and a or b         1
      @@ -2700,12 +2540,12 @@ the pop method of a list) and user-defined (like t
       
       1 
       
      -Since a is an empty string, which Python considers false in a boolean context, 1 and '' evalutes to '', and then '' or 'second' evalutes to 'second'.  Oops!  That's not what you wanted.
      +Since a is an empty string, which Python considers false in a boolean context, 1 and '' evalutes to '', and then '' or 'second' evalutes to 'second'.  Oops!  That's not what you wanted.
       
       
       
      -

      The and-or trick, bool and a or b, will not work like the C expression bool ? a : b when a is false in a boolean context. -

      The real trick behind the and-or trick, then, is to make sure that the value of a is never false. One common way of doing this is to turn a into [a] and b into [b], then taking the first element of the returned list, which will be either a or b. +

      The and-or trick, bool and a or b, will not work like the C expression bool ? a : b when a is false in a boolean context. +

      The real trick behind the and-or trick, then, is to make sure that the value of a is never false. One common way of doing this is to turn a into [a] and b into [b], then taking the first element of the returned list, which will be either a or b.

      Example 4.19. Using the and-or Trick Safely

      >>> a = ""
       >>> b = "second"
       >>> (1 and [a] or [b])[0] 1
      @@ -2714,12 +2554,12 @@ the pop method of a list) and user-defined (like t
       
       1 
       
      -Since [a] is a non-empty list, it is never false.  Even if a is 0 or '' or some other false value, the list [a] is true because it has one element.
      +Since [a] is a non-empty list, it is never false.  Even if a is 0 or '' or some other false value, the list [a] is true because it has one element.
       
       
       
       

      By now, this trick may seem like more trouble than it's worth. You could, after all, accomplish the same thing with an if statement, so why go through all this fuss? Well, in many cases, you are choosing between two constant values, so you can - use the simpler syntax and not worry, because you know that the a value will always be true. And even if you need to use the more complicated safe form, there are good reasons to do so. + use the simpler syntax and not worry, because you know that the a value will always be true. And even if you need to use the more complicated safe form, there are good reasons to do so. For example, there are some cases in Python where if statements are not allowed, such as in lambda functions.

      Further Reading on the and-or Trick

      @@ -2770,10 +2610,10 @@ a lambda function; if you need something more complex, define a nor

      4.7.1. Real-World lambda Functions

      -

      Here are the lambda functions in apihelper.py:

      +

      Here are the lambda functions in apihelper.py:

           processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)

      Notice that this uses the simple form of the and-or trick, which is okay, because a lambda function is always true in a boolean context. (That doesn't mean that a lambda function can't return a false value. The function is always true; its return value could be anything.) -

      Also notice that you're using the split function with no arguments. You've already seen it used with one or two arguments, but without any arguments it splits on whitespace. -

      Example 4.21. split With No Arguments

      >>> s = "this   is\na\ttest"  1
      +

      Also notice that you're using the split function with no arguments. You've already seen it used with one or two arguments, but without any arguments it splits on whitespace. +

      Example 4.21. split With No Arguments

      >>> s = "this   is\na\ttest"  1
       >>> print s
       this   is
       a	test
      @@ -2791,19 +2631,19 @@ a	test
       
       2 
       
      -split without any arguments splits on whitespace.  So three spaces, a carriage return, and a tab character are all the same.
      +split without any arguments splits on whitespace.  So three spaces, a carriage return, and a tab character are all the same.
       
       
       
       3 
       
      -You can normalize whitespace by splitting a string with split and then rejoining it with join, using a single space as a delimiter.  This is what the info function does to collapse multi-line docstrings into a single line.
      +You can normalize whitespace by splitting a string with split and then rejoining it with join, using a single space as a delimiter.  This is what the info function does to collapse multi-line docstrings into a single line.
       
       
       
      -

      So what is the info function actually doing with these lambda functions, splits, and and-or tricks? +

      So what is the info function actually doing with these lambda functions, splits, and and-or tricks?

      -    processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)

      processFunc is now a function, but which function it is depends on the value of the collapse variable. If collapse is true, processFunc(string) will collapse whitespace; otherwise, processFunc(string) will return its argument unchanged. + processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)

      processFunc is now a function, but which function it is depends on the value of the collapse variable. If collapse is true, processFunc(string) will collapse whitespace; otherwise, processFunc(string) will return its argument unchanged.

      To do this in a less robust language, like Visual Basic, you would probably create a function that took a string and a collapse argument and used an if statement to decide whether to collapse the whitespace or not, then returned the appropriate value. This would be inefficient, because the function would need to handle every possible case. Every time you called it, it would need to decide whether to collapse whitespace before it could give you what you wanted. In Python, you can take that decision logic out of the function and define a lambda function that is custom-tailored to give you exactly (and only) what you want. This is more efficient, more elegant, and @@ -2823,14 +2663,14 @@ a test is easy, because everything you need is already set up just the way you need it. All the dominoes are in place; it's time to knock them down.

      -

      This is the meat of apihelper.py:

      +

      This is the meat of apihelper.py:

           print "\n".join(["%s %s" %
           (method.ljust(spacing),
            processFunc(str(getattr(object, method).__doc__)))
          for method in methodList])

      Note that this is one command, split over multiple lines, but it doesn't use the line continuation character (\). Remember when I said that some expressions can be split into multiple lines without using a backslash? A list comprehension is one of those expressions, since the entire expression is contained in square brackets.

      Now, let's take it from the end and work backwards. The

      -for method in methodList

      shows that this is a list comprehension. As you know, methodList is a list of all the methods you care about in object. So you're looping through that list with method. +for method in methodList

      shows that this is a list comprehension. As you know, methodList is a list of all the methods you care about in object. So you're looping through that list with method.

      Example 4.22. Getting a docstring Dynamically

      >>> import odbchelper
       >>> object = odbchelper 1
       >>> method = 'buildConnectionString'      2
      @@ -2844,19 +2684,19 @@ for method in methodList

      shows that this is a 1 -In the info function, object is the object you're getting help on, passed in as an argument. +In the info function, object is the object you're getting help on, passed in as an argument. 2 -As you're looping through methodList, method is the name of the current method. +As you're looping through methodList, method is the name of the current method. 3 -Using the getattr function, you're getting a reference to the method function in the object module. +Using the getattr function, you're getting a reference to the method function in the object module. @@ -2866,8 +2706,8 @@ for method in methodList

      shows that this is a -

      The next piece of the puzzle is the use of str around the docstring. As you may recall, str is a built-in function that coerces data into a string. But a docstring is always a string, so why bother with the str function? The answer is that not every function has a docstring, and if it doesn't, its __doc__ attribute is None. -

      Example 4.23. Why Use str on a docstring?

      >>> >>> def foo(): print 2
      +

      The next piece of the puzzle is the use of str around the docstring. As you may recall, str is a built-in function that coerces data into a string. But a docstring is always a string, so why bother with the str function? The answer is that not every function has a docstring, and if it doesn't, its __doc__ attribute is None. +

      Example 4.23. Why Use str on a docstring?

      >>> >>> def foo(): print 2
       >>> >>> foo()
       2
       >>> >>> foo.__doc__     1
      @@ -2892,7 +2732,7 @@ True
       
       3 
       
      -The str function takes the null value and returns a string representation of it, 'None'.
      +The str function takes the null value and returns a string representation of it, 'None'.
       
       
       
      @@ -2905,9 +2745,9 @@ True
       
       
       
      -

      Now that you are guaranteed to have a string, you can pass the string to processFunc, which you have already defined as a function that either does or doesn't collapse whitespace. Now you see why it was important to use str to convert a None value into a string representation. processFunc is assuming a string argument and calling its split method, which would crash if you passed it None because None doesn't have a split method. -

      Stepping back even further, you see that you're using string formatting again to concatenate the return value of processFunc with the return value of method's ljust method. This is a new string method that you haven't seen before. -

      Example 4.24. Introducing ljust

      >>> s = 'buildConnectionString'
      +

      Now that you are guaranteed to have a string, you can pass the string to processFunc, which you have already defined as a function that either does or doesn't collapse whitespace. Now you see why it was important to use str to convert a None value into a string representation. processFunc is assuming a string argument and calling its split method, which would crash if you passed it None because None doesn't have a split method. +

      Stepping back even further, you see that you're using string formatting again to concatenate the return value of processFunc with the return value of method's ljust method. This is a new string method that you haven't seen before. +

      Example 4.24. Introducing ljust

      >>> s = 'buildConnectionString'
       >>> s.ljust(30) 1
       'buildConnectionString         '
       >>> s.ljust(20) 2
      @@ -2916,17 +2756,17 @@ True
       
       1 
       
      -ljust pads the string with spaces to the given length.  This is what the info function uses to make two columns of output and line up all the docstrings in the second column.
      +ljust pads the string with spaces to the given length.  This is what the info function uses to make two columns of output and line up all the docstrings in the second column.
       
       
       
       2 
       
      -If the given length is smaller than the length of the string, ljust will simply return the string unchanged.  It never truncates the string.
      +If the given length is smaller than the length of the string, ljust will simply return the string unchanged.  It never truncates the string.
       
       
       
      -

      You're almost finished. Given the padded method name from the ljust method and the (possibly collapsed) docstring from the call to processFunc, you concatenate the two and get a single string. Since you're mapping methodList, you end up with a list of strings. Using the join method of the string "\n", you join this list into a single string, with each element of the list on a separate line, and print the result. +

      You're almost finished. Given the padded method name from the ljust method and the (possibly collapsed) docstring from the call to processFunc, you concatenate the two and get a single string. Since you're mapping methodList, you end up with a list of strings. Using the join method of the string "\n", you join this list into a single string, with each element of the list on a separate line, and print the result.

      Example 4.25. Printing a List

      >>> li = ['a', 'b', 'c']
       >>> print "\n".join(li) 1
       a
      @@ -2946,7 +2786,7 @@ c
      (method.ljust(spacing), processFunc(str(getattr(object, method).__doc__))) for method in methodList])

      4.9. Summary

      -

      The apihelper.py program and its output should now make perfect sense. +

      The apihelper.py program and its output should now make perfect sense.

       def info(object, spacing=10, collapse=1):
           """Print methods and docstrings.
      @@ -2961,7 +2801,7 @@ def info(object, spacing=10, collapse=1):
       
       if __name__ == "__main__":
           print info.__doc__
      -

      Here is the output of apihelper.py:

      >>> from apihelper import info
      +

      Here is the output of apihelper.py:

      >>> from apihelper import info
       >>> li = []
       >>> info(li)
       append     L.append(object) -- append object to end
      @@ -2977,9 +2817,9 @@ sort       L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1,
       
      • Defining and calling functions with optional and named arguments -
      • Using str to coerce any arbitrary value into a string representation +
      • Using str to coerce any arbitrary value into a string representation -
      • Using getattr to get references to functions and other attributes dynamically +
      • Using getattr to get references to functions and other attributes dynamically
      • Extending the list comprehension syntax to do list filtering
      • Recognizing the and-or trick and using it safely @@ -2995,7 +2835,7 @@ sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1,

        5.1. Diving In

        Here is a complete, working Python program. Read the docstrings of the module, the classes, and the functions to get an overview of what this program does and how it works. As usual, don't worry about the stuff you don't understand; that's what the rest of the chapter is for. -

        Example 5.1. fileinfo.py

        +

        Example 5.1. fileinfo.py

        If you have not already done so, you can download this and other examples used in this book.

         """Framework for getting filetype-specific metadata.
         
        @@ -3131,18 +2971,18 @@ title=Spinning
         genre=255
         name=/music/_singles/spinning.mp3
         year=2000
        -comment=http://mp3.com/artists/95/vxp

        5.2. Importing Modules Using from module import

        -

        Python has two ways of importing modules. Both are useful, and you should know when to use each. One way, import module, you've already seen in Section 2.4, “Everything Is an Object”. The other way accomplishes the same thing, but it has subtle and important differences. +comment=http://mp3.com/artists/95/vxp

      5.2. Importing Modules Using from module import

      +

      Python has two ways of importing modules. Both are useful, and you should know when to use each. One way, import module, you've already seen in Section 2.4, “Everything Is an Object”. The other way accomplishes the same thing, but it has subtle and important differences.

      -

      Here is the basic from module import syntax:

      +

      Here is the basic from module import syntax:

       from UserDict import UserDict
      -

      This is similar to the import module syntax that you know and love, but with an important difference: the attributes and methods of the imported module types are imported directly into the local namespace, so they are available directly, without qualification by module name. You -can import individual items or use from module import * to import everything. +

      This is similar to the import module syntax that you know and love, but with an important difference: the attributes and methods of the imported module types are imported directly into the local namespace, so they are available directly, without qualification by module name. You +can import individual items or use from module import * to import everything.

      -
      Note
      from module import * in Python is like use module in Perl; import module in Python is like require module in Perl. +from module import * in Python is like use module in Perl; import module in Python is like require module in Perl.
      @@ -3150,11 +2990,11 @@ can import individual items or use from module -
      Note
      from module import * in Python is like import module.* in Java; import module in Python is like import module in Java. +from module import * in Python is like import module.* in Java; import module in Python is like import module in Java.
      -

      Example 5.2. import module vs. from module import

      >>> import types
      +

      Example 5.2. import module vs. from module import

      >>> import types
       >>> types.FunctionType             1
       <type 'function'>
       >>> FunctionType 2
      @@ -3168,36 +3008,36 @@ NameError: There is no variable named 'FunctionType'
       
       1 
       
      -The types module contains no methods; it just has attributes for each Python object type.  Note that the attribute, FunctionType, must be qualified by the module name, types.
      +The types module contains no methods; it just has attributes for each Python object type.  Note that the attribute, FunctionType, must be qualified by the module name, types.
       
       
       
       2 
       
      -FunctionType by itself has not been defined in this namespace; it exists only in the context of types.
      +FunctionType by itself has not been defined in this namespace; it exists only in the context of types.
       
       
       
       3 
       
      -This syntax imports the attribute FunctionType from the types module directly into the local namespace.
      +This syntax imports the attribute FunctionType from the types module directly into the local namespace.
       
       
       
       4 
       
      -Now FunctionType can be accessed directly, without reference to types.
      +Now FunctionType can be accessed directly, without reference to types.
       
       
       
      -

      When should you use from module import? +

      When should you use from module import?

        -
      • If you will be accessing attributes and methods often and don't want to type the module name over and over, use from module import. +
      • If you will be accessing attributes and methods often and don't want to type the module name over and over, use from module import. -
      • If you want to selectively import some attributes and methods but not others, use from module import. +
      • If you want to selectively import some attributes and methods but not others, use from module import. -
      • If the module contains attributes or functions with the same name as ones in your module, you must use import module to avoid name conflicts. +
      • If the module contains attributes or functions with the same name as ones in your module, you must use import module to avoid name conflicts.

      Other than that, it's just a matter of style, and you will see Python code written both ways. @@ -3213,9 +3053,9 @@ NameError: There is no variable named 'FunctionType'

      Further Reading on Module Importing Techniques

      5.3. Defining Classes

      @@ -3230,7 +3070,7 @@ class Loaf: 1
      - @@ -3258,8 +3098,8 @@ class Loaf: 1

      Of course, realistically, most classes will be inherited from other classes, and they will define their own class methods and attributes. But as you've just seen, there is nothing that a class absolutely must have, other than a name. In particular, -C++ programmers may find it odd that Python classes don't have explicit constructors and destructors. Python classes do have something similar to a constructor: the __init__ method. -

      Example 5.4. Defining the FileInfo Class

      +C++ programmers may find it odd that Python classes don't have explicit constructors and destructors.  Python classes do have something similar to a constructor: the __init__ method.
      +

      Example 5.4. Defining the FileInfo Class

       from UserDict import UserDict
       
       class FileInfo(UserDict): 1
      @@ -3267,9 +3107,9 @@ class FileInfo(UserDict): 1 -
      1 The name of this class is Loaf, and it doesn't inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement. +The name of this class is Loaf, and it doesn't inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement.
      In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. So the FileInfo class is inherited from the UserDict class (which was imported from the UserDict module). UserDict is a class that acts like a dictionary, allowing you to essentially subclass the dictionary datatype and add your own behavior. - (There are similar classes UserList and UserString which allow you to subclass lists and strings.) There is a bit of black magic behind this, which you will demystify later - in this chapter when you explore the UserDict class in more depth. +In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. So the FileInfo class is inherited from the UserDict class (which was imported from the UserDict module). UserDict is a class that acts like a dictionary, allowing you to essentially subclass the dictionary datatype and add your own behavior. + (There are similar classes UserList and UserString which allow you to subclass lists and strings.) There is a bit of black magic behind this, which you will demystify later + in this chapter when you explore the UserDict class in more depth.
      @@ -3286,8 +3126,8 @@ class FileInfo(UserDict): FileInfo
      class using the __init__ method. -

      Example 5.5. Initializing the FileInfo Class

      +

      This example shows the initialization of the FileInfo class using the __init__ method. +

      Example 5.5. Initializing the FileInfo Class

       class FileInfo(UserDict):
           "store file metadata"              1
           def __init__(self, filename=None): 2 3 4
      @@ -3301,23 +3141,23 @@ class FileInfo(UserDict): 2 -__init__ is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor - of the class. It's tempting, because it looks like a constructor (by convention, __init__ is the first method defined for the class), acts like one (it's the first piece of code executed in a newly created instance - of the class), and even sounds like one (“init” certainly suggests a constructor-ish nature). Incorrect, because the object has already been constructed by the time __init__ is called, and you already have a valid reference to the new instance of the class. But __init__ is the closest thing you're going to get to a constructor in Python, and it fills much the same role. +__init__ is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor + of the class. It's tempting, because it looks like a constructor (by convention, __init__ is the first method defined for the class), acts like one (it's the first piece of code executed in a newly created instance + of the class), and even sounds like one (“init” certainly suggests a constructor-ish nature). Incorrect, because the object has already been constructed by the time __init__ is called, and you already have a valid reference to the new instance of the class. But __init__ is the closest thing you're going to get to a constructor in Python, and it fills much the same role. 3 -The first argument of every class method, including __init__, is always a reference to the current instance of the class. By convention, this argument is always named self. In the __init__ method, self refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although +The first argument of every class method, including __init__, is always a reference to the current instance of the class. By convention, this argument is always named self. In the __init__ method, self refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify self explicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically. 4 -__init__ methods can take any number of arguments, and just like functions, the arguments can be defined with default values, making - them optional to the caller. In this case, filename has a default value of None, which is the Python null value. +__init__ methods can take any number of arguments, and just like functions, the arguments can be defined with default values, making + them optional to the caller. In this case, filename has a default value of None, which is the Python null value. @@ -3330,7 +3170,7 @@ class FileInfo(UserDict): -

      Example 5.6. Coding the FileInfo Class

      +

      Example 5.6. Coding the FileInfo Class

       class FileInfo(UserDict):
           "store file metadata"
           def __init__(self, filename=None):
      @@ -3348,18 +3188,18 @@ class FileInfo(UserDict):
       
       2 
       
      -I told you that this class acts like a dictionary, and here is the first sign of it.  You're assigning the argument filename as the value of this object's name key.
      +I told you that this class acts like a dictionary, and here is the first sign of it.  You're assigning the argument filename as the value of this object's name key.
       
       
       
       3 
       
      -Note that the __init__ method never returns a value.
      +Note that the __init__ method never returns a value.
       
       
       
      -

      5.3.2. Knowing When to Use self and __init__

      -

      When defining your class methods, you must explicitly list self as the first argument for each method, including __init__. When you call a method of an ancestor class from within your class, you must include the self argument. But when you call your class method from outside, you do not specify anything for the self argument; you skip it entirely, and Python automatically adds the instance reference for you. I am aware that this is confusing at first; it's not really inconsistent, +

      5.3.2. Knowing When to Use self and __init__

      +

      When defining your class methods, you must explicitly list self as the first argument for each method, including __init__. When you call a method of an ancestor class from within your class, you must include the self argument. But when you call your class method from outside, you do not specify anything for the self argument; you skip it entirely, and Python automatically adds the instance reference for you. I am aware that this is confusing at first; it's not really inconsistent, but it may appear inconsistent because it relies on a distinction (between bound and unbound methods) that you don't know about yet.

      Whew. I realize that's a lot to absorb, but you'll get the hang of it. All Python classes work the same way, so once you learn one, you've learned them all. If you forget everything else, remember this @@ -3368,7 +3208,7 @@ class FileInfo(UserDict): Note -__init__ methods are optional, but when you define one, you must remember to explicitly call the ancestor's __init__ method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor, +__init__ methods are optional, but when you define one, you must remember to explicitly call the ancestor's __init__ method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor, the descendant method must explicitly call the ancestor method at the proper time, with the proper arguments. @@ -3387,8 +3227,8 @@ class FileInfo(UserDict):

      5.4. Instantiating Classes

      Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the -__init__ method defines. The return value will be the newly created object. -

      Example 5.7. Creating a FileInfo Instance

      >>> import fileinfo
      +__init__ method defines.  The return value will be the newly created object.
      +

      Example 5.7. Creating a FileInfo Instance

      >>> import fileinfo
       >>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3") 1
       >>> f.__class__    2
       <class fileinfo.FileInfo at 010EC204>
      @@ -3400,14 +3240,14 @@ class FileInfo(UserDict):
       
       1 
       
      -You are creating an instance of the FileInfo class (defined in the fileinfo module) and assigning the newly created instance to the variable f.  You are passing one parameter, /music/_singles/kairo.mp3, which will end up as the filename argument in FileInfo's __init__ method.
      +You are creating an instance of the FileInfo class (defined in the fileinfo module) and assigning the newly created instance to the variable f.  You are passing one parameter, /music/_singles/kairo.mp3, which will end up as the filename argument in FileInfo's __init__ method.
       
       
       
       2 
       
       Every class instance has a built-in attribute, __class__, which is the object's class.  (Note that the representation of this includes the physical address of the instance on my
      -            machine; your representation will be different.)  Java programmers may be familiar with the Class class, which contains methods like getName and getSuperclass to get metadata information about an object.  In Python, this kind of metadata is available directly on the object itself through attributes like __class__, __name__, and __bases__.
      +            machine; your representation will be different.)  Java programmers may be familiar with the Class class, which contains methods like getName and getSuperclass to get metadata information about an object.  In Python, this kind of metadata is available directly on the object itself through attributes like __class__, __name__, and __bases__.
       
       
       
      @@ -3419,7 +3259,7 @@ class FileInfo(UserDict):
       
       4 
       
      -Remember when the __init__ method assigned its filename argument to self["name"]?  Well, here's the result.  The arguments you pass when you create the class instance get sent right along to the __init__ method (along with the object reference, self, which Python adds for free).
      +Remember when the __init__ method assigned its filename argument to self["name"]?  Well, here's the result.  The arguments you pass when you create the class instance get sent right along to the __init__ method (along with the object reference, self, which Python adds for free).
       
       
       
      @@ -3444,17 +3284,17 @@ class FileInfo(UserDict):
       
       1 
       
      -Every time the leakmem function is called, you are creating an instance of FileInfo and assigning it to the variable f, which is a local variable within the function.  Then the function ends without ever freeing f, so you would expect a memory leak, but you would be wrong.  When the function ends, the local variable f goes out of scope.  At this point, there are no longer any references to the newly created instance of FileInfo (since you never assigned it to anything other than f), so Python destroys the instance for us.
      +Every time the leakmem function is called, you are creating an instance of FileInfo and assigning it to the variable f, which is a local variable within the function.  Then the function ends without ever freeing f, so you would expect a memory leak, but you would be wrong.  When the function ends, the local variable f goes out of scope.  At this point, there are no longer any references to the newly created instance of FileInfo (since you never assigned it to anything other than f), so Python destroys the instance for us.
       
       
       
       2 
       
      -No matter how many times you call the leakmem function, it will never leak memory, because every time, Python will destroy the newly created FileInfo class before returning from leakmem.
      +No matter how many times you call the leakmem function, it will never leak memory, because every time, Python will destroy the newly created FileInfo class before returning from leakmem.
       
       
       
      -

      The technical term for this form of garbage collection is “reference counting”. Python keeps a list of references to every instance created. In the above example, there was only one reference to the FileInfo instance: the local variable f. When the function ends, the variable f goes out of scope, so the reference count drops to 0, and Python destroys the instance automatically. +

      The technical term for this form of garbage collection is “reference counting”. Python keeps a list of references to every instance created. In the above example, there was only one reference to the FileInfo instance: the local variable f. When the function ends, the variable f goes out of scope, so the reference count drops to 0, and Python destroys the instance automatically.

      In previous versions of Python, there were situations where reference counting failed, and Python couldn't clean up after you. If you created two instances that referenced each other (for instance, a doubly-linked list, where each node has a pointer to the previous and next node in the list), neither instance would ever be destroyed automatically because Python (correctly) believed that there is always a reference to each instance. Python 2.0 has an additional form of garbage collection called “mark-and-sweep” which is smart enough to notice this virtual gridlock and clean up circular references correctly. @@ -3465,11 +3305,11 @@ class FileInfo(UserDict):

      -

      5.5. Exploring UserDict: A Wrapper Class

      -

      As you've seen, FileInfo is a class that acts like a dictionary. To explore this further, let's look at the UserDict class in the UserDict module, which is the ancestor of the FileInfo class. This is nothing special; the class is written in Python and stored in a .py file, just like any other Python code. In particular, it's stored in the lib directory in your Python installation. +

      5.5. Exploring UserDict: A Wrapper Class

      +

      As you've seen, FileInfo is a class that acts like a dictionary. To explore this further, let's look at the UserDict class in the UserDict module, which is the ancestor of the FileInfo class. This is nothing special; the class is written in Python and stored in a .py file, just like any other Python code. In particular, it's stored in the lib directory in your Python installation.

      @@ -3479,7 +3319,7 @@ File->Locate... (Ctrl-L).
      Tip
      -

      Example 5.9. Defining the UserDict Class

      +

      Example 5.9. Defining the UserDict Class

       class UserDict:              1
           def __init__(self, dict=None):             2
               self.data = {}       3
      @@ -3489,29 +3329,29 @@ class UserDict:              1 
       
      -Note that UserDict is a base class, not inherited from any other class.
      +Note that UserDict is a base class, not inherited from any other class.
       
       
       
       2 
       
      -This is the __init__ method that you overrode in the FileInfo class.  Note that the argument list in this ancestor class is different than the descendant.  That's okay; each subclass can have
      +This is the __init__ method that you overrode in the FileInfo class.  Note that the argument list in this ancestor class is different than the descendant.  That's okay; each subclass can have
                   its own set of arguments, as long as it calls the ancestor with the correct arguments.  Here the ancestor class has a way
      -            to define initial values (by passing a dictionary in the dict argument) which the FileInfo does not use.
      +            to define initial values (by passing a dictionary in the dict argument) which the FileInfo does not use.
       
       
       
       3 
       
      -Python supports data attributes (called “instance variables” in Java and Powerbuilder, and “member variables” in C++).  Data attributes are pieces of data held by a specific instance of a class.  In this case, each instance of UserDict will have a data attribute data.  To reference this attribute from code outside the class, you qualify it with the instance name, instance.data, in the same way that you qualify a function with its module name.  To reference a data attribute from within the class,
      -            you use self as the qualifier.  By convention, all data attributes are initialized to reasonable values in the __init__ method.  However, this is not required, since data attributes, like local variables, spring into existence when they are first assigned a value.
      +Python supports data attributes (called “instance variables” in Java and Powerbuilder, and “member variables” in C++).  Data attributes are pieces of data held by a specific instance of a class.  In this case, each instance of UserDict will have a data attribute data.  To reference this attribute from code outside the class, you qualify it with the instance name, instance.data, in the same way that you qualify a function with its module name.  To reference a data attribute from within the class,
      +            you use self as the qualifier.  By convention, all data attributes are initialized to reasonable values in the __init__ method.  However, this is not required, since data attributes, like local variables, spring into existence when they are first assigned a value.
       
       
       
       4 
       
      -The update method is a dictionary duplicator: it copies all the keys and values from one dictionary to another.  This does not clear the target dictionary first; if the target dictionary already has some keys, the ones from the source dictionary will
      -            be overwritten, but others will be left untouched.  Think of update as a merge function, not a copy function.
      +The update method is a dictionary duplicator: it copies all the keys and values from one dictionary to another.  This does not clear the target dictionary first; if the target dictionary already has some keys, the ones from the source dictionary will
      +            be overwritten, but others will be left untouched.  Think of update as a merge function, not a copy function.
       
       
       
      @@ -3531,7 +3371,7 @@ class UserDict:              Java and Powerbuilder support function overloading by argument list, i.e. one class can have multiple methods with the same name but a different number of arguments, or arguments of different types.
              Other languages (most notably PL/SQL) even support function overloading by argument name; i.e. one class can have multiple methods with the same name and the same number of arguments of the same type but different argument
             names.  Python supports neither of these; it has no form of function overloading whatsoever.  Methods are defined solely by their name,
      -      and there can be only one method per class with a given name.  So if a descendant class has an __init__ method, it always overrides the ancestor __init__ method, even if the descendant defines it with a different argument list.  And the same rule applies to any other method.
      +      and there can be only one method per class with a given name.  So if a descendant class has an __init__ method, it always overrides the ancestor __init__ method, even if the descendant defines it with a different argument list.  And the same rule applies to any other method.
       
       
       
      @@ -3550,11 +3390,11 @@ class UserDict:              Caution
      -
      Always assign an initial value to all of an instance's data attributes in the __init__ method. It will save you hours of debugging later, tracking down AttributeError exceptions because you're referencing uninitialized (and therefore non-existent) attributes. +Always assign an initial value to all of an instance's data attributes in the __init__ method. It will save you hours of debugging later, tracking down AttributeError exceptions because you're referencing uninitialized (and therefore non-existent) attributes.
      -

      Example 5.10. UserDict Normal Methods

      +

      Example 5.10. UserDict Normal Methods

           def clear(self): self.data.clear()          1
           def copy(self):           2
               if self.__class__ is UserDict:          3
      @@ -3569,35 +3409,35 @@ class UserDict:              1 
       
      -clear is a normal class method; it is publicly available to be called by anyone at any time.  Notice that clear, like all class methods, has self as its first argument.  (Remember that you don't include self when you call the method; it's something that Python adds for you.)  Also note the basic technique of this wrapper class: store a real dictionary (data) as a data attribute, define all the methods that a real dictionary has, and have each class method redirect to the corresponding
      -            method on the real dictionary.  (In case you'd forgotten, a dictionary's clear method deletes all of its keys and their associated values.)
      +clear is a normal class method; it is publicly available to be called by anyone at any time.  Notice that clear, like all class methods, has self as its first argument.  (Remember that you don't include self when you call the method; it's something that Python adds for you.)  Also note the basic technique of this wrapper class: store a real dictionary (data) as a data attribute, define all the methods that a real dictionary has, and have each class method redirect to the corresponding
      +            method on the real dictionary.  (In case you'd forgotten, a dictionary's clear method deletes all of its keys and their associated values.)
       
       
       
       2 
       
      -The copy method of a real dictionary returns a new dictionary that is an exact duplicate of the original (all the same key-value pairs).
      -             But UserDict can't simply redirect to self.data.copy, because that method returns a real dictionary, and what you want is to return a new instance that is the same class as self.
      +The copy method of a real dictionary returns a new dictionary that is an exact duplicate of the original (all the same key-value pairs).
      +             But UserDict can't simply redirect to self.data.copy, because that method returns a real dictionary, and what you want is to return a new instance that is the same class as self.
       
       
       
       3 
       
      -You use the __class__ attribute to see if self is a UserDict; if so, you're golden, because you know how to copy a UserDict: just create a new UserDict and give it the real dictionary that you've squirreled away in self.data.  Then you immediately return the new UserDict you don't even get to the import copy on the next line.
      +You use the __class__ attribute to see if self is a UserDict; if so, you're golden, because you know how to copy a UserDict: just create a new UserDict and give it the real dictionary that you've squirreled away in self.data.  Then you immediately return the new UserDict you don't even get to the import copy on the next line.
       
       
       
       4 
       
      -If self.__class__ is not UserDict, then self must be some subclass of UserDict (like maybe FileInfo), in which case life gets trickier.  UserDict doesn't know how to make an exact copy of one of its descendants; there could, for instance, be other data attributes defined
      -            in the subclass, so you would need to iterate through them and make sure to copy all of them.  Luckily, Python comes with a module to do exactly this, and it's called copy.  I won't go into the details here (though it's a wicked cool module, if you're ever inclined to dive into it on your own).
      -             Suffice it to say that copy can copy arbitrary Python objects, and that's how you're using it here.
      +If self.__class__ is not UserDict, then self must be some subclass of UserDict (like maybe FileInfo), in which case life gets trickier.  UserDict doesn't know how to make an exact copy of one of its descendants; there could, for instance, be other data attributes defined
      +            in the subclass, so you would need to iterate through them and make sure to copy all of them.  Luckily, Python comes with a module to do exactly this, and it's called copy.  I won't go into the details here (though it's a wicked cool module, if you're ever inclined to dive into it on your own).
      +             Suffice it to say that copy can copy arbitrary Python objects, and that's how you're using it here.
       
       
       
       5 
       
      -The rest of the methods are straightforward, redirecting the calls to the built-in methods on self.data.
      +The rest of the methods are straightforward, redirecting the calls to the built-in methods on self.data.
       
       
       
      @@ -3607,12 +3447,12 @@ class UserDict:              In versions of Python prior to 2.2, you could not directly subclass built-in datatypes like strings, lists, and dictionaries.  To compensate for
      -      this, Python comes with wrapper classes that mimic the behavior of these built-in datatypes: UserString, UserList, and UserDict.  Using a combination of normal and special methods, the UserDict class does an excellent imitation of a dictionary.  In Python 2.2 and later, you can inherit classes directly from built-in datatypes like dict.  An example of this is given in the examples that come with this book, in fileinfo_fromdict.py.
      +      this, Python comes with wrapper classes that mimic the behavior of these built-in datatypes: UserString, UserList, and UserDict.  Using a combination of normal and special methods, the UserDict class does an excellent imitation of a dictionary.  In Python 2.2 and later, you can inherit classes directly from built-in datatypes like dict.  An example of this is given in the examples that come with this book, in fileinfo_fromdict.py.
       
       
       
      -

      In Python, you can inherit directly from the dict built-in datatype, as shown in this example. There are three differences here compared to the UserDict version. -

      Example 5.11. Inheriting Directly from Built-In Datatype dict

      +

      In Python, you can inherit directly from the dict built-in datatype, as shown in this example. There are three differences here compared to the UserDict version. +

      Example 5.11. Inheriting Directly from Built-In Datatype dict

       class FileInfo(dict):1
           "store file metadata"
           def __init__(self, filename=None): 2
      @@ -3622,20 +3462,20 @@ class FileInfo(dict):
       1 
       
      -The first difference is that you don't need to import the UserDict module, since dict is a built-in datatype and is always available.  The second is that you are inheriting from dict directly, instead of from UserDict.UserDict.
      +The first difference is that you don't need to import the UserDict module, since dict is a built-in datatype and is always available.  The second is that you are inheriting from dict directly, instead of from UserDict.UserDict.
       
       
       
       2 
       
      -The third difference is subtle but important.  Because of the way UserDict works internally, it requires you to manually call its __init__ method to properly initialize its internal data structures.  dict does not work like this; it is not a wrapper, and it requires no explicit initialization.
      +The third difference is subtle but important.  Because of the way UserDict works internally, it requires you to manually call its __init__ method to properly initialize its internal data structures.  dict does not work like this; it is not a wrapper, and it requires no explicit initialization.
       
       
       
       
      -

      Further Reading on UserDict

      +

      Further Reading on UserDict

      5.6. Special Class Methods

      @@ -3645,7 +3485,7 @@ class FileInfo(dict):get and set items with a syntax that doesn't include explicitly invoking methods. This is where special class methods come in: they provide a way to map non-method-calling syntax into method calls.

      5.6.1. Getting and Setting Items

      -

      Example 5.12. The __getitem__ Special Method

      +

      Example 5.12. The __getitem__ Special Method

           def __getitem__(self, key): return self.data[key]
      >>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3")
       >>> f
       {'name':'/music/_singles/kairo.mp3'}
      @@ -3657,19 +3497,19 @@ provide a way to map non-method-calling syntax into method calls.
       
       1 
       
      -The __getitem__ special method looks simple enough.  Like the normal methods clear, keys, and values, it just redirects to the dictionary to return its value.  But how does it get called?  Well, you can call __getitem__ directly, but in practice you wouldn't actually do that; I'm just doing it here to show you how it works.  The right way
      -               to use __getitem__ is to get Python to call it for you.
      +The __getitem__ special method looks simple enough.  Like the normal methods clear, keys, and values, it just redirects to the dictionary to return its value.  But how does it get called?  Well, you can call __getitem__ directly, but in practice you wouldn't actually do that; I'm just doing it here to show you how it works.  The right way
      +               to use __getitem__ is to get Python to call it for you.
       
       
       
       2 
       
      -This looks just like the syntax you would use to get a dictionary value, and in fact it returns the value you would expect.  But here's the missing link: under the covers, Python has converted this syntax to the method call f.__getitem__("name").  That's why __getitem__ is a special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax.
      +This looks just like the syntax you would use to get a dictionary value, and in fact it returns the value you would expect.  But here's the missing link: under the covers, Python has converted this syntax to the method call f.__getitem__("name").  That's why __getitem__ is a special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax.
       
       
       
      -

      Of course, Python has a __setitem__ special method to go along with __getitem__, as shown in the next example. -

      Example 5.13. The __setitem__ Special Method

      +

      Of course, Python has a __setitem__ special method to go along with __getitem__, as shown in the next example. +

      Example 5.13. The __setitem__ Special Method

           def __setitem__(self, key, item): self.data[key] = item
      >>> f
       {'name':'/music/_singles/kairo.mp3'}
       >>> f.__setitem__("genre", 31) 1
      @@ -3682,23 +3522,23 @@ provide a way to map non-method-calling syntax into method calls.
       
       1 
       
      -Like the __getitem__ method, __setitem__ simply redirects to the real dictionary self.data to do its work.  And like __getitem__, you wouldn't ordinarily call it directly like this; Python calls __setitem__ for you when you use the right syntax.
      +Like the __getitem__ method, __setitem__ simply redirects to the real dictionary self.data to do its work.  And like __getitem__, you wouldn't ordinarily call it directly like this; Python calls __setitem__ for you when you use the right syntax.
       
       
       
       2 
       
      -This looks like regular dictionary syntax, except of course that f is really a class that's trying very hard to masquerade as a dictionary, and __setitem__ is an essential part of that masquerade.  This line of code actually calls f.__setitem__("genre", 32) under the covers.
      +This looks like regular dictionary syntax, except of course that f is really a class that's trying very hard to masquerade as a dictionary, and __setitem__ is an essential part of that masquerade.  This line of code actually calls f.__setitem__("genre", 32) under the covers.
       
       
       
      -

      __setitem__ is a special class method because it gets called for you, but it's still a class method. Just as easily as the __setitem__ method was defined in UserDict, you can redefine it in the descendant class to override the ancestor method. This allows you to define classes that act +

      __setitem__ is a special class method because it gets called for you, but it's still a class method. Just as easily as the __setitem__ method was defined in UserDict, you can redefine it in the descendant class to override the ancestor method. This allows you to define classes that act like dictionaries in some ways but define their own behavior above and beyond the built-in dictionary.

      This concept is the basis of the entire framework you're studying in this chapter. Each file type can have a handler class that knows how to get metadata from a particular type of file. Once some attributes (like the file's name and location) are - known, the handler class knows how to derive other attributes automatically. This is done by overriding the __setitem__ method, checking for particular keys, and adding additional processing when they are found. -

      For example, MP3FileInfo is a descendant of FileInfo. When an MP3FileInfo's name is set, it doesn't just set the name key (like the ancestor FileInfo does); it also looks in the file itself for MP3 tags and populates a whole set of keys. The next example shows how this works. -

      Example 5.14. Overriding __setitem__ in MP3FileInfo

      +   known, the handler class knows how to derive other attributes automatically.  This is done by overriding the __setitem__ method, checking for particular keys, and adding additional processing when they are found.
      +

      For example, MP3FileInfo is a descendant of FileInfo. When an MP3FileInfo's name is set, it doesn't just set the name key (like the ancestor FileInfo does); it also looks in the file itself for MP3 tags and populates a whole set of keys. The next example shows how this works. +

      Example 5.14. Overriding __setitem__ in MP3FileInfo

           def __setitem__(self, key, item):         1
               if key == "name" and item:            2
                   self.__parse(item)                3
      @@ -3707,27 +3547,27 @@ provide a way to map non-method-calling syntax into method calls.
       
       1 
       
      -Notice that this __setitem__ method is defined exactly the same way as the ancestor method.  This is important, since Python will be calling the method for you, and it expects it to be defined with a certain number of arguments.  (Technically speaking,
      +Notice that this __setitem__ method is defined exactly the same way as the ancestor method.  This is important, since Python will be calling the method for you, and it expects it to be defined with a certain number of arguments.  (Technically speaking,
                      the names of the arguments don't matter; only the number of arguments is important.)
       
       
       
       2 
       
      -Here's the crux of the entire MP3FileInfo class: if you're assigning a value to the name key, you want to do something extra.
      +Here's the crux of the entire MP3FileInfo class: if you're assigning a value to the name key, you want to do something extra.
       
       
       
       3 
       
      -The extra processing you do for names is encapsulated in the __parse method.  This is another class method defined in MP3FileInfo, and when you call it, you qualify it with self.  Just calling __parse would look for a normal function defined outside the class, which is not what you want.  Calling self.__parse will look for a class method defined within the class.  This isn't anything new; you reference data attributes the same way.
      +The extra processing you do for names is encapsulated in the __parse method.  This is another class method defined in MP3FileInfo, and when you call it, you qualify it with self.  Just calling __parse would look for a normal function defined outside the class, which is not what you want.  Calling self.__parse will look for a class method defined within the class.  This isn't anything new; you reference data attributes the same way.
       
       
       
       4 
       
      -After doing this extra processing, you want to call the ancestor method.  Remember that this is never done for you in Python; you must do it manually.  Note that you're calling the immediate ancestor, FileInfo, even though it doesn't have a __setitem__ method.  That's okay, because Python will walk up the ancestor tree until it finds a class with the method you're calling, so this line of code will eventually
      -               find and call the __setitem__ defined in UserDict.
      +After doing this extra processing, you want to call the ancestor method.  Remember that this is never done for you in Python; you must do it manually.  Note that you're calling the immediate ancestor, FileInfo, even though it doesn't have a __setitem__ method.  That's okay, because Python will walk up the ancestor tree until it finds a class with the method you're calling, so this line of code will eventually
      +               find and call the __setitem__ defined in UserDict.
       
       
       
      @@ -3736,11 +3576,11 @@ provide a way to map non-method-calling syntax into method calls.
       Note
       
       
      -When accessing data attributes within a class, you need to qualify the attribute name: self.attribute.  When calling other methods within a class, you need to qualify the method name: self.method.
      +When accessing data attributes within a class, you need to qualify the attribute name: self.attribute.  When calling other methods within a class, you need to qualify the method name: self.method.
       
       
       
      -

      Example 5.15. Setting an MP3FileInfo's name

      >>> import fileinfo
      +

      Example 5.15. Setting an MP3FileInfo's name

      >>> import fileinfo
       >>> mp3file = fileinfo.MP3FileInfo() 1
       >>> mp3file
       {'name':None}
      @@ -3758,29 +3598,29 @@ provide a way to map non-method-calling syntax into method calls.
       
       1 
       
      -First, you create an instance of MP3FileInfo, without passing it a filename.  (You can get away with this because the filename argument of the __init__ method is optional.)  Since MP3FileInfo has no __init__ method of its own, Python walks up the ancestor tree and finds the __init__ method of FileInfo.  This __init__ method manually calls the __init__ method of UserDict and then sets the name key to filename, which is None, since you didn't pass a filename.  Thus, mp3file initially looks like a dictionary with one key, name, whose value is None.
      +First, you create an instance of MP3FileInfo, without passing it a filename.  (You can get away with this because the filename argument of the __init__ method is optional.)  Since MP3FileInfo has no __init__ method of its own, Python walks up the ancestor tree and finds the __init__ method of FileInfo.  This __init__ method manually calls the __init__ method of UserDict and then sets the name key to filename, which is None, since you didn't pass a filename.  Thus, mp3file initially looks like a dictionary with one key, name, whose value is None.
                      
       
       
       
       2 
       
      -Now the real fun begins.  Setting the name key of mp3file triggers the __setitem__ method on MP3FileInfo (not UserDict), which notices that you're setting the name key with a real value and calls self.__parse.  Although you haven't traced through the __parse method yet, you can see from the output that it sets several other keys: album, artist, genre, title, year, and comment.
      +Now the real fun begins.  Setting the name key of mp3file triggers the __setitem__ method on MP3FileInfo (not UserDict), which notices that you're setting the name key with a real value and calls self.__parse.  Although you haven't traced through the __parse method yet, you can see from the output that it sets several other keys: album, artist, genre, title, year, and comment.
                      
       
       
       
       3 
       
      -Modifying the name key will go through the same process again: Python calls __setitem__, which calls self.__parse, which sets all the other keys.
      +Modifying the name key will go through the same process again: Python calls __setitem__, which calls self.__parse, which sets all the other keys.
                      
       
       
       
       

      5.7. Advanced Special Class Methods

      -

      Python has more special methods than just __getitem__ and __setitem__. Some of them let you emulate functionality that you may not even know about. -

      This example shows some of the other special methods in UserDict. -

      Example 5.16. More Special Methods in UserDict

      +

      Python has more special methods than just __getitem__ and __setitem__. Some of them let you emulate functionality that you may not even know about. +

      This example shows some of the other special methods in UserDict. +

      Example 5.16. More Special Methods in UserDict

           def __repr__(self): return repr(self.data)     1
           def __cmp__(self, dict):     2
               if isinstance(dict, UserDict):            
      @@ -3793,29 +3633,29 @@ provide a way to map non-method-calling syntax into method calls.
       
       1 
       
      -__repr__ is a special method that is called when you call repr(instance).  The repr function is a built-in function that returns a string representation of an object.  It works on any object, not just class
      -            instances.  You're already intimately familiar with repr and you don't even know it.  In the interactive window, when you type just a variable name and press the ENTER key, Python uses repr to display the variable's value.  Go create a dictionary d with some data and then print repr(d) to see for yourself.
      +__repr__ is a special method that is called when you call repr(instance).  The repr function is a built-in function that returns a string representation of an object.  It works on any object, not just class
      +            instances.  You're already intimately familiar with repr and you don't even know it.  In the interactive window, when you type just a variable name and press the ENTER key, Python uses repr to display the variable's value.  Go create a dictionary d with some data and then print repr(d) to see for yourself.
       
       
       
       2 
       
      -__cmp__ is called when you compare class instances.  In general, you can compare any two Python objects, not just class instances, by using ==.  There are rules that define when built-in datatypes are considered equal; for instance, dictionaries are equal when they
      +__cmp__ is called when you compare class instances.  In general, you can compare any two Python objects, not just class instances, by using ==.  There are rules that define when built-in datatypes are considered equal; for instance, dictionaries are equal when they
                   have all the same keys and values, and strings are equal when they are the same length and contain the same sequence of characters.
      -             For class instances, you can define the __cmp__ method and code the comparison logic yourself, and then you can use == to compare instances of your class and Python will call your __cmp__ special method for you.
      +             For class instances, you can define the __cmp__ method and code the comparison logic yourself, and then you can use == to compare instances of your class and Python will call your __cmp__ special method for you.
       
       
       
       3 
       
      -__len__ is called when you call len(instance).  The len function is a built-in function that returns the length of an object.  It works on any object that could reasonably be thought
      -            of as having a length.  The len of a string is its number of characters; the len of a dictionary is its number of keys; the len of a list or tuple is its number of elements.  For class instances, define the __len__ method and code the length calculation yourself, and then call len(instance) and Python will call your __len__ special method for you.
      +__len__ is called when you call len(instance).  The len function is a built-in function that returns the length of an object.  It works on any object that could reasonably be thought
      +            of as having a length.  The len of a string is its number of characters; the len of a dictionary is its number of keys; the len of a list or tuple is its number of elements.  For class instances, define the __len__ method and code the length calculation yourself, and then call len(instance) and Python will call your __len__ special method for you.
       
       
       
       4 
       
      -__delitem__ is called when you call del instance[key], which you may remember as the way to delete individual items from a dictionary.  When you use del on a class instance, Python calls the __delitem__ special method for you.
      +__delitem__ is called when you call del instance[key], which you may remember as the way to delete individual items from a dictionary.  When you use del on a class instance, Python calls the __delitem__ special method for you.
       
       
       
      @@ -3828,20 +3668,20 @@ provide a way to map non-method-calling syntax into method calls.
       
       
       
      -

      At this point, you may be thinking, “All this work just to do something in a class that I can do with a built-in datatype.” And it's true that life would be easier (and the entire UserDict class would be unnecessary) if you could inherit from built-in datatypes like a dictionary. But even if you could, special -methods would still be useful, because they can be used in any class, not just wrapper classes like UserDict. -

      Special methods mean that any class can store key/value pairs like a dictionary, just by defining the __setitem__ method. Any class can act like a sequence, just by defining the __getitem__ method. Any class that defines the __cmp__ method can be compared with ==. And if your class represents something that has a length, don't define a GetLength method; define the __len__ method and use len(instance). +

      At this point, you may be thinking, “All this work just to do something in a class that I can do with a built-in datatype.” And it's true that life would be easier (and the entire UserDict class would be unnecessary) if you could inherit from built-in datatypes like a dictionary. But even if you could, special +methods would still be useful, because they can be used in any class, not just wrapper classes like UserDict. +

      Special methods mean that any class can store key/value pairs like a dictionary, just by defining the __setitem__ method. Any class can act like a sequence, just by defining the __getitem__ method. Any class that defines the __cmp__ method can be compared with ==. And if your class represents something that has a length, don't define a GetLength method; define the __len__ method and use len(instance).

      -
      Note
      While other object-oriented languages only let you define the physical model of an object (“this object has a GetLength method”), Python's special class methods like __len__ allow you to define the logical model of an object (“this object has a length”). +While other object-oriented languages only let you define the physical model of an object (“this object has a GetLength method”), Python's special class methods like __len__ allow you to define the logical model of an object (“this object has a length”).

      Python has a lot of other special methods. There's a whole set of them that let classes act like numbers, allowing you to add, subtract, and do other arithmetic operations on class instances. (The canonical example of this is a class that represents -complex numbers, numbers with both real and imaginary components.) The __call__ method lets a class act like a function, allowing you to call a class instance directly. And there are other special methods +complex numbers, numbers with both real and imaginary components.) The __call__ method lets a class act like a function, allowing you to call a class instance directly. And there are other special methods that allow classes to have read-only and write-only data attributes; you'll talk more about those in later chapters.

      Further Reading on Special Class Methods

      @@ -3881,13 +3721,13 @@ class MP3FileInfo(FileInfo): 1 -MP3FileInfo is the class itself, not any particular instance of the class. +MP3FileInfo is the class itself, not any particular instance of the class. 2 -tagDataMap is a class attribute: literally, an attribute of the class. It is available before creating any instances of the class. +tagDataMap is a class attribute: literally, an attribute of the class. It is available before creating any instances of the class. @@ -3901,11 +3741,11 @@ class MP3FileInfo(FileInfo): Note -In Java, both static variables (called class attributes in Python) and instance variables (called data attributes in Python) are defined immediately after the class definition (one with the static keyword, one without). In Python, only class attributes can be defined here; data attributes are defined in the __init__ method. +In Java, both static variables (called class attributes in Python) and instance variables (called data attributes in Python) are defined immediately after the class definition (one with the static keyword, one without). In Python, only class attributes can be defined here; data attributes are defined in the __init__ method. -

      Class attributes can be used as class-level constants (which is how you use them in MP3FileInfo), but they are not really constants. You can also change them. +

      Class attributes can be used as class-level constants (which is how you use them in MP3FileInfo), but they are not really constants. You can also change them.

      @@ -3939,32 +3779,32 @@ class MP3FileInfo(FileInfo): - - - - -
      Note
      1 count is a class attribute of the counter class. +count is a class attribute of the counter class.
      2 __class__ is a built-in attribute of every class instance (of every class). It is a reference to the class that self is an instance of (in this case, the counter class). +__class__ is a built-in attribute of every class instance (of every class). It is a reference to the class that self is an instance of (in this case, the counter class).
      3 Because count is a class attribute, it is available through direct reference to the class, before you have created any instances of the +Because count is a class attribute, it is available through direct reference to the class, before you have created any instances of the class.
      4 Creating an instance of the class calls the __init__ method, which increments the class attribute count by 1. This affects the class itself, not just the newly created instance. +Creating an instance of the class calls the __init__ method, which increments the class attribute count by 1. This affects the class itself, not just the newly created instance.
      5 Creating a second instance will increment the class attribute count again. Notice how the class attribute is shared by the class and all instances of the class. +Creating a second instance will increment the class attribute count again. Notice how the class attribute is shared by the class and all instances of the class.
      @@ -3980,13 +3820,13 @@ class MP3FileInfo(FileInfo):

      If the name of a Python function, class method, or attribute starts with (but doesn't end with) two underscores, it's private; everything else is public. Python has no concept of protected class methods (accessible only in their own class and descendant classes). Class methods are either private (accessible only in their own class) or public (accessible from anywhere). -

      In MP3FileInfo, there are two methods: __parse and __setitem__. As you have already discussed, __setitem__ is a special method; normally, you would call it indirectly by using the dictionary syntax on a class instance, but it is public, and you could -call it directly (even from outside the fileinfo module) if you had a really good reason. However, __parse is private, because it has two underscores at the beginning of its name. +

      In MP3FileInfo, there are two methods: __parse and __setitem__. As you have already discussed, __setitem__ is a special method; normally, you would call it indirectly by using the dictionary syntax on a class instance, but it is public, and you could +call it directly (even from outside the fileinfo module) if you had a really good reason. However, __parse is private, because it has two underscores at the beginning of its name.

      - @@ -4003,7 +3843,7 @@ AttributeError: 'MP3FileInfo' instance has no attribute '__parse' @@ -4015,17 +3855,17 @@ AttributeError: 'MP3FileInfo' instance has no attribute '__parse'

      5.10. Summary

      -

      That's it for the hard-core object trickery. You'll see a real-world application of special class methods in Chapter 12, which uses getattr to create a proxy to a remote web service. +

      That's it for the hard-core object trickery. You'll see a real-world application of special class methods in Chapter 12, which uses getattr to create a proxy to a remote web service.

      The next chapter will continue using this code sample to explore other Python concepts, such as exceptions, file objects, and for loops.

      Before diving into the next chapter, make sure you're comfortable doing all of these things:

        -
      • Importing modules using either import module or from module import +
      • Importing modules using either import module or from module import
      • Defining and instantiating classes -
      • Defining __init__ methods and other special class methods, and understanding when they are called +
      • Defining __init__ methods and other special class methods, and understanding when they are called -
      • Subclassing UserDict to define classes that act like dictionaries +
      • Subclassing UserDict to define classes that act like dictionaries
      • Defining data attributes and class attributes, and understanding the differences between them @@ -4033,7 +3873,7 @@ AttributeError: 'MP3FileInfo' instance has no attribute '__parse'

        Chapter 6. Exceptions and File Handling

        -

        In this chapter, you will dive into exceptions, file objects, for loops, and the os and sys modules. If you've used exceptions in another programming language, you can skim the first section to get a sense of Python's syntax. Be sure to tune in again for file handling. +

        In this chapter, you will dive into exceptions, file objects, for loops, and the os and sys modules. If you've used exceptions in another programming language, you can skim the first section to get a sense of Python's syntax. Be sure to tune in again for file handling.

        6.1. Handling Exceptions

        Like many other programming languages, Python has exception handling via try...except blocks.

      Note
      In Python, all special methods (like __setitem__) and built-in attributes (like __doc__) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and +In Python, all special methods (like __setitem__) and built-in attributes (like __doc__) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and attributes this way, because it will only confuse you (and others) later.
      If you try to call a private method, Python will raise a slightly misleading exception, saying that the method does not exist. Of course it does exist, but it's private, so it's not accessible outside the class.Strictly speaking, private methods are accessible outside their class, just not easily accessible. Nothing in Python is truly private; internally, the names of private methods and attributes are mangled and unmangled on the fly to make them - seem inaccessible by their given names. You can access the __parse method of the MP3FileInfo class by the name _MP3FileInfo__parse. Acknowledge that this is interesting, but promise to never, ever do it in real code. Private methods are private for a + seem inaccessible by their given names. You can access the __parse method of the MP3FileInfo class by the name _MP3FileInfo__parse. Acknowledge that this is interesting, but promise to never, ever do it in real code. Private methods are private for a reason, but like many other things in Python, their privateness is ultimately a matter of convention, not force.
      @@ -4047,15 +3887,15 @@ AttributeError: 'MP3FileInfo' instance has no attribute '__parse'Exceptions are everywhere in Python. Virtually every module in the standard Python library uses them, and Python itself will raise them in a lot of different circumstances. You've already seen them repeatedly throughout this book.

      In each of these cases, you were simply playing around in the Python IDE: an error occurred, the exception was printed (depending on your IDE, perhaps in an intentionally jarring shade of red), and that was that. This is called an unhandled exception. When the exception was raised, there was no code to explicitly notice it and deal with it, so it bubbled its @@ -4079,7 +3919,7 @@ This line will always print

      - @@ -4091,14 +3931,14 @@ This line will always print
      -
      1 Using the built-in open function, you can try to open a file for reading (more on open in the next section). But the file doesn't exist, so this raises the IOError exception. Since you haven't provided any explicit check for an IOError exception, Python just prints out some debugging information about what happened and then gives up. +Using the built-in open function, you can try to open a file for reading (more on open in the next section). But the file doesn't exist, so this raises the IOError exception. Since you haven't provided any explicit check for an IOError exception, Python just prints out some debugging information about what happened and then gives up.
      3 When the open method raises an IOError exception, you're ready for it. The except IOError: line catches the exception and executes your own block of code, which in this case just prints a more pleasant error message. +When the open method raises an IOError exception, you're ready for it. The except IOError: line catches the exception and executes your own block of code, which in this case just prints a more pleasant error message.
      4 Once an exception has been handled, processing continues normally on the first line after the try...except block. Note that this line will always print, whether or not an exception occurs. If you really did have a file called -notthere in your root directory, the call to open would succeed, the except clause would be ignored, and this line would still be executed. +notthere in your root directory, the call to open would succeed, the except clause would be ignored, and this line would still be executed.
      @@ -4109,11 +3949,11 @@ line that you would need to trace back to the source. I'm sure you've experienc exceptions, errors occur immediately, and you can handle them in a standard way at the source of the problem.

      6.1.1. Using Exceptions For Other Purposes

      There are a lot of other uses for exceptions besides handling actual error conditions. A common use in the standard Python library is to try to import a module, and then check whether it worked. Importing a module that does not exist will raise - an ImportError exception. You can use this to define multiple levels of functionality based on which modules are available at run-time, + an ImportError exception. You can use this to define multiple levels of functionality based on which modules are available at run-time, or to support multiple platforms (where platform-specific code is separated into different modules). -

      You can also define your own exceptions by creating a class that inherits from the built-in Exception class, and then raise your exceptions with the raise command. See the further reading section if you're interested in doing this. +

      You can also define your own exceptions by creating a class that inherits from the built-in Exception class, and then raise your exceptions with the raise command. See the further reading section if you're interested in doing this.

      The next example demonstrates how to use an exception to support platform-specific functionality. This code comes from the -getpass module, a wrapper module for getting a password from the user. Getting a password is accomplished differently on UNIX, Windows, and Mac OS platforms, but this code encapsulates all of those differences. +getpass module, a wrapper module for getting a password from the user. Getting a password is accomplished differently on UNIX, Windows, and Mac OS platforms, but this code encapsulates all of those differences.

      Example 6.2. Supporting Platform-Specific Functionality

         # Bind the name getpass to the appropriate function
         try:
      @@ -4136,34 +3976,34 @@ exceptions, errors occur immediately, and you can handle them in a standard way
       
       1 
       
      -termios is a UNIX-specific module that provides low-level control over the input terminal.  If this module is not available (because it's not
      -               on your system, or your system doesn't support it), the import fails and Python raises an ImportError, which you catch.
      +termios is a UNIX-specific module that provides low-level control over the input terminal.  If this module is not available (because it's not
      +               on your system, or your system doesn't support it), the import fails and Python raises an ImportError, which you catch.
       
       
       
       2 
       
      -OK, you didn't have termios, so let's try msvcrt, which is a Windows-specific module that provides an API to many useful functions in the Microsoft Visual C++ runtime services.  If this import fails, Python will raise an ImportError, which you catch.
      +OK, you didn't have termios, so let's try msvcrt, which is a Windows-specific module that provides an API to many useful functions in the Microsoft Visual C++ runtime services.  If this import fails, Python will raise an ImportError, which you catch.
       
       
       
       3 
       
      -If the first two didn't work, you try to import a function from EasyDialogs, which is a Mac OS-specific module that provides functions to pop up dialog boxes of various types.  Once again, if this import fails, Python will raise an ImportError, which you catch.
      +If the first two didn't work, you try to import a function from EasyDialogs, which is a Mac OS-specific module that provides functions to pop up dialog boxes of various types.  Once again, if this import fails, Python will raise an ImportError, which you catch.
       
       
       
       4 
       
       None of these platform-specific modules is available (which is possible, since Python has been ported to a lot of different platforms), so you need to fall back on a default password input function (which is
      -               defined elsewhere in the getpass module).  Notice what you're doing here: assigning the function default_getpass to the variable getpass.  If you read the official getpass documentation, it tells you that the getpass module defines a getpass function.  It does this by binding getpass to the correct function for your platform.  Then when you call the getpass function, you're really calling a platform-specific function that this code has set up for you.  You don't need to know or
      -               care which platform your code is running on -- just call getpass, and it will always do the right thing.
      +               defined elsewhere in the getpass module).  Notice what you're doing here: assigning the function default_getpass to the variable getpass.  If you read the official getpass documentation, it tells you that the getpass module defines a getpass function.  It does this by binding getpass to the correct function for your platform.  Then when you call the getpass function, you're really calling a platform-specific function that this code has set up for you.  You don't need to know or
      +               care which platform your code is running on -- just call getpass, and it will always do the right thing.
       
       
       
       5 
       
      -A try...except block can have an else clause, like an if statement.  If no exception is raised during the try block, the else clause is executed afterwards.  In this case, that means that the from EasyDialogs import AskPassword import worked, so you should bind getpass to the AskPassword function.  Each of the other try...except blocks has similar else clauses to bind getpass to the appropriate function when you find an import that works.
      +A try...except block can have an else clause, like an if statement.  If no exception is raised during the try block, the else clause is executed afterwards.  In this case, that means that the from EasyDialogs import AskPassword import worked, so you should bind getpass to the AskPassword function.  Each of the other try...except blocks has similar else clauses to bind getpass to the appropriate function when you find an import that works.
       
       
       
      @@ -4176,13 +4016,13 @@ exceptions, errors occur immediately, and you can handle them in a standard way
       
       
    11. Python Library Reference documents the getpass module. -
    12. Python Library Reference documents the traceback module, which provides low-level access to exception attributes after an exception is raised. +
    13. Python Library Reference documents the traceback module, which provides low-level access to exception attributes after an exception is raised.
    14. Python Reference Manual discusses the inner workings of the try...except block.

      6.2. Working with File Objects

      -

      Python has a built-in function, open, for opening a file on disk. open returns a file object, which has methods and attributes for getting information about and manipulating the opened file. +

      Python has a built-in function, open, for opening a file on disk. open returns a file object, which has methods and attributes for getting information about and manipulating the opened file.

      Example 6.3. Opening a File

      >>> f = open("/music/_singles/kairo.mp3", "rb") 1
       >>> f       2
       <open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
      @@ -4194,7 +4034,7 @@ exceptions, errors occur immediately, and you can handle them in a standard way
       
       1 
       
      -The open method can take up to three parameters: a filename, a mode, and a buffering parameter.  Only the first one, the filename,
      +The open method can take up to three parameters: a filename, a mode, and a buffering parameter.  Only the first one, the filename,
                   is required; the other two are optional.  If not specified, the file is opened for reading in text mode.  Here you are opening the file for reading in binary mode.
                    (print open.__doc__ displays a great explanation of all the possible modes.)
       
      @@ -4202,19 +4042,19 @@ exceptions, errors occur immediately, and you can handle them in a standard way
       
       2 
       
      -The open function returns an object (by now, this should not surprise you).  A file object has several useful attributes.
      +The open function returns an object (by now, this should not surprise you).  A file object has several useful attributes.
       
       
       
       3 
       
      -The mode attribute of a file object tells you in which mode the file was opened.
      +The mode attribute of a file object tells you in which mode the file was opened.
       
       
       
       4 
       
      -The name attribute of a file object tells you the name of the file that the file object has open.
      +The name attribute of a file object tells you the name of the file that the file object has open.
       
       
       
      @@ -4238,35 +4078,35 @@ Rave Mix    2000http://mp3.com/DJMARYJANE     \037'
       
       1 
       
      -A file object maintains state about the file it has open.  The tell method of a file object tells you your current position in the open file.  Since you haven't done anything with this file
      -               yet, the current position is 0, which is the beginning of the file.
      +A file object maintains state about the file it has open.  The tell method of a file object tells you your current position in the open file.  Since you haven't done anything with this file
      +               yet, the current position is 0, which is the beginning of the file.
       
       
       
       2 
       
      -The seek method of a file object moves to another position in the open file.  The second parameter specifies what the first one means;
      -0 means move to an absolute position (counting from the start of the file), 1 means move to a relative position (counting from the current position), and 2 means move to a position relative to the end of the file.  Since the MP3 tags you're looking for are stored at the end of the file, you use 2 and tell the file object to move to a position 128 bytes from the end of the file.
      +The seek method of a file object moves to another position in the open file.  The second parameter specifies what the first one means;
      +0 means move to an absolute position (counting from the start of the file), 1 means move to a relative position (counting from the current position), and 2 means move to a position relative to the end of the file.  Since the MP3 tags you're looking for are stored at the end of the file, you use 2 and tell the file object to move to a position 128 bytes from the end of the file.
       
       
       
       3 
       
      -The tell method confirms that the current file position has moved.
      +The tell method confirms that the current file position has moved.
       
       
       
       4 
       
      -The read method reads a specified number of bytes from the open file and returns a string with the data that was read.  The optional
      -               parameter specifies the maximum number of bytes to read.  If no parameter is specified, read will read until the end of the file.  (You could have simply said read() here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.)  The read data
      -               is assigned to the tagData variable, and the current position is updated based on how many bytes were read.
      +The read method reads a specified number of bytes from the open file and returns a string with the data that was read.  The optional
      +               parameter specifies the maximum number of bytes to read.  If no parameter is specified, read will read until the end of the file.  (You could have simply said read() here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.)  The read data
      +               is assigned to the tagData variable, and the current position is updated based on how many bytes were read.
       
       
       
       5 
       
      -The tell method confirms that the current position has moved.  If you do the math, you'll see that after reading 128 bytes, the position
      +The tell method confirms that the current position has moved.  If you do the math, you'll see that after reading 128 bytes, the position
                      has been incremented by 128.
       
       
      @@ -4301,40 +4141,40 @@ ValueError: I/O operation on closed file
       
       1 
       
      -The closed attribute of a file object indicates whether the object has a file open or not.  In this case, the file is still open (closed is False).
      +The closed attribute of a file object indicates whether the object has a file open or not.  In this case, the file is still open (closed is False).
       
       
       
       2 
       
      -To close a file, call the close method of the file object.  This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any)
      +To close a file, call the close method of the file object.  This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any)
                      that the system hadn't gotten around to actually writing yet, and releases the system resources.
       
       
       
       3 
       
      -The closed attribute confirms that the file is closed.
      +The closed attribute confirms that the file is closed.
       
       
       
       4 
       
      -Just because a file is closed doesn't mean that the file object ceases to exist.  The variable f will continue to exist until it goes out of scope or gets manually deleted.  However, none of the methods that manipulate an open file will work once the file has been closed;
      +Just because a file is closed doesn't mean that the file object ceases to exist.  The variable f will continue to exist until it goes out of scope or gets manually deleted.  However, none of the methods that manipulate an open file will work once the file has been closed;
                      they all raise an exception.
       
       
       
       5 
       
      -Calling close on a file object whose file is already closed does not raise an exception; it fails silently.
      +Calling close on a file object whose file is already closed does not raise an exception; it fails silently.
       
       
       
       

      6.2.3. Handling I/O Errors

      -

      Now you've seen enough to understand the file handling code in the fileinfo.py sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle +

      Now you've seen enough to understand the file handling code in the fileinfo.py sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle errors. -

      Example 6.6. File Objects in MP3FileInfo

      +

      Example 6.6. File Objects in MP3FileInfo

               try:              1
                   fsock = open(filename, "rb", 0) 2
                   try:         
      @@ -4357,31 +4197,31 @@ ValueError: I/O operation on closed file
       
       2 
       
      -The open function may raise an IOError.  (Maybe the file doesn't exist.)
      +The open function may raise an IOError.  (Maybe the file doesn't exist.)
       
       
       
       3 
       
      -The seek method may raise an IOError.  (Maybe the file is smaller than 128 bytes.)
      +The seek method may raise an IOError.  (Maybe the file is smaller than 128 bytes.)
       
       
       
       4 
       
      -The read method may raise an IOError.  (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.)
      +The read method may raise an IOError.  (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.)
       
       
       
       5 
       
      -This is new: a try...finally block.  Once the file has been opened successfully by the open function, you want to make absolutely sure that you close it, even if an exception is raised by the seek or read methods.  That's what a try...finally block is for: code in the finally block will always be executed, even if something in the try block raises an exception.  Think of it as code that gets executed on the way out, regardless of what happened before.
      +This is new: a try...finally block.  Once the file has been opened successfully by the open function, you want to make absolutely sure that you close it, even if an exception is raised by the seek or read methods.  That's what a try...finally block is for: code in the finally block will always be executed, even if something in the try block raises an exception.  Think of it as code that gets executed on the way out, regardless of what happened before.
       
       
       
       6 
       
      -At last, you handle your IOError exception.  This could be the IOError exception raised by the call to open, seek, or read.  Here, you really don't care, because all you're going to do is ignore it silently and continue.  (Remember, pass is a Python statement that does nothing.)  That's perfectly legal; “handling” an exception can mean explicitly doing nothing.  It still counts as handled, and processing will continue normally on the
      +At last, you handle your IOError exception.  This could be the IOError exception raised by the call to open, seek, or read.  Here, you really don't care, because all you're going to do is ignore it silently and continue.  (Remember, pass is a Python statement that does nothing.)  That's perfectly legal; “handling” an exception can mean explicitly doing nothing.  It still counts as handled, and processing will continue normally on the
                      next line of code after the try...except block.
       
       
      @@ -4412,33 +4252,33 @@ test succeededline 2
       
       1 
       
      -You start boldly by creating either the new file test.log or overwrites the existing file, and opening the file for writing.  (The second parameter "w" means open the file for writing.)  Yes, that's all as dangerous as it sounds.  I hope you didn't care about the previous
      +You start boldly by creating either the new file test.log or overwrites the existing file, and opening the file for writing.  (The second parameter "w" means open the file for writing.)  Yes, that's all as dangerous as it sounds.  I hope you didn't care about the previous
                      contents of that file, because it's gone now.
       
       
       
       2 
       
      -You can add data to the newly opened file with the write method of the file object returned by open.
      +You can add data to the newly opened file with the write method of the file object returned by open.
       
       
       
       3 
       
      -file is a synonym for open.  This one-liner opens the file, reads its contents, and prints them.
      +file is a synonym for open.  This one-liner opens the file, reads its contents, and prints them.
       
       
       
       4 
       
      -You happen to know that test.log exists (since you just finished writing to it), so you can open it and append to it.  (The "a" parameter means open the file for appending.)  Actually you could do this even if the file didn't exist, because opening
      +You happen to know that test.log exists (since you just finished writing to it), so you can open it and append to it.  (The "a" parameter means open the file for appending.)  Actually you could do this even if the file didn't exist, because opening
                      the file for appending will create the file if necessary.  But appending will never harm the existing contents of the file.
       
       
       
       5 
       
      -As you can see, both the original line you wrote and the second line you appended are now in test.log.  Also note that carriage returns are not included.  Since you didn't write them explicitly to the file either time, the
      +As you can see, both the original line you wrote and the second line you appended are now in test.log.  Also note that carriage returns are not included.  Since you didn't write them explicitly to the file either time, the
                      file doesn't include them.  You can write a carriage return with the "\n" character.  Since you didn't do this, everything you wrote to the file ended up smooshed together on the same line.
       
       
      @@ -4473,7 +4313,7 @@ e
      1 -The syntax for a for loop is similar to list comprehensions. li is a list, and s will take the value of each element in turn, starting from the first element. +The syntax for a for loop is similar to list comprehensions. li is a list, and s will take the value of each element in turn, starting from the first element. @@ -4485,7 +4325,7 @@ e
      3 -This is the reason you haven't seen the for loop yet: you haven't needed it yet. It's amazing how often you use for loops in other languages when all you really want is a join or a list comprehension. +This is the reason you haven't seen the for loop yet: you haven't needed it yet. It's amazing how often you use for loops in other languages when all you really want is a join or a list comprehension. @@ -4511,7 +4351,7 @@ e 1 -As you saw in Example 3.20, “Assigning Consecutive Values”, range produces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress +As you saw in Example 3.20, “Assigning Consecutive Values”, range produces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress occasionally) useful to have a counter loop. @@ -4545,14 +4385,14 @@ USERNAME=mpilgrim 1 -os.environ is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables +os.environ is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables accessible from MS-DOS. In UNIX, they are the variables exported in your shell's startup scripts. In Mac OS, there is no concept of environment variables, so this dictionary is empty. 2 -os.environ.items() returns a list of tuples: [(key1, value1), (key2, value2), ...]. The for loop iterates through this list. The first round, it assigns key1 to k and value1 to v, so k = USERPROFILE and v = C:\Documents and Settings\mpilgrim. In the second round, k gets the second key, OS, and v gets the corresponding value, Windows_NT. +os.environ.items() returns a list of tuples: [(key1, value1), (key2, value2), ...]. The for loop iterates through this list. The first round, it assigns key1 to k and value1 to v, so k = USERPROFILE and v = C:\Documents and Settings\mpilgrim. In the second round, k gets the second key, OS, and v gets the corresponding value, Windows_NT. @@ -4560,12 +4400,12 @@ USERNAME=mpilgrim With multi-variable assignment and list comprehensions, you can replace the entire for loop with a single statement. Whether you actually do this in real code is a matter of personal coding style. I like it because it makes it clear that what I'm doing is mapping a dictionary into a list, then joining the list into a single string. - Other programmers prefer to write this out as a for loop. The output is the same in either case, although this version is slightly faster, because there is only one print statement instead of many. + Other programmers prefer to write this out as a for loop. The output is the same in either case, although this version is slightly faster, because there is only one print statement instead of many. -

      Now we can look at the for loop in MP3FileInfo, from the sample fileinfo.py program introduced in Chapter 5. -

      Example 6.11. for Loop in MP3FileInfo

      +

      Now we can look at the for loop in MP3FileInfo, from the sample fileinfo.py program introduced in Chapter 5. +

      Example 6.11. for Loop in MP3FileInfo

           tagDataMap = {"title"   : (  3,  33, stripnulls),
       "artist"  : ( 33,  63, stripnulls),
       "album"   : ( 63,  93, stripnulls),
      @@ -4582,27 +4422,27 @@ USERNAME=mpilgrim
       
       1 
       
      -tagDataMap is a class attribute that defines the tags you're looking for in an MP3 file.  Tags are stored in fixed-length fields.  Once you read the last 128 bytes of the file, bytes 3 through 32 of those
      +tagDataMap is a class attribute that defines the tags you're looking for in an MP3 file.  Tags are stored in fixed-length fields.  Once you read the last 128 bytes of the file, bytes 3 through 32 of those
                   are always the song title, 33 through 62 are always the artist name, 63 through 92 are the album name, and so forth.  Note
      -            that tagDataMap is a dictionary of tuples, and each tuple contains two integers and a function reference.
      +            that tagDataMap is a dictionary of tuples, and each tuple contains two integers and a function reference.
       
       
       
       2 
       
      -This looks complicated, but it's not.  The structure of the for variables matches the structure of the elements of the list returned by items.  Remember that items returns a list of tuples of the form (key, value).  The first element of that list is ("title", (3, 33, <function stripnulls>)), so the first time around the loop, tag gets "title", start gets 3, end gets 33, and parseFunc gets the function stripnulls.
      +This looks complicated, but it's not.  The structure of the for variables matches the structure of the elements of the list returned by items.  Remember that items returns a list of tuples of the form (key, value).  The first element of that list is ("title", (3, 33, <function stripnulls>)), so the first time around the loop, tag gets "title", start gets 3, end gets 33, and parseFunc gets the function stripnulls.
       
       
       
       3 
       
      -Now that you've extracted all the parameters for a single MP3 tag, saving the tag data is easy.  You slice tagdata from start to end to get the actual data for this tag, call parseFunc to post-process the data, and assign this as the value for the key tag in the pseudo-dictionary self.  After iterating through all the elements in tagDataMap, self has the values for all the tags, and you know what that looks like.
      +Now that you've extracted all the parameters for a single MP3 tag, saving the tag data is easy.  You slice tagdata from start to end to get the actual data for this tag, call parseFunc to post-process the data, and assign this as the value for the key tag in the pseudo-dictionary self.  After iterating through all the elements in tagDataMap, self has the values for all the tags, and you know what that looks like.
       
       
       
      -

      6.4. Using sys.modules

      -

      Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules. -

      Example 6.12. Introducing sys.modules

      >>> import sys        1
      +

      6.4. Using sys.modules

      +

      Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules. +

      Example 6.12. Introducing sys.modules

      >>> import sys        1
       >>> print '\n'.join(sys.modules.keys()) 2
       win32api
       os.path
      @@ -4621,18 +4461,18 @@ stat
      1 -The sys module contains system-level information, such as the version of Python you're running (sys.version or sys.version_info), and system-level options such as the maximum allowed recursion depth (sys.getrecursionlimit() and sys.setrecursionlimit()). +The sys module contains system-level information, such as the version of Python you're running (sys.version or sys.version_info), and system-level options such as the maximum allowed recursion depth (sys.getrecursionlimit() and sys.setrecursionlimit()). 2 -sys.modules is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules your program has imported. Python preloads some modules on startup, and if you're using a Python IDE, sys.modules contains all the modules imported by all the programs you've run within the IDE. +sys.modules is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules your program has imported. Python preloads some modules on startup, and if you're using a Python IDE, sys.modules contains all the modules imported by all the programs you've run within the IDE. -

      This example demonstrates how to use sys.modules. -

      Example 6.13. Using sys.modules

      >>> import fileinfo         1
      +

      This example demonstrates how to use sys.modules. +

      Example 6.13. Using sys.modules

      >>> import fileinfo         1
       >>> print '\n'.join(sys.modules.keys())
       win32api
       os.path
      @@ -4656,17 +4496,17 @@ stat
       
       1 
       
      -As new modules are imported, they are added to sys.modules.  This explains why importing the same module twice is very fast: Python has already loaded and cached the module in sys.modules, so importing the second time is simply a dictionary lookup.
      +As new modules are imported, they are added to sys.modules.  This explains why importing the same module twice is very fast: Python has already loaded and cached the module in sys.modules, so importing the second time is simply a dictionary lookup.
       
       
       
       2 
       
      -Given the name (as a string) of any previously-imported module, you can get a reference to the module itself through the sys.modules dictionary.
      +Given the name (as a string) of any previously-imported module, you can get a reference to the module itself through the sys.modules dictionary.
       
       
       
      -

      The next example shows how to use the __module__ class attribute with the sys.modules dictionary to get a reference to the module in which a class is defined. +

      The next example shows how to use the __module__ class attribute with the sys.modules dictionary to get a reference to the module in which a class is defined.

      Example 6.14. The __module__ Class Attribute

      >>> from fileinfo import MP3FileInfo
       >>> MP3FileInfo.__module__              1
       'fileinfo'
      @@ -4682,12 +4522,12 @@ stat
       
       2 
       
      -Combining this with the sys.modules dictionary, you can get a reference to the module in which a class is defined.
      +Combining this with the sys.modules dictionary, you can get a reference to the module in which a class is defined.
       
       
       
      -

      Now you're ready to see how sys.modules is used in fileinfo.py, the sample program introduced in Chapter 5. This example shows that portion of the code. -

      Example 6.15. sys.modules in fileinfo.py

      +

      Now you're ready to see how sys.modules is used in fileinfo.py, the sample program introduced in Chapter 5. This example shows that portion of the code. +

      Example 6.15. sys.modules in fileinfo.py

           def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):       1
               "get file info class from filename extension"           
               subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]        2
      @@ -4696,21 +4536,21 @@ stat
       
       1 
       
      -This is a function with two arguments; filename is required, but module is optional and defaults to the module that contains the FileInfo class.  This looks inefficient, because you might expect Python to evaluate the sys.modules expression every time the function is called.  In fact, Python evaluates default expressions only once, the first time the module is imported.  As you'll see later, you never call this
      -            function with a module argument, so module serves as a function-level constant.
      +This is a function with two arguments; filename is required, but module is optional and defaults to the module that contains the FileInfo class.  This looks inefficient, because you might expect Python to evaluate the sys.modules expression every time the function is called.  In fact, Python evaluates default expressions only once, the first time the module is imported.  As you'll see later, you never call this
      +            function with a module argument, so module serves as a function-level constant.
       
       
       
       2 
       
      -You'll plow through this line later, after you dive into the os module.  For now, take it on faith that subclass ends up as the name of a class, like MP3FileInfo.
      +You'll plow through this line later, after you dive into the os module.  For now, take it on faith that subclass ends up as the name of a class, like MP3FileInfo.
       
       
       
       3 
       
      -You already know about getattr, which gets a reference to an object by name.  hasattr is a complementary function that checks whether an object has a particular attribute; in this case, whether a module has
      -            a particular class (although it works for any object and any attribute, just like getattr).  In English, this line of code says, “If this module has the class named by subclass then return it, otherwise return the base class FileInfo.”
      +You already know about getattr, which gets a reference to an object by name.  hasattr is a complementary function that checks whether an object has a particular attribute; in this case, whether a module has
      +            a particular class (although it works for any object and any attribute, just like getattr).  In English, this line of code says, “If this module has the class named by subclass then return it, otherwise return the base class FileInfo.”
       
       
       
      @@ -4719,11 +4559,11 @@ stat
       
       

      6.5. Working with Directories

      -

      The os.path module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing +

      The os.path module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing the contents of a directory.

      Example 6.16. Constructing Pathnames

       >>> import os
      @@ -4739,27 +4579,27 @@ stat
       
       1 
       
      -os.path is a reference to a module -- which module depends on your platform.  Just as getpass encapsulates differences between platforms by setting getpass to a platform-specific function, os encapsulates differences between platforms by setting path to a platform-specific module.
      +os.path is a reference to a module -- which module depends on your platform.  Just as getpass encapsulates differences between platforms by setting getpass to a platform-specific function, os encapsulates differences between platforms by setting path to a platform-specific module.
       
       
       
       2 
       
      -The join function of os.path constructs a pathname out of one or more partial pathnames.  In this case, it simply concatenates strings.  (Note that dealing
      +The join function of os.path constructs a pathname out of one or more partial pathnames.  In this case, it simply concatenates strings.  (Note that dealing
                   with pathnames on Windows is annoying because the backslash character must be escaped.)
       
       
       
       3 
       
      -In this slightly less trivial case, join will add an extra backslash to the pathname before joining it to the filename.  I was overjoyed when I discovered this, since
      -addSlashIfNecessary is one of the stupid little functions I always need to write when building up my toolbox in a new language.  Do not write this stupid little function in Python; smart people have already taken care of it for you.
      +In this slightly less trivial case, join will add an extra backslash to the pathname before joining it to the filename.  I was overjoyed when I discovered this, since
      +addSlashIfNecessary is one of the stupid little functions I always need to write when building up my toolbox in a new language.  Do not write this stupid little function in Python; smart people have already taken care of it for you.
       
       
       
       4 
       
      -expanduser will expand a pathname that uses ~ to represent the current user's home directory.  This works on any platform where users have a home directory, like Windows,
      +expanduser will expand a pathname that uses ~ to represent the current user's home directory.  This works on any platform where users have a home directory, like Windows,
       UNIX, and Mac OS X; it has no effect on Mac OS.
       
       
      @@ -4785,32 +4625,32 @@ stat
       
       1 
       
      -The split function splits a full pathname and returns a tuple containing the path and filename.  Remember when I said you could use
      -multi-variable assignment to return multiple values from a function?  Well, split is such a function.
      +The split function splits a full pathname and returns a tuple containing the path and filename.  Remember when I said you could use
      +multi-variable assignment to return multiple values from a function?  Well, split is such a function.
       
       
       
       2 
       
      -You assign the return value of the split function into a tuple of two variables.  Each variable receives the value of the corresponding element of the returned tuple.
      +You assign the return value of the split function into a tuple of two variables.  Each variable receives the value of the corresponding element of the returned tuple.
       
       
       
       3 
       
      -The first variable, filepath, receives the value of the first element of the tuple returned from split, the file path.
      +The first variable, filepath, receives the value of the first element of the tuple returned from split, the file path.
       
       
       
       4 
       
      -The second variable, filename, receives the value of the second element of the tuple returned from split, the filename.
      +The second variable, filename, receives the value of the second element of the tuple returned from split, the filename.
       
       
       
       5 
       
      -os.path also contains a function splitext, which splits a filename and returns a tuple containing the filename and the file extension.   You use the same technique
      +os.path also contains a function splitext, which splits a filename and returns a tuple containing the filename and the file extension.   You use the same technique
                   to assign each of them to separate variables.
       
       
      @@ -4839,30 +4679,30 @@ stat
       
       1 
       
      -The listdir function takes a pathname and returns a list of the contents of the directory.
      +The listdir function takes a pathname and returns a list of the contents of the directory.
       
       
       
       2 
       
      -listdir returns both files and folders, with no indication of which is which.
      +listdir returns both files and folders, with no indication of which is which.
       
       
       
       3 
       
      -You can use list filtering and the isfile function of the os.path module to separate the files from the folders.  isfile takes a pathname and returns 1 if the path represents a file, and 0 otherwise.  Here you're using os.path.join to ensure a full pathname, but isfile also works with a partial path, relative to the current working directory.  You can use os.getcwd() to get the current working directory.
      +You can use list filtering and the isfile function of the os.path module to separate the files from the folders.  isfile takes a pathname and returns 1 if the path represents a file, and 0 otherwise.  Here you're using os.path.join to ensure a full pathname, but isfile also works with a partial path, relative to the current working directory.  You can use os.getcwd() to get the current working directory.
       
       
       
       4 
       
      -os.path also has a isdir function which returns 1 if the path represents a directory, and 0 otherwise.  You can use this to get a list of the subdirectories
      +os.path also has a isdir function which returns 1 if the path represents a directory, and 0 otherwise.  You can use this to get a list of the subdirectories
                   within a directory.
       
       
       
      -

      Example 6.19. Listing Directories in fileinfo.py

      +

      Example 6.19. Listing Directories in fileinfo.py

       def listDirectory(directory, fileExtList):    
           "get list of file info objects for files of particular extensions" 
           fileList = [os.path.normcase(f)
      @@ -4874,25 +4714,25 @@ def listDirectory(directory, fileExtList):
       
       1 
       
      -os.listdir(directory) returns a list of all the files and folders in directory.
      +os.listdir(directory) returns a list of all the files and folders in directory.
       
       
       
       2 
       
      -Iterating through the list with f, you use os.path.normcase(f) to normalize the case according to operating system defaults.  normcase is a useful little function that compensates for case-insensitive operating systems that think that mahadeva.mp3 and mahadeva.MP3 are the same file.  For instance, on Windows and Mac OS, normcase will convert the entire filename to lowercase; on UNIX-compatible systems, it will return the filename unchanged.
      +Iterating through the list with f, you use os.path.normcase(f) to normalize the case according to operating system defaults.  normcase is a useful little function that compensates for case-insensitive operating systems that think that mahadeva.mp3 and mahadeva.MP3 are the same file.  For instance, on Windows and Mac OS, normcase will convert the entire filename to lowercase; on UNIX-compatible systems, it will return the filename unchanged.
       
       
       
       3 
       
      -Iterating through the normalized list with f again, you use os.path.splitext(f) to split each filename into name and extension.
      +Iterating through the normalized list with f again, you use os.path.splitext(f) to split each filename into name and extension.
       
       
       
       4 
       
      -For each file, you see if the extension is in the list of file extensions you care about (fileExtList, which was passed to the listDirectory function).
      +For each file, you see if the extension is in the list of file extensions you care about (fileExtList, which was passed to the listDirectory function).
       
       
       
      @@ -4907,14 +4747,14 @@ def listDirectory(directory, fileExtList):
       Note
       
       
      -Whenever possible, you should use the functions in os and os.path for file, directory, and path manipulations.  These modules are wrappers for platform-specific modules, so functions like
      -os.path.split work on UNIX, Windows, Mac OS, and any other platform supported by Python.
      +Whenever possible, you should use the functions in os and os.path for file, directory, and path manipulations.  These modules are wrappers for platform-specific modules, so functions like
      +os.path.split work on UNIX, Windows, Mac OS, and any other platform supported by Python.
       
       
       
       

      There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you may already be familiar with from working on the command line. -

      Example 6.20. Listing Directories with glob

      +

      Example 6.20. Listing Directories with glob

       >>> os.listdir("c:\\music\\_singles\\")               1
       ['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
       'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
      @@ -4936,14 +4776,14 @@ may already be familiar with from working on the command line.
       
       1 
       
      -As you saw earlier, os.listdir simply takes a directory path and lists all files and directories in that directory.
      +As you saw earlier, os.listdir simply takes a directory path and lists all files and directories in that directory.
       
       
       
       2 
       
      -The glob module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard.
      -             Here the wildcard is a directory path plus "*.mp3", which will match all .mp3 files.  Note that each element of the returned list already includes the full path of the file.
      +The glob module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard.
      +             Here the wildcard is a directory path plus "*.mp3", which will match all .mp3 files.  Note that each element of the returned list already includes the full path of the file.
       
       
       
      @@ -4954,22 +4794,22 @@ may already be familiar with from working on the command line.
       
       4 
       
      -Now consider this scenario: you have a music directory, with several subdirectories within it, with .mp3 files within each subdirectory.  You can get a list of all of those with a single call to glob, by using two wildcards at once.  One wildcard is the "*.mp3" (to match .mp3 files), and one wildcard is within the directory path itself, to match any subdirectory within c:\music.  That's a crazy amount of power packed into one deceptively simple-looking function!
      +Now consider this scenario: you have a music directory, with several subdirectories within it, with .mp3 files within each subdirectory.  You can get a list of all of those with a single call to glob, by using two wildcards at once.  One wildcard is the "*.mp3" (to match .mp3 files), and one wildcard is within the directory path itself, to match any subdirectory within c:\music.  That's a crazy amount of power packed into one deceptively simple-looking function!
       
       
       
       
      -

      Further Reading on the os Module

      +

      Further Reading on the os Module

      6.6. Putting It All Together

      Once again, all the dominoes are in place. You've seen how each line of code works. Now let's step back and see how it all fits together. -

      Example 6.21. listDirectory

      +

      Example 6.21. listDirectory

       def listDirectory(directory, fileExtList):     1
           "get list of file info objects for files of particular extensions"
           fileList = [os.path.normcase(f)
      @@ -4986,51 +4826,51 @@ def listDirectory(directory, fileExtList):     1 
       
      -listDirectory is the main attraction of this entire module.  It takes a directory (like c:\music\_singles\ in my case) and a list of interesting file extensions (like ['.mp3']), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in
      +listDirectory is the main attraction of this entire module.  It takes a directory (like c:\music\_singles\ in my case) and a list of interesting file extensions (like ['.mp3']), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in
                   that directory.  And it does it in just a few straightforward lines of code.
       
       
       
       2 
       
      -As you saw in the previous section, this line of code gets a list of the full pathnames of all the files in directory that have an interesting file extension (as specified by fileExtList).
      +As you saw in the previous section, this line of code gets a list of the full pathnames of all the files in directory that have an interesting file extension (as specified by fileExtList).
       
       
       
       3 
       
      -Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports nested functions -- literally, a function within a function.  The nested function getFileInfoClass can be called only from the function in which it is defined, listDirectory.  As with any other function, you don't need an interface declaration or anything fancy; just define the function and code
      +Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports nested functions -- literally, a function within a function.  The nested function getFileInfoClass can be called only from the function in which it is defined, listDirectory.  As with any other function, you don't need an interface declaration or anything fancy; just define the function and code
                   it.
       
       
       
       4 
       
      -Now that you've seen the os module, this line should make more sense.  It gets the extension of the file (os.path.splitext(filename)[1]), forces it to uppercase (.upper()), slices off the dot ([1:]), and constructs a class name out of it with string formatting.  So c:\music\ap\mahadeva.mp3 becomes .mp3 becomes .MP3 becomes MP3 becomes MP3FileInfo.
      +Now that you've seen the os module, this line should make more sense.  It gets the extension of the file (os.path.splitext(filename)[1]), forces it to uppercase (.upper()), slices off the dot ([1:]), and constructs a class name out of it with string formatting.  So c:\music\ap\mahadeva.mp3 becomes .mp3 becomes .MP3 becomes MP3 becomes MP3FileInfo.
       
       
       
       5 
       
       Having constructed the name of the handler class that would handle this file, you check to see if that handler class actually
      -            exists in this module.  If it does, you return the class, otherwise you return the base class FileInfo.  This is a very important point: this function returns a class.  Not an instance of a class, but the class itself.
      +            exists in this module.  If it does, you return the class, otherwise you return the base class FileInfo.  This is a very important point: this function returns a class.  Not an instance of a class, but the class itself.
       
       
       
       6 
       
      -For each file in the “interesting files” list (fileList), you call getFileInfoClass with the filename (f).  Calling getFileInfoClass(f) returns a class; you don't know exactly which class, but you don't care.  You then create an instance of this class (whatever
      -            it is) and pass the filename (f again), to the __init__ method.  As you saw earlier in this chapter, the __init__ method of FileInfo sets self["name"], which triggers __setitem__, which is overridden in the descendant (MP3FileInfo) to parse the file appropriately to pull out the file's metadata.  You do all that for each interesting file and return a
      +For each file in the “interesting files” list (fileList), you call getFileInfoClass with the filename (f).  Calling getFileInfoClass(f) returns a class; you don't know exactly which class, but you don't care.  You then create an instance of this class (whatever
      +            it is) and pass the filename (f again), to the __init__ method.  As you saw earlier in this chapter, the __init__ method of FileInfo sets self["name"], which triggers __setitem__, which is overridden in the descendant (MP3FileInfo) to parse the file appropriately to pull out the file's metadata.  You do all that for each interesting file and return a
                   list of the resulting instances.
       
       
       
      -

      Note that listDirectory is completely generic. It doesn't know ahead of time which types of files it will be getting, or which classes are defined +

      Note that listDirectory is completely generic. It doesn't know ahead of time which types of files it will be getting, or which classes are defined that could potentially handle those files. It inspects the directory for the files to process, and then introspects its own -module to see what special handler classes (like MP3FileInfo) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class: -HTMLFileInfo for HTML files, DOCFileInfo for Word .doc files, and so forth. listDirectory will handle them all, without modification, by handing off the real work to the appropriate classes and collating the results. +module to see what special handler classes (like MP3FileInfo) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class: +HTMLFileInfo for HTML files, DOCFileInfo for Word .doc files, and so forth. listDirectory will handle them all, without modification, by handing off the real work to the appropriate classes and collating the results.

      6.7. Summary

      -

      The fileinfo.py program introduced in Chapter 5 should now make perfect sense. +

      The fileinfo.py program introduced in Chapter 5 should now make perfect sense.

       """Framework for getting filetype-specific metadata.
       
      @@ -5116,7 +4956,7 @@ if __name__ == "__main__":
       
    15. Protecting external resources with try...finally
    16. Reading from files
    17. Assigning multiple values at once in a for loop -
    18. Using the os module for all your cross-platform file manipulation needs +
    19. Using the os module for all your cross-platform file manipulation needs
    20. Dynamically instantiating classes of unknown type by treating classes as objects and passing them around @@ -5124,13 +4964,13 @@ if __name__ == "__main__":

      Chapter 7. Regular Expressions

      Regular expressions are a powerful and standardized way of searching, replacing, and parsing text with complex patterns of -characters. If you've used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the re module to get an overview of the available functions and their arguments. +characters. If you've used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the re module to get an overview of the available functions and their arguments.

      7.1. Diving In

      -

      Strings have methods for searching (index, find, and count), replacing (replace), and parsing (split), but they are limited to the simplest of cases. The search methods look for a single, hard-coded substring, and they are -always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace and split methods have the same limitations. +

      Strings have methods for searching (index, find, and count), replacing (replace), and parsing (split), but they are limited to the simplest of cases. The search methods look for a single, hard-coded substring, and they are +always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace and split methods have the same limitations.

      If what you're trying to do can be accomplished with string functions, you should use them. They're fast and simple and easy to read, and there's a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different - string functions with if statements to handle special cases, or if you're combining them with split and join and list comprehensions in weird unreadable ways, you may need to move up to regular expressions. + string functions with if statements to handle special cases, or if you're combining them with split and join and list comprehensions in weird unreadable ways, you may need to move up to regular expressions.

      Although the regular expression syntax is tight and unlike normal code, the result can end up being more readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions to make them practically self-documenting.

      7.2. Case Study: Street Addresses

      @@ -5153,13 +4993,13 @@ within regular expressions to make them practically self-documenting. 1 -My goal is to standardize a street address so that 'ROAD' is always abbreviated as 'RD.'. At first glance, I thought this was simple enough that I could just use the string method replace. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, 'ROAD', was a constant. And in this deceptively simple example, s.replace does indeed work. +My goal is to standardize a street address so that 'ROAD' is always abbreviated as 'RD.'. At first glance, I thought this was simple enough that I could just use the string method replace. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, 'ROAD', was a constant. And in this deceptively simple example, s.replace does indeed work. 2 -Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that 'ROAD' appears twice in the address, once as part of the street name 'BROAD' and once as its own word. The replace method sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed. +Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that 'ROAD' appears twice in the address, once as part of the street name 'BROAD' and once as its own word. The replace method sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed. @@ -5172,7 +5012,7 @@ within regular expressions to make them practically self-documenting. 4 -It's time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the re module. +It's time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the re module. @@ -5184,7 +5024,7 @@ within regular expressions to make them practically self-documenting. 6 -Using the re.sub function, you search the string s for the regular expression 'ROAD$' and replace it with 'RD.'. This matches the ROAD at the end of the string s, but does not match the ROAD that's part of the word BROAD, because that's in the middle of s. +Using the re.sub function, you search the string s for the regular expression 'ROAD$' and replace it with 'RD.'. This matches the ROAD at the end of the string s, but does not match the ROAD that's part of the word BROAD, because that's in the middle of s. @@ -5224,7 +5064,7 @@ ended with the street name. Most of the time, I got away with it, but if the st *sigh* Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word 'ROAD' as a whole word by itself, but it wasn't at the end, because the address had an apartment number after the street designation. - Because 'ROAD' isn't at the very end of the string, it doesn't match, so the entire call to re.sub ends up replacing nothing at all, and you get the original string back, which is not what you want. + Because 'ROAD' isn't at the very end of the string, it doesn't match, so the entire call to re.sub ends up replacing nothing at all, and you get the original string back, which is not what you want. @@ -5252,7 +5092,7 @@ ended with the street name. Most of the time, I got away with it, but if the st

      The following are some general rules for constructing Roman numerals:

        -
      • Characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8. +
      • Characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8.
      • The tens characters (I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”). The number 40 is written as XL (10 less than 50), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (10 less than 50, then 1 less than 5). @@ -5301,8 +5141,8 @@ ended with the street name. Most of the time, I got away with it, but if the st 2 -The essence of the re module is the search function, that takes a regular expression (pattern) and a string ('M') to try to match against the regular expression. If a match is found, search returns an object which has various methods to describe the match; if no match is found, search returns None, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return - value of search. 'M' matches this regular expression, because the first optional M matches and the second and third optional M characters are ignored. +The essence of the re module is the search function, that takes a regular expression (pattern) and a string ('M') to try to match against the regular expression. If a match is found, search returns an object which has various methods to describe the match; if no match is found, search returns None, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return + value of search. 'M' matches this regular expression, because the first optional M matches and the second and third optional M characters are ignored. @@ -5320,7 +5160,7 @@ ended with the street name. Most of the time, I got away with it, but if the st 5 -'MMMM' does not match. All three M characters match, but then the regular expression insists on the string ending (because of the $ character), and the string doesn't end yet (because of the fourth M). So search returns None. +'MMMM' does not match. All three M characters match, but then the regular expression insists on the string ending (because of the $ character), and the string doesn't end yet (because of the fourth M). So search returns None. @@ -5649,7 +5489,7 @@ it a verbose regular expression. This example shows how. 1 The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when - working with them: re.VERBOSE is a constant defined in the re module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has + working with them: re.VERBOSE is a constant defined in the re module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it's a lot more readable. @@ -5669,7 +5509,7 @@ it a verbose regular expression. This example shows how. 4 -This does not match. Why? Because it doesn't have the re.VERBOSE flag, so the re.search function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can't auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose. +This does not match. Why? Because it doesn't have the re.VERBOSE flag, so the re.search function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can't auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose. @@ -5713,7 +5553,7 @@ examples of regular expressions that purported to do this, but none of them were 2 -To get access to the groups that the regular expression parser remembered along the way, use the groups() method on the object that the search function returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you +To get access to the groups that the regular expression parser remembered along the way, use the groups() method on the object that the search function returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you defined three groups, one with three digits, one with three digits, and one with four digits. @@ -5747,7 +5587,7 @@ examples of regular expressions that purported to do this, but none of them were 2 -The groups() method now returns a tuple of four elements, since the regular expression now defines four groups to remember. +The groups() method now returns a tuple of four elements, since the regular expression now defines four groups to remember. @@ -5848,7 +5688,7 @@ examples of regular expressions that purported to do this, but none of them were 4 -Finally, you've solved the other long-standing problem: extensions are optional again. If no extension is found, the groups() method still returns a tuple of four elements, but the fourth element is just an empty string. +Finally, you've solved the other long-standing problem: extensions are optional again. If no extension is found, the groups() method still returns a tuple of four elements, but the fourth element is just an empty string. @@ -5983,7 +5823,7 @@ you made.

        7.7. Summary

        @@ -6012,7 +5852,7 @@ you made.
      • (a|b|c) matches either a or b or c. -
      • (x) in general is a remembered group. You can get the value of what matched by using the groups() method of the object returned by re.search. +
      • (x) in general is a remembered group. You can get the value of what matched by using the groups() method of the object returned by re.search.

      Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough @@ -6036,9 +5876,9 @@ they solve.

      Chapter 8. HTML Processing

      8.1. Diving in

      I often see questions on comp.lang.python like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of my HTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions. -

      Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone. Read the docstrings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how +

      Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone. Read the docstrings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how any of these class methods ever get called. Don't worry, all will be revealed in due time. -

      Example 8.1. BaseHTMLProcessor.py

      +

      Example 8.1. BaseHTMLProcessor.py

      If you have not already done so, you can download this and other examples used in this book.

       from sgmllib import SGMLParser
       import htmlentitydefs
      @@ -6110,7 +5950,7 @@ class BaseHTMLProcessor(SGMLParser):
       
           def output(self):              
               """Return processed HTML as a single string"""
      -        return "".join(self.pieces)

      Example 8.2. dialect.py

      +        return "".join(self.pieces)

      Example 8.2. dialect.py

       import re
       from BaseHTMLProcessor import BaseHTMLProcessor
       
      @@ -6263,7 +6103,7 @@ def test(url):
               webbrowser.open_new(outfile)
       
       if __name__ == "__main__":
      -    test("http://diveintopython3.org/odbchelper_list.html")

      Example 8.3. Output of dialect.py

      + test("http://diveintopython3.org/odbchelper_list.html")
    21. Example 8.3. Output of dialect.py

      Running this script will translate Section 3.2, “Introducing Lists” into mock Swedish Chef-speak (from The Muppets), mock Elmer Fudd-speak (from Bugs Bunny cartoons), and mock Middle English (loosely based on Chaucer's The Canterbury Tales). If you look at the HTML source of the output pages, you'll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the code listings and screen examples were left untouched.

       <div class="abstract">
      @@ -6273,38 +6113,38 @@ If youw onwy expewience wif wists is awways in
       in <span class="application">Powewbuiwdew</span>, bwace youwsewf fow
       <span class="application">Pydon</span> wists.</p>
       </div>
      -

      8.2. Introducing sgmllib.py

      -

      HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library. +

      8.2. Introducing sgmllib.py

      +

      HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.

      The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags -and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally. -

      sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, -it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method. -

      SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them: +and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally. +

      sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, +it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method. +

      SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:

      Start tag
      -
      An HTML tag that starts a block, like <html>, <head>, <body>, or <pre>, or a standalone tag like <br> or <img>. When it finds a start tag tagname, SGMLParser will look for a method called start_tagname or do_tagname. For instance, when it finds a <pre> tag, it will look for a start_pre or do_pre method. If found, SGMLParser calls this method with a list of the tag's attributes; otherwise, it calls unknown_starttag with the tag name and list of attributes. +
      An HTML tag that starts a block, like <html>, <head>, <body>, or <pre>, or a standalone tag like <br> or <img>. When it finds a start tag tagname, SGMLParser will look for a method called start_tagname or do_tagname. For instance, when it finds a <pre> tag, it will look for a start_pre or do_pre method. If found, SGMLParser calls this method with a list of the tag's attributes; otherwise, it calls unknown_starttag with the tag name and list of attributes.
      End tag
      -
      An HTML tag that ends a block, like </html>, </head>, </body>, or </pre>. When it finds an end tag, SGMLParser will look for a method called end_tagname. If found, SGMLParser calls this method, otherwise it calls unknown_endtag with the tag name. +
      An HTML tag that ends a block, like </html>, </head>, </body>, or </pre>. When it finds an end tag, SGMLParser will look for a method called end_tagname. If found, SGMLParser calls this method, otherwise it calls unknown_endtag with the tag name.
      Character reference
      -
      An escaped character referenced by its decimal or hexadecimal equivalent, like &#160;. When found, SGMLParser calls handle_charref with the text of the decimal or hexadecimal character equivalent. +
      An escaped character referenced by its decimal or hexadecimal equivalent, like &#160;. When found, SGMLParser calls handle_charref with the text of the decimal or hexadecimal character equivalent.
      Entity reference
      -
      An HTML entity, like &copy;. When found, SGMLParser calls handle_entityref with the name of the HTML entity. +
      An HTML entity, like &copy;. When found, SGMLParser calls handle_entityref with the name of the HTML entity.
      Comment
      -
      An HTML comment, enclosed in <!-- ... -->. When found, SGMLParser calls handle_comment with the body of the comment. +
      An HTML comment, enclosed in <!-- ... -->. When found, SGMLParser calls handle_comment with the body of the comment.
      Processing instruction
      -
      An HTML processing instruction, enclosed in <? ... >. When found, SGMLParser calls handle_pi with the body of the processing instruction. +
      An HTML processing instruction, enclosed in <? ... >. When found, SGMLParser calls handle_pi with the body of the processing instruction.
      Declaration
      -
      An HTML declaration, such as a DOCTYPE, enclosed in <! ... >. When found, SGMLParser calls handle_decl with the body of the declaration. +
      An HTML declaration, such as a DOCTYPE, enclosed in <! ... >. When found, SGMLParser calls handle_decl with the body of the declaration.
      Text data
      -
      A block of text. Anything that doesn't fit into the other 7 categories. When found, SGMLParser calls handle_data with the text. +
      A block of text. Anything that doesn't fit into the other 7 categories. When found, SGMLParser calls handle_data with the text.
      @@ -6312,12 +6152,12 @@ it calls a method on itself based on what it found. In order to use the parser, -
      Important
      Python 2.0 had a bug where SGMLParser would not recognize declarations at all (handle_decl would never be called), which meant that DOCTYPEs were silently ignored. This is fixed in Python 2.1. +Python 2.0 had a bug where SGMLParser would not recognize declarations at all (handle_decl would never be called), which meant that DOCTYPEs were silently ignored. This is fixed in Python 2.1.
      -

      sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing -the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments. +

      sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing +the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments.

      @@ -6326,7 +6166,7 @@ the SGMLParser class and defining

      Example 8.4. Sample test of sgmllib.py

      +

      Example 8.4. Sample test of sgmllib.py

      Here is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven't downloaded the HTML version of the book, you can do so at http://diveintopython3.org/.

       c:\python23\lib> type "c:\downloads\diveintopython3\html\toc\index.html"
       
      @@ -6340,7 +6180,7 @@ the SGMLParser class and defining 
       
       ... rest of file omitted for brevity ...
      -

      Running this through the test suite of sgmllib.py yields this output:

      +

      Running this through the test suite of sgmllib.py yields this output:

       c:\python23\lib> python sgmllib.py "c:\downloads\diveintopython3\html\toc\index.html"
       data: '\n\n'
       start tag: <html >
      @@ -6360,22 +6200,22 @@ data: '\n      '
       

      Here's the roadmap for the rest of the chapter:

        -
      • Subclass SGMLParser to create classes that extract interesting data out of HTML documents. +
      • Subclass SGMLParser to create classes that extract interesting data out of HTML documents. -
      • Subclass SGMLParser to create BaseHTMLProcessor, which overrides all 8 handler methods and uses them to reconstruct the original HTML from the pieces. +
      • Subclass SGMLParser to create BaseHTMLProcessor, which overrides all 8 handler methods and uses them to reconstruct the original HTML from the pieces. -
      • Subclass BaseHTMLProcessor to create Dialectizer, which adds some methods to process specific HTML tags specially, and overrides the handle_data method to provide a framework for processing the text blocks between the HTML tags. +
      • Subclass BaseHTMLProcessor to create Dialectizer, which adds some methods to process specific HTML tags specially, and overrides the handle_data method to provide a framework for processing the text blocks between the HTML tags. -
      • Subclass Dialectizer to create classes that define text processing rules used by Dialectizer.handle_data. +
      • Subclass Dialectizer to create classes that define text processing rules used by Dialectizer.handle_data. -
      • Write a test suite that grabs a real web page from http://diveintopython3.org/ and processes it. +
      • Write a test suite that grabs a real web page from http://diveintopython3.org/ and processes it.
      -

      Along the way, you'll also learn about locals, globals, and dictionary-based string formatting. +

      Along the way, you'll also learn about locals, globals, and dictionary-based string formatting.

      8.3. Extracting data from HTML documents

      -

      To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture. +

      To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.

      The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages. -

      Example 8.5. Introducing urllib

      +

      Example 8.5. Introducing urllib

       >>> import urllib   1
       >>> sock = urllib.urlopen("http://diveintopython3.org/") 2
       >>> htmlSource = sock.read()          3
      @@ -6400,35 +6240,35 @@ data: '\n      '
       
      - - - - -
      Tip
      1 The urllib module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages). +The urllib module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages).
      2 The simplest use of urllib is to retrieve the entire text of a web page using the urlopen function. Opening a URL is similar to opening a file. The return value of urlopen is a file-like object, which has some of the same methods as a file object. +The simplest use of urllib is to retrieve the entire text of a web page using the urlopen function. Opening a URL is similar to opening a file. The return value of urlopen is a file-like object, which has some of the same methods as a file object.
      3 The simplest thing to do with the file-like object returned by urlopen is read, which reads the entire HTML of the web page into a single string. The object also supports readlines, which reads the text line by line into a list. +The simplest thing to do with the file-like object returned by urlopen is read, which reads the entire HTML of the web page into a single string. The object also supports readlines, which reads the text line by line into a list.
      4 When you're done with the object, make sure to close it, just like a normal file object. +When you're done with the object, make sure to close it, just like a normal file object.
      5 You now have the complete HTML of the home page of http://diveintopython3.org/ in a string, and you're ready to parse it. +You now have the complete HTML of the home page of http://diveintopython3.org/ in a string, and you're ready to parse it.
      -

      +

      If you have not already done so, you can download this and other examples used in this book.

       from sgmllib import SGMLParser
       
      @@ -6445,30 +6285,30 @@ class URLLister(SGMLParser):
       
       1 
       
      -reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created.  So if you need to do any initialization,
      -            do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance.
      +reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created.  So if you need to do any initialization,
      +            do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance.
       
       
       
       2 
       
      -start_a is called by SGMLParser whenever it finds an <a> tag.  The tag may contain an href attribute, and/or other attributes, like name or title.  The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...].  Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list.
      +start_a is called by SGMLParser whenever it finds an <a> tag.  The tag may contain an href attribute, and/or other attributes, like name or title.  The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...].  Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list.
       
       
       
       3 
       
      -You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension.
      +You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension.
       
       
       
       4 
       
      -String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs.
      +String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs.
       
       
       
      -

      Example 8.7. Using urllister.py

      +

      Example 8.7. Using urllister.py

       >>> import urllib, urllister
       >>> usock = urllib.urlopen("http://diveintopython3.org/")
       >>> parser = urllister.URLLister()
      @@ -6495,34 +6335,34 @@ download/diveintopython3-common-5.0.zip
       
       1 
       
      -Call the feed method, defined in SGMLParser, to get HTML into the parser.[1]  It takes a string, which is what usock.read() returns.
      +Call the feed method, defined in SGMLParser, to get HTML into the parser.[1]  It takes a string, which is what usock.read() returns.
       
       
       
       2 
       
      -Like files, you should close your URL objects as soon as you're done with them.
      +Like files, you should close your URL objects as soon as you're done with them.
       
       
       
       3 
       
      -You should close your parser object, too, but for a different reason.  You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more.  Be sure to call close to flush the buffer and force everything to be fully parsed.
      +You should close your parser object, too, but for a different reason.  You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more.  Be sure to call close to flush the buffer and force everything to be fully parsed.
       
       
       
       4 
       
      -Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document.  (Your output may look different, if the download links have been updated by the time you read this.)
      +Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document.  (Your output may look different, if the download links have been updated by the time you read this.)
       
       
       
      -

      8.4. Introducing BaseHTMLProcessor.py

      -

      SGMLParser doesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it - finds, but the methods don't do anything. SGMLParser is an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll - take this one step further by defining a class that catches everything SGMLParser throws at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer. -

      BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi, handle_decl, and handle_data. -

      Example 8.8. Introducing BaseHTMLProcessor

      +

      8.4. Introducing BaseHTMLProcessor.py

      +

      SGMLParser doesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it + finds, but the methods don't do anything. SGMLParser is an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll + take this one step further by defining a class that catches everything SGMLParser throws at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer. +

      BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi, handle_decl, and handle_data. +

      Example 8.8. Introducing BaseHTMLProcessor

       class BaseHTMLProcessor(SGMLParser):
           def reset(self):      1
               self.pieces = []
      @@ -6558,13 +6398,13 @@ class BaseHTMLProcessor(SGMLParser):
       
       1 
       
      -reset, called by SGMLParser.__init__, initializes self.pieces as an empty list before calling the ancestor method.  self.pieces is a data attribute which will hold the pieces of the HTML document you're constructing.  Each handler method will reconstruct the HTML that SGMLParser parsed, and each method will append that string to self.pieces.  Note that self.pieces is a list.  You might be tempted to define it as a string and just keep appending each piece to it.  That would work, but
      +reset, called by SGMLParser.__init__, initializes self.pieces as an empty list before calling the ancestor method.  self.pieces is a data attribute which will hold the pieces of the HTML document you're constructing.  Each handler method will reconstruct the HTML that SGMLParser parsed, and each method will append that string to self.pieces.  Note that self.pieces is a list.  You might be tempted to define it as a string and just keep appending each piece to it.  That would work, but
       Python is much more efficient at dealing with lists.[2]
       
       
       2 
       
      -Since BaseHTMLProcessor does not define any methods for specific tags (like the start_a method in URLLister), SGMLParser will call unknown_starttag for every start tag.  This method takes the tag (tag) and the list of attribute name/value pairs (attrs), reconstructs the original HTML, and appends it to self.pieces.  The string formatting here is a little strange; you'll untangle that (and also the odd-looking locals function) later in this chapter.
      +Since BaseHTMLProcessor does not define any methods for specific tags (like the start_a method in URLLister), SGMLParser will call unknown_starttag for every start tag.  This method takes the tag (tag) and the list of attribute name/value pairs (attrs), reconstructs the original HTML, and appends it to self.pieces.  The string formatting here is a little strange; you'll untangle that (and also the odd-looking locals function) later in this chapter.
       
       
       
      @@ -6576,21 +6416,21 @@ Python is much more efficient at dealing with lists.[
       4 
       
      -When SGMLParser finds a character reference, it calls handle_charref with the bare reference.  If the HTML document contains the reference &#160;, ref will be 160.  Reconstructing the original complete character reference just involves wrapping ref in &#...; characters.
      +When SGMLParser finds a character reference, it calls handle_charref with the bare reference.  If the HTML document contains the reference &#160;, ref will be 160.  Reconstructing the original complete character reference just involves wrapping ref in &#...; characters.
       
       
       
       5 
       
       Entity references are similar to character references, but without the hash mark.  Reconstructing the original entity reference
      -            requires wrapping ref in &...; characters.  (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this.  Only certain standard
      -HTML entites end in a semicolon; other similar-looking entities do not.  Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module called htmlentitydefs.  Hence the extra if statement.)
      +            requires wrapping ref in &...; characters.  (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this.  Only certain standard
      +HTML entites end in a semicolon; other similar-looking entities do not.  Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module called htmlentitydefs.  Hence the extra if statement.)
       
       
       
       6 
       
      -Blocks of text are simply appended to self.pieces unaltered.
      +Blocks of text are simply appended to self.pieces unaltered.
       
       
       
      @@ -6611,12 +6451,12 @@ Python is much more efficient at dealing with lists.[Important
       
       
      -The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't).  BaseHTMLProcessor is not forgiving; if script is improperly embedded, it will be parsed as if it were HTML.  For instance, if the script contains less-than and equals signs, SGMLParser may incorrectly think that it has found tags and attributes.  SGMLParser always converts tags and attribute names to lowercase, which may break the script, and BaseHTMLProcessor always encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script.  Always protect your client-side script
      +The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't).  BaseHTMLProcessor is not forgiving; if script is improperly embedded, it will be parsed as if it were HTML.  For instance, if the script contains less-than and equals signs, SGMLParser may incorrectly think that it has found tags and attributes.  SGMLParser always converts tags and attribute names to lowercase, which may break the script, and BaseHTMLProcessor always encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script.  Always protect your client-side script
             within HTML comments.
       
       
       
      -

      Example 8.9. BaseHTMLProcessor output

      +

      Example 8.9. BaseHTMLProcessor output

           def output(self):               1
               """Return processed HTML as a single string"""
               return "".join(self.pieces) 2
      @@ -6624,13 +6464,13 @@ Python is much more efficient at dealing with lists.[ 1 -This is the one method in BaseHTMLProcessor that is never called by the ancestor SGMLParser. Since the other handler methods store their reconstructed HTML in self.pieces, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it. +This is the one method in BaseHTMLProcessor that is never called by the ancestor SGMLParser. Since the other handler methods store their reconstructed HTML in self.pieces, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it. 2 -If you prefer, you could use the join method of the string module instead: string.join(self.pieces, "") +If you prefer, you could use the join method of the string module instead: string.join(self.pieces, "")
      @@ -6638,17 +6478,17 @@ Python is much more efficient at dealing with lists.[
    22. W3C discusses character and entity references. -
    23. Python Library Reference confirms your suspicions that the htmlentitydefs module is exactly what it sounds like. +
    24. Python Library Reference confirms your suspicions that the htmlentitydefs module is exactly what it sounds like. -

      8.5. locals and globals

      -

      Let's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables. -

      Remember locals? You first saw it here: +

      8.5. locals and globals

      +

      Let's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables. +

      Remember locals? You first saw it here:

           def unknown_starttag(self, tag, attrs):
               strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
               self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
      -

      No, wait, you can't learn about locals yet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention. +

    25. No, wait, you can't learn about locals yet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention.

      Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute.

      At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which @@ -6656,17 +6496,17 @@ keeps track of the function's variables, including function arguments and locall own namespace, called the global namespace, which keeps track of the module's variables, including functions, classes, any other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any module, which holds built-in functions and exceptions. -

      When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order: +

      When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order:

        -
      1. local namespace - specific to the current function or class method. If the function defines a local variable x, or has an argument x, Python will use this and stop searching. +
      2. local namespace - specific to the current function or class method. If the function defines a local variable x, or has an argument x, Python will use this and stop searching. -
      3. global namespace - specific to the current module. If the module has defined a variable, function, or class called x, Python will use that and stop searching. +
      4. global namespace - specific to the current module. If the module has defined a variable, function, or class called x, Python will use that and stop searching. -
      5. built-in namespace - global to all modules. As a last resort, Python will assume that x is the name of built-in function or variable. +
      6. built-in namespace - global to all modules. As a last resort, Python will assume that x is the name of built-in function or variable.
      -

      If Python doesn't find x in any of these namespaces, it gives up and raises a NameError with the message There is no variable named 'x', which you saw back in Example 3.18, “Referencing an Unbound Variable”, but you didn't appreciate how much work Python was doing before giving you that error. +

      If Python doesn't find x in any of these namespaces, it gives up and raises a NameError with the message There is no variable named 'x', which you saw back in Example 3.18, “Referencing an Unbound Variable”, but you didn't appreciate how much work Python was doing before giving you that error.

      @@ -6675,8 +6515,8 @@ module, which holds built-in functions and exceptions. from __future__ import nested_scopes
      Important
      -

      Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in locals function, and the global (module level) namespace is accessible via the built-in globals function. -

      Example 8.10. Introducing locals

      >>> def foo(arg): 1
      +

      Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in locals function, and the global (module level) namespace is accessible via the built-in globals function. +

      Example 8.10. Introducing locals

      >>> def foo(arg): 1
       ...     x = 1
       ...     print locals()
       ...     
      @@ -6688,30 +6528,30 @@ from __future__ import nested_scopes
      1 -The function foo has two variables in its local namespace: arg, whose value is passed in to the function, and x, which is defined within the function. +The function foo has two variables in its local namespace: arg, whose value is passed in to the function, and x, which is defined within the function. 2 -locals returns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values - of the dictionary are the actual values of the variables. So calling foo with 7 prints the dictionary containing the function's two local variables: arg (7) and x (1). +locals returns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values + of the dictionary are the actual values of the variables. So calling foo with 7 prints the dictionary containing the function's two local variables: arg (7) and x (1). 3 -Remember, Python has dynamic typing, so you could just as easily pass a string in for arg; the function (and the call to locals) would still work just as well. locals works with all variables of all datatypes. +Remember, Python has dynamic typing, so you could just as easily pass a string in for arg; the function (and the call to locals) would still work just as well. locals works with all variables of all datatypes. -

      What locals does for the local (function) namespace, globals does for the global (module) namespace. globals is more exciting, though, because a module's namespace is more exciting.[3] Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes +

      What locals does for the local (function) namespace, globals does for the global (module) namespace. globals is more exciting, though, because a module's namespace is more exciting.[3] Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes defined in the module. Plus, it includes anything that was imported into the module. -

      Remember the difference between from module import and import module? With import module, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access -any of its functions or attributes: module.function. But with from module import, you're actually importing specific functions and attributes from another module into your own namespace, which is why you -access them directly without referencing the original module they came from. With the globals function, you can actually see this happen. -

      Example 8.11. Introducing globals

      -

      Look at the following block of code at the bottom of BaseHTMLProcessor.py:

      +

      Remember the difference between from module import and import module? With import module, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access +any of its functions or attributes: module.function. But with from module import, you're actually importing specific functions and attributes from another module into your own namespace, which is why you +access them directly without referencing the original module they came from. With the globals function, you can actually see this happen. +

      Example 8.11. Introducing globals

      +

      Look at the following block of code at the bottom of BaseHTMLProcessor.py:

       if __name__ == "__main__":
           for k, v in globals().items():             1
               print k, "=", v
      @@ -6719,7 +6559,7 @@ if __name__ == "__main__": 1 -Just so you don't get intimidated, remember that you've seen all this before. The globals function returns a dictionary, and you're iterating through the dictionary using the items method and multi-variable assignment. The only thing new here is the globals function. +Just so you don't get intimidated, remember that you've seen all this before. The globals function returns a dictionary, and you're iterating through the dictionary using the items method and multi-variable assignment. The only thing new here is the globals function. @@ -6734,25 +6574,25 @@ __name__ = __main__ 1 -SGMLParser was imported from sgmllib, using from module import. That means that it was imported directly into the module's namespace, and here it is. +SGMLParser was imported from sgmllib, using from module import. That means that it was imported directly into the module's namespace, and here it is. 2 -Contrast this with htmlentitydefs, which was imported using import. That means that the htmlentitydefs module itself is in the namespace, but the entitydefs variable defined within htmlentitydefs is not. +Contrast this with htmlentitydefs, which was imported using import. That means that the htmlentitydefs module itself is in the namespace, but the entitydefs variable defined within htmlentitydefs is not. 3 -This module only defines one class, BaseHTMLProcessor, and here it is. Note that the value here is the class itself, not a specific instance of the class. +This module only defines one class, BaseHTMLProcessor, and here it is. Note that the value here is the class itself, not a specific instance of the class. 4 -Remember the if __name__ trick? When running a module (as opposed to importing it from another module), the built-in __name__ attribute is a special value, __main__. Since you ran this module as a script from the command line, __name__ is __main__, which is why the little test code to print the globals got executed. +Remember the if __name__ trick? When running a module (as opposed to importing it from another module), the built-in __name__ attribute is a special value, __main__. Since you ran this module as a script from the command line, __name__ is __main__, which is why the little test code to print the globals got executed. @@ -6761,14 +6601,14 @@ __name__ = __main__ Note -Using the locals and globals functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors - the functionality of the getattr function, which allows you to access arbitrary functions dynamically by providing the function name as a string. +Using the locals and globals functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors + the functionality of the getattr function, which allows you to access arbitrary functions dynamically by providing the function name as a string. -

      There is one other important difference between the locals and globals functions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning +

      There is one other important difference between the locals and globals functions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning it. -

      Example 8.12. locals is read-only, globals is not

      +

      Example 8.12. locals is read-only, globals is not

       def foo(arg):
           x = 1
           print locals()    1
      @@ -6785,14 +6625,14 @@ print "z=",z          
       1 
       
      -Since foo is called with 3, this will print {'arg': 3, 'x': 1}.  This should not be a surprise.
      +Since foo is called with 3, this will print {'arg': 3, 'x': 1}.  This should not be a surprise.
       
       
       
       2 
       
      -locals is a function that returns a dictionary, and here you are setting a value in that dictionary.  You might think that this
      -            would change the value of the local variable x to 2, but it doesn't.  locals does not actually return the local namespace, it returns a copy.  So changing it does nothing to the value of the variables
      +locals is a function that returns a dictionary, and here you are setting a value in that dictionary.  You might think that this
      +            would change the value of the local variable x to 2, but it doesn't.  locals does not actually return the local namespace, it returns a copy.  So changing it does nothing to the value of the variables
                   in the local namespace.
       
       
      @@ -6805,7 +6645,7 @@ print "z=",z          
       4 
       
      -After being burned by locals, you might think that this wouldn't change the value of z, but it does.  Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself), globals returns the actual global namespace, not a copy: the exact opposite behavior of locals.  So any changes to the dictionary returned by globals directly affect your global variables.
      +After being burned by locals, you might think that this wouldn't change the value of z, but it does.  Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself), globals returns the actual global namespace, not a copy: the exact opposite behavior of locals.  So any changes to the dictionary returned by globals directly affect your global variables.
       
       
       
      @@ -6816,7 +6656,7 @@ print "z=",z          
       
       

      8.6. Dictionary-based string formatting

      -

      Why did you learn about locals and globals? So you can learn about dictionary-based string formatting. As you recall, regular string formatting provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in +

      Why did you learn about locals and globals? So you can learn about dictionary-based string formatting. As you recall, regular string formatting provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in place of each formatting marker. While this is efficient, it is not always the easiest code to read, especially when multiple values are being inserted. You can't simply scan through the string in one pass and understand what the result will be; you're constantly switching between reading the string and reading the tuple of values. @@ -6833,14 +6673,14 @@ constantly switching between reading the string and reading the tuple of values. 1 -Instead of a tuple of explicit values, this form of string formatting uses a dictionary, params. And instead of a simple %s marker in the string, the marker contains a name in parentheses. This name is used as a key in the params dictionary and subsitutes the corresponding value, secret, in place of the %(pwd)s marker. +Instead of a tuple of explicit values, this form of string formatting uses a dictionary, params. And instead of a simple %s marker in the string, the marker contains a name in parentheses. This name is used as a key in the params dictionary and subsitutes the corresponding value, secret, in place of the %(pwd)s marker. 2 Dictionary-based string formatting works with any number of named keys. Each key must exist in the given dictionary, or the - formatting will fail with a KeyError. + formatting will fail with a KeyError. @@ -6851,8 +6691,8 @@ constantly switching between reading the string and reading the tuple of values.

      So why would you use dictionary-based string formatting? Well, it does seem like overkill to set up a dictionary of keys and values simply to do string formatting in the next line; it's really most useful when you happen to have a dictionary of -meaningful keys and values already. Like locals. -

      Example 8.14. Dictionary-based string formatting in BaseHTMLProcessor.py

      +meaningful keys and values already.  Like locals.
      +

      Example 8.14. Dictionary-based string formatting in BaseHTMLProcessor.py

           def handle_comment(self, text):        
               self.pieces.append("<!--%(text)s-->" % locals()) 1
       
      @@ -6860,8 +6700,8 @@ meaningful keys and values already. Like 1 -Using the built-in locals function is the most common use of dictionary-based string formatting. It means that you can use the names of local variables - within your string (in this case, text, which was passed to the class method as an argument) and each named variable will be replaced by its value. If text is 'Begin page footer', the string formatting "<!--%(text)s-->" % locals() will resolve to the string '<!--Begin page footer-->'. +Using the built-in locals function is the most common use of dictionary-based string formatting. It means that you can use the names of local variables + within your string (in this case, text, which was passed to the class method as an argument) and each named variable will be replaced by its value. If text is 'Begin page footer', the string formatting "<!--%(text)s-->" % locals() will resolve to the string '<!--Begin page footer-->'. @@ -6874,20 +6714,20 @@ meaningful keys and values already. Like 1 -When this method is called, attrs is a list of key/value tuples, just like the items of a dictionary, which means you can use multi-variable assignment to iterate through it. This should be a familiar pattern by now, but there's a lot going on here, so let's break it down: +When this method is called, attrs is a list of key/value tuples, just like the items of a dictionary, which means you can use multi-variable assignment to iterate through it. This should be a familiar pattern by now, but there's a lot going on here, so let's break it down:
        -
      1. Suppose attrs is [('href', 'index.html'), ('title', 'Go to home page')]. +
      2. Suppose attrs is [('href', 'index.html'), ('title', 'Go to home page')]. -
      3. In the first round of the list comprehension, key will get 'href', and value will get 'index.html'. +
      4. In the first round of the list comprehension, key will get 'href', and value will get 'index.html'.
      5. The string formatting ' %s="%s"' % (key, value) will resolve to ' href="index.html"'. This string becomes the first element of the list comprehension's return value. -
      6. In the second round, key will get 'title', and value will get 'Go to home page'. +
      7. In the second round, key will get 'title', and value will get 'Go to home page'.
      8. The string formatting will resolve to ' title="Go to home page"'. -
      9. The list comprehension returns a list of these two resolved strings, and strattrs will join both elements of this list together to form ' href="index.html" title="Go to home page"'. +
      10. The list comprehension returns a list of these two resolved strings, and strattrs will join both elements of this list together to form ' href="index.html" title="Go to home page"'.
      @@ -6895,7 +6735,7 @@ meaningful keys and values already. Like 2 -Now, using dictionary-based string formatting, you insert the value of tag and strattrs into a string. So if tag is 'a', the final result would be '<a href="index.html" title="Go to home page">', and that is what gets appended to self.pieces. +Now, using dictionary-based string formatting, you insert the value of tag and strattrs into a string. So if tag is 'a', the final result would be '<a href="index.html" title="Go to home page">', and that is what gets appended to self.pieces. @@ -6904,14 +6744,14 @@ meaningful keys and values already. Like Important -Using dictionary-based string formatting with locals is a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a - slight performance hit in making the call to locals, since locals builds a copy of the local namespace. +Using dictionary-based string formatting with locals is a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a + slight performance hit in making the call to locals, since locals builds a copy of the local namespace.

      8.7. Quoting attribute values

      -

      A common question on comp.lang.python is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”[4] (This is generally precipitated by a project manager who has found the HTML-is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through BaseHTMLProcessor. -

      BaseHTMLProcessor consumes HTML (since it's descended from SGMLParser) and produces equivalent HTML, but the HTML output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase +

      A common question on comp.lang.python is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”[4] (This is generally precipitated by a project manager who has found the HTML-is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through BaseHTMLProcessor. +

      BaseHTMLProcessor consumes HTML (since it's descended from SGMLParser) and produces equivalent HTML, but the HTML output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase or mixed case, and attribute values will be enclosed in double quotes, even if they started in single quotes or with no quotes at all. It is this last side effect that you can take advantage of.

      Example 8.16. Quoting attribute values

      @@ -6947,7 +6787,7 @@ at all.  It is this last side effect that you can take advantage of.
       
       1 
       
      -Note that the attribute values of the href attributes in the <a> tags are not properly quoted.  (Also note that you're using triple quotes for something other than a docstring.  And directly in the IDE, no less.  They're very useful.)
      +Note that the attribute values of the href attributes in the <a> tags are not properly quoted.  (Also note that you're using triple quotes for something other than a docstring.  And directly in the IDE, no less.  They're very useful.)
       
       
       
      @@ -6958,14 +6798,14 @@ at all.  It is this last side effect that you can take advantage of.
       
       3 
       
      -Using the output function defined in BaseHTMLProcessor, you get the output as a single string, complete with quoted attribute values.  While this may seem anti-climactic, think
      -            about how much has actually happened here: SGMLParser parsed the entire HTML document, breaking it down into tags, refs, data, and so forth; BaseHTMLProcessor used those elements to reconstruct pieces of HTML (which are still stored in parser.pieces, if you want to see them); finally, you called parser.output, which joined all the pieces of HTML into one string.
      +Using the output function defined in BaseHTMLProcessor, you get the output as a single string, complete with quoted attribute values.  While this may seem anti-climactic, think
      +            about how much has actually happened here: SGMLParser parsed the entire HTML document, breaking it down into tags, refs, data, and so forth; BaseHTMLProcessor used those elements to reconstruct pieces of HTML (which are still stored in parser.pieces, if you want to see them); finally, you called parser.output, which joined all the pieces of HTML into one string.
       
       
       
      -

      8.8. Introducing dialect.py

      -

      Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within a <pre>...</pre> block passes through unaltered. -

      To handle the <pre> blocks, you define two methods in Dialectizer: start_pre and end_pre. +

      8.8. Introducing dialect.py

      +

      Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within a <pre>...</pre> block passes through unaltered. +

      To handle the <pre> blocks, you define two methods in Dialectizer: start_pre and end_pre.

      Example 8.17. Handling specific tags

           def start_pre(self, attrs):             1
               self.verbatim += 12
      @@ -6978,25 +6818,25 @@ at all.  It is this last side effect that you can take advantage of.
       
       1 
       
      -start_pre is called every time SGMLParser finds a <pre> tag in the HTML source.  (In a minute, you'll see exactly how this happens.)  The method takes a single parameter, attrs, which contains the attributes of the tag (if any).  attrs is a list of key/value tuples, just like unknown_starttag takes.
      +start_pre is called every time SGMLParser finds a <pre> tag in the HTML source.  (In a minute, you'll see exactly how this happens.)  The method takes a single parameter, attrs, which contains the attributes of the tag (if any).  attrs is a list of key/value tuples, just like unknown_starttag takes.
       
       
       
       2 
       
      -In the reset method, you initialize a data attribute that serves as a counter for <pre> tags.  Every time you hit a <pre> tag, you increment the counter; every time you hit a </pre> tag, you'll decrement the counter.  (You could just use this as a flag and set it to 1 and reset it to 0, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested <pre> tags.)  In a minute, you'll see how this counter is put to good use.
      +In the reset method, you initialize a data attribute that serves as a counter for <pre> tags.  Every time you hit a <pre> tag, you increment the counter; every time you hit a </pre> tag, you'll decrement the counter.  (You could just use this as a flag and set it to 1 and reset it to 0, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested <pre> tags.)  In a minute, you'll see how this counter is put to good use.
       
       
       
       3 
       
      -That's it, that's the only special processing you do for <pre> tags.  Now you pass the list of attributes along to unknown_starttag so it can do the default processing.
      +That's it, that's the only special processing you do for <pre> tags.  Now you pass the list of attributes along to unknown_starttag so it can do the default processing.
       
       
       
       4 
       
      -end_pre is called every time SGMLParser finds a </pre> tag.  Since end tags can not contain attributes, the method takes no parameters.
      +end_pre is called every time SGMLParser finds a </pre> tag.  Since end tags can not contain attributes, the method takes no parameters.
       
       
       
      @@ -7007,12 +6847,12 @@ at all.  It is this last side effect that you can take advantage of.
       
       6 
       
      -Second, you decrement your counter to signal that this <pre> block has been closed.
      +Second, you decrement your counter to signal that this <pre> block has been closed.
       
       
       
      -

      At this point, it's worth digging a little further into SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle <pre> and </pre>. But how does this happen? Well, it's not magic, it's just good Python coding. -

      Example 8.18. SGMLParser

      +

      At this point, it's worth digging a little further into SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle <pre> and </pre>. But how does this happen? Well, it's not magic, it's just good Python coding. +

      Example 8.18. SGMLParser

           def finish_starttag(self, tag, attrs):               1
               try:        
                   method = getattr(self, 'start_' + tag)       2
      @@ -7036,46 +6876,46 @@ at all.  It is this last side effect that you can take advantage of.
       
       1 
       
      -At this point, SGMLParser has already found a start tag and parsed the attribute list.  The only thing left to do is figure out whether there is a
      -            specific handler method for this tag, or whether you should fall back on the default method (unknown_starttag).
      +At this point, SGMLParser has already found a start tag and parsed the attribute list.  The only thing left to do is figure out whether there is a
      +            specific handler method for this tag, or whether you should fall back on the default method (unknown_starttag).
       
       
       
       2 
       
      -The “magic” of SGMLParser is nothing more than your old friend, getattr.  What you may not have realized before is that getattr will find methods defined in descendants of an object as well as the object itself.  Here the object is self, the current instance.  So if tag is 'pre', this call to getattr will look for a start_pre method on the current instance, which is an instance of the Dialectizer class.
      +The “magic” of SGMLParser is nothing more than your old friend, getattr.  What you may not have realized before is that getattr will find methods defined in descendants of an object as well as the object itself.  Here the object is self, the current instance.  So if tag is 'pre', this call to getattr will look for a start_pre method on the current instance, which is an instance of the Dialectizer class.
       
       
       
       3 
       
      -getattr raises an AttributeError if the method it's looking for doesn't exist in the object (or any of its descendants), but that's okay, because you wrapped
      -            the call to getattr inside a try...except block and explicitly caught the AttributeError.
      +getattr raises an AttributeError if the method it's looking for doesn't exist in the object (or any of its descendants), but that's okay, because you wrapped
      +            the call to getattr inside a try...except block and explicitly caught the AttributeError.
       
       
       
       4 
       
      -Since you didn't find a start_xxx method, you'll also look for a do_xxx method before giving up.  This alternate naming scheme is generally used for standalone tags, like <br>, which have no corresponding end tag.  But you can use either naming scheme; as you can see, SGMLParser tries both for every tag.  (You shouldn't define both a start_xxx and do_xxx handler method for the same tag, though; only the start_xxx method will get called.)
      +Since you didn't find a start_xxx method, you'll also look for a do_xxx method before giving up.  This alternate naming scheme is generally used for standalone tags, like <br>, which have no corresponding end tag.  But you can use either naming scheme; as you can see, SGMLParser tries both for every tag.  (You shouldn't define both a start_xxx and do_xxx handler method for the same tag, though; only the start_xxx method will get called.)
       
       
       
       5 
       
      -Another AttributeError, which means that the call to getattr failed with do_xxx.  Since you found neither a start_xxx nor a do_xxx method for this tag, you catch the exception and fall back on the default method, unknown_starttag.
      +Another AttributeError, which means that the call to getattr failed with do_xxx.  Since you found neither a start_xxx nor a do_xxx method for this tag, you catch the exception and fall back on the default method, unknown_starttag.
       
       
       
       6 
       
      -Remember, try...except blocks can have an else clause, which is called if no exception is raised during the try...except block.  Logically, that means that you did find a do_xxx method for this tag, so you're going to call it.
      +Remember, try...except blocks can have an else clause, which is called if no exception is raised during the try...except block.  Logically, that means that you did find a do_xxx method for this tag, so you're going to call it.
       
       
       
       7 
       
       By the way, don't worry about these different return values; in theory they mean something, but they're never actually used.
      -             Don't worry about the self.stack.append(tag) either; SGMLParser keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this
      +             Don't worry about the self.stack.append(tag) either; SGMLParser keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this
                   information either.  In theory, you could use this module to validate that your tags were fully balanced, but it's probably
                   not worth it, and it's beyond the scope of this chapter.  You have better things to worry about right now.
       
      @@ -7083,40 +6923,40 @@ at all.  It is this last side effect that you can take advantage of.
       
       8 
       
      -start_xxx and do_xxx methods are not called directly; the tag, method, and attributes are passed to this function, handle_starttag, so that descendants can override it and change the way all start tags are dispatched.  You don't need that level of control, so you just let this method do its thing, which is to call
      -            the method (start_xxx or do_xxx) with the list of attributes.  Remember, method is a function, returned from getattr, and functions are objects.  (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run
      +start_xxx and do_xxx methods are not called directly; the tag, method, and attributes are passed to this function, handle_starttag, so that descendants can override it and change the way all start tags are dispatched.  You don't need that level of control, so you just let this method do its thing, which is to call
      +            the method (start_xxx or do_xxx) with the list of attributes.  Remember, method is a function, returned from getattr, and functions are objects.  (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run
                   out of ways to use it to my advantage.)  Here, the function object is passed into this dispatch method as an argument, and
                   this method turns around and calls the function.  At this point, you don't need to know what the function is, what it's named,
      -            or where it's defined; the only thing you need to know about the function is that it is called with one argument, attrs.
      +            or where it's defined; the only thing you need to know about the function is that it is called with one argument, attrs.
       
       
       
      -

      Now back to our regularly scheduled program: Dialectizer. When you left, you were in the process of defining specific handler methods for <pre> and </pre> tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, -you need to override the handle_data method. -

      Example 8.19. Overriding the handle_data method

      +

      Now back to our regularly scheduled program: Dialectizer. When you left, you were in the process of defining specific handler methods for <pre> and </pre> tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, +you need to override the handle_data method. +

      Example 8.19. Overriding the handle_data method

           def handle_data(self, text):     1
               self.pieces.append(self.verbatim and text or self.process(text)) 2
      - -
      1 handle_data is called with only one argument, the text to process. +handle_data is called with only one argument, the text to process.
      2 In the ancestor BaseHTMLProcessor, the handle_data method simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you're in the middle of a <pre>...</pre> block, self.verbatim will be some value greater than 0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the +In the ancestor BaseHTMLProcessor, the handle_data method simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you're in the middle of a <pre>...</pre> block, self.verbatim will be some value greater than 0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using the and-or trick.
      -

      You're close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes -later in dialect.py define a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough +

      You're close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes +later in dialect.py define a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough for one chapter.

      8.9. Putting it all together

      It's time to put everything you've learned so far to good use. I hope you were paying attention. -

      Example 8.20. The translate function, part 1

      +

      Example 8.20. The translate function, part 1

       def translate(url, dialectName="chef"): 1
           import urllib     2
           sock = urllib.urlopen(url)          3
      @@ -7127,7 +6967,7 @@ def translate(url, dialectName="chef"): 1 
       
      -The translate function has an optional argument dialectName, which is a string that specifies the dialect you'll be using.  You'll see how this is used in a minute.
      +The translate function has an optional argument dialectName, which is a string that specifies the dialect you'll be using.  You'll see how this is used in a minute.
       
       
       
      @@ -7147,7 +6987,7 @@ def translate(url, dialectName="chef"): 

      Example 8.21. The translate function, part 2: curiouser and curiouser

      +

      Example 8.21. The translate function, part 2: curiouser and curiouser

           parserName = "%sDialectizer" % dialectName.capitalize() 1
           parserClass = globals()[parserName]   2
           parser = parserClass()                3
      @@ -7156,32 +6996,32 @@ def translate(url, dialectName="chef"): 1 
       
      -capitalize is a string method you haven't seen before; it simply capitalizes the first letter of a string and forces everything else
      -            to lowercase.  Combined with some string formatting, you've taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class.  If dialectName is the string 'chef', parserName will be the string 'ChefDialectizer'.
      +capitalize is a string method you haven't seen before; it simply capitalizes the first letter of a string and forces everything else
      +            to lowercase.  Combined with some string formatting, you've taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class.  If dialectName is the string 'chef', parserName will be the string 'ChefDialectizer'.
       
       
       
       2 
       
      -You have the name of a class as a string (parserName), and you have the global namespace as a dictionary (globals()).  Combined, you can get a reference to the class which the string names.  (Remember, classes are objects, and they can be assigned to variables just like any other object.)  If parserName is the string 'ChefDialectizer', parserClass will be the class ChefDialectizer.
      +You have the name of a class as a string (parserName), and you have the global namespace as a dictionary (globals()).  Combined, you can get a reference to the class which the string names.  (Remember, classes are objects, and they can be assigned to variables just like any other object.)  If parserName is the string 'ChefDialectizer', parserClass will be the class ChefDialectizer.
       
       
       
       3 
       
      -Finally, you have a class object (parserClass), and you want an instance of the class.  Well, you already know how to do that: call the class like a function.  The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable
      -            like a function, and out pops an instance of the class.  If parserClass is the class ChefDialectizer, parser will be an instance of the class ChefDialectizer.
      +Finally, you have a class object (parserClass), and you want an instance of the class.  Well, you already know how to do that: call the class like a function.  The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable
      +            like a function, and out pops an instance of the class.  If parserClass is the class ChefDialectizer, parser will be an instance of the class ChefDialectizer.
       
       
       
      -

      Why bother? After all, there are only 3 Dialectizer classes; why not just use a case statement? (Well, there's no case statement in Python, but why not just use a series of if statements?) One reason: extensibility. The translate function has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a new FooDialectizer tomorrow; translate would work by passing 'foo' as the dialectName. -

      Even better, imagine putting FooDialectizer in a separate module, and importing it with from module import. You've already seen that this includes it in globals(), so translate would still work without modification, even though FooDialectizer was in a separate file. +

      Why bother? After all, there are only 3 Dialectizer classes; why not just use a case statement? (Well, there's no case statement in Python, but why not just use a series of if statements?) One reason: extensibility. The translate function has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a new FooDialectizer tomorrow; translate would work by passing 'foo' as the dialectName. +

      Even better, imagine putting FooDialectizer in a separate module, and importing it with from module import. You've already seen that this includes it in globals(), so translate would still work without modification, even though FooDialectizer was in a separate file.

      Now imagine that the name of the dialect is coming from somewhere outside the program, maybe from a database or from a user-inputted value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a URL and a dialect name (both strings) in the query string of a web page request, and output the “translated” web page. -

      Finally, imagine a Dialectizer framework with a plug-in architecture. You could put each Dialectizer class in a separate file, leaving only the translate function in dialect.py. Assuming a consistent naming scheme, the translate function could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven't +

      Finally, imagine a Dialectizer framework with a plug-in architecture. You could put each Dialectizer class in a separate file, leaving only the translate function in dialect.py. Assuming a consistent naming scheme, the translate function could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven't seen dynamic importing yet, but I promise to cover it in a later chapter.) To add a new dialect, you would simply add an -appropriately-named file in the plug-ins directory (like foodialect.py which contains the FooDialectizer class). Calling the translate function with the dialect name 'foo' would find the module foodialect.py, import the class FooDialectizer, and away you go. -

      Example 8.22. The translate function, part 3

      +appropriately-named file in the plug-ins directory (like foodialect.py which contains the FooDialectizer class).  Calling the translate function with the dialect name 'foo' would find the module foodialect.py, import the class FooDialectizer, and away you go.
      +

      Example 8.22. The translate function, part 3

           parser.feed(htmlSource) 1
           parser.close()          2
           return parser.output()  3
      @@ -7190,21 +7030,21 @@ appropriately-named file in the plug-ins directory (like 
       
       1 
       
      -After all that imagining, this is going to seem pretty boring, but the feed function is what does the entire transformation.  You had the entire HTML source in a single string, so you only had to call feed once.  However, you can call feed as often as you want, and the parser will just keep parsing.  So if you were worried about memory usage (or you knew you
      +After all that imagining, this is going to seem pretty boring, but the feed function is what does the entire transformation.  You had the entire HTML source in a single string, so you only had to call feed once.  However, you can call feed as often as you want, and the parser will just keep parsing.  So if you were worried about memory usage (or you knew you
                   were going to be dealing with very large HTML pages), you could set this up in a loop, where you read a few bytes of HTML and fed it to the parser.  The result would be the same.
       
       
       
       2 
       
      -Because feed maintains an internal buffer, you should always call the parser's close method when you're done (even if you fed it all at once, like you did).  Otherwise you may find that your output is missing
      +Because feed maintains an internal buffer, you should always call the parser's close method when you're done (even if you fed it all at once, like you did).  Otherwise you may find that your output is missing
                   the last few bytes.
       
       
       
       3 
       
      -Remember, output is the function you defined on BaseHTMLProcessor that joins all the pieces of output you've buffered and returns them in a single string.
      +Remember, output is the function you defined on BaseHTMLProcessor that joins all the pieces of output you've buffered and returns them in a single string.
       
       
       
      @@ -7216,26 +7056,26 @@ appropriately-named file in the plug-ins directory (like 
       
       
       

      8.10. Summary

      -

      Python provides you with a powerful tool, sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways. +

      Python provides you with a powerful tool, sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways.

      • parsing the HTML looking for something specific
      • aggregating the results, like the URL lister
      • altering the structure along the way, like the attribute quoter -
      • transforming the HTML into something else by manipulating the text while leaving the tags alone, like the Dialectizer +
      • transforming the HTML into something else by manipulating the text while leaving the tags alone, like the Dialectizer

      Along with these examples, you should be comfortable doing all of the following things:



      -

      [1] The technical term for a parser like SGMLParser is a consumer: it consumes HTML and breaks it down. Presumably, the name feed was chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or +

      [1] The technical term for a parser like SGMLParser is a consumer: it consumes HTML and breaks it down. Presumably, the name feed was chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that's just me. In any event, it's an interesting mental image. @@ -7243,7 +7083,7 @@ appropriately-named file in the plug-ins directory (like

      [2] The reason Python is better at lists than strings is that lists are mutable but strings are immutable. This means that appending to a list just adds the element and updates the index. Since strings can not be changed after they are created, code like s = s + newpiece will create an entirely new string out of the concatenation of the original and the new piece, then throw away the original string. This involves a lot of expensive memory management, and the amount of effort involved increases as the string gets - longer, so doing s = s + newpiece in a loop is deadly. In technical terms, appending n items to a list is O(n), while appending n items to a string is O(n2). + longer, so doing s = s + newpiece in a loop is deadly. In technical terms, appending n items to a list is O(n), while appending n items to a string is O(n2).

      [3] I don't get out much.

      @@ -7253,14 +7093,14 @@ appropriately-named file in the plug-ins directory (like

      9.1. Diving in

      These next two chapters are about XML processing in Python. It would be helpful if you already knew what an XML document looks like, that it's made up of structured tags to form a hierarchy of elements, and so on. If this doesn't make sense to you, there are many XML tutorials that can explain the basics. -

      If you're not particularly interested in XML, you should still read these chapters, which cover important topics like Python packages, Unicode, command line arguments, and how to use getattr for method dispatching. +

      If you're not particularly interested in XML, you should still read these chapters, which cover important topics like Python packages, Unicode, command line arguments, and how to use getattr for method dispatching.

      Being a philosophy major is not required, although if you have ever had the misfortune of being subjected to the writings of Immanuel Kant, you will appreciate the example program a lot more than if you majored in something useful, like computer science. -

      There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each element it finds. (If you read Chapter 8, HTML Processing, this should sound familiar, because that's how the sgmllib module works.) The other is called DOM (“Document Object Model”), and it works by reading in the entire XML document at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the DOM. +

      There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each element it finds. (If you read Chapter 8, HTML Processing, this should sound familiar, because that's how the sgmllib module works.) The other is called DOM (“Document Object Model”), and it works by reading in the entire XML document at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the DOM.

      The following is a complete Python program which generates pseudo-random output based on a context-free grammar defined in an XML format. Don't worry yet if you don't understand what that means; you'll examine both the program's input and its output in more depth throughout these next two chapters. -

      Example 9.1. kgp.py

      +

      Example 9.1. kgp.py

      If you have not already done so, you can download this and other examples used in this book.

       """Kant Generator for Python
       
      @@ -7502,7 +7342,7 @@ def main(argv):
       
       if __name__ == "__main__":
           main(sys.argv[1:])
      -

      Example 9.2. toolbox.py

      +

      Example 9.2. toolbox.py

       """Miscellaneous utility functions"""
       
       def openAnything(source):            
      @@ -7549,8 +7389,8 @@ def openAnything(source):
           # treat source as string
           import StringIO     
           return StringIO.StringIO(str(source)) 
      -

      Run the program kgp.py by itself, and it will parse the default XML-based grammar, in kant.xml, and print several paragraphs worth of philosophy in the style of Immanuel Kant. -

      Example 9.3. Sample output of kgp.py

      [you@localhost kgp]$ python kgp.py
      +

      Run the program kgp.py by itself, and it will parse the default XML-based grammar, in kant.xml, and print several paragraphs worth of philosophy in the style of Immanuel Kant. +

      Example 9.3. Sample output of kgp.py

      [you@localhost kgp]$ python kgp.py
            As is shown in the writings of Hume, our a priori concepts, in
       reference to ends, abstract from all content of knowledge; in the study
       of space, the discipline of human reason, in accordance with the
      @@ -7589,13 +7429,13 @@ the sort of thing that Kant would have agreed with), some of it is blatantly fal
       But all of it is in the style of Immanuel Kant.
       

      Let me repeat that this is much, much funnier if you are now or have ever been a philosophy major.

      The interesting thing about this program is that there is nothing Kant-specific about it. All the content in the previous -example was derived from the grammar file, kant.xml. If you tell the program to use a different grammar file (which you can specify on the command line), the output will be +example was derived from the grammar file, kant.xml. If you tell the program to use a different grammar file (which you can specify on the command line), the output will be completely different. -

      Example 9.4. Simpler output from kgp.py

      [you@localhost kgp]$ python kgp.py -g binary.xml
      +

      Example 9.4. Simpler output from kgp.py

      [you@localhost kgp]$ python kgp.py -g binary.xml
       00101001
       [you@localhost kgp]$ python kgp.py -g binary.xml
       10110100

      You will take a closer look at the structure of the grammar file later in this chapter. For now, all you need to know is -that the grammar file defines the structure of the output, and the kgp.py program reads through the grammar and makes random decisions about which words to plug in where. +that the grammar file defines the structure of the output, and the kgp.py program reads through the grammar and makes random decisions about which words to plug in where.

      9.2. Packages

      Actually parsing an XML document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour to talk about packages. @@ -7606,13 +7446,13 @@ that the grammar file defines the structure of the output, and the 1 -This is a syntax you haven't seen before. It looks almost like the from module import you know and love, but the "." gives it away as something above and beyond a simple import. In fact, xml is what is known as a package, dom is a nested package within xml, and minidom is a module within xml.dom. +This is a syntax you haven't seen before. It looks almost like the from module import you know and love, but the "." gives it away as something above and beyond a simple import. In fact, xml is what is known as a package, dom is a nested package within xml, and minidom is a module within xml.dom.

      That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than directories of modules; nested packages are subdirectories. The modules within a package (or a nested package) are still -just .py files, like always, except that they're in a subdirectory instead of the main lib/ directory of your Python installation. +just .py files, like always, except that they're in a subdirectory instead of the main lib/ directory of your Python installation.

      Example 9.6. File layout of a package

      Python21/           root Python installation (home of the executable)
       |
       +--lib/             library directory (home of the standard library modules)
      @@ -7623,7 +7463,7 @@ just .py files, like always, except that they're i
              |
              +--dom/      xml.dom package (contains minidom.py)
              |
      -       +--parsers/  xml.parsers package (used internally)

      So when you say from xml.dom import minidom, Python figures out that that means “look in the xml directory for a dom directory, and look in that for the minidom module, and import it as minidom”. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import + +--parsers/ xml.parsers package (used internally)

      So when you say from xml.dom import minidom, Python figures out that that means “look in the xml directory for a dom directory, and look in that for the minidom module, and import it as minidom”. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import specific classes or functions from a module contained within a package. You can also import the package itself as a module. The syntax is all the same; Python figures out what you mean based on the file layout of the package, and automatically does the right thing.

      Example 9.7. Packages are modules, too

      >>> from xml.dom import minidom         1
      @@ -7646,41 +7486,41 @@ The syntax is all the same; Python figures out what you mean based on the file l
       
       1 
       
      -Here you're importing a module (minidom) from a nested package (xml.dom).  The result is that minidom is imported into your namespace, and in order to reference classes within the minidom module (like Element), you need to preface them with the module name.
      +Here you're importing a module (minidom) from a nested package (xml.dom).  The result is that minidom is imported into your namespace, and in order to reference classes within the minidom module (like Element), you need to preface them with the module name.
       
       
       
       2 
       
      -Here you are importing a class (Element) from a module (minidom) from a nested package (xml.dom).  The result is that Element is imported directly into your namespace.  Note that this does not interfere with the previous import; the Element class can now be referenced in two ways (but it's all still the same class).
      +Here you are importing a class (Element) from a module (minidom) from a nested package (xml.dom).  The result is that Element is imported directly into your namespace.  Note that this does not interfere with the previous import; the Element class can now be referenced in two ways (but it's all still the same class).
       
       
       
       3 
       
      -Here you are importing the dom package (a nested package of xml) as a module in and of itself.  Any level of a package can be treated as a module, as you'll see in a moment.  It can even
      +Here you are importing the dom package (a nested package of xml) as a module in and of itself.  Any level of a package can be treated as a module, as you'll see in a moment.  It can even
                   have its own attributes and methods, just the modules you've seen before.
       
       
       
       4 
       
      -Here you are importing the root level xml package as a module.
      +Here you are importing the root level xml package as a module.
       
       
       
       

      So how can a package (which is just a directory on disk) be imported and treated as a module (which is always a file on disk)? -The answer is the magical __init__.py file. You see, packages are not simply directories; they are directories with a specific file, __init__.py, inside. This file defines the attributes and methods of the package. For instance, xml.dom contains a Node class, which is defined in xml/dom/__init__.py. When you import a package as a module (like dom from xml), you're really importing its __init__.py file. +The answer is the magical __init__.py file. You see, packages are not simply directories; they are directories with a specific file, __init__.py, inside. This file defines the attributes and methods of the package. For instance, xml.dom contains a Node class, which is defined in xml/dom/__init__.py. When you import a package as a module (like dom from xml), you're really importing its __init__.py file.
      -
      Note
      A package is a directory with the special __init__.py file in it. The __init__.py file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file, - but it has to exist. But if __init__.py doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages. +A package is a directory with the special __init__.py file in it. The __init__.py file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file, + but it has to exist. But if __init__.py doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages.
      -

      So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an xml package with sax and dom packages inside, the authors could have chosen to put all the sax functionality in xmlsax.py and all the dom functionality in xmldom.py, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the XML package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different +

      So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an xml package with sax and dom packages inside, the authors could have chosen to put all the sax functionality in xmlsax.py and all the dom functionality in xmldom.py, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the XML package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different areas simultaneously).

      If you ever find yourself writing a large subsystem in Python (or, more likely, when you realize that your small subsystem has grown into a large one), invest some time designing a good package architecture. It's one of the many things Python is good at, so take advantage of it. @@ -7707,26 +7547,26 @@ package architecture. It's one of the many things Python is good at, so take ad 1 -As you saw in the previous section, this imports the minidom module from the xml.dom package. +As you saw in the previous section, this imports the minidom module from the xml.dom package. 2 -Here is the one line of code that does all the work: minidom.parse takes one argument and returns a parsed representation of the XML document. The argument can be many things; in this case, it's simply a filename of an XML document on my local disk. (To follow along, you'll need to change the path to point to your downloaded examples directory.) +Here is the one line of code that does all the work: minidom.parse takes one argument and returns a parsed representation of the XML document. The argument can be many things; in this case, it's simply a filename of an XML document on my local disk. (To follow along, you'll need to change the path to point to your downloaded examples directory.) But you can also pass a file object, or even a file-like object. You'll take advantage of this flexibility later in this chapter. 3 -The object returned from minidom.parse is a Document object, a descendant of the Node class. This Document object is the root level of a complex tree-like structure of interlocking Python objects that completely represent the XML document you passed to minidom.parse. +The object returned from minidom.parse is a Document object, a descendant of the Node class. This Document object is the root level of a complex tree-like structure of interlocking Python objects that completely represent the XML document you passed to minidom.parse. 4 -toxml is a method of the Node class (and is therefore available on the Document object you got from minidom.parse). toxml prints out the XML that this Node represents. For the Document node, this prints out the entire XML document. +toxml is a method of the Node class (and is therefore available on the Document object you got from minidom.parse). toxml prints out the XML that this Node represents. For the Document node, this prints out the entire XML document. @@ -7742,7 +7582,7 @@ package architecture. It's one of the many things Python is good at, so take ad 1 -Every Node has a childNodes attribute, which is a list of the Node objects. A Document always has only one child node, the root element of the XML document (in this case, the grammar element). +Every Node has a childNodes attribute, which is a list of the Node objects. A Document always has only one child node, the root element of the XML document (in this case, the grammar element). @@ -7755,11 +7595,11 @@ package architecture. It's one of the many things Python is good at, so take ad 3 -Since getting the first child node of a node is a useful and common activity, the Node class has a firstChild attribute, which is synonymous with childNodes[0]. (There is also a lastChild attribute, which is synonymous with childNodes[-1].) +Since getting the first child node of a node is a useful and common activity, the Node class has a firstChild attribute, which is synonymous with childNodes[0]. (There is also a lastChild attribute, which is synonymous with childNodes[-1].) -

      Example 9.10. toxml works on any node

      +

      Example 9.10. toxml works on any node

       >>> grammarNode = xmldoc.firstChild
       >>> print grammarNode.toxml() 1
       <grammar>
      @@ -7776,7 +7616,7 @@ package architecture.  It's one of the many things Python is good at, so take ad
       
       1 
       
      -Since the toxml method is defined in the Node class, it is available on any XML node, not just the Document element.
      +Since the toxml method is defined in the Node class, it is available on any XML node, not just the Document element.
       
       
       
      @@ -7806,31 +7646,31 @@ package architecture.  It's one of the many things Python is good at, so take ad
       
       1 
       
      -Looking at the XML in binary.xml, you might think that the grammar has only two child nodes, the two ref elements.  But you're missing something: the carriage returns!  After the '<grammar>' and before the first '<ref>' is a carriage return, and this text counts as a child node of the grammar element.  Similarly, there is a carriage return after each '</ref>'; these also count as child nodes.  So grammar.childNodes is actually a list of 5 objects: 3 Text objects and 2 Element objects.
      +Looking at the XML in binary.xml, you might think that the grammar has only two child nodes, the two ref elements.  But you're missing something: the carriage returns!  After the '<grammar>' and before the first '<ref>' is a carriage return, and this text counts as a child node of the grammar element.  Similarly, there is a carriage return after each '</ref>'; these also count as child nodes.  So grammar.childNodes is actually a list of 5 objects: 3 Text objects and 2 Element objects.
       
       
       
       2 
       
      -The first child is a Text object representing the carriage return after the '<grammar>' tag and before the first '<ref>' tag.
      +The first child is a Text object representing the carriage return after the '<grammar>' tag and before the first '<ref>' tag.
       
       
       
       3 
       
      -The second child is an Element object representing the first ref element.
      +The second child is an Element object representing the first ref element.
       
       
       
       4 
       
      -The fourth child is an Element object representing the second ref element.
      +The fourth child is an Element object representing the second ref element.
       
       
       
       5 
       
      -The last child is a Text object representing the carriage return after the '</ref>' end tag and before the '</grammar>' end tag.
      +The last child is a Text object representing the carriage return after the '</ref>' end tag and before the '</grammar>' end tag.
       
       
       
      @@ -7857,31 +7697,31 @@ u'0'
      1 -As you saw in the previous example, the first ref element is grammarNode.childNodes[1], since childNodes[0] is a Text node for the carriage return. +As you saw in the previous example, the first ref element is grammarNode.childNodes[1], since childNodes[0] is a Text node for the carriage return. 2 -The ref element has its own set of child nodes, one for the carriage return, a separate one for the spaces, one for the p element, and so forth. +The ref element has its own set of child nodes, one for the carriage return, a separate one for the spaces, one for the p element, and so forth. 3 -You can even use the toxml method here, deeply nested within the document. +You can even use the toxml method here, deeply nested within the document. 4 -The p element has only one child node (you can't tell that from this example, but look at pNode.childNodes if you don't believe me), and it is a Text node for the single character '0'. +The p element has only one child node (you can't tell that from this example, but look at pNode.childNodes if you don't believe me), and it is a Text node for the single character '0'. 5 -The .data attribute of a Text node gives you the actual string that the text node represents. But what is that 'u' in front of the string? The answer to that deserves its own section. +The .data attribute of a Text node gives you the actual string that the text node represents. But what is that 'u' in front of the string? The answer to that deserves its own section. @@ -7925,7 +7765,7 @@ Dive in
      2 -When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn't know that s was a unicode string, you'd never notice the difference. +When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn't know that s was a unicode string, you'd never notice the difference. @@ -7947,20 +7787,20 @@ La Peña
      2 -Remember I said that the print function attempts to convert a unicode string to ASCII so it can print it? Well, that's not going to work here, because your unicode string contains non-ASCII characters, so Python raises a UnicodeError error. +Remember I said that the print function attempts to convert a unicode string to ASCII so it can print it? Well, that's not going to work here, because your unicode string contains non-ASCII characters, so Python raises a UnicodeError error. 3 -Here's where the conversion-from-unicode-to-other-encoding-schemes comes in. s is a unicode string, but print can only print a regular string. To solve this problem, you call the encode method, available on every unicode string, to convert the unicode string to a regular string in the given encoding scheme, +Here's where the conversion-from-unicode-to-other-encoding-schemes comes in. s is a unicode string, but print can only print a regular string. To solve this problem, you call the encode method, available on every unicode string, to convert the unicode string to a regular string in the given encoding scheme, which you pass as a parameter. In this case, you're using latin-1 (also known as iso-8859-1), which includes the tilde-n (whereas the default ASCII encoding scheme did not, since it only includes characters numbered 0 through 127).

      Remember I said Python usually converted unicode to ASCII whenever it needed to make a regular string out of a unicode string? Well, this default encoding scheme is an option which you can customize. -

      Example 9.15. sitecustomize.py

      +

      Example 9.15. sitecustomize.py

       # sitecustomize.py 1
       # this file can be anywhere in your Python path,
       # but it usually goes in ${pythondir}/lib/site-packages/
      @@ -7971,14 +7811,14 @@ sys.setdefaultencoding('iso-8859-1') 1 
       
      -sitecustomize.py is a special script; Python will try to import it on startup, so any code in it will be run automatically.  As the comment mentions, it can go anywhere
      -            (as long as import can find it), but it usually goes in the site-packages directory within your Python lib directory.
      +sitecustomize.py is a special script; Python will try to import it on startup, so any code in it will be run automatically.  As the comment mentions, it can go anywhere
      +            (as long as import can find it), but it usually goes in the site-packages directory within your Python lib directory.
       
       
       
       2 
       
      -setdefaultencoding function sets, well, the default encoding.  This is the encoding scheme that Python will try to use whenever it needs to auto-coerce a unicode string into a regular string.
      +setdefaultencoding function sets, well, the default encoding.  This is the encoding scheme that Python will try to use whenever it needs to auto-coerce a unicode string into a regular string.
       
       
       
      @@ -7993,8 +7833,8 @@ La Peña
      1 -This example assumes that you have made the changes listed in the previous example to your sitecustomize.py file, and restarted Python. If your default encoding still says 'ascii', you didn't set up your sitecustomize.py properly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even - call sys.setdefaultencoding after Python has started up. Dig into site.py and search for “setdefaultencoding” to find out how.) +This example assumes that you have made the changes listed in the previous example to your sitecustomize.py file, and restarted Python. If your default encoding still says 'ascii', you didn't set up your sitecustomize.py properly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even + call sys.setdefaultencoding after Python has started up. Dig into site.py and search for “setdefaultencoding” to find out how.) @@ -8004,13 +7844,13 @@ La Peña
      -

      Example 9.17. Specifying encoding in .py files

      -

      If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to be UTF-8:

      +

      Example 9.17. Specifying encoding in .py files

      +

      If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to be UTF-8:

       #!/usr/bin/env python
       # -*- coding: UTF-8 -*-
       

      Now, what about XML? Well, every XML document is in a specific encoding. Again, ISO-8859-1 is a popular encoding for data in Western European languages. KOI8-R is popular for Russian texts. The encoding, if specified, is in the header of the XML document. -

      Example 9.18. russiansample.xml

      
      +

      Example 9.18. russiansample.xml

      
       <?xml version="1.0" encoding="koi8-r"?>       1
       <preface>
       <title>Предисловие</title>  2
      @@ -8030,7 +7870,7 @@ is popular for Russian texts.  The encoding, if specified, is in the header of t
       
       
       
      -

      Example 9.19. Parsing russiansample.xml

      +

      Example 9.19. Parsing russiansample.xml

       >>> from xml.dom import minidom
       >>> xmldoc = minidom.parse('russiansample.xml') 1
       >>> title = xmldoc.getElementsByTagName('title')[0].firstChild.data
      @@ -8049,15 +7889,15 @@ UnicodeError: ASCII encoding error: ordinal not in range(128)
       
       1 
       
      -I'm assuming here that you saved the previous example as russiansample.xml in the current directory.  I am also, for the sake of completeness, assuming that you've changed your default encoding back
      -            to 'ascii' by removing your sitecustomize.py file, or at least commenting out the setdefaultencoding line.
      +I'm assuming here that you saved the previous example as russiansample.xml in the current directory.  I am also, for the sake of completeness, assuming that you've changed your default encoding back
      +            to 'ascii' by removing your sitecustomize.py file, or at least commenting out the setdefaultencoding line.
       
       
       
       2 
       
      -Note that the text data of the title tag (now in the title variable, thanks to that long concatenation of Python functions which I hastily skipped over and, annoyingly, won't explain until the next section) -- the text data inside the
      -XML document's title element is stored in unicode.
      +Note that the text data of the title tag (now in the title variable, thanks to that long concatenation of Python functions which I hastily skipped over and, annoyingly, won't explain until the next section) -- the text data inside the
      +XML document's title element is stored in unicode.
       
       
       
      @@ -8089,14 +7929,14 @@ in Python.  If your XML documents are all 7-bit ASCI
       
       
    26. Unicode Tutorial has some more examples of how to use Python's unicode functions, including how to force Python to coerce unicode into ASCII even when it doesn't really want to. -
    27. PEP 263 goes into more detail about how and when to define a character encoding in your .py files. +
    28. PEP 263 goes into more detail about how and when to define a character encoding in your .py files.

      9.5. Searching for elements

      Traversing XML documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within - your XML document, there is a shortcut you can use to find it quickly: getElementsByTagName. -

      For this section, you'll be using the binary.xml grammar file, which looks like this: -

      Example 9.20. binary.xml

      <?xml version="1.0"?>
      +   your XML document, there is a shortcut you can use to find it quickly: getElementsByTagName.
      +

      For this section, you'll be using the binary.xml grammar file, which looks like this: +

      Example 9.20. binary.xml

      <?xml version="1.0"?>
       <!DOCTYPE grammar PUBLIC "-//diveintopython3.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
       <grammar>
       <ref id="bit">
      @@ -8107,8 +7947,8 @@ in Python.  If your XML documents are all 7-bit ASCI
         <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
       <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
       </ref>
      -</grammar>

      It has two refs, 'bit' and 'byte'. A bit is either a '0' or '1', and a byte is 8 bits. -

      Example 9.21. Introducing getElementsByTagName

      +</grammar>

      It has two refs, 'bit' and 'byte'. A bit is either a '0' or '1', and a byte is 8 bits. +

      Example 9.21. Introducing getElementsByTagName

       >>> from xml.dom import minidom
       >>> xmldoc = minidom.parse('binary.xml')
       >>> reflist = xmldoc.getElementsByTagName('ref') 1
      @@ -8129,7 +7969,7 @@ in Python.  If your XML documents are all 7-bit ASCI
       
       1 
       
      -getElementsByTagName takes one argument, the name of the element you wish to find.  It returns a list of Element objects, corresponding to the XML elements that have that name.  In this case, you find two ref elements.
      +getElementsByTagName takes one argument, the name of the element you wish to find.  It returns a list of Element objects, corresponding to the XML elements that have that name.  In this case, you find two ref elements.
       
       
       
      @@ -8151,19 +7991,19 @@ in Python.  If your XML documents are all 7-bit ASCI
       
       1 
       
      -Continuing from the previous example, the first object in your reflist is the 'bit' ref element.
      +Continuing from the previous example, the first object in your reflist is the 'bit' ref element.
       
       
       
       2 
       
      -You can use the same getElementsByTagName method on this Element to find all the <p> elements within the 'bit' ref element.
      +You can use the same getElementsByTagName method on this Element to find all the <p> elements within the 'bit' ref element.
       
       
       
       3 
       
      -Just as before, the getElementsByTagName method returns a list of all the elements it found.  In this case, you have two, one for each bit.
      +Just as before, the getElementsByTagName method returns a list of all the elements it found.  In this case, you have two, one for each bit.
       
       
       
      @@ -8182,25 +8022,25 @@ in Python.  If your XML documents are all 7-bit ASCI
       
       1 
       
      -Note carefully the difference between this and the previous example.  Previously, you were searching for p elements within firstref, but here you are searching for p elements within xmldoc, the root-level object that represents the entire XML document.  This does find the p elements nested within the ref elements within the root grammar element.
      +Note carefully the difference between this and the previous example.  Previously, you were searching for p elements within firstref, but here you are searching for p elements within xmldoc, the root-level object that represents the entire XML document.  This does find the p elements nested within the ref elements within the root grammar element.
       
       
       
       2 
       
      -The first two p elements are within the first ref (the 'bit' ref).
      +The first two p elements are within the first ref (the 'bit' ref).
       
       
       
       3 
       
      -The last p element is the one within the second ref (the 'byte' ref).
      +The last p element is the one within the second ref (the 'byte' ref).
       
       
       
       

      9.6. Accessing element attributes

      XML elements can have one or more attributes, and it is incredibly simple to access them once you have parsed an XML document. -

      For this section, you'll be using the binary.xml grammar file that you saw in the previous section. +

      For this section, you'll be using the binary.xml grammar file that you saw in the previous section.

      @@ -8231,13 +8071,13 @@ in Python. If your XML documents are all 7-bit ASCI - - @@ -8249,14 +8089,14 @@ in Python. If your XML documents are all 7-bit ASCI - -
      Note
      1 Each Element object has an attribute called attributes, which is a NamedNodeMap object. This sounds scary, but it's not, because a NamedNodeMap is an object that acts like a dictionary, so you already know how to use it. +Each Element object has an attribute called attributes, which is a NamedNodeMap object. This sounds scary, but it's not, because a NamedNodeMap is an object that acts like a dictionary, so you already know how to use it.
      2 Treating the NamedNodeMap as a dictionary, you can get a list of the names of the attributes of this element by using attributes.keys(). This element has only one attribute, 'id'. +Treating the NamedNodeMap as a dictionary, you can get a list of the names of the attributes of this element by using attributes.keys(). This element has only one attribute, 'id'.
      4 Again treating the NamedNodeMap as a dictionary, you can get a list of the values of the attributes by using attributes.values(). The values are themselves objects, of type Attr. You'll see how to get useful information out of this object in the next example. +Again treating the NamedNodeMap as a dictionary, you can get a list of the values of the attributes by using attributes.values(). The values are themselves objects, of type Attr. You'll see how to get useful information out of this object in the next example.
      5 Still treating the NamedNodeMap as a dictionary, you can access an individual attribute by name, using normal dictionary syntax. (Readers who have been - paying extra-close attention will already know how the NamedNodeMap class accomplishes this neat trick: by defining a __getitem__ special method. Other readers can take comfort in the fact that they don't need to understand how it works in order to use it effectively.) +Still treating the NamedNodeMap as a dictionary, you can access an individual attribute by name, using normal dictionary syntax. (Readers who have been + paying extra-close attention will already know how the NamedNodeMap class accomplishes this neat trick: by defining a __getitem__ special method. Other readers can take comfort in the fact that they don't need to understand how it works in order to use it effectively.)
      @@ -8272,7 +8112,7 @@ u'bit'

      1 -The Attr object completely represents a single XML attribute of a single XML element. The name of the attribute (the same name as you used to find this object in the bitref.attributes NamedNodeMap pseudo-dictionary) is stored in a.name. +The Attr object completely represents a single XML attribute of a single XML element. The name of the attribute (the same name as you used to find this object in the bitref.attributes NamedNodeMap pseudo-dictionary) is stored in a.name. @@ -8287,18 +8127,18 @@ u'bit'
      Note -Like a dictionary, attributes of an XML element have no ordering. Attributes may happen to be listed in a certain order in the original XML document, and the Attr objects may happen to be listed in a certain order when the XML document is parsed into Python objects, but these orders are arbitrary and should carry no special meaning. You should always access individual attributes +Like a dictionary, attributes of an XML element have no ordering. Attributes may happen to be listed in a certain order in the original XML document, and the Attr objects may happen to be listed in a certain order when the XML document is parsed into Python objects, but these orders are arbitrary and should carry no special meaning. You should always access individual attributes by name, like the keys of a dictionary.

      9.7. Segue

      OK, that's it for the hard-core XML stuff. The next chapter will continue to use these same example programs, but focus on - other aspects that make the program more flexible: using streams for input processing, using getattr for method dispatching, and using command-line flags to allow users to reconfigure the program without changing the code. + other aspects that make the program more flexible: using streams for input processing, using getattr for method dispatching, and using command-line flags to allow users to reconfigure the program without changing the code.

      Before moving on to the next chapter, you should be comfortable doing all of these things:

    29. One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the file-like object.

      Many functions which require an input source could simply take a filename, go open the file for reading, read it, and close it when they're done. But they don't. Instead, they take a file-like object. -

      In the simplest case, a file-like object is any object with a read method with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When -called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left +

      In the simplest case, a file-like object is any object with a read method with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When +called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left off and returns the next chunk of data.

      This is how reading from real files works; the difference is that you're not limiting yourself to real files. The input source could be anything: a file on disk, a web page, even a hard-coded string. As long as you pass a file-like object to the function, and the function simply -calls the object's read method, the function can handle any kind of input source without specific code to handle each kind. -

      In case you were wondering how this relates to XML processing, minidom.parse is one such function which can take a file-like object. +calls the object's read method, the function can handle any kind of input source without specific code to handle each kind. +

      In case you were wondering how this relates to XML processing, minidom.parse is one such function which can take a file-like object.

      Example 10.1. Parsing XML from a file

       >>> from xml.dom import minidom
       >>> fsock = open('binary.xml')    1
      @@ -8348,24 +8188,24 @@ calls the object's read method, the function can h
       
       2 
       
      -You pass the file object to minidom.parse, which calls the read method of fsock and reads the XML document from the file on disk.
      +You pass the file object to minidom.parse, which calls the read method of fsock and reads the XML document from the file on disk.
       
       
       
       3 
       
      -Be sure to call the close method of the file object after you're done with it.  minidom.parse will not do this for you.
      +Be sure to call the close method of the file object after you're done with it.  minidom.parse will not do this for you.
       
       
       
       4 
       
      -Calling the toxml() method on the returned XML document prints out the entire thing.
      +Calling the toxml() method on the returned XML document prints out the entire thing.
       
       
       
      -

      Well, that all seems like a colossal waste of time. After all, you've already seen that minidom.parse can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're -just going to be parsing a local file, you can pass the filename and minidom.parse is smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet. +

      Well, that all seems like a colossal waste of time. After all, you've already seen that minidom.parse can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're +just going to be parsing a local file, you can pass the filename and minidom.parse is smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet.

      Example 10.2. Parsing XML from a URL

       >>> import urllib
       >>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf') 1
      @@ -8398,19 +8238,19 @@ just going to be parsing a local file, you can pass the filename and 
       1 
       
      -As you saw in a previous chapter, urlopen takes a web page URL and returns a file-like object.  Most importantly, this object has a read method which returns the HTML source of the web page.
      +As you saw in a previous chapter, urlopen takes a web page URL and returns a file-like object.  Most importantly, this object has a read method which returns the HTML source of the web page.
       
       
       
       2 
       
      -Now you pass the file-like object to minidom.parse, which obediently calls the read method of the object and parses the XML data that the read method returns.  The fact that this XML data is now coming straight from a web page is completely irrelevant.  minidom.parse doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects.
      +Now you pass the file-like object to minidom.parse, which obediently calls the read method of the object and parses the XML data that the read method returns.  The fact that this XML data is now coming straight from a web page is completely irrelevant.  minidom.parse doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects.
       
       
       
       3 
       
      -As soon as you're done with it, be sure to close the file-like object that urlopen gives you.
      +As soon as you're done with it, be sure to close the file-like object that urlopen gives you.
       
       
       
      @@ -8430,14 +8270,14 @@ just going to be parsing a local file, you can pass the filename and 
       1 
       
      -minidom has a method, parseString, which takes an entire XML document as a string and parses it.  You can use this instead of minidom.parse if you know you already have your entire XML document in a string.
      +minidom has a method, parseString, which takes an entire XML document as a string and parses it.  You can use this instead of minidom.parse if you know you already have your entire XML document in a string.
       
       
       
      -

      OK, so you can use the minidom.parse function for parsing both local files and remote URLs, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a -file, a URL, or a string, you'll need special logic to check whether it's a string, and call the parseString function instead. How unsatisfying. -

      If there were a way to turn a string into a file-like object, then you could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO. -

      Example 10.4. Introducing StringIO

      +

      OK, so you can use the minidom.parse function for parsing both local files and remote URLs, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a +file, a URL, or a string, you'll need special logic to check whether it's a string, and call the parseString function instead. How unsatisfying. +

      If there were a way to turn a string into a file-like object, then you could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO. +

      Example 10.4. Introducing StringIO

       >>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
       >>> import StringIO
       >>> ssock = StringIO.StringIO(contents)   1
      @@ -8457,38 +8297,38 @@ file, a URL, or a string, you'll need special logic to check
       
       1 
       
      -The StringIO module contains a single class, also called StringIO, which allows you to turn a string into a file-like object.  The StringIO class takes the string as a parameter when creating an instance.
      +The StringIO module contains a single class, also called StringIO, which allows you to turn a string into a file-like object.  The StringIO class takes the string as a parameter when creating an instance.
       
       
       
       2 
       
      -Now you have a file-like object, and you can do all sorts of file-like things with it.  Like read, which returns the original string.
      +Now you have a file-like object, and you can do all sorts of file-like things with it.  Like read, which returns the original string.
       
       
       
       3 
       
      -Calling read again returns an empty string.  This is how real file objects work too; once you read the entire file, you can't read any
      -            more without explicitly seeking to the beginning of the file.  The StringIO object works the same way.
      +Calling read again returns an empty string.  This is how real file objects work too; once you read the entire file, you can't read any
      +            more without explicitly seeking to the beginning of the file.  The StringIO object works the same way.
       
       
       
       4 
       
      -You can explicitly seek to the beginning of the string, just like seeking through a file, by using the seek method of the StringIO object.
      +You can explicitly seek to the beginning of the string, just like seeking through a file, by using the seek method of the StringIO object.
       
       
       
       5 
       
      -You can also read the string in chunks, by passing a size parameter to the read method.
      +You can also read the string in chunks, by passing a size parameter to the read method.
       
       
       
       6 
       
      -At any time, read will return the rest of the string that you haven't read yet.  All of this is exactly how file objects work; hence the term
      +At any time, read will return the rest of the string that you haven't read yet.  All of this is exactly how file objects work; hence the term
       file-like object.
       
       
      @@ -8505,12 +8345,12 @@ file, a URL, or a string, you'll need special logic to check
       
       1 
       
      -Now you can pass the file-like object (really a StringIO) to minidom.parse, which will call the object's read method and happily parse away, never knowing that its input came from a hard-coded string.
      +Now you can pass the file-like object (really a StringIO) to minidom.parse, which will call the object's read method and happily parse away, never knowing that its input came from a hard-coded string.
       
       
       
      -

      So now you know how to use a single function, minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, you use urlopen to get a file-like object; for a local file, you use open; and for a string, you use StringIO. Now let's take it one step further and generalize these differences as well. -

      Example 10.6. openAnything

      +

      So now you know how to use a single function, minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, you use urlopen to get a file-like object; for a local file, you use open; and for a string, you use StringIO. Now let's take it one step further and generalize these differences as well. +

      Example 10.6. openAnything

       def openAnything(source):1
           # try to open with urllib (if source is http, ftp, or file URL)
           import urllib       
      @@ -8532,31 +8372,31 @@ def openAnything(source):1 
       
      -The openAnything function takes a single parameter, source, and returns a file-like object.  source is a string of some sort; it can either be a URL (like 'http://slashdot.org/slashdot.rdf'), a full or partial pathname to a local file (like 'binary.xml'), or a string that contains actual XML data to be parsed.
      +The openAnything function takes a single parameter, source, and returns a file-like object.  source is a string of some sort; it can either be a URL (like 'http://slashdot.org/slashdot.rdf'), a full or partial pathname to a local file (like 'binary.xml'), or a string that contains actual XML data to be parsed.
       
       
       
       2 
       
      -First, you see if source is a URL.  You do this through brute force: you try to open it as a URL and silently ignore errors caused by trying to open something which is not a URL.  This is actually elegant in the sense that, if urllib ever supports new types of URLs in the future, you will also support them without recoding.  If urllib is able to open source, then the return kicks you out of the function immediately and the following try statements never execute.
      +First, you see if source is a URL.  You do this through brute force: you try to open it as a URL and silently ignore errors caused by trying to open something which is not a URL.  This is actually elegant in the sense that, if urllib ever supports new types of URLs in the future, you will also support them without recoding.  If urllib is able to open source, then the return kicks you out of the function immediately and the following try statements never execute.
       
       
       
       3 
       
      -On the other hand, if urllib yelled at you and told you that source wasn't a valid URL, you assume it's a path to a file on disk and try to open it.  Again, you don't do anything fancy to check whether source is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably
      +On the other hand, if urllib yelled at you and told you that source wasn't a valid URL, you assume it's a path to a file on disk and try to open it.  Again, you don't do anything fancy to check whether source is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably
                   get them wrong anyway).  Instead, you just blindly open the file, and silently trap any errors.
       
       
       
       4 
       
      -By this point, you need to assume that source is a string that has hard-coded data in it (since nothing else worked), so you use StringIO to create a file-like object out of it and return that.  (In fact, since you're using the str function, source doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its __str__ special method.)
      +By this point, you need to assume that source is a string that has hard-coded data in it (since nothing else worked), so you use StringIO to create a file-like object out of it and return that.  (In fact, since you're using the str function, source doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its __str__ special method.)
       
       
       
      -

      Now you can use this openAnything function in conjunction with minidom.parse to make a function that takes a source that refers to an XML document somehow (either as a URL, or a local filename, or a hard-coded XML document in a string) and parses it. -

      Example 10.7. Using openAnything in kgp.py

      +

      Now you can use this openAnything function in conjunction with minidom.parse to make a function that takes a source that refers to an XML document somehow (either as a URL, or a local filename, or a hard-coded XML document in a string) and parses it. +

      Example 10.7. Using openAnything in kgp.py

       class KantGenerator:
           def _load(self, source):
               sock = toolbox.openAnything(source)
      @@ -8565,7 +8405,7 @@ class KantGenerator:
               return xmldoc

      10.2. Standard input, output, and error

      UNIX users are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you. -

      Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX system. When you print something, it goes to the stdout pipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to the stderr pipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program +

      Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX system. When you print something, it goes to the stdout pipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to the stderr pipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system with a window-based Python IDE, stdout and stderr default to your “Interactive Window”.)

      Example 10.8. Introducing stdout and stderr

      @@ -8585,13 +8425,13 @@ Dive inDive inDive in
      1 -As you saw in Example 6.9, “Simple Counters”, you can use Python's built-in range function to build simple counter loops that repeat something a set number of times. +As you saw in Example 6.9, “Simple Counters”, you can use Python's built-in range function to build simple counter loops that repeat something a set number of times. 2 -stdout is a file-like object; calling its write function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you're printing, and calls sys.stdout.write. +stdout is a file-like object; calling its write function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you're printing, and calls sys.stdout.write. @@ -8601,7 +8441,7 @@ Dive inDive inDive in
      -

      stdout and stderr are both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have no read method, only write. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output. +

      stdout and stderr are both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have no read method, only write. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output.

      Example 10.9. Redirecting output

       [you@localhost kgp]$ python stdout.py
       Dive in
      @@ -8660,7 +8500,7 @@ fsock.close()        7Close the log file.
       
       
      -

      Redirecting stderr works exactly the same way, using sys.stderr instead of sys.stdout. +

      Redirecting stderr works exactly the same way, using sys.stderr instead of sys.stdout.

      Example 10.10. Redirecting error information

       [you@localhost kgp]$ python stderr.py
       [you@localhost kgp]$ cat error.log
      @@ -8690,7 +8530,7 @@ raise Exception, 'this error will be logged' 3 
       
      -Raise an exception.  Note from the screen output that this does not print anything on screen.  All the normal traceback information has been written to error.log.
      +Raise an exception.  Note from the screen output that this does not print anything on screen.  All the normal traceback information has been written to error.log.
       
       
       
      @@ -8713,14 +8553,14 @@ entering function
       
       1 
       
      -This shorthand syntax of the print statement can be used to write to any open file, or file-like object.  In this case, you can redirect a single print statement to stderr without affecting subsequent print statements.
      +This shorthand syntax of the print statement can be used to write to any open file, or file-like object.  In this case, you can redirect a single print statement to stderr without affecting subsequent print statements.
       
       
       
       

      Standard input, on the other hand, is a read-only file object, and it represents the data flowing into the program from some previous program. This will likely not make much sense to classic Mac OS users, or even Windows users unless you were ever fluent on the MS-DOS command line. The way it works is that you can construct a chain of commands in a single line, so that one program's output becomes the input for the next program in the chain. The first program simply outputs to standard output (without doing any -special redirecting itself, just doing normal print statements or whatever), and the next program reads from standard input, and the operating system takes care of connecting +special redirecting itself, just doing normal print statements or whatever), and the next program reads from standard input, and the operating system takes care of connecting one program's output to the next program's input.

      Example 10.12. Chaining commands

       [you@localhost kgp]$ python kgp.py -g binary.xml         1
      @@ -8744,36 +8584,36 @@ one program's output to the next program's input.
       
       1 
       
      -As you saw in Section 9.1, “Diving in”, this will print a string of eight random bits, 0 or 1.
      +As you saw in Section 9.1, “Diving in”, this will print a string of eight random bits, 0 or 1.
       
       
       
       2 
       
      -This simply prints out the entire contents of binary.xml.  (Windows users should use type instead of cat.)
      +This simply prints out the entire contents of binary.xml.  (Windows users should use type instead of cat.)
       
       
       
       3 
       
      -This prints the contents of binary.xml, but the “|” character, called the “pipe” character, means that the contents will not be printed to the screen.  Instead, they will become the standard input of the
      +This prints the contents of binary.xml, but the “|” character, called the “pipe” character, means that the contents will not be printed to the screen.  Instead, they will become the standard input of the
                   next command, which in this case calls your Python script.
       
       
       
       4 
       
      -Instead of specifying a module (like binary.xml), you specify “-”, which causes your script to load the grammar from standard input instead of from a specific file on disk.  (More on how
      +Instead of specifying a module (like binary.xml), you specify “-”, which causes your script to load the grammar from standard input instead of from a specific file on disk.  (More on how
                   this happens in the next example.)  So the effect is the same as the first syntax, where you specified the grammar filename
                   directly, but think of the expansion possibilities here.  Instead of simply doing cat binary.xml, you could run a script that dynamically generates the grammar, then you can pipe it into your script.  It could come from
                   anywhere: a database, or some grammar-generating meta-script, or whatever.  The point is that you don't need to change your
      -kgp.py script at all to incorporate any of this functionality.  All you need to do is be able to take grammar files from standard
      +kgp.py script at all to incorporate any of this functionality.  All you need to do is be able to take grammar files from standard
                   input, and you can separate all the other logic into another program.
       
       
       
       

      So how does the script “know” to read from standard input when the grammar file is “-”? It's not magic; it's just code. -

      Example 10.13. Reading from standard input in kgp.py

      +

      Example 10.13. Reading from standard input in kgp.py

       def openAnything(source):
           if source == "-":    1
               import sys
      @@ -8788,17 +8628,17 @@ def openAnything(source):
       
       1 
       
      -This is the openAnything function from toolbox.py, which you previously examined in Section 10.1, “Abstracting input sources”.  All you've done is add three lines of code at the beginning of the function to check if the source is “-”; if so, you return sys.stdin.  Really, that's it!  Remember, stdin is a file-like object with a read method, so the rest of the code (in kgp.py, where you call openAnything) doesn't change a bit.
      +This is the openAnything function from toolbox.py, which you previously examined in Section 10.1, “Abstracting input sources”.  All you've done is add three lines of code at the beginning of the function to check if the source is “-”; if so, you return sys.stdin.  Really, that's it!  Remember, stdin is a file-like object with a read method, so the rest of the code (in kgp.py, where you call openAnything) doesn't change a bit.
       
       
       
       

      10.3. Caching node lookups

      -

      kgp.py employs several tricks which may or may not be useful to you in your XML processing. The first one takes advantage of the consistent structure of the input documents to build a cache of nodes. -

      A grammar file defines a series of ref elements. Each ref contains one or more p elements, which can contain a lot of different things, including xrefs. Whenever you encounter an xref, you look for a corresponding ref element with the same id attribute, and choose one of the ref element's children and parse it. (You'll see how this random choice is made in the next section.) -

      This is how you build up the grammar: define ref elements for the smallest pieces, then define ref elements which "include" the first ref elements by using xref, and so forth. Then you parse the "largest" reference and follow each xref, and eventually output real text. The text you output depends on the (random) decisions you make each time you fill in an -xref, so the output is different each time. -

      This is all very flexible, but there is one downside: performance. When you find an xref and need to find the corresponding ref element, you have a problem. The xref has an id attribute, and you want to find the ref element that has that same id attribute, but there is no easy way to do that. The slow way to do it would be to get the entire list of ref elements each time, then manually loop through and look at each id attribute. The fast way is to do that once and build a cache, in the form of a dictionary. -

      Example 10.14. loadGrammar

      +

      kgp.py employs several tricks which may or may not be useful to you in your XML processing. The first one takes advantage of the consistent structure of the input documents to build a cache of nodes. +

      A grammar file defines a series of ref elements. Each ref contains one or more p elements, which can contain a lot of different things, including xrefs. Whenever you encounter an xref, you look for a corresponding ref element with the same id attribute, and choose one of the ref element's children and parse it. (You'll see how this random choice is made in the next section.) +

      This is how you build up the grammar: define ref elements for the smallest pieces, then define ref elements which "include" the first ref elements by using xref, and so forth. Then you parse the "largest" reference and follow each xref, and eventually output real text. The text you output depends on the (random) decisions you make each time you fill in an +xref, so the output is different each time. +

      This is all very flexible, but there is one downside: performance. When you find an xref and need to find the corresponding ref element, you have a problem. The xref has an id attribute, and you want to find the ref element that has that same id attribute, but there is no easy way to do that. The slow way to do it would be to get the entire list of ref elements each time, then manually loop through and look at each id attribute. The fast way is to do that once and build a cache, in the form of a dictionary. +

      Example 10.14. loadGrammar

           def loadGrammar(self, grammar):       
               self.grammar = self._load(grammar)
               self.refs = {}   1
      @@ -8808,36 +8648,36 @@ def openAnything(source):
       
       1 
       
      -Start by creating an empty dictionary, self.refs.
      +Start by creating an empty dictionary, self.refs.
       
       
       
       2 
       
      -As you saw in Section 9.5, “Searching for elements”, getElementsByTagName returns a list of all the elements of a particular name.  You easily can get a list of all the ref elements, then simply loop through that list.
      +As you saw in Section 9.5, “Searching for elements”, getElementsByTagName returns a list of all the elements of a particular name.  You easily can get a list of all the ref elements, then simply loop through that list.
       
       
       
       3 
       
      -As you saw in Section 9.6, “Accessing element attributes”, you can access individual attributes of an element by name, using standard dictionary syntax.  So the keys of the self.refs dictionary will be the values of the id attribute of each ref element.
      +As you saw in Section 9.6, “Accessing element attributes”, you can access individual attributes of an element by name, using standard dictionary syntax.  So the keys of the self.refs dictionary will be the values of the id attribute of each ref element.
       
       
       
       4 
       
      -The values of the self.refs dictionary will be the ref elements themselves.  As you saw in Section 9.3, “Parsing XML”, each element, each node, each comment, each piece of text in a parsed XML document is an object.
      +The values of the self.refs dictionary will be the ref elements themselves.  As you saw in Section 9.3, “Parsing XML”, each element, each node, each comment, each piece of text in a parsed XML document is an object.
       
       
       
      -

      Once you build this cache, whenever you come across an xref and need to find the ref element with the same id attribute, you can simply look it up in self.refs. -

      Example 10.15. Using the ref element cache

      +

      Once you build this cache, whenever you come across an xref and need to find the ref element with the same id attribute, you can simply look it up in self.refs. +

      Example 10.15. Using the ref element cache

           def do_xref(self, node):
               id = node.attributes["id"].value
      -        self.parse(self.randomChildElement(self.refs[id]))

      You'll explore the randomChildElement function in the next section. + self.parse(self.randomChildElement(self.refs[id]))

      You'll explore the randomChildElement function in the next section.

      10.4. Finding direct children of a node

      -

      Another useful techique when parsing XML documents is finding all the direct child elements of a particular element. For instance, in the grammar files, a ref element can have several p elements, each of which can contain many things, including other p elements. You want to find just the p elements that are children of the ref, not p elements that are children of other p elements. -

      You might think you could simply use getElementsByTagName for this, but you can't. getElementsByTagName searches recursively and returns a single list for all the elements it finds. Since p elements can contain other p elements, you can't use getElementsByTagName, because it would return nested p elements that you don't want. To find only direct child elements, you'll need to do it yourself. +

      Another useful techique when parsing XML documents is finding all the direct child elements of a particular element. For instance, in the grammar files, a ref element can have several p elements, each of which can contain many things, including other p elements. You want to find just the p elements that are children of the ref, not p elements that are children of other p elements. +

      You might think you could simply use getElementsByTagName for this, but you can't. getElementsByTagName searches recursively and returns a single list for all the elements it finds. Since p elements can contain other p elements, you can't use getElementsByTagName, because it would return nested p elements that you don't want. To find only direct child elements, you'll need to do it yourself.

      Example 10.16. Finding direct child elements

           def randomChildElement(self, node):
               choices = [e for e in node.childNodes
      @@ -8848,32 +8688,32 @@ def openAnything(source):
       
       1 
       
      -As you saw in Example 9.9, “Getting child nodes”, the childNodes attribute returns a list of all the child nodes of an element.
      +As you saw in Example 9.9, “Getting child nodes”, the childNodes attribute returns a list of all the child nodes of an element.
       
       
       
       2 
       
      -However, as you saw in Example 9.11, “Child nodes can be text”, the list returned by childNodes contains all different types of nodes, including text nodes.  That's not what you're looking for here.  You only want the
      +However, as you saw in Example 9.11, “Child nodes can be text”, the list returned by childNodes contains all different types of nodes, including text nodes.  That's not what you're looking for here.  You only want the
                   children that are elements.
       
       
       
       3 
       
      -Each node has a nodeType attribute, which can be ELEMENT_NODE, TEXT_NODE, COMMENT_NODE, or any number of other values.  The complete list of possible values is in the __init__.py file in the xml.dom package.  (See Section 9.2, “Packages” for more on packages.)  But you're just interested in nodes that are elements, so you can filter the list to only include
      -            those nodes whose nodeType is ELEMENT_NODE.
      +Each node has a nodeType attribute, which can be ELEMENT_NODE, TEXT_NODE, COMMENT_NODE, or any number of other values.  The complete list of possible values is in the __init__.py file in the xml.dom package.  (See Section 9.2, “Packages” for more on packages.)  But you're just interested in nodes that are elements, so you can filter the list to only include
      +            those nodes whose nodeType is ELEMENT_NODE.
       
       
       
       4 
       
      -Once you have a list of actual elements, choosing a random one is easy.  Python comes with a module called random which includes several useful functions.  The random.choice function takes a list of any number of items and returns a random item.  For example, if the ref elements contains several p elements, then choices would be a list of p elements, and chosen would end up being assigned exactly one of them, selected at random.
      +Once you have a list of actual elements, choosing a random one is easy.  Python comes with a module called random which includes several useful functions.  The random.choice function takes a list of any number of items and returns a random item.  For example, if the ref elements contains several p elements, then choices would be a list of p elements, and chosen would end up being assigned exactly one of them, selected at random.
       
       
       
       

      10.5. Creating separate handlers by node type

      -

      The third useful XML processing tip involves separating your code into logical functions, based on node types and element names. Parsed XML documents are made up of various types of nodes, each represented by a Python object. The root level of the document itself is represented by a Document object. The Document then contains one or more Element objects (for actual XML tags), each of which may contain other Element objects, Text objects (for bits of text), or Comment objects (for embedded comments). Python makes it easy to write a dispatcher to separate the logic for each node type. +

      The third useful XML processing tip involves separating your code into logical functions, based on node types and element names. Parsed XML documents are made up of various types of nodes, each represented by a Python object. The root level of the document itself is represented by a Document object. The Document then contains one or more Element objects (for actual XML tags), each of which may contain other Element objects, Text objects (for bits of text), or Comment objects (for embedded comments). Python makes it easy to write a dispatcher to separate the logic for each node type.

      Example 10.17. Class names of parsed XML objects

       >>> from xml.dom import minidom
       >>> xmldoc = minidom.parse('kant.xml') 1
      @@ -8887,13 +8727,13 @@ def openAnything(source):
       
       1 
       
      -Assume for a moment that kant.xml is in the current directory.
      +Assume for a moment that kant.xml is in the current directory.
       
       
       
       2 
       
      -As you saw in Section 9.2, “Packages”, the object returned by parsing an XML document is a Document object, as defined in the minidom.py in the xml.dom package.  As you saw in Section 5.4, “Instantiating Classes”, __class__ is built-in attribute of every Python object.
      +As you saw in Section 9.2, “Packages”, the object returned by parsing an XML document is a Document object, as defined in the minidom.py in the xml.dom package.  As you saw in Section 5.4, “Instantiating Classes”, __class__ is built-in attribute of every Python object.
       
       
       
      @@ -8904,8 +8744,8 @@ def openAnything(source):
       
       
       
      -

      Fine, so now you can get the class name of any particular XML node (since each XML node is represented as a Python object). How can you use this to your advantage to separate the logic of parsing each node type? The answer is getattr, which you first saw in Section 4.4, “Getting Object References With getattr”. -

      Example 10.18. parse, a generic XML node dispatcher

      +

      Fine, so now you can get the class name of any particular XML node (since each XML node is represented as a Python object). How can you use this to your advantage to separate the logic of parsing each node type? The answer is getattr, which you first saw in Section 4.4, “Getting Object References With getattr”. +

      Example 10.18. parse, a generic XML node dispatcher

           def parse(self, node):          
               parseMethod = getattr(self, "parse_%s" % node.__class__.__name__) 1 2
               parseMethod(node) 3
      @@ -8913,13 +8753,13 @@ def openAnything(source): 1 -First off, notice that you're constructing a larger string based on the class name of the node you were passed (in the node argument). So if you're passed a Document node, you're constructing the string 'parse_Document', and so forth. +First off, notice that you're constructing a larger string based on the class name of the node you were passed (in the node argument). So if you're passed a Document node, you're constructing the string 'parse_Document', and so forth. 2 -Now you can treat that string as a function name, and get a reference to the function itself using getattr +Now you can treat that string as a function name, and get a reference to the function itself using getattr 3 @@ -8929,7 +8769,7 @@ def openAnything(source): -

      Example 10.19. Functions called by the parse dispatcher

      +

      Example 10.19. Functions called by the parse dispatcher

           def parse_Document(self, node): 1
               self.parse(node.documentElement)
       
      @@ -8952,34 +8792,34 @@ def openAnything(source):
       
       1 
       
      -parse_Document is only ever called once, since there is only one Document node in an XML document, and only one Document object in the parsed XML representation.  It simply turns around and parses the root element of the grammar file.
      +parse_Document is only ever called once, since there is only one Document node in an XML document, and only one Document object in the parsed XML representation.  It simply turns around and parses the root element of the grammar file.
       
       
       
       2 
       
      -parse_Text is called on nodes that represent bits of text.  The function itself does some special processing to handle automatic capitalization
      +parse_Text is called on nodes that represent bits of text.  The function itself does some special processing to handle automatic capitalization
                   of the first word of a sentence, but otherwise simply appends the represented text to a list.
       
       
       
       3 
       
      -parse_Comment is just a pass, since you don't care about embedded comments in the grammar files.  Note, however, that you still need to define the function
      -            and explicitly make it do nothing.  If the function did not exist, the generic parse function would fail as soon as it stumbled on a comment, because it would try to find the non-existent parse_Comment function.  Defining a separate function for every node type, even ones you don't use, allows the generic parse function to stay simple and dumb.
      +parse_Comment is just a pass, since you don't care about embedded comments in the grammar files.  Note, however, that you still need to define the function
      +            and explicitly make it do nothing.  If the function did not exist, the generic parse function would fail as soon as it stumbled on a comment, because it would try to find the non-existent parse_Comment function.  Defining a separate function for every node type, even ones you don't use, allows the generic parse function to stay simple and dumb.
       
       
       
       4 
       
      -The parse_Element method is actually itself a dispatcher, based on the name of the element's tag.  The basic idea is the same: take what distinguishes
      +The parse_Element method is actually itself a dispatcher, based on the name of the element's tag.  The basic idea is the same: take what distinguishes
                   elements from each other (their tag names) and dispatch to a separate function for each of them.  You construct a string like
      -'do_xref' (for an <xref> tag), find a function of that name, and call it.  And so forth for each of the other tag names that might be found in the
      -            course of parsing a grammar file (<p> tags, <choice> tags).
      +'do_xref' (for an <xref> tag), find a function of that name, and call it.  And so forth for each of the other tag names that might be found in the
      +            course of parsing a grammar file (<p> tags, <choice> tags).
       
       
       
      -

      In this example, the dispatch functions parse and parse_Element simply find other methods in the same class. If your processing is very complex (or you have many different tag names), +

      In this example, the dispatch functions parse and parse_Element simply find other methods in the same class. If your processing is very complex (or you have many different tag names), you could break up your code into separate modules, and use dynamic importing to import each module and call whatever functions you needed. Dynamic importing will be discussed in Chapter 16, Functional Programming.

      10.6. Handling command-line arguments

      @@ -8987,7 +8827,7 @@ you needed. Dynamic importing will be discussed in XML-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it.

      It's difficult to talk about command-line processing without understanding how command-line arguments are exposed to your Python program, so let's write a simple program to see them. -

      Example 10.20. Introducing sys.argv

      +

      Example 10.20. Introducing sys.argv

      If you have not already done so, you can download this and other examples used in this book.

       #argecho.py
       import sys
      @@ -8998,11 +8838,11 @@ for arg in sys.argv: 
       1 
       
      -Each command-line argument passed to the program will be in sys.argv, which is just a list.  Here you are printing each argument on a separate line.
      +Each command-line argument passed to the program will be in sys.argv, which is just a list.  Here you are printing each argument on a separate line.
       
       
       
      -

      Example 10.21. The contents of sys.argv

      +

      Example 10.21. The contents of sys.argv

       [you@localhost py]$ python argecho.py             1
       argecho.py
       [you@localhost py]$ python argecho.py abc def     2
      @@ -9020,34 +8860,34 @@ kant.xml
      1 -The first thing to know about sys.argv is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later, +The first thing to know about sys.argv is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later, in Chapter 16, Functional Programming. Don't worry about it for now. 2 -Command-line arguments are separated by spaces, and each shows up as a separate element in the sys.argv list. +Command-line arguments are separated by spaces, and each shows up as a separate element in the sys.argv list. 3 -Command-line flags, like --help, also show up as their own element in the sys.argv list. +Command-line flags, like --help, also show up as their own element in the sys.argv list. 4 To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag - (-m) which takes an argument (kant.xml). Both the flag itself and the flag's argument are simply sequential elements in the sys.argv list. No attempt is made to associate one with the other; all you get is a list. + (-m) which takes an argument (kant.xml). Both the flag itself and the flag's argument are simply sequential elements in the sys.argv list. No attempt is made to associate one with the other; all you get is a list.

      So as you can see, you certainly have all the information passed on the command line, but then again, it doesn't look like it's going to be all that easy to actually use it. For simple programs that only take a single argument and have no flags, -you can simply use sys.argv[1] to access the argument. There's no shame in this; I do it all the time. For more complex programs, you need the getopt module. -

      Example 10.22. Introducing getopt

      +you can simply use sys.argv[1] to access the argument.  There's no shame in this; I do it all the time.  For more complex programs, you need the getopt module.
      +

      Example 10.22. Introducing getopt

       def main(argv):       
           grammar = "kant.xml"                 1
           try:              
      @@ -9064,14 +8904,14 @@ if __name__ == "__main__":
       
       1 
       
      -First off, look at the bottom of the example and notice that you're calling the main function with sys.argv[1:].  Remember, sys.argv[0] is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off
      +First off, look at the bottom of the example and notice that you're calling the main function with sys.argv[1:].  Remember, sys.argv[0] is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off
                   and pass the rest of the list.
       
       
       
       2 
       
      -This is where all the interesting processing happens.  The getopt function of the getopt module takes three parameters: the argument list (which you got from sys.argv[1:]), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer
      +This is where all the interesting processing happens.  The getopt function of the getopt module takes three parameters: the argument list (which you got from sys.argv[1:]), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer
                   command-line flags that are equivalent to the single-character versions.  This is quite confusing at first glance, and is
                   explained in more detail below.
       
      @@ -9079,19 +8919,19 @@ if __name__ == "__main__":
       
       3 
       
      -If anything goes wrong trying to parse these command-line flags, getopt will raise an exception, which you catch.  You told getopt all the flags you understand, so this probably means that the end user passed some command-line flag that you don't understand.
      +If anything goes wrong trying to parse these command-line flags, getopt will raise an exception, which you catch.  You told getopt all the flags you understand, so this probably means that the end user passed some command-line flag that you don't understand.
       
       
       
       4 
       
       As is standard practice in the UNIX world, when the script is passed flags it doesn't understand, you print out a summary of proper usage and exit gracefully.
      -             Note that I haven't shown the usage function here.  You would still need to code that somewhere and have it print out the appropriate summary; it's not automatic.
      +             Note that I haven't shown the usage function here.  You would still need to code that somewhere and have it print out the appropriate summary; it's not automatic.
       
       
       
      -

      So what are all those parameters you pass to the getopt function? Well, the first one is simply the raw list of command-line flags and arguments (not including the first element, -the script name, which you already chopped off before calling the main function). The second is the list of short command-line flags that the script accepts. +

      So what are all those parameters you pass to the getopt function? Well, the first one is simply the raw list of command-line flags and arguments (not including the first element, +the script name, which you already chopped off before calling the main function). The second is the list of short command-line flags that the script accepts.

      "hg:d"

      @@ -9104,8 +8944,8 @@ the script name, which you already chopped off before calling the getopt this by putting a colon after the g in that second parameter to the getopt function. -

      To further complicate things, the script accepts either short flags (like -h) or long flags (like --help), and you want them to do the same thing. This is what the third parameter to getopt is for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter. +and you don't know which yet (you'll figure it out later), but you know it has to be something. So you tell getopt this by putting a colon after the g in that second parameter to the getopt function. +

      To further complicate things, the script accepts either short flags (like -h) or long flags (like --help), and you want them to do the same thing. This is what the third parameter to getopt is for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter.

      ["help", "grammar="]

      @@ -9117,7 +8957,7 @@ and you don't know which yet (you'll figure it out later), but you know it has t

      Three things of note here:

        -
      1. All long flags are preceded by two dashes on the command line, but you don't include those dashes when calling getopt. They are understood. +
      2. All long flags are preceded by two dashes on the command line, but you don't include those dashes when calling getopt. They are understood.
      3. The --grammar flag must always be followed by an additional argument, just like the -g flag. This is notated by an equals sign, "grammar=". @@ -9126,7 +8966,7 @@ and you don't know which yet (you'll figure it out later), but you know it has t

      Confused yet? Let's look at the actual code and see if it makes sense in context. -

      Example 10.23. Handling command-line arguments in kgp.py

      +

      Example 10.23. Handling command-line arguments in kgp.py

       def main(argv):        1
           grammar = "kant.xml"                
           try:              
      @@ -9152,21 +8992,21 @@ def main(argv):        
       1 
       
      -The grammar variable will keep track of the grammar file you're using.  You initialize it here in case it's not specified on the command
      +The grammar variable will keep track of the grammar file you're using.  You initialize it here in case it's not specified on the command
                   line (using either the -g or the --grammar flag).
       
       
       
       2 
       
      -The opts variable that you get back from getopt contains a list of tuples: flag and argument.  If the flag doesn't take an argument, then arg will simply be None.  This makes it easier to loop through the flags.
      +The opts variable that you get back from getopt contains a list of tuples: flag and argument.  If the flag doesn't take an argument, then arg will simply be None.  This makes it easier to loop through the flags.
       
       
       
       3 
       
      -getopt validates that the command-line flags are acceptable, but it doesn't do any sort of conversion between short and long flags.
      -             If you specify the -h flag, opt will contain "-h"; if you specify the --help flag, opt will contain "--help".  So you need to check for both.
      +getopt validates that the command-line flags are acceptable, but it doesn't do any sort of conversion between short and long flags.
      +             If you specify the -h flag, opt will contain "-h"; if you specify the --help flag, opt will contain "--help".  So you need to check for both.
       
       
       
      @@ -9180,21 +9020,21 @@ def main(argv):        
       5 
       
      -If you find a grammar file, either with a -g flag or a --grammar flag, you save the argument that followed it (stored in arg) into the grammar variable, overwriting the default that you initialized at the top of the main function.
      +If you find a grammar file, either with a -g flag or a --grammar flag, you save the argument that followed it (stored in arg) into the grammar variable, overwriting the default that you initialized at the top of the main function.
       
       
       
       6 
       
       That's it.  You've looped through and dealt with all the command-line flags.  That means that anything left must be command-line
      -            arguments.  These come back from the getopt function in the args variable.  In this case, you're treating them as source material for the parser.  If there are no command-line arguments
      -            specified, args will be an empty list, and source will end up as the empty string.
      +            arguments.  These come back from the getopt function in the args variable.  In this case, you're treating them as source material for the parser.  If there are no command-line arguments
      +            specified, args will be an empty list, and source will end up as the empty string.
       
       
       
       

      10.7. Putting it all together

      You've covered a lot of ground. Let's step back and see how all the pieces fit together. -

      To start with, this is a script that takes its arguments on the command line, using the getopt module. +

      To start with, this is a script that takes its arguments on the command line, using the getopt module.

       def main(argv):       
       ...
      @@ -9203,9 +9043,9 @@ def main(argv):
           except getopt.GetoptError:          
       ...
           for opt, arg in opts:               
      -...

      You create a new instance of the KantGenerator class, and pass it the grammar file and source that may or may not have been specified on the command line. +...

      You create a new instance of the KantGenerator class, and pass it the grammar file and source that may or may not have been specified on the command line.

      -    k = KantGenerator(grammar, source)

      The KantGenerator instance automatically loads the grammar, which is an XML file. You use your custom openAnything function to open the file (which could be stored in a local file or a remote web server), then use the built-in minidom parsing functions to parse the XML into a tree of Python objects. + k = KantGenerator(grammar, source)

      The KantGenerator instance automatically loads the grammar, which is an XML file. You use your custom openAnything function to open the file (which could be stored in a local file or a remote web server), then use the built-in minidom parsing functions to parse the XML into a tree of Python objects.

           def _load(self, source):
               sock = toolbox.openAnything(source)
      @@ -9227,15 +9067,15 @@ the "top-level" reference (that isn't referenced by anything else) and use that
       
           def parse_Element(self, node): 
               handlerMethod = getattr(self, "do_%s" % node.tagName)
      -        handlerMethod(node)

      You bounce through the grammar, parsing all the children of each p element, + handlerMethod(node)

      You bounce through the grammar, parsing all the children of each p element,

           def do_p(self, node):
       ...
               if doit:
      -            for child in node.childNodes: self.parse(child)

      replacing choice elements with a random child, + for child in node.childNodes: self.parse(child)

      replacing choice elements with a random child,

           def do_choice(self, node):
      -        self.parse(self.randomChildElement(node))

      and replacing xref elements with a random child of the corresponding ref element, which you previously cached. + self.parse(self.randomChildElement(node))

      and replacing xref elements with a random child of the corresponding ref element, which you previously cached.

           def do_xref(self, node):
               id = node.attributes["id"].value
      @@ -9250,16 +9090,16 @@ def main(argv):
       ...
           k = KantGenerator(grammar, source)
           print k.output()

      10.8. Summary

      -

      Python comes with powerful libraries for parsing and manipulating XML documents. The minidom takes an XML file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments, +

      Python comes with powerful libraries for parsing and manipulating XML documents. The minidom takes an XML file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments, error handling, even the ability to take input from the piped result of a previous program.

      Before moving on to the next chapter, you should be comfortable doing all of these things:

      Chapter 11. HTTP Web Services

      @@ -9291,8 +9131,8 @@ you to quickly navigate through it. semantics to the underlying HTTP semantics. (They tunnel everything over HTTP POST.) But this chapter will concentrate on using HTTP GET to get data from a remote server, and you'll explore several HTTP features you can use to get the maximum benefit out of pure HTTP web services. -

      Here is a more advanced version of the openanything module that you saw in the previous chapter: -

      Example 11.1. openanything.py

      +

      Here is a more advanced version of the openanything module that you saw in the previous chapter: +

      Example 11.1. openanything.py

      If you have not already done so, you can download this and other examples used in this book.

       import urllib2, urlparse, gzip
       from StringIO import StringIO
      @@ -9414,7 +9254,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
       
       1 
       
      -Downloading anything over HTTP is incredibly easy in Python; in fact, it's a one-liner.  The urllib module has a handy urlopen function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page.  It just can't get much easier.
      +Downloading anything over HTTP is incredibly easy in Python; in fact, it's a one-liner.  The urllib module has a handy urlopen function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page.  It just can't get much easier.
       
       
       
      @@ -9439,7 +9279,7 @@ rude.
          code 200 means “everything's normal, here's the page you asked for”.  Status code 404 means “page not found”.  (You've probably seen 404 errors while browsing the web.)
       

      HTTP has two different ways of signifying that a resource has moved. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location: header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location: header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you're supposed to use the new address from then on. -

      urllib.urlopen will automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn't tell you when +

      urllib.urlopen will automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn't tell you when it does so. You'll end up getting data you asked for, but you'll never know that the underlying library “helpfully” followed a redirect for you. So you'll continue pounding away at the old address, and each time you'll get redirected to the new address. That's two round trips instead of one: not very efficient! Later in this chapter, you'll see how to work around this so you can deal with permanent redirects properly and efficiently. @@ -9472,7 +9312,7 @@ rude. it in compressed format. You include the Accept-encoding: gzip header in your request, and if the server supports compression, it will send you back gzip-compressed data and mark it with a Content-encoding: gzip header.

      Python's URL library has no built-in support for gzip compression per se, but you can add arbitrary headers to the request. And -Python comes with a separate gzip module, which has functions you can use to decompress the data yourself. +Python comes with a separate gzip module, which has functions you can use to decompress the data yourself.

      Note that our little one-line script to download a syndicated feed did not support any of these HTTP features. Let's see how you can improve it.

      11.4. Debugging HTTP web services

      First, let's turn on the debugging features of Python's HTTP library and see what's being sent over the wire. This will be useful throughout the chapter, as you add more and @@ -9502,7 +9342,7 @@ header: Connection: close 1 -urllib relies on another standard Python library, httplib. Normally you don't need to import httplib directly (urllib does that automatically), but you will here so you can set the debugging flag on the HTTPConnection class that urllib uses internally to connect to the HTTP server. This is an incredibly useful technique. Some other Python libraries have similar debug flags, but there's no particular standard for naming them or turning them on; you need to read +urllib relies on another standard Python library, httplib. Normally you don't need to import httplib directly (urllib does that automatically), but you will here so you can set the debugging flag on the HTTPConnection class that urllib uses internally to connect to the HTTP server. This is an incredibly useful technique. Some other Python libraries have similar debug flags, but there's no particular standard for naming them or turning them on; you need to read the documentation of each library to see if such a feature is available. @@ -9516,7 +9356,7 @@ header: Connection: close 3 -When you request the Atom feed, urllib sends three lines to the server. The first line specifies the HTTP verb you're using, and the path of the resource (minus +When you request the Atom feed, urllib sends three lines to the server. The first line specifies the HTTP verb you're using, and the path of the resource (minus the domain name). All the requests in this chapter will use GET, but in the next chapter on SOAP, you'll see that it uses POST for everything. The basic syntax is the same, regardless of the verb. @@ -9530,13 +9370,13 @@ header: Connection: close 5 -The third line is the User-Agent header. What you see here is the generic User-Agent that the urllib library adds by default. In the next section, you'll see how to customize this to be more specific. +The third line is the User-Agent header. What you see here is the generic User-Agent that the urllib library adds by default. In the next section, you'll see how to customize this to be more specific. 6 -The server replies with a status code and a bunch of headers (and possibly some data, which got stored in the feeddata variable). The status code here is 200, meaning “everything's normal, here's the data you requested”. The server also tells you the date it responded to your request, some information about the server itself, and the content +The server replies with a status code and a bunch of headers (and possibly some data, which got stored in the feeddata variable). The status code here is 200, meaning “everything's normal, here's the data you requested”. The server also tells you the date it responded to your request, some information about the server itself, and the content type of the data it's giving you. Depending on your application, this might be useful, or not. It's certainly reassuring that you thought you were asking for an Atom feed, and lo and behold, you're getting an Atom feed (application/atom+xml, which is the registered content type for Atom feeds). @@ -9557,8 +9397,8 @@ header: Connection: close

      11.5. Setting the User-Agent

      -

      The first step to improving your HTTP web services client is to identify yourself properly with a User-Agent. To do that, you need to move beyond the basic urllib and dive into urllib2. -

      Example 11.4. Introducing urllib2

      +

      The first step to improving your HTTP web services client is to identify yourself properly with a User-Agent. To do that, you need to move beyond the basic urllib and dive into urllib2. +

      Example 11.4. Introducing urllib2

       >>> import httplib
       >>> httplib.HTTPConnection.debuglevel = 1           1
       >>> import urllib2
      @@ -9591,7 +9431,7 @@ header: Connection: close
       
       2 
       
      -Fetching an HTTP resource with urllib2 is a three-step process, for good reasons that will become clear shortly.  The first step is to create a Request object, which takes the URL of the resource you'll eventually get around to retrieving.  Note that this step doesn't actually
      +Fetching an HTTP resource with urllib2 is a three-step process, for good reasons that will become clear shortly.  The first step is to create a Request object, which takes the URL of the resource you'll eventually get around to retrieving.  Note that this step doesn't actually
                   retrieve anything yet.
       
       
      @@ -9606,12 +9446,12 @@ header: Connection: close
       
       4 
       
      -The final step is to tell the opener to open the URL, using the Request object you created.  As you can see from all the debugging information that gets printed, this step actually retrieves the
      -            resource and stores the returned data in feeddata.
      +The final step is to tell the opener to open the URL, using the Request object you created.  As you can see from all the debugging information that gets printed, this step actually retrieves the
      +            resource and stores the returned data in feeddata.
       
       
       
      -

      Example 11.5. Adding headers with the Request

      +

      Example 11.5. Adding headers with the Request

       >>> request            1
       <urllib2.Request instance at 0x00250AA8>
       >>> request.get_full_url()
      @@ -9639,13 +9479,13 @@ header: Connection: close
       
       1 
       
      -You're continuing from the previous example; you've already created a Request object with the URL you want to access.
      +You're continuing from the previous example; you've already created a Request object with the URL you want to access.
       
       
       
       2 
       
      -Using the add_header method on the Request object, you can add arbitrary HTTP headers to the request.  The first argument is the header, the second is the value you're
      +Using the add_header method on the Request object, you can add arbitrary HTTP headers to the request.  The first argument is the header, the second is the value you're
                   providing for that header.  Convention dictates that a User-Agent should be in this specific format: an application name, followed by a slash, followed by a version number.  The rest is free-form,
                   and you'll see a lot of variations in the wild, but somewhere it should include a URL of your application.  The User-Agent is usually logged by the server along with other details of your request, and including a URL of your application allows
                   server administrators looking through their access logs to contact you if something is wrong.
      @@ -9654,13 +9494,13 @@ header: Connection: close
       
       3 
       
      -The opener object you created before can be reused too, and it will retrieve the same feed again, but with your custom User-Agent header.
      +The opener object you created before can be reused too, and it will retrieve the same feed again, but with your custom User-Agent header.
       
       
       
       4 
       
      -And here's you sending your custom User-Agent, in place of the generic one that Python sends by default.  If you look closely, you'll notice that you defined a User-Agent header, but you actually sent a User-agent header.  See the difference?  urllib2 changed the case so that only the first letter was capitalized.  It doesn't really matter; HTTP specifies that header field
      +And here's you sending your custom User-Agent, in place of the generic one that Python sends by default.  If you look closely, you'll notice that you defined a User-Agent header, but you actually sent a User-agent header.  See the difference?  urllib2 changed the case so that only the first letter was capitalized.  It doesn't really matter; HTTP specifies that header field
                   names are completely case-insensitive.
       
       
      @@ -9709,7 +9549,7 @@ urllib2.HTTPError: HTTP Error 304: Not Modified
       1 
       
       Remember all those HTTP headers you saw printed out when you turned on debugging?  This is how you can get access to them
      -            programmatically: firstdatastream.headers is an object that acts like a dictionary and allows you to get any of the individual headers returned from the HTTP server.
      +            programmatically: firstdatastream.headers is an object that acts like a dictionary and allows you to get any of the individual headers returned from the HTTP server.
       
       
       
      @@ -9721,16 +9561,16 @@ urllib2.HTTPError: HTTP Error 304: Not Modified
       
       3 
       
      -Sure enough, the data hasn't changed.  You can see from the traceback that urllib2 throws a special exception, HTTPError, in response to the 304 status code.  This is a little unusual, and not entirely helpful.  After all, it's not an error; you specifically asked the
      +Sure enough, the data hasn't changed.  You can see from the traceback that urllib2 throws a special exception, HTTPError, in response to the 304 status code.  This is a little unusual, and not entirely helpful.  After all, it's not an error; you specifically asked the
                   server not to send you any data if it hadn't changed, and the data didn't change, so the server told you it wasn't sending
                   you any data.  That's not an error; that's exactly what you were hoping for.
       
       
       
      -

      urllib2 also raises an HTTPError exception for conditions that you would think of as errors, such as 404 (page not found). In fact, it will raise HTTPError for any status code other than 200 (OK), 301 (permanent redirect), or 302 (temporary redirect). It would be more helpful for your purposes to capture the status code and simply return it, without +

      urllib2 also raises an HTTPError exception for conditions that you would think of as errors, such as 404 (page not found). In fact, it will raise HTTPError for any status code other than 200 (OK), 301 (permanent redirect), or 302 (temporary redirect). It would be more helpful for your purposes to capture the status code and simply return it, without throwing an exception. To do that, you'll need to define a custom URL handler.

      Example 11.7. Defining URL handlers

      -

      This custom URL handler is part of openanything.py.

      +

      This custom URL handler is part of openanything.py.

       class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):    1
           def http_error_default(self, req, fp, code, msg, headers): 2
               result = urllib2.HTTPError(         
      @@ -9742,14 +9582,14 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):    1 
       
      -urllib2 is designed around URL handlers.  Each handler is just a class that can define any number of methods.  When something happens
      -            -- like an HTTP error, or even a 304 code -- urllib2 introspects into the list of defined handlers for a method that can handle it.  You used a similar introspection in Chapter 9, XML Processing to define handlers for different node types, but urllib2 is more flexible, and introspects over as many handlers as are defined for the current request.
      +urllib2 is designed around URL handlers.  Each handler is just a class that can define any number of methods.  When something happens
      +            -- like an HTTP error, or even a 304 code -- urllib2 introspects into the list of defined handlers for a method that can handle it.  You used a similar introspection in Chapter 9, XML Processing to define handlers for different node types, but urllib2 is more flexible, and introspects over as many handlers as are defined for the current request.
       
       
       
       2 
       
      -urllib2 searches through the defined handlers and calls the http_error_default method when it encounters a 304 status code from the server. By defining a custom error handler, you can prevent urllib2 from raising an exception.  Instead, you create the HTTPError object, but return it instead of raising it.
      +urllib2 searches through the defined handlers and calls the http_error_default method when it encounters a 304 status code from the server. By defining a custom error handler, you can prevent urllib2 from raising an exception.  Instead, you create the HTTPError object, but return it instead of raising it.
       
       
       
      @@ -9776,20 +9616,20 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):    1 
       
      -You're continuing the previous example, so the Request object is already set up, and you've already added the If-Modified-Since header.
      +You're continuing the previous example, so the Request object is already set up, and you've already added the If-Modified-Since header.
       
       
       
       2 
       
      -This is the key: now that you've defined your custom URL handler, you need to tell urllib2 to use it.  Remember how I said that urllib2 broke up the process of accessing an HTTP resource into three steps, and for good reason?  This is why building the URL opener
      -            is its own step, because you can build it with your own custom URL handlers that override urllib2's default behavior.
      +This is the key: now that you've defined your custom URL handler, you need to tell urllib2 to use it.  Remember how I said that urllib2 broke up the process of accessing an HTTP resource into three steps, and for good reason?  This is why building the URL opener
      +            is its own step, because you can build it with your own custom URL handlers that override urllib2's default behavior.
       
       
       
       3 
       
      -Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use seconddatastream.headers.dict to acess them), also contains the HTTP status code.  In this case, as you expected, the status is 304, meaning this data hasn't changed since the last time you asked for it.
      +Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use seconddatastream.headers.dict to acess them), also contains the HTTP status code.  In this case, as you expected, the status is 304, meaning this data hasn't changed since the last time you asked for it.
       
       
       
      @@ -9831,7 +9671,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):    1 
       
      -Using the firstdatastream.headers pseudo-dictionary, you can get the ETag returned from the server.  (What happens if the server didn't send back an ETag?  Then this line would return None.)
      +Using the firstdatastream.headers pseudo-dictionary, you can get the ETag returned from the server.  (What happens if the server didn't send back an ETag?  Then this line would return None.)
       
       
       
      @@ -9949,13 +9789,13 @@ AttributeError: addinfourl instance has no attribute 'status'
       
       5 
       
      -urllib2 notices the redirect status code and automatically tries to retrieve the data at the new location specified in the Location: header.
      +urllib2 notices the redirect status code and automatically tries to retrieve the data at the new location specified in the Location: header.
       
       
       
       6 
       
      -The object you get back from the opener contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent
      +The object you get back from the opener contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent
                   address).  But the status code is missing, so you have no way of knowing programmatically whether this redirect was temporary
                   or permanent.  And that matters very much: if it was a temporary redirect, then you should continue to ask for the data at
                   the old location.  But if it was a permanent redirect (as this was), you should ask for the data at the new location from
      @@ -9963,9 +9803,9 @@ AttributeError: addinfourl instance has no attribute 'status'
       
       
       
      -

      This is suboptimal, but easy to fix. urllib2 doesn't behave exactly as you want it to when it encounters a 301 or 302, so let's override its behavior. How? With a custom URL handler, just like you did to handle 304 codes. +

      This is suboptimal, but easy to fix. urllib2 doesn't behave exactly as you want it to when it encounters a 301 or 302, so let's override its behavior. How? With a custom URL handler, just like you did to handle 304 codes.

      Example 11.11. Defining the redirect handler

      -

      This class is defined in openanything.py.

      +

      This class is defined in openanything.py.

       class SmartRedirectHandler(urllib2.HTTPRedirectHandler):     1
           def http_error_301(self, req, fp, code, msg, headers):  
               result = urllib2.HTTPRedirectHandler.http_error_301( 2
      @@ -9983,13 +9823,13 @@ class SmartRedirectHandler(urllib2.HTTPRedirectHandler):     1 
       
      -Redirect behavior is defined in urllib2 in a class called HTTPRedirectHandler.  You don't want to completely override the behavior, you just want to extend it a little, so you'll subclass HTTPRedirectHandler so you can call the ancestor class to do all the hard work.
      +Redirect behavior is defined in urllib2 in a class called HTTPRedirectHandler.  You don't want to completely override the behavior, you just want to extend it a little, so you'll subclass HTTPRedirectHandler so you can call the ancestor class to do all the hard work.
       
       
       
       2 
       
      -When it encounters a 301 status code from the server, urllib2 will search through its handlers and call the http_error_301 method.   The first thing ours does is just call the http_error_301 method in the ancestor, which handles the grunt work of looking for the Location: header and following the redirect to the new address.
      +When it encounters a 301 status code from the server, urllib2 will search through its handlers and call the http_error_301 method.   The first thing ours does is just call the http_error_301 method in the ancestor, which handles the grunt work of looking for the Location: header and following the redirect to the new address.
       
       
       
      @@ -10056,7 +9896,7 @@ header: Content-Type: application/atom+xml
       
       2 
       
      -You sent off a request, and you got a 301 status code in response.  At this point, the http_error_301 method gets called.  You call the ancestor method, which follows the redirect and sends a request at the new location (http://diveintomark.org/xml/atom.xml).
      +You sent off a request, and you got a 301 status code in response.  At this point, the http_error_301 method gets called.  You call the ancestor method, which follows the redirect and sends a request at the new location (http://diveintomark.org/xml/atom.xml).
       
       
       
      @@ -10064,7 +9904,7 @@ header: Content-Type: application/atom+xml
       
       This is the payoff: now, not only do you have access to the new URL, but you have access to the redirect status code, so you
                   can tell that this was a permanent redirect.  The next time you request this data, you should request it from the new location
      -            (http://diveintomark.org/xml/atom.xml, as specified in f.url).  If you had stored the location in a configuration file or a database, you need to update that so you don't keep pounding
      +            (http://diveintomark.org/xml/atom.xml, as specified in f.url).  If you had stored the location in a configuration file or a database, you need to update that so you don't keep pounding
                   the server with requests at the old address.  It's time to update your address book.
       
       
      @@ -10123,13 +9963,13 @@ http://diveintomark.org/xml/atom.xml
       
       3 
       
      -urllib2 calls your http_error_302 method, which calls the ancestor method of the same name in urllib2.HTTPRedirectHandler, which follows the redirect to the new location.  Then your http_error_302 method stores the status code (302) so the calling application can get it later.
      +urllib2 calls your http_error_302 method, which calls the ancestor method of the same name in urllib2.HTTPRedirectHandler, which follows the redirect to the new location.  Then your http_error_302 method stores the status code (302) so the calling application can get it later.
       
       
       
       4 
       
      -And here you are, having successfully followed the redirect to http://diveintomark.org/xml/atom.xml.  f.status tells you that this was a temporary redirect, which means that you should continue to request data from the original address
      +And here you are, having successfully followed the redirect to http://diveintomark.org/xml/atom.xml.  f.status tells you that this was a temporary redirect, which means that you should continue to request data from the original address
                   (http://diveintomark.org/redir/example302.xml).  Maybe it will redirect next time too, but maybe not.  Maybe it will redirect to a different address.  It's not for you
                   to say.  The server said this redirect was only temporary, so you should respect that.  And now you're exposing enough information
                   that the calling application can respect that.
      @@ -10171,7 +10011,7 @@ header: Content-Type: application/atom+xml
       
       1 
       
      -This is the key: once you've created your Request object, add an Accept-encoding header to tell the server you can accept gzip-encoded data.  gzip is the name of the compression algorithm you're using.  In theory there could be other compression algorithms, but gzip is the compression algorithm used by 99% of web servers.
      +This is the key: once you've created your Request object, add an Accept-encoding header to tell the server you can accept gzip-encoded data.  gzip is the name of the compression algorithm you're using.  In theory there could be other compression algorithms, but gzip is the compression algorithm used by 99% of web servers.
       
       
       
      @@ -10218,28 +10058,28 @@ header: Content-Type: application/atom+xml
       
       1 
       
      -Continuing from the previous example, f is the file-like object returned from the URL opener.  Using its read() method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first
      +Continuing from the previous example, f is the file-like object returned from the URL opener.  Using its read() method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first
                   step towards getting the data you really want.
       
       
       
       2 
       
      -OK, this step is a little bit of messy workaround.  Python has a gzip module, which reads (and actually writes) gzip-compressed files on disk.  But you don't have a file on disk, you have a gzip-compressed
      +OK, this step is a little bit of messy workaround.  Python has a gzip module, which reads (and actually writes) gzip-compressed files on disk.  But you don't have a file on disk, you have a gzip-compressed
                   buffer in memory, and you don't want to write out a temporary file just so you can uncompress it.  So what you're going to
      -            do is create a file-like object out of the in-memory data (compresseddata), using the StringIO module.  You first saw the StringIO module in the previous chapter, but now you've found another use for it.
      +            do is create a file-like object out of the in-memory data (compresseddata), using the StringIO module.  You first saw the StringIO module in the previous chapter, but now you've found another use for it.
       
       
       
       3 
       
      -Now you can create an instance of GzipFile, and tell it that its “file” is the file-like object compressedstream.
      +Now you can create an instance of GzipFile, and tell it that its “file” is the file-like object compressedstream.
       
       
       
       4 
       
      -This is the line that does all the actual work: “reading” from GzipFile will decompress the data.  Strange?  Yes, but it makes sense in a twisted kind of way.  gzipper is a file-like object which represents a gzip-compressed file.  That “file” is not a real file on disk, though; gzipper is really just “reading” from the file-like object you created with StringIO to wrap the compressed data, which is only in memory in the variable compresseddata.  And where did that compressed data come from?  You originally downloaded it from a remote HTTP server by “reading” from the file-like object you built with urllib2.build_opener.  And amazingly, this all just works.  Every step in the chain has no idea that the previous step is faking it.
      +This is the line that does all the actual work: “reading” from GzipFile will decompress the data.  Strange?  Yes, but it makes sense in a twisted kind of way.  gzipper is a file-like object which represents a gzip-compressed file.  That “file” is not a real file on disk, though; gzipper is really just “reading” from the file-like object you created with StringIO to wrap the compressed data, which is only in memory in the variable compresseddata.  And where did that compressed data come from?  You originally downloaded it from a remote HTTP server by “reading” from the file-like object you built with urllib2.build_opener.  And amazingly, this all just works.  Every step in the chain has no idea that the previous step is faking it.
       
       
       
      @@ -10248,7 +10088,7 @@ header: Content-Type: application/atom+xml
       Look ma, real data. (15955 bytes of it, in fact.)
       
       
      -

      “But wait!” I hear you cry. “This could be even easier!” I know what you're thinking. You're thinking that opener.open returns a file-like object, so why not cut out the StringIO middleman and just pass f directly to GzipFile? OK, maybe you weren't thinking that, but don't worry about it, because it doesn't work. +

      “But wait!” I hear you cry. “This could be even easier!” I know what you're thinking. You're thinking that opener.open returns a file-like object, so why not cut out the StringIO middleman and just pass f directly to GzipFile? OK, maybe you weren't thinking that, but don't worry about it, because it doesn't work.

      Example 11.16. Decompressing the data directly from the server

       >>> f = opener.open(request)1
       >>> f.headers.get('Content-Encoding')         2
      @@ -10266,7 +10106,7 @@ AttributeError: addinfourl instance has no attribute 'tell'
       
       1 
       
      -Continuing from the previous example, you already have a Request object set up with an Accept-encoding: gzip header.
      +Continuing from the previous example, you already have a Request object set up with an Accept-encoding: gzip header.
       
       
       
      @@ -10279,17 +10119,17 @@ AttributeError: addinfourl instance has no attribute 'tell'
       
       3 
       
      -Since opener.open returns a file-like object, and you know from the headers that when you read it, you're going to get gzip-compressed data,
      -            why not simply pass that file-like object directly to GzipFile?  As you “read” from the GzipFile instance, it will “read” compressed data from the remote HTTP server and decompress it on the fly.  It's a good idea, but unfortunately it doesn't
      -            work.  Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file.  This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and
      -            forth through the data stream.  So the inelegant hack of using StringIO is the best solution: download the compressed data, create a file-like object out of it with StringIO, and then decompress the data from that.
      +Since opener.open returns a file-like object, and you know from the headers that when you read it, you're going to get gzip-compressed data,
      +            why not simply pass that file-like object directly to GzipFile?  As you “read” from the GzipFile instance, it will “read” compressed data from the remote HTTP server and decompress it on the fly.  It's a good idea, but unfortunately it doesn't
      +            work.  Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file.  This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and
      +            forth through the data stream.  So the inelegant hack of using StringIO is the best solution: download the compressed data, create a file-like object out of it with StringIO, and then decompress the data from that.
       
       
       
       

      11.9. Putting it all together

      You've seen all the pieces for building an intelligent HTTP web services client. Now let's see how they all fit together. -

      Example 11.17. The openanything function

      -

      This function is defined in openanything.py.

      +

      Example 11.17. The openanything function

      +

      This function is defined in openanything.py.

       def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
           # non-HTTP code omitted for brevity
           if urlparse.urlparse(source)[0] == 'http':   1
      @@ -10308,14 +10148,14 @@ def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
       
       1 
       
      -urlparse is a handy utility module for, you guessed it, parsing URLs.  It's primary function, also called urlparse, takes a URL and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier).
      -             Of these, the only thing you care about is the scheme, to make sure that you're dealing with an HTTP URL (which urllib2 can handle).
      +urlparse is a handy utility module for, you guessed it, parsing URLs.  It's primary function, also called urlparse, takes a URL and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier).
      +             Of these, the only thing you care about is the scheme, to make sure that you're dealing with an HTTP URL (which urllib2 can handle).
       
       
       
       2 
       
      -You identify yourself to the HTTP server with the User-Agent passed in by the calling function.  If no User-Agent was specified, you use a default one defined earlier in the openanything.py module.  You never use the default one defined by urllib2.
      +You identify yourself to the HTTP server with the User-Agent passed in by the calling function.  If no User-Agent was specified, you use a default one defined earlier in the openanything.py module.  You never use the default one defined by urllib2.
       
       
       
      @@ -10338,7 +10178,7 @@ def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
       
       6 
       
      -Build a URL opener that uses both of the custom URL handlers: SmartRedirectHandler for handling 301 and 302 redirects, and DefaultErrorHandler for handling 304, 404, and other error conditions gracefully.
      +Build a URL opener that uses both of the custom URL handlers: SmartRedirectHandler for handling 301 and 302 redirects, and DefaultErrorHandler for handling 304, 404, and other error conditions gracefully.
       
       
       
      @@ -10347,8 +10187,8 @@ def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
       That's it!  Open the URL and return a file-like object to the caller.
       
       
      -

      Example 11.18. The fetch function

      -

      This function is defined in openanything.py.

      +

      Example 11.18. The fetch function

      +

      This function is defined in openanything.py.

       def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):  
           '''Fetch data and metadata from a URL, file, stream, or string'''
           result = {}
      @@ -10374,7 +10214,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
       
       1 
       
      -First, you call the openAnything function with a URL, ETag hash, Last-Modified date, and User-Agent.
      +First, you call the openAnything function with a URL, ETag hash, Last-Modified date, and User-Agent.
       
       
       
      @@ -10385,7 +10225,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
       
       3 
       
      -Save the ETag hash returned from the server, so the calling application can pass it back to you next time, and you can pass it on to openAnything, which can stick it in the If-None-Match header and send it to the remote server.
      +Save the ETag hash returned from the server, so the calling application can pass it back to you next time, and you can pass it on to openAnything, which can stick it in the If-None-Match header and send it to the remote server.
       
       
       
      @@ -10411,7 +10251,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
       If one of the custom URL handlers captured a status code, then save that too.
       
       
      -

      Example 11.19. Using openanything.py

      +

      Example 11.19. Using openanything.py

       >>> import openanything
       >>> useragent = 'MyHTTPWebServicesApp/1.0'
       >>> url = 'http://diveintopython3.org/redir/example301.xml'
      @@ -10446,7 +10286,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
       2 
       
       What you get back is a dictionary of several useful headers, the HTTP status code, and the actual data returned from the server.
      -             openanything handles the gzip compression internally; you don't care about that at this level.
      +             openanything handles the gzip compression internally; you don't care about that at this level.
       
       
       
      @@ -10470,7 +10310,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
       
       
       

      11.10. Summary

      -

      The openanything.py and its functions should now make perfect sense. +

      The openanything.py and its functions should now make perfect sense.

      There are 5 important features of HTTP web services that every client should support:

        @@ -10496,7 +10336,7 @@ you can “call” a function through a SOAP library,

        12.1. Diving In

        You use Google, right? It's a popular search engine. Have you ever wished you could programmatically access Google search results? Now you can. Here is a program to search Google from Python. -

        Example 12.1. search.py

        from SOAPpy import WSDL
        +

        Example 12.1. search.py

        from SOAPpy import WSDL
         
         # you'll need to configure these two values;
         # see http://www.google.com/apis/
        @@ -10523,7 +10363,7 @@ if __name__ == '__main__':
         command line, you give the search query as a command-line argument, and it prints out the URL, title, and description of the
         top five Google search results.
         

        Here is the sample output for a search for the word “python”. -

        Example 12.2. Sample Usage of search.py

        +

        Example 12.2. Sample Usage of search.py

         C:\diveintopython3\common\py> python search.py "python"
         <b>Python</b> Programming Language
         http://www.python.org/
        @@ -10576,7 +10416,7 @@ Dive Into <b>Python</b>. This book is still being written. <b>...</b
         

        If you are using Windows, there are several choices. Make sure to download the version of PyXML that matches the version of Python you are using.

      • -

        Double-click the installer. If you download PyXML 0.8.3 for Windows and Python 2.3, the installer program will be PyXML-0.8.3.win32-py2.3.exe. +

        Double-click the installer. If you download PyXML 0.8.3 for Windows and Python 2.3, the installer program will be PyXML-0.8.3.win32-py2.3.exe.

      • Step through the installer program. @@ -10603,7 +10443,7 @@ Dive Into <b>Python</b>. This book is still being written. <b>...</b

        Download the latest version of fpconst from http://www.analytics.washington.edu/statcomp/projects/rzope/fpconst/.

      • -

        There are two downloads available, one in .tar.gz format, the other in .zip format. If you are using Windows, download the .zip file; otherwise, download the .tar.gz file. +

        There are two downloads available, one in .tar.gz format, the other in .zip format. If you are using Windows, download the .zip file; otherwise, download the .tar.gz file.

      • Decompress the downloaded file. On Windows XP, you can right-click on the file and choose Extract All; on earlier versions @@ -10632,7 +10472,7 @@ Dive Into <b>Python</b>. This book is still being written. <b>...</b

        Go to http://pywebsvcs.sourceforge.net/ and select Latest Official Release under the SOAPpy section.

      • -

        There are two downloads available. If you are using Windows, download the .zip file; otherwise, download the .tar.gz file. +

        There are two downloads available. If you are using Windows, download the .zip file; otherwise, download the .tar.gz file.

      • Decompress the downloaded file, just as you did with fpconst. @@ -10666,7 +10506,7 @@ region. 1 -You access the remote SOAP server through a proxy class, SOAPProxy. The proxy handles all the internals of SOAP for you, including creating the XML request document out of the function name and argument list, sending the request over +You access the remote SOAP server through a proxy class, SOAPProxy. The proxy handles all the internals of SOAP for you, including creating the XML request document out of the function name and argument list, sending the request over HTTP to the remote SOAP server, parsing the XML response document, and creating native Python values to return. You'll see what these XML documents look like in the next section. @@ -10681,7 +10521,7 @@ region. 3 -You're creating the SOAPProxy with the service URL and the service namespace. This doesn't make any connection to the SOAP server; it simply creates a local Python object. +You're creating the SOAPProxy with the service URL and the service namespace. This doesn't make any connection to the SOAP server; it simply creates a local Python object. @@ -10695,7 +10535,7 @@ region.

        Let's peek under those covers.

        12.4. Debugging SOAP Web Services

        The SOAP libraries provide an easy way to see what's going on behind the scenes. -

        Turning on debugging is a simple matter of setting two flags in the SOAPProxy's configuration. +

        Turning on debugging is a simple matter of setting two flags in the SOAPProxy's configuration.

        Example 12.7. Debugging SOAP Web Services

         >>> from SOAPpy import SOAPProxy
         >>> url = 'http://services.xmethods.net:80/soap/servlet/rpcrouter'
        @@ -10740,25 +10580,25 @@ region.
         
         1 
         
        -First, create the SOAPProxy like normal, with the service URL and the namespace.
        +First, create the SOAPProxy like normal, with the service URL and the namespace.
         
         
         
         2 
         
        -Second, turn on debugging by setting server.config.dumpSOAPIn and server.config.dumpSOAPOut.
        +Second, turn on debugging by setting server.config.dumpSOAPIn and server.config.dumpSOAPOut.
         
         
         
         3 
         
         Third, call the remote SOAP method as usual.  The SOAP library will print out both the outgoing XML request document, and the incoming XML response document.  This is all the hard
        -            work that SOAPProxy is doing for you.  Intimidating, isn't it?  Let's break it down.
        +            work that SOAPProxy is doing for you.  Intimidating, isn't it?  Let's break it down.
         
         
         
         

        Most of the XML request document that gets sent to the server is just boilerplate. Ignore all the namespace declarations; -they're going to be the same (or similar) for all SOAP calls. The heart of the “function call” is this fragment within the <Body> element: +they're going to be the same (or similar) for all SOAP calls. The heart of the “function call” is this fragment within the <Body> element:

         <ns1:getTemp               1
           xmlns:ns1="urn:xmethods-Temperature"       2
        @@ -10770,7 +10610,7 @@ they're going to be the same (or similar) for all SOAP calls.
         
         1 
         
        -The element name is the function name, getTemp.  SOAPProxy uses getattr as a dispatcher.  Instead of calling separate local methods based on the method name, it actually uses the method name to construct the XML
        +The element name is the function name, getTemp.  SOAPProxy uses getattr as a dispatcher.  Instead of calling separate local methods based on the method name, it actually uses the method name to construct the XML
                     request document.
         
         
        @@ -10778,17 +10618,17 @@ they're going to be the same (or similar) for all SOAP calls.
         2 
         
         The function's XML element is contained in a specific namespace, which is the namespace you specified when you created the
        -SOAPProxy object.  Don't worry about the SOAP-ENC:root; that's boilerplate too.
        +SOAPProxy object.  Don't worry about the SOAP-ENC:root; that's boilerplate too.
         
         
         
         3 
         
        -The arguments of the function also got translated into XML.  SOAPProxy introspects each argument to determine its datatype (in this case it's a string).  The argument datatype goes into the xsi:type attribute, followed by the actual string value.
        +The arguments of the function also got translated into XML.  SOAPProxy introspects each argument to determine its datatype (in this case it's a string).  The argument datatype goes into the xsi:type attribute, followed by the actual string value.
         
         
         
        -

        The XML return document is equally easy to understand, once you know what to ignore. Focus on this fragment within the <Body>: +

        The XML return document is equally easy to understand, once you know what to ignore. Focus on this fragment within the <Body>:

         <ns1:getTempResponse           1
           xmlns:ns1="urn:xmethods-Temperature"           2
        @@ -10800,25 +10640,25 @@ they're going to be the same (or similar) for all SOAP calls.
         
         1 
         
        -The server wraps the function return value within a <getTempResponse> element.  By convention, this wrapper element is the name of the function, plus Response.  But it could really be almost anything; the important thing that SOAPProxy notices is not the element name, but the namespace.
        +The server wraps the function return value within a <getTempResponse> element.  By convention, this wrapper element is the name of the function, plus Response.  But it could really be almost anything; the important thing that SOAPProxy notices is not the element name, but the namespace.
         
         
         
         2 
         
         The server returns the response in the same namespace we used in the request, the same namespace we specified when we first
        -            create the SOAPProxy.  Later in this chapter we'll see what happens if you forget to specify the namespace when creating the SOAPProxy.
        +            create the SOAPProxy.  Later in this chapter we'll see what happens if you forget to specify the namespace when creating the SOAPProxy.
         
         
         
         3 
         
        -The return value is specified, along with its datatype (it's a float).  SOAPProxy uses this explicit datatype to create a Python object of the correct native datatype and return it.
        +The return value is specified, along with its datatype (it's a float).  SOAPProxy uses this explicit datatype to create a Python object of the correct native datatype and return it.
         
         
         
         

        12.5. Introducing WSDL

        -

        The SOAPProxy class proxies local method calls and transparently turns then into invocations of remote SOAP methods. As you've seen, this is a lot of work, and SOAPProxy does it quickly and transparently. What it doesn't do is provide any means of method introspection. +

        The SOAPProxy class proxies local method calls and transparently turns then into invocations of remote SOAP methods. As you've seen, this is a lot of work, and SOAPProxy does it quickly and transparently. What it doesn't do is provide any means of method introspection.

        Consider this: the previous two sections showed an example of calling a simple remote SOAP method with one argument and one return value, both of simple data types. This required knowing, and keeping track of, the service URL, the service namespace, the function name, the number of arguments, and the datatype of each argument. If any of these is missing or wrong, the whole thing falls apart. @@ -10865,17 +10705,17 @@ a module, and with a little work, drill down to individual function declarations 2 -To use a WSDL file, you again use a proxy class, WSDL.Proxy, which takes a single argument: the WSDL file. Note that in this case you are passing in the URL of a WSDL file stored on the remote server, but the proxy class works just as well with a local copy of the WSDL file. The act of creating the WSDL proxy will download the WSDL file and parse it, so it there are any errors in the WSDL file (or it can't be fetched due to networking problems), you'll know about it immediately. +To use a WSDL file, you again use a proxy class, WSDL.Proxy, which takes a single argument: the WSDL file. Note that in this case you are passing in the URL of a WSDL file stored on the remote server, but the proxy class works just as well with a local copy of the WSDL file. The act of creating the WSDL proxy will download the WSDL file and parse it, so it there are any errors in the WSDL file (or it can't be fetched due to networking problems), you'll know about it immediately. 3 -The WSDL proxy class exposes the available functions as a Python dictionary, server.methods. So getting the list of available methods is as simple as calling the dictionary method keys(). +The WSDL proxy class exposes the available functions as a Python dictionary, server.methods. So getting the list of available methods is as simple as calling the dictionary method keys(). -

        Okay, so you know that this SOAP server offers a single method: getTemp. But how do you call it? The WSDL proxy object can tell you that too. +

        Okay, so you know that this SOAP server offers a single method: getTemp. But how do you call it? The WSDL proxy object can tell you that too.

        Example 12.9. Discovering A Method's Arguments

         >>> callInfo = server.methods['getTemp']  1
         >>> callInfo.inparams   2
        @@ -10889,26 +10729,26 @@ u'zipcode'
         
         1 
         
        -The server.methods dictionary is filled with a SOAPpy-specific structure called CallInfo.  A CallInfo object contains information about one specific function, including the function arguments.
        +The server.methods dictionary is filled with a SOAPpy-specific structure called CallInfo.  A CallInfo object contains information about one specific function, including the function arguments.
         
         
         
         2 
         
        -The function arguments are stored in callInfo.inparams, which is a Python list of ParameterInfo objects that hold information about each parameter.
        +The function arguments are stored in callInfo.inparams, which is a Python list of ParameterInfo objects that hold information about each parameter.
         
         
         
         3 
         
        -Each ParameterInfo object contains a name attribute, which is the argument name.  You are not required to know the argument name to call the function through SOAP, but SOAP does support calling functions with named arguments (just like Python), and WSDL.Proxy will correctly handle mapping named arguments to the remote function if you choose to use them.
        +Each ParameterInfo object contains a name attribute, which is the argument name.  You are not required to know the argument name to call the function through SOAP, but SOAP does support calling functions with named arguments (just like Python), and WSDL.Proxy will correctly handle mapping named arguments to the remote function if you choose to use them.
         
         
         
         4 
         
         Each parameter is also explicitly typed, using datatypes defined in XML Schema.  You saw this in the wire trace in the previous
        -            section; the XML Schema namespace was part of the “boilerplate” I told you to ignore.  For our purposes here, you may continue to ignore it.  The zipcode parameter is a string, and if you pass in a Python string to the WSDL.Proxy object, it will map it correctly and send it to the server.
        +            section; the XML Schema namespace was part of the “boilerplate” I told you to ignore.  For our purposes here, you may continue to ignore it.  The zipcode parameter is a string, and if you pass in a Python string to the WSDL.Proxy object, it will map it correctly and send it to the server.
         
         
         
        @@ -10925,13 +10765,13 @@ u'return'
         
         1 
         
        -The adjunct to callInfo.inparams for function arguments is callInfo.outparams for return value.  It is also a list, because functions called through SOAP can return multiple values, just like Python functions.
        +The adjunct to callInfo.inparams for function arguments is callInfo.outparams for return value.  It is also a list, because functions called through SOAP can return multiple values, just like Python functions.
         
         
         
         2 
         
        -Each ParameterInfo object contains name and type.  This function returns a single value, named return, which is a float.
        +Each ParameterInfo object contains name and type.  This function returns a single value, named return, which is a float.
         
         
         
        @@ -10981,19 +10821,19 @@ u'return'
         
         1 
         
        -The configuration is simpler than calling the SOAP service directly, since the WSDL file contains the both service URL and namespace you need to call the service.  Creating the WSDL.Proxy object downloads the WSDL file, parses it, and configures a SOAPProxy object that it uses to call the actual SOAP web service.
        +The configuration is simpler than calling the SOAP service directly, since the WSDL file contains the both service URL and namespace you need to call the service.  Creating the WSDL.Proxy object downloads the WSDL file, parses it, and configures a SOAPProxy object that it uses to call the actual SOAP web service.
         
         
         
         2 
         
        -Once the WSDL.Proxy object is created, you can call a function as easily as you did with the SOAPProxy object.  This is not surprising; the WSDL.Proxy is just a wrapper around the SOAPProxy with some introspection methods added, so the syntax for calling functions is the same.
        +Once the WSDL.Proxy object is created, you can call a function as easily as you did with the SOAPProxy object.  This is not surprising; the WSDL.Proxy is just a wrapper around the SOAPProxy with some introspection methods added, so the syntax for calling functions is the same.
         
         
         
         3 
         
        -You can access the WSDL.Proxy's SOAPProxy with server.soapproxy.  This is useful to turning on debugging, so that when you can call functions through the WSDL proxy, its SOAPProxy will dump the outgoing and incoming XML documents that are going over the wire.
        +You can access the WSDL.Proxy's SOAPProxy with server.soapproxy.  This is useful to turning on debugging, so that when you can call functions through the WSDL proxy, its SOAPProxy will dump the outgoing and incoming XML documents that are going over the wire.
         
         
         
        @@ -11012,7 +10852,7 @@ u'return'
         

        Also on http://www.google.com/apis/, download the Google Web APIs developer kit. This includes some sample code in several programming languages (but not Python), and more importantly, it includes the WSDL file.

      • -

        Decompress the developer kit file and find GoogleSearch.wsdl. Copy this file to some permanent location on your local drive. You will need it later in this chapter. +

        Decompress the developer kit file and find GoogleSearch.wsdl. Copy this file to some permanent location on your local drive. You will need it later in this chapter.

    Once you have your developer key and your Google WSDL file in a known place, you can start poking around with Google Web Services. @@ -11039,48 +10879,48 @@ oe (u'http://www.w3.org/2001/XMLSchema', u'string') 1 -Getting started with Google web services is easy: just create a WSDL.Proxy object and point it at your local copy of Google's WSDL file. +Getting started with Google web services is easy: just create a WSDL.Proxy object and point it at your local copy of Google's WSDL file. 2 -According to the WSDL file, Google offers three functions: doGoogleSearch, doGetCachedPage, and doSpellingSuggestion. These do exactly what they sound like: perform a Google search and return the results programmatically, get access to the +According to the WSDL file, Google offers three functions: doGoogleSearch, doGetCachedPage, and doSpellingSuggestion. These do exactly what they sound like: perform a Google search and return the results programmatically, get access to the cached version of a page from the last time Google saw it, and offer spelling suggestions for commonly misspelled search words. 3 -The doGoogleSearch function takes a number of parameters of various types. Note that while the WSDL file can tell you what the arguments are called and what datatype they are, it can't tell you what they mean or how to use +The doGoogleSearch function takes a number of parameters of various types. Note that while the WSDL file can tell you what the arguments are called and what datatype they are, it can't tell you what they mean or how to use them. It could theoretically tell you the acceptable range of values for each parameter, if only specific values were allowed, - but Google's WSDL file is not that detailed. WSDL.Proxy can't work magic; it can only give you the information provided in the WSDL file. + but Google's WSDL file is not that detailed. WSDL.Proxy can't work magic; it can only give you the information provided in the WSDL file. -

    Here is a brief synopsis of all the parameters to the doGoogleSearch function: +

    Here is a brief synopsis of all the parameters to the doGoogleSearch function:

      -
    • key - Your Google API key, which you received when you signed up for Google web services. +
    • key - Your Google API key, which you received when you signed up for Google web services. -
    • q - The search word or phrase you're looking for. The syntax is exactly the same as Google's web form, so if you know any +
    • q - The search word or phrase you're looking for. The syntax is exactly the same as Google's web form, so if you know any advanced search syntax or tricks, they all work here as well. -
    • start - The index of the result to start on. Like the interactive web version of Google, this function returns 10 results at a - time. If you wanted to get the second “page” of results, you would set start to 10. +
    • start - The index of the result to start on. Like the interactive web version of Google, this function returns 10 results at a + time. If you wanted to get the second “page” of results, you would set start to 10. -
    • maxResults - The number of results to return. Currently capped at 10, although you can specify fewer if you are only interested in +
    • maxResults - The number of results to return. Currently capped at 10, although you can specify fewer if you are only interested in a few results and want to save a little bandwidth. -
    • filter - If True, Google will filter out duplicate pages from the results. +
    • filter - If True, Google will filter out duplicate pages from the results. -
    • restrict - Set this to country plus a country code to get results only from a particular country. Example: countryUK to search pages in the United Kingdom. You can also specify linux, mac, or bsd to search a Google-defined set of technical sites, or unclesam to search sites about the United States government. +
    • restrict - Set this to country plus a country code to get results only from a particular country. Example: countryUK to search pages in the United Kingdom. You can also specify linux, mac, or bsd to search a Google-defined set of technical sites, or unclesam to search sites about the United States government. -
    • safeSearch - If True, Google will filter out porn sites. +
    • safeSearch - If True, Google will filter out porn sites. -
    • lr (“language restrict”) - Set this to a language code to get results only in a particular language. +
    • lr (“language restrict”) - Set this to a language code to get results only in a particular language. -
    • ie and oe (“input encoding” and “output encoding”) - Deprecated, both must be utf-8. +
    • ie and oe (“input encoding” and “output encoding”) - Deprecated, both must be utf-8.

    Example 12.13. Searching Google

    @@ -11100,23 +10940,23 @@ oe              (u'http://www.w3.org/2001/XMLSchema', u'string')
     
     1 
     
    -After setting up the WSDL.Proxy object, you can call server.doGoogleSearch with all ten parameters.  Remember to use your own Google API key that you received when you signed up for Google web services.
    +After setting up the WSDL.Proxy object, you can call server.doGoogleSearch with all ten parameters.  Remember to use your own Google API key that you received when you signed up for Google web services.
     
     
     
     2 
     
    -There's a lot of information returned, but let's look at the actual search results first.  They're stored in results.resultElements, and you can access them just like a normal Python list.
    +There's a lot of information returned, but let's look at the actual search results first.  They're stored in results.resultElements, and you can access them just like a normal Python list.
     
     
     
     3 
     
    -Each element in the resultElements is an object that has a URL, title, snippet, and other useful attributes.  At this point you can use normal Python introspection techniques like dir(results.resultElements[0]) to see the available attributes.  Or you can introspect through the WSDL proxy object and look through the function's outparams.  Each technique will give you the same information.
    +Each element in the resultElements is an object that has a URL, title, snippet, and other useful attributes.  At this point you can use normal Python introspection techniques like dir(results.resultElements[0]) to see the available attributes.  Or you can introspect through the WSDL proxy object and look through the function's outparams.  Each technique will give you the same information.
     
     
     
    -

    The results object contains more than the actual search results. It also contains information about the search itself, such as how long +

    The results object contains more than the actual search results. It also contains information about the search itself, such as how long it took and how many results were found (even though only 10 were returned). The Google web interface shows this information, and you can access it programmatically too.

    Example 12.14. Accessing Secondary Information From Google

    @@ -11142,7 +10982,7 @@ and you can access it programmatically too.
     
     2 
     
    -In total, there were approximately 30 million results.  You can access them 10 at a time by changing the start parameter and calling server.doGoogleSearch again.
    +In total, there were approximately 30 million results.  You can access them 10 at a time by changing the start parameter and calling server.doGoogleSearch again.
     
     
     
    @@ -11180,14 +11020,14 @@ Unable to determine object id from call: is the method element namespaced?>
     1 
     
    -Did you spot the mistake?  You're creating a SOAPProxy manually, and you've correctly specified the service URL, but you haven't specified the namespace.  Since multiple services may be routed through the same service URL, the namespace is essential to determine which service you're trying to talk to, and therefore which method you're really
    +Did you spot the mistake?  You're creating a SOAPProxy manually, and you've correctly specified the service URL, but you haven't specified the namespace.  Since multiple services may be routed through the same service URL, the namespace is essential to determine which service you're trying to talk to, and therefore which method you're really
                 calling.
     
     
     
     2 
     
    -The server responds by sending a SOAP Fault, which SOAPpy turns into a Python exception of type SOAPpy.Types.faultType.  All errors returned from any SOAP server will always be SOAP Faults, so you can easily catch this exception.  In this case, the human-readable part of the SOAP Fault gives a clue to the problem: the method element is not namespaced, because the original SOAPProxy object was not configured with a service namespace.
    +The server responds by sending a SOAP Fault, which SOAPpy turns into a Python exception of type SOAPpy.Types.faultType.  All errors returned from any SOAP server will always be SOAP Faults, so you can easily catch this exception.  In this case, the human-readable part of the SOAP Fault gives a clue to the problem: the method element is not namespaced, because the original SOAPProxy object was not configured with a service namespace.
     
     
     
    @@ -11213,14 +11053,14 @@ services.temperature.TempService.getTemp(int) -- no signature match>
     
     1 
     
    -Did you spot the mistake?  It's a subtle one: you're calling server.getTemp with an integer instead of a string.  As you saw from introspecting the WSDL file, the getTemp() SOAP function takes a single argument, zipcode, which must be a string.  WSDL.Proxy will not coerce datatypes for you; you need to pass the exact datatypes that the server expects.
    +Did you spot the mistake?  It's a subtle one: you're calling server.getTemp with an integer instead of a string.  As you saw from introspecting the WSDL file, the getTemp() SOAP function takes a single argument, zipcode, which must be a string.  WSDL.Proxy will not coerce datatypes for you; you need to pass the exact datatypes that the server expects.
     
     
     
     2 
     
    -Again, the server returns a SOAP Fault, and the human-readable part of the error gives a clue as to the problem: you're calling a getTemp function with an integer value, but there is no function defined with that name that takes an integer.  In theory, SOAP allows you to overload functions, so you could have two functions in the same SOAP service with the same name and the same number of arguments, but the arguments were of different datatypes.  This is why
    -            it's important to match the datatypes exactly, and why WSDL.Proxy doesn't coerce datatypes for you.  If it did, you could end up calling a completely different function!  Good luck debugging
    +Again, the server returns a SOAP Fault, and the human-readable part of the error gives a clue as to the problem: you're calling a getTemp function with an integer value, but there is no function defined with that name that takes an integer.  In theory, SOAP allows you to overload functions, so you could have two functions in the same SOAP service with the same name and the same number of arguments, but the arguments were of different datatypes.  This is why
    +            it's important to match the datatypes exactly, and why WSDL.Proxy doesn't coerce datatypes for you.  If it did, you could end up calling a completely different function!  Good luck debugging
                 that one.  It's much easier to be picky about datatypes and fail as quickly as possible if you get them wrong.
     
     
    @@ -11238,8 +11078,8 @@ TypeError: unpack non-sequence
     
     1 
     
    -Did you spot the mistake?  server.getTemp only returns one value, a float, but you've written code that assumes you're getting two values and trying to assign them
    -            to two different variables.  Note that this does not fail with a SOAP fault.  As far as the remote server is concerned, nothing went wrong at all.  The error only occurred after the SOAP transaction was complete, WSDL.Proxy returned a float, and your local Python interpreter tried to accomodate your request to split it into two different variables.  Since the function only returned
    +Did you spot the mistake?  server.getTemp only returns one value, a float, but you've written code that assumes you're getting two values and trying to assign them
    +            to two different variables.  Note that this does not fail with a SOAP fault.  As far as the remote server is concerned, nothing went wrong at all.  The error only occurred after the SOAP transaction was complete, WSDL.Proxy returned a float, and your local Python interpreter tried to accomodate your request to split it into two different variables.  Since the function only returned
                 one value, you get a Python exception trying to split it, not a SOAP Fault.
     
     
    @@ -11374,30 +11214,30 @@ numerals.  You saw the mechanics of constructing and validating Roman numerals i
     
  • There is a limited range of numbers that can be expressed as Roman numerals, specifically 1 through 3999. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent that its normal value should be multiplied by 1000, but you're not going to deal with that. For the purposes of this chapter, let's stipulate that Roman numerals go from 1 to 3999.) -
  • There is no way to represent 0 in Roman numerals. (Amazingly, the ancient Romans had no concept of 0 as a number. Numbers were for counting things you had; how can you count what you don't have?) +
  • There is no way to represent 0 in Roman numerals. (Amazingly, the ancient Romans had no concept of 0 as a number. Numbers were for counting things you had; how can you count what you don't have?)
  • There is no way to represent negative numbers in Roman numerals.
  • There is no way to represent fractions or non-integer numbers in Roman numerals.

    Given all of this, what would you expect out of a set of functions to convert to and from Roman numerals? -

    roman.py requirements

    +

    roman.py requirements

      -
    1. toRoman should return the Roman numeral representation for all integers 1 to 3999. +
    2. toRoman should return the Roman numeral representation for all integers 1 to 3999. -
    3. toRoman should fail when given an integer outside the range 1 to 3999. +
    4. toRoman should fail when given an integer outside the range 1 to 3999. -
    5. toRoman should fail when given a non-integer number. +
    6. toRoman should fail when given a non-integer number. -
    7. fromRoman should take a valid Roman numeral and return the number that it represents. +
    8. fromRoman should take a valid Roman numeral and return the number that it represents. -
    9. fromRoman should fail when given an invalid Roman numeral. +
    10. fromRoman should fail when given an invalid Roman numeral.
    11. If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number - you started with. So fromRoman(toRoman(n)) == n for all n in 1..3999. + you started with. So fromRoman(toRoman(n)) == n for all n in 1..3999. -
    12. toRoman should always return a Roman numeral using uppercase letters. +
    13. toRoman should always return a Roman numeral using uppercase letters. -
    14. fromRoman should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input). +
    15. fromRoman should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input).
    @@ -11412,12 +11252,12 @@ numerals. You saw the mechanics of constructing and validating Roman numerals i behave the way you want them to. You read that right: you're going to write code that tests code that you haven't written yet.

    This is called unit testing, since the set of two conversion functions can be written and tested as a unit, separate from -any larger program they may become part of later. Python has a framework for unit testing, the appropriately-named unittest module. +any larger program they may become part of later. Python has a framework for unit testing, the appropriately-named unittest module.
    -
    Note
    unittest is included with Python 2.1 and later. Python 2.0 users can download it from pyunit.sourceforge.net. +unittest is included with Python 2.1 and later. Python 2.0 users can download it from pyunit.sourceforge.net.
    @@ -11439,11 +11279,11 @@ of development: That way, nobody goes off too far into developing code that won't play well with others.) -

    13.3. Introducing romantest.py

    +

    13.3. Introducing romantest.py

    This is the complete test suite for your Roman numeral conversion functions, which are yet to be written but will eventually - be in roman.py. It is not immediately obvious how it all fits together; none of these classes or methods reference any of the others. + be in roman.py. It is not immediately obvious how it all fits together; none of these classes or methods reference any of the others. There are good reasons for this, as you'll see shortly. -

    Example 13.1. romantest.py

    +

    Example 13.1. romantest.py

    If you have not already done so, you can download this and other examples used in this book.

     """Unit test for roman.py"""
     
    @@ -11581,11 +11421,11 @@ if __name__ == "__main__":
         unittest.main()   

    Further reading

      -
    • The PyUnit home page has an in-depth discussion of using the unittest framework, including advanced features not covered in this chapter. +
    • The PyUnit home page has an in-depth discussion of using the unittest framework, including advanced features not covered in this chapter.
    • The PyUnit FAQ explains why test cases are stored separately from the code they test. -
    • Python Library Reference summarizes the unittest module. +
    • Python Library Reference summarizes the unittest module.
    • ExtremeProgramming.org discusses why you should write unit tests. @@ -11605,10 +11445,10 @@ if __name__ == "__main__":

      Given that, let's build the first test case. You have the following requirement:

        -
      1. toRoman should return the Roman numeral representation for all integers 1 to 3999. +
      2. toRoman should return the Roman numeral representation for all integers 1 to 3999.
      -

      Example 13.2. testToRomanKnownValues

      +

      Example 13.2. testToRomanKnownValues

       class KnownValues(unittest.TestCase):         1
           knownValues = ( (1, 'I'),
         (2, 'II'),
      @@ -11676,7 +11516,7 @@ class KnownValues(unittest.TestCase):         1 
       
      -To write a test case, first subclass the TestCase class of the unittest module.  This class provides many useful methods which you can use in your test case to test specific conditions.
      +To write a test case, first subclass the TestCase class of the unittest module.  This class provides many useful methods which you can use in your test case to test specific conditions.
       
       
       
      @@ -11697,38 +11537,38 @@ class KnownValues(unittest.TestCase):         4 
       
      -Here you call the actual toRoman function.  (Well, the function hasn't be written yet, but once it is, this is the line that will call it.)  Notice that you
      -            have now defined the API for the toRoman function: it must take an integer (the number to convert) and return a string (the Roman numeral representation).  If the
      +Here you call the actual toRoman function.  (Well, the function hasn't be written yet, but once it is, this is the line that will call it.)  Notice that you
      +            have now defined the API for the toRoman function: it must take an integer (the number to convert) and return a string (the Roman numeral representation).  If the
       API is different than that, this test is considered failed.
       
       
       
       5 
       
      -Also notice that you are not trapping any exceptions when you call toRoman.  This is intentional.  toRoman shouldn't raise an exception when you call it with valid input, and these input values are all valid.  If toRoman raises an exception, this test is considered failed.
      +Also notice that you are not trapping any exceptions when you call toRoman.  This is intentional.  toRoman shouldn't raise an exception when you call it with valid input, and these input values are all valid.  If toRoman raises an exception, this test is considered failed.
       
       
       
       6 
       
      -Assuming the toRoman function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check
      -            whether it returned the right value.  This is a common question, and the TestCase class provides a method, assertEqual, to check whether two values are equal.  If the result returned from toRoman (result) does not match the known value you were expecting (numeral), assertEqual will raise an exception and the test will fail.  If the two values are equal, assertEqual will do nothing.  If every value returned from toRoman matches the known value you expect, assertEqual never raises an exception, so testToRomanKnownValues eventually exits normally, which means toRoman has passed this test.
      +Assuming the toRoman function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check
      +            whether it returned the right value.  This is a common question, and the TestCase class provides a method, assertEqual, to check whether two values are equal.  If the result returned from toRoman (result) does not match the known value you were expecting (numeral), assertEqual will raise an exception and the test will fail.  If the two values are equal, assertEqual will do nothing.  If every value returned from toRoman matches the known value you expect, assertEqual never raises an exception, so testToRomanKnownValues eventually exits normally, which means toRoman has passed this test.
       
       
       
       

      13.5. Testing for failure

      It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect. -

      Remember the other requirements for toRoman: +

      Remember the other requirements for toRoman:

        -
      1. toRoman should fail when given an integer outside the range 1 to 3999. +
      2. toRoman should fail when given an integer outside the range 1 to 3999. -
      3. toRoman should fail when given a non-integer number. +
      4. toRoman should fail when given a non-integer number.
      -

      In Python, functions indicate failure by raising exceptions, and the unittest module provides methods for testing whether a function raises a particular exception when given bad input. -

      Example 13.3. Testing bad input to toRoman

      +

      In Python, functions indicate failure by raising exceptions, and the unittest module provides methods for testing whether a function raises a particular exception when given bad input. +

      Example 13.3. Testing bad input to toRoman

       class ToRomanBadInput(unittest.TestCase):          
           def testTooLarge(self):      
               """toRoman should fail with large input""" 
      @@ -11749,9 +11589,9 @@ class ToRomanBadInput(unittest.TestCase):
       
       1 
       
      -The TestCase class of the unittest provides the assertRaises method, which takes the following arguments: the exception you're expecting, the function you're testing, and the arguments
      -            you're passing that function.  (If the function you're testing takes more than one argument, pass them all to assertRaises, in order, and it will pass them right along to the function you're testing.)  Pay close attention to what you're doing here:
      -            instead of calling toRoman directly and manually checking that it raises a particular exception (by wrapping it in a try...except block), assertRaises has encapsulated all of that for us.  All you do is give it the exception (roman.OutOfRangeError), the function (toRoman), and toRoman's arguments (4000), and assertRaises takes care of calling toRoman and checking to make sure that it raises roman.OutOfRangeError.  (Also note that you're passing the toRoman function itself as an argument; you're not calling it, and you're not passing the name of it as a string.  Have I mentioned
      +The TestCase class of the unittest provides the assertRaises method, which takes the following arguments: the exception you're expecting, the function you're testing, and the arguments
      +            you're passing that function.  (If the function you're testing takes more than one argument, pass them all to assertRaises, in order, and it will pass them right along to the function you're testing.)  Pay close attention to what you're doing here:
      +            instead of calling toRoman directly and manually checking that it raises a particular exception (by wrapping it in a try...except block), assertRaises has encapsulated all of that for us.  All you do is give it the exception (roman.OutOfRangeError), the function (toRoman), and toRoman's arguments (4000), and assertRaises takes care of calling toRoman and checking to make sure that it raises roman.OutOfRangeError.  (Also note that you're passing the toRoman function itself as an argument; you're not calling it, and you're not passing the name of it as a string.  Have I mentioned
                   recently how handy it is that everything in Python is an object, including functions and exceptions?)
       
       
      @@ -11759,27 +11599,27 @@ class ToRomanBadInput(unittest.TestCase):
       2 
       
       Along with testing numbers that are too large, you need to test numbers that are too small.  Remember, Roman numerals cannot
      -            express 0 or negative numbers, so you have a test case for each of those (testZero and testNegative).  In testZero, you are testing that toRoman raises a roman.OutOfRangeError exception when called with 0; if it does not raise a roman.OutOfRangeError (either because it returns an actual value, or because it raises some other exception), this test is considered failed.
      +            express 0 or negative numbers, so you have a test case for each of those (testZero and testNegative).  In testZero, you are testing that toRoman raises a roman.OutOfRangeError exception when called with 0; if it does not raise a roman.OutOfRangeError (either because it returns an actual value, or because it raises some other exception), this test is considered failed.
       
       
       
       3 
       
      -Requirement #3 specifies that toRoman cannot accept a non-integer number, so here you test to make sure that toRoman raises a roman.NotIntegerError exception when called with 0.5.  If toRoman does not raise a roman.NotIntegerError, this test is considered failed.
      +Requirement #3 specifies that toRoman cannot accept a non-integer number, so here you test to make sure that toRoman raises a roman.NotIntegerError exception when called with 0.5.  If toRoman does not raise a roman.NotIntegerError, this test is considered failed.
       
       
       
      -

      The next two requirements are similar to the first three, except they apply to fromRoman instead of toRoman: +

      The next two requirements are similar to the first three, except they apply to fromRoman instead of toRoman:

        -
      1. fromRoman should take a valid Roman numeral and return the number that it represents. +
      2. fromRoman should take a valid Roman numeral and return the number that it represents. -
      3. fromRoman should fail when given an invalid Roman numeral. +
      4. fromRoman should fail when given an invalid Roman numeral.

      Requirement #4 is handled in the same way as requirement #1, iterating through a sampling of known values and testing each in turn. Requirement #5 is handled in the same way as requirements -#2 and #3, by testing a series of bad inputs and making sure fromRoman raises the appropriate exception. -

      Example 13.4. Testing bad input to fromRoman

      +#2 and #3, by testing a series of bad inputs and making sure fromRoman raises the appropriate exception.
      +

      Example 13.4. Testing bad input to fromRoman

       class FromRomanBadInput(unittest.TestCase):  
           def testTooManyRepeatedNumerals(self):   
               """fromRoman should fail with too many repeated numerals"""              
      @@ -11800,7 +11640,7 @@ class FromRomanBadInput(unittest.TestCase):
       
       1 
       
      -Not much new to say about these; the pattern is exactly the same as the one you used to test bad input to toRoman.  I will briefly note that you have another exception: roman.InvalidRomanNumeralError.  That makes a total of three custom exceptions that will need to be defined in roman.py (along with roman.OutOfRangeError and roman.NotIntegerError).  You'll see how to define these custom exceptions when you actually start writing roman.py, later in this chapter.
      +Not much new to say about these; the pattern is exactly the same as the one you used to test bad input to toRoman.  I will briefly note that you have another exception: roman.InvalidRomanNumeralError.  That makes a total of three custom exceptions that will need to be defined in roman.py (along with roman.OutOfRangeError and roman.NotIntegerError).  You'll see how to define these custom exceptions when you actually start writing roman.py, later in this chapter.
       
       
       
      @@ -11812,10 +11652,10 @@ class FromRomanBadInput(unittest.TestCase):
       
      1. If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number - you started with. So fromRoman(toRoman(n)) == n for all n in 1..3999. + you started with. So fromRoman(toRoman(n)) == n for all n in 1..3999.
      -

      Example 13.5. Testing toRoman against fromRoman

      +

      Example 13.5. Testing toRoman against fromRoman

       class SanityCheck(unittest.TestCase):        
           def testSanity(self):  
               """fromRoman(toRoman(n))==n for all n"""
      @@ -11827,31 +11667,31 @@ class SanityCheck(unittest.TestCase):
       
       1 
       
      -You've seen the range function before, but here it is called with two arguments, which returns a list of integers starting at the first argument (1) and counting consecutively up to but not including the second argument (4000).  Thus, 1..3999, which is the valid range for converting to Roman numerals.
      +You've seen the range function before, but here it is called with two arguments, which returns a list of integers starting at the first argument (1) and counting consecutively up to but not including the second argument (4000).  Thus, 1..3999, which is the valid range for converting to Roman numerals.
       
       
       
       2 
       
      -I just wanted to mention in passing that integer is not a keyword in Python; here it's just a variable name like any other.
      +I just wanted to mention in passing that integer is not a keyword in Python; here it's just a variable name like any other.
       
       
       
       3 
       
      -The actual testing logic here is straightforward: take a number (integer), convert it to a Roman numeral (numeral), then convert it back to a number (result) and make sure you end up with the same number you started with.  If not, assertEqual will raise an exception and the test will immediately be considered failed.  If all the numbers match, assertEqual will always return silently, the entire testSanity method will eventually return silently, and the test will be considered passed.
      +The actual testing logic here is straightforward: take a number (integer), convert it to a Roman numeral (numeral), then convert it back to a number (result) and make sure you end up with the same number you started with.  If not, assertEqual will raise an exception and the test will immediately be considered failed.  If all the numbers match, assertEqual will always return silently, the entire testSanity method will eventually return silently, and the test will be considered passed.
       
       
       
       

      The last two requirements are different from the others because they seem both arbitrary and trivial:

        -
      1. toRoman should always return a Roman numeral using uppercase letters. +
      2. toRoman should always return a Roman numeral using uppercase letters. -
      3. fromRoman should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input). +
      4. fromRoman should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input).
      -

      In fact, they are somewhat arbitrary. You could, for instance, have stipulated that fromRoman accept lowercase and mixed case input. But they are not completely arbitrary; if toRoman is always returning uppercase output, then fromRoman must at least accept uppercase input, or the “sanity check” (requirement #6) would fail. The fact that it only accepts uppercase input is arbitrary, but as any systems integrator will tell you, case always matters, so it's worth specifying +

      In fact, they are somewhat arbitrary. You could, for instance, have stipulated that fromRoman accept lowercase and mixed case input. But they are not completely arbitrary; if toRoman is always returning uppercase output, then fromRoman must at least accept uppercase input, or the “sanity check” (requirement #6) would fail. The fact that it only accepts uppercase input is arbitrary, but as any systems integrator will tell you, case always matters, so it's worth specifying the behavior up front. And if it's worth specifying, it's worth testing.

      Example 13.6. Testing for case

       class CaseCheck(unittest.TestCase): 
      @@ -11873,8 +11713,8 @@ class CaseCheck(unittest.TestCase):
       1 
       
       The most interesting thing about this test case is all the things it doesn't test.  It doesn't test that the value returned
      -            from toRoman is right or even consistent; those questions are answered by separate test cases.  You have a whole test case just to test for uppercase-ness.  You might
      -            be tempted to combine this with the sanity check, since both run through the entire range of values and call toRoman.[6]  But that would violate one of the fundamental rules: each test case should answer only a single question.  Imagine that you combined this case check with the sanity check, and
      +            from toRoman is right or even consistent; those questions are answered by separate test cases.  You have a whole test case just to test for uppercase-ness.  You might
      +            be tempted to combine this with the sanity check, since both run through the entire range of values and call toRoman.[6]  But that would violate one of the fundamental rules: each test case should answer only a single question.  Imagine that you combined this case check with the sanity check, and
                   then that test case failed.  You would need to do further analysis to figure out which part of the test case failed to determine
                   what the problem was.  If you need to analyze the results of your unit testing just to figure out what they mean, it's a sure
                   sign that you've mis-designed your test cases.
      @@ -11883,21 +11723,21 @@ class CaseCheck(unittest.TestCase):
       
       2 
       
      -There's a similar lesson to be learned here: even though “you know” that toRoman always returns uppercase, you are explicitly converting its return value to uppercase here to test that fromRoman accepts uppercase input.  Why?  Because the fact that toRoman always returns uppercase is an independent requirement.  If you changed that requirement so that, for instance, it always
      -            returned lowercase, the testToRomanCase test case would need to change, but this test case would still work.  This was another of the fundamental rules: each test case must be able to work in isolation from any of the others.  Every test case is an island.
      +There's a similar lesson to be learned here: even though “you know” that toRoman always returns uppercase, you are explicitly converting its return value to uppercase here to test that fromRoman accepts uppercase input.  Why?  Because the fact that toRoman always returns uppercase is an independent requirement.  If you changed that requirement so that, for instance, it always
      +            returned lowercase, the testToRomanCase test case would need to change, but this test case would still work.  This was another of the fundamental rules: each test case must be able to work in isolation from any of the others.  Every test case is an island.
       
       
       
       3 
       
      -Note that you're not assigning the return value of fromRoman to anything.  This is legal syntax in Python; if a function returns a value but nobody's listening, Python just throws away the return value.  In this case, that's what you want.  This test case doesn't test anything about the return
      -            value; it just tests that fromRoman accepts the uppercase input without raising an exception.
      +Note that you're not assigning the return value of fromRoman to anything.  This is legal syntax in Python; if a function returns a value but nobody's listening, Python just throws away the return value.  In this case, that's what you want.  This test case doesn't test anything about the return
      +            value; it just tests that fromRoman accepts the uppercase input without raising an exception.
       
       
       
       4 
       
      -This is a complicated line, but it's very similar to what you did in the ToRomanBadInput and FromRomanBadInput tests.  You are testing to make sure that calling a particular function (roman.fromRoman) with a particular value (numeral.lower(), the lowercase version of the current Roman numeral in the loop) raises a particular exception (roman.InvalidRomanNumeralError).  If it does (each time through the loop), the test passes; if even one time it does something else (like raises a different
      +This is a complicated line, but it's very similar to what you did in the ToRomanBadInput and FromRomanBadInput tests.  You are testing to make sure that calling a particular function (roman.fromRoman) with a particular value (numeral.lower(), the lowercase version of the current Roman numeral in the loop) raises a particular exception (roman.InvalidRomanNumeralError).  If it does (each time through the loop), the test passes; if even one time it does something else (like raises a different
                   exception, or returning a value without raising an exception at all), the test fails.
       
       
      @@ -11908,12 +11748,12 @@ class CaseCheck(unittest.TestCase):
       

      [6] “I can resist everything except temptation.” --Oscar Wilde

      Chapter 14. Test-First Programming

      -

      14.1. roman.py, stage 1

      +

      14.1. roman.py, stage 1

      Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're going to do this in stages, so you can see all the unit tests fail, then watch them pass one by one as you fill in the gaps - in roman.py. -

      Example 14.1. roman1.py

      -

      This file is available in py/roman/stage1/ in the examples directory. + in roman.py. +

      Example 14.1. roman1.py

      +

      This file is available in py/roman/stage1/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       """Convert to and from Roman numerals"""
       
      @@ -11936,20 +11776,20 @@ def fromRoman(s):
       1 
       
       This is how you define your own custom exceptions in Python.  Exceptions are classes, and you create your own by subclassing existing exceptions.  It is strongly recommended (but not
      -            required) that you subclass Exception, which is the base class that all built-in exceptions inherit from.  Here I am defining RomanError (inherited from Exception) to act as the base class for all my other custom exceptions to follow.  This is a matter of style; I could just as easily
      -            have inherited each individual exception from the Exception class directly.
      +            required) that you subclass Exception, which is the base class that all built-in exceptions inherit from.  Here I am defining RomanError (inherited from Exception) to act as the base class for all my other custom exceptions to follow.  This is a matter of style; I could just as easily
      +            have inherited each individual exception from the Exception class directly.
       
       
       
       2 
       
      -The OutOfRangeError and NotIntegerError exceptions will eventually be used by toRoman to flag various forms of invalid input, as specified in ToRomanBadInput.
      +The OutOfRangeError and NotIntegerError exceptions will eventually be used by toRoman to flag various forms of invalid input, as specified in ToRomanBadInput.
       
       
       
       3 
       
      -The InvalidRomanNumeralError exception will eventually be used by fromRoman to flag invalid input, as specified in FromRomanBadInput.
      +The InvalidRomanNumeralError exception will eventually be used by fromRoman to flag invalid input, as specified in FromRomanBadInput.
       
       
       
      @@ -11960,10 +11800,10 @@ def fromRoman(s):
       
       
       

      Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At -this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to romantest.py and re-evaluate why you coded a test so useless that it passes with do-nothing functions. -

      Run romantest1.py with the -v command-line option, which will give more verbose output so you can see exactly what's going on as each test case runs. +this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to romantest.py and re-evaluate why you coded a test so useless that it passes with do-nothing functions. +

      Run romantest1.py with the -v command-line option, which will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this: -

      Example 14.2. Output of romantest1.py against roman1.py

      fromRoman should only accept uppercase input ... ERROR
      +

      Example 14.2. Output of romantest1.py against roman1.py

      fromRoman should only accept uppercase input ... ERROR
       toRoman should always return uppercase ... ERROR
       fromRoman should fail with malformed antecedents ... FAIL
       fromRoman should fail with repeated pairs of numerals ... FAIL
      @@ -12088,33 +11928,33 @@ FAILED (failures=10, errors=2)     1 
       
      -Running the script runs unittest.main(), which runs each test case, which is to say each method defined in each class within romantest.py.  For each test case, it prints out the docstring of the method and whether that test passed or failed.  As expected, none of the test cases passed.
      +Running the script runs unittest.main(), which runs each test case, which is to say each method defined in each class within romantest.py.  For each test case, it prints out the docstring of the method and whether that test passed or failed.  As expected, none of the test cases passed.
       
       
       
       2 
       
      -For each failed test case, unittest displays the trace information showing exactly what happened.  In this case, the call to assertRaises (also called failUnlessRaises) raised an AssertionError because it was expecting toRoman to raise an OutOfRangeError and it didn't.
      +For each failed test case, unittest displays the trace information showing exactly what happened.  In this case, the call to assertRaises (also called failUnlessRaises) raised an AssertionError because it was expecting toRoman to raise an OutOfRangeError and it didn't.
       
       
       
       3 
       
      -After the detail, unittest displays a summary of how many tests were performed and how long it took.
      +After the detail, unittest displays a summary of how many tests were performed and how long it took.
       
       
       
       4 
       
      -Overall, the unit test failed because at least one test case did not pass.  When a test case doesn't pass, unittest distinguishes between failures and errors.  A failure is a call to an assertXYZ method, like assertEqual or assertRaises, that fails because the asserted condition is not true or the expected exception was not raised.  An error is any other sort
      -            of exception raised in the code you're testing or the unit test case itself.  For instance, the testFromRomanCase method (“fromRoman should only accept uppercase input”) was an error, because the call to numeral.upper() raised an AttributeError exception, because toRoman was supposed to return a string but didn't.  But testZero (“toRoman should fail with 0 input”) was a failure, because the call to fromRoman did not raise the InvalidRomanNumeral exception that assertRaises was looking for.
      +Overall, the unit test failed because at least one test case did not pass.  When a test case doesn't pass, unittest distinguishes between failures and errors.  A failure is a call to an assertXYZ method, like assertEqual or assertRaises, that fails because the asserted condition is not true or the expected exception was not raised.  An error is any other sort
      +            of exception raised in the code you're testing or the unit test case itself.  For instance, the testFromRomanCase method (“fromRoman should only accept uppercase input”) was an error, because the call to numeral.upper() raised an AttributeError exception, because toRoman was supposed to return a string but didn't.  But testZero (“toRoman should fail with 0 input”) was a failure, because the call to fromRoman did not raise the InvalidRomanNumeral exception that assertRaises was looking for.
       
       
       
      -

      14.2. roman.py, stage 2

      -

      Now that you have the framework of the roman module laid out, it's time to start writing code and passing test cases. -

      Example 14.3. roman2.py

      -

      This file is available in py/roman/stage2/ in the examples directory. +

      14.2. roman.py, stage 2

      +

      Now that you have the framework of the roman module laid out, it's time to start writing code and passing test cases. +

      Example 14.3. roman2.py

      +

      This file is available in py/roman/stage2/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       """Convert to and from Roman numerals"""
       
      @@ -12156,15 +11996,15 @@ def fromRoman(s):
       
       1 
       
      -romanNumeralMap is a tuple of tuples which defines three things:
      +romanNumeralMap is a tuple of tuples which defines three things:
       
      1. The character representations of the most basic Roman numerals. Note that this is not just the single-character Roman numerals; - you're also defining two-character pairs like CM (“one hundred less than one thousand”); this will make the toRoman code simpler later. + you're also defining two-character pairs like CM (“one hundred less than one thousand”); this will make the toRoman code simpler later.
      2. The order of the Roman numerals. They are listed in descending value order, from M all the way down to I. -
      3. The value of each Roman numeral. Each inner tuple is a pair of (numeral, value). +
      4. The value of each Roman numeral. Each inner tuple is a pair of (numeral, value).
      @@ -12173,13 +12013,13 @@ def fromRoman(s): 2 Here's where your rich data structure pays off, because you don't need any special logic to handle the subtraction rule. - To convert to Roman numerals, you simply iterate through romanNumeralMap looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation + To convert to Roman numerals, you simply iterate through romanNumeralMap looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat. -

      Example 14.4. How toRoman works

      -

      If you're not clear how toRoman works, add a print statement to the end of the while loop:

      +

      Example 14.4. How toRoman works

      +

      If you're not clear how toRoman works, add a print statement to the end of the while loop:

               while n >= integer:
                   result += numeral
                   n -= integer
      @@ -12192,9 +12032,9 @@ subtracting 10 from input, adding X to output
       subtracting 10 from input, adding X to output
       subtracting 4 from input, adding IV to output
       'MCDXXIV'
      -

      So toRoman appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely. -

      Example 14.5. Output of romantest2.py against roman2.py

      -

      Remember to run romantest2.py with the -v command-line flag to enable verbose mode.

      fromRoman should only accept uppercase input ... FAIL
      +

      So toRoman appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely. +

      Example 14.5. Output of romantest2.py against roman2.py

      +

      Remember to run romantest2.py with the -v command-line flag to enable verbose mode.

      fromRoman should only accept uppercase input ... FAIL
       toRoman should always return uppercase ... ok1
       fromRoman should fail with malformed antecedents ... FAIL
       fromRoman should fail with repeated pairs of numerals ... FAIL
      @@ -12210,13 +12050,13 @@ toRoman should fail with 0 input ... FAIL
      1 -toRoman does, in fact, always return uppercase, because romanNumeralMap defines the Roman numeral representations as uppercase. So this test passes already. +toRoman does, in fact, always return uppercase, because romanNumeralMap defines the Roman numeral representations as uppercase. So this test passes already. 2 -Here's the big news: this version of the toRoman function passes the known values test. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including +Here's the big news: this version of the toRoman function passes the known values test. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it. @@ -12224,7 +12064,7 @@ toRoman should fail with 0 input ... FAIL
      3 However, the function does not “work” for bad values; it fails every single bad input test. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to - be raised (via assertRaises), and you're never raising them. You'll do that in the next stage. + be raised (via assertRaises), and you're never raising them. You'll do that in the next stage. @@ -12322,10 +12162,10 @@ AssertionError: OutOfRangeError ---------------------------------------------------------------------- Ran 12 tests in 0.320s -FAILED (failures=10)

      14.3. roman.py, stage 3

      -

      Now that toRoman behaves correctly with good input (integers from 1 to 3999), it's time to make it behave correctly with bad input (everything else). -

      Example 14.6. roman3.py

      -

      This file is available in py/roman/stage3/ in the examples directory. +FAILED (failures=10)

      14.3. roman.py, stage 3

      +

      Now that toRoman behaves correctly with good input (integers from 1 to 3999), it's time to make it behave correctly with bad input (everything else). +

      Example 14.6. roman3.py

      +

      This file is available in py/roman/stage3/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       """Convert to and from Roman numerals"""
       
      @@ -12394,7 +12234,7 @@ def fromRoman(s):
       The rest of the function is unchanged.
       
       
      -

      Example 14.7. Watching toRoman handle bad input

      +

      Example 14.7. Watching toRoman handle bad input

       >>> import roman3
       >>> roman3.toRoman(4000)
       Traceback (most recent call last):
      @@ -12408,7 +12248,7 @@ OutOfRangeError: number out of range (must be 1..3999)
         File "roman3.py", line 29, in toRoman
           raise NotIntegerError, "non-integers can not be converted"
       NotIntegerError: non-integers can not be converted
      -

      Example 14.8. Output of romantest3.py against roman3.py

      fromRoman should only accept uppercase input ... FAIL
      +

      Example 14.8. Output of romantest3.py against roman3.py

      fromRoman should only accept uppercase input ... FAIL
       toRoman should always return uppercase ... ok
       fromRoman should fail with malformed antecedents ... FAIL
       fromRoman should fail with repeated pairs of numerals ... FAIL
      @@ -12424,19 +12264,19 @@ toRoman should fail with 0 input ... ok
      1 -toRoman still passes the known values test, which is comforting. All the tests that passed in stage 2 still pass, so the latest code hasn't broken anything. +toRoman still passes the known values test, which is comforting. All the tests that passed in stage 2 still pass, so the latest code hasn't broken anything. 2 -More exciting is the fact that all of the bad input tests now pass. This test, testNonInteger, passes because of the int(n) <> n check. When a non-integer is passed to toRoman, the int(n) <> n check notices it and raises the NotIntegerError exception, which is what testNonInteger is looking for. +More exciting is the fact that all of the bad input tests now pass. This test, testNonInteger, passes because of the int(n) <> n check. When a non-integer is passed to toRoman, the int(n) <> n check notices it and raises the NotIntegerError exception, which is what testNonInteger is looking for. 3 -This test, testNegative, passes because of the not (0 < n < 4000) check, which raises an OutOfRangeError exception, which is what testNegative is looking for. +This test, testNegative, passes because of the not (0 < n < 4000) check, which raises an OutOfRangeError exception, which is what testNegative is looking for. @@ -12503,7 +12343,7 @@ FAILED (failures=6) 1 -You're down to 6 failures, and all of them involve fromRoman: the known values test, the three separate bad input tests, the case check, and the sanity check. That means that toRoman has passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires that fromRoman be written, which it isn't yet.) Which means that you must stop coding toRoman now. No tweaking, no twiddling, no extra checks “just in case”. Stop. Now. Back away from the keyboard. +You're down to 6 failures, and all of them involve fromRoman: the known values test, the three separate bad input tests, the case check, and the sanity check. That means that toRoman has passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires that fromRoman be written, which it isn't yet.) Which means that you must stop coding toRoman now. No tweaking, no twiddling, no extra checks “just in case”. Stop. Now. Back away from the keyboard. @@ -12517,11 +12357,11 @@ FAILED (failures=6) 14.4. roman.py, stage 4 -

      Now that toRoman is done, it's time to start coding fromRoman. Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than - the toRoman function. -

      Example 14.9. roman4.py

      -

      This file is available in py/roman/stage4/ in the examples directory. +

      14.4. roman.py, stage 4

      +

      Now that toRoman is done, it's time to start coding fromRoman. Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than + the toRoman function. +

      Example 14.9. roman4.py

      +

      This file is available in py/roman/stage4/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       """Convert to and from Roman numerals"""
       
      @@ -12562,13 +12402,13 @@ def fromRoman(s):
       
       1 
       
      -The pattern here is the same as toRoman.  You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer
      +The pattern here is the same as toRoman.  You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer
                   values as often as possible, you match the “highest” Roman numeral character strings as often as possible.
       
       
       
      -

      Example 14.10. How fromRoman works

      -

      If you're not clear how fromRoman works, add a print statement to the end of the while loop:

      +

      Example 14.10. How fromRoman works

      +

      If you're not clear how fromRoman works, add a print statement to the end of the while loop:

               while s[index:index+len(numeral)] == numeral:
                   result += integer
                   index += len(numeral)
      @@ -12582,7 +12422,7 @@ found X , of length 1, adding 10
       found X , of length 1, adding 10
       found I , of length 1, adding 1
       found I , of length 1, adding 1
      -1972

      Example 14.11. Output of romantest4.py against roman4.py

      fromRoman should only accept uppercase input ... FAIL
      +1972

      Example 14.11. Output of romantest4.py against roman4.py

      fromRoman should only accept uppercase input ... FAIL
       toRoman should always return uppercase ... ok
       fromRoman should fail with malformed antecedents ... FAIL
       fromRoman should fail with repeated pairs of numerals ... FAIL
      @@ -12598,13 +12438,13 @@ toRoman should fail with 0 input ... ok
      1 -Two pieces of exciting news here. The first is that fromRoman works for good input, at least for all the known values you test. +Two pieces of exciting news here. The first is that fromRoman works for good input, at least for all the known values you test. 2 -The second is that the sanity check also passed. Combined with the known values tests, you can be reasonably sure that both toRoman and fromRoman work properly for all possible good values. (This is not guaranteed; it is theoretically possible that toRoman has a bug that produces the wrong Roman numeral for some particular set of inputs, and that fromRoman has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that toRoman generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write +The second is that the sanity check also passed. Combined with the known values tests, you can be reasonably sure that both toRoman and fromRoman work properly for all possible good values. (This is not guaranteed; it is theoretically possible that toRoman has a bug that produces the wrong Roman numeral for some particular set of inputs, and that fromRoman has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that toRoman generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write more comprehensive test cases until it doesn't bother you.) @@ -12649,15 +12489,15 @@ AssertionError: InvalidRomanNumeralError ---------------------------------------------------------------------- Ran 12 tests in 1.222s -FAILED (failures=4)

      14.5. roman.py, stage 5

      -

      Now that fromRoman works properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input. +FAILED (failures=4)

      14.5. roman.py, stage 5

      +

      Now that fromRoman works properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input. That means finding a way to look at a string and determine if it's a valid Roman numeral. This is inherently more difficult - than validating numeric input in toRoman, but you have a powerful tool at your disposal: regular expressions. + than validating numeric input in toRoman, but you have a powerful tool at your disposal: regular expressions.

      If you're not familiar with regular expressions and didn't read Chapter 7, Regular Expressions, now would be a good time.

      As you saw in Section 7.3, “Case Study: Roman Numerals”, there are several simple rules for constructing a Roman numeral, using the letters M, D, C, L, X, V, and I. Let's review the rules:

        -
      1. Characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8. +
      2. Characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8.
      3. The tens characters (I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”). 40 is written as XL (“10 less than 50”), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (“10 less than 50, then 1 less than 5”). @@ -12668,8 +12508,8 @@ FAILED (failures=4)

      14.5.

      Example 14.12. roman5.py

      -

      This file is available in py/roman/stage5/ in the examples directory. +

      Example 14.12. roman5.py

      +

      This file is available in py/roman/stage5/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       """Convert to and from Roman numerals"""
       import re
      @@ -12736,13 +12576,13 @@ def fromRoman(s):
       2 
       
       Having encoded all that logic into a regular expression, the code to check for invalid Roman numerals becomes trivial.  If
      -re.search returns an object, then the regular expression matched and the input is valid; otherwise, the input is invalid.
      +re.search returns an object, then the regular expression matched and the input is valid; otherwise, the input is invalid.
       
       
       
       

      At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of invalid Roman numerals. But don't take my word for it, look at the results: -

      Example 14.13. Output of romantest5.py against roman5.py

      
      +

      Example 14.13. Output of romantest5.py against roman5.py

      
       fromRoman should only accept uppercase input ... ok          1
       toRoman should always return uppercase ... ok
       fromRoman should fail with malformed antecedents ... ok      2
      @@ -12765,13 +12605,13 @@ OK     41 
       
       One thing I didn't mention about regular expressions is that, by default, they are case-sensitive.  Since the regular expression
      -romanNumeralPattern was expressed in uppercase characters, the re.search check will reject any input that isn't completely uppercase.  So the uppercase input test passes.
      +romanNumeralPattern was expressed in uppercase characters, the re.search check will reject any input that isn't completely uppercase.  So the uppercase input test passes.
       
       
       
       2 
       
      -More importantly, the bad input tests pass.  For instance, the malformed antecedents test checks cases like MCMC.  As you've seen, this does not match the regular expression, so fromRoman raises an InvalidRomanNumeralError exception, which is what the malformed antecedents test case is looking for, so the test passes.
      +More importantly, the bad input tests pass.  For instance, the malformed antecedents test checks cases like MCMC.  As you've seen, this does not match the regular expression, so fromRoman raises an InvalidRomanNumeralError exception, which is what the malformed antecedents test case is looking for, so the test passes.
       
       
       
      @@ -12784,7 +12624,7 @@ OK     4
       4 
       
      -And the anticlimax award of the year goes to the word “OK”, which is printed by the unittest module when all the tests pass.
      +And the anticlimax award of the year goes to the word “OK”, which is printed by the unittest module when all the tests pass.
       
       
       
      @@ -12809,12 +12649,12 @@ OK     4
       Remember in the previous section when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals?
                    Well, it turns out that this is still true for the final version of the regular expression.  And that's a bug; you want an
      -            empty string to raise an InvalidRomanNumeralError exception just like any other sequence of characters that don't represent a valid Roman numeral.
      +            empty string to raise an InvalidRomanNumeralError exception just like any other sequence of characters that don't represent a valid Roman numeral.
       
       
       
       

      After reproducing the bug, and before fixing it, you should write a test case that fails, thus illustrating the bug. -

      Example 15.2. Testing for the bug (romantest61.py)

      +

      Example 15.2. Testing for the bug (romantest61.py)

       class FromRomanBadInput(unittest.TestCase):  
       
           # previous test cases omitted for clarity (they haven't changed)
      @@ -12827,12 +12667,12 @@ class FromRomanBadInput(unittest.TestCase):
       
       1 
       
      -Pretty simple stuff here.  Call fromRoman with an empty string and make sure it raises an InvalidRomanNumeralError exception.  The hard part was finding the bug; now that you know about it, testing for it is the easy part.
      +Pretty simple stuff here.  Call fromRoman with an empty string and make sure it raises an InvalidRomanNumeralError exception.  The hard part was finding the bug; now that you know about it, testing for it is the easy part.
       
       
       
       

      Since your code has a bug, and you now have a test case that tests this bug, the test case will fail: -

      Example 15.3. Output of romantest61.py against roman61.py

      fromRoman should only accept uppercase input ... ok
      +

      Example 15.3. Output of romantest61.py against roman61.py

      fromRoman should only accept uppercase input ... ok
       toRoman should always return uppercase ... ok
       fromRoman should fail with blank string ... FAIL
       fromRoman should fail with malformed antecedents ... ok
      @@ -12859,8 +12699,8 @@ AssertionError: InvalidRomanNumeralError
       Ran 13 tests in 2.864s
       
       FAILED (failures=1)

      Now you can fix the bug. -

      Example 15.4. Fixing the bug (roman62.py)

      -

      This file is available in py/roman/stage6/ in the examples directory.

      +

      Example 15.4. Fixing the bug (roman62.py)

      +

      This file is available in py/roman/stage6/ in the examples directory.

       def fromRoman(s):
           """convert Roman numeral to integer"""
           if not s: 1
      @@ -12884,7 +12724,7 @@ def fromRoman(s):
       
       
       
      -

      Example 15.5. Output of romantest62.py against roman62.py

      fromRoman should only accept uppercase input ... ok
      +

      Example 15.5. Output of romantest62.py against roman62.py

      fromRoman should only accept uppercase input ... ok
       toRoman should always return uppercase ... ok
       fromRoman should fail with blank string ... ok 1
       fromRoman should fail with malformed antecedents ... ok
      @@ -12928,8 +12768,8 @@ is tomorrow's regression test.
          if they do, they'll want more in the next release anyway.  So be prepared to update your test cases as requirements change.
       

      Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Remember the rule that said that no character could be repeated more than three times? Well, the Romans were willing to make an exception to that rule by having 4 M characters in a row to represent 4000. If you make this change, you'll be able to expand the range of convertible numbers from 1..3999 to 1..4999. But first, you need to make some changes to the test cases. -

      Example 15.6. Modifying test cases for new requirements (romantest71.py)

      -

      This file is available in py/roman/stage7/ in the examples directory. +

      Example 15.6. Modifying test cases for new requirements (romantest71.py)

      +

      This file is available in py/roman/stage7/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       import roman71
       import unittest
      @@ -13083,25 +12923,25 @@ if __name__ == "__main__":
       
       2 
       
      -The definition of “large input” has changed.  This test used to call toRoman with 4000 and expect an error; now that 4000-4999 are good values, you need to bump this up to 5000.
      +The definition of “large input” has changed.  This test used to call toRoman with 4000 and expect an error; now that 4000-4999 are good values, you need to bump this up to 5000.
       
       
       
       3 
       
      -The definition of “too many repeated numerals” has also changed.  This test used to call fromRoman with 'MMMM' and expect an error; now that MMMM is considered a valid Roman numeral, you need to bump this up to 'MMMMM'.
      +The definition of “too many repeated numerals” has also changed.  This test used to call fromRoman with 'MMMM' and expect an error; now that MMMM is considered a valid Roman numeral, you need to bump this up to 'MMMMM'.
       
       
       
       4 
       
      -The sanity check and case checks loop through every number in the range, from 1 to 3999.  Since the range has now expanded, these for loops need to be updated as well to go up to 4999.
      +The sanity check and case checks loop through every number in the range, from 1 to 3999.  Since the range has now expanded, these for loops need to be updated as well to go up to 4999.
       
       
       
       

      Now your test cases are up to date with the new requirements, but your code is not, so you expect several of the test cases to fail. -

      Example 15.7. Output of romantest71.py against roman71.py

      
      +

      Example 15.7. Output of romantest71.py against roman71.py

      
       fromRoman should only accept uppercase input ... ERROR        1
       toRoman should always return uppercase ... ERROR
       fromRoman should fail with blank string ... ok
      @@ -13120,25 +12960,25 @@ toRoman should fail with 0 input ... ok
       
       1 
       
      -Our case checks now fail because they loop from 1 to 4999, but toRoman only accepts numbers from 1 to 3999, so it will fail as soon the test case hits 4000.
      +Our case checks now fail because they loop from 1 to 4999, but toRoman only accepts numbers from 1 to 3999, so it will fail as soon the test case hits 4000.
       
       
       
       2 
       
      -The fromRoman known values test will fail as soon as it hits 'MMMM', because fromRoman still thinks this is an invalid Roman numeral.
      +The fromRoman known values test will fail as soon as it hits 'MMMM', because fromRoman still thinks this is an invalid Roman numeral.
       
       
       
       3 
       
      -The toRoman known values test will fail as soon as it hits 4000, because toRoman still thinks this is out of range.
      +The toRoman known values test will fail as soon as it hits 4000, because toRoman still thinks this is out of range.
       
       
       
       4 
       
      -The sanity check will also fail as soon as it hits 4000, because toRoman still thinks this is out of range.
      +The sanity check will also fail as soon as it hits 4000, because toRoman still thinks this is out of range.
       
       
       
      @@ -13195,8 +13035,8 @@ FAILED (errors=5)

      Now that you have test cases that fail due to t with the test cases. (One thing that takes some getting used to when you first start coding unit tests is that the code being tested is never “ahead” of the test cases. While it's behind, you still have some work to do, and as soon as it catches up to the test cases, you stop coding.) -

      Example 15.8. Coding the new requirements (roman72.py)

      -

      This file is available in py/roman/stage7/ in the examples directory.

      +

      Example 15.8. Coding the new requirements (roman72.py)

      +

      This file is available in py/roman/stage7/ in the examples directory.

       """Convert to and from Roman numerals"""
       import re
       
      @@ -13257,19 +13097,19 @@ def fromRoman(s):
       
       1 
       
      -toRoman only needs one small change, in the range check.  Where you used to check 0 < n < 4000, you now check 0 < n < 5000.  And you change the error message that you raise to reflect the new acceptable range (1..4999 instead of 1..3999).  You don't need to make any changes to the rest of the function; it handles the new cases already.  (It merrily adds 'M' for each thousand that it finds; given 4000, it will spit out 'MMMM'.  The only reason it didn't do this before is that you explicitly stopped it with the range check.)
      +toRoman only needs one small change, in the range check.  Where you used to check 0 < n < 4000, you now check 0 < n < 5000.  And you change the error message that you raise to reflect the new acceptable range (1..4999 instead of 1..3999).  You don't need to make any changes to the rest of the function; it handles the new cases already.  (It merrily adds 'M' for each thousand that it finds; given 4000, it will spit out 'MMMM'.  The only reason it didn't do this before is that you explicitly stopped it with the range check.)
       
       
       
       2 
       
      -You don't need to make any changes to fromRoman at all.  The only change is to romanNumeralPattern; if you look closely, you'll notice that you added another optional M in the first section of the regular expression.  This will allow up to 4 M characters instead of 3, meaning you will allow the Roman numeral equivalents of 4999 instead of 3999.  The actual fromRoman function is completely general; it just looks for repeated Roman numeral characters and adds them up, without caring how
      +You don't need to make any changes to fromRoman at all.  The only change is to romanNumeralPattern; if you look closely, you'll notice that you added another optional M in the first section of the regular expression.  This will allow up to 4 M characters instead of 3, meaning you will allow the Roman numeral equivalents of 4999 instead of 3999.  The actual fromRoman function is completely general; it just looks for repeated Roman numeral characters and adds them up, without caring how
                many times they repeat.  The only reason it didn't handle 'MMMM' before is that you explicitly stopped it with the regular expression pattern matching.
       
       
       
       

      You may be skeptical that these two small changes are all that you need. Hey, don't take my word for it; see for yourself: -

      Example 15.9. Output of romantest72.py against roman72.py

      fromRoman should only accept uppercase input ... ok
      +

      Example 15.9. Output of romantest72.py against roman72.py

      fromRoman should only accept uppercase input ... ok
       toRoman should always return uppercase ... ok
       fromRoman should fail with blank string ... ok
       fromRoman should fail with malformed antecedents ... ok
      @@ -13300,7 +13140,7 @@ OK 1prove that you didn't.  The best thing about unit testing is that it gives you the freedom to refactor mercilessly.
       

      Refactoring is the process of taking working code and making it work better. Usually, “better” means “faster”, although it can also mean “using less memory”, or “using less disk space”, or simply “more elegantly”. Whatever it means to you, to your project, in your environment, refactoring is important to the long-term health of any program. -

      Here, “better” means “faster”. Specifically, the fromRoman function is slower than it needs to be, because of that big nasty regular expression that you use to validate Roman numerals. +

      Here, “better” means “faster”. Specifically, the fromRoman function is slower than it needs to be, because of that big nasty regular expression that you use to validate Roman numerals. It's probably not worth trying to do away with the regular expression altogether (it would be difficult, and it might not end up any faster), but you can speed up the function by precompiling the regular expression.

      Example 15.10. Compiling regular expressions

      @@ -13319,27 +13159,27 @@ end up any faster), but you can speed up the function by precompiling the regula
       
       1 
       
      -This is the syntax you've seen before: re.search takes a regular expression as a string (pattern) and a string to match against it ('M').  If the pattern matches, the function returns a match object which can be queried to find out exactly what matched and
      +This is the syntax you've seen before: re.search takes a regular expression as a string (pattern) and a string to match against it ('M').  If the pattern matches, the function returns a match object which can be queried to find out exactly what matched and
                   how.
       
       
       
       2 
       
      -This is the new syntax: re.compile takes a regular expression as a string and returns a pattern object.  Note there is no string to match here.  Compiling a
      +This is the new syntax: re.compile takes a regular expression as a string and returns a pattern object.  Note there is no string to match here.  Compiling a
                   regular expression has nothing to do with matching it against any specific strings (like 'M'); it only involves the regular expression itself.
       
       
       
       3 
       
      -The compiled pattern object returned from re.compile has several useful-looking functions, including several (like search and sub) that are available directly in the re module.
      +The compiled pattern object returned from re.compile has several useful-looking functions, including several (like search and sub) that are available directly in the re module.
       
       
       
       4 
       
      -Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'.  Only much, much faster.  (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)
      +Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'.  Only much, much faster.  (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)
       
       
       
      @@ -13353,8 +13193,8 @@ end up any faster), but you can speed up the function by precompiling the regula
       
       
       
      -

      Example 15.11. Compiled regular expressions in roman81.py

      -

      This file is available in py/roman/stage8/ in the examples directory. +

      Example 15.11. Compiled regular expressions in roman81.py

      +

      This file is available in py/roman/stage8/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       # toRoman and rest of module omitted for clarity
       
      @@ -13380,19 +13220,19 @@ def fromRoman(s):
       
       1 
       
      -This looks very similar, but in fact a lot has changed.  romanNumeralPattern is no longer a string; it is a pattern object which was returned from re.compile.
      +This looks very similar, but in fact a lot has changed.  romanNumeralPattern is no longer a string; it is a pattern object which was returned from re.compile.
       
       
       
       2 
       
      -That means that you can call methods on romanNumeralPattern directly.  This will be much, much faster than calling re.search every time.  The regular expression is compiled once and stored in romanNumeralPattern when the module is first imported; then, every time you call fromRoman, you can immediately match the input string against the regular expression, without any intermediate steps occurring under
      +That means that you can call methods on romanNumeralPattern directly.  This will be much, much faster than calling re.search every time.  The regular expression is compiled once and stored in romanNumeralPattern when the module is first imported; then, every time you call fromRoman, you can immediately match the input string against the regular expression, without any intermediate steps occurring under
                   the covers.
       
       
       
       

      So how much faster is it to compile regular expressions? See for yourself: -

      Example 15.12. Output of romantest81.py against roman81.py

      .............          1
      +

      Example 15.12. Output of romantest81.py against roman81.py

      .............          1
       ----------------------------------------------------------------------
       Ran 13 tests in 3.385s 2
       
      @@ -13401,7 +13241,7 @@ OK   3
       1 
       
      -Just a note in passing here: this time, I ran the unit test without the -v option, so instead of the full docstring for each test, you only get a dot for each test that passes.  (If a test failed, you'd get an F, and if it had an error, you'd get an E.  You'd still get complete tracebacks for each failure and error, so you could track down any problems.)
      +Just a note in passing here: this time, I ran the unit test without the -v option, so instead of the full docstring for each test, you only get a dot for each test that passes.  (If a test failed, you'd get an F, and if it had an error, you'd get an E.  You'd still get complete tracebacks for each failure and error, so you could track down any problems.)
       
       
       
      @@ -13409,7 +13249,7 @@ OK   3
       You ran 13 tests in 3.385 seconds, compared to 3.685 seconds without precompiling the regular expressions.  That's an 8% improvement overall, and remember that most of the time spent during the unit test is spent doing other things.  (Separately,
                   I time-tested the regular expressions by themselves, apart from the rest of the unit tests, and found that compiling this
      -            regular expression speeds up the search by an average of 54%.)  Not bad for such a simple fix.
      +            regular expression speeds up the search by an average of 54%.)  Not bad for such a simple fix.
       
       
       
      @@ -13420,9 +13260,9 @@ OK   3
       

      There is one other performance optimization that I want to try. Given the complexity of regular expression syntax, it should come as no surprise that there is frequently more than one way to write the same expression. After some discussion about -this module on comp.lang.python, someone suggested that I try using the {m,n} syntax for the optional repeated characters. -

      Example 15.13. roman82.py

      -

      This file is available in py/roman/stage8/ in the examples directory. +this module on comp.lang.python, someone suggested that I try using the {m,n} syntax for the optional repeated characters. +

      Example 15.13. roman82.py

      +

      This file is available in py/roman/stage8/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       # rest of program omitted for clarity
       
      @@ -13443,7 +13283,7 @@ romanNumeralPattern = \
       
       
       

      This form of the regular expression is a little shorter (though not any more readable). The big question is, is it any faster? -

      Example 15.14. Output of romantest82.py against roman82.py

      .............
      +

      Example 15.14. Output of romantest82.py against roman82.py

      .............
       ----------------------------------------------------------------------
       Ran 13 tests in 3.315s 1
       
      @@ -13453,8 +13293,8 @@ OK   21 
       
       Overall, the unit tests run 2% faster with this form of regular expression.  That doesn't sound exciting, but remember that
      -            the search function is a small part of the overall unit test; most of the time is spent doing other things.  (Separately, I time-tested
      -            just the regular expressions, and found that the search function is 11% faster with this syntax.)  By precompiling the regular expression and rewriting part of it to use this new syntax, you've
      +            the search function is a small part of the overall unit test; most of the time is spent doing other things.  (Separately, I time-tested
      +            just the regular expressions, and found that the search function is 11% faster with this syntax.)  By precompiling the regular expression and rewriting part of it to use this new syntax, you've
                   improved the regular expression performance by over 60%, and improved the overall performance of the entire unit test by over 10%.
       
       
      @@ -13464,7 +13304,7 @@ OK   2More important than any performance boost is the fact that the module still works perfectly.  This is the freedom I was talking
                   about earlier: the freedom to tweak, change, or rewrite any piece of it and verify that you haven't messed anything up in
                   the process.  This is not a license to endlessly tweak your code just for the sake of tweaking it; you had a very specific
      -            objective (“make fromRoman faster”), and you were able to accomplish that objective without any lingering doubts about whether you introduced new bugs in the
      +            objective (“make fromRoman faster”), and you were able to accomplish that objective without any lingering doubts about whether you introduced new bugs in the
                   process.
       
       
      @@ -13473,8 +13313,8 @@ OK   2how it works, it's still going to be difficult to add new features, fix new bugs, or otherwise maintain it.  As you saw in Section 7.5, “Verbose Regular Expressions”, Python provides a way to document your logic line-by-line.
      -

      Example 15.15. roman83.py

      -

      This file is available in py/roman/stage8/ in the examples directory. +

      Example 15.15. roman83.py

      +

      This file is available in py/roman/stage8/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       # rest of program omitted for clarity
       
      @@ -13499,13 +13339,13 @@ romanNumeralPattern = re.compile('''
       
       1 
       
      -The re.compile function can take an optional second argument, which is a set of one or more flags that control various options about the
      +The re.compile function can take an optional second argument, which is a set of one or more flags that control various options about the
                   compiled regular expression.  Here you're specifying the re.VERBOSE flag, which tells Python that there are in-line comments within the regular expression itself.  The comments and all the whitespace around them are
      -not considered part of the regular expression; the re.compile function simply strips them all out when it compiles the expression.  This new, “verbose” version is identical to the old version, but it is infinitely more readable.
      +not considered part of the regular expression; the re.compile function simply strips them all out when it compiles the expression.  This new, “verbose” version is identical to the old version, but it is infinitely more readable.
       
       
       
      -

      Example 15.16. Output of romantest83.py against roman83.py

      .............
      +

      Example 15.16. Output of romantest83.py against roman83.py

      .............
       ----------------------------------------------------------------------
       Ran 13 tests in 3.315s 1
       
      @@ -13515,7 +13355,7 @@ OK   21 
       
       This new, “verbose” version runs at exactly the same speed as the old version.  In fact, the compiled pattern objects are the same, since the
      -re.compile function strips out all the stuff you added.
      +re.compile function strips out all the stuff you added.
       
       
       
      @@ -13534,8 +13374,8 @@ OK   2And best of all, he already had a complete set of unit tests.  He changed over half the code in the module, but the unit tests
       stayed the same, so he could prove that his code worked just as well as the original.
      -

      Example 15.17. roman9.py

      -

      This file is available in py/roman/stage9/ in the examples directory. +

      Example 15.17. roman9.py

      +

      This file is available in py/roman/stage9/ in the examples directory.

      If you have not already done so, you can download this and other examples used in this book.

       #Define exceptions
       class RomanError(Exception): pass
      @@ -13604,7 +13444,7 @@ def fillLookupTables():
       
       fillLookupTables()
       

      So how fast is it? -

      Example 15.18. Output of romantest9.py against roman9.py

      +

      Example 15.18. Output of romantest9.py against roman9.py

       
       .............
       ----------------------------------------------------------------------
      @@ -13650,9 +13490,9 @@ only done once, this is negligible in the long run.
       
      • Subclassing unittest.TestCase and writing methods for individual test cases -
      • Using assertEqual to check that a function returns a known value +
      • Using assertEqual to check that a function returns a known value -
      • Using assertRaises to check that a function raises a known exception +
      • Using assertRaises to check that a function raises a known exception
      • Calling unittest.main() in your if __name__ clause to run all your test cases at once @@ -13672,10 +13512,10 @@ only done once, this is negligible in the long run. we will focus more on advanced Python-specific techniques, rather than on unit testing itself.

        The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the -build process for this book; I have unit tests for several of the example programs (not just the roman.py module featured in Chapter 13, Unit Testing), and the first thing my automated build script does is run this program to make sure all my examples still work. If this +build process for this book; I have unit tests for several of the example programs (not just the roman.py module featured in Chapter 13, Unit Testing), and the first thing my automated build script does is run this program to make sure all my examples still work. If this regression test fails, the build immediately stops. I don't want to release non-working examples any more than you want to download them and sit around scratching your head and yelling at your monitor and wondering why they don't work. -

        Example 16.1. regression.py

        +

        Example 16.1. regression.py

        If you have not already done so, you can download this and other examples used in this book.

         """Regression testing framework
         
        @@ -13702,8 +13542,8 @@ def regressionTest():
         if __name__ == "__main__": 
             unittest.main(defaultTest="regressionTest")
         

        Running this script in the same directory as the rest of the example scripts that come with this book will find all the unit -tests, named moduletest.py, run them as a single test, and pass or fail them all at once. -

        Example 16.2. Sample output of regression.py

        +tests, named moduletest.py, run them as a single test, and pass or fail them all at once.
        +

        Example 16.2. Sample output of regression.py

         [you@localhost py]$ python regression.py -v
         help should fail with no object ... ok           1
         help should return known result for apihelper ... ok
        @@ -13743,19 +13583,19 @@ OK
        1 -The first 5 tests are from apihelpertest.py, which tests the example script from Chapter 4, The Power Of Introspection. +The first 5 tests are from apihelpertest.py, which tests the example script from Chapter 4, The Power Of Introspection. 2 -The next 5 tests are from odbchelpertest.py, which tests the example script from Chapter 2, Your First Python Program. +The next 5 tests are from odbchelpertest.py, which tests the example script from Chapter 2, Your First Python Program. 3 -The rest are from romantest.py, which you studied in depth in Chapter 13, Unit Testing. +The rest are from romantest.py, which you studied in depth in Chapter 13, Unit Testing. @@ -13764,7 +13604,7 @@ OK

        This is one of those obscure little tricks that is virtually impossible to figure out on your own, but simple to remember once you see it. The key to it is sys.argv. As you saw in Chapter 9, XML Processing, this is a list that holds the list of command-line arguments. However, it also holds the name of the running script, exactly as it was called from the command line, and this is enough information to determine its location. -

        Example 16.3. fullpath.py

        +

        Example 16.3. fullpath.py

        If you have not already done so, you can download this and other examples used in this book.

         import sys, os
         
        @@ -13783,19 +13623,19 @@ print 'full path =', os.path.abspath(pathname) 
         2 
         
        -os.path.dirname takes a filename as a string and returns the directory path portion.  If the given filename does not include any path information,
        -os.path.dirname returns an empty string.
        +os.path.dirname takes a filename as a string and returns the directory path portion.  If the given filename does not include any path information,
        +os.path.dirname returns an empty string.
         
         
         
         3 
         
        -os.path.abspath is the key here.  It takes a pathname, which can be partial or even blank, and returns a fully qualified pathname.
        +os.path.abspath is the key here.  It takes a pathname, which can be partial or even blank, and returns a fully qualified pathname.
         
         
         
        -

        os.path.abspath deserves further explanation. It is very flexible; it can take any kind of pathname. -

        Example 16.4. Further explanation of os.path.abspath

        +

        os.path.abspath deserves further explanation. It is very flexible; it can take any kind of pathname. +

        Example 16.4. Further explanation of os.path.abspath

         >>> import os
         >>> os.getcwd()      1
         /home/you
        @@ -13811,31 +13651,31 @@ print 'full path =', os.path.abspath(pathname) 
         1 
         
        -os.getcwd() returns the current working directory.
        +os.getcwd() returns the current working directory.
         
         
         
         2 
         
        -Calling os.path.abspath with an empty string returns the current working directory, same as os.getcwd().
        +Calling os.path.abspath with an empty string returns the current working directory, same as os.getcwd().
         
         
         
         3 
         
        -Calling os.path.abspath with a partial pathname constructs a fully qualified pathname out of it, based on the current working directory.
        +Calling os.path.abspath with a partial pathname constructs a fully qualified pathname out of it, based on the current working directory.
         
         
         
         4 
         
        -Calling os.path.abspath with a full pathname simply returns it.
        +Calling os.path.abspath with a full pathname simply returns it.
         
         
         
         5 
         
        -os.path.abspath also normalizes the pathname it returns.  Note that this example worked even though I don't actually have a 'foo' directory.  os.path.abspath never checks your actual disk; this is all just string manipulation.
        +os.path.abspath also normalizes the pathname it returns.  Note that this example worked even though I don't actually have a 'foo' directory.  os.path.abspath never checks your actual disk; this is all just string manipulation.
         
         
         
        @@ -13844,7 +13684,7 @@ print 'full path =', os.path.abspath(pathname) Note
         
         
        -The pathnames and filenames you pass to os.path.abspath do not need to exist.
        +The pathnames and filenames you pass to os.path.abspath do not need to exist.
         
         
         
        @@ -13852,12 +13692,12 @@ print 'full path =', os.path.abspath(pathname) Note
        -
        os.path.abspath not only constructs full path names, it also normalizes them. That means that if you are in the /usr/ directory, os.path.abspath('bin/../local/bin') will return /usr/local/bin. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without - turning it into a full pathname, use os.path.normpath instead. +os.path.abspath not only constructs full path names, it also normalizes them. That means that if you are in the /usr/ directory, os.path.abspath('bin/../local/bin') will return /usr/local/bin. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without + turning it into a full pathname, use os.path.normpath instead.
        -

        Example 16.5. Sample output from fullpath.py

        +

        Example 16.5. Sample output from fullpath.py

         [you@localhost py]$ python /home/you/diveintopython3/common/py/fullpath.py 1
         sys.argv[0] = /home/you/diveintopython3/common/py/fullpath.py
         path = /home/you/diveintopython3/common/py
        @@ -13875,19 +13715,19 @@ full path = /home/you/diveintopython3/common/py
        1 -In the first case, sys.argv[0] includes the full path of the script. You can then use the os.path.dirname function to strip off the script name and return the full directory name, and os.path.abspath simply returns what you give it. +In the first case, sys.argv[0] includes the full path of the script. You can then use the os.path.dirname function to strip off the script name and return the full directory name, and os.path.abspath simply returns what you give it. 2 -If the script is run by using a partial pathname, sys.argv[0] will still contain exactly what appears on the command line. os.path.dirname will then give you a partial pathname (relative to the current directory), and os.path.abspath will construct a full pathname from the partial pathname. +If the script is run by using a partial pathname, sys.argv[0] will still contain exactly what appears on the command line. os.path.dirname will then give you a partial pathname (relative to the current directory), and os.path.abspath will construct a full pathname from the partial pathname. 3 -If the script is run from the current directory without giving any path, os.path.dirname will simply return an empty string. Given an empty string, os.path.abspath returns the current directory, which is what you want, since the script was run from the current directory. +If the script is run from the current directory without giving any path, os.path.dirname will simply return an empty string. Given an empty string, os.path.abspath returns the current directory, which is what you want, since the script was run from the current directory. @@ -13896,13 +13736,13 @@ full path = /home/you/diveintopython3/common/py
        Note -Like the other functions in the os and os.path modules, os.path.abspath is cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash - as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of the os module. +Like the other functions in the os and os.path modules, os.path.abspath is cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash + as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of the os module.

        Addendum. One reader was dissatisfied with this solution, and wanted to be able to run all the unit tests in the current directory, -not the directory where regression.py is located. He suggests this approach instead: +not the directory where regression.py is located. He suggests this approach instead:

        Example 16.6. Running scripts in the current directory

        import sys, os, re, unittest
         
         def regressionTest():
        @@ -13914,7 +13754,7 @@ def regressionTest():
         
         1 
         
        -Instead of setting path to the directory where the currently running script is located, you set it to the current working directory instead.  This
        +Instead of setting path to the directory where the currently running script is located, you set it to the current working directory instead.  This
                     will be whatever directory you were in before you ran the script, which is not necessarily the same as the directory the script
                     is in. (Read that sentence a few times until you get it.)
         
        @@ -13922,7 +13762,7 @@ def regressionTest():
         
         2 
         
        -Append this directory to the Python library search path, so that when you dynamically import the unit test modules later, Python can find them.  You didn't need to do this when path was the directory of the currently running script, because Python always looks in that directory.
        +Append this directory to the Python library search path, so that when you dynamically import the unit test modules later, Python can find them.  You didn't need to do this when path was the directory of the currently running script, because Python always looks in that directory.
         
         
         
        @@ -13931,13 +13771,13 @@ def regressionTest():
         The rest of the function is the same.
         
         
        -

        This technique will allow you to re-use this regression.py script on multiple projects. Just put the script in a common directory, then change to the project's directory before running - it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory where regression.py is located. +

        This technique will allow you to re-use this regression.py script on multiple projects. Just put the script in a common directory, then change to the project's directory before running + it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory where regression.py is located.

        16.3. Filtering lists revisited

        You're already familiar with using list comprehensions to filter lists. There is another way to accomplish this same thing, which some people feel is more expressive. -

        Python has a built-in filter function which takes two arguments, a function and a list, and returns a list.[7] The function passed as the first argument to filter must itself take one argument, and the list that filter returns will contain all the elements from the list passed to filter for which the function passed to filter returns true. +

        Python has a built-in filter function which takes two arguments, a function and a list, and returns a list.[7] The function passed as the first argument to filter must itself take one argument, and the list that filter returns will contain all the elements from the list passed to filter for which the function passed to filter returns true.

        Got all that? It's not as difficult as it sounds. -

        Example 16.7. Introducing filter

        +

        Example 16.7. Introducing filter

         >>> def odd(n):                 1
         ...     return n % 2
         ...     
        @@ -13956,13 +13796,13 @@ def regressionTest():
         
         1 
         
        -odd uses the built-in mod function “%” to return True if n is odd and False if n is even.
        +odd uses the built-in mod function “%” to return True if n is odd and False if n is even.
         
         
         
         2 
         
        -filter takes two arguments, a function (odd) and a list (li).  It loops through the list and calls odd with each element.  If odd returns a true value (remember, any non-zero value is true in Python), then the element is included in the returned list, otherwise it is filtered out.  The result is a list of only the odd
        +filter takes two arguments, a function (odd) and a list (li).  It loops through the list and calls odd with each element.  If odd returns a true value (remember, any non-zero value is true in Python), then the element is included in the returned list, otherwise it is filtered out.  The result is a list of only the odd
                     numbers from the original list, in the same order as they appeared in the original.
         
         
        @@ -13975,12 +13815,12 @@ def regressionTest():
         
         4 
         
        -You could also accomplish the same thing with a for loop.  Depending on your programming background, this may seem more “straightforward”, but functions like filter are much more expressive.  Not only is it easier to write, it's easier to read, too.  Reading the for loop is like standing too close to a painting; you see all the details, but it may take a few seconds to be able to step
        +You could also accomplish the same thing with a for loop.  Depending on your programming background, this may seem more “straightforward”, but functions like filter are much more expressive.  Not only is it easier to write, it's easier to read, too.  Reading the for loop is like standing too close to a painting; you see all the details, but it may take a few seconds to be able to step
                     back and see the bigger picture: “Oh, you're just filtering the list!”
         
         
         
        -

        Example 16.8. filter in regression.py

        +

        Example 16.8. filter in regression.py

             files = os.listdir(path)              1
             test = re.compile("test\.py$", re.IGNORECASE)           2
             files = filter(test.search, files)    3
        @@ -13988,27 +13828,27 @@ def regressionTest(): 1 -As you saw in Section 16.2, “Finding the path”, path may contain the full or partial pathname of the directory of the currently running script, or it may contain an empty string - if the script is being run from the current directory. Either way, files will end up with the names of the files in the same directory as this script you're running. +As you saw in Section 16.2, “Finding the path”, path may contain the full or partial pathname of the directory of the currently running script, or it may contain an empty string + if the script is being run from the current directory. Either way, files will end up with the names of the files in the same directory as this script you're running. 2 This is a compiled regular expression. As you saw in Section 15.3, “Refactoring”, if you're going to use the same regular expression over and over, you should compile it for faster performance. The compiled - object has a search method which takes a single argument, the string to search. If the regular expression matches the string, the search method returns a Match object containing information about the regular expression match; otherwise it returns None, the Python null value. + object has a search method which takes a single argument, the string to search. If the regular expression matches the string, the search method returns a Match object containing information about the regular expression match; otherwise it returns None, the Python null value. 3 -For each element in the files list, you're going to call the search method of the compiled regular expression object, test. If the regular expression matches, the method will return a Match object, which Python considers to be true, so the element will be included in the list returned by filter. If the regular expression does not match, the search method will return None, which Python considers to be false, so the element will not be included. +For each element in the files list, you're going to call the search method of the compiled regular expression object, test. If the regular expression matches, the method will return a Match object, which Python considers to be true, so the element will be included in the list returned by filter. If the regular expression does not match, the search method will return None, which Python considers to be false, so the element will not be included. -

        Historical note. Versions of Python prior to 2.0 did not have list comprehensions, so you couldn't filter using list comprehensions; the filter function was the only game in town. Even with the introduction of list comprehensions in 2.0, some people still prefer the -old-style filter (and its companion function, map, which you'll see later in this chapter). Both techniques work at the moment, so which one you use is a matter of style. -There is discussion that map and filter might be deprecated in a future version of Python, but no decision has been made. +

        Historical note. Versions of Python prior to 2.0 did not have list comprehensions, so you couldn't filter using list comprehensions; the filter function was the only game in town. Even with the introduction of list comprehensions in 2.0, some people still prefer the +old-style filter (and its companion function, map, which you'll see later in this chapter). Both techniques work at the moment, so which one you use is a matter of style. +There is discussion that map and filter might be deprecated in a future version of Python, but no decision has been made.

        Example 16.9. Filtering using list comprehensions instead

             files = os.listdir(path)             
             test = re.compile("test\.py$", re.IGNORECASE)          
        @@ -14017,13 +13857,13 @@ There is discussion that map and 1 
         
        -This will accomplish exactly the same result as using the filter function.  Which way is more expressive?  That's up to you.
        +This will accomplish exactly the same result as using the filter function.  Which way is more expressive?  That's up to you.
         
         
         
         

        16.4. Mapping lists revisited

        -

        You're already familiar with using list comprehensions to map one list into another. There is another way to accomplish the same thing, using the built-in map function. It works much the same way as the filter function. -

        Example 16.10. Introducing map

        +

        You're already familiar with using list comprehensions to map one list into another. There is another way to accomplish the same thing, using the built-in map function. It works much the same way as the filter function. +

        Example 16.10. Introducing map

         >>> def double(n):
         ...     return n*2
         ...     
        @@ -14042,14 +13882,14 @@ There is discussion that map and 1 
         
        -map takes a function and a list[8] and returns a new list by calling the function with each element of the list in order.  In this case, the function simply
        +map takes a function and a list[8] and returns a new list by calling the function with each element of the list in order.  In this case, the function simply
                     multiplies each element by 2.
         
         
         
         2 
         
        -You could accomplish the same thing with a list comprehension.  List comprehensions were first introduced in Python 2.0; map has been around forever.
        +You could accomplish the same thing with a list comprehension.  List comprehensions were first introduced in Python 2.0; map has been around forever.
         
         
         
        @@ -14059,7 +13899,7 @@ There is discussion that map and 

        Example 16.11. map with lists of mixed datatypes

        +

        Example 16.11. map with lists of mixed datatypes

         >>> li = [5, 'a', (2, 'b')]
         >>> map(double, li)     1
         [10, 'aa', (2, 'b', 2, 'b')]
        @@ -14067,29 +13907,29 @@ There is discussion that map and 1 -As a side note, I'd like to point out that map works just as well with lists of mixed datatypes, as long as the function you're using correctly handles each type. In this - case, the double function simply multiplies the given argument by 2, and Python Does The Right Thing depending on the datatype of the argument. For integers, this means actually multiplying it by 2; for +As a side note, I'd like to point out that map works just as well with lists of mixed datatypes, as long as the function you're using correctly handles each type. In this + case, the double function simply multiplies the given argument by 2, and Python Does The Right Thing depending on the datatype of the argument. For integers, this means actually multiplying it by 2; for strings, it means concatenating the string with itself; for tuples, it means making a new tuple that has all of the elements of the original, then all of the elements of the original again.

        All right, enough play time. Let's look at some real code. -

        Example 16.12. map in regression.py

        +

        Example 16.12. map in regression.py

             filenameToModuleName = lambda f: os.path.splitext(f)[0] 1
             moduleNames = map(filenameToModuleName, files)          2
        - -
        1 As you saw in Section 4.7, “Using lambda Functions”, lambda defines an inline function. And as you saw in Example 6.17, “Splitting Pathnames”, os.path.splitext takes a filename and returns a tuple (name, extension). So filenameToModuleName is a function which will take a filename and strip off the file extension, and return just the name. +As you saw in Section 4.7, “Using lambda Functions”, lambda defines an inline function. And as you saw in Example 6.17, “Splitting Pathnames”, os.path.splitext takes a filename and returns a tuple (name, extension). So filenameToModuleName is a function which will take a filename and strip off the file extension, and return just the name.
        2 Calling map takes each filename listed in files, passes it to the function filenameToModuleName, and returns a list of the return values of each of those function calls. In other words, you strip the file extension off - of each filename, and store the list of all those stripped filenames in moduleNames. +Calling map takes each filename listed in files, passes it to the function filenameToModuleName, and returns a list of the return values of each of those function calls. In other words, you strip the file extension off + of each filename, and store the list of all those stripped filenames in moduleNames.
        @@ -14097,30 +13937,30 @@ There is discussion that map and 16.5. Data-centric programming

      By now you're probably scratching your head wondering why this is better than using for loops and straight function calls. And that's a perfectly valid question. Mostly, it's a matter of perspective. Using -map and filter forces you to center your thinking around your data. +map and filter forces you to center your thinking around your data.

      In this case, you started with no data at all; the first thing you did was get the directory path of the current script, and got a list of files in that directory. That was the bootstrap, and it gave you real data to work with: a list of filenames. -

      However, you knew you didn't care about all of those files, only the ones that were actually test suites. You had too much data, so you needed to filter it. How did you know which data to keep? You needed a test to decide, so you defined one and passed it to the filter function. In this case you used a regular expression to decide, but the concept would be the same regardless of how you +

      However, you knew you didn't care about all of those files, only the ones that were actually test suites. You had too much data, so you needed to filter it. How did you know which data to keep? You needed a test to decide, so you defined one and passed it to the filter function. In this case you used a regular expression to decide, but the concept would be the same regardless of how you constructed the test.

      Now you had the filenames of each of the test suites (and only the test suites, since everything else had been filtered out), but you really wanted module names instead. You had the right amount of data, but it was in the wrong format. So you defined a function that would transform a single filename into a module name, and you mapped that function onto the entire list. From one filename, you can get a module name; from a list of filenames, you can get a list of module names. -

      Instead of filter, you could have used a for loop with an if statement. Instead of map, you could have used a for loop with a function call. But using for loops like that is busywork. At best, it simply wastes time; at worst, it introduces obscure bugs. For instance, you need +

      Instead of filter, you could have used a for loop with an if statement. Instead of map, you could have used a for loop with a function call. But using for loops like that is busywork. At best, it simply wastes time; at worst, it introduces obscure bugs. For instance, you need to figure out how to test for the condition “is this file a test suite?” anyway; that's the application-specific logic, and no language can write that for us. But once you've figured that out, -do you really want go to all the trouble of defining a new empty list and writing a for loop and an if statement and manually calling append to add each element to the new list if it passes the condition and then keeping track of which variable holds the new filtered +do you really want go to all the trouble of defining a new empty list and writing a for loop and an if statement and manually calling append to add each element to the new list if it passes the condition and then keeping track of which variable holds the new filtered data and which one holds the old unfiltered data? Why not just define the test condition, then let Python do the rest of that work for us?

      Oh sure, you could try to be fancy and delete elements in place without creating a new list. But you've been burned by that before. Trying to modify a data structure that you're looping through can be tricky. You delete an element, then loop to the next element, and suddenly you've skipped one. Is Python one of the languages that works that way? How long would it take you to figure it out? Would you remember for certain whether it was safe the next time you tried? Programmers spend so much time and make so many mistakes dealing with purely technical issues like this, and it's all pointless. It doesn't advance your program at all; it's just busywork. -

      I resisted list comprehensions when I first learned Python, and I resisted filter and map even longer. I insisted on making my life more difficult, sticking to the familiar way of for loops and if statements and step-by-step code-centric programming. And my Python programs looked a lot like Visual Basic programs, detailing every step of every operation in every function. And they had all the same types of little problems +

      I resisted list comprehensions when I first learned Python, and I resisted filter and map even longer. I insisted on making my life more difficult, sticking to the familiar way of for loops and if statements and step-by-step code-centric programming. And my Python programs looked a lot like Visual Basic programs, detailing every step of every operation in every function. And they had all the same types of little problems and obscure bugs. And it was all pointless.

      Let it all go. Busywork code is not important. Data is important. And data is not difficult. It's only data. If you have too much, filter it. If it's not what you want, map it. Focus on the data; leave the busywork behind.

      16.6. Dynamically importing modules

      OK, enough philosophizing. Let's talk about dynamically importing modules. -

      First, let's look at how you normally import modules. The import module syntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once +

      First, let's look at how you normally import modules. The import module syntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once this way, with a comma-separated list. You did this on the very first line of this chapter's script.

      Example 16.13. Importing multiple modules at once

       import sys, os, re, unittest 1
      @@ -14129,7 +13969,7 @@ import sys, os, re, unittest 1 
       
      -This imports four modules at once: sys (for system functions and access to the command line parameters), os (for operating system functions like directory listings), re (for regular expressions), and unittest (for unit testing).
      +This imports four modules at once: sys (for system functions and access to the command line parameters), os (for operating system functions like directory listings), re (for regular expressions), and unittest (for unit testing).
       
       
       
      @@ -14148,17 +13988,17 @@ import sys, os, re, unittest 1 
       
      -The built-in __import__ function accomplishes the same goal as using the import statement, but it's an actual function, and it takes a string as an argument.
      +The built-in __import__ function accomplishes the same goal as using the import statement, but it's an actual function, and it takes a string as an argument.
       
       
       
       2 
       
      -The variable sys is now the sys module, just as if you had said import sys.  The variable os is now the os module, and so forth.
      +The variable sys is now the sys module, just as if you had said import sys.  The variable os is now the os module, and so forth.
       
       
       
      -

      So __import__ imports a module, but takes a string argument to do it. In this case the module you imported was just a hard-coded string, +

      So __import__ imports a module, but takes a string argument to do it. In this case the module you imported was just a hard-coded string, but it could just as easily be a variable, or the result of a function call. And the variable that you assign the module to doesn't need to match the module name, either. You could import a series of modules and assign them to a list.

      Example 16.15. Importing a list of modules dynamically

      @@ -14181,14 +14021,14 @@ to doesn't need to match the module name, either.  You could import a series of
       
       1 
       
      -moduleNames is just a list of strings.  Nothing fancy, except that the strings happen to be names of modules that you could import, if
      +moduleNames is just a list of strings.  Nothing fancy, except that the strings happen to be names of modules that you could import, if
                   you wanted to.
       
       
       
       2 
       
      -Surprise, you wanted to import them, and you did, by mapping the __import__ function onto the list.  Remember, this takes each element of the list (moduleNames) and calls the function (__import__) over and over, once with each element of the list, builds a list of the return values, and returns the result.
      +Surprise, you wanted to import them, and you did, by mapping the __import__ function onto the list.  Remember, this takes each element of the list (moduleNames) and calls the function (__import__) over and over, once with each element of the list, builds a list of the return values, and returns the result.
       
       
       
      @@ -14201,7 +14041,7 @@ to doesn't need to match the module name, either.  You could import a series of
       
       4 
       
      -To drive home the point that these are real modules, let's look at some module attributes.  Remember, modules[0] is the sys module, so modules[0].version is sys.version.  All the other attributes and methods of these modules are also available.  There's nothing magic about the import statement, and there's nothing magic about modules.  Modules are objects.  Everything is an object.
      +To drive home the point that these are real modules, let's look at some module attributes.  Remember, modules[0] is the sys module, so modules[0].version is sys.version.  All the other attributes and methods of these modules are also available.  There's nothing magic about the import statement, and there's nothing magic about modules.  Modules are objects.  Everything is an object.
       
       
       
      @@ -14209,7 +14049,7 @@ to doesn't need to match the module name, either.  You could import a series of
       

      16.7. Putting it all together

      You've learned enough now to deconstruct the first seven lines of this chapter's code sample: reading a directory and importing selected modules within it. -

      Example 16.16. The regressionTest function

      +

      Example 16.16. The regressionTest function

       def regressionTest():
           path = os.path.abspath(os.path.dirname(sys.argv[0]))   
           files = os.listdir(path)             
      @@ -14220,7 +14060,7 @@ def regressionTest():
           modules = map(__import__, moduleNames)                 
       load = unittest.defaultTestLoader.loadTestsFromModule  
       return unittest.TestSuite(map(load, modules))          
      -

      Let's look at it line by line, interactively. Assume that the current directory is c:\diveintopython3\py, which contains the examples that come with this book, including this chapter's script. As you saw in Section 16.2, “Finding the path”, the script directory will end up in the path variable, so let's start hard-code that and go from there. +

      Let's look at it line by line, interactively. Assume that the current directory is c:\diveintopython3\py, which contains the examples that come with this book, including this chapter's script. As you saw in Section 16.2, “Finding the path”, the script directory will end up in the path variable, so let's start hard-code that and go from there.

      Example 16.17. Step 1: Get all the files

       >>> import sys, os, re, unittest
       >>> path = r'c:\diveintopython3\py'
      @@ -14238,8 +14078,8 @@ return unittest.TestSuite(map(load, modules))
       
       1 
       
      -files is a list of all the files and directories in the script's directory.  (If you've been running some of the examples already,
      -            you may also see some .pyc files in there as well.)
      +files is a list of all the files and directories in the script's directory.  (If you've been running some of the examples already,
      +            you may also see some .pyc files in there as well.)
       
       
       
      @@ -14266,7 +14106,7 @@ return unittest.TestSuite(map(load, modules))
       
       3 
       
      -And you're left with the list of unit testing scripts, because they were the only ones named SOMETHINGtest.py.
      +And you're left with the list of unit testing scripts, because they were the only ones named SOMETHINGtest.py.
       
       
       
      @@ -14285,19 +14125,19 @@ return unittest.TestSuite(map(load, modules))
       1 
       
       As you saw in Section 4.7, “Using lambda Functions”, lambda is a quick-and-dirty way of creating an inline, one-line function.  This one takes a filename with an extension and returns
      -            just the filename part, using the standard library function os.path.splitext that you saw in Example 6.17, “Splitting Pathnames”.
      +            just the filename part, using the standard library function os.path.splitext that you saw in Example 6.17, “Splitting Pathnames”.
       
       
       
       2 
       
      -filenameToModuleName is a function.  There's nothing magic about lambda functions as opposed to regular functions that you define with a def statement.  You can call the filenameToModuleName function like any other, and it does just what you wanted it to do: strips the file extension off of its argument.
      +filenameToModuleName is a function.  There's nothing magic about lambda functions as opposed to regular functions that you define with a def statement.  You can call the filenameToModuleName function like any other, and it does just what you wanted it to do: strips the file extension off of its argument.
       
       
       
       3 
       
      -Now you can apply this function to each file in the list of unit test files, using map.
      +Now you can apply this function to each file in the list of unit test files, using map.
       
       
       
      @@ -14321,20 +14161,20 @@ return unittest.TestSuite(map(load, modules))
       
       1 
       
      -As you saw in Section 16.6, “Dynamically importing modules”, you can use a combination of map and __import__ to map a list of module names (as strings) into actual modules (which you can call or access like any other module).
      +As you saw in Section 16.6, “Dynamically importing modules”, you can use a combination of map and __import__ to map a list of module names (as strings) into actual modules (which you can call or access like any other module).
                   
       
       
       
       2 
       
      -modules is now a list of modules, fully accessible like any other module.
      +modules is now a list of modules, fully accessible like any other module.
       
       
       
       3 
       
      -The last module in the list is the romantest module, just as if you had said import romantest.
      +The last module in the list is the romantest module, just as if you had said import romantest.
       
       
       
      @@ -14358,21 +14198,21 @@ return unittest.TestSuite(map(load, modules))
       
       These are real module objects.  Not only can you access them like any other module, instantiate classes and call functions,
                   you can also introspect into the module to figure out which classes and functions it has in the first place.  That's what
      -            the loadTestsFromModule method does: it introspects into each module and returns a unittest.TestSuite object for each module.  Each TestSuite object actually contains a list of TestSuite objects, one for each TestCase class in your module, and each of those TestSuite objects contains a list of tests, one for each test method in your module.
      +            the loadTestsFromModule method does: it introspects into each module and returns a unittest.TestSuite object for each module.  Each TestSuite object actually contains a list of TestSuite objects, one for each TestCase class in your module, and each of those TestSuite objects contains a list of tests, one for each test method in your module.
       
       
       
       2 
       
      -Finally, you wrap the list of TestSuite objects into one big test suite.  The unittest module has no problem traversing this tree of nested test suites within test suites; eventually it gets down to an individual
      +Finally, you wrap the list of TestSuite objects into one big test suite.  The unittest module has no problem traversing this tree of nested test suites within test suites; eventually it gets down to an individual
                   test method and executes it, verifies that it passes or fails, and moves on to the next one.
       
       
       
      -

      This introspection process is what the unittest module usually does for us. Remember that magic-looking unittest.main() function that our individual test modules called to kick the whole thing off? unittest.main() actually creates an instance of unittest.TestProgram, which in turn creates an instance of a unittest.defaultTestLoader and loads it up with the module that called it. (How does it get a reference to the module that called it if you don't give +

      This introspection process is what the unittest module usually does for us. Remember that magic-looking unittest.main() function that our individual test modules called to kick the whole thing off? unittest.main() actually creates an instance of unittest.TestProgram, which in turn creates an instance of a unittest.defaultTestLoader and loads it up with the module that called it. (How does it get a reference to the module that called it if you don't give it one? By using the equally-magic __import__('__main__') command, which dynamically imports the currently-running module. I could write a book on all the tricks and techniques used -in the unittest module, but then I'd never finish this one.) -

      Example 16.22. Step 6: Telling unittest to use your test suite

      +in the unittest module, but then I'd never finish this one.)
      +

      Example 16.22. Step 6: Telling unittest to use your test suite

       if __name__ == "__main__": 
           unittest.main(defaultTest="regressionTest") 1
       
      @@ -14380,29 +14220,29 @@ if __name__ == "__main__": 1 -Instead of letting the unittest module do all its magic for us, you've done most of it yourself. You've created a function (regressionTest) that imports the modules yourself, calls unittest.defaultTestLoader yourself, and wraps it all up in a test suite. Now all you need to do is tell unittest that, instead of looking for tests and building a test suite in the usual way, it should just call the regressionTest function, which returns a ready-to-use TestSuite. +Instead of letting the unittest module do all its magic for us, you've done most of it yourself. You've created a function (regressionTest) that imports the modules yourself, calls unittest.defaultTestLoader yourself, and wraps it all up in a test suite. Now all you need to do is tell unittest that, instead of looking for tests and building a test suite in the usual way, it should just call the regressionTest function, which returns a ready-to-use TestSuite.

      16.8. Summary

      -

      The regression.py program and its output should now make perfect sense. +

      The regression.py program and its output should now make perfect sense.

      You should now feel comfortable doing all of these things:



      -

      [7] Technically, the second argument to filter can be any sequence, including lists, tuples, and custom classes that act like lists by defining the __getitem__ special method. If possible, filter will return the same datatype as you give it, so filtering a list returns a list, but filtering a tuple returns a tuple. +

      [7] Technically, the second argument to filter can be any sequence, including lists, tuples, and custom classes that act like lists by defining the __getitem__ special method. If possible, filter will return the same datatype as you give it, so filtering a list returns a list, but filtering a tuple returns a tuple.

      -

      [8] Again, I should point out that map can take a list, a tuple, or any object that acts like a sequence. See previous footnote about filter. +

      [8] Again, I should point out that map can take a list, a tuple, or any object that acts like a sequence. See previous footnote about filter.

      Chapter 17. Dynamic functions

      17.1. Diving in

      @@ -14431,10 +14271,10 @@ the basic rules:

      Other languages are, of course, completely different.

      Let's design a module that pluralizes nouns. Start with just English nouns, and just these four rules, but keep in mind that you'll inevitably need to add more rules, and you may eventually need to add more languages. -

      17.2. plural.py, stage 1

      +

      17.2. plural.py, stage 1

      So you're looking at words, which at least in English are strings of characters. And you have rules that say you need to find different combinations of characters, and then do different things to them. This sounds like a job for regular expressions. -

      Example 17.1. plural1.py

      +

      Example 17.1. plural1.py

       import re
       
       def plural(noun):          
      @@ -14451,17 +14291,17 @@ def plural(noun):
       
       1 
       
      -OK, this is a regular expression, but it uses a syntax you didn't see in Chapter 7, Regular Expressions.  The square brackets mean “match exactly one of these characters”.  So [sxz] means “s, or x, or z”, but only one of them.  The $ should be familiar; it matches the end of string.  So you're checking to see if noun ends with s, x, or z.
      +OK, this is a regular expression, but it uses a syntax you didn't see in Chapter 7, Regular Expressions.  The square brackets mean “match exactly one of these characters”.  So [sxz] means “s, or x, or z”, but only one of them.  The $ should be familiar; it matches the end of string.  So you're checking to see if noun ends with s, x, or z.
       
       
       
       2 
       
      -This re.sub function performs regular expression-based string substitutions.  Let's look at it in more detail.
      +This re.sub function performs regular expression-based string substitutions.  Let's look at it in more detail.
       
       
       
      -

      Example 17.2. Introducing re.sub

      +

      Example 17.2. Introducing re.sub

       >>> import re
       >>> re.search('[abc]', 'Mark')   1
       <_sre.SRE_Match object at 0x001C1FA8>
      @@ -14498,7 +14338,7 @@ def plural(noun):
       
       
       
      -

      Example 17.3. Back to plural1.py

      +

      Example 17.3. Back to plural1.py

       import re
       
       def plural(noun):          
      @@ -14515,7 +14355,7 @@ def plural(noun):
       
       1 
       
      -Back to the plural function.  What are you doing?  You're replacing the end of string with es.  In other words, adding es to the string.  You could accomplish the same thing with string concatenation, for example noun + 'es', but I'm using regular expressions for everything, for consistency, for reasons that will become clear later in the chapter.
      +Back to the plural function.  What are you doing?  You're replacing the end of string with es.  In other words, adding es to the string.  You could accomplish the same thing with string concatenation, for example noun + 'es', but I'm using regular expressions for everything, for consistency, for reasons that will become clear later in the chapter.
       
       
       
      @@ -14562,7 +14402,7 @@ def plural(noun):
       
       
       
      -

      Example 17.5. More on re.sub

      +

      Example 17.5. More on re.sub

       >>> re.sub('y$', 'ies', 'vacancy')              1
       'vacancies'
       >>> re.sub('y$', 'ies', 'agency')
      @@ -14574,7 +14414,7 @@ def plural(noun):
       
       1 
       
      -This regular expression turns vacancy into vacancies and agency into agencies, which is what you wanted.  Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub.
      +This regular expression turns vacancy into vacancies and agency into agencies, which is what you wanted.  Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub.
       
       
       
      @@ -14589,10 +14429,10 @@ def plural(noun):
       

      Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn't directly map to the way you first described the pluralizing rules. You originally laid out rules like “if the word ends in S, X, or Z, then add ES”. And if you look at this function, you have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn't get much more direct than that. -

      17.3. plural.py, stage 2

      +

      17.3. plural.py, stage 2

      Now you're going to add a level of abstraction. You started by defining a list of rules: if this, then do that, otherwise go to the next rule. Let's temporarily complicate part of the program so you can simplify another part. -

      Example 17.6. plural2.py

      +

      Example 17.6. plural2.py

       import re
       
       def match_sxz(noun):        
      @@ -14636,30 +14476,30 @@ def plural(noun):
       
       This version looks more complicated (it's certainly longer), but it does exactly the same thing: try to match four different
                   rules, in order, and apply the appropriate regular expression when a match is found.  The difference is that each individual
      -            match and apply rule is defined in its own function, and the functions are then listed in this rules variable, which is a tuple of tuples.
      +            match and apply rule is defined in its own function, and the functions are then listed in this rules variable, which is a tuple of tuples.
       
       
       
       2 
       
      -Using a for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules tuple.  On the first iteration of the for loop, matchesRule will get match_sxz, and applyRule will get apply_sxz.  On the second iteration (assuming you get that far), matchesRule will be assigned match_h, and applyRule will be assigned apply_h.
      +Using a for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules tuple.  On the first iteration of the for loop, matchesRule will get match_sxz, and applyRule will get apply_sxz.  On the second iteration (assuming you get that far), matchesRule will be assigned match_h, and applyRule will be assigned apply_h.
       
       
       
       3 
       
      -Remember that everything in Python is an object, including functions.  rules contains actual functions; not names of functions, but actual functions.  When they get assigned in the for loop, then matchesRule and applyRule are actual functions that you can call.  So on the first iteration of the for loop, this is equivalent to calling matches_sxz(noun).
      +Remember that everything in Python is an object, including functions.  rules contains actual functions; not names of functions, but actual functions.  When they get assigned in the for loop, then matchesRule and applyRule are actual functions that you can call.  So on the first iteration of the for loop, this is equivalent to calling matches_sxz(noun).
       
       
       
       4 
       
      -On the first iteration of the for loop, this is equivalent to calling apply_sxz(noun), and so forth.
      +On the first iteration of the for loop, this is equivalent to calling apply_sxz(noun), and so forth.
       
       
       
       

      If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. This for loop is equivalent to the following: -

      Example 17.7. Unrolling the plural function

      +

      Example 17.7. Unrolling the plural function

       def plural(noun):
           if match_sxz(noun):
               return apply_sxz(noun)
      @@ -14669,15 +14509,15 @@ def plural(noun):
               return apply_y(noun)
           if match_default(noun):
               return apply_default(noun)
      -

      The benefit here is that that plural function is now simplified. It takes a list of rules, defined elsewhere, and iterates through them in a generic fashion. -Get a match rule; does it match? Then call the apply rule. The rules could be defined anywhere, in any way. The plural function doesn't care. +

      The benefit here is that that plural function is now simplified. It takes a list of rules, defined elsewhere, and iterates through them in a generic fashion. +Get a match rule; does it match? Then call the apply rule. The rules could be defined anywhere, in any way. The plural function doesn't care.

      Now, was adding this level of abstraction worth it? Well, not yet. Let's consider what it would take to add a new rule to -the function. Well, in the previous example, it would require adding an if statement to the plural function. In this example, it would require adding two functions, match_foo and apply_foo, and then updating the rules list to specify where in the order the new match and apply functions should be called relative to the other rules. +the function. Well, in the previous example, it would require adding an if statement to the plural function. In this example, it would require adding two functions, match_foo and apply_foo, and then updating the rules list to specify where in the order the new match and apply functions should be called relative to the other rules.

      This is really just a stepping stone to the next section. Let's move on. -

      17.4. plural.py, stage 3

      +

      17.4. plural.py, stage 3

      Defining separate named functions for each match and apply rule isn't really necessary. You never call them directly; you - define them in the rules list and call them through there. Let's streamline the rules definition by anonymizing those functions. -

      Example 17.8. plural3.py

      +   define them in the rules list and call them through there.  Let's streamline the rules definition by anonymizing those functions.
      +

      Example 17.8. plural3.py

       import re
       
       rules = \
      @@ -14710,24 +14550,24 @@ def plural(noun):
       1 
       
       This is the same set of rules as you defined in stage 2.  The only difference is that instead of defining named functions
      -            like match_sxz and apply_sxz, you have “inlined” those function definitions directly into the rules list itself, using lambda functions.
      +            like match_sxz and apply_sxz, you have “inlined” those function definitions directly into the rules list itself, using lambda functions.
       
       
       
       2 
       
      -Note that the plural function hasn't changed at all.  It iterates through a set of rule functions, checks the first rule, and if it returns a
      +Note that the plural function hasn't changed at all.  It iterates through a set of rule functions, checks the first rule, and if it returns a
                   true value, calls the second rule and returns the value.  Same as above, word for word.  The only difference is that the rule
      -            functions were defined inline, anonymously, using lambda functions.  But the plural function doesn't care how they were defined; it just gets a list of rules and blindly works through them.
      +            functions were defined inline, anonymously, using lambda functions.  But the plural function doesn't care how they were defined; it just gets a list of rules and blindly works through them.
       
       
       
      -

      Now to add a new rule, all you need to do is define the functions directly in the rules list itself: one match rule, and one apply rule. But defining the rule functions inline like this makes it very clear that +

      Now to add a new rule, all you need to do is define the functions directly in the rules list itself: one match rule, and one apply rule. But defining the rule functions inline like this makes it very clear that you have some unnecessary duplication here. You have four pairs of functions, and they all follow the same pattern. The -match function is a single call to re.search, and the apply function is a single call to re.sub. Let's factor out these similarities. -

      17.5. plural.py, stage 4

      +match function is a single call to re.search, and the apply function is a single call to re.sub. Let's factor out these similarities. +

      17.5. plural.py, stage 4

      Let's factor out the duplication in the code so that defining new rules can be easier. -

      Example 17.9. plural4.py

      +

      Example 17.9. plural4.py

       import re
       
       def buildMatchAndApplyFunctions((pattern, search, replace)):  
      @@ -14739,26 +14579,26 @@ def buildMatchAndApplyFunctions((pattern, search, replace)):
       
       1 
       
      -buildMatchAndApplyFunctions is a function that builds other functions dynamically.  It takes pattern, search and replace (actually it takes a tuple, but more on that in a minute), and you can build the match function using the lambda syntax to be a function that takes one parameter (word) and calls re.search with the pattern that was passed to the buildMatchAndApplyFunctions function, and the word that was passed to the match function you're building.  Whoa.
      +buildMatchAndApplyFunctions is a function that builds other functions dynamically.  It takes pattern, search and replace (actually it takes a tuple, but more on that in a minute), and you can build the match function using the lambda syntax to be a function that takes one parameter (word) and calls re.search with the pattern that was passed to the buildMatchAndApplyFunctions function, and the word that was passed to the match function you're building.  Whoa.
       
       
       
       2 
       
      -Building the apply function works the same way.  The apply function is a function that takes one parameter, and calls re.sub with the search and replace parameters that were passed to the buildMatchAndApplyFunctions function, and the word that was passed to the apply function you're building.  This technique of using the values of outside parameters within a
      -            dynamic function is called closures.  You're essentially defining constants within the apply function you're building: it takes one parameter (word), but it then acts on that plus two other values (search and replace) which were set when you defined the apply function.
      +Building the apply function works the same way.  The apply function is a function that takes one parameter, and calls re.sub with the search and replace parameters that were passed to the buildMatchAndApplyFunctions function, and the word that was passed to the apply function you're building.  This technique of using the values of outside parameters within a
      +            dynamic function is called closures.  You're essentially defining constants within the apply function you're building: it takes one parameter (word), but it then acts on that plus two other values (search and replace) which were set when you defined the apply function.
       
       
       
       3 
       
      -Finally, the buildMatchAndApplyFunctions function returns a tuple of two values: the two functions you just created.  The constants you defined within those functions
      -            (pattern within matchFunction, and search and replace within applyFunction) stay with those functions, even after you return from buildMatchAndApplyFunctions.  That's insanely cool.
      +Finally, the buildMatchAndApplyFunctions function returns a tuple of two values: the two functions you just created.  The constants you defined within those functions
      +            (pattern within matchFunction, and search and replace within applyFunction) stay with those functions, even after you return from buildMatchAndApplyFunctions.  That's insanely cool.
       
       
       
       

      If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it. -

      Example 17.10. plural4.py continued

      +

      Example 17.10. plural4.py continued

       patterns = \
         (
           ('[sxz]$', '$', 'es'),
      @@ -14773,18 +14613,18 @@ rules = map(buildMatchAndApplyFunctions, patterns)  1 
       
       Our pluralization rules are now defined as a series of strings (not functions).  The first string is the regular expression
      -            that you would use in re.search to see if this rule matches; the second and third are the search and replace expressions you would use in re.sub to actually apply the rule to turn a noun into its plural.
      +            that you would use in re.search to see if this rule matches; the second and third are the search and replace expressions you would use in re.sub to actually apply the rule to turn a noun into its plural.
       
       
       
       2 
       
      -This line is magic.  It takes the list of strings in patterns and turns them into a list of functions.  How?  By mapping the strings to the buildMatchAndApplyFunctions function, which just happens to take three strings as parameters and return a tuple of two functions.  This means that rules ends up being exactly the same as the previous example: a list of tuples, where each tuple is a pair of functions, where
      -            the first function is the match function that calls re.search, and the second function is the apply function that calls re.sub.
      +This line is magic.  It takes the list of strings in patterns and turns them into a list of functions.  How?  By mapping the strings to the buildMatchAndApplyFunctions function, which just happens to take three strings as parameters and return a tuple of two functions.  This means that rules ends up being exactly the same as the previous example: a list of tuples, where each tuple is a pair of functions, where
      +            the first function is the match function that calls re.search, and the second function is the apply function that calls re.sub.
       
       
       
      -

      I swear I am not making this up: rules ends up with exactly the same list of functions as the previous example. Unroll the rules definition, and you'll get this: +

      I swear I am not making this up: rules ends up with exactly the same list of functions as the previous example. Unroll the rules definition, and you'll get this:

      Example 17.11. Unrolling the rules definition

       rules = \
         (
      @@ -14805,7 +14645,7 @@ rules = \
            lambda word: re.sub('$', 's', word)
           )
          )      
      -

      Example 17.12. plural4.py, finishing up

      +

      Example 17.12. plural4.py, finishing up

       def plural(noun):                
           for matchesRule, applyRule in rules:            1
               if matchesRule(noun):    
      @@ -14815,13 +14655,13 @@ def plural(noun):
       
       1 
       
      -Since the rules list is the same as the previous example, it should come as no surprise that the plural function hasn't changed.  Remember, it's completely generic; it takes a list of rule functions and calls them in order. 
      -            It doesn't care how the rules are defined.  In stage 2, they were defined as seperate named functions.  In stage 3, they were defined as anonymous lambda functions.  Now in stage 4, they are built dynamically by mapping the buildMatchAndApplyFunctions function onto a list of raw strings.  Doesn't matter; the plural function still works the same way.
      +Since the rules list is the same as the previous example, it should come as no surprise that the plural function hasn't changed.  Remember, it's completely generic; it takes a list of rule functions and calls them in order. 
      +            It doesn't care how the rules are defined.  In stage 2, they were defined as seperate named functions.  In stage 3, they were defined as anonymous lambda functions.  Now in stage 4, they are built dynamically by mapping the buildMatchAndApplyFunctions function onto a list of raw strings.  Doesn't matter; the plural function still works the same way.
       
       
       
      -

      Just in case that wasn't mind-blowing enough, I must confess that there was a subtlety in the definition of buildMatchAndApplyFunctions that I skipped over. Let's go back and take another look. -

      Example 17.13. Another look at buildMatchAndApplyFunctions

      +

      Just in case that wasn't mind-blowing enough, I must confess that there was a subtlety in the definition of buildMatchAndApplyFunctions that I skipped over. Let's go back and take another look. +

      Example 17.13. Another look at buildMatchAndApplyFunctions

       def buildMatchAndApplyFunctions((pattern, search, replace)):   1
       
      @@ -14830,7 +14670,7 @@ def buildMatchAndApplyFunctions((pattern, search, replace)): Notice the double parentheses? This function doesn't actually take three parameters; it actually takes one parameter, a tuple of three elements. But the tuple is expanded when the function is called, and the three elements of the tuple are each assigned - to different variables: pattern, search, and replace. Confused yet? Let's see it in action. + to different variables: pattern, search, and replace. Confused yet? Let's see it in action.
      @@ -14849,27 +14689,27 @@ apple 1 -The proper way to call the function foo is with a tuple of three elements. When the function is called, the elements are assigned to different local variables within -foo. +The proper way to call the function foo is with a tuple of three elements. When the function is called, the elements are assigned to different local variables within +foo. -

      Now let's go back and see why this auto-tuple-expansion trick was necessary. patterns was a list of tuples, and each tuple had three elements. When you called map(buildMatchAndApplyFunctions, patterns), that means that buildMatchAndApplyFunctions is not getting called with three parameters. Using map to map a single list onto a function always calls the function with a single parameter: each element of the list. In the - case of patterns, each element of the list is a tuple, so buildMatchAndApplyFunctions always gets called with the tuple, and you use the auto-tuple-expansion trick in the definition of buildMatchAndApplyFunctions to assign the elements of that tuple to named variables that you can work with. -

      17.6. plural.py, stage 5

      +

      Now let's go back and see why this auto-tuple-expansion trick was necessary. patterns was a list of tuples, and each tuple had three elements. When you called map(buildMatchAndApplyFunctions, patterns), that means that buildMatchAndApplyFunctions is not getting called with three parameters. Using map to map a single list onto a function always calls the function with a single parameter: each element of the list. In the + case of patterns, each element of the list is a tuple, so buildMatchAndApplyFunctions always gets called with the tuple, and you use the auto-tuple-expansion trick in the definition of buildMatchAndApplyFunctions to assign the elements of that tuple to named variables that you can work with. +

      17.6. plural.py, stage 5

      You've factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them.

      First, let's create a text file that contains the rules you want. No fancy data structures, just space- (or tab-)delimited -strings in three columns. You'll call it rules.en; “en” stands for English. These are the rules for pluralizing English nouns. You could add other rule files for other languages +strings in three columns. You'll call it rules.en; “en” stands for English. These are the rules for pluralizing English nouns. You could add other rule files for other languages later. -

      Example 17.15. rules.en

      +

      Example 17.15. rules.en

       [sxz]$$               es
       [^aeioudgkprt]h$        $               es
       [^aeiou]y$              y$              ies
       $     $               s
       

      Now let's see how you can use this rules file. -

      Example 17.16. plural5.py

      +

      Example 17.16. plural5.py

       import re
       import string               
       
      @@ -14897,13 +14737,13 @@ def plural(noun, language='en'):           2 
       
      -Our plural function now takes an optional second parameter, language, which defaults to en.
      +Our plural function now takes an optional second parameter, language, which defaults to en.
       
       
       
       3 
       
      -You use the language parameter to construct a filename, then open the file and read the contents into a list.  If language is en, then you'll open the rules.en file, read the entire thing, break it up by carriage returns, and return a list.  Each line of the file will be one element
      +You use the language parameter to construct a filename, then open the file and read the contents into a list.  If language is en, then you'll open the rules.en file, read the entire thing, break it up by carriage returns, and return a list.  Each line of the file will be one element
                   in the list.
       
       
      @@ -14911,13 +14751,13 @@ def plural(noun, language='en'):           4 
       
       As you saw, each line in the file really has three values, but they're separated by whitespace (tabs or spaces, it makes no
      -            difference).  Mapping the string.split function onto this list will create a new list where each element is a tuple of three strings.  So a line like [sxz]$ $ es will be broken up into the tuple ('[sxz]$', '$', 'es').  This means that patterns will end up as a list of tuples, just like you hard-coded it in stage 4.
      +            difference).  Mapping the string.split function onto this list will create a new list where each element is a tuple of three strings.  So a line like [sxz]$ $ es will be broken up into the tuple ('[sxz]$', '$', 'es').  This means that patterns will end up as a list of tuples, just like you hard-coded it in stage 4.
       
       
       
       5 
       
      -If patterns is a list of tuples, then rules will be a list of the functions created dynamically by each call to buildRule.  Calling buildRule(('[sxz]$', '$', 'es')) returns a function that takes a single parameter, word.  When this returned function is called, it will execute re.search('[sxz]$', word) and re.sub('$', 'es', word).
      +If patterns is a list of tuples, then rules will be a list of the functions created dynamically by each call to buildRule.  Calling buildRule(('[sxz]$', '$', 'es')) returns a function that takes a single parameter, word.  When this returned function is called, it will execute re.search('[sxz]$', word) and re.sub('$', 'es', word).
       
       
       
      @@ -14929,12 +14769,12 @@ def plural(noun, language='en'):           plural function can use different rule files, based on the language parameter.
      -

      The downside here is that you're reading that file every time you call the plural function. I thought I could get through this entire book without using the phrase “left as an exercise for the reader”, but here you go: building a caching mechanism for the language-specific rule files that auto-refreshes itself if the rule +file be maintained separately from the code, but you've set up a naming scheme where the same plural function can use different rule files, based on the language parameter. +

      The downside here is that you're reading that file every time you call the plural function. I thought I could get through this entire book without using the phrase “left as an exercise for the reader”, but here you go: building a caching mechanism for the language-specific rule files that auto-refreshes itself if the rule files change between calls is left as an exercise for the reader. Have fun. -

      17.7. plural.py, stage 6

      +

      17.7. plural.py, stage 6

      Now you're ready to talk about generators. -

      Example 17.17. plural6.py

      +

      Example 17.17. plural6.py

       import re
       
       def rules(language):           
      @@ -14972,40 +14812,40 @@ def plural(noun, language='en'):
       
       1 
       
      -The presence of the yield keyword in make_counter means that this is not a normal function.  It is a special kind of function which generates values one at a time.  You can
      +The presence of the yield keyword in make_counter means that this is not a normal function.  It is a special kind of function which generates values one at a time.  You can
                   think of it as a resumable function.  Calling it will return a generator that can be used to generate successive values of
      -x.
      +x.
       
       
       
       2 
       
      -To create an instance of the make_counter generator, just call it like any other function.  Note that this does not actually execute the function code.  You can tell
      -            this because the first line of make_counter is a print statement, but nothing has been printed yet.
      +To create an instance of the make_counter generator, just call it like any other function.  Note that this does not actually execute the function code.  You can tell
      +            this because the first line of make_counter is a print statement, but nothing has been printed yet.
       
       
       
       3 
       
      -The make_counter function returns a generator object.
      +The make_counter function returns a generator object.
       
       
       
       4 
       
      -The first time you call the next() method on the generator object, it executes the code in make_counter up to the first yield statement, and then returns the value that was yielded.  In this case, that will be 2, because you originally created the generator by calling make_counter(2).
      +The first time you call the next() method on the generator object, it executes the code in make_counter up to the first yield statement, and then returns the value that was yielded.  In this case, that will be 2, because you originally created the generator by calling make_counter(2).
       
       
       
       5 
       
      -Repeatedly calling next() on the generator object resumes where you left off and continues until you hit the next yield statement.  The next line of code waiting to be executed is the print statement that prints incrementing x, and then after that the x = x + 1 statement that actually increments it.  Then you loop through the while loop again, and the first thing you do is yield x, which returns the current value of x (now 3).
      +Repeatedly calling next() on the generator object resumes where you left off and continues until you hit the next yield statement.  The next line of code waiting to be executed is the print statement that prints incrementing x, and then after that the x = x + 1 statement that actually increments it.  Then you loop through the while loop again, and the first thing you do is yield x, which returns the current value of x (now 3).
       
       
       
       6 
       
      -The second time you call counter.next(), you do all the same things again, but this time x is now 4.  And so forth.  Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing x and spitting out values.  But let's look at more productive uses of generators instead.
      +The second time you call counter.next(), you do all the same things again, but this time x is now 4.  And so forth.  Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing x and spitting out values.  But let's look at more productive uses of generators instead.
       
       
       
      @@ -15021,19 +14861,19 @@ def fibonacci(max):
       1 
       
       The Fibonacci sequence is a sequence of numbers where each number is the sum of the two numbers before it.  It starts with
      -0 and 1, goes up slowly at first, then more and more rapidly.  To start the sequence, you need two variables: a starts at 0, and b starts at 1.
      +0 and 1, goes up slowly at first, then more and more rapidly.  To start the sequence, you need two variables: a starts at 0, and b starts at 1.
       
       
       
       2 
       
      -a is the current number in the sequence, so yield it.
      +a is the current number in the sequence, so yield it.
       
       
       
       3 
       
      -b is the next number in the sequence, so assign that to a, but also calculate the next value (a+b) and assign that to b for later use.  Note that this happens in parallel; if a is 3 and b is 5, then a, b = b, a+b will set a to 5 (the previous value of b) and b to 8 (the sum of the previous values of a and b).
      +b is the next number in the sequence, so assign that to a, but also calculate the next value (a+b) and assign that to b for later use.  Note that this happens in parallel; if a is 3 and b is 5, then a, b = b, a+b will set a to 5 (the previous value of b) and b to 8 (the sum of the previous values of a and b).
       
       
       
      @@ -15048,17 +14888,17 @@ is easier to read.  Also, it works well with for loops.
       
       1 
       
      -You can use a generator like fibonacci in a for loop directly.  The for loop will create the generator object and successively call the next() method to get values to assign to the for loop index variable (n).
      +You can use a generator like fibonacci in a for loop directly.  The for loop will create the generator object and successively call the next() method to get values to assign to the for loop index variable (n).
       
       
       
       2 
       
      -Each time through the for loop, n gets a new value from the yield statement in fibonacci, and all you do is print it out.  Once fibonacci runs out of numbers (a gets bigger than max, which in this case is 1000), then the for loop exits gracefully.
      +Each time through the for loop, n gets a new value from the yield statement in fibonacci, and all you do is print it out.  Once fibonacci runs out of numbers (a gets bigger than max, which in this case is 1000), then the for loop exits gracefully.
       
       
       
      -

      OK, let's go back to the plural function and see how you're using this. +

      OK, let's go back to the plural function and see how you're using this.

      Example 17.21. Generators that generate dynamic functions

       def rules(language):           
           for line in file('rules.%s' % language):      1
      @@ -15074,7 +14914,7 @@ def plural(noun, language='en'):
       
       1 
       
      -for line in file(...) is a common idiom for reading lines from a file, one line at a time.  It works because file actually returns a generator whose next() method returns the next line of the file.  That is so insanely cool, I wet myself just thinking about it.
      +for line in file(...) is a common idiom for reading lines from a file, one line at a time.  It works because file actually returns a generator whose next() method returns the next line of the file.  That is so insanely cool, I wet myself just thinking about it.
       
       
       
      @@ -15086,14 +14926,14 @@ def plural(noun, language='en'):
       
       3 
       
      -And then you yield.  What do you yield?  A function, built dynamically with lambda, that is actually a closure (it uses the local variables pattern, search, and replace as constants).  In other words, rules is a generator that spits out rule functions.
      +And then you yield.  What do you yield?  A function, built dynamically with lambda, that is actually a closure (it uses the local variables pattern, search, and replace as constants).  In other words, rules is a generator that spits out rule functions.
       
       
       
       4 
       
      -Since rules is a generator, you can use it directly in a for loop.  The first time through the for loop, you will call the rules function, which will open the rules file, read the first line out of it, dynamically build a function that matches and applies
      -            the first rule defined in the rules file, and yields the dynamically built function.  The second time through the for loop, you will pick up where you left off in rules (which was in the middle of the for line in file(...) loop), read the second line of the rules file, dynamically build another function that matches and applies the second rule
      +Since rules is a generator, you can use it directly in a for loop.  The first time through the for loop, you will call the rules function, which will open the rules file, read the first line out of it, dynamically build a function that matches and applies
      +            the first rule defined in the rules file, and yields the dynamically built function.  The second time through the for loop, you will pick up where you left off in rules (which was in the middle of the for line in file(...) loop), read the second line of the rules file, dynamically build another function that matches and applies the second rule
                   defined in the rules file, and yields it.  And so forth.
       
       
      @@ -15171,7 +15011,7 @@ we use computerized database servers now.  Most database servers include a Sound
       too long, so you discard the excess character, leaving P426.
       

      Another example: Woo becomes W99, which becomes W9, which becomes W, which gets padded with zeros to become W000.

      Here's a first attempt at a Soundex function: -

      Example 18.1. soundex/stage1/soundex1a.py

      +

      Example 18.1. soundex/stage1/soundex1a.py

      If you have not already done so, you can download this and other examples used in this book.

       import string, re
       
      @@ -15251,7 +15091,7 @@ if __name__ == '__main__':
       
    • Soundexing and Genealogy gives a chronology of the evolution of the Soundex and its regional variations.
    -

    18.2. Using the timeit Module

    +

    18.2. Using the timeit Module

    The most important thing you need to know about optimizing Python code is that you shouldn't write your own timing function.

    Timing short pieces of code is incredibly complex. How much processor time is your computer devoting to running this code? Are there things running in the background? Are you sure? Every modern computer has background processes running, some all @@ -15262,8 +15102,8 @@ the first time, then turn off the service that's incessantly checking whether th

    And then there's the matter of the variations introduced by the timing framework itself. Does the Python interpreter cache method name lookups? Does it cache code block compilations? Regular expressions? Will your code have side effects if run more than once? Don't forget that you're dealing with small fractions of a second, so small mistakes in your timing framework will irreparably skew your results. -

    The Python community has a saying: “Python comes with batteries included.” Don't write your own timing framework. Python 2.3 comes with a perfectly good one called timeit. -

    Example 18.2. Introducing timeit

    +

    The Python community has a saying: “Python comes with batteries included.” Don't write your own timing framework. Python 2.3 comes with a perfectly good one called timeit. +

    Example 18.2. Introducing timeit

    If you have not already done so, you can download this and other examples used in this book.

     >>> import timeit
     >>> t = timeit.Timer("soundex.soundex('Pilgrim')",
    @@ -15277,22 +15117,22 @@ in your timing framework will irreparably skew your results.
     
     1 
     
    -The timeit module defines one class, Timer, which takes two arguments.  Both arguments are strings.  The first argument is the statement you wish to time; in this case,
    -            you are timing a call to the Soundex function within the soundex with an argument of 'Pilgrim'.  The second argument to the Timer class is the import statement that sets up the environment for the statement.  Internally, timeit sets up an isolated virtual environment, manually executes the setup statement (importing the soundex module), then manually compiles and executes the timed statement (calling the Soundex function).
    +The timeit module defines one class, Timer, which takes two arguments.  Both arguments are strings.  The first argument is the statement you wish to time; in this case,
    +            you are timing a call to the Soundex function within the soundex with an argument of 'Pilgrim'.  The second argument to the Timer class is the import statement that sets up the environment for the statement.  Internally, timeit sets up an isolated virtual environment, manually executes the setup statement (importing the soundex module), then manually compiles and executes the timed statement (calling the Soundex function).
     
     
     
     2 
     
    -Once you have the Timer object, the easiest thing to do is call timeit(), which calls your function 1 million times and returns the number of seconds it took to do it.
    +Once you have the Timer object, the easiest thing to do is call timeit(), which calls your function 1 million times and returns the number of seconds it took to do it.
     
     
     
     3 
     
    -The other major method of the Timer object is repeat(), which takes two optional arguments.  The first argument is the number of times to repeat the entire test, and the second
    +The other major method of the Timer object is repeat(), which takes two optional arguments.  The first argument is the number of times to repeat the entire test, and the second
                 argument is the number of times to call the timed statement within each test.  Both arguments are optional, and they default
    -            to 3 and 1000000 respectively.  The repeat() method returns a list of the times each test cycle took, in seconds.
    +            to 3 and 1000000 respectively.  The repeat() method returns a list of the times each test cycle took, in seconds.
     
     
     
    @@ -15301,17 +15141,17 @@ in your timing framework will irreparably skew your results.
     Tip
     
     
    -You can use the timeit module on the command line to test an existing Python program, without modifying the code.  See http://docs.python.org/lib/node396.html for documentation on the command-line flags.
    +You can use the timeit module on the command line to test an existing Python program, without modifying the code.  See http://docs.python.org/lib/node396.html for documentation on the command-line flags.
     
     
     
    -

    Note that repeat() returns a list of times. The times will almost never be identical, due to slight variations in how much processor time the +

    Note that repeat() returns a list of times. The times will almost never be identical, due to slight variations in how much processor time the Python interpreter is getting (and those pesky background processes that you can't get rid of). Your first thought might be to say “Let's take the average and call that The True Number.”

    In fact, that's almost certainly wrong. The tests that took longer didn't take longer because of variations in your code or in the Python interpreter; they took longer because of those pesky background processes, or other factors outside of the Python interpreter that you can't fully eliminate. If the different timing results differ by more than a few percent, you still have too much variability to trust the results. Otherwise, take the minimum time and discard the rest. -

    Python has a handy min function that takes a list and returns the smallest value: +

    Python has a handy min function that takes a list and returns the smallest value:

     >>> min(t.repeat(3, 1000000))
     8.22203948912
    @@ -15320,7 +15160,7 @@ have too much variability to trust the results.  Otherwise, take the minimum tim
     Tip
     
     
    -The timeit module only works if you already know what piece of code you need to optimize.  If you have a larger Python program and don't know where your performance problems are, check out the hotshot module.
    +The timeit module only works if you already know what piece of code you need to optimize.  If you have a larger Python program and don't know where your performance problems are, check out the hotshot module.
     
     
     

    18.3. Optimizing Regular Expressions

    @@ -15329,12 +15169,12 @@ have too much variability to trust the results. Otherwise, take the minimum tim

    If you answered “regular expressions”, go sit in the corner and contemplate your bad instincts. Regular expressions are almost never the right answer; they should be avoided whenever possible. Not only for performance reasons, but simply because they're difficult to debug and maintain. Also for performance reasons. -

    This code fragment from soundex/stage1/soundex1a.py checks whether the function argument source is a word made entirely of letters, with at least one letter (not the empty string): +

    This code fragment from soundex/stage1/soundex1a.py checks whether the function argument source is a word made entirely of letters, with at least one letter (not the empty string):

         allChars = string.uppercase + string.lowercase
         if not re.search('^[%s]+$' % allChars, source):
             return "0000"
    -

    How does soundex1a.py perform? For convenience, the __main__ section of the script contains this code that calls the timeit module, sets up a timing test with three different names, tests each name three times, and displays the minimum time for +

    How does soundex1a.py perform? For convenience, the __main__ section of the script contains this code that calls the timeit module, sets up a timing test with three different names, tests each name three times, and displays the minimum time for each:

     if __name__ == '__main__':
    @@ -15344,7 +15184,7 @@ if __name__ == '__main__':
             statement = "soundex('%s')" % name
             t = Timer(statement, "from __main__ import soundex")
             print name.ljust(15), soundex(name), min(t.repeat())
    -

    So how does soundex1a.py perform with this regular expression? +

    So how does soundex1a.py perform with this regular expression?

     C:\samples\soundex\stage1>python soundex1a.py
     Woo             W000 19.3356647283
    @@ -15356,60 +15196,60 @@ that it will never run in constant time.
     

    The other thing to keep in mind is that we are testing a representative sample of names. Woo is a kind of trivial case, in that it gets shorted down to a single letter and then padded with zeros. Pilgrim is a normal case, of average length and a mixture of significant and ignored letters. Flingjingwaller is extraordinarily long and contains consecutive duplicates. Other tests might also be helpful, but this hits a good range of different cases.

    So what about that regular expression? Well, it's inefficient. Since the expression is testing for ranges of characters -(A-Z in uppercase, and a-z in lowercase), we can use a shorthand regular expression syntax. Here is soundex/stage1/soundex1b.py: +(A-Z in uppercase, and a-z in lowercase), we can use a shorthand regular expression syntax. Here is soundex/stage1/soundex1b.py:

         if not re.search('^[A-Za-z]+$', source):
             return "0000"
    -

    timeit says soundex1b.py is slightly faster than soundex1a.py, but nothing to get terribly excited about: +

    timeit says soundex1b.py is slightly faster than soundex1a.py, but nothing to get terribly excited about:

     C:\samples\soundex\stage1>python soundex1b.py
     Woo             W000 17.1361133887
     Pilgrim         P426 21.8201693232
     Flingjingwaller F452 32.7262294509
     

    We saw in Section 15.3, “Refactoring” that regular expressions can be compiled and reused for faster results. Since this regular expression never changes across -function calls, we can compile it once and use the compiled version. Here is soundex/stage1/soundex1c.py: +function calls, we can compile it once and use the compiled version. Here is soundex/stage1/soundex1c.py:

     isOnlyChars = re.compile('^[A-Za-z]+$').search
     def soundex(source):
         if not isOnlyChars(source):
             return "0000"
    -

    Using a compiled regular expression in soundex1c.py is significantly faster: +

  • Using a compiled regular expression in soundex1c.py is significantly faster:

     C:\samples\soundex\stage1>python soundex1c.py
     Woo             W000 14.5348347346
     Pilgrim         P426 19.2784703084
     Flingjingwaller F452 30.0893873383
    -

    But is this the wrong path? The logic here is simple: the input source needs to be non-empty, and it needs to be composed entirely of letters. Wouldn't it be faster to write a loop checking each +

    But is this the wrong path? The logic here is simple: the input source needs to be non-empty, and it needs to be composed entirely of letters. Wouldn't it be faster to write a loop checking each character, and do away with regular expressions altogether? -

    Here is soundex/stage1/soundex1d.py: +

    Here is soundex/stage1/soundex1d.py:

         if not source:
             return "0000"
         for c in source:
             if not ('A' <= c <= 'Z') and not ('a' <= c <= 'z'):
                 return "0000"
    -

    It turns out that this technique in soundex1d.py is not faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression): +

    It turns out that this technique in soundex1d.py is not faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression):

     C:\samples\soundex\stage1>python soundex1d.py
     Woo             W000 15.4065058548
     Pilgrim         P426 22.2753567842
     Flingjingwaller F452 37.5845122774
    -

    Why isn't soundex1d.py faster? The answer lies in the interpreted nature of Python. The regular expression engine is written in C, and compiled to run natively on your computer. On the other hand, this +

    Why isn't soundex1d.py faster? The answer lies in the interpreted nature of Python. The regular expression engine is written in C, and compiled to run natively on your computer. On the other hand, this loop is written in Python, and runs through the Python interpreter. Even though the loop is relatively simple, it's not simple enough to make up for the overhead of being interpreted. Regular expressions are never the right answer... except when they are.

    It turns out that Python offers an obscure string method. You can be excused for not knowing about it, since it's never been mentioned in this book. -The method is called isalpha(), and it checks whether a string contains only letters. -

    This is soundex/stage1/soundex1e.py: +The method is called isalpha(), and it checks whether a string contains only letters. +

    This is soundex/stage1/soundex1e.py:

         if (not source) and (not source.isalpha()):
             return "0000"
    -

    How much did we gain by using this specific method in soundex1e.py? Quite a bit. +

    How much did we gain by using this specific method in soundex1e.py? Quite a bit.

     C:\samples\soundex\stage1>python soundex1e.py
     Woo             W000 13.5069504644
     Pilgrim         P426 18.2199394057
     Flingjingwaller F452 28.9975225902
    -

    Example 18.3. Best Result So Far: soundex/stage1/soundex1e.py

    +

    Example 18.3. Best Result So Far: soundex/stage1/soundex1e.py

     import string, re
     
     charToSoundex = {"A": "9",
    @@ -15467,7 +15307,7 @@ if __name__ == '__main__':
     

    The second step of the Soundex algorithm is to convert characters to digits in a specific pattern. What's the best way to do this?

    The most obvious solution is to define a dictionary with individual characters as keys and their corresponding digits as values, -and do dictionary lookups on each character. This is what we have in soundex/stage1/soundex1c.py (the current best result so far): +and do dictionary lookups on each character. This is what we have in soundex/stage1/soundex1c.py (the current best result so far):

     charToSoundex = {"A": "9",
                      "B": "1",
    @@ -15503,60 +15343,60 @@ def soundex(source):
         for s in source[1:]:
             s = s.upper()
             digits += charToSoundex[s]
    -

    You timed soundex1c.py already; this is how it performs: +

    You timed soundex1c.py already; this is how it performs:

     C:\samples\soundex\stage1>python soundex1c.py
     Woo             W000 14.5341678901
     Pilgrim         P426 19.2650071448
     Flingjingwaller F452 30.1003563302
    -

    This code is straightforward, but is it the best solution? Calling upper() on each individual character seems inefficient; it would probably be better to call upper() once on the entire string. -

    Then there's the matter of incrementally building the digits string. Incrementally building strings like this is horribly inefficient; internally, the Python interpreter needs to create a new string each time through the loop, then discard the old one. +

    This code is straightforward, but is it the best solution? Calling upper() on each individual character seems inefficient; it would probably be better to call upper() once on the entire string. +

    Then there's the matter of incrementally building the digits string. Incrementally building strings like this is horribly inefficient; internally, the Python interpreter needs to create a new string each time through the loop, then discard the old one.

    Python is good at lists, though. It can treat a string as a list of characters automatically. And lists are easy to combine into -strings again, using the string method join(). -

    Here is soundex/stage2/soundex2a.py, which converts letters to digits by using ↦ and lambda: +strings again, using the string method join(). +

    Here is soundex/stage2/soundex2a.py, which converts letters to digits by using ↦ and lambda:

     def soundex(source):
         # ...
         source = source.upper()
         digits = source[0] + "".join(map(lambda c: charToSoundex[c], source[1:]))
    -

    Surprisingly, soundex2a.py is not faster: +

    Surprisingly, soundex2a.py is not faster:

     C:\samples\soundex\stage2>python soundex2a.py
     Woo             W000 15.0097526362
     Pilgrim         P426 19.254806407
     Flingjingwaller F452 29.3790847719
     

    The overhead of the anonymous lambda function kills any performance you gain by dealing with the string as a list of characters. -

    soundex/stage2/soundex2b.py uses a list comprehension instead of ↦ and lambda: +

    soundex/stage2/soundex2b.py uses a list comprehension instead of ↦ and lambda:

         source = source.upper()
         digits = source[0] + "".join([charToSoundex[c] for c in source[1:]])
    -

    Using a list comprehension in soundex2b.py is faster than using ↦ and lambda in soundex2a.py, but still not faster than the original code (incrementally building a string in soundex1c.py): +

    Using a list comprehension in soundex2b.py is faster than using ↦ and lambda in soundex2a.py, but still not faster than the original code (incrementally building a string in soundex1c.py):

     C:\samples\soundex\stage2>python soundex2b.py
     Woo             W000 13.4221324219
     Pilgrim         P426 16.4901234654
     Flingjingwaller F452 25.8186157738
     

    It's time for a radically different approach. Dictionary lookups are a general purpose tool. Dictionary keys can be any -length string (or many other data types), but in this case we are only dealing with single-character keys and single-character values. It turns out that Python has a specialized function for handling exactly this situation: the string.maketrans function. -

    This is soundex/stage2/soundex2c.py: +length string (or many other data types), but in this case we are only dealing with single-character keys and single-character values. It turns out that Python has a specialized function for handling exactly this situation: the string.maketrans function. +

    This is soundex/stage2/soundex2c.py:

     allChar = string.uppercase + string.lowercase
     charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
     def soundex(source):
         # ...
         digits = source[0].upper() + source[1:].translate(charToSoundex)
    -

    What the heck is going on here? string.maketrans creates a translation matrix between two strings: the first argument and the second argument. In this case, the first argument +

    What the heck is going on here? string.maketrans creates a translation matrix between two strings: the first argument and the second argument. In this case, the first argument is the string ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz, and the second argument is the string 9123912992245591262391929291239129922455912623919292. See the pattern? It's the same conversion pattern we were setting up longhand with a dictionary. A maps to 9, B maps to 1, C maps to 2, and so forth. But it's not a dictionary; it's a specialized data structure that you can access using the -string method translate, which translates each character into the corresponding digit, according to the matrix defined by string.maketrans. -

    timeit shows that soundex2c.py is significantly faster than defining a dictionary and looping through the input and building the output incrementally: +string method translate, which translates each character into the corresponding digit, according to the matrix defined by string.maketrans. +

    timeit shows that soundex2c.py is significantly faster than defining a dictionary and looping through the input and building the output incrementally:

     C:\samples\soundex\stage2>python soundex2c.py
     Woo             W000 11.437645008
     Pilgrim         P426 13.2825062962
     Flingjingwaller F452 18.5570110168
     

    You're not going to get much better than that. Python has a specialized function that does exactly what you want to do; use it and move on. -

    Example 18.4. Best Result So Far: soundex/stage2/soundex2c.py

    +

    Example 18.4. Best Result So Far: soundex/stage2/soundex2c.py

     import string, re
     
     allChar = string.uppercase + string.lowercase
    @@ -15585,21 +15425,21 @@ if __name__ == '__main__':
             print name.ljust(15), soundex(name), min(t.repeat())
     

    18.5. Optimizing List Operations

    The third step in the Soundex algorithm is eliminating consecutive duplicate digits. What's the best way to do this? -

    Here's the code we have so far, in soundex/stage2/soundex2c.py: +

    Here's the code we have so far, in soundex/stage2/soundex2c.py:

         digits2 = digits[0]
         for d in digits[1:]:
             if digits2[-1] != d:
                 digits2 += d
    -

    Here are the performance results for soundex2c.py: +

    Here are the performance results for soundex2c.py:

     C:\samples\soundex\stage2>python soundex2c.py
     Woo             W000 12.6070768771
     Pilgrim         P426 14.4033353401
     Flingjingwaller F452 19.7774882003
    -

    The first thing to consider is whether it's efficient to check digits[-1] each time through the loop. Are list indexes expensive? Would we be better off maintaining the last digit in a separate +

    The first thing to consider is whether it's efficient to check digits[-1] each time through the loop. Are list indexes expensive? Would we be better off maintaining the last digit in a separate variable, and checking that instead? -

    To answer this question, here is soundex/stage3/soundex3a.py: +

    To answer this question, here is soundex/stage3/soundex3a.py:

         digits2 = ''
         last_digit = ''
    @@ -15607,20 +15447,20 @@ variable, and checking that instead?
             if d != last_digit:
                 digits2 += d
                 last_digit = d
    -

    soundex3a.py does not run any faster than soundex2c.py, and may even be slightly slower (although it's not enough of a difference to say for sure): +

    soundex3a.py does not run any faster than soundex2c.py, and may even be slightly slower (although it's not enough of a difference to say for sure):

     C:\samples\soundex\stage3>python soundex3a.py
     Woo             W000 11.5346048171
     Pilgrim         P426 13.3950636184
     Flingjingwaller F452 18.6108927252
    -

    Why isn't soundex3a.py faster? It turns out that list indexes in Python are extremely efficient. Repeatedly accessing digits2[-1] is no problem at all. On the other hand, manually maintaining the last seen digit in a separate variable means we have two variable assignments for each digit we're storing, which wipes out any small gains we might have gotten from eliminating +

    Why isn't soundex3a.py faster? It turns out that list indexes in Python are extremely efficient. Repeatedly accessing digits2[-1] is no problem at all. On the other hand, manually maintaining the last seen digit in a separate variable means we have two variable assignments for each digit we're storing, which wipes out any small gains we might have gotten from eliminating the list lookup.

    Let's try something radically different. If it's possible to treat a string as a list of characters, it should be possible to use a list comprehension to iterate through the list. The problem is, the code needs access to the previous character in the list, and that's not easy to do with a straightforward list comprehension. -

    However, it is possible to create a list of index numbers using the built-in range() function, and use those index numbers to progressively search through the list and pull out each character that is different -from the previous character. That will give you a list of characters, and you can use the string method join() to reconstruct a string from that. -

    Here is soundex/stage3/soundex3b.py: +

    However, it is possible to create a list of index numbers using the built-in range() function, and use those index numbers to progressively search through the list and pull out each character that is different +from the previous character. That will give you a list of characters, and you can use the string method join() to reconstruct a string from that. +

    Here is soundex/stage3/soundex3b.py:

         digits2 = "".join([digits[i] for i in range(len(digits))
          if i == 0 or digits[i-1] != digits[i]])
    @@ -15630,9 +15470,9 @@ from the previous character.  That will give you a list of characters, and you c
     Woo             W000 14.2245271396
     Pilgrim         P426 17.8337165757
     Flingjingwaller F452 25.9954005327
    -

    It's possible that the techniques so far as have been “string-centric”. Python can convert a string into a list of characters with a single command: list('abc') returns ['a', 'b', 'c']. Furthermore, lists can be modified in place very quickly. Instead of incrementally building a new list (or string) out of the source string, why not move elements around +

    It's possible that the techniques so far as have been “string-centric”. Python can convert a string into a list of characters with a single command: list('abc') returns ['a', 'b', 'c']. Furthermore, lists can be modified in place very quickly. Instead of incrementally building a new list (or string) out of the source string, why not move elements around within a single list? -

    Here is soundex/stage3/soundex3c.py, which modifies a list in place to remove consecutive duplicate elements: +

    Here is soundex/stage3/soundex3c.py, which modifies a list in place to remove consecutive duplicate elements:

         digits = list(source[0].upper() + source[1:].translate(charToSoundex))
         i=0
    @@ -15642,14 +15482,14 @@ within a single list?
             digits[i]=item
         del digits[i+1:]
         digits2 = "".join(digits)
    -

    Is this faster than soundex3a.py or soundex3b.py? No, in fact it's the slowest method yet: +

    Is this faster than soundex3a.py or soundex3b.py? No, in fact it's the slowest method yet:

     C:\samples\soundex\stage3>python soundex3c.py
     Woo             W000 14.1662554878
     Pilgrim         P426 16.0397885765
     Flingjingwaller F452 22.1789341942
    -

    We haven't made any progress here at all, except to try and rule out several “clever” techniques. The fastest code we've seen so far was the original, most straightforward method (soundex2c.py). Sometimes it doesn't pay to be clever. -

    Example 18.5. Best Result So Far: soundex/stage2/soundex2c.py

    +

    We haven't made any progress here at all, except to try and rule out several “clever” techniques. The fastest code we've seen so far was the original, most straightforward method (soundex2c.py). Sometimes it doesn't pay to be clever. +

    Example 18.5. Best Result So Far: soundex/stage2/soundex2c.py

     import string, re
     
     allChar = string.uppercase + string.lowercase
    @@ -15679,61 +15519,61 @@ if __name__ == '__main__':
     

    18.6. Optimizing String Manipulation

    The final step of the Soundex algorithm is padding short results with zeros, and truncating long results. What is the best way to do this? -

    This is what we have so far, taken from soundex/stage2/soundex2c.py: +

    This is what we have so far, taken from soundex/stage2/soundex2c.py:

         digits3 = re.sub('9', '', digits2)
         while len(digits3) < 4:
             digits3 += "0"
         return digits3[:4]
    -

    These are the results for soundex2c.py: +

    These are the results for soundex2c.py:

     C:\samples\soundex\stage2>python soundex2c.py
     Woo             W000 12.6070768771
     Pilgrim         P426 14.4033353401
     Flingjingwaller F452 19.7774882003
    -

    The first thing to consider is replacing that regular expression with a loop. This code is from soundex/stage4/soundex4a.py: +

    The first thing to consider is replacing that regular expression with a loop. This code is from soundex/stage4/soundex4a.py:

         digits3 = ''
         for d in digits2:
             if d != '9':
                 digits3 += d
    -

    Is soundex4a.py faster? Yes it is: +

    Is soundex4a.py faster? Yes it is:

     C:\samples\soundex\stage4>python soundex4a.py
     Woo             W000 6.62865531792
     Pilgrim         P426 9.02247576158
     Flingjingwaller F452 13.6328416042
    -

    But wait a minute. A loop to remove characters from a string? We can use a simple string method for that. Here's soundex/stage4/soundex4b.py: +

    But wait a minute. A loop to remove characters from a string? We can use a simple string method for that. Here's soundex/stage4/soundex4b.py:

         digits3 = digits2.replace('9', '')
    -

    Is soundex4b.py faster? That's an interesting question. It depends on the input: +

    Is soundex4b.py faster? That's an interesting question. It depends on the input:

     C:\samples\soundex\stage4>python soundex4b.py
     Woo             W000 6.75477414029
     Pilgrim         P426 7.56652144337
     Flingjingwaller F452 10.8727729362
    -

    The string method in soundex4b.py is faster than the loop for most names, but it's actually slightly slower than soundex4a.py in the trivial case (of a very short name). Performance optimizations aren't always uniform; tuning that makes one case +

    The string method in soundex4b.py is faster than the loop for most names, but it's actually slightly slower than soundex4a.py in the trivial case (of a very short name). Performance optimizations aren't always uniform; tuning that makes one case faster can sometimes make other cases slower. In this case, the majority of cases will benefit from the change, so let's leave it at that, but the principle is an important one to remember.

    Last but not least, let's examine the final two steps of the algorithm: padding short results with zeros, and truncating long -results to four characters. The code you see in soundex4b.py does just that, but it's horribly inefficient. Take a look at soundex/stage4/soundex4c.py to see why: +results to four characters. The code you see in soundex4b.py does just that, but it's horribly inefficient. Take a look at soundex/stage4/soundex4c.py to see why:

         digits3 += '000'
         return digits3[:4]
     

    Why do we need a while loop to pad out the result? We know in advance that we're going to truncate the result to four characters, and we know that -we already have at least one character (the initial letter, which is passed unchanged from the original source variable). That means we can simply add three zeros to the output, then truncate it. Don't get stuck in a rut over the +we already have at least one character (the initial letter, which is passed unchanged from the original source variable). That means we can simply add three zeros to the output, then truncate it. Don't get stuck in a rut over the exact wording of the problem; looking at the problem slightly differently can lead to a simpler solution. -

    How much speed do we gain in soundex4c.py by dropping the while loop? It's significant: +

    How much speed do we gain in soundex4c.py by dropping the while loop? It's significant:

     C:\samples\soundex\stage4>python soundex4c.py
     Woo             W000 4.89129791636
     Pilgrim         P426 7.30642134685
     Flingjingwaller F452 10.689832367
     

    Finally, there is still one more thing you can do to these three lines of code to make them faster: you can combine them into -one line. Take a look at soundex/stage4/soundex4d.py: +one line. Take a look at soundex/stage4/soundex4d.py:

         return (digits2.replace('9', '') + '000')[:4]
    -

    Putting all this code on one line in soundex4d.py is barely faster than soundex4c.py: +

    Putting all this code on one line in soundex4d.py is barely faster than soundex4c.py:

     C:\samples\soundex\stage4>python soundex4d.py
     Woo             W000 4.93624105857
    @@ -15752,7 +15592,7 @@ and maintainability.
     
  • If you need to choose between regular expressions and string methods, choose string methods. Both are compiled in C, so choose the simpler one. -
  • General-purpose dictionary lookups are fast, but specialtiy functions such as string.maketrans and string methods such as isalpha() are faster. If Python has a custom-tailored function for you, use it. +
  • General-purpose dictionary lookups are fast, but specialtiy functions such as string.maketrans and string methods such as isalpha() are faster. If Python has a custom-tailored function for you, use it.
  • Don't be too clever. Sometimes the most obvious algorithm is also the fastest.
  • Don't sweat it too much. Performance isn't everything. diff --git a/dip3.css b/dip3.css index 52aadce..5ebf326 100644 --- a/dip3.css +++ b/dip3.css @@ -7,14 +7,16 @@ a:visited{color:darkorchid} h1 a,h2 a,h3 a,#nav a{color:inherit !important} abbr,acronym{letter-spacing:0.1em;text-transform:lowercase;font-variant:small-caps} h1,h2,h3,p,ul,ol,#search{margin:1.75em 0} -#search div{float:right} +form div{float:right} li ol{margin:0} -h1,h2,h3{font-size:medium} +h1,h2,h3{font-size:medium;clear:both} h1{background:papayawhip;color:#000;width:100%;margin:0} pre{white-space:pre-wrap;margin:2.154em 0;padding:0 0 0 2.154em;border-left:1px dotted} -pre,kbd,code,samp{font-family:Consolas,Inconsolata,Monaco,monospace;font-size:medium;line-height:2.154} +pre,kbd,code,samp{font-family:Consolas,Inconsolata,Monaco,monospace;font-size:medium;line-height:2.154;word-spacing:0} +pre a{display:inline;padding:0.4375em 0;border:0} +pre a:hover{border:0} kbd{font-weight:bold} -samp.prompt{color:#667}/*the neighbor of the beast*/ +.prompt{color:#667}/*the neighbor of the beast*/ td pre{margin:0;padding:0;border:0} .c{text-align:center;font-size:small} p.fancy:first-letter{float:left;background:transparent;color:gainsboro;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif} @@ -31,9 +33,9 @@ span,tr + tr th:first-child{font-family:'Arial Unicode MS',sans-serif;font-style .note{margin-left:4.94em} .note span{display:block;float:left;font-size:xx-large;line-height:0.875em;margin:0 0.22em 0 -1.22em} table.simple th{font-family:inherit !important} -.fr{width:auto;margin-top:4.308em;border:1px dotted} +.fr{width:100%;margin:2.154em 0;border:1px dotted} .fr h4{margin-top:-1.2em;margin-left:-1em;width:8.5em;border:1px dotted;padding: 3px 3px 3px 13px;background:#fff;color:inherit;position:relative} -tr.hover,li.hover{background:#eee;color:inherit;cursor:default} +.hover{background:#eee;color:inherit;cursor:default} body{counter-reset:h1} h1:before{counter-increment:h1;content:counter(h1) ". "} h1{counter-reset:h2} diff --git a/dip3.js b/dip3.js new file mode 100644 index 0000000..aabc146 --- /dev/null +++ b/dip3.js @@ -0,0 +1,75 @@ +window.onload = function() { +// synchronized highlighting for code blocks with callouts +var arPre = document.getElementsByTagName('pre'); +for (var i = arPre.length - 1; i >= 0; i--) { + var elmPre = arPre[i]; + var arCallout = elmPre.getElementsByTagName('span'); + if (arCallout.length == 0) { continue; } + var elmCalloutList = elmPre.nextSibling; + while (elmCalloutList && (elmCalloutList.nodeType != 1)) { + elmCalloutList = elmCalloutList.nextSibling; + } + if (elmCalloutList.nodeName.toLowerCase() != 'ol') { continue; } + var arCalloutListItem = elmCalloutList.getElementsByTagName('li'); + if (arCalloutListItem.length != arCallout.length) { + alert('Number of callouts != number of callout list items:\n' + elmPre.innerHTML); + continue; + } + for (var j = arCallout.length - 1; j >= 0; j--) { + var elmCallout = arCallout[j].parentNode; + var elmCalloutListItem = arCalloutListItem[j]; + elmCallout._li = elmCalloutListItem; + elmCalloutListItem._div = elmCallout; + elmCallout.onmouseover = function() { + this.className = 'hover'; + this._li.className = 'hover'; + }; + elmCalloutListItem.onmouseover = function() { + this.className = 'hover'; + this._div.className = 'hover'; + }; + elmCallout.onmouseout = function() { + this.className = ''; + this._li.className = ''; + }; + elmCalloutListItem.onmouseout = function() { + this.className = ''; + this._div.className = ''; + }; + + } +} + +// synchronized highlighting for tables with callouts +var arTables = document.getElementsByTagName('table'); +for (var i = arTables.length - 1; i >= 0; i--) { + var elmTable = arTables[i]; + var olNotes = document.getElementById("skip" + elmTable.id); + if (!olNotes) { continue; } + var arNotes = olNotes.getElementsByTagName('li'); + var arTableRows = elmTable.getElementsByTagName('tr'); + if (arNotes.length == 0) { continue; } + for (var j = arTableRows.length - 1; j >= 1; j--) { + var elmTableRow = arTableRows[j]; + var elmNote = arNotes[j - 1]; + elmTableRow._li = elmNote; + elmNote._tr = elmTableRow; + elmTableRow.onmouseover = function() { + this.className = 'hover'; + this._li.className = 'hover'; + }; + elmNote.onmouseover = function() { + this.className = 'hover'; + this._tr.className = 'hover'; + }; + elmTableRow.onmouseout = function() { + this.className = ''; + this._li.className = ''; + }; + elmNote.onmouseout = function() { + this.className = ''; + this._tr.className = ''; + }; + } +} +} diff --git a/index.html b/index.html index 211729a..ae38c5c 100644 --- a/index.html +++ b/index.html @@ -6,6 +6,7 @@ +

    skip to main content -

    +

    Porting code to Python 3 with 2to3

    Life is pleasant. Death is peaceful. It’s the transition that’s troublesome.
    — Isaac Asimov (attributed) @@ -144,8 +145,8 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}

    long data type

    -

    Python 2 had separate int and long types for non-floating-point numbers. An int could not be any larger than sys.maxint, which varied by platform. Longs were defined by appending an L to the end of the number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called int, which mostly behaves like the long type in Python 2. Further reading: PEP 237: Unifying Long Integers and Integers. -

    Since there are no longer two types, there is no need for special syntax to distinguish them. +

    Python 2 had separate int and long types for non-floating-point numbers. An int could not be any larger than sys.maxint, which varied by platform. Longs were defined by appending an L to the end of the number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called int, which mostly behaves like the long type in Python 2. Since there are no longer two types, there is no need for special syntax to distinguish them. +

    Further reading: PEP 237: Unifying Long Integers and Integers.

    skip over this table @@ -1271,40 +1272,5 @@ do_stuff(a_list)

    FIXME: once the rest of the book is written, this appendix should contain copious links back to any chapter or section that touches on these features.

    © 2001-4, 2009 ark Pilgrim, CC-BY-3.0 - diff --git a/your-first-python-program.html b/your-first-python-program.html index ca1f472..82182d6 100644 --- a/your-first-python-program.html +++ b/your-first-python-program.html @@ -6,12 +6,13 @@ +

    skip to main content -

      

    You are here: Dive Into Python 3 Chapter 2 +

      

    You are here: Dive Into Python 3 1. Your first Python program

    Your first Python program

    Don’t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate.
    Ven. Henepola Gunararatana @@ -20,6 +21,19 @@ body{counter-reset:h1 1}

  • Diving in
  • Declaring functions
  • Writing readable code +
      +
    1. Why bother? +
    2. Docstrings +
    3. Function annotations +
    4. Style conventions +
    +
  • Everything is an object +
      +
    1. The import search path +
    2. What's an object? +
    +
  • Indenting code +
  • Running scripts

    Diving in

    You know how other books go on and on about programming fundamentals and finally work up to building something useful? Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it. @@ -67,8 +81,8 @@ if __name__ == "__main__":

    In some languages, functions (that return a value) start with function, and subroutines (that do not return a value) start with sub. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's None), and all functions start with def.

    -

    The approximate_size function takes the two arguments — size and a_kilobyte_is_1024_bytes — but neither argument specifies a datatype. (As you might guess from the =True syntax, the second argument is a boolean. You'll learn what that syntax does in [FIXME xref].) In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally. -

    +

    The approximate_size function takes the two arguments — size and a_kilobyte_is_1024_bytes — but neither argument specifies a datatype. (As you might guess from the =True syntax, the second argument is a boolean. You'll learn what that syntax does in [FIXME xref-was-#apihelper].) In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally. +

    In Java, C++, and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.

    How Python's Datatypes Compare to Other Programming Languages

    @@ -95,10 +109,10 @@ if __name__ == "__main__":
  • Notes
    Strongly typedPascal, JavaPython, Ruby

    Writing readable code

    - -FIXME - -

    You can document a Python function by giving it a docstring. In this program, the approximate_size function has a docstring: +

    Why bother?

    +

    FIXME +

    Documentation strings

    +

    You can document a Python function by giving it a documentation string (docstring for short). In this program, the approximate_size function has a docstring:

    def approximate_size(size, a_kilobyte_is_1024_bytes=True):
         """Convert a file size to human-readable form.
     
    @@ -110,14 +124,18 @@ FIXME
         Returns: string
     
         """
    -

    Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns and other quote characters. You can use them anywhere, but you'll see them most often used when defining a docstring. -

    +

    Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you'll see them most often used when defining a docstring. +

    Triple quotes are also an easy way to define a string with both single and double quotes, like qq/.../ in Perl 5.

    Everything between the triple quotes is the function's docstring, which documents what the function does. A docstring, if it exists, must be the first thing defined in a function (that is, on the next line after the function declaration). You don't technically need to give your function a docstring, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the docstring is available at runtime as an attribute of the function.

    Many Python IDEs use the docstring to provide context-sensitive documentation, so that when you type a function name, its docstring appears as a tooltip. This can be incredibly helpful, but it's only as good as the docstrings you write.

    +

    Function annotations

    +

    FIXME +

    Style conventions

    +

    FIXME

    Further reading

    +

    Everything is an object

    +

    In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime. A function, like everything else in Python, is an object. +

    Run the interactive Python shell and follow along: +

    +>>> import humansize                               
    +>>> print(humansize.approximate_size(4096, True))  
    +4.0 KiB
    +>>> print(humansize.approximate_size.__doc__)      
    +Convert a file size to human-readable form.
    +
    +    Keyword arguments:
    +    size -- file size in bytes
    +    a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024
    +                                if False, use multiples of 1000
    +
    +    Returns: string
    +
    +
    +
      +
    1. The first line imports the humansize program as a module -- a chunk of code that you can use interactively, or from a larger Python program. (You'll see examples of multi-module Python programs in [FIXME xref].) Once you import a module, you can reference any of its public functions, classes, or attributes. Modules can do this to access functionality in other modules, and you can do it in the Python interactive shell too. This is an important concept, and you'll see a lot more of it throughout this book. +
    2. When you want to use functions defined in imported modules, you need to include the module name. So you can't just say approximate_size; it must be humansize.approximate_size. If you've used classes in Java, this should feel vaguely familiar. +
    3. Instead of calling the function as you would expect to, you asked for one of the function's attributes, __doc__. +
    +
    +

    import in Python is like require in Perl. Once you import a Python module, you access its functions with module.function; once you require a Perl module, you access its functions with module::function. +

    +

    The import search path

    +

    Before this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in sys.path. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.) +

    +>>> import sys                       
    +>>> sys.path                         
    +['', '/usr/lib/python30.zip', '/usr/lib/python3.0', '/usr/lib/python3.0/plat-linux2@EXTRAMACHDEPPATH@', '/usr/lib/python3.0/lib-dynload', '/usr/lib/python3.0/dist-packages', '/usr/local/lib/python3.0/dist-packages']
    +>>> sys                              
    +<module 'sys' (built-in)>
    +>>> sys.path.append('/my/new/path')  
    +
      +
    1. Importing the sys module makes all of its functions and attributes available. +
    2. sys.path is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a .py file whose name matches what you're trying to import. +
    3. Actually, I lied; the truth is more complicated than that, because not all modules are stored as .py files. Some, like the sys module, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (The sys module is written in C.) +
    4. You can add a new directory to Python's search path at runtime by appending the directory name to sys.path, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You'll learn more about append() and other list methods in [FIXME xref-was-#datatypes].) +
    +

    What's an object?

    +

    Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute __doc__, which returns the docstring defined in the function's source code. The sys module is an object which has (among other things) an attribute called path. And so forth. +

    Still, this doesn't answer the more fundamental question: what is an object? Different programming languages define “object” in different ways. In some, it means that all objects must have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in [FIXME xref-was-#datatypes]), and not all objects are subclassable (more on this in [FIXME xref-was-#fileinfo]). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function (more in this in [FIXME xref-was-#apihelp]). +

    This is so important that I'm going to repeat it in case you missed it the first few times: everything in Python is an object. Strings are objects. Lists are objects. Functions are objects. Even modules are objects. +

    +
    +

    Further reading

    + +
    +

    Indenting code

    +

    Python functions have no explicit begin or end, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (:) and the indentation of the code itself. +

    
    +def approximate_size(size, a_kilobyte_is_1024_bytes=True):  
    +    if size < 0:                                            
    +        raise ValueError('number must be non-negative')     
    +                                                            
    +    multiple = 1024 if a_kilobyte_is_1024_bytes else 1000
    +    for suffix in SUFFIXES[multiple]:                       
    +        size /= multiple
    +        if size < multiple:
    +            return "{0:.1f} {1}".format(size, suffix)
    +
    +    raise ValueError('number too large')
    +
      +
    1. Code blocks are defined by their indentation. By "code block," I mean functions, if statements, for loops, while loops, and so forth. Indenting starts a block and unindenting ends it. There are no explicit braces, brackets, or keywords. This means that whitespace is significant, and must be consistent. In this example, the function code is indented four spaces. It doesn't need to be four spaces, it just needs to be consistent. The first line that is not indented marks the end of the function. +
    2. In Python, an if statement is followed by a code block. If the if expression evaluates to true, the indented block is executed, otherwise it falls to the else block (if any). (Note the lack of parentheses around the expression.) +
    3. This line is inside the if code block. This raise statement will raise an exception (of type ValueError), but only if size < 0. +
    4. This is not the end of the function. Completely blank lines don't count. The function continues on the next line. +
    5. The for loop also marks the start of a code block. Code blocks can contain multiple lines, as long as they are all indented the same amount. This for loop has three lines of code in it. There is no other special syntax for multi-line code blocks. Just indent and get on with your life. +
    +

    After some initial protests and several snide analogies to Fortran, you will make peace with this and start seeing its benefits. One major benefit is that all Python programs look similar, since indentation is a language requirement and not a matter of style. This makes it easier to read and understand other people's Python code. +

    +

    Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. C++ and Java use semicolons to separate statements and curly braces to separate code blocks. +

    +
    +

    Further reading

    + +
    +

    Running scripts

    +

    Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them, by including a special block of code that executes when you run the Python file on the command line. Take the last few lines of humansize.py: +

    
    +if __name__ == "__main__":
    +    print(approximate_size(1000000000000, False))
    +    print(approximate_size(1000000000000))
    +
    +

    Like C, Python uses == for comparison and = for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing. +

    +

    So what makes this if statement special? Well, modules are objects, and all modules have a built-in attribute __name__. A module's __name__ depends on how you're using the module. If you import the module, then __name__ is the module's filename, without a directory path or file extension. +

    >>> import humansize
    +>>> humansize.__name__
    +'humansize'
    +

    But you can also run the module directly as a standalone program, in which case __name__ will be a special default value, __main__. Python will evaluate this if statement, find a true expression, and execute the if code block. In this case, to print two values. +

    c:\home\diveintopython3> c:\python30\python.exe humansize.py
    +1.0 TB
    +931.3 GiB

    © 2001-4, 2009 ark Pilgrim, CC-BY-3.0