Dive Into Python

This book lives at http://diveintopython3.org/. If you're reading it somewhere else, you may not have the latest version.

Table of Contents

1. Installing Python
2. Your First Python Program
3. Native Datatypes
4. The Power Of Introspection
5. Objects and Object-Orientation
6. Exceptions and File Handling
7. Regular Expressions
8. HTML Processing
9. XML Processing
10. Scripts and Streams
11. HTTP Web Services
12. SOAP Web Services
13. Unit Testing
14. Test-First Programming
15. Refactoring
16. Functional Programming
17. Dynamic functions
18. Performance Tuning

Chapter 1. Installing Python

Welcome to Python. Let's dive in. In this chapter, you'll install the version of Python that's right for you.

1.1. Which Python is right for you?

The first thing you need to do with Python is install it. Or do you?

If you're using an account on a hosted server, your ISP may have already installed Python. Most popular Linux distributions come with Python in the default installation. Mac OS X 10.2 and later includes a command-line version of Python, although you'll probably want to install a version that includes a more Mac-like graphical interface.

Windows does not come with any version of Python, but don't despair! There are several ways to point-and-click your way to Python on Windows.

As you can see already, Python runs on a great many operating systems. The full list includes Windows, Mac OS, Mac OS X, and all varieties of free UNIX-compatible systems like Linux. There are also versions that run on Sun Solaris, AS/400, Amiga, OS/2, BeOS, and a plethora of other platforms you've probably never even heard of.

What's more, Python programs written on one platform can, with a little care, run on any supported platform. For instance, I regularly develop Python programs on Windows and later deploy them on Linux.

So back to the question that started this section, “Which Python is right for you?” The answer is whichever one runs on the computer you already have.

1.2. Python on Windows

On Windows, you have a couple choices for installing Python.

ActiveState makes a Windows installer for Python called ActivePython, which includes a complete version of Python, an IDE with a Python-aware code editor, plus some Windows extensions for Python that allow complete access to Windows-specific services, APIs, and the Windows Registry.

ActivePython is freely downloadable, although it is not open source. It is the IDE I used to learn Python, and I recommend you try it unless you have a specific reason not to. One such reason might be that ActiveState is generally several months behind in updating their ActivePython installer when new version of Python are released. If you absolutely need the latest version of Python and ActivePython is still a version behind as you read this, you'll want to use the second option for installing Python on Windows.

The second option is the “official” Python installer, distributed by the people who develop Python itself. It is freely downloadable and open source, and it is always current with the latest version of Python.

Procedure 1.1. Option 1: Installing ActivePython

Here is the procedure for installing ActivePython:

Download ActivePython from http://www.activestate.com/Products/ActivePython/.
If you are using Windows 95, Windows 98, or Windows ME, you will also need to download and install Windows Installer 2.0 before installing ActivePython.
Double-click the installer, ActivePython-2.2.2-224-win32-ix86.msi.
Step through the installer program.
If space is tight, you can do a custom installation and deselect the documentation, but I don't recommend this unless you absolutely can't spare the 14MB.
After the installation is complete, close the installer and choose Start->Programs->ActiveState ActivePython 2.2->PythonWin IDE. You'll see something like the following:

PythonWin 2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)] on win32.
Portions Copyright 1994-2001 Mark Hammond (mhammond@skippinet.com.au) -
see 'Help/About PythonWin' for further copyright information.
>>>

Procedure 1.2. Option 2: Installing Python from Python.org

Download the latest Python Windows installer by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then downloading the .exe installer.
Double-click the installer, Python-2.xxx.yyy.exe. The name will depend on the version of Python available when you read this.
Step through the installer program.
If disk space is tight, you can deselect the HTMLHelp file, the utility scripts (Tools/), and/or the test suite (Lib/test/).
If you do not have administrative rights on your machine, you can select Advanced Options, then choose Non-Admin Install. This just affects where Registry entries and Start menu shortcuts are created.
After the installation is complete, close the installer and select Start->Programs->Python 2.3->IDLE (Python GUI). You'll see something like the following:

Python 2.3.2 (#49, Oct  2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.

    ****************************************************************
    Personal firewall software may warn about the connection IDLE
    makes to its subprocess using this computer's internal loopback
    interface.  This connection is not visible on any external
    interface and no data is sent to or received from the Internet.
    ****************************************************************
    
IDLE 1.0
>>>

1.3. Python on Mac OS X

On Mac OS X, you have two choices for installing Python: install it, or don't install it. You probably want to install it.

Mac OS X 10.2 and later comes with a command-line version of Python preinstalled. If you are comfortable with the command line, you can use this version for the first third of the book. However, the preinstalled version does not come with an XML parser, so when you get to the XML chapter, you'll need to install the full version.

Rather than using the preinstalled version, you'll probably want to install the latest version, which also comes with a graphical interactive shell.

Procedure 1.3. Running the Preinstalled Version of Python on Mac OS X

To use the preinstalled version of Python, follow these steps:

Open the /Applications folder.
Open the Utilities folder.
Double-click Terminal to open a terminal window and get to a command line.
Type python at the command prompt.

Try it out:

Welcome to Darwin!
[localhost:~] you% python
Python 2.2 (#1, 07/14/02, 23:25:09)
[GCC Apple cpp-precomp 6.14] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> [press Ctrl+D to get back to the command prompt]
[localhost:~] you%

Procedure 1.4. Installing the Latest Version of Python on Mac OS X

Follow these steps to download and install the latest version of Python:

Download the MacPython-OSX disk image from http://homepages.cwi.nl/~jack/macpython/download.html.
If your browser has not already done so, double-click MacPython-OSX-2.3-1.dmg to mount the disk image on your desktop.
Double-click the installer, MacPython-OSX.pkg.
The installer will prompt you for your administrative username and password.
Step through the installer program.
After installation is complete, close the installer and open the /Applications folder.
Open the MacPython-2.3 folder
Double-click PythonIDE to launch Python.

The MacPython IDE should display a splash screen, then take you to the interactive shell. If the interactive shell does not appear, select Window->Python Interactive (Cmd-0). The opening window will look something like this:

Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)]
Type "copyright", "credits" or "license" for more information.
MacPython IDE 1.0.1
>>>

Note that once you install the latest version, the pre-installed version is still present. If you are running scripts from the command line, you need to be aware which version of Python you are using.

Example 1.1. Two versions of Python

[localhost:~] you% python
Python 2.2 (#1, 07/14/02, 23:25:09)
[GCC Apple cpp-precomp 6.14] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> [press Ctrl+D to get back to the command prompt]
[localhost:~] you% /usr/local/bin/python
Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> [press Ctrl+D to get back to the command prompt]
[localhost:~] you%

1.4. Python on Mac OS 9

Mac OS 9 does not come with any version of Python, but installation is very simple, and there is only one choice.

Follow these steps to install Python on Mac OS 9:

Download the MacPython23full.bin file from http://homepages.cwi.nl/~jack/macpython/download.html.
If your browser does not decompress the file automatically, double-click MacPython23full.bin to decompress the file with Stuffit Expander.
Double-click the installer, MacPython23full.
Step through the installer program.
AFter installation is complete, close the installer and open the /Applications folder.
Open the MacPython-OS9 2.3 folder.
Double-click Python IDE to launch Python.

The MacPython IDE should display a splash screen, and then take you to the interactive shell. If the interactive shell does not appear, select Window->Python Interactive (Cmd-0). You'll see a screen like this:

Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)]
Type "copyright", "credits" or "license" for more information.
MacPython IDE 1.0.1
>>>

1.5. Python on RedHat Linux

Installing under UNIX-compatible operating systems such as Linux is easy if you're willing to install a binary package. Pre-built binary packages are available for most popular Linux distributions. Or you can always compile from source.

Download the latest Python RPM by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then selecting the rpms/ directory within that. Then download the RPM with the highest version number. You can install it with the rpm command, as shown here:

Example 1.2. Installing on RedHat Linux 9

localhost:~$ su -
Password: [enter your root password]
[root@localhost root]# wget http://python.org/ftp/python/2.3/rpms/redhat-9/python2.3-2.3-5pydotorg.i386.rpm
Resolving python.org... done.
Connecting to python.org[194.109.137.226]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,495,111 [application/octet-stream]
...
[root@localhost root]# rpm -Uvh python2.3-2.3-5pydotorg.i386.rpm
Preparing...                ########################################### [100%]
   1:python2.3              ########################################### [100%]
[root@localhost root]# python          
Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2
Type "help", "copyright", "credits", or "license" for more information.
>>> [press Ctrl+D to exit]
[root@localhost root]# python2.3       
Python 2.3 (#1, Sep 12 2003, 10:53:56)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
Type "help", "copyright", "credits", or "license" for more information.
>>> [press Ctrl+D to exit]
[root@localhost root]# which python2.3 
/usr/bin/python2.3

	Whoops! Just typing `python` gives you the older version of Python -- the one that was installed by default. That's not the one you want.
	At the time of this writing, the newest version is called `python2.3`. You'll probably want to change the path on the first line of the sample scripts to point to the newer version.
	This is the complete path of the newer version of Python that you just installed. Use this on the `#!` line (the first line of each script) to ensure that scripts are running under the latest version of Python, and be sure to type `python2.3` to get into the interactive shell.

1.6. Python on Debian GNU/Linux

If you are lucky enough to be running Debian GNU/Linux, you install Python through the apt command.

Example 1.3. Installing on Debian GNU/Linux

localhost:~$ su -
Password: [enter your root password]
localhost:~# apt-get install python
Reading Package Lists... Done
Building Dependency Tree... Done
The following extra packages will be installed:
  python2.3
Suggested packages:
  python-tk python2.3-doc
The following NEW packages will be installed:
  python python2.3
0 upgraded, 2 newly installed, 0 to remove and 3 not upgraded.
Need to get 0B/2880kB of archives.
After unpacking 9351kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Selecting previously deselected package python2.3.
(Reading database ... 22848 files and directories currently installed.)
Unpacking python2.3 (from .../python2.3_2.3.1-1_i386.deb) ...
Selecting previously deselected package python.
Unpacking python (from .../python_2.3.1-1_all.deb) ...
Setting up python (2.3.1-1) ...
Setting up python2.3 (2.3.1-1) ...
Compiling python modules in /usr/lib/python2.3 ...
Compiling optimized python modules in /usr/lib/python2.3 ...
localhost:~# exit
logout
localhost:~$ python
Python 2.3.1 (#2, Sep 24 2003, 11:39:14)
[GCC 3.3.2 20030908 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> [press Ctrl+D to exit]

1.7. Python Installation from Source

If you prefer to build from source, you can download the Python source code from http://www.python.org/ftp/python/. Select the highest version number listed, download the .tgz file), and then do the usual configure, make, make install dance.

Example 1.4. Installing from source

localhost:~$ su -
Password: [enter your root password]
localhost:~# wget http://www.python.org/ftp/python/2.3/Python-2.3.tgz
Resolving www.python.org... done.
Connecting to www.python.org[194.109.137.226]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8,436,880 [application/x-tar]
...
localhost:~# tar xfz Python-2.3.tgz
localhost:~# cd Python-2.3
localhost:~/Python-2.3# ./configure
checking MACHDEP... linux2
checking EXTRAPLATDIR...
checking for --without-gcc... no
...
localhost:~/Python-2.3# make
gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
-I. -I./Include  -DPy_BUILD_CORE -o Modules/python.o Modules/python.c
gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
-I. -I./Include  -DPy_BUILD_CORE -o Parser/acceler.o Parser/acceler.c
gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
-I. -I./Include  -DPy_BUILD_CORE -o Parser/grammar1.o Parser/grammar1.c
...
localhost:~/Python-2.3# make install
/usr/bin/install -c python /usr/local/bin/python2.3
...
localhost:~/Python-2.3# exit
logout
localhost:~$ which python
/usr/local/bin/python
localhost:~$ python
Python 2.3.1 (#2, Sep 24 2003, 11:39:14)
[GCC 3.3.2 20030908 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> [press Ctrl+D to get back to the command prompt]
localhost:~$

1.8. The Interactive Shell

Now that you have Python installed, what's this interactive shell thing you're running?

It's like this: Python leads a double life. It's an interpreter for scripts that you can run from the command line or run like applications, by double-clicking the scripts. But it's also an interactive shell that can evaluate arbitrary statements and expressions. This is extremely useful for debugging, quick hacking, and testing. I even know some people who use the Python interactive shell in lieu of a calculator!

Launch the Python interactive shell in whatever way works on your platform, and let's dive in with the steps shown here:

Example 1.5. First Steps in the Interactive Shell

>>> 1 + 1               
2
>>> print 'hello world' 
hello world
>>> x = 1               
>>> y = 2
>>> x + y
3

	The Python interactive shell can evaluate arbitrary Python expressions, including any basic arithmetic expression.
	The interactive shell can execute arbitrary Python statements, including the `print` statement.
	You can also assign values to variables, and the values will be remembered as long as the shell is open (but not any longer than that).

1.9. Summary

You should now have a version of Python installed that works for you.

Depending on your platform, you may have more than one version of Python intsalled. If so, you need to be aware of your paths. If simply typing python on the command line doesn't run the version of Python that you want to use, you may need to enter the full pathname of your preferred version.

Congratulations, and welcome to Python.

Chapter 2. Your First Python Program

You know how other books go on and on about programming fundamentals and finally work up to building a complete, working program? Let's skip all that.

2.1. Diving in

Here is a complete, working Python program.

It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.

Example 2.1. `odbchelper.py`

If you have not already done so, you can download this and other examples used in this book.

def buildConnectionString(params):
    """Build a connection string from a dictionary of parameters.

    Returns string."""
    return ";".join(["%s=%s" % (k, v) for k, v in params.items()])

if __name__ == "__main__":
    myParams = {"server":"mpilgrim", \
                "database":"master", \
                "uid":"sa", \
                "pwd":"secret" \
                }
    print buildConnectionString(myParams)

Now run this program and see what happens.


	In the ActivePython IDE on Windows, you can run the Python program you're editing by choosing File->Run... (`Ctrl-R`). Output is displayed in the interactive window.


	In the Python IDE on Mac OS, you can run a Python program with Python->Run window... (`Cmd-R`), but there is an important option you must set first. Open the `.py` file in the IDE, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the Run as __main__ option is checked. This is a per-file setting, but you'll only need to do it once per file.


	On UNIX-compatible systems (including Mac OS X), you can run a Python program from the command line: `python odbchelper.py`

The id="odbchelper.output" output of odbchelper.py will look like this:

server=mpilgrim;uid=sa;database=master;pwd=secret

2.2. Declaring Functions

Python has functions like most other languages, but it does not have separate header files like C++ or interface/implementation sections like Pascal. When you need a function, just declare it, like this:

def buildConnectionString(params):

Note that the keyword def starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments (not shown here) are separated with commas.

Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value. In fact, every Python function returns a value; if the function ever executes a return statement, it will return that value, otherwise it will return None, the Python null value.


	In Visual Basic, functions (that return a value) start with `function`, and subroutines (that do not return a value) start with `sub`. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's `None`), and all functions start with `def`.

The argument, params, doesn't specify a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.


	In Java, C++, and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.

2.2.1. How Python's Datatypes Compare to Other Programming Languages

An erudite reader sent me this explanation of how Python compares to other programming languages:

statically typed language: A language in which types are fixed at compile time. Most statically typed languages enforce this by requiring you to declare all variables with their datatypes before using them. Java and C are statically typed languages.
dynamically typed language: A language in which types are discovered at execution time; the opposite of statically typed. VBScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
strongly typed language: A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
weakly typed language: A language in which types may be ignored; the opposite of strongly typed. VBScript is weakly typed. In VBScript, you can concatenate the string '12' and the integer 3 to get the string '123', then treat that as the integer 123, all without any explicit conversion.

So Python is both dynamically typed (because it doesn't use explicit datatype declarations) and strongly typed (because once a variable has a datatype, it actually matters).

2.3. Documenting Functions

You can document a Python function by giving it a docstring.

Example 2.2. Defining the `buildConnectionString` Function's `docstring`

def buildConnectionString(params):
    """Build a connection string from a dictionary of parameters.

    Returns string."""

Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns and other quote characters. You can use them anywhere, but you'll see them most often used when defining a docstring.


	Triple quotes are also an easy way to define a string with both single and double quotes, like `qq/.../` in Perl.

Everything between the triple quotes is the function's docstring, which documents what the function does. A docstring, if it exists, must be the first thing defined in a function (that is, the first thing after the colon). You don't technically need to give your function a docstring, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the docstring is available at runtime as an attribute of the function.


	Many Python IDEs use the `docstring` to provide context-sensitive documentation, so that when you type a function name, its `docstring` appears as a tooltip. This can be incredibly helpful, but it's only as good as the `docstring`s you write.

2.4. Everything Is an Object

2.6. Testing Modules

Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them. Here's an example that uses the if __name__ trick.

if __name__ == "__main__":

Some quick observations before you get to the good stuff. First, parentheses are not required around the if expression. Second, the if statement ends with a colon, and is followed by indented code.


	Like C, Python uses `==` for comparison and `=` for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.

So why is this particular if statement a trick? Modules are objects, and all modules have a built-in attribute __name__. A module's __name__ depends on how you're using the module. If you import the module, then __name__ is the module's filename, without a directory path or file extension. But you can also run the module directly as a standalone program, in which case __name__ will be a special default value, __main__.

>>> import odbchelper
>>> odbchelper.__name__
'odbchelper'

Knowing this, you can design a test suite for your module within the module itself by putting it in this if statement. When you run the module directly, __name__ is __main__, so the test suite executes. When you import the module, __name__ is something else, so the test suite is ignored. This makes it easier to develop and debug new modules before integrating them into a larger program.


	On MacPython, there is an additional step to make the `if` `__name__` trick work. Pop up the module's options menu by clicking the black triangle in the upper-right corner of the window, and make sure Run as __main__ is checked.

Chapter 3. Native Datatypes

3.2. Introducing Lists

Lists are Python's workhorse datatype. If your only experience with lists is arrays in Visual Basic or (God forbid) the datastore in Powerbuilder, brace yourself for Python lists.


	A list in Python is like an array in Perl. In Perl, variables that store arrays always start with the `@` character; in Python, variables can be named anything, and Python keeps track of the datatype internally.


	A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the `ArrayList` class, which can hold arbitrary objects and can expand dynamically as new items are added.

3.2.1. Defining Lists

Example 3.6. Defining a List

>>> li = ["a", "b", "mpilgrim", "z", "example"] 
>>> li
['a', 'b', 'mpilgrim', 'z', 'example']
>>> li[0]   
'a'
>>> li[4]   
'example'

	First, you define a list of five elements. Note that they retain their original order. This is not an accident. A list is an ordered set of elements enclosed in square brackets.
	A list can be used like a zero-based array. The first element of any non-empty list is always `li[0]`.
	The last element of this five-element list is `li[4]`, because lists are always zero-based.

Example 3.7. Negative List Indices

>>> li
['a', 'b', 'mpilgrim', 'z', 'example']
>>> li[-1] 
'example'
>>> li[-3] 
'mpilgrim'

	A negative index accesses elements from the end of the list counting backwards. The last element of any non-empty list is always `li[-1]`.
	If the negative index is confusing to you, think of it this way: `li[-n] == li[len(li) - n]`. So in this list, `li[-3] == li[5 - 3] == li[2]`.

Example 3.8. Slicing a List

>>> li
['a', 'b', 'mpilgrim', 'z', 'example']
>>> li[1:3]  
['b', 'mpilgrim']
>>> li[1:-1] 
['b', 'mpilgrim', 'z']
>>> li[0:3]  
['a', 'b', 'mpilgrim']

	You can get a subset of a list, called a “slice”, by specifying two indices. The return value is a new list containing all the elements of the list, in order, starting with the first slice index (in this case `li[1]`), up to but not including the second slice index (in this case `li[3]`).
	Slicing works if one or both of the slice indices is negative. If it helps, you can think of it this way: reading the list from left to right, the first slice index specifies the first element you want, and the second slice index specifies the first element you don't want. The return value is everything in between.
	Lists are zero-based, so `li[0:3]` returns the first three elements of the list, starting at `li[0]`, up to but not including `li[3]`.

Example 3.9. Slicing Shorthand

>>> li
['a', 'b', 'mpilgrim', 'z', 'example']
>>> li[:3] 
['a', 'b', 'mpilgrim']
>>> li[3:]  
['z', 'example']
>>> li[:]  
['a', 'b', 'mpilgrim', 'z', 'example']

	If the left slice index is 0, you can leave it out, and 0 is implied. So `li[:3]` is the same as `li[0:3]` from Example 3.8, “Slicing a List”.
	Similarly, if the right slice index is the length of the list, you can leave it out. So `li[3:]` is the same as `li[3:5]`, because this list has five elements.
	Note the symmetry here. In this five-element list, `li[:3]` returns the first 3 elements, and `li[3:]` returns the last two elements. In fact, `li[:n]` will always return the first `n` elements, and `li[n:]` will return the rest, regardless of the length of the list.
	If both slice indices are left out, all elements of the list are included. But this is not the same as the original `li` list; it is a new list that happens to have all the same elements. `li[:]` is shorthand for making a complete copy of a list.

3.2.2. Adding Elements to Lists

Example 3.10. Adding Elements to a List

>>> li
['a', 'b', 'mpilgrim', 'z', 'example']
>>> li.append("new")               
>>> li
['a', 'b', 'mpilgrim', 'z', 'example', 'new']
>>> li.insert(2, "new")            
>>> li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new']
>>> li.extend(["two", "elements"]) 
>>> li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']

	`append` adds a single element to the end of the list.
	`insert` inserts a single element into a list. The numeric argument is the index of the first element that gets bumped out of position. Note that list elements do not need to be unique; there are now two separate elements with the value `'new'`, `li[2]` and `li[6]`.
	`extend` concatenates lists. Note that you do not call `extend` with multiple arguments; you call it with one argument, a list. In this case, that list has two elements.

Example 3.11. The Difference between `extend` and `append`

>>> li = ['a', 'b', 'c']
>>> li.extend(['d', 'e', 'f']) 
>>> li
['a', 'b', 'c', 'd', 'e', 'f']
>>> len(li)  
6
>>> li[-1]
'f'
>>> li = ['a', 'b', 'c']
>>> li.append(['d', 'e', 'f']) 
>>> li
['a', 'b', 'c', ['d', 'e', 'f']]
>>> len(li)  
4
>>> li[-1]
['d', 'e', 'f']

	Lists have two methods, `extend` and `append`, that look like they do the same thing, but are in fact completely different. `extend` takes a single argument, which is always a list, and adds each of the elements of that list to the original list.
	Here you started with a list of three elements (`'a'`, `'b'`, and `'c'`), and you extended the list with a list of another three elements (`'d'`, `'e'`, and `'f'`), so you now have a list of six elements.
	On the other hand, `append` takes one argument, which can be any data type, and simply adds it to the end of the list. Here, you're calling the `append` method with a single argument, which is a list of three elements.
	Now the original list, which started as a list of three elements, contains four elements. Why four? Because the last element that you just appended is itself a list. Lists can contain any type of data, including other lists. That may be what you want, or maybe not. Don't use `append` if you mean `extend`.

3.2.3. Searching Lists

Example 3.12. Searching a List

>>> li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']
>>> li.index("example") 
5
>>> li.index("new")     
2
>>> li.index("c")       
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
ValueError: list.index(x): x not in list
>>> "c" in li           
False

	`index` finds the first occurrence of a value in the list and returns the index.
	`index` finds the first occurrence of a value in the list. In this case, `'new'` occurs twice in the list, in `li[2]` and `li[6]`, but `index` will return only the first index, `2`.
	If the value is not found in the list, Python raises an exception. This is notably different from most languages, which will return some invalid index. While this may seem annoying, it is a good thing, because it means your program will crash at the source of the problem, rather than later on when you try to use the invalid index.
	To test whether a value is in the list, use `in`, which returns `True` if the value is found or `False` if it is not.


	Before version 2.2.1, Python had no separate boolean datatype. To compensate for this, Python accepted almost anything in a boolean context (like an `if` statement), according to the following rules: `0` is false; all other numbers are true. An empty string (`""`) is false, all other strings are true. An empty list (`[]`) is false; all other lists are true. An empty tuple (`()`) is false; all other tuples are true. An empty dictionary (`{}`) is false; all other dictionaries are true. These rules still apply in Python 2.2.1 and beyond, but now you can also use an actual boolean, which has a value of `True` or `False`. Note the capitalization; these values, like everything else in Python, are case-sensitive.

3.2.4. Deleting List Elements

Example 3.13. Removing Elements from a List

>>> li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']
>>> li.remove("z")   
>>> li
['a', 'b', 'new', 'mpilgrim', 'example', 'new', 'two', 'elements']
>>> li.remove("new") 
>>> li
['a', 'b', 'mpilgrim', 'example', 'new', 'two', 'elements']
>>> li.remove("c")   
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
ValueError: list.remove(x): x not in list
>>> li.pop()         
'elements'
>>> li
['a', 'b', 'mpilgrim', 'example', 'new', 'two']

	`remove` removes the first occurrence of a value from a list.
	`remove` removes only the first occurrence of a value. In this case, `'new'` appeared twice in the list, but `li.remove("new")` removed only the first occurrence.
	If the value is not found in the list, Python raises an exception. This mirrors the behavior of the `index` method.
	`pop` is an interesting beast. It does two things: it removes the last element of the list, and it returns the value that it removed. Note that this is different from `li[-1]`, which returns a value but does not change the list, and different from `li.remove(value)`, which changes the list but does not return a value.

3.2.5. Using List Operators

Example 3.14. List Operators

>>> li = ['a', 'b', 'mpilgrim']
>>> li = li + ['example', 'new'] 
>>> li
['a', 'b', 'mpilgrim', 'example', 'new']
>>> li += ['two']                
>>> li
['a', 'b', 'mpilgrim', 'example', 'new', 'two']
>>> li = [1, 2] * 3              
>>> li
[1, 2, 1, 2, 1, 2]

	Lists can also be concatenated with the `+` operator. `list = list + otherlist` has the same result as `list.extend(otherlist)`. But the `+` operator returns a new (concatenated) list as a value, whereas `extend` only alters an existing list. This means that `extend` is faster, especially for large lists.
	Python supports the `+=` operator. `li += ['two']` is equivalent to `li.extend(['two'])`. The `+=` operator works for lists, strings, and integers, and it can be overloaded to work for user-defined classes as well. (More on classes in Chapter 5.)
	The `` operator works on lists as a repeater. `li = [1, 2] 3` is equivalent to `li = [1, 2] + [1, 2] + [1, 2]`, which concatenates the three lists into one.

3.3. Introducing Tuples

A tuple is an immutable list. A tuple can not be changed in any way once it is created.

Example 3.15. Defining a tuple

>>> t = ("a", "b", "mpilgrim", "z", "example") 
>>> t
('a', 'b', 'mpilgrim', 'z', 'example')
>>> t[0]   
'a'
>>> t[-1]  
'example'
>>> t[1:3] 
('b', 'mpilgrim')

	A tuple is defined in the same way as a list, except that the whole set of elements is enclosed in parentheses instead of square brackets.
	The elements of a tuple have a defined order, just like a list. Tuples indices are zero-based, just like a list, so the first element of a non-empty tuple is always `t[0]`.
	Negative indices count from the end of the tuple, just as with a list.
	Slicing works too, just like a list. Note that when you slice a list, you get a new list; when you slice a tuple, you get a new tuple.

Example 3.16. Tuples Have No Methods

>>> t
('a', 'b', 'mpilgrim', 'z', 'example')
>>> t.append("new")    
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'append'
>>> t.remove("z")      
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'remove'
>>> t.index("example") 
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'index'
>>> "z" in t           
True

	You can't add elements to a tuple. Tuples have no `append` or `extend` method.
	You can't remove elements from a tuple. Tuples have no `remove` or `pop` method.
	You can't find elements in a tuple. Tuples have no `index` method.
	You can, however, use `in` to see if an element exists in the tuple.

So what are tuples good for?

Tuples are faster than lists. If you're defining a constant set of values and all you're ever going to do with it is iterate through it, use a tuple instead of a list.
It makes your code safer if you “write-protect” data that does not need to be changed. Using a tuple instead of a list is like having an implied assert statement that shows this data is constant, and that special thought (and a specific function) is required to override that.
Remember that I said that dictionary keys can be integers, strings, and “a few other types”? Tuples are one of those types. Tuples can be used as keys in a dictionary, but lists can't be used this way.Actually, it's more complicated than that. Dictionary keys must be immutable. Tuples themselves are immutable, but if you have a tuple of lists, that counts as mutable and isn't safe to use as a dictionary key. Only tuples of strings, numbers, or other dictionary-safe tuples can be used as dictionary keys.
Tuples are used in string formatting, as you'll see shortly.


	Tuples can be converted into lists, and vice-versa. The built-in `tuple` function takes a list and returns a tuple with the same elements, and the `list` function takes a tuple and returns a list. In effect, `tuple` freezes a list, and `list` thaws a tuple.

3.4. Declaring variables

Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from Chapter 2, odbchelper.py.

Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring into existence by being assigned a value, and they are automatically destroyed when they go out of scope.

Example 3.17. Defining the `myParams` Variable

if __name__ == "__main__":
    myParams = {"server":"mpilgrim", \
                "database":"master", \
                "uid":"sa", \
                "pwd":"secret" \
                }

Notice the indentation. An if statement is a code block and needs to be indented just like a function.

Also notice that the variable assignment is one command split over several lines, with a backslash (“\”) serving as a line-continuation marker.


	When a command is split among several lines with the line-continuation marker (“`\`”), the continued lines can be indented in any manner; Python's normally stringent indentation rules do not apply. If your Python IDE auto-indents the continued line, you should probably accept its default unless you have a burning reason not to.

Strictly speaking, expressions in parentheses, straight brackets, or curly braces (like defining a dictionary) can be split into multiple lines with or without the line continuation character (“\”). I like to include the backslash even when it's not required because I think it makes the code easier to read, but that's a matter of style.

Third, you never declared the variable myParams, you just assigned a value to it. This is like VBScript without the option explicit option. Luckily, unlike VBScript, Python will not allow you to reference a variable that has never been assigned a value; trying to do so will raise an exception.

3.4.1. Referencing Variables

Example 3.18. Referencing an Unbound Variable

>>> x
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
NameError: There is no variable named 'x'
>>> x = 1
>>> x
1

You will thank Python for this one day.

3.4.2. Assigning Multiple Values at Once

One of the cooler programming shortcuts in Python is using sequences to assign multiple values at once.

Example 3.19. Assigning multiple values at once

>>> v = ('a', 'b', 'e')
>>> (x, y, z) = v     
>>> x
'a'
>>> y
'b'
>>> z
'e'

v is a tuple of three elements, and (x, y, z) is a tuple of three variables. Assigning one to the other assigns each of the values of v to each of the variables, in order.

This has all sorts of uses. I often want to assign names to a range of values. In C, you would use enum and manually list each constant and its associated value, which seems especially tedious when the values are consecutive. In Python, you can use the built-in range function with multi-variable assignment to quickly assign consecutive values.

Example 3.20. Assigning Consecutive Values

>>> range(7)              
[0, 1, 2, 3, 4, 5, 6]
>>> (MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7) 
>>> MONDAY                
0
>>> TUESDAY
1
>>> SUNDAY
6

	The built-in `range` function returns a list of integers. In its simplest form, it takes an upper limit and returns a zero-based list counting up to but not including the upper limit. (If you like, you can pass other parameters to specify a base other than `0` and a step other than `1`. You can `print range.__doc__` for details.)
	`MONDAY`, `TUESDAY`, `WEDNESDAY`, `THURSDAY`, `FRIDAY`, `SATURDAY`, and `SUNDAY` are the variables you're defining. (This example came from the `calendar` module, a fun little module that prints calendars, like the UNIX program `cal`. The `calendar` module defines integer constants for days of the week.)
	Now each variable has its value: `MONDAY` is `0`, `TUESDAY` is `1`, and so forth.

You can also use multi-variable assignment to build functions that return multiple values, simply by returning a tuple of all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including the os module, which you'll discuss in Chapter 6.

3.5. Formatting Strings

Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert values into a string with the %s placeholder.


	String formatting in Python uses the same syntax as the `sprintf` function in C.

Example 3.21. Introducing String Formatting

>>> k = "uid"
>>> v = "sa"
>>> "%s=%s" % (k, v) 
'uid=sa'

The whole expression evaluates to a string. The first %s is replaced by the value of k; the second %s is replaced by the value of v. All other characters in the string (in this case, the equal sign) stay as they are.

Note that (k, v) is a tuple. I told you they were good for something.

You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.

Example 3.22. String Formatting vs. Concatenating

>>> uid = "sa"
>>> pwd = "secret"
>>> print pwd + " is not a good password for " + uid      
secret is not a good password for sa
>>> print "%s is not a good password for %s" % (pwd, uid) 
secret is not a good password for sa
>>> userCount = 6
>>> print "Users connected: %d" % (userCount, )            
Users connected: 6
>>> print "Users connected: " + userCount                 
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
TypeError: cannot concatenate 'str' and 'int' objects

	`+` is the string concatenation operator.
	In this trivial case, string formatting accomplishes the same result as concatentation.
	`(userCount, )` is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether `(userCount)` was a tuple with one element or just the value of `userCount`.
	String formatting works with integers by specifying `%d` instead of `%s`.
	Trying to concatenate a string with a non-string raises an exception. Unlike string formatting, string concatenation works only when everything is already a string.

As with printf in C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.

Example 3.23. Formatting Numbers

>>> print "Today's stock price: %f" % 50.4625   
50.462500
>>> print "Today's stock price: %.2f" % 50.4625 
50.46
>>> print "Change since yesterday: %+.2f" % 1.5 
+1.50

	The `%f` string formatting option treats the value as a decimal, and prints it to six decimal places.
	The ".2" modifier of the `%f` option truncates the value to two decimal places.
	You can even combine modifiers. Adding the `+` modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding the value to exactly two decimal places.

3.6. Mapping Lists

One of the most powerful features of Python is the list comprehension, which provides a compact way of mapping a list into another list by applying a function to each of the elements of the list.

Example 3.24. Introducing List Comprehensions

>>> li = [1, 9, 8, 4]
>>> [elem*2 for elem in li]      
[2, 18, 16, 8]
>>> li         
[1, 9, 8, 4]
>>> li = [elem*2 for elem in li] 
>>> li
[2, 18, 16, 8]

	To make sense of this, look at it from right to left. `li` is the list you're mapping. Python loops through `li` one element at a time, temporarily assigning the value of each element to the variable `elem`. Python then applies the function `elem*2` and appends that result to the returned list.
	Note that list comprehensions do not change the original list.
	It is safe to assign the result of a list comprehension to the variable that you're mapping. Python constructs the new list in memory, and when the list comprehension is complete, it assigns the result to the variable.

Here are the list comprehensions in the buildConnectionString function that you declared in Chapter 2:

["%s=%s" % (k, v) for k, v in params.items()]

First, notice that you're calling the items function of the params dictionary. This function returns a list of tuples of all the data in the dictionary.

Example 3.25. The `keys`, `values`, and `items` Functions

>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
>>> params.keys()   
['server', 'uid', 'database', 'pwd']
>>> params.values() 
['mpilgrim', 'sa', 'master', 'secret']
>>> params.items()  
[('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]

	The `keys` method of a dictionary returns a list of all the keys. The list is not in the order in which the dictionary was defined (remember that elements in a dictionary are unordered), but it is a list.
	The `values` method returns a list of all the values. The list is in the same order as the list returned by `keys`, so `params.values()[n] == params[params.keys()[n]]` for all values of `n`.
	The `items` method returns a list of tuples of the form `(key, value)`. The list contains all the data in the dictionary.

Now let's see what buildConnectionString does. It takes a list, params.items(), and maps it to a new list by applying string formatting to each element. The new list will have the same number of elements as params.items(), but each element in the new list will be a string that contains both a key and its associated value from the params dictionary.

Example 3.26. List Comprehensions in `buildConnectionString`, Step by Step

>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
>>> params.items()
[('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]
>>> [k for k, v in params.items()]                
['server', 'uid', 'database', 'pwd']
>>> [v for k, v in params.items()]                
['mpilgrim', 'sa', 'master', 'secret']
>>> ["%s=%s" % (k, v) for k, v in params.items()] 
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']

	Note that you're using two variables to iterate through the `params.items()` list. This is another use of multi-variable assignment. The first element of `params.items()` is `('server', 'mpilgrim')`, so in the first iteration of the list comprehension, `k` will get `'server'` and `v` will get `'mpilgrim'`. In this case, you're ignoring the value of `v` and only including the value of `k` in the returned list, so this list comprehension ends up being equivalent to `params.keys()`.
	Here you're doing the same thing, but ignoring the value of `k`, so this list comprehension ends up being equivalent to `params.values()`.
	Combining the previous two examples with some simple string formatting, you get a list of strings that include both the key and value of each element of the dictionary. This looks suspiciously like the output of the program. All that remains is to join the elements in this list into a single string.

3.7. Joining Lists and Splitting Strings

You have a list of key-value pairs in the form key=value, and you want to join them into a single string. To join any list of strings into a single string, use the join method of a string object.

Here is an example of joining a list from the buildConnectionString function:

    return ";".join(["%s=%s" % (k, v) for k, v in params.items()])

One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything is an object. You might have thought I meant that string variables are objects. But no, look closely at this example and you'll see that the string ";" itself is an object, and you are calling its join method.

The join method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't need to be a semi-colon; it doesn't even need to be a single character. It can be any string.


	`join` works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements will raise an exception.

Example 3.27. Output of `odbchelper.py`

>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
>>> ["%s=%s" % (k, v) for k, v in params.items()]
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
>>> ";".join(["%s=%s" % (k, v) for k, v in params.items()])
'server=mpilgrim;uid=sa;database=master;pwd=secret'

This string is then returned from the odbchelper function and printed by the calling block, which gives you the output that you marveled at when you started reading this chapter.

You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's called split.

Example 3.28. Splitting a String

>>> li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
>>> s = ";".join(li)
>>> s
'server=mpilgrim;uid=sa;database=master;pwd=secret'
>>> s.split(";")    
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
>>> s.split(";", 1) 
['server=mpilgrim', 'uid=sa;database=master;pwd=secret']

	`split` reverses `join` by splitting a string into a multi-element list. Note that the delimiter (“`;`”) is stripped out completely; it does not appear in any of the elements of the returned list.
	`split` takes an optional second argument, which is the number of times to split. (“Oooooh, optional arguments...” You'll learn how to do this in your own functions in the next chapter.)


	`anystring.split(delimiter, 1)` is a useful technique when you want to search a string for a substring and then work with everything before the substring (which ends up in the first element of the returned list) and everything after it (which ends up in the second element).

3.7.1. Historical Note on String Methods

When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the join method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate string module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old string module (which still has a lot of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead.

3.8. Summary

The odbchelper.py program and its output should now make perfect sense.

def buildConnectionString(params):
    """Build a connection string from a dictionary of parameters.

    Returns string."""
    return ";".join(["%s=%s" % (k, v) for k, v in params.items()])

if __name__ == "__main__":
    myParams = {"server":"mpilgrim", \
                "database":"master", \
                "uid":"sa", \
                "pwd":"secret" \
                }
    print buildConnectionString(myParams)

Here is the output of odbchelper.py:

server=mpilgrim;uid=sa;database=master;pwd=secret

Before diving into the next chapter, make sure you're comfortable doing all of these things:

Using the Python IDE to test expressions interactively
Writing Python programs and running them from within your IDE, or from the command line
Importing modules and calling their functions
Declaring functions and using docstrings, local variables, and proper indentation
Defining dictionaries, tuples, and lists
Accessing attributes and methods of any object, including strings, lists, dictionaries, functions, and modules
Concatenating values through string formatting
Mapping lists into other lists using list comprehensions
Splitting strings into lists and joining lists into strings

Chapter 4. The Power Of Introspection

This chapter covers one of Python's strengths: introspection. As you know, everything in Python is an object, and introspection is code looking at other modules and functions in memory as objects, getting information about them, and manipulating them. Along the way, you'll define functions with no name, call functions with arguments out of order, and reference functions whose names you don't even know ahead of time.

4.1. Diving In

Here is a complete, working Python program. You should understand a good deal about it just by looking at it. The numbered lines illustrate concepts covered in Chapter 2, Your First Python Program. Don't worry if the rest of the code looks intimidating; you'll learn all about it throughout this chapter.

Example 4.1. `apihelper.py`

If you have not already done so, you can download this and other examples used in this book.

def info(object, spacing=10, collapse=1):   
    """Print methods and docstrings.
    
    Takes module, class, list, dictionary, or string."""
    methodList = [method for method in dir(object) if callable(getattr(object, method))]
    processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)
    print "\n".join(["%s %s" %
    (method.ljust(spacing),
     processFunc(str(getattr(object, method).__doc__)))
   for method in methodList])

if __name__ == "__main__":                 
    print info.__doc__

	This module has one function, `info`. According to its function declaration, it takes three parameters: `object`, `spacing`, and `collapse`. The last two are actually optional parameters, as you'll see shortly.
	The `info` function has a multi-line `docstring` that succinctly describes the function's purpose. Note that no return value is mentioned; this function will be used solely for its effects, rather than its value.
	Code within the function is indented.
	The `if __name__` trick allows this program do something useful when run by itself, without interfering with its use as a module for other programs. In this case, the program simply prints out the `docstring` of the `info` function.
	`if` statements use `==` for comparison, and parentheses are not required.

The info function is designed to be used by you, the programmer, while working in the Python IDE. It takes any object that has functions or methods (like a module, which has functions, or a list, which has methods) and prints out the functions and their docstrings.

Example 4.2. Sample Usage of `apihelper.py`

>>> from apihelper import info
>>> li = []
>>> info(li)
append     L.append(object) -- append object to end
count      L.count(value) -> integer -- return number of occurrences of value
extend     L.extend(list) -- extend list by appending list elements
index      L.index(value) -> integer -- return index of first occurrence of value
insert     L.insert(index, object) -- insert object before index
pop        L.pop([index]) -> item -- remove and return item at index (default last)
remove     L.remove(value) -- remove first occurrence of value
reverse    L.reverse() -- reverse *IN PLACE*
sort       L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1

By default the output is formatted to be easy to read. Multi-line docstrings are collapsed into a single long line, but this option can be changed by specifying 0 for the collapse argument. If the function names are longer than 10 characters, you can specify a larger value for the spacing argument to make the output easier to read.

Example 4.3. Advanced Usage of `apihelper.py`

>>> import odbchelper
>>> info(odbchelper)
buildConnectionString Build a connection string from a dictionary Returns string.
>>> info(odbchelper, 30)
buildConnectionString          Build a connection string from a dictionary Returns string.
>>> info(odbchelper, 30, 0)
buildConnectionString          Build a connection string from a dictionary
    
    Returns string.

4.2. Using Optional and Named Arguments

Python allows function arguments to have default values; if the function is called without the argument, the argument gets its default value. Futhermore, arguments can be specified in any order by using named arguments. Stored procedures in SQL Server Transact/SQL can do this, so if you're a SQL Server scripting guru, you can skim this part.

Here is an example of info, a function with two optional arguments:

def info(object, spacing=10, collapse=1):

spacing and collapse are optional, because they have default values defined. object is required, because it has no default value. If info is called with only one argument, spacing defaults to 10 and collapse defaults to 1. If info is called with two arguments, collapse still defaults to 1.

Say you want to specify a value for collapse but want to accept the default value for spacing. In most languages, you would be out of luck, because you would need to call the function with three arguments. But in Python, arguments can be specified by name, in any order.

Example 4.4. Valid Calls of `info`

info(odbchelper)  
info(odbchelper, 12)                
info(odbchelper, collapse=0)        
info(spacing=15, object=odbchelper)

	With only one argument, `spacing` gets its default value of `10` and `collapse` gets its default value of `1`.
	With two arguments, `collapse` gets its default value of `1`.
	Here you are naming the `collapse` argument explicitly and specifying its value. `spacing` still gets its default value of `10`.
	Even required arguments (like `object`, which has no default value) can be named, and named arguments can appear in any order.

This looks totally whacked until you realize that arguments are simply a dictionary. The “normal” method of calling functions without argument names is actually just a shorthand where Python matches up the values with the argument names in the order they're specified in the function declaration. And most of the time, you'll call functions the “normal” way, but you always have the additional flexibility if you need it.


	The only thing you need to do to call a function is specify a value (somehow) for each required argument; the manner and order in which you do that is up to you.

4.3. Using `type`, `str`, `dir`, and Other Built-In Functions

Python has a small set of extremely useful built-in functions. All other functions are partitioned off into modules. This was actually a conscious design decision, to keep the core language from getting bloated like other scripting languages (cough cough, Visual Basic).

4.3.1. The `type` Function

The type function returns the datatype of any arbitrary object. The possible types are listed in the types module. This is useful for helper functions that can handle several types of data.

Example 4.5. Introducing `type`

>>> type(1)           
<type 'int'>
>>> li = []
>>> type(li)          
<type 'list'>
>>> import odbchelper
>>> type(odbchelper)  
<type 'module'>
>>> import types      
>>> type(odbchelper) == types.ModuleType
True

	`type` takes anything -- and I mean anything -- and returns its datatype. Integers, strings, lists, dictionaries, tuples, functions, classes, modules, even types are acceptable.
	`type` can take a variable and return its datatype.
	`type` also works on modules.
	You can use the constants in the `types` module to compare types of objects. This is what the `info` function does, as you'll see shortly.

4.3.2. The `str` Function

The str coerces data into a string. Every datatype can be coerced into a string.

Example 4.6. Introducing `str`

>>> str(1)          
'1'
>>> horsemen = ['war', 'pestilence', 'famine']
>>> horsemen
['war', 'pestilence', 'famine']
>>> horsemen.append('Powerbuilder')
>>> str(horsemen)   
"['war', 'pestilence', 'famine', 'Powerbuilder']"
>>> str(odbchelper) 
"<module 'odbchelper' from 'c:\\docbook\\dip\\py\\odbchelper.py'>"
>>> str(None)       
'None'

	For simple datatypes like integers, you would expect `str` to work, because almost every language has a function to convert an integer to a string.
	However, `str` works on any object of any type. Here it works on a list which you've constructed in bits and pieces.
	`str` also works on modules. Note that the string representation of the module includes the pathname of the module on disk, so yours will be different.
	A subtle but important behavior of `str` is that it works on `None`, the Python null value. It returns the string `'None'`. You'll use this to your advantage in the `info` function, as you'll see shortly.

At the heart of the info function is the powerful dir function. dir returns a list of the attributes and methods of any object: modules, functions, strings, lists, dictionaries... pretty much anything.

Example 4.7. Introducing `dir`

>>> li = []
>>> dir(li)           
['append', 'count', 'extend', 'index', 'insert',
'pop', 'remove', 'reverse', 'sort']
>>> d = {}
>>> dir(d)            
['clear', 'copy', 'get', 'has_key', 'items', 'keys', 'setdefault', 'update', 'values']
>>> import odbchelper
>>> dir(odbchelper)   
['__builtins__', '__doc__', '__file__', '__name__', 'buildConnectionString']

	`li` is a list, so `dir(li)` returns a list of all the methods of a list. Note that the returned list contains the names of the methods as strings, not the methods themselves.
	`d` is a dictionary, so `dir(d)` returns a list of the names of dictionary methods. At least one of these, `keys`, should look familiar.
	This is where it really gets interesting. `odbchelper` is a module, so `dir(odbchelper)` returns a list of all kinds of stuff defined in the module, including built-in attributes, like `__name__`, `__doc__`, and whatever other attributes and methods you define. In this case, `odbchelper` has only one user-defined method, the `buildConnectionString` function described in Chapter 2.

Finally, the callable function takes any object and returns True if the object can be called, or False otherwise. Callable objects include functions, class methods, even classes themselves. (More on classes in the next chapter.)

Example 4.8. Introducing `callable`

>>> import string
>>> string.punctuation           
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> string.join
<function join at 00C55A7C>
>>> callable(string.punctuation) 
False
>>> callable(string.join)        
True
>>> print string.join.__doc__    
join(list [,sep]) -> string

    Return a string composed of the words in list, with
    intervening occurrences of sep.  The default separator is a
    single space.

    (joinfields and join are synonymous)

	The functions in the `string` module are deprecated (although many people still use the `join` function), but the module contains a lot of useful constants like this `string.punctuation`, which contains all the standard punctuation characters.
	`string.join` is a function that joins a list of strings.
	`string.punctuation` is not callable; it is a string. (A string does have callable methods, but the string itself is not callable.)
	`string.join` is callable; it's a function that takes two arguments.
	Any callable object may have a `docstring`. By using the `callable` function on each of an object's attributes, you can determine which attributes you care about (methods, functions, classes) and which you want to ignore (constants and so on) without knowing anything about the object ahead of time.

4.3.3. Built-In Functions

type, str, dir, and all the rest of Python's built-in functions are grouped into a special module called __builtin__. (That's two underscores before and after.) If it helps, you can think of Python automatically executing from __builtin__ import * on startup, which imports all the “built-in” functions into the namespace so you can use them directly.

The advantage of thinking like this is that you can access all the built-in functions and attributes as a group by getting information about the __builtin__ module. And guess what, Python has a function called info. Try it yourself and skim through the list now. We'll dive into some of the more important functions later. (Some of the built-in error classes, like AttributeError, should already look familiar.)

Example 4.9. Built-in Attributes and Functions

>>> from apihelper import info
>>> import __builtin__
>>> info(__builtin__, 20)
ArithmeticError      Base class for arithmetic errors.
AssertionError       Assertion failed.
AttributeError       Attribute not found.
EOFError             Read beyond end of file.
EnvironmentError     Base class for I/O related errors.
Exception            Common base class for all exceptions.
FloatingPointError   Floating point operation failed.
IOError              I/O operation failed.

[...snip...]


	Python comes with excellent reference manuals, which you should peruse thoroughly to learn all the modules Python has to offer. But unlike most languages, where you would find yourself referring back to the manuals or man pages to remind yourself how to use these modules, Python is largely self-documenting.

4.4. Getting Object References With `getattr`

You already know that Python functions are objects. What you don't know is that you can get a reference to a function without knowing its name until run-time, by using the getattr function.

Example 4.10. Introducing `getattr`

>>> li = ["Larry", "Curly"]
>>> li.pop     
<built-in method pop of list object at 010DF884>
>>> getattr(li, "pop")           
<built-in method pop of list object at 010DF884>
>>> getattr(li, "append")("Moe") 
>>> li
["Larry", "Curly", "Moe"]
>>> getattr({}, "clear")         
<built-in method clear of dictionary object at 00F113D4>
>>> getattr((), "pop")           
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'pop'

	This gets a reference to the `pop` method of the list. Note that this is not calling the `pop` method; that would be `li.pop()`. This is the method itself.
	This also returns a reference to the `pop` method, but this time, the method name is specified as a string argument to the `getattr` function. `getattr` is an incredibly useful built-in function that returns any attribute of any object. In this case, the object is a list, and the attribute is the `pop` method.
	In case it hasn't sunk in just how incredibly useful this is, try this: the return value of `getattr` is the method, which you can then call just as if you had said `li.append("Moe")` directly. But you didn't call the function directly; you specified the function name as a string instead.
	`getattr` also works on dictionaries.
	In theory, `getattr` would work on tuples, except that tuples have no methods, so `getattr` will raise an exception no matter what attribute name you give.

4.4.1. `getattr` with Modules

getattr isn't just for built-in datatypes. It also works on modules.

Example 4.11. The `getattr` Function in `apihelper.py`

>>> import odbchelper
>>> odbchelper.buildConnectionString             
<function buildConnectionString at 00D18DD4>
>>> getattr(odbchelper, "buildConnectionString") 
<function buildConnectionString at 00D18DD4>
>>> object = odbchelper
>>> method = "buildConnectionString"
>>> getattr(object, method)    
<function buildConnectionString at 00D18DD4>
>>> type(getattr(object, method))                
<type 'function'>
>>> import types
>>> type(getattr(object, method)) == types.FunctionType
True
>>> callable(getattr(object, method))            
True

	This returns a reference to the `buildConnectionString` function in the `odbchelper` module, which you studied in Chapter 2, Your First Python Program. (The hex address you see is specific to my machine; your output will be different.)
	Using `getattr`, you can get the same reference to the same function. In general, `getattr(object, "attribute")` is equivalent to `object.attribute`. If `object` is a module, then `attribute` can be anything defined in the module: a function, class, or global variable.
	And this is what you actually use in the `info` function. `object` is passed into the function as an argument; `method` is a string which is the name of a method or function.
	In this case, `method` is the name of a function, which you can prove by getting its `type`.
	Since `method` is a function, it is callable.

4.4.2. `getattr` As a Dispatcher

A common usage pattern of getattr is as a dispatcher. For example, if you had a program that could output data in a variety of different formats, you could define separate functions for each output format and use a single dispatch function to call the right one.

For example, let's imagine a program that prints site statistics in HTML, XML, and plain text formats. The choice of output format could be specified on the command line, or stored in a configuration file. A statsout module defines three functions, output_html, output_xml, and output_text. Then the main program defines a single output function, like this:

Example 4.12. Creating a Dispatcher with `getattr`

import statsout

def output(data, format="text"):            
    output_function = getattr(statsout, "output_%s" % format) 
    return output_function(data)

	The `output` function takes one required argument, `data`, and one optional argument, `format`. If `format` is not specified, it defaults to `text`, and you will end up calling the plain text output function.
	You concatenate the `format` argument with "output_" to produce a function name, and then go get that function from the `statsout` module. This allows you to easily extend the program later to support other output formats, without changing this dispatch function. Just add another function to `statsout` named, for instance, `output_pdf`, and pass "pdf" as the `format` into the `output` function.
	Now you can simply call the output function in the same way as any other function. The `output_function` variable is a reference to the appropriate function from the `statsout` module.

Did you see the bug in the previous example? This is a very loose coupling of strings and functions, and there is no error checking. What happens if the user passes in a format that doesn't have a corresponding function defined in statsout? Well, getattr will return None, which will be assigned to output_function instead of a valid function, and the next line that attempts to call that function will crash and raise an exception. That's bad.

Luckily, getattr takes an optional third argument, a default value.

Example 4.13. `getattr` Default Values

import statsout

def output(data, format="text"):
    output_function = getattr(statsout, "output_%s" % format, statsout.output_text)
    return output_function(data)

This function call is guaranteed to work, because you added a third argument to the call to getattr. The third argument is a default value that is returned if the attribute or method specified by the second argument wasn't found.

As you can see, getattr is quite powerful. It is the heart of introspection, and you'll see even more powerful examples of it in later chapters.

4.5. Filtering Lists

As you know, Python has powerful capabilities for mapping lists into other lists, via list comprehensions (Section 3.6, “Mapping Lists”). This can be combined with a filtering mechanism, where some elements in the list are mapped while others are skipped entirely.

Here is the list filtering syntax:

[mapping-expression for element in source-list if filter-expression]

This is an extension of the list comprehensions that you know and love. The first two thirds are the same; the last part, starting with the if, is the filter expression. A filter expression can be any expression that evaluates true or false (which in Python can be almost anything). Any element for which the filter expression evaluates true will be included in the mapping. All other elements are ignored, so they are never put through the mapping expression and are not included in the output list.

Example 4.14. Introducing List Filtering

>>> li = ["a", "mpilgrim", "foo", "b", "c", "b", "d", "d"]
>>> [elem for elem in li if len(elem) > 1]       
['mpilgrim', 'foo']
>>> [elem for elem in li if elem != "b"]         
['a', 'mpilgrim', 'foo', 'c', 'd', 'd']
>>> [elem for elem in li if li.count(elem) == 1] 
['a', 'mpilgrim', 'foo', 'c']

	The mapping expression here is simple (it just returns the value of each element), so concentrate on the filter expression. As Python loops through the list, it runs each element through the filter expression. If the filter expression is true, the element is mapped and the result of the mapping expression is included in the returned list. Here, you are filtering out all the one-character strings, so you're left with a list of all the longer strings.
	Here, you are filtering out a specific value, `b`. Note that this filters all occurrences of `b`, since each time it comes up, the filter expression will be false.
	`count` is a list method that returns the number of times a value occurs in a list. You might think that this filter would eliminate duplicates from a list, returning a list containing only one copy of each value in the original list. But it doesn't, because values that appear twice in the original list (in this case, `b` and `d`) are excluded completely. There are ways of eliminating duplicates from a list, but filtering is not the solution.

Let's id="apihelper.filter.care" get back to this line from apihelper.py:

    methodList = [method for method in dir(object) if callable(getattr(object, method))]

This looks complicated, and it is complicated, but the basic structure is the same. The whole filter expression returns a list, which is assigned to the methodList variable. The first half of the expression is the list mapping part. The mapping expression is an identity expression, which it returns the value of each element. dir(object) returns a list of object's attributes and methods -- that's the list you're mapping. So the only new part is the filter expression after the if.

The filter expression looks scary, but it's not. You already know about callable, getattr, and in. As you saw in the previous section, the expression getattr(object, method) returns a function object if object is a module and method is the name of a function in that module.

So this expression takes an object (named object). Then it gets a list of the names of the object's attributes, methods, functions, and a few other things. Then it filters that list to weed out all the stuff that you don't care about. You do the weeding out by taking the name of each attribute/method/function and getting a reference to the real thing, via the getattr function. Then you check to see if that object is callable, which will be any methods and functions, both built-in (like the pop method of a list) and user-defined (like the buildConnectionString function of the odbchelper module). You don't care about other attributes, like the __name__ attribute that's built in to every module.

4.6. The Peculiar Nature of `and` and `or`

In Python, and and or perform boolean logic as you would expect, but they do not return boolean values; instead, they return one of the actual values they are comparing.

Example 4.15. Introducing `and`

>>> 'a' and 'b'         
'b'
>>> '' and 'b'          
''
>>> 'a' and 'b' and 'c' 
'c'

	When using `and`, values are evaluated in a boolean context from left to right. `0`, `''`, `[]`, `()`, `{}`, and `None` are false in a boolean context; everything else is true. Well, almost everything. By default, instances of classes are true in a boolean context, but you can define special methods in your class to make an instance evaluate to false. You'll learn all about classes and special methods in Chapter 5. If all values are true in a boolean context, `and` returns the last value. In this case, `and` evaluates `'a'`, which is true, then `'b'`, which is true, and returns `'b'`.
	If any value is false in a boolean context, `and` returns the first false value. In this case, `''` is the first false value.
	All values are true, so `and` returns the last value, `'c'`.

Example 4.16. Introducing `or`

>>> 'a' or 'b'          
'a'
>>> '' or 'b'           
'b'
>>> '' or [] or {}      
{}
>>> def sidefx():
...     print "in sidefx()"
...     return 1
>>> 'a' or sidefx()     
'a'

	When using `or`, values are evaluated in a boolean context from left to right, just like `and`. If any value is true, `or` returns that value immediately. In this case, `'a'` is the first true value.
	`or` evaluates `''`, which is false, then `'b'`, which is true, and returns `'b'`.
	If all values are false, `or` returns the last value. `or` evaluates `''`, which is false, then `[]`, which is false, then `{}`, which is false, and returns `{}`.
	Note that `or` evaluates values only until it finds one that is true in a boolean context, and then it ignores the rest. This distinction is important if some values can have side effects. Here, the function `sidefx` is never called, because `or` evaluates `'a'`, which is true, and returns `'a'` immediately.

If you're a C hacker, you are certainly familiar with the bool ? a : b expression, which evaluates to a if bool is true, and b otherwise. Because of the way and and or work in Python, you can accomplish the same thing.

4.6.1. Using the `and-or` Trick

Example 4.17. Introducing the `and-or` Trick

>>> a = "first"
>>> b = "second"
>>> 1 and a or b 
'first'
>>> 0 and a or b 
'second'

	This syntax looks similar to the `bool ? a : b` expression in C. The entire expression is evaluated from left to right, so the `and` is evaluated first. `1 and 'first'` evalutes to `'first'`, then `'first' or 'second'` evalutes to `'first'`.
	`0 and 'first'` evalutes to `False`, and then `0 or 'second'` evaluates to `'second'`.

However, since this Python expression is simply boolean logic, and not a special construct of the language, there is one extremely important difference between this and-or trick in Python and the bool ? a : b syntax in C. If the value of a is false, the expression will not work as you would expect it to. (Can you tell I was bitten by this? More than once?)

Example 4.18. When the `and-or` Trick Fails

>>> a = ""
>>> b = "second"
>>> 1 and a or b         
'second'

Since a is an empty string, which Python considers false in a boolean context, 1 and '' evalutes to '', and then '' or 'second' evalutes to 'second'. Oops! That's not what you wanted.

The and-or trick, bool and a or b, will not work like the C expression bool ? a : b when a is false in a boolean context.

The real trick behind the and-or trick, then, is to make sure that the value of a is never false. One common way of doing this is to turn a into [a] and b into [b], then taking the first element of the returned list, which will be either a or b.

Example 4.19. Using the `and-or` Trick Safely

>>> a = ""
>>> b = "second"
>>> (1 and [a] or [b])[0] 
''

Since [a] is a non-empty list, it is never false. Even if a is 0 or '' or some other false value, the list [a] is true because it has one element.

By now, this trick may seem like more trouble than it's worth. You could, after all, accomplish the same thing with an if statement, so why go through all this fuss? Well, in many cases, you are choosing between two constant values, so you can use the simpler syntax and not worry, because you know that the a value will always be true. And even if you need to use the more complicated safe form, there are good reasons to do so. For example, there are some cases in Python where if statements are not allowed, such as in lambda functions.

4.7. Using `lambda` Functions

Python supports an interesting syntax that lets you define one-line mini-functions on the fly. Borrowed from Lisp, these so-called lambda functions can be used anywhere a function is required.

Example 4.20. Introducing `lambda` Functions

>>> def f(x):
...     return x*2
...     
>>> f(3)
6
>>> g = lambda x: x*2  
>>> g(3)
6
>>> (lambda x: x*2)(3) 
6

	This is a `lambda` function that accomplishes the same thing as the normal function above it. Note the abbreviated syntax here: there are no parentheses around the argument list, and the `return` keyword is missing (it is implied, since the entire function can only be one expression). Also, the function has no name, but it can be called through the variable it is assigned to.
	You can use a `lambda` function without even assigning it to a variable. This may not be the most useful thing in the world, but it just goes to show that a lambda is just an in-line function.

To generalize, a lambda function is a function that takes any number of arguments (including optional arguments) and returns the value of a single expression. lambda functions can not contain commands, and they can not contain more than one expression. Don't try to squeeze too much into a lambda function; if you need something more complex, define a normal function instead and make it as long as you want.


	`lambda` functions are a matter of style. Using them is never required; anywhere you could use them, you could define a separate normal function and use that instead. I use them in places where I want to encapsulate specific, non-reusable code without littering my code with a lot of little one-line functions.

4.7.1. Real-World `lambda` Functions

Here are the lambda functions in apihelper.py:

    processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)

Notice that this uses the simple form of the and-or trick, which is okay, because a lambda function is always true in a boolean context. (That doesn't mean that a lambda function can't return a false value. The function is always true; its return value could be anything.)

Also notice that you're using the split function with no arguments. You've already seen it used with one or two arguments, but without any arguments it splits on whitespace.

Example 4.21. `split` With No Arguments

>>> s = "this   is\na\ttest"  
>>> print s
this   is
a	test
>>> print s.split()           
['this', 'is', 'a', 'test']
>>> print " ".join(s.split()) 
'this is a test'

	This is a multiline string, defined by escape characters instead of triple quotes. `\n` is a carriage return, and `\t` is a tab character.
	`split` without any arguments splits on whitespace. So three spaces, a carriage return, and a tab character are all the same.
	You can normalize whitespace by splitting a string with `split` and then rejoining it with `join`, using a single space as a delimiter. This is what the `info` function does to collapse multi-line `docstring`s into a single line.

So what is the info function actually doing with these lambda functions, splits, and and-or tricks?

    processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)

processFunc is now a function, but which function it is depends on the value of the collapse variable. If collapse is true, processFunc(string) will collapse whitespace; otherwise, processFunc(string) will return its argument unchanged.

To do this in a less robust language, like Visual Basic, you would probably create a function that took a string and a collapse argument and used an if statement to decide whether to collapse the whitespace or not, then returned the appropriate value. This would be inefficient, because the function would need to handle every possible case. Every time you called it, it would need to decide whether to collapse whitespace before it could give you what you wanted. In Python, you can take that decision logic out of the function and define a lambda function that is custom-tailored to give you exactly (and only) what you want. This is more efficient, more elegant, and less prone to those nasty oh-I-thought-those-arguments-were-reversed kinds of errors.

4.8. Putting It All Together

The last line of code, the only one you haven't deconstructed yet, is the one that does all the work. But by now the work is easy, because everything you need is already set up just the way you need it. All the dominoes are in place; it's time to knock them down.

This is the meat of apihelper.py:

    print "\n".join(["%s %s" %
    (method.ljust(spacing),
     processFunc(str(getattr(object, method).__doc__)))
   for method in methodList])

Note that this is one command, split over multiple lines, but it doesn't use the line continuation character (\). Remember when I said that some expressions can be split into multiple lines without using a backslash? A list comprehension is one of those expressions, since the entire expression is contained in square brackets.

Now, let's take it from the end and work backwards. The

for method in methodList

shows that this is a list comprehension. As you know, methodList is a list of all the methods you care about in object. So you're looping through that list with method.

Example 4.22. Getting a `docstring` Dynamically

>>> import odbchelper
>>> object = odbchelper 
>>> method = 'buildConnectionString'      
>>> getattr(object, method)               
<function buildConnectionString at 010D6D74>
>>> print getattr(object, method).__doc__ 
Build a connection string from a dictionary of parameters.

    Returns string.

	In the `info` function, `object` is the object you're getting help on, passed in as an argument.
	As you're looping through `methodList`, `method` is the name of the current method.
	Using the `getattr` function, you're getting a reference to the `method` function in the `object` module.
	Now, printing the actual `docstring` of the method is easy.

The next piece of the puzzle is the use of str around the docstring. As you may recall, str is a built-in function that coerces data into a string. But a docstring is always a string, so why bother with the str function? The answer is that not every function has a docstring, and if it doesn't, its __doc__ attribute is None.

Example 4.23. Why Use `str` on a `docstring`?

>>> >>> def foo(): print 2
>>> >>> foo()
2
>>> >>> foo.__doc__     
>>> foo.__doc__ == None 
True
>>> str(foo.__doc__)    
'None'

	You can easily define a function that has no `docstring`, so its `__doc__` attribute is `None`. Confusingly, if you evaluate the `__doc__` attribute directly, the Python IDE prints nothing at all, which makes sense if you think about it, but is still unhelpful.
	You can verify that the value of the `__doc__` attribute is actually `None` by comparing it directly.
	The `str` function takes the null value and returns a string representation of it, `'None'`.


	In SQL, you must use `IS NULL` instead of `= NULL` to compare a null value. In Python, you can use either `== None` or `is None`, but `is None` is faster.

Now that you are guaranteed to have a string, you can pass the string to processFunc, which you have already defined as a function that either does or doesn't collapse whitespace. Now you see why it was important to use str to convert a None value into a string representation. processFunc is assuming a string argument and calling its split method, which would crash if you passed it None because None doesn't have a split method.

Stepping back even further, you see that you're using string formatting again to concatenate the return value of processFunc with the return value of method's ljust method. This is a new string method that you haven't seen before.

Example 4.24. Introducing `ljust`

>>> s = 'buildConnectionString'
>>> s.ljust(30) 
'buildConnectionString         '
>>> s.ljust(20) 
'buildConnectionString'

	`ljust` pads the string with spaces to the given length. This is what the `info` function uses to make two columns of output and line up all the `docstring`s in the second column.
	If the given length is smaller than the length of the string, `ljust` will simply return the string unchanged. It never truncates the string.

You're almost finished. Given the padded method name from the ljust method and the (possibly collapsed) docstring from the call to processFunc, you concatenate the two and get a single string. Since you're mapping methodList, you end up with a list of strings. Using the join method of the string "\n", you join this list into a single string, with each element of the list on a separate line, and print the result.

Example 4.25. Printing a List

>>> li = ['a', 'b', 'c']
>>> print "\n".join(li) 
a
b
c

This is also a useful debugging trick when you're working with lists. And in Python, you're always working with lists.

That's the last piece of the puzzle. You should now understand this code.

    print "\n".join(["%s %s" %
    (method.ljust(spacing),
     processFunc(str(getattr(object, method).__doc__)))
   for method in methodList])

4.9. Summary

The apihelper.py program and its output should now make perfect sense.

def info(object, spacing=10, collapse=1):
    """Print methods and docstrings.
    
    Takes module, class, list, dictionary, or string."""
    methodList = [method for method in dir(object) if callable(getattr(object, method))]
    processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)
    print "\n".join(["%s %s" %
    (method.ljust(spacing),
     processFunc(str(getattr(object, method).__doc__)))
   for method in methodList])

if __name__ == "__main__":
    print info.__doc__

Here is the output of apihelper.py:

>>> from apihelper import info
>>> li = []
>>> info(li)
append     L.append(object) -- append object to end
count      L.count(value) -> integer -- return number of occurrences of value
extend     L.extend(list) -- extend list by appending list elements
index      L.index(value) -> integer -- return index of first occurrence of value
insert     L.insert(index, object) -- insert object before index
pop        L.pop([index]) -> item -- remove and return item at index (default last)
remove     L.remove(value) -- remove first occurrence of value
reverse    L.reverse() -- reverse *IN PLACE*
sort       L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1

Before diving into the next chapter, make sure you're comfortable doing all of these things:

Defining and calling functions with optional and named arguments
Using str to coerce any arbitrary value into a string representation
Using getattr to get references to functions and other attributes dynamically
Extending the list comprehension syntax to do list filtering
Recognizing the and-or trick and using it safely
Defining lambda functions
Assigning functions to variables and calling the function by referencing the variable. I can't emphasize this enough, because this mode of thought is vital to advancing your understanding of Python. You'll see more complex applications of this concept throughout this book.

Chapter 5. Objects and Object-Orientation

This chapter, and pretty much every chapter after this, deals with object-oriented Python programming.

5.1. Diving In

Here is a complete, working Python program. Read the docstrings of the module, the classes, and the functions to get an overview of what this program does and how it works. As usual, don't worry about the stuff you don't understand; that's what the rest of the chapter is for.

Example 5.1. `fileinfo.py`

If you have not already done so, you can download this and other examples used in this book.

"""Framework for getting filetype-specific metadata.

Instantiate appropriate class with filename.  Returned object acts like a
dictionary, with key-value pairs for each piece of metadata.
    import fileinfo
    info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3")
    print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()])

Or use listDirectory function to get info on all files in a directory.
    for info in fileinfo.listDirectory("/music/ap/", [".mp3"]):
        ...

Framework can be extended by adding classes for particular file types, e.g.
HTMLFileInfo, MPGFileInfo, DOCFileInfo.  Each class is completely responsible for
parsing its files appropriately; see MP3FileInfo for example.
"""
import os
import sys
from UserDict import UserDict

def stripnulls(data):
    "strip whitespace and nulls"
    return data.replace("\00", "").strip()

class FileInfo(UserDict):
    "store file metadata"
    def __init__(self, filename=None):
        UserDict.__init__(self)
        self["name"] = filename

class MP3FileInfo(FileInfo):
    "store ID3v1.0 MP3 tags"
    tagDataMap = {"title"   : (  3,  33, stripnulls),
"artist"  : ( 33,  63, stripnulls),
"album"   : ( 63,  93, stripnulls),
"year"    : ( 93,  97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre"   : (127, 128, ord)}

    def __parse(self, filename):
        "parse ID3v1.0 tags from MP3 file"
        self.clear()
        try:             
            fsock = open(filename, "rb", 0)
            try:         
                fsock.seek(-128, 2)        
                tagdata = fsock.read(128)  
            finally:     
                fsock.close()              
            if tagdata[:3] == "TAG":
                for tag, (start, end, parseFunc) in self.tagDataMap.items():
  self[tag] = parseFunc(tagdata[start:end])               
        except IOError:  
            pass         

    def __setitem__(self, key, item):
        if key == "name" and item:
            self.__parse(item)
        FileInfo.__setitem__(self, key, item)

def listDirectory(directory, fileExtList):    
    "get list of file info objects for files of particular extensions"
    fileList = [os.path.normcase(f)
                for f in os.listdir(directory)]           
    fileList = [os.path.join(directory, f) 
               for f in fileList
                if os.path.splitext(f)[1] in fileExtList] 
    def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):      
        "get file info class from filename extension"           
        subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]       
        return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
    return [getFileInfoClass(f)(f) for f in fileList]           

if __name__ == "__main__":
    for info in listDirectory("/music/_singles/", [".mp3"]): 
        print "\n".join(["%s=%s" % (k, v) for k, v in info.items()])
        print

This program's output depends on the files on your hard drive. To get meaningful output, you'll need to change the directory path to point to a directory of MP3 files on your own machine.

This is the output I got on my machine. Your output will be different, unless, by some startling coincidence, you share my exact taste in music.

album=
artist=Ghost in the Machine
title=A Time Long Forgotten (Concept
genre=31
name=/music/_singles/a_time_long_forgotten_con.mp3
year=1999
comment=http://mp3.com/ghostmachine

album=Rave Mix
artist=***DJ MARY-JANE***
title=HELLRAISER****Trance from Hell
genre=31
name=/music/_singles/hellraiser.mp3
year=2000
comment=http://mp3.com/DJMARYJANE

album=Rave Mix
artist=***DJ MARY-JANE***
title=KAIRO****THE BEST GOA
genre=31
name=/music/_singles/kairo.mp3
year=2000
comment=http://mp3.com/DJMARYJANE

album=Journeys
artist=Masters of Balance
title=Long Way Home
genre=31
name=/music/_singles/long_way_home1.mp3
year=2000
comment=http://mp3.com/MastersofBalan

album=
artist=The Cynic Project
title=Sidewinder
genre=18
name=/music/_singles/sidewinder.mp3
year=2000
comment=http://mp3.com/cynicproject

album=Digitosis@128k
artist=VXpanded
title=Spinning
genre=255
name=/music/_singles/spinning.mp3
year=2000
comment=http://mp3.com/artists/95/vxp

5.2. Importing Modules Using `from module import`

Python has two ways of importing modules. Both are useful, and you should know when to use each. One way, import module, you've already seen in Section 2.4, “Everything Is an Object”. The other way accomplishes the same thing, but it has subtle and important differences.

Here is the basic from module import syntax:

from UserDict import UserDict

This is similar to the import module syntax that you know and love, but with an important difference: the attributes and methods of the imported module types are imported directly into the local namespace, so they are available directly, without qualification by module name. You can import individual items or use from module import * to import everything.


	`from module import *` in Python is like `use module` in Perl; `import module` in Python is like `require module` in Perl.


	`from module import ` in Python is like `import module.` in Java; `import module` in Python is like `import module` in Java.

Example 5.2. `import module` vs. `from module import`

>>> import types
>>> types.FunctionType             
<type 'function'>
>>> FunctionType 
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
NameError: There is no variable named 'FunctionType'
>>> from types import FunctionType 
>>> FunctionType 
<type 'function'>

	The `types` module contains no methods; it just has attributes for each Python object type. Note that the attribute, `FunctionType`, must be qualified by the module name, `types`.
	`FunctionType` by itself has not been defined in this namespace; it exists only in the context of `types`.
	This syntax imports the attribute `FunctionType` from the `types` module directly into the local namespace.
	Now `FunctionType` can be accessed directly, without reference to `types`.

When should you use from module import?

If you will be accessing attributes and methods often and don't want to type the module name over and over, use from module import.
If you want to selectively import some attributes and methods but not others, use from module import.
If the module contains attributes or functions with the same name as ones in your module, you must use import module to avoid name conflicts.

Other than that, it's just a matter of style, and you will see Python code written both ways.


	Use `from module import *` sparingly, because it makes it difficult to determine where a particular function or attribute came from, and that makes debugging and refactoring more difficult.

5.3. Defining Classes

Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you've defined.

Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that's all that's required, since a class doesn't need to inherit from any other class.

Example 5.3. The Simplest Python Class

class Loaf: 
    pass

	The name of this class is `Loaf`, and it doesn't inherit from any other class. Class names are usually capitalized, `EachWordLikeThis`, but this is only a convention, not a requirement.
	This class doesn't define any methods or attributes, but syntactically, there needs to be something in the definition, so you use `pass`. This is a Python reserved word that just means “move along, nothing to see here”. It's a statement that does nothing, and it's a good placeholder when you're stubbing out functions or classes.
	You probably guessed this, but everything in a class is indented, just like the code within a function, `if` statement, `for` loop, and so forth. The first thing not indented is not in the class.


	The `pass` statement in Python is like an empty set of braces (`{}`) in Java or C.

Of course, realistically, most classes will be inherited from other classes, and they will define their own class methods and attributes. But as you've just seen, there is nothing that a class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don't have explicit constructors and destructors. Python classes do have something similar to a constructor: the __init__ method.

Example 5.4. Defining the `FileInfo` Class

from UserDict import UserDict

class FileInfo(UserDict):

In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. So the FileInfo class is inherited from the UserDict class (which was imported from the UserDict module). UserDict is a class that acts like a dictionary, allowing you to essentially subclass the dictionary datatype and add your own behavior. (There are similar classes UserList and UserString which allow you to subclass lists and strings.) There is a bit of black magic behind this, which you will demystify later in this chapter when you explore the UserDict class in more depth.


	In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. There is no special keyword like `extends` in Java.

Python supports multiple inheritance. In the parentheses following the class name, you can list as many ancestor classes as you like, separated by commas.

5.3.1. Initializing and Coding Classes

This example shows the initialization of the FileInfo class using the __init__ method.

Example 5.5. Initializing the `FileInfo` Class

class FileInfo(UserDict):
    "store file metadata"              
    def __init__(self, filename=None):

	Classes can (and should) have `docstring`s too, just like modules and functions.
	`__init__` is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor of the class. It's tempting, because it looks like a constructor (by convention, `__init__` is the first method defined for the class), acts like one (it's the first piece of code executed in a newly created instance of the class), and even sounds like one (“init” certainly suggests a constructor-ish nature). Incorrect, because the object has already been constructed by the time `__init__` is called, and you already have a valid reference to the new instance of the class. But `__init__` is the closest thing you're going to get to a constructor in Python, and it fills much the same role.
	The first argument of every class method, including `__init__`, is always a reference to the current instance of the class. By convention, this argument is always named `self`. In the `__init__` method, `self` refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify `self` explicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically.
	`__init__` methods can take any number of arguments, and just like functions, the arguments can be defined with default values, making them optional to the caller. In this case, `filename` has a default value of `None`, which is the Python null value.


	By convention, the first argument of any Python class method (the reference to the current instance) is called `self`. This argument fills the role of the reserved word `this` in C++ or Java, but `self` is not a reserved word in Python, merely a naming convention. Nonetheless, please don't call it anything but `self`; this is a very strong convention.

Example 5.6. Coding the `FileInfo` Class

class FileInfo(UserDict):
    "store file metadata"
    def __init__(self, filename=None):
        UserDict.__init__(self)        
        self["name"] = filename

	Some pseudo-object-oriented languages like Powerbuilder have a concept of “extending” constructors and other events, where the ancestor's method is called automatically before the descendant's method is executed. Python does not do this; you must always explicitly call the appropriate method in the ancestor class.
	I told you that this class acts like a dictionary, and here is the first sign of it. You're assigning the argument `filename` as the value of this object's `name` key.
	Note that the `__init__` method never returns a value.

5.3.2. Knowing When to Use `self` and `init`

When defining your class methods, you must explicitly list self as the first argument for each method, including __init__. When you call a method of an ancestor class from within your class, you must include the self argument. But when you call your class method from outside, you do not specify anything for the self argument; you skip it entirely, and Python automatically adds the instance reference for you. I am aware that this is confusing at first; it's not really inconsistent, but it may appear inconsistent because it relies on a distinction (between bound and unbound methods) that you don't know about yet.

Whew. I realize that's a lot to absorb, but you'll get the hang of it. All Python classes work the same way, so once you learn one, you've learned them all. If you forget everything else, remember this one thing, because I promise it will trip you up:


	`__init__` methods are optional, but when you define one, you must remember to explicitly call the ancestor's `__init__` method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor, the descendant method must explicitly call the ancestor method at the proper time, with the proper arguments.

5.4. Instantiating Classes

Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the __init__ method defines. The return value will be the newly created object.

Example 5.7. Creating a `FileInfo` Instance

>>> import fileinfo
>>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3") 
>>> f.__class__    
<class fileinfo.FileInfo at 010EC204>
>>> f.__doc__      
'store file metadata'
>>> f              
{'name': '/music/_singles/kairo.mp3'}

	You are creating an instance of the `FileInfo` class (defined in the `fileinfo` module) and assigning the newly created instance to the variable `f`. You are passing one parameter, `/music/_singles/kairo.mp3`, which will end up as the `filename` argument in `FileInfo`'s `__init__` method.
	Every class instance has a built-in attribute, `__class__`, which is the object's class. (Note that the representation of this includes the physical address of the instance on my machine; your representation will be different.) Java programmers may be familiar with the `Class` class, which contains methods like `getName` and `getSuperclass` to get metadata information about an object. In Python, this kind of metadata is available directly on the object itself through attributes like `__class__`, `__name__`, and `__bases__`.
	You can access the instance's `docstring` just as with a function or a module. All instances of a class share the same `docstring`.
	Remember when the `__init__` method assigned its `filename` argument to `self["name"]`? Well, here's the result. The arguments you pass when you create the class instance get sent right along to the `__init__` method (along with the object reference, `self`, which Python adds for free).


	In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit `new` operator like C++ or Java.

5.4.1. Garbage Collection

If creating new instances is easy, destroying them is even easier. In general, there is no need to explicitly free instances, because they are freed automatically when the variables assigned to them go out of scope. Memory leaks are rare in Python.

Example 5.8. Trying to Implement a Memory Leak

>>> def leakmem():
...     f = fileinfo.FileInfo('/music/_singles/kairo.mp3') 
...     
>>> for i in range(100):
...     leakmem()

Every time the leakmem function is called, you are creating an instance of FileInfo and assigning it to the variable f, which is a local variable within the function. Then the function ends without ever freeing f, so you would expect a memory leak, but you would be wrong. When the function ends, the local variable f goes out of scope. At this point, there are no longer any references to the newly created instance of FileInfo (since you never assigned it to anything other than f), so Python destroys the instance for us.

No matter how many times you call the leakmem function, it will never leak memory, because every time, Python will destroy the newly created FileInfo class before returning from leakmem.

The technical term for this form of garbage collection is “reference counting”. Python keeps a list of references to every instance created. In the above example, there was only one reference to the FileInfo instance: the local variable f. When the function ends, the variable f goes out of scope, so the reference count drops to 0, and Python destroys the instance automatically.

In previous versions of Python, there were situations where reference counting failed, and Python couldn't clean up after you. If you created two instances that referenced each other (for instance, a doubly-linked list, where each node has a pointer to the previous and next node in the list), neither instance would ever be destroyed automatically because Python (correctly) believed that there is always a reference to each instance. Python 2.0 has an additional form of garbage collection called “mark-and-sweep” which is smart enough to notice this virtual gridlock and clean up circular references correctly.

As a former philosophy major, it disturbs me to think that things disappear when no one is looking at them, but that's exactly what happens in Python. In general, you can simply forget about memory management and let Python clean up after you.

5.5. Exploring `UserDict`: A Wrapper Class

As you've seen, FileInfo is a class that acts like a dictionary. To explore this further, let's look at the UserDict class in the UserDict module, which is the ancestor of the FileInfo class. This is nothing special; the class is written in Python and stored in a .py file, just like any other Python code. In particular, it's stored in the lib directory in your Python installation.


	In the ActivePython IDE on Windows, you can quickly open any module in your library path by selecting File->Locate... (`Ctrl-L`).

Example 5.9. Defining the `UserDict` Class

class UserDict:              
    def __init__(self, dict=None):             
        self.data = {}       
        if dict is not None: self.update(dict)

	Note that `UserDict` is a base class, not inherited from any other class.
	This is the `__init__` method that you overrode in the `FileInfo` class. Note that the argument list in this ancestor class is different than the descendant. That's okay; each subclass can have its own set of arguments, as long as it calls the ancestor with the correct arguments. Here the ancestor class has a way to define initial values (by passing a dictionary in the `dict` argument) which the `FileInfo` does not use.
	Python supports data attributes (called “instance variables” in Java and Powerbuilder, and “member variables” in C++). Data attributes are pieces of data held by a specific instance of a class. In this case, each instance of `UserDict` will have a data attribute `data`. To reference this attribute from code outside the class, you qualify it with the instance name, `instance.data`, in the same way that you qualify a function with its module name. To reference a data attribute from within the class, you use `self` as the qualifier. By convention, all data attributes are initialized to reasonable values in the `__init__` method. However, this is not required, since data attributes, like local variables, spring into existence when they are first assigned a value.
	The `update` method is a dictionary duplicator: it copies all the keys and values from one dictionary to another. This does not clear the target dictionary first; if the target dictionary already has some keys, the ones from the source dictionary will be overwritten, but others will be left untouched. Think of `update` as a merge function, not a copy function.
	This is a syntax you may not have seen before (I haven't used it in the examples in this book). It's an `if` statement, but instead of having an indented block starting on the next line, there is just a single statement on the same line, after the colon. This is perfectly legal syntax, which is just a shortcut you can use when you have only one statement in a block. (It's like specifying a single statement without braces in C++.) You can use this syntax, or you can have indented code on subsequent lines, but you can't do both for the same block.


	Java and Powerbuilder support function overloading by argument list, i.e. one class can have multiple methods with the same name but a different number of arguments, or arguments of different types. Other languages (most notably PL/SQL) even support function overloading by argument name; i.e. one class can have multiple methods with the same name and the same number of arguments of the same type but different argument names. Python supports neither of these; it has no form of function overloading whatsoever. Methods are defined solely by their name, and there can be only one method per class with a given name. So if a descendant class has an `__init__` method, it always overrides the ancestor `__init__` method, even if the descendant defines it with a different argument list. And the same rule applies to any other method.


	Guido, the original author of Python, explains method overriding this way: "Derived classes may override methods of their base classes. Because methods have no special privileges when calling other methods of the same object, a method of a base class that calls another method defined in the same base class, may in fact end up calling a method of a derived class that overrides it. (For C++ programmers: all methods in Python are effectively virtual.)" If that doesn't make sense to you (it confuses the hell out of me), feel free to ignore it. I just thought I'd pass it along.


	Always assign an initial value to all of an instance's data attributes in the `__init__` method. It will save you hours of debugging later, tracking down `AttributeError` exceptions because you're referencing uninitialized (and therefore non-existent) attributes.

Example 5.10. `UserDict` Normal Methods

    def clear(self): self.data.clear()          
    def copy(self):           
        if self.__class__ is UserDict:          
            return UserDict(self.data)         
        import copy           
        return copy.copy(self)                 
    def keys(self): return self.data.keys()     
    def items(self): return self.data.items()  
    def values(self): return self.data.values()

	`clear` is a normal class method; it is publicly available to be called by anyone at any time. Notice that `clear`, like all class methods, has `self` as its first argument. (Remember that you don't include `self` when you call the method; it's something that Python adds for you.) Also note the basic technique of this wrapper class: store a real dictionary (`data`) as a data attribute, define all the methods that a real dictionary has, and have each class method redirect to the corresponding method on the real dictionary. (In case you'd forgotten, a dictionary's `clear` method deletes all of its keys and their associated values.)
	The `copy` method of a real dictionary returns a new dictionary that is an exact duplicate of the original (all the same key-value pairs). But `UserDict` can't simply redirect to `self.data.copy`, because that method returns a real dictionary, and what you want is to return a new instance that is the same class as `self`.
	You use the `__class__` attribute to see if `self` is a `UserDict`; if so, you're golden, because you know how to copy a `UserDict`: just create a new `UserDict` and give it the real dictionary that you've squirreled away in `self.data`. Then you immediately return the new `UserDict` you don't even get to the `import copy` on the next line.
	If `self.__class__` is not `UserDict`, then `self` must be some subclass of `UserDict` (like maybe `FileInfo`), in which case life gets trickier. `UserDict` doesn't know how to make an exact copy of one of its descendants; there could, for instance, be other data attributes defined in the subclass, so you would need to iterate through them and make sure to copy all of them. Luckily, Python comes with a module to do exactly this, and it's called `copy`. I won't go into the details here (though it's a wicked cool module, if you're ever inclined to dive into it on your own). Suffice it to say that `copy` can copy arbitrary Python objects, and that's how you're using it here.
	The rest of the methods are straightforward, redirecting the calls to the built-in methods on `self.data`.


	In versions of Python prior to 2.2, you could not directly subclass built-in datatypes like strings, lists, and dictionaries. To compensate for this, Python comes with wrapper classes that mimic the behavior of these built-in datatypes: `UserString`, `UserList`, and `UserDict`. Using a combination of normal and special methods, the `UserDict` class does an excellent imitation of a dictionary. In Python 2.2 and later, you can inherit classes directly from built-in datatypes like `dict`. An example of this is given in the examples that come with this book, in `fileinfo_fromdict.py`.

In Python, you can inherit directly from the dict built-in datatype, as shown in this example. There are three differences here compared to the UserDict version.

Example 5.11. Inheriting Directly from Built-In Datatype `dict`

class FileInfo(dict):
    "store file metadata"
    def __init__(self, filename=None): 
        self["name"] = filename

	The first difference is that you don't need to import the `UserDict` module, since `dict` is a built-in datatype and is always available. The second is that you are inheriting from `dict` directly, instead of from `UserDict.UserDict`.
	The third difference is subtle but important. Because of the way `UserDict` works internally, it requires you to manually call its `__init__` method to properly initialize its internal data structures. `dict` does not work like this; it is not a wrapper, and it requires no explicit initialization.

5.6. Special Class Methods

In addition to normal class methods, there are a number of special methods that Python classes can define. Instead of being called directly by your code (like normal methods), special methods are called for you by Python in particular circumstances or when specific syntax is used.

As you saw in the previous section, normal methods go a long way towards wrapping a dictionary in a class. But normal methods alone are not enough, because there are a lot of things you can do with dictionaries besides call methods on them. For starters, you can get and set items with a syntax that doesn't include explicitly invoking methods. This is where special class methods come in: they provide a way to map non-method-calling syntax into method calls.

5.6.1. Getting and Setting Items

Example 5.12. The `getitem` Special Method

    def __getitem__(self, key): return self.data[key]

>>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3")
>>> f
{'name':'/music/_singles/kairo.mp3'}
>>> f.__getitem__("name") 
'/music/_singles/kairo.mp3'
>>> f["name"]             
'/music/_singles/kairo.mp3'

The __getitem__ special method looks simple enough. Like the normal methods clear, keys, and values, it just redirects to the dictionary to return its value. But how does it get called? Well, you can call __getitem__ directly, but in practice you wouldn't actually do that; I'm just doing it here to show you how it works. The right way to use __getitem__ is to get Python to call it for you.

This looks just like the syntax you would use to get a dictionary value, and in fact it returns the value you would expect. But here's the missing link: under the covers, Python has converted this syntax to the method call f.__getitem__("name"). That's why __getitem__ is a special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax.

Of course, Python has a __setitem__ special method to go along with __getitem__, as shown in the next example.

Example 5.13. The `setitem` Special Method

    def __setitem__(self, key, item): self.data[key] = item

>>> f
{'name':'/music/_singles/kairo.mp3'}
>>> f.__setitem__("genre", 31) 
>>> f
{'name':'/music/_singles/kairo.mp3', 'genre':31}
>>> f["genre"] = 32            
>>> f
{'name':'/music/_singles/kairo.mp3', 'genre':32}

	Like the `__getitem__` method, `__setitem__` simply redirects to the real dictionary `self.data` to do its work. And like `__getitem__`, you wouldn't ordinarily call it directly like this; Python calls `__setitem__` for you when you use the right syntax.
	This looks like regular dictionary syntax, except of course that `f` is really a class that's trying very hard to masquerade as a dictionary, and `__setitem__` is an essential part of that masquerade. This line of code actually calls `f.__setitem__("genre", 32)` under the covers.

__setitem__ is a special class method because it gets called for you, but it's still a class method. Just as easily as the __setitem__ method was defined in UserDict, you can redefine it in the descendant class to override the ancestor method. This allows you to define classes that act like dictionaries in some ways but define their own behavior above and beyond the built-in dictionary.

This concept is the basis of the entire framework you're studying in this chapter. Each file type can have a handler class that knows how to get metadata from a particular type of file. Once some attributes (like the file's name and location) are known, the handler class knows how to derive other attributes automatically. This is done by overriding the __setitem__ method, checking for particular keys, and adding additional processing when they are found.

For example, MP3FileInfo is a descendant of FileInfo. When an MP3FileInfo's name is set, it doesn't just set the name key (like the ancestor FileInfo does); it also looks in the file itself for MP3 tags and populates a whole set of keys. The next example shows how this works.

Example 5.14. Overriding `setitem` in `MP3FileInfo`

    def __setitem__(self, key, item):         
        if key == "name" and item:            
            self.__parse(item)                
        FileInfo.__setitem__(self, key, item)

	Notice that this `__setitem__` method is defined exactly the same way as the ancestor method. This is important, since Python will be calling the method for you, and it expects it to be defined with a certain number of arguments. (Technically speaking, the names of the arguments don't matter; only the number of arguments is important.)
	Here's the crux of the entire `MP3FileInfo` class: if you're assigning a value to the `name` key, you want to do something extra.
	The extra processing you do for `name`s is encapsulated in the `__parse` method. This is another class method defined in `MP3FileInfo`, and when you call it, you qualify it with `self`. Just calling `__parse` would look for a normal function defined outside the class, which is not what you want. Calling `self.__parse` will look for a class method defined within the class. This isn't anything new; you reference data attributes the same way.
	After doing this extra processing, you want to call the ancestor method. Remember that this is never done for you in Python; you must do it manually. Note that you're calling the immediate ancestor, `FileInfo`, even though it doesn't have a `__setitem__` method. That's okay, because Python will walk up the ancestor tree until it finds a class with the method you're calling, so this line of code will eventually find and call the `__setitem__` defined in `UserDict`.


	When accessing data attributes within a class, you need to qualify the attribute name: `self.attribute`. When calling other methods within a class, you need to qualify the method name: `self.method`.

Example 5.15. Setting an `MP3FileInfo`'s `name`

>>> import fileinfo
>>> mp3file = fileinfo.MP3FileInfo() 
>>> mp3file
{'name':None}
>>> mp3file["name"] = "/music/_singles/kairo.mp3"      
>>> mp3file
{'album': 'Rave Mix', 'artist': '***DJ MARY-JANE***', 'genre': 31,
'title': 'KAIRO****THE BEST GOA', 'name': '/music/_singles/kairo.mp3',
'year': '2000', 'comment': 'http://mp3.com/DJMARYJANE'}
>>> mp3file["name"] = "/music/_singles/sidewinder.mp3" 
>>> mp3file
{'album': '', 'artist': 'The Cynic Project', 'genre': 18, 'title': 'Sidewinder', 
'name': '/music/_singles/sidewinder.mp3', 'year': '2000', 
'comment': 'http://mp3.com/cynicproject'}

	First, you create an instance of `MP3FileInfo`, without passing it a filename. (You can get away with this because the `filename` argument of the `__init__` method is optional.) Since `MP3FileInfo` has no `__init__` method of its own, Python walks up the ancestor tree and finds the `__init__` method of `FileInfo`. This `__init__` method manually calls the `__init__` method of `UserDict` and then sets the `name` key to `filename`, which is `None`, since you didn't pass a filename. Thus, `mp3file` initially looks like a dictionary with one key, `name`, whose value is `None`.
	Now the real fun begins. Setting the `name` key of `mp3file` triggers the `__setitem__` method on `MP3FileInfo` (not `UserDict`), which notices that you're setting the `name` key with a real value and calls `self.__parse`. Although you haven't traced through the `__parse` method yet, you can see from the output that it sets several other keys: `album`, `artist`, `genre`, `title`, `year`, and `comment`.
	Modifying the `name` key will go through the same process again: Python calls `__setitem__`, which calls `self.__parse`, which sets all the other keys.

5.7. Advanced Special Class Methods

Python has more special methods than just __getitem__ and __setitem__. Some of them let you emulate functionality that you may not even know about.

This example shows some of the other special methods in UserDict.

Example 5.16. More Special Methods in `UserDict`

    def __repr__(self): return repr(self.data)     
    def __cmp__(self, dict):     
        if isinstance(dict, UserDict):            
            return cmp(self.data, dict.data)      
        else: 
            return cmp(self.data, dict)           
    def __len__(self): return len(self.data)       
    def __delitem__(self, key): del self.data[key]

	`__repr__` is a special method that is called when you call `repr(instance)`. The `repr` function is a built-in function that returns a string representation of an object. It works on any object, not just class instances. You're already intimately familiar with `repr` and you don't even know it. In the interactive window, when you type just a variable name and press the `ENTER` key, Python uses `repr` to display the variable's value. Go create a dictionary `d` with some data and then `print repr(d)` to see for yourself.
	`__cmp__` is called when you compare class instances. In general, you can compare any two Python objects, not just class instances, by using `==`. There are rules that define when built-in datatypes are considered equal; for instance, dictionaries are equal when they have all the same keys and values, and strings are equal when they are the same length and contain the same sequence of characters. For class instances, you can define the `__cmp__` method and code the comparison logic yourself, and then you can use `==` to compare instances of your class and Python will call your `__cmp__` special method for you.
	`__len__` is called when you call `len(instance)`. The `len` function is a built-in function that returns the length of an object. It works on any object that could reasonably be thought of as having a length. The `len` of a string is its number of characters; the `len` of a dictionary is its number of keys; the `len` of a list or tuple is its number of elements. For class instances, define the `__len__` method and code the length calculation yourself, and then call `len(instance)` and Python will call your `__len__` special method for you.
	`__delitem__` is called when you call `del instance[key]`, which you may remember as the way to delete individual items from a dictionary. When you use `del` on a class instance, Python calls the `__delitem__` special method for you.


	In Java, you determine whether two string variables reference the same physical memory location by using `str1 == str2`. This is called object identity, and it is written in Python as `str1 is str2`. To compare string values in Java, you would use `str1.equals(str2)`; in Python, you would use `str1 == str2`. Java programmers who have been taught to believe that the world is a better place because `==` in Java compares by identity instead of by value may have a difficult time adjusting to Python's lack of such “gotchas”.

At this point, you may be thinking, “All this work just to do something in a class that I can do with a built-in datatype.” And it's true that life would be easier (and the entire UserDict class would be unnecessary) if you could inherit from built-in datatypes like a dictionary. But even if you could, special methods would still be useful, because they can be used in any class, not just wrapper classes like UserDict.

Special methods mean that any class can store key/value pairs like a dictionary, just by defining the __setitem__ method. Any class can act like a sequence, just by defining the __getitem__ method. Any class that defines the __cmp__ method can be compared with ==. And if your class represents something that has a length, don't define a GetLength method; define the __len__ method and use len(instance).


	While other object-oriented languages only let you define the physical model of an object (“this object has a `GetLength` method”), Python's special class methods like `__len__` allow you to define the logical model of an object (“this object has a length”).

Python has a lot of other special methods. There's a whole set of them that let classes act like numbers, allowing you to add, subtract, and do other arithmetic operations on class instances. (The canonical example of this is a class that represents complex numbers, numbers with both real and imaginary components.) The __call__ method lets a class act like a function, allowing you to call a class instance directly. And there are other special methods that allow classes to have read-only and write-only data attributes; you'll talk more about those in later chapters.

5.8. Introducing Class Attributes

You already know about data attributes, which are variables owned by a specific instance of a class. Python also supports class attributes, which are variables owned by the class itself.

Example 5.17. Introducing Class Attributes

class MP3FileInfo(FileInfo):
    "store ID3v1.0 MP3 tags"
    tagDataMap = {"title"   : (  3,  33, stripnulls),
"artist"  : ( 33,  63, stripnulls),
"album"   : ( 63,  93, stripnulls),
"year"    : ( 93,  97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre"   : (127, 128, ord)}

>>> import fileinfo
>>> fileinfo.MP3FileInfo            
<class fileinfo.MP3FileInfo at 01257FDC>
>>> fileinfo.MP3FileInfo.tagDataMap 
{'title': (3, 33, <function stripnulls at 0260C8D4>), 
'genre': (127, 128, <built-in function ord>), 
'artist': (33, 63, <function stripnulls at 0260C8D4>), 
'year': (93, 97, <function stripnulls at 0260C8D4>), 
'comment': (97, 126, <function stripnulls at 0260C8D4>), 
'album': (63, 93, <function stripnulls at 0260C8D4>)}
>>> m = fileinfo.MP3FileInfo()      
>>> m.tagDataMap
{'title': (3, 33, <function stripnulls at 0260C8D4>), 
'genre': (127, 128, <built-in function ord>), 
'artist': (33, 63, <function stripnulls at 0260C8D4>), 
'year': (93, 97, <function stripnulls at 0260C8D4>), 
'comment': (97, 126, <function stripnulls at 0260C8D4>), 
'album': (63, 93, <function stripnulls at 0260C8D4>)}

	`MP3FileInfo` is the class itself, not any particular instance of the class.
	`tagDataMap` is a class attribute: literally, an attribute of the class. It is available before creating any instances of the class.
	Class attributes are available both through direct reference to the class and through any instance of the class.


	In Java, both static variables (called class attributes in Python) and instance variables (called data attributes in Python) are defined immediately after the class definition (one with the `static` keyword, one without). In Python, only class attributes can be defined here; data attributes are defined in the `__init__` method.

Class attributes can be used as class-level constants (which is how you use them in MP3FileInfo), but they are not really constants. You can also change them.


	There are no constants in Python. Everything can be changed if you try hard enough. This fits with one of the core principles of Python: bad behavior should be discouraged but not banned. If you really want to change the value of `None`, you can do it, but don't come running to me when your code is impossible to debug.

Example 5.18. Modifying Class Attributes

>>> class counter:
...     count = 0   
...     def __init__(self):
...         self.__class__.count += 1 
...     
>>> counter
<class __main__.counter at 010EAECC>
>>> counter.count   
0
>>> c = counter()
>>> c.count         
1
>>> counter.count
1
>>> d = counter()   
>>> d.count
2
>>> c.count
2
>>> counter.count
2

	`count` is a class attribute of the `counter` class.
	`__class__` is a built-in attribute of every class instance (of every class). It is a reference to the class that `self` is an instance of (in this case, the `counter` class).
	Because `count` is a class attribute, it is available through direct reference to the class, before you have created any instances of the class.
	Creating an instance of the class calls the `__init__` method, which increments the class attribute `count` by `1`. This affects the class itself, not just the newly created instance.
	Creating a second instance will increment the class attribute `count` again. Notice how the class attribute is shared by the class and all instances of the class.

5.9. Private Functions

Like most languages, Python has the concept of private elements:

Private functions, which can't be called from outside their module
Private class methods, which can't be called from outside their class
Private attributes, which can't be accessed from outside their class.

Unlike in most languages, whether a Python function, method, or attribute is private or public is determined entirely by its name.

If the name of a Python function, class method, or attribute starts with (but doesn't end with) two underscores, it's private; everything else is public. Python has no concept of protected class methods (accessible only in their own class and descendant classes). Class methods are either private (accessible only in their own class) or public (accessible from anywhere).

In MP3FileInfo, there are two methods: __parse and __setitem__. As you have already discussed, __setitem__ is a special method; normally, you would call it indirectly by using the dictionary syntax on a class instance, but it is public, and you could call it directly (even from outside the fileinfo module) if you had a really good reason. However, __parse is private, because it has two underscores at the beginning of its name.


	In Python, all special methods (like `__setitem__`) and built-in attributes (like `__doc__`) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and attributes this way, because it will only confuse you (and others) later.

Example 5.19. Trying to Call a Private Method

>>> import fileinfo
>>> m = fileinfo.MP3FileInfo()
>>> m.__parse("/music/_singles/kairo.mp3") 
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
AttributeError: 'MP3FileInfo' instance has no attribute '__parse'

If you try to call a private method, Python will raise a slightly misleading exception, saying that the method does not exist. Of course it does exist, but it's private, so it's not accessible outside the class.Strictly speaking, private methods are accessible outside their class, just not easily accessible. Nothing in Python is truly private; internally, the names of private methods and attributes are mangled and unmangled on the fly to make them seem inaccessible by their given names. You can access the __parse method of the MP3FileInfo class by the name _MP3FileInfo__parse. Acknowledge that this is interesting, but promise to never, ever do it in real code. Private methods are private for a reason, but like many other things in Python, their privateness is ultimately a matter of convention, not force.

5.10. Summary

That's it for the hard-core object trickery. You'll see a real-world application of special class methods in Chapter 12, which uses getattr to create a proxy to a remote web service.

The next chapter will continue using this code sample to explore other Python concepts, such as exceptions, file objects, and for loops.

Before diving into the next chapter, make sure you're comfortable doing all of these things:

Importing modules using either import module or from module import
Defining and instantiating classes
Defining __init__ methods and other special class methods, and understanding when they are called
Subclassing UserDict to define classes that act like dictionaries
Defining data attributes and class attributes, and understanding the differences between them
Defining private attributes and methods

Chapter 6. Exceptions and File Handling

In this chapter, you will dive into exceptions, file objects, for loops, and the os and sys modules. If you've used exceptions in another programming language, you can skim the first section to get a sense of Python's syntax. Be sure to tune in again for file handling.

6.1. Handling Exceptions

Like many other programming languages, Python has exception handling via try...except blocks.


	Python uses `try...except` to handle exceptions and `raise` to generate them. Java and C++ use `try...catch` to handle exceptions, and `throw` to generate them.

Exceptions are everywhere in Python. Virtually every module in the standard Python library uses them, and Python itself will raise them in a lot of different circumstances. You've already seen them repeatedly throughout this book.

Accessing a non-existent dictionary key will raise a KeyError exception.
Searching a list for a non-existent value will raise a ValueError exception.
Calling a non-existent method will raise an AttributeError exception.
Referencing a non-existent variable will raise a NameError exception.
Mixing datatypes without coercion will raise a TypeError exception.

In each of these cases, you were simply playing around in the Python IDE: an error occurred, the exception was printed (depending on your IDE, perhaps in an intentionally jarring shade of red), and that was that. This is called an unhandled exception. When the exception was raised, there was no code to explicitly notice it and deal with it, so it bubbled its way back to the default behavior built in to Python, which is to spit out some debugging information and give up. In the IDE, that's no big deal, but if that happened while your actual Python program was running, the entire program would come to a screeching halt.

An exception doesn't need result in a complete program crash, though. Exceptions, when raised, can be handled. Sometimes an exception is really because you have a bug in your code (like accessing a variable that doesn't exist), but many times, an exception is something you can anticipate. If you're opening a file, it might not exist. If you're connecting to a database, it might be unavailable, or you might not have the correct security credentials to access it. If you know a line of code may raise an exception, you should handle the exception using a try...except block.

Example 6.1. Opening a Non-Existent File

>>> fsock = open("/notthere", "r")      
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
IOError: [Errno 2] No such file or directory: '/notthere'
>>> try:
...     fsock = open("/notthere")       
... except IOError:   
...     print "The file does not exist, exiting gracefully"
... print "This line will always print" 
The file does not exist, exiting gracefully
This line will always print

	Using the built-in `open` function, you can try to open a file for reading (more on `open` in the next section). But the file doesn't exist, so this raises the `IOError` exception. Since you haven't provided any explicit check for an `IOError` exception, Python just prints out some debugging information about what happened and then gives up.
	You're trying to open the same non-existent file, but this time you're doing it within a `try...except` block.
	When the `open` method raises an `IOError` exception, you're ready for it. The `except IOError:` line catches the exception and executes your own block of code, which in this case just prints a more pleasant error message.
	Once an exception has been handled, processing continues normally on the first line after the `try...except` block. Note that this line will always print, whether or not an exception occurs. If you really did have a file called `notthere` in your root directory, the call to `open` would succeed, the `except` clause would be ignored, and this line would still be executed.

Exceptions may seem unfriendly (after all, if you don't catch the exception, your entire program will crash), but consider the alternative. Would you rather get back an unusable file object to a non-existent file? You'd need to check its validity somehow anyway, and if you forgot, somewhere down the line, your program would give you strange errors somewhere down the line that you would need to trace back to the source. I'm sure you've experienced this, and you know it's not fun. With exceptions, errors occur immediately, and you can handle them in a standard way at the source of the problem.

6.1.1. Using Exceptions For Other Purposes

There are a lot of other uses for exceptions besides handling actual error conditions. A common use in the standard Python library is to try to import a module, and then check whether it worked. Importing a module that does not exist will raise an ImportError exception. You can use this to define multiple levels of functionality based on which modules are available at run-time, or to support multiple platforms (where platform-specific code is separated into different modules).

You can also define your own exceptions by creating a class that inherits from the built-in Exception class, and then raise your exceptions with the raise command. See the further reading section if you're interested in doing this.

The next example demonstrates how to use an exception to support platform-specific functionality. This code comes from the getpass module, a wrapper module for getting a password from the user. Getting a password is accomplished differently on UNIX, Windows, and Mac OS platforms, but this code encapsulates all of those differences.

Example 6.2. Supporting Platform-Specific Functionality

  # Bind the name getpass to the appropriate function
  try:
      import termios, TERMIOS   
  except ImportError:
      try:
          import msvcrt         
      except ImportError:
          try:
              from EasyDialogs import AskPassword 
          except ImportError:
              getpass = default_getpass           
          else:                 
              getpass = AskPassword
      else:
          getpass = win_getpass
  else:
      getpass = unix_getpass

	`termios` is a UNIX-specific module that provides low-level control over the input terminal. If this module is not available (because it's not on your system, or your system doesn't support it), the import fails and Python raises an `ImportError`, which you catch.
	OK, you didn't have `termios`, so let's try `msvcrt`, which is a Windows-specific module that provides an API to many useful functions in the Microsoft Visual C++ runtime services. If this import fails, Python will raise an `ImportError`, which you catch.
	If the first two didn't work, you try to import a function from `EasyDialogs`, which is a Mac OS-specific module that provides functions to pop up dialog boxes of various types. Once again, if this import fails, Python will raise an `ImportError`, which you catch.
	None of these platform-specific modules is available (which is possible, since Python has been ported to a lot of different platforms), so you need to fall back on a default password input function (which is defined elsewhere in the `getpass` module). Notice what you're doing here: assigning the function `default_getpass` to the variable `getpass`. If you read the official `getpass` documentation, it tells you that the `getpass` module defines a `getpass` function. It does this by binding `getpass` to the correct function for your platform. Then when you call the `getpass` function, you're really calling a platform-specific function that this code has set up for you. You don't need to know or care which platform your code is running on -- just call `getpass`, and it will always do the right thing.
	A `try...except` block can have an `else` clause, like an `if` statement. If no exception is raised during the `try` block, the `else` clause is executed afterwards. In this case, that means that the `from EasyDialogs import AskPassword` import worked, so you should bind `getpass` to the `AskPassword` function. Each of the other `try...except` blocks has similar `else` clauses to bind `getpass` to the appropriate function when you find an `import` that works.

6.2. Working with File Objects

Python has a built-in function, open, for opening a file on disk. open returns a file object, which has methods and attributes for getting information about and manipulating the opened file.

Example 6.3. Opening a File

>>> f = open("/music/_singles/kairo.mp3", "rb") 
>>> f       
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.mode  
'rb'
>>> f.name  
'/music/_singles/kairo.mp3'

	The `open` method can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename, is required; the other two are optional. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode. (`print open.__doc__` displays a great explanation of all the possible modes.)
	The `open` function returns an object (by now, this should not surprise you). A file object has several useful attributes.
	The `mode` attribute of a file object tells you in which mode the file was opened.
	The `name` attribute of a file object tells you the name of the file that the file object has open.

6.2.1. Reading Files

After you open a file, the first thing you'll want to do is read from it, as shown in the next example.

Example 6.4. Reading a File

>>> f
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.tell()              
0
>>> f.seek(-128, 2)       
>>> f.tell()              
7542909
>>> tagData = f.read(128) 
>>> tagData
'TAGKAIRO****THE BEST GOA         ***DJ MARY-JANE***            
Rave Mix    2000http://mp3.com/DJMARYJANE     \037'
>>> f.tell()              
7543037

	A file object maintains state about the file it has open. The `tell` method of a file object tells you your current position in the open file. Since you haven't done anything with this file yet, the current position is `0`, which is the beginning of the file.
	The `seek` method of a file object moves to another position in the open file. The second parameter specifies what the first one means; `0` means move to an absolute position (counting from the start of the file), `1` means move to a relative position (counting from the current position), and `2` means move to a position relative to the end of the file. Since the MP3 tags you're looking for are stored at the end of the file, you use `2` and tell the file object to move to a position `128` bytes from the end of the file.
	The `tell` method confirms that the current file position has moved.
	The `read` method reads a specified number of bytes from the open file and returns a string with the data that was read. The optional parameter specifies the maximum number of bytes to read. If no parameter is specified, `read` will read until the end of the file. (You could have simply said `read()` here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.) The read data is assigned to the `tagData` variable, and the current position is updated based on how many bytes were read.
	The `tell` method confirms that the current position has moved. If you do the math, you'll see that after reading 128 bytes, the position has been incremented by 128.

6.2.2. Closing Files

Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's important to close files as soon as you're finished with them.

Example 6.5. Closing a File

>>> f
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.closed       
False
>>> f.close()      
>>> f
<closed file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.closed       
True
>>> f.seek(0)      
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
ValueError: I/O operation on closed file
>>> f.tell()
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
ValueError: I/O operation on closed file
>>> f.read()
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
ValueError: I/O operation on closed file
>>> f.close()

	The `closed` attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (`closed` is `False`).
	To close a file, call the `close` method of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any) that the system hadn't gotten around to actually writing yet, and releases the system resources.
	The `closed` attribute confirms that the file is closed.
	Just because a file is closed doesn't mean that the file object ceases to exist. The variable `f` will continue to exist until it goes out of scope or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed; they all raise an exception.
	Calling `close` on a file object whose file is already closed does not raise an exception; it fails silently.

6.2.3. Handling I/O Errors

Now you've seen enough to understand the file handling code in the fileinfo.py sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle errors.

Example 6.6. File Objects in `MP3FileInfo`

        try:              
            fsock = open(filename, "rb", 0) 
            try:         
                fsock.seek(-128, 2)         
                tagdata = fsock.read(128)   
            finally:      
                fsock.close()              
            .
            .
            .
        except IOError:   
            pass

	Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a `try...except` block. (Hey, isn't standardized indentation great? This is where you start to appreciate it.)
	The `open` function may raise an `IOError`. (Maybe the file doesn't exist.)
	The `seek` method may raise an `IOError`. (Maybe the file is smaller than 128 bytes.)
	The `read` method may raise an `IOError`. (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.)
	This is new: a `try...finally` block. Once the file has been opened successfully by the `open` function, you want to make absolutely sure that you close it, even if an exception is raised by the `seek` or `read` methods. That's what a `try...finally` block is for: code in the `finally` block will always be executed, even if something in the `try` block raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before.
	At last, you handle your `IOError` exception. This could be the `IOError` exception raised by the call to `open`, `seek`, or `read`. Here, you really don't care, because all you're going to do is ignore it silently and continue. (Remember, `pass` is a Python statement that does nothing.) That's perfectly legal; “handling” an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the next line of code after the `try...except` block.

6.2.4. Writing to Files

As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes:

"Append" mode will add data to the end of the file.
"write" mode will overwrite the file.

Either mode will create the file automatically if it doesn't already exist, so there's never a need for any sort of fiddly "if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open it and start writing.

Example 6.7. Writing to Files

>>> logfile = open('test.log', 'w') 
>>> logfile.write('test succeeded') 
>>> logfile.close()
>>> print file('test.log').read()   
test succeeded
>>> logfile = open('test.log', 'a') 
>>> logfile.write('line 2')
>>> logfile.close()
>>> print file('test.log').read()   
test succeededline 2

	You start boldly by creating either the new file `test.log` or overwrites the existing file, and opening the file for writing. (The second parameter `"w"` means open the file for writing.) Yes, that's all as dangerous as it sounds. I hope you didn't care about the previous contents of that file, because it's gone now.
	You can add data to the newly opened file with the `write` method of the file object returned by `open`.
	`file` is a synonym for `open`. This one-liner opens the file, reads its contents, and prints them.
	You happen to know that `test.log` exists (since you just finished writing to it), so you can open it and append to it. (The `"a"` parameter means open the file for appending.) Actually you could do this even if the file didn't exist, because opening the file for appending will create the file if necessary. But appending will never harm the existing contents of the file.
	As you can see, both the original line you wrote and the second line you appended are now in `test.log`. Also note that carriage returns are not included. Since you didn't write them explicitly to the file either time, the file doesn't include them. You can write a carriage return with the `"\n"` character. Since you didn't do this, everything you wrote to the file ended up smooshed together on the same line.

6.3. Iterating with `for` Loops

Like most other languages, Python has for loops. The only reason you haven't seen them until now is that Python is good at so many other things that you don't need them as often.

Most other languages don't have a powerful list datatype like Python, so you end up doing a lot of manual work, specifying a start, end, and step to define a range of integers or characters or other iteratable entities. But in Python, a for loop simply iterates over a list, the same way list comprehensions work.

Example 6.8. Introducing the `for` Loop

>>> li = ['a', 'b', 'e']
>>> for s in li:         
...     print s          
a
b
e
>>> print "\n".join(li)  
a
b
e

	The syntax for a `for` loop is similar to list comprehensions. `li` is a list, and `s` will take the value of each element in turn, starting from the first element.
	Like an `if` statement or any other indented block, a `for` loop can have any number of lines of code in it.
	This is the reason you haven't seen the `for` loop yet: you haven't needed it yet. It's amazing how often you use `for` loops in other languages when all you really want is a `join` or a list comprehension.

Doing a “normal” (by Visual Basic standards) counter for loop is also simple.

Example 6.9. Simple Counters

>>> for i in range(5):             
...     print i
0
1
2
3
4
>>> li = ['a', 'b', 'c', 'd', 'e']
>>> for i in range(len(li)):       
...     print li[i]
a
b
c
d
e

	As you saw in Example 3.20, “Assigning Consecutive Values”, `range` produces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress occasionally) useful to have a counter loop.
	Don't ever do this. This is Visual Basic-style thinking. Break out of it. Just iterate through the list, as shown in the previous example.

for loops are not just for simple counters. They can iterate through all kinds of things. Here is an example of using a for loop to iterate through a dictionary.

Example 6.10. Iterating Through a Dictionary

>>> import os
>>> for k, v in os.environ.items():       
...     print "%s=%s" % (k, v)
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim

[...snip...]
>>> print "\n".join(["%s=%s" % (k, v)
...     for k, v in os.environ.items()]) 
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim

[...snip...]

	`os.environ` is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables accessible from MS-DOS. In UNIX, they are the variables exported in your shell's startup scripts. In Mac OS, there is no concept of environment variables, so this dictionary is empty.
	`os.environ.items()` returns a list of tuples: `[(key1, value1), (key2, value2), ...]`. The `for` loop iterates through this list. The first round, it assigns `key1` to `k` and `value1` to `v`, so `k` = `USERPROFILE` and `v` = `C:\Documents and Settings\mpilgrim`. In the second round, `k` gets the second key, `OS`, and `v` gets the corresponding value, `Windows_NT`.
	With multi-variable assignment and list comprehensions, you can replace the entire `for` loop with a single statement. Whether you actually do this in real code is a matter of personal coding style. I like it because it makes it clear that what I'm doing is mapping a dictionary into a list, then joining the list into a single string. Other programmers prefer to write this out as a `for` loop. The output is the same in either case, although this version is slightly faster, because there is only one `print` statement instead of many.

Now we can look at the for loop in MP3FileInfo, from the sample fileinfo.py program introduced in Chapter 5.

Example 6.11. `for` Loop in `MP3FileInfo`

    tagDataMap = {"title"   : (  3,  33, stripnulls),
"artist"  : ( 33,  63, stripnulls),
"album"   : ( 63,  93, stripnulls),
"year"    : ( 93,  97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre"   : (127, 128, ord)}             
    .
    .
    .
            if tagdata[:3] == "TAG":
                for tag, (start, end, parseFunc) in self.tagDataMap.items(): 
  self[tag] = parseFunc(tagdata[start:end])

	`tagDataMap` is a class attribute that defines the tags you're looking for in an MP3 file. Tags are stored in fixed-length fields. Once you read the last 128 bytes of the file, bytes 3 through 32 of those are always the song title, 33 through 62 are always the artist name, 63 through 92 are the album name, and so forth. Note that `tagDataMap` is a dictionary of tuples, and each tuple contains two integers and a function reference.
	This looks complicated, but it's not. The structure of the `for` variables matches the structure of the elements of the list returned by `items`. Remember that `items` returns a list of tuples of the form `(key, value)`. The first element of that list is `("title", (3, 33, <function stripnulls>))`, so the first time around the loop, `tag` gets `"title"`, `start` gets `3`, `end` gets `33`, and `parseFunc` gets the function `stripnulls`.
	Now that you've extracted all the parameters for a single MP3 tag, saving the tag data is easy. You slice `tagdata` from `start` to `end` to get the actual data for this tag, call `parseFunc` to post-process the data, and assign this as the value for the key `tag` in the pseudo-dictionary `self`. After iterating through all the elements in `tagDataMap`, `self` has the values for all the tags, and you know what that looks like.

6.4. Using `sys.modules`

Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules.

Example 6.12. Introducing `sys.modules`

>>> import sys        
>>> print '\n'.join(sys.modules.keys()) 
win32api
os.path
os
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat

The sys module contains system-level information, such as the version of Python you're running (sys.version or sys.version_info), and system-level options such as the maximum allowed recursion depth (sys.getrecursionlimit() and sys.setrecursionlimit()).

sys.modules is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules your program has imported. Python preloads some modules on startup, and if you're using a Python IDE, sys.modules contains all the modules imported by all the programs you've run within the IDE.

This example demonstrates how to use sys.modules.

Example 6.13. Using `sys.modules`

>>> import fileinfo         
>>> print '\n'.join(sys.modules.keys())
win32api
os.path
os
fileinfo
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat
>>> fileinfo
<module 'fileinfo' from 'fileinfo.pyc'>
>>> sys.modules["fileinfo"] 
<module 'fileinfo' from 'fileinfo.pyc'>

	As new modules are imported, they are added to `sys.modules`. This explains why importing the same module twice is very fast: Python has already loaded and cached the module in `sys.modules`, so importing the second time is simply a dictionary lookup.
	Given the name (as a string) of any previously-imported module, you can get a reference to the module itself through the `sys.modules` dictionary.

The next example shows how to use the __module__ class attribute with the sys.modules dictionary to get a reference to the module in which a class is defined.

Example 6.14. The `module` Class Attribute

>>> from fileinfo import MP3FileInfo
>>> MP3FileInfo.__module__              
'fileinfo'
>>> sys.modules[MP3FileInfo.__module__] 
<module 'fileinfo' from 'fileinfo.pyc'>

	Every Python class has a built-in class attribute `__module__`, which is the name of the module in which the class is defined.
	Combining this with the `sys.modules` dictionary, you can get a reference to the module in which a class is defined.

Now you're ready to see how sys.modules is used in fileinfo.py, the sample program introduced in Chapter 5. This example shows that portion of the code.

Example 6.15. `sys.modules` in `fileinfo.py`

    def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):       
        "get file info class from filename extension"           
        subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]        
        return hasattr(module, subclass) and getattr(module, subclass) or FileInfo

	This is a function with two arguments; `filename` is required, but `module` is optional and defaults to the module that contains the `FileInfo` class. This looks inefficient, because you might expect Python to evaluate the `sys.modules` expression every time the function is called. In fact, Python evaluates default expressions only once, the first time the module is imported. As you'll see later, you never call this function with a `module` argument, so `module` serves as a function-level constant.
	You'll plow through this line later, after you dive into the `os` module. For now, take it on faith that `subclass` ends up as the name of a class, like `MP3FileInfo`.
	You already know about `getattr`, which gets a reference to an object by name. `hasattr` is a complementary function that checks whether an object has a particular attribute; in this case, whether a module has a particular class (although it works for any object and any attribute, just like `getattr`). In English, this line of code says, “If this module has the class named by `subclass` then return it, otherwise return the base class `FileInfo`.”

6.5. Working with Directories

The os.path module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing the contents of a directory.

Example 6.16. Constructing Pathnames

>>> import os
>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")  
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")   
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.expanduser("~")       
'c:\\Documents and Settings\\mpilgrim\\My Documents'
>>> os.path.join(os.path.expanduser("~"), "Python") 
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'

	`os.path` is a reference to a module -- which module depends on your platform. Just as `getpass` encapsulates differences between platforms by setting `getpass` to a platform-specific function, `os` encapsulates differences between platforms by setting `path` to a platform-specific module.
	The `join` function of `os.path` constructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing with pathnames on Windows is annoying because the backslash character must be escaped.)
	In this slightly less trivial case, `join` will add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since `addSlashIfNecessary` is one of the stupid little functions I always need to write when building up my toolbox in a new language. Do not write this stupid little function in Python; smart people have already taken care of it for you.
	`expanduser` will expand a pathname that uses `~` to represent the current user's home directory. This works on any platform where users have a home directory, like Windows, UNIX, and Mac OS X; it has no effect on Mac OS.
	Combining these techniques, you can easily construct pathnames for directories and files under the user's home directory.

Example 6.17. Splitting Pathnames

>>> os.path.split("c:\\music\\ap\\mahadeva.mp3")      
('c:\\music\\ap', 'mahadeva.mp3')
>>> (filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3") 
>>> filepath      
'c:\\music\\ap'
>>> filename      
'mahadeva.mp3'
>>> (shortname, extension) = os.path.splitext(filename)                 
>>> shortname
'mahadeva'
>>> extension
'.mp3'

	The `split` function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use multi-variable assignment to return multiple values from a function? Well, `split` is such a function.
	You assign the return value of the `split` function into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple.
	The first variable, `filepath`, receives the value of the first element of the tuple returned from `split`, the file path.
	The second variable, `filename`, receives the value of the second element of the tuple returned from `split`, the filename.
	`os.path` also contains a function `splitext`, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique to assign each of them to separate variables.

Example 6.18. Listing Directories

>>> os.listdir("c:\\music\\_singles\\")              
['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3', 
'spinning.mp3']
>>> dirname = "c:\\"
>>> os.listdir(dirname)            
['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin',
'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS',
'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys',
'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']
>>> [f for f in os.listdir(dirname)
...     if os.path.isfile(os.path.join(dirname, f))] 
['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS',
'NTDETECT.COM', 'ntldr', 'pagefile.sys']
>>> [f for f in os.listdir(dirname)
...     if os.path.isdir(os.path.join(dirname, f))]  
['cygwin', 'docbook', 'Documents and Settings', 'Incoming',
'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']

	The `listdir` function takes a pathname and returns a list of the contents of the directory.
	`listdir` returns both files and folders, with no indication of which is which.
	You can use list filtering and the `isfile` function of the `os.path` module to separate the files from the folders. `isfile` takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're using `os.path.join` to ensure a full pathname, but `isfile` also works with a partial path, relative to the current working directory. You can use `os.getcwd()` to get the current working directory.
	`os.path` also has a `isdir` function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories within a directory.

Example 6.19. Listing Directories in `fileinfo.py`

def listDirectory(directory, fileExtList):    
    "get list of file info objects for files of particular extensions" 
    fileList = [os.path.normcase(f)
                for f in os.listdir(directory)]             
    fileList = [os.path.join(directory, f) 
               for f in fileList
                if os.path.splitext(f)[1] in fileExtList]

	`os.listdir(directory)` returns a list of all the files and folders in `directory`.
	Iterating through the list with `f`, you use `os.path.normcase(f)` to normalize the case according to operating system defaults. `normcase` is a useful little function that compensates for case-insensitive operating systems that think that `mahadeva.mp3` and `mahadeva.MP3` are the same file. For instance, on Windows and Mac OS, `normcase` will convert the entire filename to lowercase; on UNIX-compatible systems, it will return the filename unchanged.
	Iterating through the normalized list with `f` again, you use `os.path.splitext(f)` to split each filename into name and extension.
	For each file, you see if the extension is in the list of file extensions you care about (`fileExtList`, which was passed to the `listDirectory` function).
	For each file you care about, you use `os.path.join(directory, f)` to construct the full pathname of the file, and return a list of the full pathnames.


	Whenever possible, you should use the functions in `os` and `os.path` for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like `os.path.split` work on UNIX, Windows, Mac OS, and any other platform supported by Python.

There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you may already be familiar with from working on the command line.

Example 6.20. Listing Directories with `glob`

>>> os.listdir("c:\\music\\_singles\\")               
['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
'spinning.mp3']
>>> import glob
>>> glob.glob('c:\\music\\_singles\\*.mp3')           
['c:\\music\\_singles\\a_time_long_forgotten_con.mp3',
'c:\\music\\_singles\\hellraiser.mp3',
'c:\\music\\_singles\\kairo.mp3',
'c:\\music\\_singles\\long_way_home1.mp3',
'c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']
>>> glob.glob('c:\\music\\_singles\\s*.mp3')          
['c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']
>>> glob.glob('c:\\music\\*\\*.mp3')

	As you saw earlier, `os.listdir` simply takes a directory path and lists all files and directories in that directory.
	The `glob` module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard. Here the wildcard is a directory path plus "*.mp3", which will match all `.mp3` files. Note that each element of the returned list already includes the full path of the file.
	If you want to find all the files in a specific directory that start with "s" and end with ".mp3", you can do that too.
	Now consider this scenario: you have a `music` directory, with several subdirectories within it, with `.mp3` files within each subdirectory. You can get a list of all of those with a single call to `glob`, by using two wildcards at once. One wildcard is the `".mp3"` (to match `.mp3` files), and one wildcard is within the directory path itself*, to match any subdirectory within `c:\music`. That's a crazy amount of power packed into one deceptively simple-looking function!

6.6. Putting It All Together

Once again, all the dominoes are in place. You've seen how each line of code works. Now let's step back and see how it all fits together.

Example 6.21. `listDirectory`

def listDirectory(directory, fileExtList):     
    "get list of file info objects for files of particular extensions"
    fileList = [os.path.normcase(f)
                for f in os.listdir(directory)]           
    fileList = [os.path.join(directory, f) 
               for f in fileList
                if os.path.splitext(f)[1] in fileExtList]        
    def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):       
        "get file info class from filename extension"           
        subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]        
        return hasattr(module, subclass) and getattr(module, subclass) or FileInfo 
    return [getFileInfoClass(f)(f) for f in fileList]

	`listDirectory` is the main attraction of this entire module. It takes a directory (like `c:\music\_singles\` in my case) and a list of interesting file extensions (like `['.mp3']`), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in that directory. And it does it in just a few straightforward lines of code.
	As you saw in the previous section, this line of code gets a list of the full pathnames of all the files in `directory` that have an interesting file extension (as specified by `fileExtList`).
	Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports nested functions -- literally, a function within a function. The nested function `getFileInfoClass` can be called only from the function in which it is defined, `listDirectory`. As with any other function, you don't need an interface declaration or anything fancy; just define the function and code it.
	Now that you've seen the `os` module, this line should make more sense. It gets the extension of the file (`os.path.splitext(filename)[1]`), forces it to uppercase (`.upper()`), slices off the dot (`[1:]`), and constructs a class name out of it with string formatting. So `c:\music\ap\mahadeva.mp3` becomes `.mp3` becomes `.MP3` becomes `MP3` becomes `MP3FileInfo`.
	Having constructed the name of the handler class that would handle this file, you check to see if that handler class actually exists in this module. If it does, you return the class, otherwise you return the base class `FileInfo`. This is a very important point: this function returns a class. Not an instance of a class, but the class itself.
	For each file in the “interesting files” list (`fileList`), you call `getFileInfoClass` with the filename (`f`). Calling `getFileInfoClass(f)` returns a class; you don't know exactly which class, but you don't care. You then create an instance of this class (whatever it is) and pass the filename (`f` again), to the `__init__` method. As you saw earlier in this chapter, the `__init__` method of `FileInfo` sets `self["name"]`, which triggers `__setitem__`, which is overridden in the descendant (`MP3FileInfo`) to parse the file appropriately to pull out the file's metadata. You do all that for each interesting file and return a list of the resulting instances.

Note that listDirectory is completely generic. It doesn't know ahead of time which types of files it will be getting, or which classes are defined that could potentially handle those files. It inspects the directory for the files to process, and then introspects its own module to see what special handler classes (like MP3FileInfo) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class: HTMLFileInfo for HTML files, DOCFileInfo for Word .doc files, and so forth. listDirectory will handle them all, without modification, by handing off the real work to the appropriate classes and collating the results.

6.7. Summary

The fileinfo.py program introduced in Chapter 5 should now make perfect sense.

"""Framework for getting filetype-specific metadata.

Instantiate appropriate class with filename.  Returned object acts like a
dictionary, with key-value pairs for each piece of metadata.
    import fileinfo
    info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3")
    print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()])

Or use listDirectory function to get info on all files in a directory.
    for info in fileinfo.listDirectory("/music/ap/", [".mp3"]):
        ...

Framework can be extended by adding classes for particular file types, e.g.
HTMLFileInfo, MPGFileInfo, DOCFileInfo.  Each class is completely responsible for
parsing its files appropriately; see MP3FileInfo for example.
"""
import os
import sys
from UserDict import UserDict

def stripnulls(data):
    "strip whitespace and nulls"
    return data.replace("\00", "").strip()

class FileInfo(UserDict):
    "store file metadata"
    def __init__(self, filename=None):
        UserDict.__init__(self)
        self["name"] = filename

class MP3FileInfo(FileInfo):
    "store ID3v1.0 MP3 tags"
    tagDataMap = {"title"   : (  3,  33, stripnulls),
"artist"  : ( 33,  63, stripnulls),
"album"   : ( 63,  93, stripnulls),
"year"    : ( 93,  97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre"   : (127, 128, ord)}

    def __parse(self, filename):
        "parse ID3v1.0 tags from MP3 file"
        self.clear()
        try:             
            fsock = open(filename, "rb", 0)
            try:         
                fsock.seek(-128, 2)        
                tagdata = fsock.read(128)  
            finally:     
                fsock.close()              
            if tagdata[:3] == "TAG":
                for tag, (start, end, parseFunc) in self.tagDataMap.items():
  self[tag] = parseFunc(tagdata[start:end])               
        except IOError:  
            pass         

    def __setitem__(self, key, item):
        if key == "name" and item:
            self.__parse(item)
        FileInfo.__setitem__(self, key, item)

def listDirectory(directory, fileExtList):    
    "get list of file info objects for files of particular extensions"
    fileList = [os.path.normcase(f)
                for f in os.listdir(directory)]           
    fileList = [os.path.join(directory, f) 
               for f in fileList
                if os.path.splitext(f)[1] in fileExtList] 
    def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):      
        "get file info class from filename extension"           
        subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]       
        return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
    return [getFileInfoClass(f)(f) for f in fileList]           

if __name__ == "__main__":
    for info in listDirectory("/music/_singles/", [".mp3"]):
        print "\n".join(["%s=%s" % (k, v) for k, v in info.items()])
        print

Before diving into the next chapter, make sure you're comfortable doing the following things:

Catching exceptions with try...except
Protecting external resources with try...finally
Reading from files
Assigning multiple values at once in a for loop
Using the os module for all your cross-platform file manipulation needs
Dynamically instantiating classes of unknown type by treating classes as objects and passing them around

Chapter 7. Regular Expressions

Regular expressions are a powerful and standardized way of searching, replacing, and parsing text with complex patterns of characters. If you've used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the re module to get an overview of the available functions and their arguments.

7.1. Diving In

Strings have methods for searching (index, find, and count), replacing (replace), and parsing (split), but they are limited to the simplest of cases. The search methods look for a single, hard-coded substring, and they are always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace and split methods have the same limitations.

If what you're trying to do can be accomplished with string functions, you should use them. They're fast and simple and easy to read, and there's a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with if statements to handle special cases, or if you're combining them with split and join and list comprehensions in weird unreadable ways, you may need to move up to regular expressions.

Although the regular expression syntax is tight and unlike normal code, the result can end up being more readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions to make them practically self-documenting.

7.2. Case Study: Street Addresses

This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don't just make this stuff up; it's actually useful.) This example shows how I approached the problem.

Example 7.1. Matching at the End of a String

>>> s = '100 NORTH MAIN ROAD'
>>> s.replace('ROAD', 'RD.')               
'100 NORTH MAIN RD.'
>>> s = '100 NORTH BROAD ROAD'
>>> s.replace('ROAD', 'RD.')               
'100 NORTH BRD. RD.'
>>> s[:-4] + s[-4:].replace('ROAD', 'RD.') 
'100 NORTH BROAD RD.'
>>> import re            
>>> re.sub('ROAD$', 'RD.', s)               
'100 NORTH BROAD RD.'

	My goal is to standardize a street address so that `'ROAD'` is always abbreviated as `'RD.'`. At first glance, I thought this was simple enough that I could just use the string method `replace`. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, `'ROAD'`, was a constant. And in this deceptively simple example, `s.replace` does indeed work.
	Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that `'ROAD'` appears twice in the address, once as part of the street name `'BROAD'` and once as its own word. The `replace` method sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed.
	To solve the problem of addresses with more than one `'ROAD'` substring, you could resort to something like this: only search and replace `'ROAD'` in the last four characters of the address (`s[-4:]`), and leave the string alone (`s[:-4]`). But you can see that this is already getting unwieldy. For example, the pattern is dependent on the length of the string you're replacing (if you were replacing `'STREET'` with `'ST.'`, you would need to use `s[:-6]` and `s[-6:].replace(...)`). Would you like to come back in six months and debug this? I know I wouldn't.
	It's time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the `re` module.
	Take a look at the first parameter: `'ROAD$'`. This is a simple regular expression that matches `'ROAD'` only when it occurs at the end of a string. The `$` means “end of the string”. (There is a corresponding character, the caret `^`, which means “beginning of the string”.)
	Using the `re.sub` function, you search the string `s` for the regular expression `'ROAD$'` and replace it with `'RD.'`. This matches the `ROAD` at the end of the string `s`, but does not match the `ROAD` that's part of the word `BROAD`, because that's in the middle of `s`.

Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching 'ROAD' at the end of the address, was not good enough, because not all addresses included a street designation at all; some just ended with the street name. Most of the time, I got away with it, but if the street name was 'BROAD', then the regular expression would match 'ROAD' at the end of the string as part of the word 'BROAD', which is not what I wanted.

Example 7.2. Matching Whole Words

>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
'100 BRD.'
>>> re.sub('\\bROAD$', 'RD.', s)  
'100 BROAD'
>>> re.sub(r'\bROAD$', 'RD.', s)  
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s)  
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b', 'RD.', s) 
'100 BROAD RD. APT 3'

	What I really wanted was to match `'ROAD'` when it was at the end of the string and it was its own whole word, not a part of some larger word. To express this in a regular expression, you use `\b`, which means “a word boundary must occur right here”. In Python, this is complicated by the fact that the `'\'` character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it's a bug in syntax or a bug in your regular expression.
	To work around the backslash plague, you can use what is called a raw string, by prefixing the string with the letter `r`. This tells Python that nothing in this string should be escaped; `'\t'` is a tab character, but `r'\t'` is really the backslash character `\` followed by the letter `t`. I recommend always using raw strings when dealing with regular expressions; otherwise, things get too confusing too quickly (and regular expressions get confusing quickly enough all by themselves).
	sigh Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word `'ROAD'` as a whole word by itself, but it wasn't at the end, because the address had an apartment number after the street designation. Because `'ROAD'` isn't at the very end of the string, it doesn't match, so the entire call to `re.sub` ends up replacing nothing at all, and you get the original string back, which is not what you want.
	To solve this problem, I removed the `$` character and added another `\b`. Now the regular expression reads “match `'ROAD'` when it's a whole word by itself anywhere in the string,” whether at the end, the beginning, or somewhere in the middle.

7.3. Case Study: Roman Numerals

You've most likely seen Roman numerals, even if you didn't recognize them. You may have seen them in copyrights of old movies and television shows (“Copyright MCMXLVI” instead of “Copyright 1946”), or on the dedication walls of libraries or universities (“established MDCCCLXXXVIII” instead of “established 1888”). You may also have seen them in outlines and bibliographical references. It's a system of representing numbers that really does date back to the ancient Roman empire (hence the name).

In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.

I = 1
V = 5
X = 10
L = 50
C = 100
D = 500
M = 1000

The following are some general rules for constructing Roman numerals:

Characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8.
The tens characters (I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”). The number 40 is written as XL (10 less than 50), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (10 less than 50, then 1 less than 5).
Similarly, at 9, you need to subtract from the next highest tens character: 8 is VIII, but 9 is IX (1 less than 10), not VIIII (since the I character can not be repeated four times). The number 90 is XC, 900 is CM.
The fives characters can not be repeated. The number 10 is always represented as X, never as VV. The number 100 is always C, never LL.
Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much. DC is 600; CD is a completely different number (400, 100 less than 500). CI is 101; IC is not even a valid Roman numeral (because you can't subtract 1 directly from 100; you would need to write it as XCIX, for 10 less than 100, then 1 less than 10).

7.3.1. Checking for Thousands

What would it take to validate that an arbitrary string is a valid Roman numeral? Let's take it one digit at a time. Since Roman numerals are always written highest to lowest, let's start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of M characters.

Example 7.3. Checking for Thousands

>>> import re
>>> pattern = '^M?M?M?$'       
>>> re.search(pattern, 'M')    
<SRE_Match object at 0106FB58>
>>> re.search(pattern, 'MM')   
<SRE_Match object at 0106C290>
>>> re.search(pattern, 'MMM')  
<SRE_Match object at 0106AA38>
>>> re.search(pattern, 'MMMM') 
>>> re.search(pattern, '')     
<SRE_Match object at 0106F4A8>

	This pattern has three parts: `^` to match what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the `M` characters were, which is not what you want. You want to make sure that the `M` characters, if they're there, are at the beginning of the string. `M?` to optionally match a single `M` character. Since this is repeated three times, you're matching anywhere from zero to three `M` characters in a row. `$` to match what precedes only at the end of the string. When combined with the `^` character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the `M` characters.
	The essence of the `re` module is the `search` function, that takes a regular expression (`pattern`) and a string (`'M'`) to try to match against the regular expression. If a match is found, `search` returns an object which has various methods to describe the match; if no match is found, `search` returns `None`, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return value of `search`. `'M'` matches this regular expression, because the first optional `M` matches and the second and third optional `M` characters are ignored.
	`'MM'` matches because the first and second optional `M` characters match and the third `M` is ignored.
	`'MMM'` matches because all three `M` characters match.
	`'MMMM'` does not match. All three `M` characters match, but then the regular expression insists on the string ending (because of the `$` character), and the string doesn't end yet (because of the fourth `M`). So `search` returns `None`.
	Interestingly, an empty string also matches this regular expression, since all the `M` characters are optional.

7.3.2. Checking for Hundreds

The hundreds place is more difficult than the thousands, because there are several mutually exclusive ways it could be expressed, depending on its value.

100 = C
200 = CC
300 = CCC
400 = CD
500 = D
600 = DC
700 = DCC
800 = DCCC
900 = CM

So there are four possible patterns:

CM
CD
Zero to three C characters (zero if the hundreds place is 0)
D, followed by zero to three C characters

The last two patterns can be combined:

an optional D, followed by zero to three C characters

This example shows how to validate the hundreds place of a Roman numeral.

Example 7.4. Checking for Hundreds

>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$' 
>>> re.search(pattern, 'MCM')            
<SRE_Match object at 01070390>
>>> re.search(pattern, 'MD')             
<SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC')         
<SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC')           
>>> re.search(pattern, '')               
<SRE_Match object at 01071D98>

	This pattern starts out the same as the previous one, checking for the beginning of the string (`^`), then the thousands place (`M?M?M?`). Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical bars: `CM`, `CD`, and `D?C?C?C?` (which is an optional `D` followed by zero to three optional `C` characters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first one that matches, and ignores the rest.
	`'MCM'` matches because the first `M` matches, the second and third `M` characters are ignored, and the `CM` matches (so the `CD` and `D?C?C?C?` patterns are never even considered). `MCM` is the Roman numeral representation of `1900`.
	`'MD'` matches because the first `M` matches, the second and third `M` characters are ignored, and the `D?C?C?C?` pattern matches `D` (each of the three `C` characters are optional and are ignored). `MD` is the Roman numeral representation of `1500`.
	`'MMMCCC'` matches because all three `M` characters match, and the `D?C?C?C?` pattern matches `CCC` (the `D` is optional and is ignored). `MMMCCC` is the Roman numeral representation of `3300`.
	`'MCMC'` does not match. The first `M` matches, the second and third `M` characters are ignored, and the `CM` matches, but then the `$` does not match because you're not at the end of the string yet (you still have an unmatched `C` character). The `C` does not match as part of the `D?C?C?C?` pattern, because the mutually exclusive `CM` pattern has already matched.
	Interestingly, an empty string still matches this pattern, because all the `M` characters are optional and ignored, and the empty string matches the `D?C?C?C?` pattern where all the characters are optional and ignored.

Whew! See how quickly regular expressions can get nasty? And you've only covered the thousands and hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because they're exactly the same pattern. But let's look at another way to express the pattern.

7.4. Using the `{n,m}` Syntax

In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.

Example 7.5. The Old Way: Every Character Optional

>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M')    
<_sre.SRE_Match object at 0x008EE090>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MM')   
<_sre.SRE_Match object at 0x008EEB48>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MMM')  
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMMM') 
>>>

	This matches the start of the string, and then the first optional `M`, but not the second and third `M` (but that's okay because they're optional), and then the end of the string.
	This matches the start of the string, and then the first and second optional `M`, but not the third `M` (but that's okay because it's optional), and then the end of the string.
	This matches the start of the string, and then all three optional `M`, and then the end of the string.
	This matches the start of the string, and then all three optional `M`, but then does not match the the end of the string (because there is still one unmatched `M`), so the pattern does not match and returns `None`.

Example 7.6. The New Way: From `n` o `m`

>>> pattern = '^M{0,3}$'       
>>> re.search(pattern, 'M')    
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MM')   
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMM')  
<_sre.SRE_Match object at 0x008EEDA8>
>>> re.search(pattern, 'MMMM') 
>>>

	This pattern says: “Match the start of the string, then anywhere from zero to three `M` characters, then the end of the string.” The 0 and 3 can be any numbers; if you want to match at least one but no more than three `M` characters, you could say `M{1,3}`.
	This matches the start of the string, then one `M` out of a possible three, then the end of the string.
	This matches the start of the string, then two `M` out of a possible three, then the end of the string.
	This matches the start of the string, then three `M` out of a possible three, then the end of the string.
	This matches the start of the string, then three `M` out of a possible three, but then does not match the end of the string. The regular expression allows for up to only three `M` characters before the end of the string, but you have four, so the pattern does not match and returns `None`.


	There is no way to programmatically determine that two regular expressions are equivalent. The best you can do is write a lot of test cases to make sure they behave the same way on all relevant inputs. You'll talk more about writing test cases later in this book.

7.4.1. Checking for Tens and Ones

Now let's expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.

Example 7.7. Checking for Tens

>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
>>> re.search(pattern, 'MCMXL')    
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCML')     
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLX')    
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXX')  
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXX') 
>>>

	This matches the start of the string, then the first optional `M`, then `CM`, then `XL`, then the end of the string. Remember, the `(A\|B\|C)` syntax means “match exactly one of A, B, or C”. You match `XL`, so you ignore the `XC` and `L?X?X?X?` choices, and then move on to the end of the string. `MCML` is the Roman numeral representation of `1940`.
	This matches the start of the string, then the first optional `M`, then `CM`, then `L?X?X?X?`. Of the `L?X?X?X?`, it matches the `L` and skips all three optional `X` characters. Then you move to the end of the string. `MCML` is the Roman numeral representation of `1950`.
	This matches the start of the string, then the first optional `M`, then `CM`, then the optional `L` and the first optional `X`, skips the second and third optional `X`, then the end of the string. `MCMLX` is the Roman numeral representation of `1960`.
	This matches the start of the string, then the first optional `M`, then `CM`, then the optional `L` and all three optional `X` characters, then the end of the string. `MCMLXXX` is the Roman numeral representation of `1980`.
	This matches the start of the string, then the first optional `M`, then `CM`, then the optional `L` and all three optional `X` characters, then fails to match the end of the string because there is still one more `X` unaccounted for. So the entire pattern fails to match, and returns `None`. `MCMLXXXX` is not a valid Roman numeral.

The expression for the ones place follows the same pattern. I'll spare you the details and show you the end result.

>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'

So what does that look like using this alternate {n,m} syntax? This example shows the new syntax.

Example 7.8. Validating Roman Numerals with `{n,m}`

>>> pattern = '^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')             
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMDCLXVI')         
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMMDCCCLXXXVIII') 
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'I')                
<_sre.SRE_Match object at 0x008EEB48>

	This matches the start of the string, then one of a possible four `M` characters, then `D?C{0,3}`. Of that, it matches the optional `D` and zero of three possible `C` characters. Moving on, it matches `L?X{0,3}` by matching the optional `L` and zero of three possible `X` characters. Then it matches `V?I{0,3}` by matching the optional V and zero of three possible `I` characters, and finally the end of the string. `MDLV` is the Roman numeral representation of `1555`.
	This matches the start of the string, then two of a possible four `M` characters, then the `D?C{0,3}` with a `D` and one of three possible `C` characters; then `L?X{0,3}` with an `L` and one of three possible `X` characters; then `V?I{0,3}` with a `V` and one of three possible `I` characters; then the end of the string. `MMDCLXVI` is the Roman numeral representation of `2666`.
	This matches the start of the string, then four out of four `M` characters, then `D?C{0,3}` with a `D` and three out of three `C` characters; then `L?X{0,3}` with an `L` and three out of three `X` characters; then `V?I{0,3}` with a `V` and three out of three `I` characters; then the end of the string. `MMMMDCCCLXXXVIII` is the Roman numeral representation of `3888`, and it's the longest Roman numeral you can write without extended syntax.
	Watch closely. (I feel like a magician. “Watch closely, kids, I'm going to pull a rabbit out of my hat.”) This matches the start of the string, then zero out of four `M`, then matches `D?C{0,3}` by skipping the optional `D` and matching zero out of three `C`, then matches `L?X{0,3}` by skipping the optional `L` and matching zero out of three `X`, then matches `V?I{0,3}` by skipping the optional `V` and matching one out of three `I`. Then the end of the string. Whoa.

If you followed all that and understood it on the first try, you're doing better than I did. Now imagine trying to understand someone else's regular expressions, in the middle of a critical function of a large program. Or even imagine coming back to your own regular expressions a few months later. I've done it, and it's not a pretty sight.

In the next section you'll explore an alternate syntax that can help keep your expressions maintainable.

7.5. Verbose Regular Expressions

So far you've just been dealing with what I'll call “compact” regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee that you'll be able to understand it six months later. What you really need is inline documentation.

Python allows you to do this with something called verbose regular expressions. A verbose regular expression is different from a compact regular expression in two ways:

Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They're not matched at all. (If you want to match a space in a verbose regular expression, you'll need to escape it by putting a backslash in front of it.)
Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a # character and goes until the end of the line. In this case it's a comment within a multi-line string instead of within your source code, but it works the same way.

This will be more clear with an example. Let's revisit the compact regular expression you've been working with, and make it a verbose regular expression. This example shows how.

Example 7.9. Regular Expressions with Inline Comments

>>> pattern = """
    ^ # beginning of string
    M{0,4}              # thousands - 0 to 4 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
      #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
      #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
      #        or 5-8 (V, followed by 0 to 3 I's)
    $ # end of string
    """
>>> re.search(pattern, 'M', re.VERBOSE)                
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)        
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMMDCCCLXXXVIII', re.VERBOSE) 
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'M')

	The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them: `re.VERBOSE` is a constant defined in the `re` module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it's a lot more readable.
	This matches the start of the string, then one of a possible four `M`, then `CM`, then `L` and three of a possible three `X`, then `IX`, then the end of the string.
	This matches the start of the string, then four of a possible four `M`, then `D` and three of a possible three `C`, then `L` and three of a possible three `X`, then `V` and three of a possible three `I`, then the end of the string.
	This does not match. Why? Because it doesn't have the `re.VERBOSE` flag, so the `re.search` function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can't auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose.

7.6. Case study: Parsing Phone Numbers

So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where.

This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the area code, trunk, number, and optionally an extension separately in the company's database. I scoured the Web and found many examples of regular expressions that purported to do this, but none of them were permissive enough.

Here are the phone numbers I needed to be able to accept:

800-555-1212
800 555 1212
800.555.1212
(800) 555-1212
1-800-555-1212
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234

Quite a variety! In each of these cases, I need to know that the area code was 800, the trunk was 555, and the rest of the phone number was 1212. For those with an extension, I need to know that the extension was 1234.

Let's work through developing a solution for phone number parsing. This example shows the first step.

Example 7.10. Finding Numbers

>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$') 
>>> phonePattern.search('800-555-1212').groups()            
('800', '555', '1212')
>>> phonePattern.search('800-555-1212-1234')                
>>>

	Always read regular expressions from left to right. This one matches the beginning of the string, and then `(\d{3})`. What's `\d{3}`? Well, the `{3}` means “match exactly three numeric digits”; it's a variation on the `{n,m} syntax` you saw earlier. `\d` means “any numeric digit” (`0` through `9`). Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another group of exactly four digits. Then match the end of the string.
	To get access to the groups that the regular expression parser remembered along the way, use the `groups()` method on the object that the `search` function returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you defined three groups, one with three digits, one with three digits, and one with four digits.
	This regular expression is not the final answer, because it doesn't handle a phone number with an extension on the end. For that, you'll need to expand the regular expression.

Example 7.11. Finding the Extension

>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$') 
>>> phonePattern.search('800-555-1212-1234').groups()             
('800', '555', '1212', '1234')
>>> phonePattern.search('800 555 1212 1234')    
>>> 
>>> phonePattern.search('800-555-1212')         
>>>

	This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then a remembered group of three digits, then a hyphen, then a remembered group of three digits, then a hyphen, then a remembered group of four digits. What's new is that you then match another hyphen, and a remembered group of one or more digits, then the end of the string.
	The `groups()` method now returns a tuple of four elements, since the regular expression now defines four groups to remember.
	Unfortunately, this regular expression is not the final answer either, because it assumes that the different parts of the phone number are separated by hyphens. What if they're separated by spaces, or commas, or dots? You need a more general solution to match several different types of separators.
	Oops! Not only does this regular expression not do everything you want, it's actually a step backwards, because now you can't parse phone numbers without an extension. That's not what you wanted at all; if the extension is there, you want to know what it is, but if it's not there, you still want to know what the different parts of the main number are.

The next example shows the regular expression to handle separators between the different parts of the phone number.

Example 7.12. Handling Different Separators

>>> phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$') 
>>> phonePattern.search('800 555 1212 1234').groups() 
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212-1234').groups() 
('800', '555', '1212', '1234')
>>> phonePattern.search('80055512121234')             
>>> 
>>> phonePattern.search('800-555-1212')               
>>>

	Hang on to your hat. You're matching the beginning of the string, then a group of three digits, then `\D+`. What the heck is that? Well, `\D` matches any character except a numeric digit, and `+` means “1 or more”. So `\D+` matches one or more characters that are not digits. This is what you're using instead of a literal hyphen, to try to match different separators.
	Using `\D+` instead of `-` means you can now match phone numbers where the parts are separated by spaces instead of hyphens.
	Of course, phone numbers separated by hyphens still work too.
	Unfortunately, this is still not the final answer, because it assumes that there is a separator at all. What if the phone number is entered without any spaces or hyphens at all?
	Oops! This still hasn't fixed the problem of requiring extensions. Now you have two problems, but you can solve both of them with the same technique.

The next example shows the regular expression for handling phone numbers without separators.

Example 7.13. Handling Numbers Without Separators

>>> phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$') 
>>> phonePattern.search('80055512121234').groups()    
('800', '555', '1212', '1234')
>>> phonePattern.search('800.555.1212 x1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212').groups()      
('800', '555', '1212', '')
>>> phonePattern.search('(800)5551212 x1234')         
>>>

	The only change you've made since that last step is changing all the `+` to ``. Instead of `\D+` between the parts of the phone number, you now match on `\D`. Remember that `+` means “1 or more”? Well, `*` means “zero or more”. So now you should be able to parse phone numbers even when there is no separator character at all.
	Lo and behold, it actually works. Why? You matched the beginning of the string, then a remembered group of three digits (`800`), then zero non-numeric characters, then a remembered group of three digits (`555`), then zero non-numeric characters, then a remembered group of four digits (`1212`), then zero non-numeric characters, then a remembered group of an arbitrary number of digits (`1234`), then the end of the string.
	Other variations work now too: dots instead of hyphens, and both a space and an `x` before the extension.
	Finally, you've solved the other long-standing problem: extensions are optional again. If no extension is found, the `groups()` method still returns a tuple of four elements, but the fourth element is just an empty string.
	I hate to be the bearer of bad news, but you're not finished yet. What's the problem here? There's an extra character before the area code, but the regular expression assumes that the area code is the first thing at the beginning of the string. No problem, you can use the same technique of “zero or more non-numeric characters” to skip over the leading characters before the area code.

The next example shows how to handle leading characters in phone numbers.

Example 7.14. Handling Leading Characters

>>> phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$') 
>>> phonePattern.search('(800)5551212 ext. 1234').groups()                 
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212').groups()         
('800', '555', '1212', '')
>>> phonePattern.search('work 1-(800) 555.1212 #1234')   
>>>

	This is the same as in the previous example, except now you're matching `\D*`, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you're not remembering these non-numeric characters (they're not in parentheses). If you find them, you'll just skip over them and then start remembering the area code whenever you get to it.
	You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis after the area code is already handled; it's treated as a non-numeric separator and matched by the `\D*` after the first remembered group.)
	Just a sanity check to make sure you haven't broken anything that used to work. Since the leading characters are entirely optional, this matches the beginning of the string, then zero non-numeric characters, then a remembered group of three digits (`800`), then one non-numeric character (the hyphen), then a remembered group of three digits (`555`), then one non-numeric character (the hyphen), then a remembered group of four digits (`1212`), then zero non-numeric characters, then a remembered group of zero digits, then the end of the string.
	This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn't this phone number match? Because there's a `1` before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (`\D*`). Aargh.

Let's back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than trying to match it all just so you can skip over it, let's take a different approach: don't explicitly match the beginning of the string at all. This approach is shown in the next example.

Example 7.15. Phone Number, Wherever I May Find Ye

>>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$') 
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()        
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')              
('800', '555', '1212', '')
>>> phonePattern.search('80055512121234')            
('800', '555', '1212', '1234')

	Note the lack of `^` in this regular expression. You are not matching the beginning of the string anymore. There's nothing that says you need to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there.
	Now you can successfully parse a phone number that includes leading characters and a leading digit, plus any number of any kind of separators around each part of the phone number.
	Sanity check. this still works.
	That still works too.

See how quickly a regular expression can get out of control? Take a quick glance at any of the previous iterations. Can you tell the difference between one and the next?

While you still understand the final answer (and it is the final answer; if you've discovered a case it doesn't handle, I don't want to know about it), let's write it out as a verbose regular expression, before you forget why you made the choices you made.

Example 7.16. Parsing Phone Numbers (Final Version)

>>> phonePattern = re.compile(r'''
                # don't match beginning of string, number can start anywhere
    (\d{3})     # area code is 3 digits (e.g. '800')
    \D*         # optional separator is any number of non-digits
    (\d{3})     # trunk is 3 digits (e.g. '555')
    \D*         # optional separator
    (\d{4})     # rest of number is 4 digits (e.g. '1212')
    \D*         # optional separator
    (\d*)       # extension is optional and can be any number of digits
    $           # end of string
    ''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()        
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')              
('800', '555', '1212', '')

	Other than being spread out over multiple lines, this is exactly the same regular expression as the last step, so it's no surprise that it parses the same inputs.
	Final sanity check. Yes, this still works. You're done.

7.7. Summary

This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you're completely overwhelmed by them now, believe me, you ain't seen nothing yet.

You should now be familiar with the following techniques:

^ matches the beginning of a string.
$ matches the end of a string.
\b matches a word boundary.
\d matches any numeric digit.
\D matches any non-numeric character.
x? matches an optional x character (in other words, it matches an x zero or one times).
x* matches x zero or more times.
x+ matches x one or more times.
x{n,m} matches an x character at least n times, but not more than m times.
(a|b|c) matches either a or b or c.
(x) in general is a remembered group. You can get the value of what matched by using the groups() method of the object returned by re.search.

Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough about them to know when they are appropriate, when they will solve your problems, and when they will cause more problems than they solve.

	Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
--Jamie Zawinski, in comp.emacs.xemacs

Chapter 8. HTML Processing

8.1. Diving in

I often see questions on comp.lang.python like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of my HTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions.

Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone. Read the docstrings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how any of these class methods ever get called. Don't worry, all will be revealed in due time.

Example 8.1. `BaseHTMLProcessor.py`

If you have not already done so, you can download this and other examples used in this book.

from sgmllib import SGMLParser
import htmlentitydefs

class BaseHTMLProcessor(SGMLParser):
    def reset(self):     
        # extend (called by SGMLParser.__init__)
        self.pieces = []
        SGMLParser.reset(self)

    def unknown_starttag(self, tag, attrs):
        # called for each start tag
        # attrs is a list of (attr, value) tuples
        # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
        # Ideally we would like to reconstruct original tag and attributes, but
        # we may end up quoting attribute values that weren't quoted in the source
        # document, or we may change the type of quotes around the attribute value
        # (single to double quotes).
        # Note that improperly embedded non-HTML code (like client-side Javascript)
        # may be parsed incorrectly by the ancestor, causing runtime script errors.
        # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
        # to ensure that it will pass through this parser unaltered (in handle_comment).
        strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
        self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

    def unknown_endtag(self, tag):         
        # called for each end tag, e.g. for </pre>, tag will be "pre"
        # Reconstruct the original end tag.
        self.pieces.append("</%(tag)s>" % locals())

    def handle_charref(self, ref):         
        # called for each character reference, e.g. for "&#160;", ref will be "160"
        # Reconstruct the original character reference.
        self.pieces.append("&#%(ref)s;" % locals())

    def handle_entityref(self, ref):       
        # called for each entity reference, e.g. for "&copy;", ref will be "copy"
        # Reconstruct the original entity reference.
        self.pieces.append("&%(ref)s" % locals())
        # standard HTML entities are closed with a semicolon; other entities are not
        if htmlentitydefs.entitydefs.has_key(ref):
            self.pieces.append(";")

    def handle_data(self, text):           
        # called for each block of plain text, i.e. outside of any tag and
        # not containing any character or entity references
        # Store the original text verbatim.
        self.pieces.append(text)

    def handle_comment(self, text):        
        # called for each HTML comment, e.g. <!-- insert Javascript code here -->
        # Reconstruct the original comment.
        # It is especially important that the source document enclose client-side
        # code (like Javascript) within comments so it can pass through this
        # processor undisturbed; see comments in unknown_starttag for details.
        self.pieces.append("<!--%(text)s-->" % locals())

    def handle_pi(self, text):             
        # called for each processing instruction, e.g. <?instruction>
        # Reconstruct original processing instruction.
        self.pieces.append("<?%(text)s>" % locals())

    def handle_decl(self, text):
        # called for the DOCTYPE, if present, e.g.
        # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        #     "http://www.w3.org/TR/html4/loose.dtd">
        # Reconstruct original DOCTYPE
        self.pieces.append("<!%(text)s>" % locals())

    def output(self):              
        """Return processed HTML as a single string"""
        return "".join(self.pieces)

Example 8.2. `dialect.py`

import re
from BaseHTMLProcessor import BaseHTMLProcessor

class Dialectizer(BaseHTMLProcessor):
    subs = ()

    def reset(self):
        # extend (called from __init__ in ancestor)
        # Reset all data attributes
        self.verbatim = 0
        BaseHTMLProcessor.reset(self)

    def start_pre(self, attrs):            
        # called for every <pre> tag in HTML source
        # Increment verbatim mode count, then handle tag like normal
        self.verbatim += 1                 
        self.unknown_starttag("pre", attrs)

    def end_pre(self):   
        # called for every </pre> tag in HTML source
        # Decrement verbatim mode count
        self.unknown_endtag("pre")         
        self.verbatim -= 1                 

    def handle_data(self, text):    
        # override
        # called for every block of text in HTML source
        # If in verbatim mode, save text unaltered;
        # otherwise process the text with a series of substitutions
        self.pieces.append(self.verbatim and text or self.process(text))

    def process(self, text):
        # called from handle_data
        # Process text block by performing series of regular expression
        # substitutions (actual substitions are defined in descendant)
        for fromPattern, toPattern in self.subs:
            text = re.sub(fromPattern, toPattern, text)
        return text

class ChefDialectizer(Dialectizer):
    """convert HTML to Swedish Chef-speak
    
    based on the classic chef.x, copyright (c) 1992, 1993 John Hagerman
    """
    subs = ((r'a([nu])', r'u\1'),
            (r'A([nu])', r'U\1'),
            (r'a\B', r'e'),
            (r'A\B', r'E'),
            (r'en\b', r'ee'),
            (r'\Bew', r'oo'),
            (r'\Be\b', r'e-a'),
            (r'\be', r'i'),
            (r'\bE', r'I'),
            (r'\Bf', r'ff'),
            (r'\Bir', r'ur'),
            (r'(\w*?)i(\w*?)$', r'\1ee\2'),
            (r'\bow', r'oo'),
            (r'\bo', r'oo'),
            (r'\bO', r'Oo'),
            (r'the', r'zee'),
            (r'The', r'Zee'),
            (r'th\b', r't'),
            (r'\Btion', r'shun'),
            (r'\Bu', r'oo'),
            (r'\BU', r'Oo'),
            (r'v', r'f'),
            (r'V', r'F'),
            (r'w', r'w'),
            (r'W', r'W'),
            (r'([a-z])[.]', r'\1.  Bork Bork Bork!'))

class FuddDialectizer(Dialectizer):
    """convert HTML to Elmer Fudd-speak"""
    subs = ((r'[rl]', r'w'),
            (r'qu', r'qw'),
            (r'th\b', r'f'),
            (r'th', r'd'),
            (r'n[.]', r'n, uh-hah-hah-hah.'))

class OldeDialectizer(Dialectizer):
    """convert HTML to mock Middle English"""
    subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'),
            (r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'),
            (r'ick\b', r'yk'),
            (r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
            (r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
            (r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'),
            (r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'),
            (r'([aeiou])re\b', r'\1r'),
            (r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'),
            (r'tion\b', r'cioun'),
            (r'ion\b', r'ioun'),
            (r'aid', r'ayde'),
            (r'ai', r'ey'),
            (r'ay\b', r'y'),
            (r'ay', r'ey'),
            (r'ant', r'aunt'),
            (r'ea', r'ee'),
            (r'oa', r'oo'),
            (r'ue', r'e'),
            (r'oe', r'o'),
            (r'ou', r'ow'),
            (r'ow', r'ou'),
            (r'\bhe', r'hi'),
            (r've\b', r'veth'),
            (r'se\b', r'e'),
            (r"'s\b", r'es'),
            (r'ic\b', r'ick'),
            (r'ics\b', r'icc'),
            (r'ical\b', r'ick'),
            (r'tle\b', r'til'),
            (r'll\b', r'l'),
            (r'ould\b', r'olde'),
            (r'own\b', r'oune'),
            (r'un\b', r'onne'),
            (r'rry\b', r'rye'),
            (r'est\b', r'este'),
            (r'pt\b', r'pte'),
            (r'th\b', r'the'),
            (r'ch\b', r'che'),
            (r'ss\b', r'sse'),
            (r'([wybdp])\b', r'\1e'),
            (r'([rnt])\b', r'\1\1e'),
            (r'from', r'fro'),
            (r'when', r'whan'))

def translate(url, dialectName="chef"):
    """fetch URL and translate using dialect
    
    dialect in ("chef", "fudd", "olde")"""
    import urllib    
    sock = urllib.urlopen(url)         
    htmlSource = sock.read()           
    sock.close()     
    parserName = "%sDialectizer" % dialectName.capitalize()
    parserClass = globals()[parserName]  
    parser = parserClass()               
    parser.feed(htmlSource)
    parser.close()         
    return parser.output() 

def test(url):
    """test all dialects against URL"""
    for dialect in ("chef", "fudd", "olde"):
        outfile = "%s.html" % dialect
        fsock = open(outfile, "wb")
        fsock.write(translate(url, dialect))
        fsock.close()
        import webbrowser
        webbrowser.open_new(outfile)

if __name__ == "__main__":
    test("http://diveintopython3.org/odbchelper_list.html")

Example 8.3. Output of `dialect.py`

Running this script will translate Section 3.2, “Introducing Lists” into mock Swedish Chef-speak (from The Muppets), mock Elmer Fudd-speak (from Bugs Bunny cartoons), and mock Middle English (loosely based on Chaucer's The Canterbury Tales). If you look at the HTML source of the output pages, you'll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the code listings and screen examples were left untouched.

<div class="abstract">
<p>Lists awe <span class="application">Pydon</span>'s wowkhowse datatype.
If youw onwy expewience wif wists is awways in
<span class="application">Visuaw Basic</span> ow (God fowbid) de datastowe
in <span class="application">Powewbuiwdew</span>, bwace youwsewf fow
<span class="application">Pydon</span> wists.</p>
</div>

8.2. Introducing `sgmllib.py`

HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.

The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally.

sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method.

SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:

Start tag: An HTML tag that starts a block, like <html>, <head>, <body>, or <pre>, or a standalone tag like <br> or <img>. When it finds a start tag tagname, SGMLParser will look for a method called start_tagname or do_tagname. For instance, when it finds a <pre> tag, it will look for a start_pre or do_pre method. If found, SGMLParser calls this method with a list of the tag's attributes; otherwise, it calls unknown_starttag with the tag name and list of attributes.
End tag: An HTML tag that ends a block, like </html>, </head>, </body>, or </pre>. When it finds an end tag, SGMLParser will look for a method called end_tagname. If found, SGMLParser calls this method, otherwise it calls unknown_endtag with the tag name.
Character reference: An escaped character referenced by its decimal or hexadecimal equivalent, like  . When found, SGMLParser calls handle_charref with the text of the decimal or hexadecimal character equivalent.
Entity reference: An HTML entity, like ©. When found, SGMLParser calls handle_entityref with the name of the HTML entity.
Comment: An HTML comment, enclosed in . When found, SGMLParser calls handle_comment with the body of the comment.
Processing instruction: An HTML processing instruction, enclosed in <? ... >. When found, SGMLParser calls handle_pi with the body of the processing instruction.
Declaration: An HTML declaration, such as a DOCTYPE, enclosed in <! ... >. When found, SGMLParser calls handle_decl with the body of the declaration.
Text data: A block of text. Anything that doesn't fit into the other 7 categories. When found, SGMLParser calls handle_data with the text.


	Python 2.0 had a bug where `SGMLParser` would not recognize declarations at all (`handle_decl` would never be called), which meant that `DOCTYPE`s were silently ignored. This is fixed in Python 2.1.

sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments.


	In the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces.

Example 8.4. Sample test of `sgmllib.py`

Here is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven't downloaded the HTML version of the book, you can do so at http://diveintopython3.org/.

c:\python23\lib> type "c:\downloads\diveintopython3\html\toc\index.html"

<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
   
      <title>Dive Into Python</title>
      <link rel="stylesheet" href="diveintopython3.css" type="text/css">

... rest of file omitted for brevity ...

Running this through the test suite of sgmllib.py yields this output:

c:\python23\lib> python sgmllib.py "c:\downloads\diveintopython3\html\toc\index.html"
data: '\n\n'
start tag: <html >
data: '\n   '
start tag: <head>
data: '\n      '
start tag: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >
data: '\n   \n      '
start tag: <title>
data: 'Dive Into Python'
end tag: </title>
data: '\n      '
start tag: <link rel="stylesheet" href="diveintopython3.css" type="text/css" >
data: '\n      '

... rest of output omitted for brevity ...

Here's the roadmap for the rest of the chapter:

Subclass SGMLParser to create classes that extract interesting data out of HTML documents.
Subclass SGMLParser to create BaseHTMLProcessor, which overrides all 8 handler methods and uses them to reconstruct the original HTML from the pieces.
Subclass BaseHTMLProcessor to create Dialectizer, which adds some methods to process specific HTML tags specially, and overrides the handle_data method to provide a framework for processing the text blocks between the HTML tags.
Subclass Dialectizer to create classes that define text processing rules used by Dialectizer.handle_data.
Write a test suite that grabs a real web page from http://diveintopython3.org/ and processes it.

Along the way, you'll also learn about locals, globals, and dictionary-based string formatting.

8.3. Extracting data from HTML documents

To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.

The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages.

Example 8.5. Introducing `urllib`

>>> import urllib   
>>> sock = urllib.urlopen("http://diveintopython3.org/") 
>>> htmlSource = sock.read()          
>>> sock.close()    
>>> print htmlSource
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>
      <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
   <title>Dive Into Python</title>
<link rel='stylesheet' href='diveintopython3.css' type='text/css'>
<link rev='made' href='mailto:mark@diveintopython3.org'>
<meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'>
<meta name='description' content='a free Python tutorial for experienced programmers'>
</head>
<body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'>
<table cellpadding='0' cellspacing='0' border='0' width='100%'>
<tr><td class='header' width='1%' valign='top'>diveintopython3.org</td>
<td width='99%' align='right'><hr size='1' noshade></td></tr>
<tr><td class='tagline' colspan='2'>Python&nbsp;for&nbsp;experienced&nbsp;programmers</td></tr>

[...snip...]

	The `urllib` module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages).
	The simplest use of `urllib` is to retrieve the entire text of a web page using the `urlopen` function. Opening a URL is similar to opening a file. The return value of `urlopen` is a file-like object, which has some of the same methods as a file object.
	The simplest thing to do with the file-like object returned by `urlopen` is `read`, which reads the entire HTML of the web page into a single string. The object also supports `readlines`, which reads the text line by line into a list.
	When you're done with the object, make sure to `close` it, just like a normal file object.
	You now have the complete HTML of the home page of `http://diveintopython3.org/` in a string, and you're ready to parse it.

Example 8.6. Introducing `urllister.py`

If you have not already done so, you can download this and other examples used in this book.

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):            
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):   
        href = [v for k, v in attrs if k=='href']  
        if href:
            self.urls.extend(href)

	`reset` is called by the `__init__` method of `SGMLParser`, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it in `reset`, not in `__init__`, so that it will be re-initialized properly when someone re-uses a parser instance.
	`start_a` is called by `SGMLParser` whenever it finds an `<a>` tag. The tag may contain an `href` attribute, and/or other attributes, like `name` or `title`. The `attrs` parameter is a list of tuples, `[(attribute, value), (attribute, value), ...]`. Or it may be just an `<a>`, a valid (if useless) HTML tag, in which case `attrs` would be an empty list.
	You can find out whether this `<a>` tag has an `href` attribute with a simple multi-variable list comprehension.
	String comparisons like `k=='href'` are always case-sensitive, but that's safe in this case, because `SGMLParser` converts attribute names to lowercase while building `attrs`.

Example 8.7. Using `urllister.py`

>>> import urllib, urllister
>>> usock = urllib.urlopen("http://diveintopython3.org/")
>>> parser = urllister.URLLister()
>>> parser.feed(usock.read())         
>>> usock.close()   
>>> parser.close()  
>>> for url in parser.urls: print url 
toc/index.html
#download
#languages
toc/index.html
appendix/history.html
download/diveintopython3-html-5.0.zip
download/diveintopython3-pdf-5.0.zip
download/diveintopython3-word-5.0.zip
download/diveintopython3-text-5.0.zip
download/diveintopython3-html-flat-5.0.zip
download/diveintopython3-xml-5.0.zip
download/diveintopython3-common-5.0.zip


... rest of output omitted for brevity ...

	Call the `feed` method, defined in `SGMLParser`, to get HTML into the parser.^[1] It takes a string, which is what `usock.read()` returns.
	Like files, you should `close` your URL objects as soon as you're done with them.
	You should `close` your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the `feed` method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call `close` to flush the buffer and force everything to be fully parsed.
	Once the parser is `close`d, the parsing is complete, and `parser.urls` contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.)

8.4. Introducing `BaseHTMLProcessor.py`

SGMLParser doesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it finds, but the methods don't do anything. SGMLParser is an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll take this one step further by defining a class that catches everything SGMLParser throws at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer.

BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi, handle_decl, and handle_data.

Example 8.8. Introducing `BaseHTMLProcessor`

class BaseHTMLProcessor(SGMLParser):
    def reset(self):      
        self.pieces = []
        SGMLParser.reset(self)

    def unknown_starttag(self, tag, attrs): 
        strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
        self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

    def unknown_endtag(self, tag):          
        self.pieces.append("</%(tag)s>" % locals())

    def handle_charref(self, ref):          
        self.pieces.append("&#%(ref)s;" % locals())

    def handle_entityref(self, ref):        
        self.pieces.append("&%(ref)s" % locals())
        if htmlentitydefs.entitydefs.has_key(ref):
            self.pieces.append(";")

    def handle_data(self, text):            
        self.pieces.append(text)

    def handle_comment(self, text):         
        self.pieces.append("<!--%(text)s-->" % locals())

    def handle_pi(self, text):              
        self.pieces.append("<?%(text)s>" % locals())

    def handle_decl(self, text):
        self.pieces.append("<!%(text)s>" % locals())

	`reset`, called by `SGMLParser.__init__`, initializes `self.pieces` as an empty list before calling the ancestor method. `self.pieces` is a data attribute which will hold the pieces of the HTML document you're constructing. Each handler method will reconstruct the HTML that `SGMLParser` parsed, and each method will append that string to `self.pieces`. Note that `self.pieces` is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but Python is much more efficient at dealing with lists.^[2]
	Since `BaseHTMLProcessor` does not define any methods for specific tags (like the `start_a` method in `URLLister`), `SGMLParser` will call `unknown_starttag` for every start tag. This method takes the tag (`tag`) and the list of attribute name/value pairs (`attrs`), reconstructs the original HTML, and appends it to `self.pieces`. The string formatting here is a little strange; you'll untangle that (and also the odd-looking `locals` function) later in this chapter.
	Reconstructing end tags is much simpler; just take the tag name and wrap it in the `</...>` brackets.
	When `SGMLParser` finds a character reference, it calls `handle_charref` with the bare reference. If the HTML document contains the reference ` `, `ref` will be `160`. Reconstructing the original complete character reference just involves wrapping `ref` in `&#...;` characters.
	Entity references are similar to character references, but without the hash mark. Reconstructing the original entity reference requires wrapping `ref` in `&...;` characters. (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this. Only certain standard HTML entites end in a semicolon; other similar-looking entities do not. Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module called `htmlentitydefs`. Hence the extra `if` statement.)
	Blocks of text are simply appended to `self.pieces` unaltered.
	HTML comments are wrapped in `<!--...-->` characters.
	Processing instructions are wrapped in `<?...>` characters.


	The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't). `BaseHTMLProcessor` is not forgiving; if script is improperly embedded, it will be parsed as if it were HTML. For instance, if the script contains less-than and equals signs, `SGMLParser` may incorrectly think that it has found tags and attributes. `SGMLParser` always converts tags and attribute names to lowercase, which may break the script, and `BaseHTMLProcessor` always encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script within HTML comments.

Example 8.9. `BaseHTMLProcessor` output

    def output(self):               
        """Return processed HTML as a single string"""
        return "".join(self.pieces)

	This is the one method in `BaseHTMLProcessor` that is never called by the ancestor `SGMLParser`. Since the other handler methods store their reconstructed HTML in `self.pieces`, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it.
	If you prefer, you could use the `join` method of the `string` module instead: `string.join(self.pieces, "")`

8.5. `locals` and `globals`

Let's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables.

Remember locals? You first saw it here:

    def unknown_starttag(self, tag, attrs):
        strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
        self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

No, wait, you can't learn about locals yet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention.

Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute.

At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which keeps track of the function's variables, including function arguments and locally defined variables. Each module has its own namespace, called the global namespace, which keeps track of the module's variables, including functions, classes, any other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any module, which holds built-in functions and exceptions.

When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order:

local namespace - specific to the current function or class method. If the function defines a local variable x, or has an argument x, Python will use this and stop searching.
global namespace - specific to the current module. If the module has defined a variable, function, or class called x, Python will use that and stop searching.
built-in namespace - global to all modules. As a last resort, Python will assume that x is the name of built-in function or variable.

If Python doesn't find x in any of these namespaces, it gives up and raises a NameError with the message There is no variable named 'x', which you saw back in Example 3.18, “Referencing an Unbound Variable”, but you didn't appreciate how much work Python was doing before giving you that error.


	Python 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a nested function or `lambda` function, Python will search for that variable in the current (nested or `lambda`) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested or `lambda`) function's namespace, then in the parent function's namespace, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2: from __future__ import nested_scopes

Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in locals function, and the global (module level) namespace is accessible via the built-in globals function.

Example 8.10. Introducing `locals`

>>> def foo(arg): 
...     x = 1
...     print locals()
...     
>>> foo(7)        
{'arg': 7, 'x': 1}
>>> foo('bar')    
{'arg': 'bar', 'x': 1}

	The function `foo` has two variables in its local namespace: `arg`, whose value is passed in to the function, and `x`, which is defined within the function.
	`locals` returns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values of the dictionary are the actual values of the variables. So calling `foo` with `7` prints the dictionary containing the function's two local variables: `arg` (`7`) and `x` (`1`).
	Remember, Python has dynamic typing, so you could just as easily pass a string in for `arg`; the function (and the call to `locals`) would still work just as well. `locals` works with all variables of all datatypes.

What locals does for the local (function) namespace, globals does for the global (module) namespace. globals is more exciting, though, because a module's namespace is more exciting.^[3] Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes defined in the module. Plus, it includes anything that was imported into the module.

Remember the difference between from module import and import module? With import module, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access any of its functions or attributes: module.function. But with from module import, you're actually importing specific functions and attributes from another module into your own namespace, which is why you access them directly without referencing the original module they came from. With the globals function, you can actually see this happen.

Example 8.11. Introducing `globals`

Look at the following block of code at the bottom of BaseHTMLProcessor.py:

if __name__ == "__main__":
    for k, v in globals().items():             
        print k, "=", v

Just so you don't get intimidated, remember that you've seen all this before. The globals function returns a dictionary, and you're iterating through the dictionary using the items method and multi-variable assignment. The only thing new here is the globals function.

Now running the script from the command line gives this output (note that your output may be slightly different, depending on your platform and where you installed Python):

c:\docbook\dip\py> python BaseHTMLProcessor.py

SGMLParser = sgmllib.SGMLParser                
htmlentitydefs = <module 'htmlentitydefs' from 'C:\Python23\lib\htmlentitydefs.py'> 
BaseHTMLProcessor = __main__.BaseHTMLProcessor 
__name__ = __main__          
... rest of output omitted for brevity...

	`SGMLParser` was imported from `sgmllib`, using `from module import`. That means that it was imported directly into the module's namespace, and here it is.
	Contrast this with `htmlentitydefs`, which was imported using `import`. That means that the `htmlentitydefs` module itself is in the namespace, but the `entitydefs` variable defined within `htmlentitydefs` is not.
	This module only defines one class, `BaseHTMLProcessor`, and here it is. Note that the value here is the class itself, not a specific instance of the class.
	Remember the `if __name__` trick? When running a module (as opposed to importing it from another module), the built-in `__name__` attribute is a special value, `__main__`. Since you ran this module as a script from the command line, `__name__` is `__main__`, which is why the little test code to print the `globals` got executed.


	Using the `locals` and `globals` functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors the functionality of the `getattr` function, which allows you to access arbitrary functions dynamically by providing the function name as a string.

There is one other important difference between the locals and globals functions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning it.

Example 8.12. `locals` is read-only, `globals` is not

def foo(arg):
    x = 1
    print locals()    
    locals()["x"] = 2 
    print "x=",x      

z = 7
print "z=",z
foo(3)
globals()["z"] = 8    
print "z=",z

	Since `foo` is called with `3`, this will print `{'arg': 3, 'x': 1}`. This should not be a surprise.
	`locals` is a function that returns a dictionary, and here you are setting a value in that dictionary. You might think that this would change the value of the local variable `x` to `2`, but it doesn't. `locals` does not actually return the local namespace, it returns a copy. So changing it does nothing to the value of the variables in the local namespace.
	This prints `x= 1`, not `x= 2`.
	After being burned by `locals`, you might think that this wouldn't change the value of `z`, but it does. Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself), `globals` returns the actual global namespace, not a copy: the exact opposite behavior of `locals`. So any changes to the dictionary returned by `globals` directly affect your global variables.
	This prints `z= 8`, not `z= 7`.

8.6. Dictionary-based string formatting

Why did you learn about locals and globals? So you can learn about dictionary-based string formatting. As you recall, regular string formatting provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in place of each formatting marker. While this is efficient, it is not always the easiest code to read, especially when multiple values are being inserted. You can't simply scan through the string in one pass and understand what the result will be; you're constantly switching between reading the string and reading the tuple of values.

There is an alternative form of string formatting that uses dictionaries instead of tuples of values.

Example 8.13. Introducing dictionary-based string formatting

>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
>>> "%(pwd)s" % params
'secret'
>>> "%(pwd)s is not a good password for %(uid)s" % params 
'secret is not a good password for sa'
>>> "%(database)s of mind, %(database)s of body" % params 
'master of mind, master of body'

	Instead of a tuple of explicit values, this form of string formatting uses a dictionary, `params`. And instead of a simple `%s` marker in the string, the marker contains a name in parentheses. This name is used as a key in the `params` dictionary and subsitutes the corresponding value, `secret`, in place of the `%(pwd)s` marker.
	Dictionary-based string formatting works with any number of named keys. Each key must exist in the given dictionary, or the formatting will fail with a `KeyError`.
	You can even specify the same key twice; each occurrence will be replaced with the same value.

So why would you use dictionary-based string formatting? Well, it does seem like overkill to set up a dictionary of keys and values simply to do string formatting in the next line; it's really most useful when you happen to have a dictionary of meaningful keys and values already. Like locals.

Example 8.14. Dictionary-based string formatting in `BaseHTMLProcessor.py`

    def handle_comment(self, text):        
        self.pieces.append("<!--%(text)s-->" % locals())

Using the built-in locals function is the most common use of dictionary-based string formatting. It means that you can use the names of local variables within your string (in this case, text, which was passed to the class method as an argument) and each named variable will be replaced by its value. If text is 'Begin page footer', the string formatting "" % locals() will resolve to the string ''.

Example 8.15. More dictionary-based string formatting

    def unknown_starttag(self, tag, attrs):
        strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) 
        self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

When this method is called, attrs is a list of key/value tuples, just like the items of a dictionary, which means you can use multi-variable assignment to iterate through it. This should be a familiar pattern by now, but there's a lot going on here, so let's break it down:

Suppose attrs is [('href', 'index.html'), ('title', 'Go to home page')].
In the first round of the list comprehension, key will get 'href', and value will get 'index.html'.
The string formatting ' %s="%s"' % (key, value) will resolve to ' href="index.html"'. This string becomes the first element of the list comprehension's return value.
In the second round, key will get 'title', and value will get 'Go to home page'.
The string formatting will resolve to ' title="Go to home page"'.
The list comprehension returns a list of these two resolved strings, and strattrs will join both elements of this list together to form ' href="index.html" title="Go to home page"'.

Now, using dictionary-based string formatting, you insert the value of tag and strattrs into a string. So if tag is 'a', the final result would be '<a href="index.html" title="Go to home page">', and that is what gets appended to self.pieces.


	Using dictionary-based string formatting with `locals` is a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a slight performance hit in making the call to `locals`, since `locals` builds a copy of the local namespace.

8.7. Quoting attribute values

A common question on comp.lang.python is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”^[4] (This is generally precipitated by a project manager who has found the HTML-is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through BaseHTMLProcessor.

BaseHTMLProcessor consumes HTML (since it's descended from SGMLParser) and produces equivalent HTML, but the HTML output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase or mixed case, and attribute values will be enclosed in double quotes, even if they started in single quotes or with no quotes at all. It is this last side effect that you can take advantage of.

Example 8.16. Quoting attribute values

>>> htmlSource = """        
...     <html>
...     <head>
...     <title>Test page</title>
...     </head>
...     <body>
...     <ul>
...     <li><a href=index.html>Home</a></li>
...     <li><a href=toc.html>Table of contents</a></li>
...     <li><a href=history.html>Revision history</a></li>
...     </body>
...     </html>
...     """
>>> from BaseHTMLProcessor import BaseHTMLProcessor
>>> parser = BaseHTMLProcessor()
>>> parser.feed(htmlSource) 
>>> print parser.output()   
<html>
<head>
<title>Test page</title>
</head>
<body>
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="toc.html">Table of contents</a></li>
<li><a href="history.html">Revision history</a></li>
</body>
</html>

	Note that the attribute values of the `href` attributes in the `<a>` tags are not properly quoted. (Also note that you're using triple quotes for something other than a `docstring`. And directly in the IDE, no less. They're very useful.)
	Feed the parser.
	Using the `output` function defined in `BaseHTMLProcessor`, you get the output as a single string, complete with quoted attribute values. While this may seem anti-climactic, think about how much has actually happened here: `SGMLParser` parsed the entire HTML document, breaking it down into tags, refs, data, and so forth; `BaseHTMLProcessor` used those elements to reconstruct pieces of HTML (which are still stored in `parser.pieces`, if you want to see them); finally, you called `parser.output`, which joined all the pieces of HTML into one string.

8.8. Introducing `dialect.py`

Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within a <pre>...</pre> block passes through unaltered.

To handle the <pre> blocks, you define two methods in Dialectizer: start_pre and end_pre.

Example 8.17. Handling specific tags

    def start_pre(self, attrs):             
        self.verbatim += 1
        self.unknown_starttag("pre", attrs) 

    def end_pre(self):    
        self.unknown_endtag("pre")          
        self.verbatim -= 1

	`start_pre` is called every time `SGMLParser` finds a `<pre>` tag in the HTML source. (In a minute, you'll see exactly how this happens.) The method takes a single parameter, `attrs`, which contains the attributes of the tag (if any). `attrs` is a list of key/value tuples, just like `unknown_starttag` takes.
	In the `reset` method, you initialize a data attribute that serves as a counter for `<pre>` tags. Every time you hit a `<pre>` tag, you increment the counter; every time you hit a `</pre>` tag, you'll decrement the counter. (You could just use this as a flag and set it to `1` and reset it to `0`, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested `<pre>` tags.) In a minute, you'll see how this counter is put to good use.
	That's it, that's the only special processing you do for `<pre>` tags. Now you pass the list of attributes along to `unknown_starttag` so it can do the default processing.
	`end_pre` is called every time `SGMLParser` finds a `</pre>` tag. Since end tags can not contain attributes, the method takes no parameters.
	First, you want to do the default processing, just like any other end tag.
	Second, you decrement your counter to signal that this `<pre>` block has been closed.

At this point, it's worth digging a little further into SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle <pre> and </pre>. But how does this happen? Well, it's not magic, it's just good Python coding.

Example 8.18. `SGMLParser`

    def finish_starttag(self, tag, attrs):               
        try:        
            method = getattr(self, 'start_' + tag)       
        except AttributeError:         
            try:    
                method = getattr(self, 'do_' + tag)      
            except AttributeError:    
                self.unknown_starttag(tag, attrs)        
                return -1             
            else:   
                self.handle_starttag(tag, method, attrs) 
                return 0              
        else:       
            self.stack.append(tag)    
            self.handle_starttag(tag, method, attrs)    
            return 1 

    def handle_starttag(self, tag, method, attrs):      
        method(attrs)

	At this point, `SGMLParser` has already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a specific handler method for this tag, or whether you should fall back on the default method (`unknown_starttag`).
	The “magic” of `SGMLParser` is nothing more than your old friend, `getattr`. What you may not have realized before is that `getattr` will find methods defined in descendants of an object as well as the object itself. Here the object is `self`, the current instance. So if `tag` is `'pre'`, this call to `getattr` will look for a `start_pre` method on the current instance, which is an instance of the `Dialectizer` class.
	`getattr` raises an `AttributeError` if the method it's looking for doesn't exist in the object (or any of its descendants), but that's okay, because you wrapped the call to `getattr` inside a `try...except` block and explicitly caught the `AttributeError`.
	Since you didn't find a `start_xxx` method, you'll also look for a `do_xxx` method before giving up. This alternate naming scheme is generally used for standalone tags, like `<br>`, which have no corresponding end tag. But you can use either naming scheme; as you can see, `SGMLParser` tries both for every tag. (You shouldn't define both a `start_xxx` and `do_xxx` handler method for the same tag, though; only the `start_xxx` method will get called.)
	Another `AttributeError`, which means that the call to `getattr` failed with `do_xxx`. Since you found neither a `start_xxx` nor a `do_xxx` method for this tag, you catch the exception and fall back on the default method, `unknown_starttag`.
	Remember, `try...except` blocks can have an `else` clause, which is called if no exception is raised during the `try...except` block. Logically, that means that you did find a `do_xxx` method for this tag, so you're going to call it.
	By the way, don't worry about these different return values; in theory they mean something, but they're never actually used. Don't worry about the `self.stack.append(tag)` either; `SGMLParser` keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this information either. In theory, you could use this module to validate that your tags were fully balanced, but it's probably not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now.
	`start_xxx` and `do_xxx` methods are not called directly; the tag, method, and attributes are passed to this function, `handle_starttag`, so that descendants can override it and change the way all start tags are dispatched. You don't need that level of control, so you just let this method do its thing, which is to call the method (`start_xxx` or `do_xxx`) with the list of attributes. Remember, `method` is a function, returned from `getattr`, and functions are objects. (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run out of ways to use it to my advantage.) Here, the function object is passed into this dispatch method as an argument, and this method turns around and calls the function. At this point, you don't need to know what the function is, what it's named, or where it's defined; the only thing you need to know about the function is that it is called with one argument, `attrs`.

Now back to our regularly scheduled program: Dialectizer. When you left, you were in the process of defining specific handler methods for <pre> and </pre> tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, you need to override the handle_data method.

Example 8.19. Overriding the `handle_data` method

    def handle_data(self, text):     
        self.pieces.append(self.verbatim and text or self.process(text))

handle_data is called with only one argument, the text to process.

In the ancestor BaseHTMLProcessor, the handle_data method simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you're in the middle of a <pre>...</pre> block, self.verbatim will be some value greater than 0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using the and-or trick.

You're close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes later in dialect.py define a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough for one chapter.

8.9. Putting it all together

It's time to put everything you've learned so far to good use. I hope you were paying attention.

Example 8.20. The `translate` function, part 1

def translate(url, dialectName="chef"): 
    import urllib     
    sock = urllib.urlopen(url)          
    htmlSource = sock.read()           
    sock.close()

	The `translate` function has an optional argument `dialectName`, which is a string that specifies the dialect you'll be using. You'll see how this is used in a minute.
	Hey, wait a minute, there's an `import` statement in this function! That's perfectly legal in Python. You're used to seeing `import` statements at the top of a program, which means that the imported module is available anywhere in the program. But you can also import modules within a function, which means that the imported module is only available within the function. If you have a module that is only ever used in one function, this is an easy way to make your code more modular. (When you find that your weekend hack has turned into an 800-line work of art and decide to split it up into a dozen reusable modules, you'll appreciate this.)
	Now you get the source of the given URL.

Example 8.21. The `translate` function, part 2: curiouser and curiouser

    parserName = "%sDialectizer" % dialectName.capitalize() 
    parserClass = globals()[parserName]   
    parser = parserClass()

	`capitalize` is a string method you haven't seen before; it simply capitalizes the first letter of a string and forces everything else to lowercase. Combined with some string formatting, you've taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class. If `dialectName` is the string `'chef'`, `parserName` will be the string `'ChefDialectizer'`.
	You have the name of a class as a string (`parserName`), and you have the global namespace as a dictionary (`globals`()). Combined, you can get a reference to the class which the string names. (Remember, classes are objects, and they can be assigned to variables just like any other object.) If `parserName` is the string `'ChefDialectizer'`, `parserClass` will be the class `ChefDialectizer`.
	Finally, you have a class object (`parserClass`), and you want an instance of the class. Well, you already know how to do that: call the class like a function. The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable like a function, and out pops an instance of the class. If `parserClass` is the class `ChefDialectizer`, `parser` will be an instance of the class `ChefDialectizer`.

Why bother? After all, there are only 3 Dialectizer classes; why not just use a case statement? (Well, there's no case statement in Python, but why not just use a series of if statements?) One reason: extensibility. The translate function has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a new FooDialectizer tomorrow; translate would work by passing 'foo' as the dialectName.

Even better, imagine putting FooDialectizer in a separate module, and importing it with from module import. You've already seen that this includes it in globals(), so translate would still work without modification, even though FooDialectizer was in a separate file.

Now imagine that the name of the dialect is coming from somewhere outside the program, maybe from a database or from a user-inputted value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a URL and a dialect name (both strings) in the query string of a web page request, and output the “translated” web page.

Finally, imagine a Dialectizer framework with a plug-in architecture. You could put each Dialectizer class in a separate file, leaving only the translate function in dialect.py. Assuming a consistent naming scheme, the translate function could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven't seen dynamic importing yet, but I promise to cover it in a later chapter.) To add a new dialect, you would simply add an appropriately-named file in the plug-ins directory (like foodialect.py which contains the FooDialectizer class). Calling the translate function with the dialect name 'foo' would find the module foodialect.py, import the class FooDialectizer, and away you go.

Example 8.22. The `translate` function, part 3

    parser.feed(htmlSource) 
    parser.close()          
    return parser.output()

	After all that imagining, this is going to seem pretty boring, but the `feed` function is what does the entire transformation. You had the entire HTML source in a single string, so you only had to call `feed` once. However, you can call `feed` as often as you want, and the parser will just keep parsing. So if you were worried about memory usage (or you knew you were going to be dealing with very large HTML pages), you could set this up in a loop, where you read a few bytes of HTML and fed it to the parser. The result would be the same.
	Because `feed` maintains an internal buffer, you should always call the parser's `close` method when you're done (even if you fed it all at once, like you did). Otherwise you may find that your output is missing the last few bytes.
	Remember, `output` is the function you defined on `BaseHTMLProcessor` that joins all the pieces of output you've buffered and returns them in a single string.

And just like that, you've “translated” a web page, given nothing but a URL and the name of a dialect.

8.10. Summary

Python provides you with a powerful tool, sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways.

parsing the HTML looking for something specific
aggregating the results, like the URL lister
altering the structure along the way, like the attribute quoter
transforming the HTML into something else by manipulating the text while leaving the tags alone, like the Dialectizer

Along with these examples, you should be comfortable doing all of the following things:

Using locals() and globals() to access namespaces
Formatting strings using dictionary-based substitutions

^[1]The technical term for a parser like SGMLParser is a consumer: it consumes HTML and breaks it down. Presumably, the name feed was chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that's just me. In any event, it's an interesting mental image.

^[2]The reason Python is better at lists than strings is that lists are mutable but strings are immutable. This means that appending to a list just adds the element and updates the index. Since strings can not be changed after they are created, code like s = s + newpiece will create an entirely new string out of the concatenation of the original and the new piece, then throw away the original string. This involves a lot of expensive memory management, and the amount of effort involved increases as the string gets longer, so doing s = s + newpiece in a loop is deadly. In technical terms, appending n items to a list is O(n), while appending n items to a string is O(n²).

^[3]I don't get out much.

^[4]All right, it's not that common a question. It's not up there with “What editor should I use to write Python code?” (answer: Emacs) or “Is Python better or worse than Perl?” (answer: “Perl is worse than Python because people wanted it worse.” -Larry Wall, 10/14/1998) But questions about HTML processing pop up in one form or another about once a month, and among those questions, this is a popular one.

Chapter 9. XML Processing

9.1. Diving in

These next two chapters are about XML processing in Python. It would be helpful if you already knew what an XML document looks like, that it's made up of structured tags to form a hierarchy of elements, and so on. If this doesn't make sense to you, there are many XML tutorials that can explain the basics.

If you're not particularly interested in XML, you should still read these chapters, which cover important topics like Python packages, Unicode, command line arguments, and how to use getattr for method dispatching.

Being a philosophy major is not required, although if you have ever had the misfortune of being subjected to the writings of Immanuel Kant, you will appreciate the example program a lot more than if you majored in something useful, like computer science.

There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each element it finds. (If you read Chapter 8, HTML Processing, this should sound familiar, because that's how the sgmllib module works.) The other is called DOM (“Document Object Model”), and it works by reading in the entire XML document at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the DOM.

The following is a complete Python program which generates pseudo-random output based on a context-free grammar defined in an XML format. Don't worry yet if you don't understand what that means; you'll examine both the program's input and its output in more depth throughout these next two chapters.

Example 9.1. `kgp.py`

If you have not already done so, you can download this and other examples used in this book.

"""Kant Generator for Python

Generates mock philosophy based on a context-free grammar

Usage: python kgp.py [options] [source]

Options:
  -g ..., --grammar=...   use specified grammar file or URL
  -h, --help              show this help
  -d    show debugging information while parsing

Examples:
  kgp.pygenerates several paragraphs of Kantian philosophy
  kgp.py -g husserl.xml   generates several paragraphs of Husserl
  kpg.py "<xref id='paragraph'/>"  generates a paragraph of Kant
  kgp.py template.xml     reads from template.xml to decide what to generate
"""
from xml.dom import minidom
import random
import toolbox
import sys
import getopt

_debug = 0

class NoSourceError(Exception): pass

class KantGenerator:
    """generates mock philosophy based on a context-free grammar"""

    def __init__(self, grammar, source=None):
        self.loadGrammar(grammar)
        self.loadSource(source and source or self.getDefaultSource())
        self.refresh()

    def _load(self, source):
        """load XML input source, return parsed XML document

        - a URL of a remote XML file ("http://diveintopython3.org/kant.xml")
        - a filename of a local XML file ("~/diveintopython3/common/py/kant.xml")
        - standard input ("-")
        - the actual XML document, as a string
        """
        sock = toolbox.openAnything(source)
        xmldoc = minidom.parse(sock).documentElement
        sock.close()
        return xmldoc

    def loadGrammar(self, grammar):       
        """load context-free grammar"""   
        self.grammar = self._load(grammar)
        self.refs = {}  
        for ref in self.grammar.getElementsByTagName("ref"):
            self.refs[ref.attributes["id"].value] = ref     

    def loadSource(self, source):
        """load source"""
        self.source = self._load(source)

    def getDefaultSource(self):
        """guess default source of the current grammar
        
        The default source will be one of the <ref>s that is not
        cross-referenced.  This sounds complicated but it's not.
        Example: The default source for kant.xml is
        "<xref id='section'/>", because 'section' is the one <ref>
        that is not <xref>'d anywhere in the grammar.
        In most grammars, the default source will produce the
        longest (and most interesting) output.
        """
        xrefs = {}
        for xref in self.grammar.getElementsByTagName("xref"):
            xrefs[xref.attributes["id"].value] = 1
        xrefs = xrefs.keys()
        standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
        if not standaloneXrefs:
            raise NoSourceError, "can't guess source, and no source specified"
        return '<xref id="%s"/>' % random.choice(standaloneXrefs)
        
    def reset(self):
        """reset parser"""
        self.pieces = []
        self.capitalizeNextWord = 0

    def refresh(self):
        """reset output buffer, re-parse entire source file, and return output
        
        Since parsing involves a good deal of randomness, this is an
        easy way to get new output without having to reload a grammar file
        each time.
        """
        self.reset()
        self.parse(self.source)
        return self.output()

    def output(self):
        """output generated text"""
        return "".join(self.pieces)

    def randomChildElement(self, node):
        """choose a random child element of a node
        
        This is a utility method used by do_xref and do_choice.
        """
        choices = [e for e in node.childNodes
 if e.nodeType == e.ELEMENT_NODE]
        chosen = random.choice(choices)            
        if _debug:               
            sys.stderr.write('%s available choices: %s\n' % \
                (len(choices), [e.toxml() for e in choices]))
            sys.stderr.write('Chosen: %s\n' % chosen.toxml())
        return chosen            

    def parse(self, node):         
        """parse a single XML node
        
        A parsed XML document (from minidom.parse) is a tree of nodes
        of various types.  Each node is represented by an instance of the
        corresponding Python class (Element for a tag, Text for
        text data, Document for the top-level document).  The following
        statement constructs the name of a class method based on the type
        of node we're parsing ("parse_Element" for an Element node,
        "parse_Text" for a Text node, etc.) and then calls the method.
        """
        parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)
        parseMethod(node)

    def parse_Document(self, node):
        """parse the document node
        
        The document node by itself isn't interesting (to us), but
        its only child, node.documentElement, is: it's the root node
        of the grammar.
        """
        self.parse(node.documentElement)

    def parse_Text(self, node):    
        """parse a text node
        
        The text of a text node is usually added to the output buffer
        verbatim.  The one exception is that <p class='sentence'> sets
        a flag to capitalize the first letter of the next word.  If
        that flag is set, we capitalize the text and reset the flag.
        """
        text = node.data
        if self.capitalizeNextWord:
            self.pieces.append(text[0].upper())
            self.pieces.append(text[1:])
            self.capitalizeNextWord = 0
        else:
            self.pieces.append(text)

    def parse_Element(self, node): 
        """parse an element
        
        An XML element corresponds to an actual tag in the source:
        <xref id='...'>, <p chance='...'>, <choice>, etc.
        Each element type is handled in its own method.  Like we did in
        parse(), we construct a method name based on the name of the
        element ("do_xref" for an <xref> tag, etc.) and
        call the method.
        """
        handlerMethod = getattr(self, "do_%s" % node.tagName)
        handlerMethod(node)

    def parse_Comment(self, node):
        """parse a comment
        
        The grammar can contain XML comments, but we ignore them
        """
        pass
    
    def do_xref(self, node):
        """handle <xref id='...'> tag
        
        An <xref id='...'> tag is a cross-reference to a <ref id='...'>
        tag.  <xref id='sentence'/> evaluates to a randomly chosen child of
        <ref id='sentence'>.
        """
        id = node.attributes["id"].value
        self.parse(self.randomChildElement(self.refs[id]))

    def do_p(self, node):
        """handle <p> tag
        
        The <p> tag is the core of the grammar.  It can contain almost
        anything: freeform text, <choice> tags, <xref> tags, even other
        <p> tags.  If a "class='sentence'" attribute is found, a flag
        is set and the next word will be capitalized.  If a "chance='X'"
        attribute is found, there is an X% chance that the tag will be
        evaluated (and therefore a (100-X)% chance that it will be
        completely ignored)
        """
        keys = node.attributes.keys()
        if "class" in keys:
            if node.attributes["class"].value == "sentence":
                self.capitalizeNextWord = 1
        if "chance" in keys:
            chance = int(node.attributes["chance"].value)
            doit = (chance > random.randrange(100))
        else:
            doit = 1
        if doit:
            for child in node.childNodes: self.parse(child)

    def do_choice(self, node):
        """handle <choice> tag
        
        A <choice> tag contains one or more <p> tags.  One <p> tag
        is chosen at random and evaluated; the rest are ignored.
        """
        self.parse(self.randomChildElement(node))

def usage():
    print __doc__

def main(argv):       
    grammar = "kant.xml"                
    try:              
        opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
    except getopt.GetoptError:          
        usage()       
        sys.exit(2)   
    for opt, arg in opts:               
        if opt in ("-h", "--help"):     
            usage()   
            sys.exit()
        elif opt == '-d':               
            global _debug               
            _debug = 1
        elif opt in ("-g", "--grammar"):
            grammar = arg               
    
    source = "".join(args)              

    k = KantGenerator(grammar, source)
    print k.output()

if __name__ == "__main__":
    main(sys.argv[1:])

Example 9.2. `toolbox.py`

"""Miscellaneous utility functions"""

def openAnything(source):            
    """URI, filename, or string --> stream

    This function lets you define parsers that take any input source
    (URL, pathname to local or network file, or actual data as a string)
    and deal with it in a uniform manner.  Returned object is guaranteed
    to have all the basic stdio read methods (read, readline, readlines).
    Just .close() the object when you're done with it.
    
    Examples:
    >>> from xml.dom import minidom
    >>> sock = openAnything("http://localhost/kant.xml")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    >>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    >>> sock = openAnything("<ref id='conjunction'><text>and</text><text>or</text></ref>")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    """
    if hasattr(source, "read"):
        return source

    if source == '-':
        import sys
        return sys.stdin

    # try to open with urllib (if source is http, ftp, or file URL)
    import urllib       
    try:                
        return urllib.urlopen(source)     
    except (IOError, OSError):            
        pass            
    
    # try to open with native open function (if source is pathname)
    try:                
        return open(source)               
    except (IOError, OSError):            
        pass            
    
    # treat source as string
    import StringIO     
    return StringIO.StringIO(str(source))

Run the program kgp.py by itself, and it will parse the default XML-based grammar, in kant.xml, and print several paragraphs worth of philosophy in the style of Immanuel Kant.

Example 9.3. Sample output of `kgp.py`

[you@localhost kgp]$ python kgp.py
     As is shown in the writings of Hume, our a priori concepts, in
reference to ends, abstract from all content of knowledge; in the study
of space, the discipline of human reason, in accordance with the
principles of philosophy, is the clue to the discovery of the
Transcendental Deduction.  The transcendental aesthetic, in all
theoretical sciences, occupies part of the sphere of human reason
concerning the existence of our ideas in general; still, the
never-ending regress in the series of empirical conditions constitutes
the whole content for the transcendental unity of apperception.  What
we have alone been able to show is that, even as this relates to the
architectonic of human reason, the Ideal may not contradict itself, but
it is still possible that it may be in contradictions with the
employment of the pure employment of our hypothetical judgements, but
natural causes (and I assert that this is the case) prove the validity
of the discipline of pure reason.  As we have already seen, time (and
it is obvious that this is true) proves the validity of time, and the
architectonic of human reason, in the full sense of these terms,
abstracts from all content of knowledge.  I assert, in the case of the
discipline of practical reason, that the Antinomies are just as
necessary as natural causes, since knowledge of the phenomena is a
posteriori.
    The discipline of human reason, as I have elsewhere shown, is by
its very nature contradictory, but our ideas exclude the possibility of
the Antinomies.  We can deduce that, on the contrary, the pure
employment of philosophy, on the contrary, is by its very nature
contradictory, but our sense perceptions are a representation of, in
the case of space, metaphysics.  The thing in itself is a
representation of philosophy.  Applied logic is the clue to the
discovery of natural causes.  However, what we have alone been able to
show is that our ideas, in other words, should only be used as a canon
for the Ideal, because of our necessary ignorance of the conditions.

[...snip...]

This is, of course, complete gibberish. Well, not complete gibberish. It is syntactically and grammatically correct (although very verbose -- Kant wasn't what you would call a get-to-the-point kind of guy). Some of it may actually be true (or at least the sort of thing that Kant would have agreed with), some of it is blatantly false, and most of it is simply incoherent. But all of it is in the style of Immanuel Kant.

Let me repeat that this is much, much funnier if you are now or have ever been a philosophy major.

The interesting thing about this program is that there is nothing Kant-specific about it. All the content in the previous example was derived from the grammar file, kant.xml. If you tell the program to use a different grammar file (which you can specify on the command line), the output will be completely different.

Example 9.4. Simpler output from `kgp.py`

[you@localhost kgp]$ python kgp.py -g binary.xml
00101001
[you@localhost kgp]$ python kgp.py -g binary.xml
10110100

You will take a closer look at the structure of the grammar file later in this chapter. For now, all you need to know is that the grammar file defines the structure of the output, and the kgp.py program reads through the grammar and makes random decisions about which words to plug in where.

9.2. Packages

Actually parsing an XML document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour to talk about packages.

Example 9.5. Loading an XML document (a sneak peek)

>>> from xml.dom import minidom 
>>> xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')

This is a syntax you haven't seen before. It looks almost like the from module import you know and love, but the "." gives it away as something above and beyond a simple import. In fact, xml is what is known as a package, dom is a nested package within xml, and minidom is a module within xml.dom.

That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than directories of modules; nested packages are subdirectories. The modules within a package (or a nested package) are still just .py files, like always, except that they're in a subdirectory instead of the main lib/ directory of your Python installation.

Example 9.6. File layout of a package

Python21/           root Python installation (home of the executable)
|
+--lib/             library directory (home of the standard library modules)
   |
   +-- xml/         xml package (really just a directory with other stuff in it)
       |
       +--sax/      xml.sax package (again, just a directory)
       |
       +--dom/      xml.dom package (contains minidom.py)
       |
       +--parsers/  xml.parsers package (used internally)

So when you say from xml.dom import minidom, Python figures out that that means “look in the xml directory for a dom directory, and look in that for the minidom module, and import it as minidom”. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import specific classes or functions from a module contained within a package. You can also import the package itself as a module. The syntax is all the same; Python figures out what you mean based on the file layout of the package, and automatically does the right thing.

Example 9.7. Packages are modules, too

>>> from xml.dom import minidom         
>>> minidom
<module 'xml.dom.minidom' from 'C:\Python21\lib\xml\dom\minidom.pyc'>
>>> minidom.Element
<class xml.dom.minidom.Element at 01095744>
>>> from xml.dom.minidom import Element 
>>> Element
<class xml.dom.minidom.Element at 01095744>
>>> minidom.Element
<class xml.dom.minidom.Element at 01095744>
>>> from xml import dom                 
>>> dom
<module 'xml.dom' from 'C:\Python21\lib\xml\dom\__init__.pyc'>
>>> import xml        
>>> xml
<module 'xml' from 'C:\Python21\lib\xml\__init__.pyc'>

	Here you're importing a module (`minidom`) from a nested package (`xml.dom`). The result is that `minidom` is imported into your namespace, and in order to reference classes within the `minidom` module (like `Element`), you need to preface them with the module name.
	Here you are importing a class (`Element`) from a module (`minidom`) from a nested package (`xml.dom`). The result is that `Element` is imported directly into your namespace. Note that this does not interfere with the previous import; the `Element` class can now be referenced in two ways (but it's all still the same class).
	Here you are importing the `dom` package (a nested package of `xml`) as a module in and of itself. Any level of a package can be treated as a module, as you'll see in a moment. It can even have its own attributes and methods, just the modules you've seen before.
	Here you are importing the root level `xml` package as a module.

So how can a package (which is just a directory on disk) be imported and treated as a module (which is always a file on disk)? The answer is the magical __init__.py file. You see, packages are not simply directories; they are directories with a specific file, __init__.py, inside. This file defines the attributes and methods of the package. For instance, xml.dom contains a Node class, which is defined in xml/dom/__init__.py. When you import a package as a module (like dom from xml), you're really importing its __init__.py file.


	A package is a directory with the special `__init__.py` file in it. The `__init__.py` file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file, but it has to exist. But if `__init__.py` doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages.

So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an xml package with sax and dom packages inside, the authors could have chosen to put all the sax functionality in xmlsax.py and all the dom functionality in xmldom.py, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the XML package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different areas simultaneously).

If you ever find yourself writing a large subsystem in Python (or, more likely, when you realize that your small subsystem has grown into a large one), invest some time designing a good package architecture. It's one of the many things Python is good at, so take advantage of it.

9.3. Parsing XML

As I was saying, actually parsing an XML document is very simple: one line of code. Where you go from there is up to you.

Example 9.8. Loading an XML document (for real this time)

>>> from xml.dom import minidom      
>>> xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')  
>>> xmldoc         
<xml.dom.minidom.Document instance at 010BE87C>
>>> print xmldoc.toxml()             
<?xml version="1.0" ?>
<grammar>
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>

	As you saw in the previous section, this imports the `minidom` module from the `xml.dom` package.
	Here is the one line of code that does all the work: `minidom.parse` takes one argument and returns a parsed representation of the XML document. The argument can be many things; in this case, it's simply a filename of an XML document on my local disk. (To follow along, you'll need to change the path to point to your downloaded examples directory.) But you can also pass a file object, or even a file-like object. You'll take advantage of this flexibility later in this chapter.
	The object returned from `minidom.parse` is a `Document` object, a descendant of the `Node` class. This `Document` object is the root level of a complex tree-like structure of interlocking Python objects that completely represent the XML document you passed to `minidom.parse`.
	`toxml` is a method of the `Node` class (and is therefore available on the `Document` object you got from `minidom.parse`). `toxml` prints out the XML that this `Node` represents. For the `Document` node, this prints out the entire XML document.

Now that you have an XML document in memory, you can start traversing through it.

Example 9.9. Getting child nodes

>>> xmldoc.childNodes    
[<DOM Element: grammar at 17538908>]
>>> xmldoc.childNodes[0] 
<DOM Element: grammar at 17538908>
>>> xmldoc.firstChild    
<DOM Element: grammar at 17538908>

	Every `Node` has a `childNodes` attribute, which is a list of the `Node` objects. A `Document` always has only one child node, the root element of the XML document (in this case, the `grammar` element).
	To get the first (and in this case, the only) child node, just use regular list syntax. Remember, there is nothing special going on here; this is just a regular Python list of regular Python objects.
	Since getting the first child node of a node is a useful and common activity, the `Node` class has a `firstChild` attribute, which is synonymous with `childNodes[0]`. (There is also a `lastChild` attribute, which is synonymous with `childNodes[-1]`.)

Example 9.10. `toxml` works on any node

>>> grammarNode = xmldoc.firstChild
>>> print grammarNode.toxml() 
<grammar>
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>

Since the toxml method is defined in the Node class, it is available on any XML node, not just the Document element.

Example 9.11. Child nodes can be text

>>> grammarNode.childNodes
[<DOM Text node "\n">, <DOM Element: ref at 17533332>, \
<DOM Text node "\n">, <DOM Element: ref at 17549660>, <DOM Text node "\n">]
>>> print grammarNode.firstChild.toxml()    



>>> print grammarNode.childNodes[1].toxml() 
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
>>> print grammarNode.childNodes[3].toxml() 
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
>>> print grammarNode.lastChild.toxml()

	Looking at the XML in `binary.xml`, you might think that the `grammar` has only two child nodes, the two `ref` elements. But you're missing something: the carriage returns! After the `'<grammar>'` and before the first `'<ref>'` is a carriage return, and this text counts as a child node of the `grammar` element. Similarly, there is a carriage return after each `'</ref>'`; these also count as child nodes. So `grammar.childNodes` is actually a list of 5 objects: 3 `Text` objects and 2 `Element` objects.
	The first child is a `Text` object representing the carriage return after the `'<grammar>'` tag and before the first `'<ref>'` tag.
	The second child is an `Element` object representing the first `ref` element.
	The fourth child is an `Element` object representing the second `ref` element.
	The last child is a `Text` object representing the carriage return after the `'</ref>'` end tag and before the `'</grammar>'` end tag.

Example 9.12. Drilling down all the way to text

>>> grammarNode
<DOM Element: grammar at 19167148>
>>> refNode = grammarNode.childNodes[1] 
>>> refNode
<DOM Element: ref at 17987740>
>>> refNode.childNodes
[<DOM Text node "\n">, <DOM Text node "  ">, <DOM Element: p at 19315844>, \
<DOM Text node "\n">, <DOM Text node "  ">, \
<DOM Element: p at 19462036>, <DOM Text node "\n">]
>>> pNode = refNode.childNodes[2]
>>> pNode
<DOM Element: p at 19315844>
>>> print pNode.toxml()                 
<p>0</p>
>>> pNode.firstChild  
<DOM Text node "0">
>>> pNode.firstChild.data               
u'0'

	As you saw in the previous example, the first `ref` element is `grammarNode.childNodes[1]`, since childNodes[0] is a `Text` node for the carriage return.
	The `ref` element has its own set of child nodes, one for the carriage return, a separate one for the spaces, one for the `p` element, and so forth.
	You can even use the `toxml` method here, deeply nested within the document.
	The `p` element has only one child node (you can't tell that from this example, but look at `pNode.childNodes` if you don't believe me), and it is a `Text` node for the single character `'0'`.
	The `.data` attribute of a `Text` node gives you the actual string that the text node represents. But what is that `'u'` in front of the string? The answer to that deserves its own section.

9.4. Unicode

Unicode is a system to represent characters from all the world's different languages. When Python parses an XML document, all data is stored in memory as unicode.

You'll get to all that in a minute, but first, some background.

Historical note. Before unicode, there were separate character encoding systems for each language, each using the same numbers (0-255) to represent that language's characters. Some languages (like Russian) have multiple conflicting standards about how to represent the same characters; other languages (like Japanese) have so many characters that they require multiple-byte character sets. Exchanging documents between systems was difficult because there was no way for a computer to tell for certain which character encoding scheme the document author had used; the computer only saw numbers, and the numbers could mean different things. Then think about trying to store these documents in the same place (like in the same database table); you would need to store the character encoding alongside each piece of text, and make sure to pass it around whenever you passed the text around. Then think about multilingual documents, with characters from multiple languages in the same document. (They typically used escape codes to switch modes; poof, you're in Russian koi8-r mode, so character 241 means this; poof, now you're in Mac Greek mode, so character 241 means something else. And so on.) These are the problems which unicode was designed to solve.

To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.^[5] Each 2-byte number represents a unique character used in at least one of the world's languages. (Characters that are used in multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per number. Unicode data is never ambiguous.

Of course, there is still the matter of all these legacy encoding systems. 7-bit ASCII, for instance, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in 7-bit ASCII. Western European languages like French, Spanish, and German all use an encoding system called ISO-8859-1 (also called “latin-1”), which uses the 7-bit ASCII characters for the numbers 0 through 127, but then extends into the 128-255 range for characters like n-with-a-tilde-over-it (241), and u-with-two-dots-over-it (252). And unicode uses the same characters as 7-bit ASCII for 0 through 127, and the same characters as ISO-8859-1 for 128 through 255, and then extends from there into characters for other languages with the remaining numbers, 256 through 65535.

When dealing with unicode data, you may at some point need to convert the data back into one of these other legacy encoding systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding scheme, or to print it to a non-unicode-aware terminal or printer. Or to store it in an XML document which explicitly specifies the encoding scheme.

And on that note, let's get back to Python.

Python has had unicode support throughout the language since version 2.0. The XML package uses unicode to store all parsed XML data, but you can use unicode anywhere.

Example 9.13. Introducing unicode

>>> s = u'Dive in'            
>>> s
u'Dive in'
>>> print s 
Dive in

To create a unicode string instead of a regular ASCII string, add the letter “u” before the string. Note that this particular string doesn't have any non-ASCII characters. That's fine; unicode is a superset of ASCII (a very large superset at that), so any regular ASCII string can also be stored as unicode.

When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn't know that s was a unicode string, you'd never notice the difference.

Example 9.14. Storing non-ASCII characters

>>> s = u'La Pe\xf1a'         
>>> print s 
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
>>> print s.encode('latin-1') 
La Peña

	The real advantage of unicode, of course, is its ability to store non-ASCII characters, like the Spanish “`ñ`” (`n` with a tilde over it). The unicode character code for the tilde-n is `0xf1` in hexadecimal (241 in decimal), which you can type like this: `\xf1`.
	Remember I said that the `print` function attempts to convert a unicode string to ASCII so it can print it? Well, that's not going to work here, because your unicode string contains non-ASCII characters, so Python raises a `UnicodeError` error.
	Here's where the conversion-from-unicode-to-other-encoding-schemes comes in. `s` is a unicode string, but `print` can only print a regular string. To solve this problem, you call the `encode` method, available on every unicode string, to convert the unicode string to a regular string in the given encoding scheme, which you pass as a parameter. In this case, you're using `latin-1` (also known as `iso-8859-1`), which includes the tilde-n (whereas the default ASCII encoding scheme did not, since it only includes characters numbered 0 through 127).

Remember I said Python usually converted unicode to ASCII whenever it needed to make a regular string out of a unicode string? Well, this default encoding scheme is an option which you can customize.

Example 9.15. `sitecustomize.py`

# sitecustomize.py 
# this file can be anywhere in your Python path,
# but it usually goes in ${pythondir}/lib/site-packages/
import sys
sys.setdefaultencoding('iso-8859-1')

	`sitecustomize.py` is a special script; Python will try to import it on startup, so any code in it will be run automatically. As the comment mentions, it can go anywhere (as long as `import` can find it), but it usually goes in the `site-packages` directory within your Python `lib` directory.
	`setdefaultencoding` function sets, well, the default encoding. This is the encoding scheme that Python will try to use whenever it needs to auto-coerce a unicode string into a regular string.

Example 9.16. Effects of setting the default encoding

>>> import sys
>>> sys.getdefaultencoding() 
'iso-8859-1'
>>> s = u'La Pe\xf1a'
>>> print s
La Peña

This example assumes that you have made the changes listed in the previous example to your sitecustomize.py file, and restarted Python. If your default encoding still says 'ascii', you didn't set up your sitecustomize.py properly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even call sys.setdefaultencoding after Python has started up. Dig into site.py and search for “setdefaultencoding” to find out how.)

Now that the default encoding scheme includes all the characters you use in your string, Python has no problem auto-coercing the string and printing it.

Example 9.17. Specifying encoding in `.py` files

If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to be UTF-8:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

Now, what about XML? Well, every XML document is in a specific encoding. Again, ISO-8859-1 is a popular encoding for data in Western European languages. KOI8-R is popular for Russian texts. The encoding, if specified, is in the header of the XML document.

Example 9.18. `russiansample.xml`


<?xml version="1.0" encoding="koi8-r"?>       
<preface>
<title>Предисловие</title>  
</preface>

	This is a sample extract from a real Russian XML document; it's part of a Russian translation of this very book. Note the encoding, `koi8-r`, specified in the header.
	These are Cyrillic characters which, as far as I know, spell the Russian word for “Preface”. If you open this file in a regular text editor, the characters will most likely like gibberish, because they're encoded using the `koi8-r` encoding scheme, but they're being displayed in `iso-8859-1`.

Example 9.19. Parsing `russiansample.xml`

>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('russiansample.xml') 
>>> title = xmldoc.getElementsByTagName('title')[0].firstChild.data
>>> title   
u'\u041f\u0440\u0435\u0434\u0438\u0441\u043b\u043e\u0432\u0438\u0435'
>>> print title               
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
>>> convertedtitle = title.encode('koi8-r')     
>>> convertedtitle
'\xf0\xd2\xc5\xc4\xc9\xd3\xcc\xcf\xd7\xc9\xc5'
>>> print convertedtitle      
Предисловие

	I'm assuming here that you saved the previous example as `russiansample.xml` in the current directory. I am also, for the sake of completeness, assuming that you've changed your default encoding back to `'ascii'` by removing your `sitecustomize.py` file, or at least commenting out the `setdefaultencoding` line.
	Note that the text data of the `title` tag (now in the `title` variable, thanks to that long concatenation of Python functions which I hastily skipped over and, annoyingly, won't explain until the next section) -- the text data inside the XML document's `title` element is stored in unicode.
	Printing the title is not possible, because this unicode string contains non-ASCII characters, so Python can't convert it to ASCII because that doesn't make sense.
	You can, however, explicitly convert it to `koi8-r`, in which case you get a (regular, not unicode) string of single-byte characters (`f0`, `d2`, `c5`, and so forth) that are the `koi8-r`-encoded versions of the characters in the original unicode string.
	Printing the `koi8-r`-encoded string will probably show gibberish on your screen, because your Python IDE is interpreting those characters as `iso-8859-1`, not `koi8-r`. But at least they do print. (And, if you look carefully, it's the same gibberish that you saw when you opened the original XML document in a non-unicode-aware text editor. Python converted it from `koi8-r` into unicode when it parsed the XML document, and you've just converted it back.)

To sum up, unicode itself is a bit intimidating if you've never seen it before, but unicode data is really very easy to handle in Python. If your XML documents are all 7-bit ASCII (like the examples in this chapter), you will literally never think about unicode. Python will convert the ASCII data in the XML documents into unicode while parsing, and auto-coerce it back to ASCII whenever necessary, and you'll never even notice. But if you need to deal with that in other languages, Python is ready.

9.5. Searching for elements

Traversing XML documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within your XML document, there is a shortcut you can use to find it quickly: getElementsByTagName.

For this section, you'll be using the binary.xml grammar file, which looks like this:

Example 9.20. `binary.xml`

<?xml version="1.0"?>
<!DOCTYPE grammar PUBLIC "-//diveintopython3.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
<grammar>
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>

It has two refs, 'bit' and 'byte'. A bit is either a '0' or '1', and a byte is 8 bits.

Example 9.21. Introducing `getElementsByTagName`

>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('binary.xml')
>>> reflist = xmldoc.getElementsByTagName('ref') 
>>> reflist
[<DOM Element: ref at 136138108>, <DOM Element: ref at 136144292>]
>>> print reflist[0].toxml()
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
>>> print reflist[1].toxml()
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>

getElementsByTagName takes one argument, the name of the element you wish to find. It returns a list of Element objects, corresponding to the XML elements that have that name. In this case, you find two ref elements.

Example 9.22. Every element is searchable

>>> firstref = reflist[0]    
>>> print firstref.toxml()
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
>>> plist = firstref.getElementsByTagName("p") 
>>> plist
[<DOM Element: p at 136140116>, <DOM Element: p at 136142172>]
>>> print plist[0].toxml()   
<p>0</p>
>>> print plist[1].toxml()
<p>1</p>

	Continuing from the previous example, the first object in your `reflist` is the `'bit'` `ref` element.
	You can use the same `getElementsByTagName` method on this `Element` to find all the `<p>` elements within the `'bit'` `ref` element.
	Just as before, the `getElementsByTagName` method returns a list of all the elements it found. In this case, you have two, one for each bit.

Example 9.23. Searching is actually recursive

>>> plist = xmldoc.getElementsByTagName("p") 
>>> plist
[<DOM Element: p at 136140116>, <DOM Element: p at 136142172>, <DOM Element: p at 136146124>]
>>> plist[0].toxml()       
'<p>0</p>'
>>> plist[1].toxml()
'<p>1</p>'
>>> plist[2].toxml()       
'<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>'

	Note carefully the difference between this and the previous example. Previously, you were searching for `p` elements within `firstref`, but here you are searching for `p` elements within `xmldoc`, the root-level object that represents the entire XML document. This does find the `p` elements nested within the `ref` elements within the root `grammar` element.
	The first two `p` elements are within the first `ref` (the `'bit'` `ref`).
	The last `p` element is the one within the second `ref` (the `'byte'` `ref`).

9.6. Accessing element attributes

XML elements can have one or more attributes, and it is incredibly simple to access them once you have parsed an XML document.

For this section, you'll be using the binary.xml grammar file that you saw in the previous section.


	This section may be a little confusing, because of some overlapping terminology. Elements in an XML document have attributes, and Python objects also have attributes. When you parse an XML document, you get a bunch of Python objects that represent all the pieces of the XML document, and some of these Python objects represent attributes of the XML elements. But the (Python) objects that represent the (XML) attributes also have (Python) attributes, which are used to access various parts of the (XML) attribute that the object represents. I told you it was confusing. I am open to suggestions on how to distinguish these more clearly.

Example 9.24. Accessing element attributes

>>> xmldoc = minidom.parse('binary.xml')
>>> reflist = xmldoc.getElementsByTagName('ref')
>>> bitref = reflist[0]
>>> print bitref.toxml()
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
>>> bitref.attributes          
<xml.dom.minidom.NamedNodeMap instance at 0x81e0c9c>
>>> bitref.attributes.keys()    
[u'id']
>>> bitref.attributes.values() 
[<xml.dom.minidom.Attr instance at 0x81d5044>]
>>> bitref.attributes["id"]    
<xml.dom.minidom.Attr instance at 0x81d5044>

	Each `Element` object has an attribute called `attributes`, which is a `NamedNodeMap` object. This sounds scary, but it's not, because a `NamedNodeMap` is an object that acts like a dictionary, so you already know how to use it.
	Treating the `NamedNodeMap` as a dictionary, you can get a list of the names of the attributes of this element by using `attributes.keys()`. This element has only one attribute, `'id'`.
	Attribute names, like all other text in an XML document, are stored in unicode.
	Again treating the `NamedNodeMap` as a dictionary, you can get a list of the values of the attributes by using `attributes.values()`. The values are themselves objects, of type `Attr`. You'll see how to get useful information out of this object in the next example.
	Still treating the `NamedNodeMap` as a dictionary, you can access an individual attribute by name, using normal dictionary syntax. (Readers who have been paying extra-close attention will already know how the `NamedNodeMap` class accomplishes this neat trick: by defining a `__getitem__` special method. Other readers can take comfort in the fact that they don't need to understand how it works in order to use it effectively.)

Example 9.25. Accessing individual attributes

>>> a = bitref.attributes["id"]
>>> a
<xml.dom.minidom.Attr instance at 0x81d5044>
>>> a.name  
u'id'
>>> a.value 
u'bit'

	The `Attr` object completely represents a single XML attribute of a single XML element. The name of the attribute (the same name as you used to find this object in the `bitref.attributes` `NamedNodeMap` pseudo-dictionary) is stored in `a.name`.
	The actual text value of this XML attribute is stored in `a.value`.


	Like a dictionary, attributes of an XML element have no ordering. Attributes may happen to be listed in a certain order in the original XML document, and the `Attr` objects may happen to be listed in a certain order when the XML document is parsed into Python objects, but these orders are arbitrary and should carry no special meaning. You should always access individual attributes by name, like the keys of a dictionary.

9.7. Segue

OK, that's it for the hard-core XML stuff. The next chapter will continue to use these same example programs, but focus on other aspects that make the program more flexible: using streams for input processing, using getattr for method dispatching, and using command-line flags to allow users to reconfigure the program without changing the code.

Before moving on to the next chapter, you should be comfortable doing all of these things:

Parsing XML documents using minidom, searching through the parsed document, and accessing arbitrary element attributes and element children
Organizing complex libraries into packages
Converting unicode strings to different character encodings

^[5]This, sadly, is still an oversimplification. Unicode now has been extended to handle ancient Chinese, Korean, and Japanese texts, which had so many different characters that the 2-byte unicode system could not represent them all. But Python doesn't currently support that out of the box, and I don't know if there is a project afoot to add it. You've reached the limits of my expertise, sorry.

Chapter 10. Scripts and Streams

10.1. Abstracting input sources

One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the file-like object.

Many functions which require an input source could simply take a filename, go open the file for reading, read it, and close it when they're done. But they don't. Instead, they take a file-like object.

In the simplest case, a file-like object is any object with a read method with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left off and returns the next chunk of data.

This is how reading from real files works; the difference is that you're not limiting yourself to real files. The input source could be anything: a file on disk, a web page, even a hard-coded string. As long as you pass a file-like object to the function, and the function simply calls the object's read method, the function can handle any kind of input source without specific code to handle each kind.

In case you were wondering how this relates to XML processing, minidom.parse is one such function which can take a file-like object.

Example 10.1. Parsing XML from a file

>>> from xml.dom import minidom
>>> fsock = open('binary.xml')    
>>> xmldoc = minidom.parse(fsock) 
>>> fsock.close()                 
>>> print xmldoc.toxml()          
<?xml version="1.0" ?>
<grammar>
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>

	First, you open the file on disk. This gives you a file object.
	You pass the file object to `minidom.parse`, which calls the `read` method of `fsock` and reads the XML document from the file on disk.
	Be sure to call the `close` method of the file object after you're done with it. `minidom.parse` will not do this for you.
	Calling the `toxml()` method on the returned XML document prints out the entire thing.

Well, that all seems like a colossal waste of time. After all, you've already seen that minidom.parse can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're just going to be parsing a local file, you can pass the filename and minidom.parse is smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet.

Example 10.2. Parsing XML from a URL

>>> import urllib
>>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf') 
>>> xmldoc = minidom.parse(usock)            
>>> usock.close()          
>>> print xmldoc.toxml()   
<?xml version="1.0" ?>
<rdf:RDF xmlns="http://my.netscape.com/rdf/simple/0.9/"
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<channel>
<title>Slashdot</title>
<link>http://slashdot.org/</link>
<description>News for nerds, stuff that matters</description>
</channel>

<image>
<title>Slashdot</title>
<url>http://images.slashdot.org/topics/topicslashdot.gif</url>
<link>http://slashdot.org/</link>
</image>

<item>
<title>To HDTV or Not to HDTV?</title>
<link>http://slashdot.org/article.pl?sid=01/12/28/0421241</link>
</item>

[...snip...]

	As you saw in a previous chapter, `urlopen` takes a web page URL and returns a file-like object. Most importantly, this object has a `read` method which returns the HTML source of the web page.
	Now you pass the file-like object to `minidom.parse`, which obediently calls the `read` method of the object and parses the XML data that the `read` method returns. The fact that this XML data is now coming straight from a web page is completely irrelevant. `minidom.parse` doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects.
	As soon as you're done with it, be sure to close the file-like object that `urlopen` gives you.
	By the way, this URL is real, and it really is XML. It's an XML representation of the current headlines on Slashdot, a technical news and gossip site.

Example 10.3. Parsing XML from a string (the easy but inflexible way)

>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
>>> xmldoc = minidom.parseString(contents) 
>>> print xmldoc.toxml()
<?xml version="1.0" ?>
<grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>

minidom has a method, parseString, which takes an entire XML document as a string and parses it. You can use this instead of minidom.parse if you know you already have your entire XML document in a string.

OK, so you can use the minidom.parse function for parsing both local files and remote URLs, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a file, a URL, or a string, you'll need special logic to check whether it's a string, and call the parseString function instead. How unsatisfying.

If there were a way to turn a string into a file-like object, then you could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO.

Example 10.4. Introducing `StringIO`

>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
>>> import StringIO
>>> ssock = StringIO.StringIO(contents)   
>>> ssock.read()        
"<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
>>> ssock.read()        
''
>>> ssock.seek(0)       
>>> ssock.read(15)      
'<grammar><ref i'
>>> ssock.read(15)
"d='bit'><p>0</p"
>>> ssock.read()
'><p>1</p></ref></grammar>'
>>> ssock.close()

	The `StringIO` module contains a single class, also called `StringIO`, which allows you to turn a string into a file-like object. The `StringIO` class takes the string as a parameter when creating an instance.
	Now you have a file-like object, and you can do all sorts of file-like things with it. Like `read`, which returns the original string.
	Calling `read` again returns an empty string. This is how real file objects work too; once you read the entire file, you can't read any more without explicitly seeking to the beginning of the file. The `StringIO` object works the same way.
	You can explicitly seek to the beginning of the string, just like seeking through a file, by using the `seek` method of the `StringIO` object.
	You can also read the string in chunks, by passing a `size` parameter to the `read` method.
	At any time, `read` will return the rest of the string that you haven't read yet. All of this is exactly how file objects work; hence the term file-like object.

Example 10.5. Parsing XML from a string (the file-like object way)

>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
>>> ssock = StringIO.StringIO(contents)
>>> xmldoc = minidom.parse(ssock) 
>>> ssock.close()
>>> print xmldoc.toxml()
<?xml version="1.0" ?>
<grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>

Now you can pass the file-like object (really a StringIO) to minidom.parse, which will call the object's read method and happily parse away, never knowing that its input came from a hard-coded string.

So now you know how to use a single function, minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, you use urlopen to get a file-like object; for a local file, you use open; and for a string, you use StringIO. Now let's take it one step further and generalize these differences as well.

Example 10.6. `openAnything`

def openAnything(source):
    # try to open with urllib (if source is http, ftp, or file URL)
    import urllib       
    try:                
        return urllib.urlopen(source)      
    except (IOError, OSError):            
        pass            

    # try to open with native open function (if source is pathname)
    try:                
        return open(source)                
    except (IOError, OSError):            
        pass            

    # treat source as string
    import StringIO     
    return StringIO.StringIO(str(source))

	The `openAnything` function takes a single parameter, `source`, and returns a file-like object. `source` is a string of some sort; it can either be a URL (like `'http://slashdot.org/slashdot.rdf'`), a full or partial pathname to a local file (like `'binary.xml'`), or a string that contains actual XML data to be parsed.
	First, you see if `source` is a URL. You do this through brute force: you try to open it as a URL and silently ignore errors caused by trying to open something which is not a URL. This is actually elegant in the sense that, if `urllib` ever supports new types of URLs in the future, you will also support them without recoding. If `urllib` is able to open `source`, then the `return` kicks you out of the function immediately and the following `try` statements never execute.
	On the other hand, if `urllib` yelled at you and told you that `source` wasn't a valid URL, you assume it's a path to a file on disk and try to open it. Again, you don't do anything fancy to check whether `source` is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably get them wrong anyway). Instead, you just blindly open the file, and silently trap any errors.
	By this point, you need to assume that `source` is a string that has hard-coded data in it (since nothing else worked), so you use `StringIO` to create a file-like object out of it and return that. (In fact, since you're using the `str` function, `source` doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its `__str__` special method.)

Now you can use this openAnything function in conjunction with minidom.parse to make a function that takes a source that refers to an XML document somehow (either as a URL, or a local filename, or a hard-coded XML document in a string) and parses it.

Example 10.7. Using `openAnything` in `kgp.py`

class KantGenerator:
    def _load(self, source):
        sock = toolbox.openAnything(source)
        xmldoc = minidom.parse(sock).documentElement
        sock.close()
        return xmldoc

10.2. Standard input, output, and error

UNIX users are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you.

Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX system. When you print something, it goes to the stdout pipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to the stderr pipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system with a window-based Python IDE, stdout and stderr default to your “Interactive Window”.)

Example 10.8. Introducing `stdout` and `stderr`

>>> for i in range(3):
...     print 'Dive in'             
Dive in
Dive in
Dive in
>>> import sys
>>> for i in range(3):
...     sys.stdout.write('Dive in') 
Dive inDive inDive in
>>> for i in range(3):
...     sys.stderr.write('Dive in') 
Dive inDive inDive in

	As you saw in Example 6.9, “Simple Counters”, you can use Python's built-in `range` function to build simple counter loops that repeat something a set number of times.
	`stdout` is a file-like object; calling its `write` function will print out whatever string you give it. In fact, this is what the `print` function really does; it adds a carriage return to the end of the string you're printing, and calls `sys.stdout.write`.
	In the simplest case, `stdout` and `stderr` send their output to the same place: the Python IDE (if you're in one), or the terminal (if you're running Python from the command line). Like `stdout`, `stderr` does not add carriage returns for you; if you want them, add them yourself.

stdout and stderr are both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have no read method, only write. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output.

Example 10.9. Redirecting output

[you@localhost kgp]$ python stdout.py
Dive in
[you@localhost kgp]$ cat out.log
This message will be logged instead of displayed

(On Windows, you can use type instead of cat to display the contents of a file.)

If you have not already done so, you can download this and other examples used in this book.

#stdout.py
import sys

print 'Dive in'      
saveout = sys.stdout 
fsock = open('out.log', 'w')           
sys.stdout = fsock   
print 'This message will be logged instead of displayed' 
sys.stdout = saveout 
fsock.close()

	This will print to the IDE “Interactive Window” (or the terminal, if running the script from the command line).
	Always save `stdout` before redirecting it, so you can set it back to normal later.
	Open a file for writing. If the file doesn't exist, it will be created. If the file does exist, it will be overwritten.
	Redirect all further output to the new file you just opened.
	This will be “printed” to the log file only; it will not be visible in the IDE window or on the screen.
	Set `stdout` back to the way it was before you mucked with it.
	Close the log file.

Redirecting stderr works exactly the same way, using sys.stderr instead of sys.stdout.

Example 10.10. Redirecting error information

[you@localhost kgp]$ python stderr.py
[you@localhost kgp]$ cat error.log
Traceback (most recent line last):
  File "stderr.py", line 5, in ?
    raise Exception, 'this error will be logged'
Exception: this error will be logged

If you have not already done so, you can download this and other examples used in this book.

#stderr.py
import sys

fsock = open('error.log', 'w')               
sys.stderr = fsock         
raise Exception, 'this error will be logged'

	Open the log file where you want to store debugging information.
	Redirect standard error by assigning the file object of the newly-opened log file to `stderr`.
	Raise an exception. Note from the screen output that this does not print anything on screen. All the normal traceback information has been written to `error.log`.
	Also note that you're not explicitly closing your log file, nor are you setting `stderr` back to its original value. This is fine, since once the program crashes (because of the exception), Python will clean up and close the file for us, and it doesn't make any difference that `stderr` is never restored, since, as I mentioned, the program crashes and Python ends. Restoring the original is more important for `stdout`, if you expect to go do other stuff within the same script afterwards.

Since it is so common to write error messages to standard error, there is a shorthand syntax that can be used instead of going through the hassle of redirecting it outright.

Example 10.11. Printing to `stderr`

>>> print 'entering function'
entering function
>>> import sys
>>> print >> sys.stderr, 'entering function' 
entering function

This shorthand syntax of the print statement can be used to write to any open file, or file-like object. In this case, you can redirect a single print statement to stderr without affecting subsequent print statements.

Standard input, on the other hand, is a read-only file object, and it represents the data flowing into the program from some previous program. This will likely not make much sense to classic Mac OS users, or even Windows users unless you were ever fluent on the MS-DOS command line. The way it works is that you can construct a chain of commands in a single line, so that one program's output becomes the input for the next program in the chain. The first program simply outputs to standard output (without doing any special redirecting itself, just doing normal print statements or whatever), and the next program reads from standard input, and the operating system takes care of connecting one program's output to the next program's input.

Example 10.12. Chaining commands

[you@localhost kgp]$ python kgp.py -g binary.xml         
01100111
[you@localhost kgp]$ cat binary.xml    
<?xml version="1.0"?>
<!DOCTYPE grammar PUBLIC "-//diveintopython3.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
<grammar>
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>
[you@localhost kgp]$ cat binary.xml | python kgp.py -g -  
10110001

	As you saw in Section 9.1, “Diving in”, this will print a string of eight random bits, `0` or `1`.
	This simply prints out the entire contents of `binary.xml`. (Windows users should use `type` instead of `cat`.)
	This prints the contents of `binary.xml`, but the “`\|`” character, called the “pipe” character, means that the contents will not be printed to the screen. Instead, they will become the standard input of the next command, which in this case calls your Python script.
	Instead of specifying a module (like `binary.xml`), you specify “`-`”, which causes your script to load the grammar from standard input instead of from a specific file on disk. (More on how this happens in the next example.) So the effect is the same as the first syntax, where you specified the grammar filename directly, but think of the expansion possibilities here. Instead of simply doing `cat binary.xml`, you could run a script that dynamically generates the grammar, then you can pipe it into your script. It could come from anywhere: a database, or some grammar-generating meta-script, or whatever. The point is that you don't need to change your `kgp.py` script at all to incorporate any of this functionality. All you need to do is be able to take grammar files from standard input, and you can separate all the other logic into another program.

So how does the script “know” to read from standard input when the grammar file is “-”? It's not magic; it's just code.

Example 10.13. Reading from standard input in `kgp.py`

def openAnything(source):
    if source == "-":    
        import sys
        return sys.stdin

    # try to open with urllib (if source is http, ftp, or file URL)
    import urllib
    try:

[... snip ...]

	Start by creating an empty dictionary, `self.refs`.
	As you saw in Section 9.5, “Searching for elements”, `getElementsByTagName` returns a list of all the elements of a particular name. You easily can get a list of all the `ref` elements, then simply loop through that list.
	As you saw in Section 9.6, “Accessing element attributes”, you can access individual attributes of an element by name, using standard dictionary syntax. So the keys of the `self.refs` dictionary will be the values of the `id` attribute of each `ref` element.
	The values of the `self.refs` dictionary will be the `ref` elements themselves. As you saw in Section 9.3, “Parsing XML”, each element, each node, each comment, each piece of text in a parsed XML document is an object.

	As you saw in Example 9.9, “Getting child nodes”, the `childNodes` attribute returns a list of all the child nodes of an element.
	However, as you saw in Example 9.11, “Child nodes can be text”, the list returned by `childNodes` contains all different types of nodes, including text nodes. That's not what you're looking for here. You only want the children that are elements.
	Each node has a `nodeType` attribute, which can be `ELEMENT_NODE`, `TEXT_NODE`, `COMMENT_NODE`, or any number of other values. The complete list of possible values is in the `__init__.py` file in the `xml.dom` package. (See Section 9.2, “Packages” for more on packages.) But you're just interested in nodes that are elements, so you can filter the list to only include those nodes whose `nodeType` is `ELEMENT_NODE`.
	Once you have a list of actual elements, choosing a random one is easy. Python comes with a module called `random` which includes several useful functions. The `random.choice` function takes a list of any number of items and returns a random item. For example, if the `ref` elements contains several `p` elements, then `choices` would be a list of `p` elements, and `chosen` would end up being assigned exactly one of them, selected at random.

	Assume for a moment that `kant.xml` is in the current directory.
	As you saw in Section 9.2, “Packages”, the object returned by parsing an XML document is a `Document` object, as defined in the `minidom.py` in the `xml.dom` package. As you saw in Section 5.4, “Instantiating Classes”, `__class__` is built-in attribute of every Python object.
	Furthermore, `__name__` is a built-in attribute of every Python class, and it is a string. This string is not mysterious; it's the same as the class name you type when you define a class yourself. (See Section 5.3, “Defining Classes”.)

	First off, notice that you're constructing a larger string based on the class name of the node you were passed (in the `node` argument). So if you're passed a `Document` node, you're constructing the string `'parse_Document'`, and so forth.
	Now you can treat that string as a function name, and get a reference to the function itself using `getattr`
	Finally, you can call that function and pass the node itself as an argument. The next example shows the definitions of each of these functions.

	`parse_Document` is only ever called once, since there is only one `Document` node in an XML document, and only one `Document` object in the parsed XML representation. It simply turns around and parses the root element of the grammar file.
	`parse_Text` is called on nodes that represent bits of text. The function itself does some special processing to handle automatic capitalization of the first word of a sentence, but otherwise simply appends the represented text to a list.
	`parse_Comment` is just a `pass`, since you don't care about embedded comments in the grammar files. Note, however, that you still need to define the function and explicitly make it do nothing. If the function did not exist, the generic `parse` function would fail as soon as it stumbled on a comment, because it would try to find the non-existent `parse_Comment` function. Defining a separate function for every node type, even ones you don't use, allows the generic `parse` function to stay simple and dumb.
	The `parse_Element` method is actually itself a dispatcher, based on the name of the element's tag. The basic idea is the same: take what distinguishes elements from each other (their tag names) and dispatch to a separate function for each of them. You construct a string like `'do_xref'` (for an `<xref>` tag), find a function of that name, and call it. And so forth for each of the other tag names that might be found in the course of parsing a grammar file (`<p>` tags, `<choice>` tags).

	The first thing to know about `sys.argv` is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later, in Chapter 16, Functional Programming. Don't worry about it for now.
	Command-line arguments are separated by spaces, and each shows up as a separate element in the `sys.argv` list.
	Command-line flags, like `--help`, also show up as their own element in the `sys.argv` list.
	To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag (`-m`) which takes an argument (`kant.xml`). Both the flag itself and the flag's argument are simply sequential elements in the `sys.argv` list. No attempt is made to associate one with the other; all you get is a list.

	First off, look at the bottom of the example and notice that you're calling the `main` function with `sys.argv[1:]`. Remember, `sys.argv[0]` is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off and pass the rest of the list.
	This is where all the interesting processing happens. The `getopt` function of the `getopt` module takes three parameters: the argument list (which you got from `sys.argv[1:]`), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer command-line flags that are equivalent to the single-character versions. This is quite confusing at first glance, and is explained in more detail below.
	If anything goes wrong trying to parse these command-line flags, `getopt` will raise an exception, which you catch. You told `getopt` all the flags you understand, so this probably means that the end user passed some command-line flag that you don't understand.
	As is standard practice in the UNIX world, when the script is passed flags it doesn't understand, you print out a summary of proper usage and exit gracefully. Note that I haven't shown the `usage` function here. You would still need to code that somewhere and have it print out the appropriate summary; it's not automatic.

	The `grammar` variable will keep track of the grammar file you're using. You initialize it here in case it's not specified on the command line (using either the `-g` or the `--grammar` flag).
	The `opts` variable that you get back from `getopt` contains a list of tuples: `flag` and `argument`. If the flag doesn't take an argument, then `arg` will simply be `None`. This makes it easier to loop through the flags.
	`getopt` validates that the command-line flags are acceptable, but it doesn't do any sort of conversion between short and long flags. If you specify the `-h` flag, `opt` will contain `"-h"`; if you specify the `--help` flag, `opt` will contain `"--help"`. So you need to check for both.
	Remember, the `-d` flag didn't have a corresponding long flag, so you only need to check for the short form. If you find it, you set a global variable that you'll refer to later to print out debugging information. (I used this during the development of the script. What, you thought all these examples worked on the first try?)
	If you find a grammar file, either with a `-g` flag or a `--grammar` flag, you save the argument that followed it (stored in `arg`) into the `grammar` variable, overwriting the default that you initialized at the top of the `main` function.
	That's it. You've looped through and dealt with all the command-line flags. That means that anything left must be command-line arguments. These come back from the `getopt` function in the `args` variable. In this case, you're treating them as source material for the parser. If there are no command-line arguments specified, `args` will be an empty list, and `source` will end up as the empty string.

	`urllib` relies on another standard Python library, `httplib`. Normally you don't need to `import httplib` directly (`urllib` does that automatically), but you will here so you can set the debugging flag on the `HTTPConnection` class that `urllib` uses internally to connect to the HTTP server. This is an incredibly useful technique. Some other Python libraries have similar debug flags, but there's no particular standard for naming them or turning them on; you need to read the documentation of each library to see if such a feature is available.
	Now that the debugging flag is set, information on the the HTTP request and response is printed out in real time. The first thing it tells you is that you're connecting to the server `diveintomark.org` on port 80, which is the standard port for HTTP.
	When you request the Atom feed, `urllib` sends three lines to the server. The first line specifies the HTTP verb you're using, and the path of the resource (minus the domain name). All the requests in this chapter will use `GET`, but in the next chapter on SOAP, you'll see that it uses `POST` for everything. The basic syntax is the same, regardless of the verb.
	The second line is the `Host` header, which specifies the domain name of the service you're accessing. This is important, because a single HTTP server can host multiple separate domains. My server currently hosts 12 domains; other servers can host hundreds or even thousands.
	The third line is the `User-Agent` header. What you see here is the generic `User-Agent` that the `urllib` library adds by default. In the next section, you'll see how to customize this to be more specific.
	The server replies with a status code and a bunch of headers (and possibly some data, which got stored in the `feeddata` variable). The status code here is `200`, meaning “everything's normal, here's the data you requested”. The server also tells you the date it responded to your request, some information about the server itself, and the content type of the data it's giving you. Depending on your application, this might be useful, or not. It's certainly reassuring that you thought you were asking for an Atom feed, and lo and behold, you're getting an Atom feed (`application/atom+xml`, which is the registered content type for Atom feeds).
	The server tells you when this Atom feed was last modified (in this case, about 13 minutes ago). You can send this date back to the server the next time you request the same feed, and the server can do last-modified checking.
	The server also tells you that this Atom feed has an ETag hash of `"e8284-68e0-4de30f80"`. The hash doesn't mean anything by itself; there's nothing you can do with it, except send it back to the server the next time you request this same feed. Then the server can use it to tell you if the data has changed or not.

	If you still have your Python IDE open from the previous section's example, you can skip this, but this turns on HTTP debugging so you can see what you're actually sending over the wire, and what gets sent back.
	Fetching an HTTP resource with `urllib2` is a three-step process, for good reasons that will become clear shortly. The first step is to create a `Request` object, which takes the URL of the resource you'll eventually get around to retrieving. Note that this step doesn't actually retrieve anything yet.
	The second step is to build a URL opener. This can take any number of handlers, which control how responses are handled. But you can also build an opener without any custom handlers, which is what you're doing here. You'll see how to define and use custom handlers later in this chapter when you explore redirects.
	The final step is to tell the opener to open the URL, using the `Request` object you created. As you can see from all the debugging information that gets printed, this step actually retrieves the resource and stores the returned data in `feeddata`.

	You're continuing from the previous example; you've already created a `Request` object with the URL you want to access.
	Using the `add_header` method on the `Request` object, you can add arbitrary HTTP headers to the request. The first argument is the header, the second is the value you're providing for that header. Convention dictates that a `User-Agent` should be in this specific format: an application name, followed by a slash, followed by a version number. The rest is free-form, and you'll see a lot of variations in the wild, but somewhere it should include a URL of your application. The `User-Agent` is usually logged by the server along with other details of your request, and including a URL of your application allows server administrators looking through their access logs to contact you if something is wrong.
	The `opener` object you created before can be reused too, and it will retrieve the same feed again, but with your custom `User-Agent` header.
	And here's you sending your custom `User-Agent`, in place of the generic one that Python sends by default. If you look closely, you'll notice that you defined a `User-Agent` header, but you actually sent a `User-agent` header. See the difference? `urllib2` changed the case so that only the first letter was capitalized. It doesn't really matter; HTTP specifies that header field names are completely case-insensitive.

	Remember all those HTTP headers you saw printed out when you turned on debugging? This is how you can get access to them programmatically: `firstdatastream.headers` is an object that acts like a dictionary and allows you to get any of the individual headers returned from the HTTP server.
	On the second request, you add the `If-Modified-Since` header with the last-modified date from the first request. If the data hasn't changed, the server should return a `304` status code.
	Sure enough, the data hasn't changed. You can see from the traceback that `urllib2` throws a special exception, `HTTPError`, in response to the `304` status code. This is a little unusual, and not entirely helpful. After all, it's not an error; you specifically asked the server not to send you any data if it hadn't changed, and the data didn't change, so the server told you it wasn't sending you any data. That's not an error; that's exactly what you were hoping for.

	`urllib2` is designed around URL handlers. Each handler is just a class that can define any number of methods. When something happens -- like an HTTP error, or even a `304` code -- `urllib2` introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in Chapter 9, XML Processing to define handlers for different node types, but `urllib2` is more flexible, and introspects over as many handlers as are defined for the current request.
	`urllib2` searches through the defined handlers and calls the `http_error_default` method when it encounters a `304` status code from the server. By defining a custom error handler, you can prevent `urllib2` from raising an exception. Instead, you create the `HTTPError` object, but return it instead of raising it.
	This is the key part: before returning, you save the status code returned by the HTTP server. This will allow you easy access to it from the calling program.

	You're continuing the previous example, so the `Request` object is already set up, and you've already added the `If-Modified-Since` header.
	This is the key: now that you've defined your custom URL handler, you need to tell `urllib2` to use it. Remember how I said that `urllib2` broke up the process of accessing an HTTP resource into three steps, and for good reason? This is why building the URL opener is its own step, because you can build it with your own custom URL handlers that override `urllib2`'s default behavior.
	Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use `seconddatastream.headers.dict` to acess them), also contains the HTTP status code. In this case, as you expected, the status is `304`, meaning this data hasn't changed since the last time you asked for it.
	Note that when the server sends back a `304` status code, it doesn't re-send the data. That's the whole point: to save bandwidth by not re-downloading data that hasn't changed. So if you actually want that data, you'll need to cache it locally the first time you get it.

	Using the `firstdatastream.headers` pseudo-dictionary, you can get the `ETag` returned from the server. (What happens if the server didn't send back an `ETag`? Then this line would return `None`.)
	OK, you got the data.
	Now set up the second call by setting the `If-None-Match` header to the `ETag` you got from the first call.
	The second call succeeds quietly (without throwing an exception), and once again you see that the server has sent back a `304` status code. Based on the `ETag` you sent the second time, it knows that the data hasn't changed.
	Regardless of whether the `304` is triggered by `Last-Modified` date checking or `ETag` hash matching, you'll never get the data along with the `304`. That's the whole point.


	In these examples, the HTTP server has supported both `Last-Modified` and `ETag` headers, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively in case a server only supports one or the other, or neither.

	You'll be better able to see what's happening if you turn on debugging.
	This is a URL which I have set up to permanently redirect to my Atom feed at `http://diveintomark.org/xml/atom.xml`.
	Sure enough, when you try to download the data at that address, the server sends back a `301` status code, telling you that the resource has moved permanently.
	The server also sends back a `Location:` header that gives the new address of this data.
	`urllib2` notices the redirect status code and automatically tries to retrieve the data at the new location specified in the `Location:` header.
	The object you get back from the `opener` contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent address). But the status code is missing, so you have no way of knowing programmatically whether this redirect was temporary or permanent. And that matters very much: if it was a temporary redirect, then you should continue to ask for the data at the old location. But if it was a permanent redirect (as this was), you should ask for the data at the new location from now on.

	Redirect behavior is defined in `urllib2` in a class called `HTTPRedirectHandler`. You don't want to completely override the behavior, you just want to extend it a little, so you'll subclass `HTTPRedirectHandler` so you can call the ancestor class to do all the hard work.
	When it encounters a `301` status code from the server, `urllib2` will search through its handlers and call the `http_error_301` method. The first thing ours does is just call the `http_error_301` method in the ancestor, which handles the grunt work of looking for the `Location:` header and following the redirect to the new address.
	Here's the key: before you return, you store the status code (`301`), so that the calling program can access it later.
	Temporary redirects (status code `302`) work the same way: override the `http_error_302` method, call the ancestor, and save the status code before returning.

	First, build a URL opener with the redirect handler you just defined.
	You sent off a request, and you got a `301` status code in response. At this point, the `http_error_301` method gets called. You call the ancestor method, which follows the redirect and sends a request at the new location (`http://diveintomark.org/xml/atom.xml`).
	This is the payoff: now, not only do you have access to the new URL, but you have access to the redirect status code, so you can tell that this was a permanent redirect. The next time you request this data, you should request it from the new location (`http://diveintomark.org/xml/atom.xml`, as specified in `f.url`). If you had stored the location in a configuration file or a database, you need to update that so you don't keep pounding the server with requests at the old address. It's time to update your address book.

	This is a sample URL I've set up that is configured to tell clients to temporarily redirect to `http://diveintomark.org/xml/atom.xml`.
	The server sends back a `302` status code, indicating a temporary redirect. The temporary new location of the data is given in the `Location:` header.
	`urllib2` calls your `http_error_302` method, which calls the ancestor method of the same name in `urllib2.HTTPRedirectHandler`, which follows the redirect to the new location. Then your `http_error_302` method stores the status code (`302`) so the calling application can get it later.
	And here you are, having successfully followed the redirect to `http://diveintomark.org/xml/atom.xml`. `f.status` tells you that this was a temporary redirect, which means that you should continue to request data from the original address (`http://diveintomark.org/redir/example302.xml`). Maybe it will redirect next time too, but maybe not. Maybe it will redirect to a different address. It's not for you to say. The server said this redirect was only temporary, so you should respect that. And now you're exposing enough information that the calling application can respect that.

Dive Into Python

Chapter 1. Installing Python

1.1. Which Python is right for you?

1.2. Python on Windows

Procedure 1.1. Option 1: Installing ActivePython

Procedure 1.2. Option 2: Installing Python from Python.org

1.3. Python on Mac OS X

Procedure 1.3. Running the Preinstalled Version of Python on Mac OS X

Procedure 1.4. Installing the Latest Version of Python on Mac OS X

Example 1.1. Two versions of Python

1.4. Python on Mac OS 9

1.5. Python on RedHat Linux

Example 1.2. Installing on RedHat Linux 9

1.6. Python on Debian GNU/Linux

Example 1.3. Installing on Debian GNU/Linux

1.7. Python Installation from Source

Example 1.4. Installing from source

1.8. The Interactive Shell

Example 1.5. First Steps in the Interactive Shell

1.9. Summary

Chapter 2. Your First Python Program

2.1. Diving in

Example 2.1. odbchelper.py

2.2. Declaring Functions

2.2.1. How Python's Datatypes Compare to Other Programming Languages

2.3. Documenting Functions

Example 2.2. Defining the buildConnectionString Function's docstring

2.4. Everything Is an Object

2.6. Testing Modules

Further Reading on Importing Modules

Chapter 3. Native Datatypes

3.2. Introducing Lists

3.2.1. Defining Lists

Example 3.6. Defining a List

Example 3.7. Negative List Indices

Example 3.8. Slicing a List

Example 3.9. Slicing Shorthand

3.2.2. Adding Elements to Lists

Example 3.10. Adding Elements to a List

Example 3.11. The Difference between extend and append

3.2.3. Searching Lists

Example 3.12. Searching a List

3.2.4. Deleting List Elements

Example 3.13. Removing Elements from a List

3.2.5. Using List Operators

Example 3.14. List Operators

Further Reading on Lists

3.3. Introducing Tuples

Example 3.15. Defining a tuple

Example 3.16. Tuples Have No Methods

Further Reading on Tuples

3.4. Declaring variables

Example 3.17. Defining the myParams Variable

3.4.1. Referencing Variables

Example 3.18. Referencing an Unbound Variable

3.4.2. Assigning Multiple Values at Once

Example 3.19. Assigning multiple values at once

Example 3.20. Assigning Consecutive Values

Further Reading on Variables

3.5. Formatting Strings

Example 3.21. Introducing String Formatting

Example 3.22. String Formatting vs. Concatenating

Example 3.23. Formatting Numbers

Further Reading on String Formatting

3.6. Mapping Lists

Example 3.24. Introducing List Comprehensions

Example 3.25. The keys, values, and items Functions

Example 3.26. List Comprehensions in buildConnectionString, Step by Step

Further Reading on List Comprehensions

3.7. Joining Lists and Splitting Strings

Example 3.27. Output of odbchelper.py

Example 3.28. Splitting a String

Further Reading on String Methods

3.7.1. Historical Note on String Methods

3.8. Summary

Chapter 4. The Power Of Introspection

4.1. Diving In

Example 4.1. apihelper.py

Example 4.2. Sample Usage of apihelper.py

Example 4.3. Advanced Usage of apihelper.py

Example 2.1. `odbchelper.py`

Example 2.2. Defining the `buildConnectionString` Function's `docstring`

Example 3.11. The Difference between `extend` and `append`

Example 3.17. Defining the `myParams` Variable

Example 3.25. The `keys`, `values`, and `items` Functions

Example 3.26. List Comprehensions in `buildConnectionString`, Step by Step

Example 3.27. Output of `odbchelper.py`

Example 4.1. `apihelper.py`

Example 4.2. Sample Usage of `apihelper.py`

Example 4.3. Advanced Usage of `apihelper.py`

Example 4.4. Valid Calls of `info`

4.3. Using `type`, `str`, `dir`, and Other Built-In Functions

4.3.1. The `type` Function

Example 4.5. Introducing `type`

4.3.2. The `str` Function

Example 4.6. Introducing `str`

Example 4.7. Introducing `dir`

Example 4.8. Introducing `callable`

4.4. Getting Object References With `getattr`

Example 4.10. Introducing `getattr`

4.4.1. `getattr` with Modules

Example 4.11. The `getattr` Function in `apihelper.py`

4.4.2. `getattr` As a Dispatcher

Example 4.12. Creating a Dispatcher with `getattr`

Example 4.13. `getattr` Default Values

4.6. The Peculiar Nature of `and` and `or`

Example 4.15. Introducing `and`

Example 4.16. Introducing `or`

4.6.1. Using the `and-or` Trick

Example 4.17. Introducing the `and-or` Trick

Example 4.18. When the `and-or` Trick Fails

Example 4.19. Using the `and-or` Trick Safely

Further Reading on the `and-or` Trick

4.7. Using `lambda` Functions

Example 4.20. Introducing `lambda` Functions

4.7.1. Real-World `lambda` Functions

Example 4.21. `split` With No Arguments

Further Reading on `lambda` Functions

Example 4.22. Getting a `docstring` Dynamically

Example 4.23. Why Use `str` on a `docstring`?

Example 4.24. Introducing `ljust`

Example 5.1. `fileinfo.py`

5.2. Importing Modules Using `from module import`

Example 5.2. `import module` vs. `from module import`

Example 5.4. Defining the `FileInfo` Class

Example 5.5. Initializing the `FileInfo` Class

Example 5.6. Coding the `FileInfo` Class

5.3.2. Knowing When to Use `self` and `init`

Example 5.7. Creating a `FileInfo` Instance

5.5. Exploring `UserDict`: A Wrapper Class

Example 5.9. Defining the `UserDict` Class

Example 5.10. `UserDict` Normal Methods

Example 5.11. Inheriting Directly from Built-In Datatype `dict`

Further Reading on `UserDict`

Example 5.12. The `getitem` Special Method

Example 5.13. The `setitem` Special Method

	This is the key: once you've created your `Request` object, add an `Accept-encoding` header to tell the server you can accept gzip-encoded data. `gzip` is the name of the compression algorithm you're using. In theory there could be other compression algorithms, but `gzip` is the compression algorithm used by 99% of web servers.
	There's your header going across the wire.
	And here's what the server sends back: the `Content-Encoding: gzip` header means that the data you're about to receive has been gzip-compressed.
	The `Content-Length` header is the length of the compressed data, not the uncompressed data. As you'll see in a minute, the actual length of the uncompressed data was 15955, so gzip compression cut your bandwidth by over 60%!

	Continuing from the previous example, `f` is the file-like object returned from the URL opener. Using its `read()` method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first step towards getting the data you really want.
	OK, this step is a little bit of messy workaround. Python has a `gzip` module, which reads (and actually writes) gzip-compressed files on disk. But you don't have a file on disk, you have a gzip-compressed buffer in memory, and you don't want to write out a temporary file just so you can uncompress it. So what you're going to do is create a file-like object out of the in-memory data (`compresseddata`), using the `StringIO` module. You first saw the `StringIO` module in the previous chapter, but now you've found another use for it.
	Now you can create an instance of `GzipFile`, and tell it that its “file” is the file-like object `compressedstream`.
	This is the line that does all the actual work: “reading” from `GzipFile` will decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. `gzipper` is a file-like object which represents a gzip-compressed file. That “file” is not a real file on disk, though; `gzipper` is really just “reading” from the file-like object you created with `StringIO` to wrap the compressed data, which is only in memory in the variable `compresseddata`. And where did that compressed data come from? You originally downloaded it from a remote HTTP server by “reading” from the file-like object you built with `urllib2.build_opener`. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it.
	Look ma, real data. (15955 bytes of it, in fact.)

	Continuing from the previous example, you already have a `Request` object set up with an `Accept-encoding: gzip` header.
	Simply opening the request will get you the headers (though not download any data yet). As you can see from the returned `Content-Encoding` header, this data has been sent gzip-compressed.
	Since `opener.open` returns a file-like object, and you know from the headers that when you read it, you're going to get gzip-compressed data, why not simply pass that file-like object directly to `GzipFile`? As you “read” from the `GzipFile` instance, it will “read” compressed data from the remote HTTP server and decompress it on the fly. It's a good idea, but unfortunately it doesn't work. Because of the way gzip compression works, `GzipFile` needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and forth through the data stream. So the inelegant hack of using `StringIO` is the best solution: download the compressed data, create a file-like object out of it with `StringIO`, and then decompress the data from that.

	`urlparse` is a handy utility module for, you guessed it, parsing URLs. It's primary function, also called `urlparse`, takes a URL and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier). Of these, the only thing you care about is the scheme, to make sure that you're dealing with an HTTP URL (which `urllib2` can handle).
	You identify yourself to the HTTP server with the `User-Agent` passed in by the calling function. If no `User-Agent` was specified, you use a default one defined earlier in the `openanything.py` module. You never use the default one defined by `urllib2`.
	If an `ETag` hash was given, send it in the `If-None-Match` header.
	If a last-modified date was given, send it in the `If-Modified-Since` header.
	Tell the server you would like compressed data if possible.
	Build a URL opener that uses both of the custom URL handlers: `SmartRedirectHandler` for handling `301` and `302` redirects, and `DefaultErrorHandler` for handling `304`, `404`, and other error conditions gracefully.
	That's it! Open the URL and return a file-like object to the caller.

	First, you call the `openAnything` function with a URL, `ETag` hash, `Last-Modified` date, and `User-Agent`.
	Read the actual data returned from the server. This may be compressed; if so, you'll decompress it later.
	Save the `ETag` hash returned from the server, so the calling application can pass it back to you next time, and you can pass it on to `openAnything`, which can stick it in the `If-None-Match` header and send it to the remote server.
	Save the `Last-Modified` date too.
	If the server says that it sent compressed data, decompress it.
	If you got a URL back from the server, save it, and assume that the status code is `200` until you find out otherwise.
	If one of the custom URL handlers captured a status code, then save that too.

	The very first time you fetch a resource, you don't have an `ETag` hash or `Last-Modified` date, so you'll leave those out. (They're optional parameters.)
	What you get back is a dictionary of several useful headers, the HTTP status code, and the actual data returned from the server. `openanything` handles the gzip compression internally; you don't care about that at this level.
	If you ever get a `301` status code, that's a permanent redirect, and you need to update your URL to the new address.
	The second time you fetch the same resource, you have all sorts of information to pass back: a (possibly updated) URL, the `ETag` from the last time, the `Last-Modified` date from the last time, and of course your `User-Agent`.
	What you get back is again a dictionary, but the data hasn't changed, so all you got was a `304` status code and no data.

	You access the remote SOAP server through a proxy class, `SOAPProxy`. The proxy handles all the internals of SOAP for you, including creating the XML request document out of the function name and argument list, sending the request over HTTP to the remote SOAP server, parsing the XML response document, and creating native Python values to return. You'll see what these XML documents look like in the next section.
	Every SOAP service has a URL which handles all the requests. The same URL is used for all function calls. This particular service only has a single function, but later in this chapter you'll see examples of the Google API, which has several functions. The service URL is shared by all functions.Each SOAP service also has a namespace, which is defined by the server and is completely arbitrary. It's simply part of the configuration required to call SOAP methods. It allows the server to share a single service URL and route requests between several unrelated services. It's like dividing Python modules into packages.
	You're creating the `SOAPProxy` with the service URL and the service namespace. This doesn't make any connection to the SOAP server; it simply creates a local Python object.
	Now with everything configured properly, you can actually call remote SOAP methods as if they were local functions. You pass arguments just like a normal function, and you get a return value just like a normal function. But under the covers, there's a heck of a lot going on.

	First, create the `SOAPProxy` like normal, with the service URL and the namespace.
	Second, turn on debugging by setting `server.config.dumpSOAPIn` and `server.config.dumpSOAPOut`.
	Third, call the remote SOAP method as usual. The SOAP library will print out both the outgoing XML request document, and the incoming XML response document. This is all the hard work that `SOAPProxy` is doing for you. Intimidating, isn't it? Let's break it down.

	The element name is the function name, `getTemp`. `SOAPProxy` uses `getattr` as a dispatcher. Instead of calling separate local methods based on the method name, it actually uses the method name to construct the XML request document.
	The function's XML element is contained in a specific namespace, which is the namespace you specified when you created the `SOAPProxy` object. Don't worry about the `SOAP-ENC:root`; that's boilerplate too.
	The arguments of the function also got translated into XML. `SOAPProxy` introspects each argument to determine its datatype (in this case it's a string). The argument datatype goes into the `xsi:type` attribute, followed by the actual string value.

	The server wraps the function return value within a `<getTempResponse>` element. By convention, this wrapper element is the name of the function, plus `Response`. But it could really be almost anything; the important thing that `SOAPProxy` notices is not the element name, but the namespace.
	The server returns the response in the same namespace we used in the request, the same namespace we specified when we first create the `SOAPProxy`. Later in this chapter we'll see what happens if you forget to specify the namespace when creating the `SOAPProxy`.
	The return value is specified, along with its datatype (it's a float). `SOAPProxy` uses this explicit datatype to create a Python object of the correct native datatype and return it.