Files
dive-into-python3/dip2
T

15772 lines
1.2 MiB
Plaintext

<!DOCTYPE html>
<html lang="en">
<head>
<title>Dive Into Python</title>
<link rel="stylesheet" href="diveintopython3.css" type="text/css">
</head>
<body>
<h1>Dive Into Python</h1>
<p class="pubdate">20 May 2004
<p class="copyright">Copyright &copy; 2000, 2001, 2002, 2003, 2004 <a href="mailto:mark@diveintopython3.org">Mark Pilgrim</a>
<p>This book lives at <a href="http://diveintopython3.org/">http://diveintopython3.org/</a>. If you're reading it somewhere else, you may not have the latest version.
<div class="toc">
<p><b>Table of Contents</b>
<ul>
<li><a href="#install">1. Installing Python</a><ul>
<li><a href="#install.choosing">1.1. Which Python is right for you?</a>
<li><a href="#install.windows">1.2. Python on Windows</a>
<li><a href="#install.macosx">1.3. Python on Mac OS X</a>
<li><a href="#install.macos9">1.4. Python on Mac OS 9</a>
<li><a href="#install.redhat">1.5. Python on RedHat Linux</a>
<li><a href="#install.debian">1.6. Python on Debian GNU/Linux</a>
<li><a href="#install.source">1.7. Python Installation from Source</a>
<li><a href="#install.shell">1.8. The Interactive Shell</a>
<li><a href="#install.summary">1.9. Summary</a>
</ul>
<li><a href="#odbchelper">2. Your First Python Program</a><ul>
<li><a href="#odbchelper.divein">2.1. Diving in</a>
<li><a href="#odbchelper.funcdef">2.2. Declaring Functions</a><ul>
<li><a href="#d0e4188">2.2.1. How Python's Datatypes Compare to Other Programming Languages</a>
</ul>
<li><a href="#odbchelper.docstring">2.3. Documenting Functions</a>
<li><a href="#odbchelper.objects">2.4. Everything Is an Object</a><ul>
<li><a href="#d0e4550">2.4.1. The Import Search Path</a>
<li><a href="#d0e4665">2.4.2. What's an Object?</a>
</ul>
<li><a href="#odbchelper.indenting">2.5. Indenting Code</a>
<li><a href="#odbchelper.testing">2.6. Testing Modules</a>
</ul>
<li><a href="#datatypes">3. Native Datatypes</a><ul>
<li><a href="#odbchelper.dict">3.1. Introducing Dictionaries</a><ul>
<li><a href="#d0e5174">3.1.1. Defining Dictionaries</a>
<li><a href="#d0e5269">3.1.2. Modifying Dictionaries</a>
<li><a href="#d0e5450">3.1.3. Deleting Items From Dictionaries</a>
</ul>
<li><a href="#odbchelper.list">3.2. Introducing Lists</a><ul>
<li><a href="#d0e5623">3.2.1. Defining Lists</a>
<li><a href="#d0e5887">3.2.2. Adding Elements to Lists</a>
<li><a href="#d0e6115">3.2.3. Searching Lists</a>
<li><a href="#d0e6277">3.2.4. Deleting List Elements</a>
<li><a href="#d0e6392">3.2.5. Using List Operators</a>
</ul>
<li><a href="#odbchelper.tuple">3.3. Introducing Tuples</a>
<li><a href="#odbchelper.vardef">3.4. Declaring variables</a><ul>
<li><a href="#d0e6873">3.4.1. Referencing Variables</a>
<li><a href="#odbchelper.multiassign">3.4.2. Assigning Multiple Values at Once</a>
</ul>
<li><a href="#odbchelper.stringformatting">3.5. Formatting Strings</a>
<li><a href="#odbchelper.map">3.6. Mapping Lists</a>
<li><a href="#odbchelper.join">3.7. Joining Lists and Splitting Strings</a><ul>
<li><a href="#d0e7982">3.7.1. Historical Note on String Methods</a>
</ul>
<li><a href="#odbchelper.summary">3.8. Summary</a>
</ul>
<li><a href="#apihelper">4. The Power Of Introspection</a><ul>
<li><a href="#apihelper.divein">4.1. Diving In</a>
<li><a href="#apihelper.optional">4.2. Using Optional and Named Arguments</a>
<li><a href="#apihelper.builtin">4.3. Using type, str, dir, and Other Built-In Functions</a><ul>
<li><a href="#d0e8510">4.3.1. The type Function</a>
<li><a href="#d0e8609">4.3.2. The str Function</a>
<li><a href="#d0e8958">4.3.3. Built-In Functions</a>
</ul>
<li><a href="#apihelper.getattr">4.4. Getting Object References With getattr</a><ul>
<li><a href="#d0e9194">4.4.1. getattr with Modules</a>
<li><a href="#d0e9362">4.4.2. getattr As a Dispatcher</a>
</ul>
<li><a href="#apihelper.filter">4.5. Filtering Lists</a>
<li><a href="#apihelper.andor">4.6. The Peculiar Nature of and and or</a><ul>
<li><a href="#d0e9975">4.6.1. Using the and-or Trick</a>
</ul>
<li><a href="#apihelper.lambda">4.7. Using lambda Functions</a><ul>
<li><a href="#d0e10403">4.7.1. Real-World lambda Functions</a>
</ul>
<li><a href="#apihelper.alltogether">4.8. Putting It All Together</a>
<li><a href="#apihelper.summary">4.9. Summary</a>
</ul>
<li><a href="#fileinfo">5. Objects and Object-Orientation</a><ul>
<li><a href="#fileinfo.divein">5.1. Diving In</a>
<li><a href="#fileinfo.fromimport">5.2. Importing Modules Using from module import</a>
<li><a href="#fileinfo.class">5.3. Defining Classes</a><ul>
<li><a href="#d0e11720">5.3.1. Initializing and Coding Classes</a>
<li><a href="#d0e11896">5.3.2. Knowing When to Use self and __init__</a>
</ul>
<li><a href="#fileinfo.create">5.4. Instantiating Classes</a><ul>
<li><a href="#d0e12165">5.4.1. Garbage Collection</a>
</ul>
<li><a href="#fileinfo.userdict">5.5. Exploring UserDict: A Wrapper Class</a>
<li><a href="#fileinfo.specialmethods">5.6. Special Class Methods</a><ul>
<li><a href="#d0e12822">5.6.1. Getting and Setting Items</a>
</ul>
<li><a href="#fileinfo.morespecial">5.7. Advanced Special Class Methods</a>
<li><a href="#fileinfo.classattributes">5.8. Introducing Class Attributes</a>
<li><a href="#fileinfo.private">5.9. Private Functions</a>
<li><a href="#fileinfo.summary">5.10. Summary</a>
</ul>
<li><a href="#filehandling">6. Exceptions and File Handling</a><ul>
<li><a href="#fileinfo.exception">6.1. Handling Exceptions</a><ul>
<li><a href="#d0e14344">6.1.1. Using Exceptions For Other Purposes</a>
</ul>
<li><a href="#fileinfo.files">6.2. Working with File Objects</a><ul>
<li><a href="#d0e14670">6.2.1. Reading Files</a>
<li><a href="#d0e14800">6.2.2. Closing Files</a>
<li><a href="#d0e14928">6.2.3. Handling I/O Errors</a>
<li><a href="#d0e15055">6.2.4. Writing to Files</a>
</ul>
<li><a href="#fileinfo.for">6.3. Iterating with for Loops</a>
<li><a href="#fileinfo.modules">6.4. Using sys.modules</a>
<li><a href="#fileinfo.os">6.5. Working with Directories</a>
<li><a href="#fileinfo.alltogether">6.6. Putting It All Together</a>
<li><a href="#fileinfo.summary2">6.7. Summary</a>
</ul>
<li><a href="#re">7. Regular Expressions</a><ul>
<li><a href="#re.intro">7.1. Diving In</a>
<li><a href="#re.matching">7.2. Case Study: Street Addresses</a>
<li><a href="#re.roman">7.3. Case Study: Roman Numerals</a><ul>
<li><a href="#d0e17592">7.3.1. Checking for Thousands</a>
<li><a href="#d0e17785">7.3.2. Checking for Hundreds</a>
</ul>
<li><a href="#re.nm">7.4. Using the {n,m} Syntax</a><ul>
<li><a href="#d0e18326">7.4.1. Checking for Tens and Ones</a>
</ul>
<li><a href="#re.verbose">7.5. Verbose Regular Expressions</a>
<li><a href="#re.phone">7.6. Case study: Parsing Phone Numbers</a>
<li><a href="#re.summary">7.7. Summary</a>
</ul>
<li><a href="#dialect">8. HTML Processing</a><ul>
<li><a href="#dialect.divein">8.1. Diving in</a>
<li><a href="#dialect.sgmllib">8.2. Introducing sgmllib.py</a>
<li><a href="#dialect.extract">8.3. Extracting data from HTML documents</a>
<li><a href="#dialect.basehtml">8.4. Introducing BaseHTMLProcessor.py</a>
<li><a href="#dialect.locals">8.5. locals and globals</a>
<li><a href="#dialect.dictsub">8.6. Dictionary-based string formatting</a>
<li><a href="#dialect.quoting">8.7. Quoting attribute values</a>
<li><a href="#dialect.dialectizer">8.8. Introducing dialect.py</a>
<li><a href="#dialect.alltogether">8.9. Putting it all together</a>
<li><a href="#dialect.summary">8.10. Summary</a>
</ul>
<li><a href="#kgp">9. XML Processing</a><ul>
<li><a href="#kgp.divein">9.1. Diving in</a>
<li><a href="#kgp.packages">9.2. Packages</a>
<li><a href="#kgp.parse">9.3. Parsing XML</a>
<li><a href="#kgp.unicode">9.4. Unicode</a>
<li><a href="#kgp.search">9.5. Searching for elements</a>
<li><a href="#kgp.attributes">9.6. Accessing element attributes</a>
<li><a href="#kgp.segue">9.7. Segue</a>
</ul>
<li><a href="#streams">10. Scripts and Streams</a><ul>
<li><a href="#kgp.openanything">10.1. Abstracting input sources</a>
<li><a href="#kgp.stdio">10.2. Standard input, output, and error</a>
<li><a href="#kgp.cache">10.3. Caching node lookups</a>
<li><a href="#kgp.child">10.4. Finding direct children of a node</a>
<li><a href="#kgp.handler">10.5. Creating separate handlers by node type</a>
<li><a href="#kgp.commandline">10.6. Handling command-line arguments</a>
<li><a href="#kgp.alltogether">10.7. Putting it all together</a>
<li><a href="#kgp.summary">10.8. Summary</a>
</ul>
<li><a href="#oa">11. HTTP Web Services</a><ul>
<li><a href="#oa.divein">11.1. Diving in</a>
<li><a href="#oa.review">11.2. How not to fetch data over HTTP</a>
<li><a href="#oa.features">11.3. Features of HTTP</a><ul>
<li><a href="#d0e27596">11.3.1. User-Agent</a>
<li><a href="#d0e27616">11.3.2. Redirects</a>
<li><a href="#d0e27689">11.3.3. Last-Modified/If-Modified-Since</a>
<li><a href="#d0e27724">11.3.4. ETag/If-None-Match</a>
<li><a href="#d0e27752">11.3.5. Compression</a>
</ul>
<li><a href="#oa.debug">11.4. Debugging HTTP web services</a>
<li><a href="#oa.useragent">11.5. Setting the User-Agent</a>
<li><a href="#oa.etags">11.6. Handling Last-Modified and ETag</a>
<li><a href="#oa.redirect">11.7. Handling redirects</a>
<li><a href="#oa.gzip">11.8. Handling compressed data</a>
<li><a href="#oa.alltogether">11.9. Putting it all together</a>
<li><a href="#oa.summary">11.10. Summary</a>
</ul>
<li><a href="#soap">12. SOAP Web Services</a><ul>
<li><a href="#soap.divein">12.1. Diving In</a>
<li><a href="#soap.install">12.2. Installing the SOAP Libraries</a><ul>
<li><a href="#d0e29967">12.2.1. Installing PyXML</a>
<li><a href="#d0e30070">12.2.2. Installing fpconst</a>
<li><a href="#d0e30171">12.2.3. Installing SOAPpy</a>
</ul>
<li><a href="#soap.firststeps">12.3. First Steps with SOAP</a>
<li><a href="#soap.debug">12.4. Debugging SOAP Web Services</a>
<li><a href="#soap.wsdl">12.5. Introducing WSDL</a>
<li><a href="#soap.introspection">12.6. Introspecting SOAP Web Services with WSDL</a>
<li><a href="#soap.google">12.7. Searching Google</a>
<li><a href="#soap.troubleshooting">12.8. Troubleshooting SOAP Web Services</a>
<li><a href="#soap.summary">12.9. Summary</a>
</ul>
<li><a href="#roman">13. Unit Testing</a><ul>
<li><a href="#roman.intro">13.1. Introduction to Roman numerals</a>
<li><a href="#roman.divein">13.2. Diving in</a>
<li><a href="#roman.romantest">13.3. Introducing romantest.py</a>
<li><a href="#roman.success">13.4. Testing for success</a>
<li><a href="#roman.failure">13.5. Testing for failure</a>
<li><a href="#roman.sanity">13.6. Testing for sanity</a>
</ul>
<li><a href="#roman1.5">14. Test-First Programming</a><ul>
<li><a href="#roman.stage1">14.1. roman.py, stage 1</a>
<li><a href="#roman.stage2">14.2. roman.py, stage 2</a>
<li><a href="#roman.stage3">14.3. roman.py, stage 3</a>
<li><a href="#roman.stage4">14.4. roman.py, stage 4</a>
<li><a href="#roman.stage5">14.5. roman.py, stage 5</a>
</ul>
<li><a href="#roman2">15. Refactoring</a><ul>
<li><a href="#roman.bugs">15.1. Handling bugs</a>
<li><a href="#roman.change">15.2. Handling changing requirements</a>
<li><a href="#roman.refactoring">15.3. Refactoring</a>
<li><a href="#roman.postscript">15.4. Postscript</a>
<li><a href="#roman.summary">15.5. Summary</a>
</ul>
<li><a href="#regression">16. Functional Programming</a><ul>
<li><a href="#regression.divein">16.1. Diving in</a>
<li><a href="#regression.path">16.2. Finding the path</a>
<li><a href="#regression.filter">16.3. Filtering lists revisited</a>
<li><a href="#regression.map">16.4. Mapping lists revisited</a>
<li><a href="#regression.datacentric">16.5. Data-centric programming</a>
<li><a href="#regression.import">16.6. Dynamically importing modules</a>
<li><a href="#regression.alltogether">16.7. Putting it all together</a>
<li><a href="#regression.summary">16.8. Summary</a>
</ul>
<li><a href="#plural">17. Dynamic functions</a><ul>
<li><a href="#plural.divein">17.1. Diving in</a>
<li><a href="#plural.stage1">17.2. plural.py, stage 1</a>
<li><a href="#plural.stage2">17.3. plural.py, stage 2</a>
<li><a href="#plural.stage3">17.4. plural.py, stage 3</a>
<li><a href="#plural.stage4">17.5. plural.py, stage 4</a>
<li><a href="#plural.stage5">17.6. plural.py, stage 5</a>
<li><a href="#plural.stage6">17.7. plural.py, stage 6</a>
<li><a href="#plural.summary">17.8. Summary</a>
</ul>
<li><a href="#soundex">18. Performance Tuning</a><ul>
<li><a href="#soundex.divein">18.1. Diving in</a>
<li><a href="#soundex.timeit">18.2. Using the timeit Module</a>
<li><a href="#soundex.stage1">18.3. Optimizing Regular Expressions</a>
<li><a href="#soundex.stage2">18.4. Optimizing Dictionary Lookups</a>
<li><a href="#soundex.stage3">18.5. Optimizing List Operations</a>
<li><a href="#soundex.stage4">18.6. Optimizing String Manipulation</a>
<li><a href="#soundex.summary">18.7. Summary</a>
</ul>
</ul>
</ul>
<div class="chapter">
<h2 id="install">Chapter 1. Installing Python</h2>
<p>Welcome to Python. Let's dive in. In this chapter, you'll install the version of Python that's right for you.
<h2 id="install.choosing">1.1. Which Python is right for you?</h2>
<p>The first thing you need to do with Python is install it. Or do you?
<p>If you're using an account on a hosted server, your ISP may have already installed Python. Most popular Linux distributions come with Python in the default installation. Mac OS X 10.2 and later includes a command-line version of Python, although you'll probably want to install a version that includes a more Mac-like graphical interface.
<p>Windows does not come with any version of Python, but don't despair! There are several ways to point-and-click your way to Python on Windows.
<p>As you can see already, Python runs on a great many operating systems. The full list includes Windows, Mac OS, Mac OS X, and all varieties of free <acronym>UNIX</acronym>-compatible systems like Linux. There are also versions that run on Sun Solaris, AS/400, Amiga, OS/2, BeOS, and a plethora
of other platforms you've probably never even heard of.
<p>What's more, Python programs written on one platform can, with a little care, run on <em>any</em> supported platform. For instance, I regularly develop Python programs on Windows and later deploy them on Linux.
<p>So back to the question that started this section, &#8220;Which Python is right for you?&#8221; The answer is whichever one runs on the computer you already have.
<h2 id="install.windows">1.2. Python on Windows</h2>
<p>On Windows, you have a couple choices for installing Python.
<p>ActiveState makes a Windows installer for Python called ActivePython, which includes a complete version of Python, an <acronym>IDE</acronym> with a Python-aware code editor, plus some Windows extensions for Python that allow complete access to Windows-specific services, <acronym>API</acronym>s, and the Windows Registry.
<p>ActivePython is freely downloadable, although it is not open source. It is the <acronym>IDE</acronym> I used to learn Python, and I recommend you try it unless you have a specific reason not to. One such reason might be that ActiveState is generally
several months behind in updating their ActivePython installer when new version of Python are released. If you absolutely need the latest version of Python and ActivePython is still a version behind as you read this, you'll want to use the second option for installing Python on Windows.
<p>The second option is the &#8220;official&#8221; Python installer, distributed by the people who develop Python itself. It is freely downloadable and open source, and it is always current with the latest version of Python.
<div class="procedure">
<h3>Procedure 1.1. Option 1: Installing ActivePython</h3>
<p>Here is the procedure for installing ActivePython:
<ol>
<li>
<p>Download ActivePython from <a href="http://www.activestate.com/Products/ActivePython/">http://www.activestate.com/Products/ActivePython/</a>.
<li>
<p>If you are using Windows 95, Windows 98, or Windows ME, you will also need to download and install <a href="http://download.microsoft.com/download/WindowsInstaller/Install/2.0/W9XMe/EN-US/InstMsiA.exe">Windows Installer 2.0</a> before installing ActivePython.
<li>
<p>Double-click the installer, <code class="filename">ActivePython-2.2.2-224-win32-ix86.msi</code>.
<li>
<p>Step through the installer program.
<li>
<p>If space is tight, you can do a custom installation and deselect the documentation, but I don't recommend this unless you
absolutely can't spare the 14MB.
<li>
<p>After the installation is complete, close the installer and choose Start->Programs->ActiveState ActivePython 2.2->PythonWin IDE. You'll see something like the following:
</ol>
<div class="informalexample"><pre class="screen">
<samp class="computeroutput">PythonWin 2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)] on win32.
Portions Copyright 1994-2001 Mark Hammond (mhammond@skippinet.com.au) -
see 'Help/About PythonWin' for further copyright information.</samp>
<samp class="prompt">>>> </samp>
</pre><div class="procedure">
<h3>Procedure 1.2. Option 2: Installing Python from <a href="http://www.python.org/" title="Python language home page">Python.org</a></h3>
<ol>
<li>
<p>Download the latest Python Windows installer by going to <a href="http://www.python.org/ftp/python/">http://www.python.org/ftp/python/</a> and selecting the highest version number listed, then downloading the <code>.exe</code> installer.
<li>
<p>Double-click the installer, <code class="filename">Python-2.xxx.yyy.exe</code>. The name will depend on the version of Python available when you read this.
<li>
<p>Step through the installer program.
<li>
<p>If disk space is tight, you can deselect the HTMLHelp file, the utility scripts (<code class="filename">Tools/</code>), and/or the test suite (<code class="filename">Lib/test/</code>).
<li>
<p>If you do not have administrative rights on your machine, you can select Advanced Options, then choose Non-Admin Install. This just affects where Registry entries and Start menu shortcuts are created.
<li>
<p>After the installation is complete, close the installer and select Start->Programs->Python 2.3->IDLE (Python GUI). You'll see something like the following:
</ol>
<div class="informalexample"><pre class="screen">
<samp class="computeroutput">Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
****************************************************************
Personal firewall software may warn about the connection IDLE
makes to its subprocess using this computer's internal loopback
interface. This connection is not visible on any external
interface and no data is sent to or received from the Internet.
****************************************************************
IDLE 1.0</samp>
<samp class="prompt">>>> </samp>
</pre><h2 id="install.macosx">1.3. Python on Mac OS X</h2>
<p>On Mac OS X, you have two choices for installing Python: install it, or don't install it. You probably want to install it.
<p>Mac OS X 10.2 and later comes with a command-line version of Python preinstalled. If you are comfortable with the command line, you can use this version for the first third of the book. However,
the preinstalled version does not come with an <acronym>XML</acronym> parser, so when you get to the <acronym>XML</acronym> chapter, you'll need to install the full version.
<p>Rather than using the preinstalled version, you'll probably want to install the latest version, which also comes with a graphical
interactive shell.
<div class="procedure">
<h3>Procedure 1.3. Running the Preinstalled Version of Python on Mac OS X</h3>
<p>To use the preinstalled version of Python, follow these steps:
<ol>
<li>
<p>Open the <code class="filename">/Applications</code> folder.
<li>
<p>Open the <code class="filename">Utilities</code> folder.
<li>
<p>Double-click <code class="filename">Terminal</code> to open a terminal window and get to a command line.
<li>
<p>Type <kbd>python</kbd> at the command prompt.
</ol>
<p>Try it out:
<div class="informalexample"><pre class="screen">
Welcome to Darwin!
<samp class="prompt">[localhost:~] you% </samp>python
<samp class="computeroutput">Python 2.2 (#1, 07/14/02, 23:25:09)
[GCC Apple cpp-precomp 6.14] on darwin
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class="prompt">>>> </samp>[press Ctrl+D to get back to the command prompt]
<samp class="prompt">[localhost:~] you% </samp>
</pre><div class="procedure">
<h3>Procedure 1.4. Installing the Latest Version of Python on Mac OS X</h3>
<p>Follow these steps to download and install the latest version of Python:
<ol>
<li>
<p>Download the <code class="filename">MacPython-OSX</code> disk image from <a href="http://homepages.cwi.nl/~jack/macpython/download.html">http://homepages.cwi.nl/~jack/macpython/download.html</a>.
<li>
<p>If your browser has not already done so, double-click <code class="filename">MacPython-OSX-2.3-1.dmg</code> to mount the disk image on your desktop.
<li>
<p>Double-click the installer, <code class="filename">MacPython-OSX.pkg</code>.
<li>
<p>The installer will prompt you for your administrative username and password.
<li>
<p>Step through the installer program.
<li>
<p>After installation is complete, close the installer and open the <code class="filename">/Applications</code> folder.
<li>
<p>Open the <code class="filename">MacPython-2.3</code> folder
<li>
<p>Double-click <code class="filename">PythonIDE</code> to launch Python.
</ol>
<p>The MacPython <acronym>IDE</acronym> should display a splash screen, then take you to the interactive shell. If the interactive shell does not appear, select
Window->Python Interactive (<kbd class="shortcut">Cmd-0</kbd>). The opening window will look something like this:
<div class="informalexample"><pre class="screen">
<samp class="computeroutput">Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)]
Type "copyright", "credits" or "license" for more information.
MacPython IDE 1.0.1</samp>
<samp class="prompt">>>> </samp>
</pre><p>Note that once you install the latest version, the pre-installed version is still present. If you are running scripts from
the command line, you need to be aware which version of Python you are using.
<div class="example"><h3>Example 1.1. Two versions of Python</h3><pre class="screen">
<samp class="prompt">[localhost:~] you% </samp>python
<samp class="computeroutput">Python 2.2 (#1, 07/14/02, 23:25:09)
[GCC Apple cpp-precomp 6.14] on darwin
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class="prompt">>>> </samp>[press Ctrl+D to get back to the command prompt]
<samp class="prompt">[localhost:~] you% </samp>/usr/local/bin/python
<samp class="computeroutput">Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)] on darwin
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class="prompt">>>> </samp>[press Ctrl+D to get back to the command prompt]
<samp class="prompt">[localhost:~] you% </samp>
</pre><h2 id="install.macos9">1.4. Python on Mac OS 9</h2>
<p>Mac OS 9 does not come with any version of Python, but installation is very simple, and there is only one choice.
<div class="procedure">
<p>Follow these steps to install Python on Mac OS 9:
<ol>
<li>
<p>Download the <code class="filename">MacPython23full.bin</code> file from <a href="http://homepages.cwi.nl/~jack/macpython/download.html">http://homepages.cwi.nl/~jack/macpython/download.html</a>.
<li>
<p>If your browser does not decompress the file automatically, double-click <code class="filename">MacPython23full.bin</code> to decompress the file with Stuffit Expander.
<li>
<p>Double-click the installer, <code class="filename">MacPython23full</code>.
<li>
<p>Step through the installer program.
<li>
<p>AFter installation is complete, close the installer and open the <code class="filename">/Applications</code> folder.
<li>
<p>Open the <code class="filename">MacPython-OS9 2.3</code> folder.
<li>
<p>Double-click <code class="filename">Python IDE</code> to launch Python.
</ol>
<p>The MacPython <acronym>IDE</acronym> should display a splash screen, and then take you to the interactive shell. If the interactive shell does not appear, select
Window->Python Interactive (<kbd class="shortcut">Cmd-0</kbd>). You'll see a screen like this:
<div class="informalexample"><pre class="screen">
<samp class="computeroutput">Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)]
Type "copyright", "credits" or "license" for more information.
MacPython IDE 1.0.1</samp>
<samp class="prompt">>>> </samp>
</pre><h2 id="install.redhat">1.5. Python on RedHat Linux</h2>
<p>Installing under UNIX-compatible operating systems such as Linux is easy if you're willing to install a binary package. Pre-built
binary packages are available for most popular Linux distributions. Or you can always compile from source.
<p>Download the latest Python <acronym>RPM</acronym> by going to <a href="http://www.python.org/ftp/python/">http://www.python.org/ftp/python/</a> and selecting the highest version number listed, then selecting the <code class="filename">rpms/</code> directory within that. Then download the <acronym>RPM</acronym> with the highest version number. You can install it with the <kbd>rpm</kbd> command, as shown here:
<div class="example"><h3>Example 1.2. Installing on RedHat Linux 9</h3><pre class="screen">
<samp class="prompt">localhost:~$ </samp>su -
<samp class="prompt">Password: </samp>[enter your root password]
<samp class="prompt">[root@localhost root]# </samp>wget http://python.org/ftp/python/2.3/rpms/redhat-9/python2.3-2.3-5pydotorg.i386.rpm
<samp class="computeroutput">Resolving python.org... done.
Connecting to python.org[194.109.137.226]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,495,111 [application/octet-stream]
...</samp>
<samp class="prompt">[root@localhost root]# </samp>rpm -Uvh python2.3-2.3-5pydotorg.i386.rpm
<samp class="computeroutput">Preparing... ########################################### [100%]
1:python2.3 ########################################### [100%]</samp>
<samp class="prompt">[root@localhost root]# </samp>python <img id="install.unix.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class="prompt">>>> </samp>[press Ctrl+D to exit]
<samp class="prompt">[root@localhost root]# </samp>python2.3 <img id="install.unix.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">Python 2.3 (#1, Sep 12 2003, 10:53:56)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class="prompt">>>> </samp>[press Ctrl+D to exit]
<samp class="prompt">[root@localhost root]# </samp>which python2.3 <img id="install.unix.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
/usr/bin/python2.3
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#install.unix.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Whoops! Just typing <kbd>python</kbd> gives you the older version of Python -- the one that was installed by default. That's not the one you want.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#install.unix.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">At the time of this writing, the newest version is called <kbd>python2.3</kbd>. You'll probably want to change the path on the first line of the sample scripts to point to the newer version.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#install.unix.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the complete path of the newer version of Python that you just installed. Use this on the <code>#!</code> line (the first line of each script) to ensure that scripts are running under the latest version of Python, and be sure to type <kbd>python2.3</kbd> to get into the interactive shell.
</td>
</tr>
</table>
<h2 id="install.debian">1.6. Python on Debian <acronym>GNU</acronym>/Linux</h2>
<p>If you are lucky enough to be running Debian <acronym>GNU</acronym>/Linux, you install Python through the <kbd>apt</kbd> command.
<div class="example"><h3>Example 1.3. Installing on Debian <acronym>GNU</acronym>/Linux</h3><pre class="screen">
<samp class="prompt">localhost:~$ </samp>su -
<samp class="prompt">Password: </samp>[enter your root password]
<samp class="prompt">localhost:~# </samp>apt-get install python
<samp class="computeroutput">Reading Package Lists... Done
Building Dependency Tree... Done
The following extra packages will be installed:
python2.3
Suggested packages:
python-tk python2.3-doc
The following NEW packages will be installed:
python python2.3
0 upgraded, 2 newly installed, 0 to remove and 3 not upgraded.
Need to get 0B/2880kB of archives.
After unpacking 9351kB of additional disk space will be used.</samp>
<samp class="prompt">Do you want to continue? [Y/n] </samp>Y
<samp class="computeroutput">Selecting previously deselected package python2.3.
(Reading database ... 22848 files and directories currently installed.)
Unpacking python2.3 (from .../python2.3_2.3.1-1_i386.deb) ...
Selecting previously deselected package python.
Unpacking python (from .../python_2.3.1-1_all.deb) ...
Setting up python (2.3.1-1) ...
Setting up python2.3 (2.3.1-1) ...
Compiling python modules in /usr/lib/python2.3 ...
Compiling optimized python modules in /usr/lib/python2.3 ...</samp>
<samp class="prompt">localhost:~# </samp>exit
logout
<samp class="prompt">localhost:~$ </samp>python
<samp class="computeroutput">Python 2.3.1 (#2, Sep 24 2003, 11:39:14)
[GCC 3.3.2 20030908 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.</samp>
<samp class="prompt">>>> </samp>[press Ctrl+D to exit]
</pre><h2 id="install.source">1.7. Python Installation from Source</h2>
<p>If you prefer to build from source, you can download the Python source code from <a href="http://www.python.org/ftp/python/">http://www.python.org/ftp/python/</a>. Select the highest version number listed, download the <code class="filename">.tgz</code> file), and then do the usual <kbd>configure</kbd>, <kbd>make</kbd>, <kbd>make install</kbd> dance.
<div class="example"><h3>Example 1.4. Installing from source</h3><pre class="screen">
<samp class="prompt">localhost:~$ </samp>su -
<samp class="prompt">Password: </samp>[enter your root password]
<samp class="prompt">localhost:~# </samp>wget http://www.python.org/ftp/python/2.3/Python-2.3.tgz
<samp class="computeroutput">Resolving www.python.org... done.
Connecting to www.python.org[194.109.137.226]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8,436,880 [application/x-tar]
...</samp>
<samp class="prompt">localhost:~# </samp>tar xfz Python-2.3.tgz
<samp class="prompt">localhost:~# </samp>cd Python-2.3
<samp class="prompt">localhost:~/Python-2.3# </samp>./configure
<samp class="computeroutput">checking MACHDEP... linux2
checking EXTRAPLATDIR...
checking for --without-gcc... no
...</samp>
<samp class="prompt">localhost:~/Python-2.3# </samp>make
<samp class="computeroutput">gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
-I. -I./Include -DPy_BUILD_CORE -o Modules/python.o Modules/python.c
gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
-I. -I./Include -DPy_BUILD_CORE -o Parser/acceler.o Parser/acceler.c
gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
-I. -I./Include -DPy_BUILD_CORE -o Parser/grammar1.o Parser/grammar1.c
...</samp>
<samp class="prompt">localhost:~/Python-2.3# </samp>make install
<samp class="computeroutput">/usr/bin/install -c python /usr/local/bin/python2.3
...</samp>
<samp class="prompt">localhost:~/Python-2.3# </samp>exit
logout
<samp class="prompt">localhost:~$ </samp>which python
/usr/local/bin/python
<samp class="prompt">localhost:~$ </samp>python
<samp class="computeroutput">Python 2.3.1 (#2, Sep 24 2003, 11:39:14)
[GCC 3.3.2 20030908 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.</samp>
<samp class="prompt">>>> </samp>[press Ctrl+D to get back to the command prompt]
<samp class="prompt">localhost:~$ </samp>
</pre><h2 id="install.shell">1.8. The Interactive Shell</h2>
<p>Now that you have Python installed, what's this interactive shell thing you're running?
<p>It's like this: Python leads a double life. It's an interpreter for scripts that you can run from the command line or run like applications, by
double-clicking the scripts. But it's also an interactive shell that can evaluate arbitrary statements and expressions.
This is extremely useful for debugging, quick hacking, and testing. I even know some people who use the Python interactive shell in lieu of a calculator!
<p>Launch the Python interactive shell in whatever way works on your platform, and let's dive in with the steps shown here:
<div class="example"><h3>Example 1.5. First Steps in the Interactive Shell</h3><pre class="screen">
<samp class="prompt">>>> </samp>1 + 1 <img id="install.shell.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
2
<samp class="prompt">>>> </samp>print 'hello world' <img id="install.shell.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
hello world
<samp class="prompt">>>> </samp>x = 1 <img id="install.shell.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>y = 2
<samp class="prompt">>>> </samp>x + y
3
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#install.shell.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The Python interactive shell can evaluate arbitrary Python expressions, including any basic arithmetic expression.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#install.shell.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The interactive shell can execute arbitrary Python statements, including the <kbd>print</kbd> statement.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#install.shell.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can also assign values to variables, and the values will be remembered as long as the shell is open (but not any longer
than that).
</td>
</tr>
</table>
<h2 id="install.summary">1.9. Summary</h2>
<p>You should now have a version of Python installed that works for you.
<p>Depending on your platform, you may have more than one version of Python intsalled. If so, you need to be aware of your paths. If simply typing <kbd>python</kbd> on the command line doesn't run the version of Python that you want to use, you may need to enter the full pathname of your preferred version.
<p>Congratulations, and welcome to Python.
<div class="chapter">
<h2 id="odbchelper">Chapter 2. Your First Python Program</h2>
<p>You know how other books go on and on about programming fundamentals and finally work up to building a complete, working program?
Let's skip all that.
<h2 id="odbchelper.divein">2.1. Diving in</h2>
<p>Here is a complete, working Python program.
<p>It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But
read through it first and see what, if anything, you can make of it.
<div class="example"><h3>Example 2.1. <code class="filename">odbchelper.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
def buildConnectionString(params):
"""Build a connection string from a dictionary of parameters.
Returns string."""
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
if __name__ == "__main__":
myParams = {"server":"mpilgrim", \
"database":"master", \
"uid":"sa", \
"pwd":"secret" \
}
print buildConnectionString(myParams)</pre><p>Now run this program and see what happens.<table id="tip.run.windows" class="tip" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In the ActivePython <acronym>IDE</acronym> on Windows, you can run the Python program you're editing by choosing
File->Run... (<kbd class="shortcut">Ctrl-R</kbd>). Output is displayed in the interactive window.
</td>
</tr>
</table><table id="tip.run.mac" class="tip" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In the Python <acronym>IDE</acronym> on Mac OS, you can run a Python program with
Python->Run window... (<kbd class="shortcut">Cmd-R</kbd>), but there is an important option you must set first. Open the <code class="filename">.py</code> file in the <acronym>IDE</acronym>, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the Run as __main__ option is checked. This is a per-file setting, but you'll only need to do it once per file.
</td>
</tr>
</table><table id="tip.run.unix" class="tip" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">On <acronym>UNIX</acronym>-compatible systems (including Mac OS X), you can run a Python program from the command line: <kbd>python <code class="filename">odbchelper.py</code></kbd></td>
</tr>
</table>
<div class="informalexample"><p>The id="odbchelper.output" output of <code class="filename">odbchelper.py</code> will look like this:<pre class="screen">server=mpilgrim;uid=sa;database=master;pwd=secret</pre><h2 id="odbchelper.funcdef">2.2. Declaring Functions</h2>
<p>Python has functions like most other languages, but it does not have separate header files like <acronym>C++</acronym> or <code>interface</code>/<code>implementation</code> sections like Pascal. When you need a function, just declare it, like this:
<div class="informalexample"><pre class="programlisting">
def buildConnectionString(params):</pre><p>Note that the keyword <code>def</code> starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments
(not shown here) are separated with commas.
<p>Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value.
In fact, every Python function returns a value; if the function ever executes a <code>return</code> statement, it will return that value, otherwise it will return <code>None</code>, the Python null value.<table id="compare.funcdef.vb" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In Visual Basic, functions (that return a value) start with <code>function</code>, and subroutines (that do not return a value) start with <code>sub</code>. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's <code>None</code>), and all functions start with <code>def</code>.
</td>
</tr>
</table>
<p>The argument, <code>params</code>, doesn't specify a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.<table id="compare.funcdef.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In Java, <acronym>C++</acronym>, and other statically-typed languages, you must specify the datatype of the function return value and each function argument.
In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
</td>
</tr>
</table>
<h3>2.2.1. How Python's Datatypes Compare to Other Programming Languages</h3>
<p>An erudite reader sent me this explanation of how Python compares to other programming languages:
<div class="variablelist">
<dl>
<dt>statically typed language</dt>
<dd>A language in which types are fixed at compile time. Most statically typed languages enforce this by requiring you to declare
all variables with their datatypes before using them. Java and <acronym>C</acronym> are statically typed languages.
</dd>
<dt>dynamically typed language</dt>
<dd>A language in which types are discovered at execution time; the opposite of statically typed. VBScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
</dd>
<dt>strongly typed language</dt>
<dd>A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
</dd>
<dt>weakly typed language</dt>
<dd>A language in which types may be ignored; the opposite of strongly typed. VBScript is weakly typed. In VBScript, you can concatenate the string <code>'12'</code> and the integer <code>3</code> to get the string <code>'123'</code>, then treat that as the integer <code>123</code>, all without any explicit conversion.
</dd>
</dl>
<p>So Python is both <em>dynamically typed</em> (because it doesn't use explicit datatype declarations) and <em>strongly typed</em> (because once a variable has a datatype, it actually matters).
<h2 id="odbchelper.docstring">2.3. Documenting Functions</h2>
<p>You can document a Python function by giving it a <code>doc string</code>.
<div class="example"><h3 id="odbchelper.triplequotes">Example 2.2. Defining the <code class="function">buildConnectionString</code> Function's <code>doc string</code></h3><pre class="programlisting">
def buildConnectionString(params):
"""Build a connection string from a dictionary of parameters.
Returns string."""</pre><p>Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including
carriage returns and other quote characters. You can use them anywhere, but you'll see them most often used when defining
a <code>doc string</code>.
</div><table id="compare.quoting.perl" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Triple quotes are also an easy way to define a string with both single and double quotes, like <code>qq/.../</code> in Perl.
</td>
</tr>
</table>
<p>Everything between the triple quotes is the function's <code>doc string</code>, which documents what the function does. A <code>doc string</code>, if it exists, must be the first thing defined in a function (that is, the first thing after the colon). You don't technically
need to give your function a <code>doc string</code>, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the <code>doc string</code> is available at runtime as an attribute of the function.<table id="tip.docstring" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Many Python <acronym>IDE</acronym>s use the <code>doc string</code> to provide context-sensitive documentation, so that when you type a function name, its <code>doc string</code> appears as a tooltip. This can be incredibly helpful, but it's only as good as the <code>doc string</code>s you write.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Documenting Functions</h3>
<ul>
<li><a href="http://www.python.org/peps/pep-0257.html">PEP 257</a> defines <code>doc string</code> conventions.
<li><a href="http://www.python.org/doc/essays/styleguide.html"><i class="citetitle">Python Style Guide</i></a> discusses how to write a good <code>doc string</code>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> discusses conventions for <a href="http://www.python.org/doc/current/tut/node6.html#SECTION006750000000000000000">spacing in <code>doc string</code>s</a>.
</ul>
<h2 id="odbchelper.objects">2.4. Everything Is an Object</h2>
<p>In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime.
<p>A function, like everything else in Python, is an object.
<p>Open your favorite Python <acronym>IDE</acronym> and follow along:
<div class="example"><h3 id="odbchelper.import">Example 2.3. Accessing the <code class="function">buildConnectionString</code> Function's <code>doc string</code></h3><pre class="screen"><samp class="prompt">>>> </samp>import odbchelper <img id="odbchelper.objects.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
<samp class="prompt">>>> </samp>print odbchelper.buildConnectionString(params) <img id="odbchelper.objects.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
server=mpilgrim;uid=sa;database=master;pwd=secret
<samp class="prompt">>>> </samp>print odbchelper.buildConnectionString.__doc__ <img id="odbchelper.objects.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">Build a connection string from a dictionary
Returns string.</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.objects.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The first line imports the <code class="filename">odbchelper</code> program as a module -- a chunk of code that you can use interactively, or from a larger Python program. (You'll see examples of multi-module Python programs in <a href="#apihelper">Chapter 4</a>.) Once you import a module, you can reference any of its public functions, classes, or attributes. Modules can do this
to access functionality in other modules, and you can do it in the <acronym>IDE</acronym> too. This is an important concept, and you'll talk more about it later.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.objects.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When you want to use functions defined in imported modules, you need to include the module name. So you can't just say <code class="function">buildConnectionString</code>; it must be <code class="function">odbchelper.buildConnectionString</code>. If you've used classes in Java, this should feel vaguely familiar.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.objects.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Instead of calling the function as you would expect to, you asked for one of the function's attributes, <code>__doc__</code>.
</td>
</tr>
</table>
</div><table id="compare.import.perl" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%"><code>import</code> in Python is like <code>require</code> in Perl. Once you <code>import</code> a Python module, you access its functions with <code><i class="replaceable">module</i>.<i class="replaceable">function</i></code>; once you <code>require</code> a Perl module, you access its functions with <code><i class="replaceable">module</i>::<i class="replaceable">function</i></code>.
</td>
</tr>
</table>
<h3>2.4.1. The Import Search Path</h3>
<p>Before you go any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in <code class="varname">sys.path</code>. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists
later in this chapter.)
<div class="example"><h3 id="odbchelper.objects.sys.path">Example 2.4. Import Search Path</h3><pre class="screen">
<samp class="prompt">>>> </samp>import sys <img id="odbchelper.objects.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>sys.path <img id="odbchelper.objects.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">['', '/usr/local/lib/python2.2', '/usr/local/lib/python2.2/plat-linux2',
'/usr/local/lib/python2.2/lib-dynload', '/usr/local/lib/python2.2/site-packages',
'/usr/local/lib/python2.2/site-packages/PIL', '/usr/local/lib/python2.2/site-packages/piddle']</samp>
<samp class="prompt">>>> </samp>sys <img id="odbchelper.objects.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;module 'sys' (built-in)>
<samp class="prompt">>>> </samp>sys.path.append('/my/new/path') <img id="odbchelper.objects.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.objects.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Importing the <code class="filename">sys</code> module makes all of its functions and attributes available.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.objects.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">sys.path</code> is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating
system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a <code>.py</code> file matching the module name you're trying to import.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.objects.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Actually, I lied; the truth is more complicated than that, because not all modules are stored as <code>.py</code> files. Some, like the <code class="filename">sys</code> module, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (The <code class="filename">sys</code> module is written in <acronym>C</acronym>.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.objects.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can add a new directory to Python's search path at runtime by appending the directory name to <code class="varname">sys.path</code>, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You'll talk more about <code class="function">append</code> and other list methods in <a href="#datatypes">Chapter 3</a>.)
</td>
</tr>
</table>
<h3>2.4.2. What's an Object?</h3>
<p>Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute <code>__doc__</code>, which returns the <code>doc string</code> defined in the function's source code. The <code class="filename">sys</code> module is an object which has (among other things) an attribute called <code class="varname">path</code>. And so forth.
<p>Still, this begs the question. What is an object? Different programming languages define &#8220;object&#8221; in different ways. In some, it means that <em>all</em> objects <em>must</em> have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in <a href="#datatypes">Chapter 3</a>), and not all objects are subclassable (more on this in <a href="#fileinfo">Chapter 5</a>). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function
(more in this in <a href="#apihelper">Chapter 4</a>).
<p>This is so important that I'm going to repeat it in case you missed it the first few times: <em>everything in Python is an object</em>. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.
<div class="itemizedlist">
<h3>Further Reading on Objects</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class="citetitle">Python Reference Manual</i></a> explains exactly what it means to say that <a href="http://www.python.org/doc/current/ref/objects.html">everything in Python is an object</a>, because some people are pedantic and like to discuss this sort of thing at great length.
<li><a href="http://www.effbot.org/guides/">eff-bot</a> summarizes <a href="http://www.effbot.org/guides/python-objects.htm">Python objects</a>.
</ul>
<h2 id="odbchelper.indenting">2.5. Indenting Code</h2>
<p>Python functions have no explicit <code>begin</code> or <code>end</code>, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (<code>:</code>) and the indentation of the code itself.
<div class="example"><h3>Example 2.5. Indenting the <code class="function">buildConnectionString</code> Function</h3><pre class="programlisting">
def buildConnectionString(params):
"""Build a connection string from a dictionary of parameters.
Returns string."""
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])</pre><p>Code blocks are defined by their indentation. By "code block", I mean functions, <code>if</code> statements, <code>for</code> loops, <code>while</code> loops, and so forth. Indenting starts a block and unindenting ends it. There are no explicit braces, brackets, or keywords.
This means that whitespace is significant, and must be consistent. In this example, the function code (including the <code>doc string</code>) is indented four spaces. It doesn't need to be four spaces, it just needs to be consistent. The first line that is not
indented is outside the function.
<p><a href="#odbchelper.indenting.if" title="Example 2.6. if Statements">Example 2.6, &#8220;if Statements&#8221;</a> shows an example of code indentation with <code>if</code> statements.
<div class="example"><h3 id="odbchelper.indenting.if">Example 2.6. <code>if</code> Statements</h3><pre class="programlisting">
def fib(n): <img id="odbchelper.indenting.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
print 'n =', n <img id="odbchelper.indenting.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
if n > 1: <img id="odbchelper.indenting.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
return n * fib(n - 1)
else: <img id="odbchelper.indenting.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
print 'end of the line'
return 1
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.indenting.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a function named <code class="function">fib</code> that takes one argument, <code class="varname">n</code>. All the code within the function is indented.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.indenting.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Printing to the screen is very easy in Python, just use <code class="function">print</code>. <code class="function">print</code> statements can take any data type, including strings, integers, and other native types like dictionaries and lists that you'll
learn about in the next chapter. You can even mix and match to print several things on one line by using a comma-separated
list of values. Each value is printed on the same line, separated by spaces (the commas don't print). So when <code class="function">fib</code> is called with <code>5</code>, this will print "n = 5".
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.indenting.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>if</code> statements are a type of code block. If the <code>if</code> expression evaluates to true, the indented block is executed, otherwise it falls to the <code>else</code> block.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.indenting.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Of course <code>if</code> and <code>else</code> blocks can contain multiple lines, as long as they are all indented the same amount. This <code>else</code> block has two lines of code in it. There is no other special syntax for multi-line code blocks. Just indent and get on
with your life.
</td>
</tr>
</table>
<p>After some initial protests and several snide analogies to Fortran, you will make peace with this and start seeing its benefits. One major benefit is that all Python programs look similar, since indentation is a language requirement and not a matter of style. This makes it easier to read
and understand other people's Python code.<table id="compare.lineend.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. <acronym>C++</acronym> and Java use semicolons to separate statements and curly braces to separate code blocks.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Code Indentation</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class="citetitle">Python Reference Manual</i></a> discusses cross-platform indentation issues and <a href="http://www.python.org/doc/current/ref/indentation.html">shows various indentation errors</a>.
<li><a href="http://www.python.org/doc/essays/styleguide.html"><i class="citetitle">Python Style Guide</i></a> discusses good indentation style.
</ul>
<h2 id="odbchelper.testing">2.6. Testing Modules</h2>
<p>Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them.
Here's an example that uses the <code>if</code> <code>__name__</code> trick.
<div class="informalexample"><pre id="odbchelper.ifnametrick" class="programlisting">
if __name__ == "__main__":</pre><p>Some quick observations before you get to the good stuff. First, parentheses are not required around the <code>if</code> expression. Second, the <code>if</code> statement ends with a colon, and is followed by <a href="#odbchelper.indenting" title="2.5. Indenting Code">indented code</a>.<table id="compare.equals.c" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Like <acronym>C</acronym>, Python uses <code>==</code> for comparison and <code>=</code> for assignment. Unlike <acronym>C</acronym>, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.
</td>
</tr>
</table>
<p>So why is this particular <code>if</code> statement a trick? Modules are objects, and all modules have a built-in attribute <code>__name__</code>. A module's <code>__name__</code> depends on how you're using the module. If you <code>import</code> the module, then <code>__name__</code> is the module's filename, without a directory path or file extension. But you can also run the module directly as a standalone
program, in which case <code>__name__</code> will be a special default value, <code>__main__</code>.
<div class="informalexample"><pre class="screen"><samp class="prompt">>>> </samp>import odbchelper
<samp class="prompt">>>> </samp>odbchelper.<code>__name__</code>
'odbchelper'</pre><p>Knowing this, you can design a test suite for your module within the module itself by putting it in this <code>if</code> statement. When you run the module directly, <code>__name__</code> is <code>__main__</code>, so the test suite executes. When you import the module, <code>__name__</code> is something else, so the test suite is ignored. This makes it easier to develop and debug new modules before integrating
them into a larger program.<table id="tip.mac.runasmain" class="tip" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">On MacPython, there is an additional step to make the <code>if</code> <code>__name__</code> trick work. Pop up the module's options menu by clicking the black triangle in the upper-right corner of the window, and
make sure Run as __main__ is checked.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Importing Modules</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class="citetitle">Python Reference Manual</i></a> discusses the low-level details of <a href="http://www.python.org/doc/current/ref/import.html">importing modules</a>.
</ul>
<div class="chapter">
<h2 id="datatypes">Chapter 3. Native Datatypes</h2>
<p>You'll get back to your first Python program in just a minute. But first, a short digression is in order, because you need to know about dictionaries, tuples,
and lists (oh my!). If you're a Perl hacker, you can probably skim the bits about dictionaries and lists, but you should still pay attention to tuples.
<h2 id="odbchelper.dict">3.1. Introducing Dictionaries</h2>
<p>One of Python's built-in datatypes is the dictionary, which defines one-to-one relationships between keys and values.<table id="compare.dict.perl" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">A dictionary in Python is like a hash in Perl. In Perl, variables that store hashes always start with a <code>%</code> character. In Python, variables can be named anything, and Python keeps track of the datatype internally.
</td>
</tr>
</table><table id="compare.dict.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">A dictionary in Python is like an instance of the <code class="classname">Hashtable</code> class in Java.
</td>
</tr>
</table><table id="compare.dict.vb" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">A dictionary in Python is like an instance of the <code class="classname">Scripting.Dictionary</code> object in Visual Basic.
</td>
</tr>
</table>
<h3>3.1.1. Defining Dictionaries</h3>
<div class="example"><h3 id="odbchelper.dict.define">Example 3.1. Defining a Dictionary</h3><pre class="screen"><samp class="prompt">>>> </samp>d = {"server":"mpilgrim", "database":"master"} <img id="odbchelper.dict.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d
{'server': 'mpilgrim', 'database': 'master'}
<samp class="prompt">>>> </samp>d["server"]<img id="odbchelper.dict.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'mpilgrim'
<samp class="prompt">>>> </samp>d["database"] <img id="odbchelper.dict.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'master'
<samp class="prompt">>>> </samp>d["mpilgrim"] <img id="odbchelper.dict.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
KeyError: mpilgrim</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First, you create a new dictionary with two elements and assign it to the variable <code class="varname">d</code>. Each element is a key-value pair, and the whole set of elements is enclosed in curly braces.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>'server'</code> is a key, and its associated value, referenced by <code>d["server"]</code>, is <code>'mpilgrim'</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>'database'</code> is a key, and its associated value, referenced by <code>d["database"]</code>, is <code>'master'</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can get values by key, but you can't get keys by value. So <code>d["server"]</code> is <code>'mpilgrim'</code>, but <code>d["mpilgrim"]</code> raises an exception, because <code>'mpilgrim'</code> is not a key.
</td>
</tr>
</table>
<h3>3.1.2. Modifying Dictionaries</h3>
<div class="example"><h3 id="odbchelper.dict.modify">Example 3.2. Modifying a Dictionary</h3><pre class="screen"><samp class="prompt">>>> </samp>d
{'server': 'mpilgrim', 'database': 'master'}
<samp class="prompt">>>> </samp>d["database"] = "pubs" <img id="odbchelper.dict.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d
{'server': 'mpilgrim', 'database': 'pubs'}
<samp class="prompt">>>> </samp>d["uid"] = "sa" <img id="odbchelper.dict.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d
{'server': 'mpilgrim', 'uid': 'sa', 'database': 'pubs'}</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can not have duplicate keys in a dictionary. Assigning a value to an existing key will wipe out the old value.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can add new key-value pairs at any time. This syntax is identical to modifying existing values. (Yes, this will annoy
you someday when you think you are adding new values but are actually just modifying the same value over and over because
your key isn't changing the way you think it is.)
</td>
</tr>
</table>
<p>Note that the new element (key <code>'uid'</code>, value <code>'sa'</code>) appears to be in the middle. In fact, it was just a coincidence that the elements appeared to be in order in the first
example; it is just as much a coincidence that they appear to be out of order now.<table id="tip.dictorder" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Dictionaries have no concept of order among elements. It is incorrect to say that the elements are &#8220;out of order&#8221;; they are simply unordered. This is an important distinction that will annoy you when you want to access the elements of
a dictionary in a specific, repeatable order (like alphabetical order by key). There are ways of doing this, but they're
not built into the dictionary.
</td>
</tr>
</table>
<p>When working with dictionaries, you need to be aware that dictionary keys are case-sensitive.
<div class="example"><h3 id="odbchelper.dict.case">Example 3.3. Dictionary Keys Are Case-Sensitive</h3><pre class="screen">
<samp class="prompt">>>> </samp>d = {}
<samp class="prompt">>>> </samp>d["key"] = "value"
<samp class="prompt">>>> </samp>d["key"] = "other value" <img id="odbchelper.dict.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d
{'key': 'other value'}
<samp class="prompt">>>> </samp>d["Key"] = "third value" <img id="odbchelper.dict.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d
{'Key': 'third value', 'key': 'other value'}
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Assigning a value to an existing dictionary key simply replaces the old value with a new one.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is not assigning a value to an existing dictionary key, because strings in Python are case-sensitive, so <code>'key'</code> is not the same as <code>'Key'</code>. This creates a new key/value pair in the dictionary; it may look similar to you, but as far as Python is concerned, it's completely different.
</td>
</tr>
</table>
<div class="example"><h3 id="odbchelper.dictionarytypes">Example 3.4. Mixing Datatypes in a Dictionary</h3><pre class="screen"><samp class="prompt">>>> </samp>d
{'server': 'mpilgrim', 'uid': 'sa', 'database': 'pubs'}
<samp class="prompt">>>> </samp>d["retrycount"] = 3 <img id="odbchelper.dict.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d
{'server': 'mpilgrim', 'uid': 'sa', 'database': 'master', 'retrycount': 3}
<samp class="prompt">>>> </samp>d[42] = "douglas" <img id="odbchelper.dict.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d
<samp class="computeroutput">{'server': 'mpilgrim', 'uid': 'sa', 'database': 'master',
42: 'douglas', 'retrycount': 3}</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Dictionaries aren't just for strings. Dictionary values can be any datatype, including strings, integers, objects, or even
other dictionaries. And within a single dictionary, the values don't all need to be the same type; you can mix and match
as needed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match
key datatypes within a dictionary.
</td>
</tr>
</table>
<h3>3.1.3. Deleting Items From Dictionaries</h3>
<div class="example"><h3 id="odbchelper.dict.del">Example 3.5. Deleting Items from a Dictionary</h3><pre class="screen"><samp class="prompt">>>> </samp>d
<samp class="computeroutput">{'server': 'mpilgrim', 'uid': 'sa', 'database': 'master',
42: 'douglas', 'retrycount': 3}</samp>
<samp class="prompt">>>> </samp>del d[42] <img id="odbchelper.dict.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d
{'server': 'mpilgrim', 'uid': 'sa', 'database': 'master', 'retrycount': 3}
<samp class="prompt">>>> </samp>d.clear() <img id="odbchelper.dict.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d
{}</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">del</code> lets you delete individual items from a dictionary by key.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.dict.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">clear</code> deletes all items from a dictionary. Note that the set of empty curly braces signifies a dictionary without any items.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Dictionaries</h3>
<ul>
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class="citetitle">How to Think Like a Computer Scientist</i></a> teaches about dictionaries and shows how to <a href="http://www.ibiblio.org/obp/thinkCSpy/chap10.htm">use dictionaries to model sparse matrices</a>.
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> has a lot of <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/541">example code using dictionaries</a>.
<li><a href="http://www.activestate.com/ASPN/Python/Cookbook/" title="growing archive of annotated code samples">Python Cookbook</a> discusses <a href="http://www.activestate.com/ASPN/Python/Cookbook/Recipe/52306">how to sort the values of a dictionary by key</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/typesmapping.html">all the dictionary methods</a>.
</ul>
<h2 id="odbchelper.list">3.2. Introducing Lists</h2>
<p>Lists are Python's workhorse datatype. If your only experience with lists is arrays in Visual Basic or (God forbid) the datastore in Powerbuilder, brace yourself for Python lists.<table id="compare.list.perl" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">A list in Python is like an array in Perl. In Perl, variables that store arrays always start with the <code>@</code> character; in Python, variables can be named anything, and Python keeps track of the datatype internally.
</td>
</tr>
</table><table id="compare.list.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the <code class="classname">ArrayList</code> class, which can hold arbitrary objects and can expand dynamically as new items are added.
</td>
</tr>
</table>
<h3>3.2.1. Defining Lists</h3>
<div class="example"><h3>Example 3.6. Defining a List</h3><pre class="screen"><samp class="prompt">>>> </samp>li = ["a", "b", "mpilgrim", "z", "example"] <img id="odbchelper.list.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'z', 'example']
<samp class="prompt">>>> </samp>li[0] <img id="odbchelper.list.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'a'
<samp class="prompt">>>> </samp>li[4] <img id="odbchelper.list.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'example'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First, you define a list of five elements. Note that they retain their original order. This is not an accident. A list
is an ordered set of elements enclosed in square brackets.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">A list can be used like a zero-based array. The first element of any non-empty list is always <code>li[0]</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The last element of this five-element list is <code>li[4]</code>, because lists are always zero-based.
</td>
</tr>
</table>
<div class="example"><h3 id="odbchelper.negative.example">Example 3.7. Negative List Indices</h3><pre class="screen"><samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'z', 'example']
<samp class="prompt">>>> </samp>li[-1] <img id="odbchelper.list.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'example'
<samp class="prompt">>>> </samp>li[-3] <img id="odbchelper.list.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'mpilgrim'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">A negative index accesses elements from the end of the list counting backwards. The last element of any non-empty list is
always <code>li[-1]</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the negative index is confusing to you, think of it this way: <code>li[-n] == li[len(li) - n]</code>. So in this list, <code>li[-3] == li[5 - 3] == li[2]</code>.
</td>
</tr>
</table>
<div class="example"><h3 id="odbchelper.list.slice">Example 3.8. Slicing a List</h3><pre class="screen"><samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'z', 'example']
<samp class="prompt">>>> </samp>li[1:3] <img id="odbchelper.list.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
['b', 'mpilgrim']
<samp class="prompt">>>> </samp>li[1:-1] <img id="odbchelper.list.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
['b', 'mpilgrim', 'z']
<samp class="prompt">>>> </samp>li[0:3] <img id="odbchelper.list.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
['a', 'b', 'mpilgrim']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can get a subset of a list, called a &#8220;slice&#8221;, by specifying two indices. The return value is a new list containing all the elements of the list, in order, starting with
the first slice index (in this case <code>li[1]</code>), up to but not including the second slice index (in this case <code>li[3]</code>).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Slicing works if one or both of the slice indices is negative. If it helps, you can think of it this way: reading the list
from left to right, the first slice index specifies the first element you want, and the second slice index specifies the first
element you don't want. The return value is everything in between.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Lists are zero-based, so <code>li[0:3]</code> returns the first three elements of the list, starting at <code>li[0]</code>, up to but not including <code>li[3]</code>.
</td>
</tr>
</table>
<div class="example"><h3 id="odbchelper.list.slicing.example">Example 3.9. Slicing Shorthand</h3><pre class="screen"><samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'z', 'example']
<samp class="prompt">>>> </samp>li[:3] <img id="odbchelper.list.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
['a', 'b', 'mpilgrim']
<samp class="prompt">>>> </samp>li[3:] <img id="odbchelper.list.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"> <img id="odbchelper.list.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
['z', 'example']
<samp class="prompt">>>> </samp>li[:] <img id="odbchelper.list.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
['a', 'b', 'mpilgrim', 'z', 'example']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the left slice index is 0, you can leave it out, and 0 is implied. So <code>li[:3]</code> is the same as <code>li[0:3]</code> from <a href="#odbchelper.list.slice" title="Example 3.8. Slicing a List">Example 3.8, &#8220;Slicing a List&#8221;</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Similarly, if the right slice index is the length of the list, you can leave it out. So <code>li[3:]</code> is the same as <code>li[3:5]</code>, because this list has five elements.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note the symmetry here. In this five-element list, <code>li[:3]</code> returns the first 3 elements, and <code>li[3:]</code> returns the last two elements. In fact, <code>li[:n]</code> will always return the first <code>n</code> elements, and <code>li[n:]</code> will return the rest, regardless of the length of the list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If both slice indices are left out, all elements of the list are included. But this is not the same as the original <code class="varname">li</code> list; it is a new list that happens to have all the same elements. <code>li[:]</code> is shorthand for making a complete copy of a list.
</td>
</tr>
</table>
<h3>3.2.2. Adding Elements to Lists</h3>
<div class="example"><h3>Example 3.10. Adding Elements to a List</h3><pre class="screen"><samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'z', 'example']
<samp class="prompt">>>> </samp>li.append("new") <img id="odbchelper.list.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'z', 'example', 'new']
<samp class="prompt">>>> </samp>li.insert(2, "new") <img id="odbchelper.list.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new']
<samp class="prompt">>>> </samp>li.extend(["two", "elements"]) <img id="odbchelper.list.5.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">append</code> adds a single element to the end of the list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">insert</code> inserts a single element into a list. The numeric argument is the index of the first element that gets bumped out of position.
Note that list elements do not need to be unique; there are now two separate elements with the value <code>'new'</code>, <code>li[2]</code> and <code>li[6]</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.5.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">extend</code> concatenates lists. Note that you do not call <code class="function">extend</code> with multiple arguments; you call it with one argument, a list. In this case, that list has two elements.
</td>
</tr>
</table>
<div class="example"><h3 id="odbchelper.list.append.vs.extend">Example 3.11. The Difference between <code class="function">extend</code> and <code class="function">append</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>li = ['a', 'b', 'c']
<samp class="prompt">>>> </samp>li.extend(['d', 'e', 'f']) <img id="odbchelper.list.5.4" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'c', 'd', 'e', 'f']
<samp class="prompt">>>> </samp>len(li) <img id="odbchelper.list.5.5" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
6
<samp class="prompt">>>> </samp>li[-1]
'f'
<samp class="prompt">>>> </samp>li = ['a', 'b', 'c']
<samp class="prompt">>>> </samp>li.append(['d', 'e', 'f']) <img id="odbchelper.list.5.6" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'c', ['d', 'e', 'f']]
<samp class="prompt">>>> </samp>len(li) <img id="odbchelper.list.5.7" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
4
<samp class="prompt">>>> </samp>li[-1]
['d', 'e', 'f']
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.5.4"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Lists have two methods, <code class="function">extend</code> and <code class="function">append</code>, that look like they do the same thing, but are in fact completely different. <code class="function">extend</code> takes a single argument, which is always a list, and adds each of the elements of that list to the original list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.5.5"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here you started with a list of three elements (<code>'a'</code>, <code>'b'</code>, and <code>'c'</code>), and you extended the list with a list of another three elements (<code>'d'</code>, <code>'e'</code>, and <code>'f'</code>), so you now have a list of six elements.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.5.6"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">On the other hand, <code class="function">append</code> takes one argument, which can be any data type, and simply adds it to the end of the list. Here, you're calling the <code class="function">append</code> method with a single argument, which is a list of three elements.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.5.7"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now the original list, which started as a list of three elements, contains four elements. Why four? Because the last element
that you just appended <em>is itself a list</em>. Lists can contain any type of data, including other lists. That may be what you want, or maybe not. Don't use <code class="function">append</code> if you mean <code class="function">extend</code>.
</td>
</tr>
</table>
<h3>3.2.3. Searching Lists</h3>
<div class="example"><h3 id="odbchelper.list.search">Example 3.12. Searching a List</h3><pre class="screen"><samp class="prompt">>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']
<samp class="prompt">>>> </samp>li.index("example") <img id="odbchelper.list.6.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
5
<samp class="prompt">>>> </samp>li.index("new") <img id="odbchelper.list.6.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
2
<samp class="prompt">>>> </samp>li.index("c") <img id="odbchelper.list.6.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
ValueError: list.index(x): x not in list</samp>
<samp class="prompt">>>> </samp>"c" in li <img id="odbchelper.list.6.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
False</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.6.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">index</code> finds the first occurrence of a value in the list and returns the index.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.6.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">index</code> finds the <em>first</em> occurrence of a value in the list. In this case, <code>'new'</code> occurs twice in the list, in <code>li[2]</code> and <code>li[6]</code>, but <code class="function">index</code> will return only the first index, <code>2</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.6.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the value is not found in the list, Python raises an exception. This is notably different from most languages, which will return some invalid index. While this may
seem annoying, it is a good thing, because it means your program will crash at the source of the problem, rather than later
on when you try to use the invalid index.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.6.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To test whether a value is in the list, use <code class="function">in</code>, which returns <code class="constant">True</code> if the value is found or <code class="constant">False</code> if it is not.
</td>
</tr>
</table>
</div><table id="tip.boolean" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Before version 2.2.1, Python had no separate boolean datatype. To compensate for this, Python accepted almost anything in a boolean context (like an <code>if</code> statement), according to the following rules:
<div class="itemizedlist">
<ul>
<li><code class="constant">0</code> is false; all other numbers are true.
<li>An empty string (<code>""</code>) is false, all other strings are true.
<li>An empty list (<code>[]</code>) is false; all other lists are true.
<li>An empty tuple (<code>()</code>) is false; all other tuples are true.
<li>An empty dictionary (<code>{}</code>) is false; all other dictionaries are true.
</ul>
</div>These rules still apply in Python 2.2.1 and beyond, but now you can also use an actual boolean, which has a value of <code>True</code> or <code>False</code>. Note the capitalization; these values, like everything else in Python, are case-sensitive.
</td>
</tr>
</table>
<h3>3.2.4. Deleting List Elements</h3>
<div class="example"><h3 id="odbchelper.list.removingelements">Example 3.13. Removing Elements from a List</h3><pre class="screen"><samp class="prompt">>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']
<samp class="prompt">>>> </samp>li.remove("z") <img id="odbchelper.list.7.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'example', 'new', 'two', 'elements']
<samp class="prompt">>>> </samp>li.remove("new") <img id="odbchelper.list.7.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'example', 'new', 'two', 'elements']
<samp class="prompt">>>> </samp>li.remove("c") <img id="odbchelper.list.7.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
ValueError: list.remove(x): x not in list</samp>
<samp class="prompt">>>> </samp>li.pop() <img id="odbchelper.list.7.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'elements'
<samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'example', 'new', 'two']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.7.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">remove</code> removes the first occurrence of a value from a list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.7.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">remove</code> removes <em>only</em> the first occurrence of a value. In this case, <code>'new'</code> appeared twice in the list, but <code>li.remove("new")</code> removed only the first occurrence.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.7.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the value is not found in the list, Python raises an exception. This mirrors the behavior of the <code class="function">index</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.7.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">pop</code> is an interesting beast. It does two things: it removes the last element of the list, and it returns the value that it removed.
Note that this is different from <code>li[-1]</code>, which returns a value but does not change the list, and different from <code>li.remove(<i class="replaceable">value</i>)</code>, which changes the list but does not return a value.
</td>
</tr>
</table>
<h3>3.2.5. Using List Operators</h3>
<div class="example"><h3 id="odbchelper.list.operators">Example 3.14. List Operators</h3><pre class="screen"><samp class="prompt">>>> </samp>li = ['a', 'b', 'mpilgrim']
<samp class="prompt">>>> </samp>li = li + ['example', 'new'] <img id="odbchelper.list.8.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'example', 'new']
<samp class="prompt">>>> </samp>li += ['two'] <img id="odbchelper.list.8.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
['a', 'b', 'mpilgrim', 'example', 'new', 'two']
<samp class="prompt">>>> </samp>li = [1, 2] * 3 <img id="odbchelper.list.8.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
[1, 2, 1, 2, 1, 2]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.8.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Lists can also be concatenated with the <code>+</code> operator. <code><i class="replaceable">list</i> = <i class="replaceable">list</i> + <i class="replaceable">otherlist</i></code> has the same result as <code><i class="replaceable">list</i>.extend(<i class="replaceable">otherlist</i>)</code>. But the <code>+</code> operator returns a new (concatenated) list as a value, whereas <code class="function">extend</code> only alters an existing list. This means that <code class="function">extend</code> is faster, especially for large lists.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.8.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Python supports the <code>+=</code> operator. <code>li += ['two']</code> is equivalent to <code>li.extend(['two'])</code>. The <code>+=</code> operator works for lists, strings, and integers, and it can be overloaded to work for user-defined classes as well. (More
on classes in <a href="#fileinfo">Chapter 5</a>.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.list.8.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code>*</code> operator works on lists as a repeater. <code>li = [1, 2] * 3</code> is equivalent to <code>li = [1, 2] + [1, 2] + [1, 2]</code>, which concatenates the three lists into one.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Lists</h3>
<ul>
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class="citetitle">How to Think Like a Computer Scientist</i></a> teaches about lists and makes an important point about <a href="http://www.ibiblio.org/obp/thinkCSpy/chap08.htm">passing lists as function arguments</a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> shows how to <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007110000000000000000">use lists as stacks and queues</a>.
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/534">common questions about lists</a> and has a lot of <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/540">example code using lists</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/typesseq-mutable.html">all the list methods</a>.
</ul>
<h2 id="odbchelper.tuple">3.3. Introducing Tuples</h2>
<p>A tuple is an immutable list. A tuple can not be changed in any way once it is created.
<div class="example"><h3>Example 3.15. Defining a tuple</h3><pre class="screen"><samp class="prompt">>>> </samp>t = ("a", "b", "mpilgrim", "z", "example") <img id="odbchelper.tuple.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>t
('a', 'b', 'mpilgrim', 'z', 'example')
<samp class="prompt">>>> </samp>t[0] <img id="odbchelper.tuple.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'a'
<samp class="prompt">>>> </samp>t[-1] <img id="odbchelper.tuple.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'example'
<samp class="prompt">>>> </samp>t[1:3] <img id="odbchelper.tuple.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
('b', 'mpilgrim')</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.tuple.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">A tuple is defined in the same way as a list, except that the whole set of elements is enclosed in parentheses instead of
square brackets.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.tuple.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The elements of a tuple have a defined order, just like a list. Tuples indices are zero-based, just like a list, so the first
element of a non-empty tuple is always <code>t[0]</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.tuple.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Negative indices count from the end of the tuple, just as with a list.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.tuple.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Slicing works too, just like a list. Note that when you slice a list, you get a new list; when you slice a tuple, you get
a new tuple.
</td>
</tr>
</table>
<div class="example"><h3 id="odbchelper.tuplemethods">Example 3.16. Tuples Have No Methods</h3><pre class="screen"><samp class="prompt">>>> </samp>t
('a', 'b', 'mpilgrim', 'z', 'example')
<samp class="prompt">>>> </samp>t.append("new") <img id="odbchelper.tuple.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'append'</samp>
<samp class="prompt">>>> </samp>t.remove("z") <img id="odbchelper.tuple.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'remove'</samp>
<samp class="prompt">>>> </samp>t.index("example") <img id="odbchelper.tuple.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'index'</samp>
<samp class="prompt">>>> </samp>"z" in t <img id="odbchelper.tuple.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
True</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.tuple.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can't add elements to a tuple. Tuples have no <code class="function">append</code> or <code class="function">extend</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.tuple.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can't remove elements from a tuple. Tuples have no <code class="function">remove</code> or <code class="function">pop</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.tuple.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can't find elements in a tuple. Tuples have no <code class="function">index</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.tuple.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can, however, use <code class="function">in</code> to see if an element exists in the tuple.
</td>
</tr>
</table>
<p>So what are tuples good for?
<div class="itemizedlist">
<ul>
<li>Tuples are faster than lists. If you're defining a constant set of values and all you're ever going to do with it is iterate
through it, use a tuple instead of a list.
<li>It makes your code safer if you &#8220;write-protect&#8221; data that does not need to be changed. Using a tuple instead of a list is like having an implied <code>assert</code> statement that shows this data is constant, and that special thought (and a specific function) is required to override that.
<li>Remember that I said that <a href="#odbchelper.dictionarytypes" title="Example 3.4. Mixing Datatypes in a Dictionary">dictionary keys</a> can be integers, strings, and &#8220;a few other types&#8221;? Tuples are one of those types. Tuples can be used as keys in a dictionary, but lists can't be used this way.Actually, it's more complicated than that. Dictionary keys must be immutable. Tuples themselves are immutable, but if you
have a tuple of lists, that counts as mutable and isn't safe to use as a dictionary key. Only tuples of strings, numbers,
or other dictionary-safe tuples can be used as dictionary keys.
<li>Tuples are used in string formatting, as you'll see shortly.
</ul>
</div><table id="tip.tuple" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Tuples can be converted into lists, and vice-versa. The built-in <code class="function">tuple</code> function takes a list and returns a tuple with the same elements, and the <code class="function">list</code> function takes a tuple and returns a list. In effect, <code class="function">tuple</code> freezes a list, and <code class="function">list</code> thaws a tuple.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Tuples</h3>
<ul>
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class="citetitle">How to Think Like a Computer Scientist</i></a> teaches about tuples and shows how to <a href="http://www.ibiblio.org/obp/thinkCSpy/chap10.htm">concatenate tuples</a>.
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> shows how to <a href="http://www.faqts.com/knowledge-base/view.phtml/aid/4553/fid/587">sort a tuple</a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> shows how to <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007300000000000000000">define a tuple with one element</a>.
</ul>
<h2 id="odbchelper.vardef">3.4. Declaring variables</h2>
<p>Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from <a href="#odbchelper">Chapter 2</a>, <code class="filename">odbchelper.py</code>.
<p>Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring
into existence by being assigned a value, and they are automatically destroyed when they go out of scope.
<div class="example"><h3 id="myparamsdef">Example 3.17. Defining the <code class="varname">myParams</code> Variable</h3><pre class="programlisting">
if __name__ == "__main__":
myParams = {"server":"mpilgrim", \
"database":"master", \
"uid":"sa", \
"pwd":"secret" \
}</pre><p>Notice the indentation. An <code>if</code> statement is a code block and needs to be indented just like a function.
<p>Also notice that the variable assignment is one command split over several lines, with a backslash (&#8220;<code>\</code>&#8221;) serving as a line-continuation marker.<table id="tip.multiline" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">When a command is split among several lines with the line-continuation marker (&#8220;<code>\</code>&#8221;), the continued lines can be indented in any manner; Python's normally stringent indentation rules do not apply. If your Python <acronym>IDE</acronym> auto-indents the continued line, you should probably accept its default unless you have a burning reason not to.
</td>
</tr>
</table>
<p><a name="tip.implicitmultiline"></a>Strictly speaking, expressions in parentheses, straight brackets, or curly braces (like <a href="#myparamsdef" title="Example 3.17. Defining the myParams Variable">defining a dictionary</a>) can be split into multiple lines with or without the line continuation character (&#8220;<code>\</code>&#8221;). I like to include the backslash even when it's not required because I think it makes the code easier to read, but that's
a matter of style.
<p>Third, you never declared the variable <code class="varname">myParams</code>, you just assigned a value to it. This is like VBScript without the <code class="option">option explicit</code> option. Luckily, unlike VBScript, Python will not allow you to reference a variable that has never been assigned a value; trying to do so will raise an exception.
<h3>3.4.1. Referencing Variables</h3>
<div class="example"><h3 id="odbchelper.unboundvariable">Example 3.18. Referencing an Unbound Variable</h3><pre class="screen"><samp class="prompt">>>> </samp>x
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
NameError: There is no variable named 'x'</samp>
<samp class="prompt">>>> </samp>x = 1
<samp class="prompt">>>> </samp>x
1</pre><p>You will thank Python for this one day.
<h3 id="odbchelper.multiassign">3.4.2. Assigning Multiple Values at Once</h3>
<p>One of the cooler programming shortcuts in Python is using sequences to assign multiple values at once.
<div class="example"><h3>Example 3.19. Assigning multiple values at once</h3><pre class="screen"><samp class="prompt">>>> </samp>v = ('a', 'b', 'e')
<samp class="prompt">>>> </samp>(x, y, z) = v <img id="odbchelper.multiassign.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>x
'a'
<samp class="prompt">>>> </samp>y
'b'
<samp class="prompt">>>> </samp>z
'e'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.multiassign.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">v</code> is a tuple of three elements, and <code>(x, y, z)</code> is a tuple of three variables. Assigning one to the other assigns each of the values of <code class="varname">v</code> to each of the variables, in order.
</td>
</tr>
</table>
<p>This has all sorts of uses. I often want to assign names to a range of values. In <acronym>C</acronym>, you would use <code>enum</code> and manually list each constant and its associated value, which seems especially tedious when the values are consecutive.
In Python, you can use the built-in <code class="function">range</code> function with multi-variable assignment to quickly assign consecutive values.
<div class="example"><h3 id="odbchelper.multiassign.range">Example 3.20. Assigning Consecutive Values</h3><pre class="screen"><samp class="prompt">>>> </samp>range(7) <img id="odbchelper.multiassign.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
[0, 1, 2, 3, 4, 5, 6]
<samp class="prompt">>>> </samp>(MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7) <img id="odbchelper.multiassign.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>MONDAY <img id="odbchelper.multiassign.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
0
<samp class="prompt">>>> </samp>TUESDAY
1
<samp class="prompt">>>> </samp>SUNDAY
6</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.multiassign.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The built-in <code class="function">range</code> function returns a list of integers. In its simplest form, it takes an upper limit and returns a zero-based list counting
up to but not including the upper limit. (If you like, you can pass other parameters to specify a base other than <code class="constant">0</code> and a step other than <code class="constant">1</code>. You can <code>print range.__doc__</code> for details.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.multiassign.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">MONDAY</code>, <code class="varname">TUESDAY</code>, <code class="varname">WEDNESDAY</code>, <code class="varname">THURSDAY</code>, <code class="varname">FRIDAY</code>, <code class="varname">SATURDAY</code>, and <code class="varname">SUNDAY</code> are the variables you're defining. (This example came from the <code class="filename">calendar</code> module, a fun little module that prints calendars, like the <acronym>UNIX</acronym> program <code class="filename">cal</code>. The <code class="filename">calendar</code> module defines integer constants for days of the week.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.multiassign.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now each variable has its value: <code class="varname">MONDAY</code> is <code class="constant">0</code>, <code class="varname">TUESDAY</code> is <code class="constant">1</code>, and so forth.
</td>
</tr>
</table>
<p>You can also use multi-variable assignment to build functions that return multiple values, simply by returning a tuple of
all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including the <code class="filename">os</code> module, which you'll discuss in <a href="#filehandling">Chapter 6</a>.
<div class="itemizedlist">
<h3>Further Reading on Variables</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class="citetitle">Python Reference Manual</i></a> shows examples of <a href="http://www.python.org/doc/current/ref/implicit-joining.html">when you can skip the line continuation character</a> and <a href="http://www.python.org/doc/current/ref/explicit-joining.html">when you need to use it</a>.
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class="citetitle">How to Think Like a Computer Scientist</i></a> shows how to use multi-variable assignment to <a href="http://www.ibiblio.org/obp/thinkCSpy/chap09.htm">swap the values of two variables</a>.
</ul>
<h2 id="odbchelper.stringformatting">3.5. Formatting Strings</h2>
<p>Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is
to insert values into a string with the <code>%s</code> placeholder.
<table id="compare.stringformatting.c" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">String formatting in Python uses the same syntax as the <code class="function">sprintf</code> function in <acronym>C</acronym>.
</td>
</tr>
</table>
<div class="example"><h3>Example 3.21. Introducing String Formatting</h3><pre class="screen"><samp class="prompt">>>> </samp>k = "uid"
<samp class="prompt">>>> </samp>v = "sa"
<samp class="prompt">>>> </samp>"%s=%s" % (k, v) <img id="odbchelper.stringformatting.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'uid=sa'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.stringformatting.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The whole expression evaluates to a string. The first <code>%s</code> is replaced by the value of <code class="varname">k</code>; the second <code>%s</code> is replaced by the value of <code class="varname">v</code>. All other characters in the string (in this case, the equal sign) stay as they are.
</td>
</tr>
</table>
<p>Note that <code>(k, v)</code> is a tuple. I told you they were good for something.
<p>You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that
string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
<div class="example"><h3 id="odbchelper.stringformatting.coerce">Example 3.22. String Formatting vs. Concatenating</h3><pre class="screen"><samp class="prompt">>>> </samp>uid = "sa"
<samp class="prompt">>>> </samp>pwd = "secret"
<samp class="prompt">>>> </samp>print pwd + " is not a good password for " + uid <img id="odbchelper.stringformatting.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
secret is not a good password for sa
<samp class="prompt">>>> </samp>print "%s is not a good password for %s" % (pwd, uid) <img id="odbchelper.stringformatting.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
secret is not a good password for sa
<samp class="prompt">>>> </samp>userCount = 6
<samp class="prompt">>>> </samp>print "Users connected: %d" % (userCount, ) <img id="odbchelper.stringformatting.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"> <img id="odbchelper.stringformatting.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
Users connected: 6
<samp class="prompt">>>> </samp>print "Users connected: " + userCount <img id="odbchelper.stringformatting.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
TypeError: cannot concatenate 'str' and 'int' objects</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.stringformatting.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>+</code> is the string concatenation operator.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.stringformatting.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In this trivial case, string formatting accomplishes the same result as concatentation.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.stringformatting.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>(userCount, )</code> is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a
tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the
comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether <code>(userCount)</code> was a tuple with one element or just the value of <code class="varname">userCount</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.stringformatting.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">String formatting works with integers by specifying <code>%d</code> instead of <code>%s</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.stringformatting.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Trying to concatenate a string with a non-string raises an exception. Unlike string formatting, string concatenation works
only when everything is already a string.
</td>
</tr>
</table>
<p>As with <code class="function">printf</code> in <acronym>C</acronym>, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.
<div class="example"><h3 id="odbchelper.stringformatting.numbers">Example 3.23. Formatting Numbers</h3><pre class="screen">
<samp class="prompt">>>> </samp>print "Today's stock price: %f" % 50.4625 <img id="odbchelper.stringformatting.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
50.462500
<samp class="prompt">>>> </samp>print "Today's stock price: %.2f" % 50.4625 <img id="odbchelper.stringformatting.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
50.46
<samp class="prompt">>>> </samp>print "Change since yesterday: %+.2f" % 1.5 <img id="odbchelper.stringformatting.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
+1.50
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.stringformatting.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code>%f</code> string formatting option treats the value as a decimal, and prints it to six decimal places.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.stringformatting.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The ".2" modifier of the <code>%f</code> option truncates the value to two decimal places.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.stringformatting.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can even combine modifiers. Adding the <code>+</code> modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding
the value to exactly two decimal places.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on String Formatting</h3>
<ul>
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/typesseq-strings.html">all the string formatting format characters</a>.
<li><a href="http://www-gnats.gnu.org:8080/cgi-bin/info2www?(gawk)Top"><i class="citetitle">Effective <acronym>AWK</acronym> Programming</i></a> discusses <a href="http://www-gnats.gnu.org:8080/cgi-bin/info2www?(gawk)Control+Letters">all the format characters</a> and advanced string formatting techniques like <a href="http://www-gnats.gnu.org:8080/cgi-bin/info2www?(gawk)Format+Modifiers">specifying width, precision, and zero-padding</a>.
</ul>
<h2 id="odbchelper.map">3.6. Mapping Lists</h2>
<p>One of the most powerful features of Python is the list comprehension, which provides a compact way of mapping a list into another list by applying a function to each
of the elements of the list.
<div class="example"><h3>Example 3.24. Introducing List Comprehensions</h3><pre class="screen"><samp class="prompt">>>> </samp>li = [1, 9, 8, 4]
<samp class="prompt">>>> </samp>[elem*2 for elem in li] <img id="odbchelper.map.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
[2, 18, 16, 8]
<samp class="prompt">>>> </samp>li <img id="odbchelper.map.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
[1, 9, 8, 4]
<samp class="prompt">>>> </samp>li = [elem*2 for elem in li] <img id="odbchelper.map.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
[2, 18, 16, 8]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.map.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To make sense of this, look at it from right to left. <code class="varname">li</code> is the list you're mapping. Python loops through <code class="varname">li</code> one element at a time, temporarily assigning the value of each element to the variable <code class="varname">elem</code>. Python then applies the function <code><code class="varname">elem</code>*2</code> and appends that result to the returned list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.map.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that list comprehensions do not change the original list.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.map.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">It is safe to assign the result of a list comprehension to the variable that you're mapping. Python constructs the new list in memory, and when the list comprehension is complete, it assigns the result to the variable.
</td>
</tr>
</table>
<div class="informalexample">
<p>Here are the list comprehensions in the <code class="function">buildConnectionString</code> function that you declared in <a href="#odbchelper">Chapter 2</a>:<pre class="programlisting">
["%s=%s" % (k, v) for k, v in params.items()]</pre><p>First, notice that you're calling the <code class="function">items</code> function of the <code class="varname">params</code> dictionary. This function returns a list of tuples of all the data in the dictionary.
<div class="example"><h3 id="odbchelper.items">Example 3.25. The <code class="function">keys</code>, <code class="function">values</code>, and <code class="function">items</code> Functions</h3><pre class="screen"><samp class="prompt">>>> </samp>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
<samp class="prompt">>>> </samp>params.keys() <img id="odbchelper.map.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
['server', 'uid', 'database', 'pwd']
<samp class="prompt">>>> </samp>params.values() <img id="odbchelper.map.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
['mpilgrim', 'sa', 'master', 'secret']
<samp class="prompt">>>> </samp>params.items() <img id="odbchelper.map.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
[('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.map.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">keys</code> method of a dictionary returns a list of all the keys. The list is not in the order in which the dictionary was defined
(remember that elements in a dictionary are unordered), but it is a list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.map.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">values</code> method returns a list of all the values. The list is in the same order as the list returned by <code class="function">keys</code>, so <code>params.values()[n] == params[params.keys()[n]]</code> for all values of <code class="varname">n</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.map.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">items</code> method returns a list of tuples of the form <code>(<i class="replaceable">key</i>, <i class="replaceable">value</i>)</code>. The list contains all the data in the dictionary.
</td>
</tr>
</table>
<p>Now let's see what <code class="function">buildConnectionString</code> does. It takes a list, <code><code class="varname">params</code>.<code class="function">items</code>()</code>, and maps it to a new list by applying string formatting to each element. The new list will have the same number of elements
as <code><code class="varname">params</code>.<code class="function">items</code>()</code>, but each element in the new list will be a string that contains both a key and its associated value from the <code class="varname">params</code> dictionary.
<div class="example"><h3>Example 3.26. List Comprehensions in <code class="function">buildConnectionString</code>, Step by Step</h3><pre class="screen"><samp class="prompt">>>> </samp>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
<samp class="prompt">>>> </samp>params.items()
[('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]
<samp class="prompt">>>> </samp>[k for k, v in params.items()] <img id="odbchelper.map.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
['server', 'uid', 'database', 'pwd']
<samp class="prompt">>>> </samp>[v for k, v in params.items()] <img id="odbchelper.map.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
['mpilgrim', 'sa', 'master', 'secret']
<samp class="prompt">>>> </samp>["%s=%s" % (k, v) for k, v in params.items()] <img id="odbchelper.map.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.map.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that you're using two variables to iterate through the <code>params.items()</code> list. This is another use of <a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a>. The first element of <code>params.items()</code> is <code>('server', 'mpilgrim')</code>, so in the first iteration of the list comprehension, <code class="varname">k</code> will get <code>'server'</code> and <code class="varname">v</code> will get <code>'mpilgrim'</code>. In this case, you're ignoring the value of <code class="varname">v</code> and only including the value of <code class="varname">k</code> in the returned list, so this list comprehension ends up being equivalent to <code><code class="varname">params</code>.<code class="function">keys</code>()</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.map.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here you're doing the same thing, but ignoring the value of <code class="varname">k</code>, so this list comprehension ends up being equivalent to <code><code class="varname">params</code>.<code class="function">values</code>()</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.map.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Combining the previous two examples with some simple <a href="#odbchelper.stringformatting" title="3.5. Formatting Strings">string formatting</a>, you get a list of strings that include both the key and value of each element of the dictionary. This looks suspiciously
like the <a href="#odbchelper.output">output</a> of the program. All that remains is to join the elements in this list into a single string.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on List Comprehensions</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> discusses another way to map lists <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007130000000000000000">using the built-in <code class="function">map</code> function</a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> shows how to <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007140000000000000000">do nested list comprehensions</a>.
</ul>
<h2 id="odbchelper.join">3.7. Joining Lists and Splitting Strings</h2>
<p>You have a list of key-value pairs in the form <code><i class="replaceable">key</i>=<i class="replaceable">value</i></code>, and you want to join them into a single string. To join any list of strings into a single string, use the <code class="function">join</code> method of a string object.
<div class="informalexample">
<p>Here is an example of joining a list from the <code class="function">buildConnectionString</code> function:<pre class="programlisting">
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])</pre><p>One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything
is an object. You might have thought I meant that string <em>variables</em> are objects. But no, look closely at this example and you'll see that the string <code>";"</code> itself is an object, and you are calling its <code class="function">join</code> method.
<p>The <code class="function">join</code> method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't
need to be a semi-colon; it doesn't even need to be a single character. It can be any string.<table id="tip.join" class="caution" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/caution.png" alt="Caution" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%"><code class="function">join</code> works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements
will raise an exception.
</td>
</tr>
</table>
<div class="example"><h3 id="odbchelper.join.example">Example 3.27. Output of <code class="filename">odbchelper.py</code></h3><pre class="screen"><samp class="prompt">>>> </samp>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
<samp class="prompt">>>> </samp>["%s=%s" % (k, v) for k, v in params.items()]
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
<samp class="prompt">>>> </samp>";".join(["%s=%s" % (k, v) for k, v in params.items()])
'server=mpilgrim;uid=sa;database=master;pwd=secret'</pre><p>This string is then returned from the <code class="function">odbchelper</code> function and printed by the calling block, which gives you the output that you marveled at when you started reading this
chapter.
<p>You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's
called <code class="function">split</code>.
<div class="example"><h3 id="odbchelper.split.example">Example 3.28. Splitting a String</h3><pre class="screen"><samp class="prompt">>>> </samp>li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
<samp class="prompt">>>> </samp>s = ";".join(li)
<samp class="prompt">>>> </samp>s
'server=mpilgrim;uid=sa;database=master;pwd=secret'
<samp class="prompt">>>> </samp>s.split(";") <img id="odbchelper.join.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
<samp class="prompt">>>> </samp>s.split(";", 1) <img id="odbchelper.join.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
['server=mpilgrim', 'uid=sa;database=master;pwd=secret']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.join.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">split</code> reverses <code class="function">join</code> by splitting a string into a multi-element list. Note that the delimiter (&#8220;<code>;</code>&#8221;) is stripped out completely; it does not appear in any of the elements of the returned list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#odbchelper.join.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">split</code> takes an optional second argument, which is the number of times to split. (&#8220;Oooooh, optional arguments...&#8221; You'll learn how to do this in your own functions in the next chapter.)
</td>
</tr>
</table>
</div><table id="tip.split" class="tip" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%"><code><i class="replaceable">anystring</i>.<code class="function">split</code>(<i class="replaceable">delimiter</i>, 1)</code> is a useful technique when you want to search a string for a substring and then work with everything before the substring
(which ends up in the first element of the returned list) and everything after it (which ends up in the second element).
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on String Methods</h3>
<ul>
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/480">common questions about strings</a> and has a lot of <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/539">example code using strings</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/string-methods.html">all the string methods</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-string.html"><code class="filename">string</code> module</a>.
<li><a href="http://www.python.org/doc/FAQ.html"><i class="citetitle">The Whole Python <acronym>FAQ</acronym></i></a> explains <a href="http://www.python.org/cgi-bin/faqw.py?query=4.96&amp;querytype=simple&amp;casefold=yes&amp;req=search">why <code class="function">join</code> is a string method</a> instead of a list method.
</ul>
<h3>3.7.1. Historical Note on String Methods</h3>
<p>When I first learned Python, I expected <code class="function">join</code> to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story
behind the <code class="function">join</code> method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate <code class="filename">string</code> module that contained all the string functions; each function took a string as its first argument. The functions were deemed
important enough to put onto the strings themselves, which made sense for functions like <code class="function">lower</code>, <code class="function">upper</code>, and <code class="function">split</code>. But many hard-core Python programmers objected to the new <code class="function">join</code> method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of
the old <code class="filename">string</code> module (which still has a lot of useful stuff in it). I use the new <code class="function">join</code> method exclusively, but you will see code written either way, and if it really bothers you, you can use the old <code class="function">string.join</code> function instead.
<h2 id="odbchelper.summary">3.8. Summary</h2>
<p>The <code class="filename">odbchelper.py</code> program and its output should now make perfect sense.
<div class="informalexample"><pre class="programlisting">
def buildConnectionString(params):
"""Build a connection string from a dictionary of parameters.
Returns string."""
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
if __name__ == "__main__":
myParams = {"server":"mpilgrim", \
"database":"master", \
"uid":"sa", \
"pwd":"secret" \
}
print buildConnectionString(myParams)</pre><div class="informalexample">
<p>Here is the output of <code class="filename">odbchelper.py</code>:<pre class="screen">server=mpilgrim;uid=sa;database=master;pwd=secret</pre><div class="highlights">
<p>Before diving into the next chapter, make sure you're comfortable doing all of these things:
<div class="itemizedlist">
<ul>
<li>Using the Python <acronym>IDE</acronym> to test expressions interactively
<li>Writing Python programs and <a href="#odbchelper.testing" title="2.6. Testing Modules">running them from within your <acronym>IDE</acronym></a>, or from the command line
<li><a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's doc string">Importing modules</a> and calling their functions
<li><a href="#odbchelper.funcdef" title="2.2. Declaring Functions">Declaring functions</a> and using <a href="#odbchelper.docstring" title="2.3. Documenting Functions"><code>doc string</code>s</a>, <a href="#odbchelper.vardef" title="3.4. Declaring variables">local variables</a>, and <a href="#odbchelper.indenting" title="2.5. Indenting Code">proper indentation</a>
<li>Defining <a href="#odbchelper.dict" title="3.1. Introducing Dictionaries">dictionaries</a>, <a href="#odbchelper.tuple" title="3.3. Introducing Tuples">tuples</a>, and <a href="#odbchelper.list" title="3.2. Introducing Lists">lists</a>
<li>Accessing attributes and methods of <a href="#odbchelper.objects" title="2.4. Everything Is an Object">any object</a>, including strings, lists, dictionaries, functions, and modules
<li>Concatenating values through <a href="#odbchelper.stringformatting" title="3.5. Formatting Strings">string formatting</a>
<li><a href="#odbchelper.map" title="3.6. Mapping Lists">Mapping lists</a> into other lists using list comprehensions
<li><a href="#odbchelper.join" title="3.7. Joining Lists and Splitting Strings">Splitting strings</a> into lists and joining lists into strings
</ul>
<div class="chapter">
<h2 id="apihelper">Chapter 4. The Power Of Introspection</h2>
<p>This chapter covers one of Python's strengths: introspection. As you know, <a href="#odbchelper.objects" title="2.4. Everything Is an Object">everything in Python is an object</a>, and introspection is code looking at other modules and functions in memory as objects, getting information about them, and
manipulating them. Along the way, you'll define functions with no name, call functions with arguments out of order, and reference
functions whose names you don't even know ahead of time.
<h2 id="apihelper.divein">4.1. Diving In</h2>
<p>Here is a complete, working Python program. You should understand a good deal about it just by looking at it. The numbered lines illustrate concepts covered
in <a href="#odbchelper" title="Chapter 2. Your First Python Program">Chapter 2, <i>Your First Python Program</i></a>. Don't worry if the rest of the code looks intimidating; you'll learn all about it throughout this chapter.
<div class="example"><h3>Example 4.1. <code class="filename">apihelper.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
def info(object, spacing=10, collapse=1): <img id="apihelper.intro.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"> <img id="apihelper.intro.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"> <img id="apihelper.intro.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
"""Print methods and doc strings.
Takes module, class, list, dictionary, or string."""
methodList = [method for method in dir(object) if callable(getattr(object, method))]
processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)
print "\n".join(["%s %s" %
(method.ljust(spacing),
processFunc(str(getattr(object, method).__doc__)))
for method in methodList])
if __name__ == "__main__": <img id="apihelper.intro.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"> <img id="apihelper.intro.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
print info.__doc__</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.intro.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This module has one function, <code class="function">info</code>. According to its <a href="#odbchelper.funcdef" title="2.2. Declaring Functions">function declaration</a>, it takes three parameters: <code class="varname">object</code>, <code class="varname">spacing</code>, and <code class="varname">collapse</code>. The last two are actually optional parameters, as you'll see shortly.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.intro.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">info</code> function has a multi-line <a href="#odbchelper.docstring" title="2.3. Documenting Functions"><code>doc string</code></a> that succinctly describes the function's purpose. Note that no return value is mentioned; this function will be used solely
for its effects, rather than its value.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.intro.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Code within the function is <a href="#odbchelper.indenting" title="2.5. Indenting Code">indented</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.intro.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code>if __name__</code> <a href="#odbchelper.ifnametrick">trick</a> allows this program do something useful when run by itself, without interfering with its use as a module for other programs.
In this case, the program simply prints out the <code>doc string</code> of the <code class="function">info</code> function.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.intro.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><a href="#odbchelper.ifnametrick"><code>if</code> statements</a> use <code>==</code> for comparison, and parentheses are not required.
</td>
</tr>
</table>
<p>The <code class="function">info</code> function is designed to be used by you, the programmer, while working in the Python <acronym>IDE</acronym>. It takes any object that has functions or methods (like a module, which has functions, or a list, which has methods) and
prints out the functions and their <code>doc string</code>s.
<div class="example"><h3>Example 4.2. Sample Usage of <code class="filename">apihelper.py</code></h3><pre class="screen"><samp class="prompt">>>> </samp>from apihelper import info
<samp class="prompt">>>> </samp>li = []
<samp class="prompt">>>> </samp>info(li)
<samp class="computeroutput">append L.append(object) -- append object to end
count L.count(value) -> integer -- return number of occurrences of value
extend L.extend(list) -- extend list by appending list elements
index L.index(value) -> integer -- return index of first occurrence of value
insert L.insert(index, object) -- insert object before index
pop L.pop([index]) -> item -- remove and return item at index (default last)
remove L.remove(value) -- remove first occurrence of value
reverse L.reverse() -- reverse *IN PLACE*
sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1</span></pre><p>By default the output is formatted to be easy to read. Multi-line <code>doc string</code>s are collapsed into a single long line, but this option can be changed by specifying <code class="constant">0</code> for the <i class="parameter"><code>collapse</code></i> argument. If the function names are longer than 10 characters, you can specify a larger value for the <i class="parameter"><code>spacing</code></i> argument to make the output easier to read.
<div class="example"><h3>Example 4.3. Advanced Usage of <code class="filename">apihelper.py</code></h3><pre class="screen"><samp class="prompt">>>> </samp>import odbchelper
<samp class="prompt">>>> </samp>info(odbchelper)
buildConnectionString Build a connection string from a dictionary Returns string.
<samp class="prompt">>>> </samp>info(odbchelper, 30)
buildConnectionString Build a connection string from a dictionary Returns string.
<samp class="prompt">>>> </samp>info(odbchelper, 30, 0)
<samp class="computeroutput">buildConnectionString Build a connection string from a dictionary
Returns string.
</span></pre><h2 id="apihelper.optional">4.2. Using Optional and Named Arguments</h2>
<p>Python allows function arguments to have default values; if the function is called without the argument, the argument gets its default
value. Futhermore, arguments can be specified in any order by using named arguments. Stored procedures in SQL Server Transact/<acronym>SQL</acronym> can do this, so if you're a SQL Server scripting guru, you can skim this part.
<div class="informalexample">
<p>Here is an example of <code class="function">info</code>, a function with two optional arguments:<pre class="programlisting">
def info(object, spacing=10, collapse=1):</pre><p><code class="varname">spacing</code> and <code class="varname">collapse</code> are optional, because they have default values defined. <code class="varname">object</code> is required, because it has no default value. If <code class="function">info</code> is called with only one argument, <code class="varname">spacing</code> defaults to <code class="constant">10</code> and <code class="varname">collapse</code> defaults to <code class="constant">1</code>. If <code class="function">info</code> is called with two arguments, <code class="varname">collapse</code> still defaults to <code class="constant">1</code>.
<p>Say you want to specify a value for <code class="varname">collapse</code> but want to accept the default value for <code class="varname">spacing</code>. In most languages, you would be out of luck, because you would need to call the function with three arguments. But in
Python, arguments can be specified by name, in any order.
<div class="example"><h3>Example 4.4. Valid Calls of <code class="function">info</code></h3><pre class="programlisting">
info(odbchelper) <img id="apihelper_args.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
info(odbchelper, 12) <img id="apihelper_args.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
info(odbchelper, collapse=0) <img id="apihelper_args.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
info(spacing=15, object=odbchelper) <img id="apihelper_args.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper_args.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">With only one argument, <code class="varname">spacing</code> gets its default value of <code>10</code> and <code class="varname">collapse</code> gets its default value of <code class="constant">1</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper_args.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">With two arguments, <code class="varname">collapse</code> gets its default value of <code class="constant">1</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper_args.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here you are naming the <code class="varname">collapse</code> argument explicitly and specifying its value. <code class="varname">spacing</code> still gets its default value of <code>10</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper_args.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Even required arguments (like <code class="varname">object</code>, which has no default value) can be named, and named arguments can appear in any order.
</td>
</tr>
</table>
<p>This looks totally whacked until you realize that arguments are simply a dictionary. The &#8220;normal&#8221; method of calling functions without argument names is actually just a shorthand where Python matches up the values with the argument names in the order they're specified in the function declaration. And most of the
time, you'll call functions the &#8220;normal&#8221; way, but you always have the additional flexibility if you need it.<table id="tip.arguments" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">The only thing you need to do to call a function is specify a value (somehow) for each required argument; the manner and order
in which you do that is up to you.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Optional Arguments</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> discusses exactly <a href="http://www.python.org/doc/current/tut/node6.html#SECTION006710000000000000000">when and how default arguments are evaluated</a>, which matters when the default value is a list or an expression with side effects.
</ul>
<h2 id="apihelper.builtin">4.3. Using <code class="function">type</code>, <code class="function">str</code>, <code class="function">dir</code>, and Other Built-In Functions</h2>
<p>Python has a small set of extremely useful built-in functions. All other functions are partitioned off into modules. This was
actually a conscious design decision, to keep the core language from getting bloated like other scripting languages (cough
cough, Visual Basic).
<h3>4.3.1. The <code class="function">type</code> Function</h3>
<p>The <code class="function">type</code> function returns the datatype of any arbitrary object. The possible types are listed in the <code class="filename">types</code> module. This is useful for helper functions that can handle several types of data.
<div class="example"><h3 id="apihelper.type.intro">Example 4.5. Introducing <code class="function">type</code></h3><pre class="screen"><samp class="prompt">>>> </samp>type(1) <img id="apihelper.builtin.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;type 'int'>
<samp class="prompt">>>> </samp>li = []
<samp class="prompt">>>> </samp>type(li) <img id="apihelper.builtin.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;type 'list'>
<samp class="prompt">>>> </samp>import odbchelper
<samp class="prompt">>>> </samp>type(odbchelper) <img id="apihelper.builtin.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;type 'module'>
<samp class="prompt">>>> </samp>import types <img id="apihelper.builtin.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>type(odbchelper) == types.ModuleType
True</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">type</code> takes anything -- and I mean anything -- and returns its datatype. Integers, strings, lists, dictionaries, tuples, functions,
classes, modules, even types are acceptable.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">type</code> can take a variable and return its datatype.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">type</code> also works on modules.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can use the constants in the <code class="filename">types</code> module to compare types of objects. This is what the <code class="function">info</code> function does, as you'll see shortly.
</td>
</tr>
</table>
<h3>4.3.2. The <code class="function">str</code> Function</h3>
<p>The <code class="function">str</code> coerces data into a string. Every datatype can be coerced into a string.
<div class="example"><h3 id="apihelper.str.intro">Example 4.6. Introducing <code class="function">str</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>str(1) <img id="apihelper.builtin.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'1'
<samp class="prompt">>>> </samp>horsemen = ['war', 'pestilence', 'famine']
<samp class="prompt">>>> </samp>horsemen
['war', 'pestilence', 'famine']
<samp class="prompt">>>> </samp>horsemen.append('Powerbuilder')
<samp class="prompt">>>> </samp>str(horsemen) <img id="apihelper.builtin.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
"['war', 'pestilence', 'famine', 'Powerbuilder']"
<samp class="prompt">>>> </samp>str(odbchelper) <img id="apihelper.builtin.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
"&lt;module 'odbchelper' from 'c:\\docbook\\dip\\py\\odbchelper.py'>"
<samp class="prompt">>>> </samp>str(None) <img id="apihelper.builtin.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'None'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">For simple datatypes like integers, you would expect <code class="function">str</code> to work, because almost every language has a function to convert an integer to a string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">However, <code class="function">str</code> works on any object of any type. Here it works on a list which you've constructed in bits and pieces.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">str</code> also works on modules. Note that the string representation of the module includes the pathname of the module on disk, so
yours will be different.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">A subtle but important behavior of <code class="function">str</code> is that it works on <code>None</code>, the Python null value. It returns the string <code>'None'</code>. You'll use this to your advantage in the <code class="function">info</code> function, as you'll see shortly.
</td>
</tr>
</table>
<p>At the heart of the <code class="function">info</code> function is the powerful <code class="function">dir</code> function. <code class="function">dir</code> returns a list of the attributes and methods of any object: modules, functions, strings, lists, dictionaries... pretty much
anything.
<div class="example"><h3 id="apihelper.dir.intro">Example 4.7. Introducing <code class="function">dir</code></h3><pre class="screen"><samp class="prompt">>>> </samp>li = []
<samp class="prompt">>>> </samp>dir(li) <img id="apihelper.builtin.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">['append', 'count', 'extend', 'index', 'insert',
'pop', 'remove', 'reverse', 'sort']</samp>
<samp class="prompt">>>> </samp>d = {}
<samp class="prompt">>>> </samp>dir(d) <img id="apihelper.builtin.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
['clear', 'copy', 'get', 'has_key', 'items', 'keys', 'setdefault', 'update', 'values']
<samp class="prompt">>>> </samp>import odbchelper
<samp class="prompt">>>> </samp>dir(odbchelper) <img id="apihelper.builtin.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
['__builtins__', '__doc__', '__file__', '__name__', 'buildConnectionString']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">li</code> is a list, so <code><code class="function">dir</code>(<code class="varname">li</code>)</code> returns a list of all the methods of a list. Note that the returned list contains the names of the methods as strings, not
the methods themselves.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">d</code> is a dictionary, so <code><code class="function">dir</code>(<code class="varname">d</code>)</code> returns a list of the names of dictionary methods. At least one of these, <a href="#odbchelper.items" title="Example 3.25. The keys, values, and items Functions"><code class="function">keys</code></a>, should look familiar.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is where it really gets interesting. <code class="filename">odbchelper</code> is a module, so <code><code class="function">dir</code>(<code class="filename">odbchelper</code>)</code> returns a list of all kinds of stuff defined in the module, including built-in attributes, like <a href="#odbchelper.ifnametrick"><code>__name__</code></a>, <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's doc string"><code>__doc__</code></a>, and whatever other attributes and methods you define. In this case, <code class="filename">odbchelper</code> has only one user-defined method, the <code class="function">buildConnectionString</code> function described in <a href="#odbchelper">Chapter 2</a>.
</td>
</tr>
</table>
<p>Finally, the <code class="function">callable</code> function takes any object and returns <code class="constant">True</code> if the object can be called, or <code class="constant">False</code> otherwise. Callable objects include functions, class methods, even classes themselves. (More on classes in the next chapter.)
<div class="example"><h3 id="apihelper.builtin.callable">Example 4.8. Introducing <code class="function">callable</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>import string
<samp class="prompt">>>> </samp>string.punctuation <img id="apihelper.builtin.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'!"#$%&amp;\'()*+,-./:;&lt;=>?@[\\]^_`{|}~'
<samp class="prompt">>>> </samp>string.join<img id="apihelper.builtin.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><!-- " -->
&lt;function join at 00C55A7C>
<samp class="prompt">>>> </samp>callable(string.punctuation) <img id="apihelper.builtin.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
False
<samp class="prompt">>>> </samp>callable(string.join) <img id="apihelper.builtin.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
True
<samp class="prompt">>>> </samp>print string.join.__doc__ <img id="apihelper.builtin.4.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="computeroutput">join(list [,sep]) -> string
Return a string composed of the words in list, with
intervening occurrences of sep. The default separator is a
single space.
(joinfields and join are synonymous)</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The functions in the <code class="filename">string</code> module are deprecated (although many people still use the <code class="function">join</code> function), but the module contains a lot of useful constants like this <code class="varname">string.punctuation</code>, which contains all the standard punctuation characters.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><a href="#odbchelper.join" title="3.7. Joining Lists and Splitting Strings"><code class="function">string.join</code></a> is a function that joins a list of strings.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">string.punctuation</code> is not callable; it is a string. (A string does have callable methods, but the string itself is not callable.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">string.join</code> is callable; it's a function that takes two arguments.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.builtin.4.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Any callable object may have a <code>doc string</code>. By using the <code class="function">callable</code> function on each of an object's attributes, you can determine which attributes you care about (methods, functions, classes)
and which you want to ignore (constants and so on) without knowing anything about the object ahead of time.
</td>
</tr>
</table>
<h3>4.3.3. Built-In Functions</h3>
<p><code class="function">type</code>, <code class="function">str</code>, <code class="function">dir</code>, and all the rest of Python's built-in functions are grouped into a special module called <code class="filename">__builtin__</code>. (That's two underscores before and after.) If it helps, you can think of Python automatically executing <code>from __builtin__ import *</code> on startup, which imports all the &#8220;built-in&#8221; functions into the namespace so you can use them directly.
<p>The advantage of thinking like this is that you can access all the built-in functions and attributes as a group by getting
information about the <code class="filename">__builtin__</code> module. And guess what, Python has a function called <code class="function">info</code>. Try it yourself and skim through the list now. We'll dive into some of the more important functions later. (Some of the
built-in error classes, like <a href="#odbchelper.tuplemethods" title="Example 3.16. Tuples Have No Methods"><code class="errorcode">AttributeError</code></a>, should already look familiar.)
<div class="example"><h3 id="apihelper.builtin.list">Example 4.9. Built-in Attributes and Functions</h3><pre class="screen"><samp class="prompt">>>> </samp>from apihelper import info
<samp class="prompt">>>> </samp>import __builtin__
<samp class="prompt">>>> </samp>info(__builtin__, 20)
<samp class="computeroutput">ArithmeticError Base class for arithmetic errors.
AssertionError Assertion failed.
AttributeError Attribute not found.
EOFError Read beyond end of file.
EnvironmentError Base class for I/O related errors.
Exception Common base class for all exceptions.
FloatingPointError Floating point operation failed.
IOError I/O operation failed.
[...snip...]</span></pre></div><table id="tip.manuals" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Python comes with excellent reference manuals, which you should peruse thoroughly to learn all the modules Python has to offer. But unlike most languages, where you would find yourself referring back to the manuals or man pages to remind
yourself how to use these modules, Python is largely self-documenting.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Built-In Functions</h3>
<ul>
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> documents <a href="http://www.python.org/doc/current/lib/built-in-funcs.html">all the built-in functions</a> and <a href="http://www.python.org/doc/current/lib/module-exceptions.html">all the built-in exceptions</a>.
</ul>
<h2 id="apihelper.getattr">4.4. Getting Object References With <code class="function">getattr</code></h2>
<p>You already know that <a href="#odbchelper.objects" title="2.4. Everything Is an Object">Python functions are objects</a>. What you don't know is that you can get a reference to a function without knowing its name until run-time, by using the
<code class="function">getattr</code> function.
<div class="example"><h3 id="apihelper.getattr.intro">Example 4.10. Introducing <code class="function">getattr</code></h3><pre class="screen"><samp class="prompt">>>> </samp>li = ["Larry", "Curly"]
<samp class="prompt">>>> </samp>li.pop <img id="apihelper.getattr.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;built-in method pop of list object at 010DF884>
<samp class="prompt">>>> </samp>getattr(li, "pop") <img id="apihelper.getattr.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;built-in method pop of list object at 010DF884>
<samp class="prompt">>>> </samp>getattr(li, "append")("Moe") <img id="apihelper.getattr.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>li
["Larry", "Curly", "Moe"]
<samp class="prompt">>>> </samp>getattr({}, "clear") <img id="apihelper.getattr.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;built-in method clear of dictionary object at 00F113D4>
<samp class="prompt">>>> </samp>getattr((), "pop") <img id="apihelper.getattr.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'pop'</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This gets a reference to the <code class="function">pop</code> method of the list. Note that this is not calling the <code class="function">pop</code> method; that would be <code>li.pop()</code>. This is the method itself.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This also returns a reference to the <code class="function">pop</code> method, but this time, the method name is specified as a string argument to the <code class="function">getattr</code> function. <code class="function">getattr</code> is an incredibly useful built-in function that returns any attribute of any object. In this case, the object is a list,
and the attribute is the <code class="function">pop</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In case it hasn't sunk in just how incredibly useful this is, try this: the return value of <code class="function">getattr</code> <em>is</em> the method, which you can then call just as if you had said <code>li.append("Moe")</code> directly. But you didn't call the function directly; you specified the function name as a string instead.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">getattr</code> also works on dictionaries.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In theory, <code class="function">getattr</code> would work on tuples, except that <a href="#odbchelper.tuplemethods" title="Example 3.16. Tuples Have No Methods">tuples have no methods</a>, so <code class="function">getattr</code> will raise an exception no matter what attribute name you give.
</td>
</tr>
</table>
<h3>4.4.1. <code class="function">getattr</code> with Modules</h3>
<p><code class="function">getattr</code> isn't just for built-in datatypes. It also works on modules.
<div class="example"><h3 id="apihelper.getattr.example">Example 4.11. The <code class="function">getattr</code> Function in <code class="filename">apihelper.py</code></h3><pre class="screen"><samp class="prompt">>>> </samp>import odbchelper
<samp class="prompt">>>> </samp>odbchelper.buildConnectionString <img id="apihelper.getattr.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;function buildConnectionString at 00D18DD4>
<samp class="prompt">>>> </samp>getattr(odbchelper, "buildConnectionString") <img id="apihelper.getattr.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;function buildConnectionString at 00D18DD4>
<samp class="prompt">>>> </samp>object = odbchelper
<samp class="prompt">>>> </samp>method = "buildConnectionString"
<samp class="prompt">>>> </samp>getattr(object, method) <img id="apihelper.getattr.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;function buildConnectionString at 00D18DD4>
<samp class="prompt">>>> </samp>type(getattr(object, method)) <img id="apihelper.getattr.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;type 'function'>
<samp class="prompt">>>> </samp>import types
<samp class="prompt">>>> </samp>type(getattr(object, method)) == types.FunctionType
True
<samp class="prompt">>>> </samp>callable(getattr(object, method)) <img id="apihelper.getattr.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
True</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This returns a reference to the <code class="function">buildConnectionString</code> function in the <code class="filename">odbchelper</code> module, which you studied in <a href="#odbchelper" title="Chapter 2. Your First Python Program">Chapter 2, <i>Your First Python Program</i></a>. (The hex address you see is specific to my machine; your output will be different.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using <code class="function">getattr</code>, you can get the same reference to the same function. In general, <code><code class="function">getattr</code>(<i class="replaceable">object</i>, "<i class="replaceable">attribute</i>")</code> is equivalent to <code><i class="replaceable">object</i>.<i class="replaceable">attribute</i></code>. If <i class="replaceable"><code>object</code></i> is a module, then <i class="replaceable"><code>attribute</code></i> can be anything defined in the module: a function, class, or global variable.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">And this is what you actually use in the <code class="function">info</code> function. <code class="varname">object</code> is passed into the function as an argument; <code class="varname">method</code> is a string which is the name of a method or function.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In this case, <code class="varname">method</code> is the name of a function, which you can prove by getting its <a href="#apihelper.type.intro" title="Example 4.5. Introducing type"><code class="function">type</code></a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since <code class="varname">method</code> is a function, it is <a href="#apihelper.builtin.callable" title="Example 4.8. Introducing callable">callable</a>.
</td>
</tr>
</table>
<h3>4.4.2. <code class="function">getattr</code> As a Dispatcher</h3>
<p>A common usage pattern of <code class="function">getattr</code> is as a dispatcher. For example, if you had a program that could output data in a variety of different formats, you could
define separate functions for each output format and use a single dispatch function to call the right one.
<p>For example, let's imagine a program that prints site statistics in <acronym>HTML</acronym>, <acronym>XML</acronym>, and plain text formats. The choice of output format could be specified on the command line, or stored in a configuration
file. A <code class="filename">statsout</code> module defines three functions, <code class="function">output_html</code>, <code class="function">output_xml</code>, and <code class="function">output_text</code>. Then the main program defines a single output function, like this:
<div class="example"><h3 id="apihelper.getattr.dispatch">Example 4.12. Creating a Dispatcher with <code class="function">getattr</code></h3><pre class="programlisting">
import statsout
def output(data, format="text"): <img id="apihelper.getattr.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
output_function = getattr(statsout, "output_%s" % format) <img id="apihelper.getattr.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
return output_function(data) <img id="apihelper.getattr.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">output</code> function takes one required argument, <code class="varname">data</code>, and one optional argument, <code class="varname">format</code>. If <code class="varname">format</code> is not specified, it defaults to <code>text</code>, and you will end up calling the plain text output function.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You concatenate the <code class="varname">format</code> argument with "output_" to produce a function name, and then go get that function from the <code class="filename">statsout</code> module. This allows you to easily extend the program later to support other output formats, without changing this dispatch
function. Just add another function to <code class="filename">statsout</code> named, for instance, <code class="function">output_pdf</code>, and pass "pdf" as the <code class="varname">format</code> into the <code class="function">output</code> function.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you can simply call the output function in the same way as any other function. The <code class="varname">output_function</code> variable is a reference to the appropriate function from the <code class="filename">statsout</code> module.
</td>
</tr>
</table>
<p>Did you see the bug in the previous example? This is a very loose coupling of strings and functions, and there is no error
checking. What happens if the user passes in a format that doesn't have a corresponding function defined in <code class="filename">statsout</code>? Well, <code class="function">getattr</code> will return <code>None</code>, which will be assigned to <code class="varname">output_function</code> instead of a valid function, and the next line that attempts to call that function will crash and raise an exception. That's
bad.
<p>Luckily, <code class="function">getattr</code> takes an optional third argument, a default value.
<div class="example"><h3 id="apihelper.getattr.default">Example 4.13. <code class="function">getattr</code> Default Values</h3><pre class="programlisting">
import statsout
def output(data, format="text"):
output_function = getattr(statsout, "output_%s" % format, statsout.output_text)
return output_function(data) <img id="apihelper.getattr.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.getattr.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This function call is guaranteed to work, because you added a third argument to the call to <code class="function">getattr</code>. The third argument is a default value that is returned if the attribute or method specified by the second argument wasn't
found.
</td>
</tr>
</table>
<p>As you can see, <code class="function">getattr</code> is quite powerful. It is the heart of introspection, and you'll see even more powerful examples of it in later chapters.
<h2 id="apihelper.filter">4.5. Filtering Lists</h2>
<p>As you know, Python has powerful capabilities for mapping lists into other lists, via list comprehensions (<a href="#odbchelper.map" title="3.6. Mapping Lists">Section 3.6, &#8220;Mapping Lists&#8221;</a>). This can be combined with a filtering mechanism, where some elements in the list are mapped while others are skipped entirely.
<div class="informalexample">
<p>Here is the list filtering syntax:<pre class="programlisting">
[<i class="replaceable"><code>mapping-expression</code></i> for <i class="replaceable"><code>element</code></i> in <i class="replaceable"><code>source-list</code></i> if <i class="replaceable"><code>filter-expression</code></i>]</pre><p>This is an extension of the <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehensions</a> that you know and love. The first two thirds are the same; the last part, starting with the <code>if</code>, is the filter expression. A filter expression can be any expression that evaluates true or false (which in Python can be <a href="#tip.boolean">almost anything</a>). Any element for which the filter expression evaluates true will be included in the mapping. All other elements are ignored,
so they are never put through the mapping expression and are not included in the output list.
<div class="example"><h3>Example 4.14. Introducing List Filtering</h3><pre class="screen"><samp class="prompt">>>> </samp>li = ["a", "mpilgrim", "foo", "b", "c", "b", "d", "d"]
<samp class="prompt">>>> </samp>[elem for elem in li if len(elem) > 1] <img id="apihelper.filter.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
['mpilgrim', 'foo']
<samp class="prompt">>>> </samp>[elem for elem in li if elem != "b"] <img id="apihelper.filter.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
['a', 'mpilgrim', 'foo', 'c', 'd', 'd']
<samp class="prompt">>>> </samp>[elem for elem in li if li.count(elem) == 1] <img id="apihelper.filter.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
['a', 'mpilgrim', 'foo', 'c']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.filter.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The mapping expression here is simple (it just returns the value of each element), so concentrate on the filter expression.
As Python loops through the list, it runs each element through the filter expression. If the filter expression is true, the element
is mapped and the result of the mapping expression is included in the returned list. Here, you are filtering out all the
one-character strings, so you're left with a list of all the longer strings.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.filter.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here, you are filtering out a specific value, <code>b</code>. Note that this filters all occurrences of <code>b</code>, since each time it comes up, the filter expression will be false.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.filter.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">count</code> is a list method that returns the number of times a value occurs in a list. You might think that this filter would eliminate
duplicates from a list, returning a list containing only one copy of each value in the original list. But it doesn't, because
values that appear twice in the original list (in this case, <code>b</code> and <code>d</code>) are excluded completely. There are ways of eliminating duplicates from a list, but filtering is not the solution.
</td>
</tr>
</table>
<div class="informalexample"><p>Let's id="apihelper.filter.care" get back to this line from <code class="filename">apihelper.py</code>:<pre class="programlisting">
methodList = [method for method in dir(object) if callable(getattr(object, method))]</pre><p>This looks complicated, and it is complicated, but the basic structure is the same. The whole filter expression returns a
list, which is assigned to the <code class="varname">methodList</code> variable. The first half of the expression is the list mapping part. The mapping expression is an identity expression,
which it returns the value of each element. <code><code class="function">dir</code>(<code class="varname">object</code>)</code> returns a list of <code class="varname">object</code>'s attributes and methods -- that's the list you're mapping. So the only new part is the filter expression after the <code>if</code>.
<p>The filter expression looks scary, but it's not. You already know about <a href="#apihelper.builtin.callable" title="Example 4.8. Introducing callable"><code class="function">callable</code></a>, <a href="#apihelper.getattr.intro" title="Example 4.10. Introducing getattr"><code class="function">getattr</code></a>, and <a href="#odbchelper.tuplemethods" title="Example 3.16. Tuples Have No Methods"><code>in</code></a>. As you saw in the <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr">previous section</a>, the expression <code>getattr(object, method)</code> returns a function object if <code class="varname">object</code> is a module and <code class="varname">method</code> is the name of a function in that module.
<p>So this expression takes an object (named <code class="varname">object</code>). Then it gets a list of the names of the object's attributes, methods, functions, and a few other things. Then it filters
that list to weed out all the stuff that you don't care about. You do the weeding out by taking the name of each attribute/method/function
and getting a reference to the real thing, via the <code class="function">getattr</code> function. Then you check to see if that object is callable, which will be any methods and functions, both built-in (like
the <code class="function">pop</code> method of a list) and user-defined (like the <code class="function">buildConnectionString</code> function of the <code class="filename">odbchelper</code> module). You don't care about other attributes, like the <code>__name__</code> attribute that's built in to every module.
<div class="itemizedlist">
<h3>Further Reading on Filtering Lists</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> discusses another way to filter lists <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007130000000000000000">using the built-in <code class="function">filter</code> function</a>.
</ul>
<h2 id="apihelper.andor">4.6. The Peculiar Nature of <code>and</code> and <code>or</code></h2>
<p>In Python, <code>and</code> and <code>or</code> perform boolean logic as you would expect, but they do not return boolean values; instead, they return one of the actual
values they are comparing.
<div class="example"><h3 id="apihelper.andor.intro.example">Example 4.15. Introducing <code>and</code></h3><pre class="screen"><samp class="prompt">>>> </samp>'a' and 'b' <img id="apihelper.andor.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'b'
<samp class="prompt">>>> </samp>'' and 'b' <img id="apihelper.andor.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
''
<samp class="prompt">>>> </samp>'a' and 'b' and 'c' <img id="apihelper.andor.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'c'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When using <code>and</code>, values are evaluated in a boolean context from left to right. <code class="constant">0</code>, <code>''</code>, <code>[]</code>, <code>()</code>, <code>{}</code>, and <code>None</code> are false in a boolean context; everything else is true. Well, almost everything. By default, instances of classes are
true in a boolean context, but you can define special methods in your class to make an instance evaluate to false. You'll
learn all about classes and special methods in <a href="#fileinfo">Chapter 5</a>. If all values are true in a boolean context, <code>and</code> returns the last value. In this case, <code>and</code> evaluates <code>'a'</code>, which is true, then <code>'b'</code>, which is true, and returns <code>'b'</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If any value is false in a boolean context, <code>and</code> returns the first false value. In this case, <code>''</code> is the first false value.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">All values are true, so <code>and</code> returns the last value, <code>'c'</code>.
</td>
</tr>
</table>
<div class="example"><h3>Example 4.16. Introducing <code>or</code></h3><pre class="screen"><samp class="prompt">>>> </samp>'a' or 'b' <img id="apihelper.andor.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'a'
<samp class="prompt">>>> </samp>'' or 'b' <img id="apihelper.andor.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'b'
<samp class="prompt">>>> </samp>'' or [] or {} <img id="apihelper.andor.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
{}
<samp class="prompt">>>> </samp>def sidefx():
<samp class="prompt">... </samp>print "in sidefx()"
<samp class="prompt">... </samp>return 1
<samp class="prompt">>>> </samp>'a' or sidefx() <img id="apihelper.andor.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'a'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When using <code>or</code>, values are evaluated in a boolean context from left to right, just like <code>and</code>. If any value is true, <code>or</code> returns that value immediately. In this case, <code>'a'</code> is the first true value.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>or</code> evaluates <code>''</code>, which is false, then <code>'b'</code>, which is true, and returns <code>'b'</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If all values are false, <code>or</code> returns the last value. <code>or</code> evaluates <code>''</code>, which is false, then <code>[]</code>, which is false, then <code>{}</code>, which is false, and returns <code>{}</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that <code>or</code> evaluates values only until it finds one that is true in a boolean context, and then it ignores the rest. This distinction
is important if some values can have side effects. Here, the function <code class="function">sidefx</code> is never called, because <code>or</code> evaluates <code>'a'</code>, which is true, and returns <code>'a'</code> immediately.
</td>
</tr>
</table>
<p>If you're a <acronym>C</acronym> hacker, you are certainly familiar with the <code><i class="replaceable">bool</i> ? <code class="varname">a</code> : <code class="varname">b</code></code> expression, which evaluates to <code class="varname">a</code> if <i class="replaceable"><code>bool</code></i> is true, and <code class="varname">b</code> otherwise. Because of the way <code>and</code> and <code>or</code> work in Python, you can accomplish the same thing.
<h3>4.6.1. Using the <code>and-or</code> Trick</h3>
<div class="example"><h3 id="apihelper.andortrick.intro">Example 4.17. Introducing the <code>and-or</code> Trick</h3><pre class="screen"><samp class="prompt">>>> </samp>a = "first"
<samp class="prompt">>>> </samp>b = "second"
<samp class="prompt">>>> </samp>1 and a or b <img id="apihelper.andor.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'first'
<samp class="prompt">>>> </samp>0 and a or b <img id="apihelper.andor.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'second'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This syntax looks similar to the <code><i class="replaceable">bool</i> ? <code class="varname">a</code> : <code class="varname">b</code></code> expression in <acronym>C</acronym>. The entire expression is evaluated from left to right, so the <code>and</code> is evaluated first. <code>1 and 'first'</code> evalutes to <code>'first'</code>, then <code>'first' or 'second'</code> evalutes to <code>'first'</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>0 and 'first'</code> evalutes to <code class="constant">False</code>, and then <code>0 or 'second'</code> evaluates to <code>'second'</code>.
</td>
</tr>
</table>
<p>However, since this Python expression is simply boolean logic, and not a special construct of the language, there is one extremely important difference
between this <code>and-or</code> trick in Python and the <code><i class="replaceable">bool</i> ? <code class="varname">a</code> : <code class="varname">b</code></code> syntax in <acronym>C</acronym>. If the value of <code class="varname">a</code> is false, the expression will not work as you would expect it to. (Can you tell I was bitten by this? More than once?)
<div class="example"><h3>Example 4.18. When the <code>and-or</code> Trick Fails</h3><pre class="screen"><samp class="prompt">>>> </samp>a = ""
<samp class="prompt">>>> </samp>b = "second"
<samp class="prompt">>>> </samp>1 and a or b <img id="apihelper.andor.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'second'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since <code class="varname">a</code> is an empty string, which Python considers false in a boolean context, <code>1 and ''</code> evalutes to <code>''</code>, and then <code>'' or 'second'</code> evalutes to <code>'second'</code>. Oops! That's not what you wanted.
</td>
</tr>
</table>
<p>The <code>and-or</code> trick, <code><i class="replaceable">bool</i> and <code class="varname">a</code> or <code class="varname">b</code></code>, will not work like the <acronym>C</acronym> expression <code><i class="replaceable">bool</i> ? <code class="varname">a</code> : <code class="varname">b</code></code> when <code class="varname">a</code> is false in a boolean context.
<p>The real trick behind the <code>and-or</code> trick, then, is to make sure that the value of <code class="varname">a</code> is never false. One common way of doing this is to turn <code class="varname">a</code> into <code>[<code class="varname">a</code>]</code> and <code class="varname">b</code> into <code>[<code class="varname">b</code>]</code>, then taking the first element of the returned list, which will be either <code class="varname">a</code> or <code class="varname">b</code>.
<div class="example"><h3>Example 4.19. Using the <code>and-or</code> Trick Safely</h3><pre class="screen"><samp class="prompt">>>> </samp>a = ""
<samp class="prompt">>>> </samp>b = "second"
<samp class="prompt">>>> </samp>(1 and [a] or [b])[0] <img id="apihelper.andor.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
''</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.andor.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since <code>[<code class="varname">a</code>]</code> is a non-empty list, it is never false. Even if <code class="varname">a</code> is <code class="constant">0</code> or <code>''</code> or some other false value, the list <code>[<code class="varname">a</code>]</code> is true because it has one element.
</td>
</tr>
</table>
<p>By now, this trick may seem like more trouble than it's worth. You could, after all, accomplish the same thing with an <code>if</code> statement, so why go through all this fuss? Well, in many cases, you are choosing between two constant values, so you can
use the simpler syntax and not worry, because you know that the <code class="varname">a</code> value will always be true. And even if you need to use the more complicated safe form, there are good reasons to do so.
For example, there are some cases in Python where <code>if</code> statements are not allowed, such as in <code>lambda</code> functions.
<div class="itemizedlist">
<h3>Further Reading on the <code>and-or</code> Trick</h3>
<ul>
<li><a href="http://www.activestate.com/ASPN/Python/Cookbook/" title="growing archive of annotated code samples">Python Cookbook</a> discusses <a href="http://www.activestate.com/ASPN/Python/Cookbook/Recipe/52310">alternatives to the <code>and-or</code> trick</a>.
</ul>
<h2 id="apihelper.lambda">4.7. Using <code>lambda</code> Functions</h2>
<p>Python supports an interesting syntax that lets you define one-line mini-functions on the fly. Borrowed from Lisp, these so-called <code>lambda</code> functions can be used anywhere a function is required.
<div class="example"><h3>Example 4.20. Introducing <code>lambda</code> Functions</h3><pre class="screen"><samp class="prompt">>>> </samp>def f(x):
<samp class="prompt">... </samp>return x*2
<samp class="prompt">... </samp>
<samp class="prompt">>>> </samp>f(3)
6
<samp class="prompt">>>> </samp>g = lambda x: x*2 <img id="apihelper.lambda.1.2" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>g(3)
6
<samp class="prompt">>>> </samp>(lambda x: x*2)(3) <img id="apihelper.lambda.1.3" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
6</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.lambda.1.2"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a <code>lambda</code> function that accomplishes the same thing as the normal function above it. Note the abbreviated syntax here: there are no
parentheses around the argument list, and the <code>return</code> keyword is missing (it is implied, since the entire function can only be one expression). Also, the function has no name,
but it can be called through the variable it is assigned to.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.lambda.1.3"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can use a <code>lambda</code> function without even assigning it to a variable. This may not be the most useful thing in the world, but it just goes to
show that a lambda is just an in-line function.
</td>
</tr>
</table>
<p>To generalize, a <code>lambda</code> function is a function that takes any number of arguments (including <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional arguments</a>) and returns the value of a single expression. <code>lambda</code> functions can not contain commands, and they can not contain more than one expression. Don't try to squeeze too much into
a <code>lambda</code> function; if you need something more complex, define a normal function instead and make it as long as you want.<table id="tip.lambda" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%"><code>lambda</code> functions are a matter of style. Using them is never required; anywhere you could use them, you could define a separate
normal function and use that instead. I use them in places where I want to encapsulate specific, non-reusable code without
littering my code with a lot of little one-line functions.
</td>
</tr>
</table>
<h3>4.7.1. Real-World <code>lambda</code> Functions</h3>
<div class="informalexample">
<p>Here are the <code>lambda</code> functions in <code class="filename">apihelper.py</code>:<pre class="programlisting">
processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)</pre><p>Notice that this uses the simple form of the <a href="#apihelper.andor" title="4.6. The Peculiar Nature of and and or"><code>and-or</code></a> trick, which is okay, because a <code>lambda</code> function is always true <a href="#tip.boolean">in a boolean context</a>. (That doesn't mean that a <code>lambda</code> function can't return a false value. The function is always true; its return value could be anything.)
<p>Also notice that you're using the <code class="function">split</code> function with no arguments. You've already seen it used with <a href="#odbchelper.split.example" title="Example 3.28. Splitting a String">one or two arguments</a>, but without any arguments it splits on whitespace.
<div class="example"><h3>Example 4.21. <code class="function">split</code> With No Arguments</h3><pre class="screen"><samp class="prompt">>>> </samp>s = "this is\na\ttest" <img id="apihelper.split.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print s
<samp class="computeroutput">this is
a test</samp>
<samp class="prompt">>>> </samp>print s.split() <img id="apihelper.split.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
['this', 'is', 'a', 'test']
<samp class="prompt">>>> </samp>print " ".join(s.split()) <img id="apihelper.split.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'this is a test'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.split.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a multiline string, defined by escape characters instead of <a href="#odbchelper.triplequotes" title="Example 2.2. Defining the buildConnectionString Function's doc string">triple quotes</a>. <code>\n</code> is a carriage return, and <code>\t</code> is a tab character.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.split.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">split</code> without any arguments splits on whitespace. So three spaces, a carriage return, and a tab character are all the same.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.split.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can normalize whitespace by splitting a string with <code class="function">split</code> and then rejoining it with <code class="function">join</code>, using a single space as a delimiter. This is what the <code class="function">info</code> function does to collapse multi-line <code>doc string</code>s into a single line.
</td>
</tr>
</table>
<p>So what is the <code class="function">info</code> function actually doing with these <code>lambda</code> functions, <code class="function">split</code>s, and <code>and-or</code> tricks?
<div class="informalexample"><pre id="apihelper.funcassign" class="programlisting">
processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)</pre><p><code class="varname">processFunc</code> is now a function, but which function it is depends on the value of the <code class="varname">collapse</code> variable. If <code class="varname">collapse</code> is true, <code><code class="varname">processFunc</code>(<i class="replaceable">string</i>)</code> will collapse whitespace; otherwise, <code><code class="varname">processFunc</code>(<i class="replaceable">string</i>)</code> will return its argument unchanged.
<p>To do this in a less robust language, like Visual Basic, you would probably create a function that took a string and a <i class="parameter"><code>collapse</code></i> argument and used an <code>if</code> statement to decide whether to collapse the whitespace or not, then returned the appropriate value. This would be inefficient,
because the function would need to handle every possible case. Every time you called it, it would need to decide whether
to collapse whitespace before it could give you what you wanted. In Python, you can take that decision logic out of the function and define a <code>lambda</code> function that is custom-tailored to give you exactly (and only) what you want. This is more efficient, more elegant, and
less prone to those nasty oh-I-thought-those-arguments-were-reversed kinds of errors.
<div class="itemizedlist">
<h3>Further Reading on <code>lambda</code> Functions</h3>
<ul>
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> discusses using <code>lambda</code> to <a href="http://www.faqts.com/knowledge-base/view.phtml/aid/6081/fid/241">call functions indirectly</a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> shows how to <a href="http://www.python.org/doc/current/tut/node6.html#SECTION006740000000000000000">access outside variables from inside a <code>lambda</code> function</a>. (<a href="http://python.sourceforge.net/peps/pep-0227.html"><acronym>PEP</acronym> 227</a> explains how this will change in future versions of Python.)
<li><a href="http://www.python.org/doc/FAQ.html"><i class="citetitle">The Whole Python <acronym>FAQ</acronym></i></a> has examples of <a href="http://www.python.org/cgi-bin/faqw.py?query=4.15&amp;querytype=simple&amp;casefold=yes&amp;req=search">obfuscated one-liners using <code>lambda</code></a>.
</ul>
<h2 id="apihelper.alltogether">4.8. Putting It All Together</h2>
<p>The last line of code, the only one you haven't deconstructed yet, is the one that does all the work. But by now the work
is easy, because everything you need is already set up just the way you need it. All the dominoes are in place; it's time
to knock them down.
<div class="informalexample">
<p>This is the meat of <code class="filename">apihelper.py</code>:<pre class="programlisting">
print "\n".join(["%s %s" %
(method.ljust(spacing),
processFunc(str(getattr(object, method).__doc__)))
for method in methodList])</pre><p>Note that this is one command, split over multiple lines, but it doesn't use the line continuation character (<code>\</code>). Remember when I said that <a href="#tip.implicitmultiline">some expressions can be split into multiple lines</a> without using a backslash? A list comprehension is one of those expressions, since the entire expression is contained in
square brackets.
<p>Now, let's take it from the end and work backwards. The <pre class="programlisting">
for method in methodList</pre><p>shows that this is a <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehension</a>. As you know, <code class="varname">methodList</code> is a list of <a href="#apihelper.filter.care">all the methods you care about</a> in <code class="varname">object</code>. So you're looping through that list with <code class="varname">method</code>.
<div class="example"><h3>Example 4.22. Getting a <code>doc string</code> Dynamically</h3><pre class="screen"><samp class="prompt">>>> </samp>import odbchelper
<samp class="prompt">>>> </samp>object = odbchelper <img id="apihelper.alltogether.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>method = 'buildConnectionString' <img id="apihelper.alltogether.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>getattr(object, method) <img id="apihelper.alltogether.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;function buildConnectionString at 010D6D74>
<samp class="prompt">>>> </samp>print getattr(object, method).__doc__ <img id="apihelper.alltogether.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">Build a connection string from a dictionary of parameters.
Returns string.</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In the <code class="function">info</code> function, <code class="varname">object</code> is the object you're getting help on, passed in as an argument.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you're looping through <code class="varname">methodList</code>, <code class="varname">method</code> is the name of the current method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using the <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code class="function">getattr</code></a> function, you're getting a reference to the <i class="replaceable"><code>method</code></i> function in the <i class="replaceable"><code>object</code></i> module.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now, printing the actual <code>doc string</code> of the method is easy.
</td>
</tr>
</table>
<p>The next piece of the puzzle is the use of <code class="function">str</code> around the <code>doc string</code>. As you may recall, <code class="function">str</code> is a built-in function that <a href="#apihelper.builtin" title="4.3. Using type, str, dir, and Other Built-In Functions">coerces data into a string</a>. But a <code>doc string</code> is always a string, so why bother with the <code class="function">str</code> function? The answer is that not every function has a <code>doc string</code>, and if it doesn't, its <code>__doc__</code> attribute is <code>None</code>.
<div class="example"><h3>Example 4.23. Why Use <code class="function">str</code> on a <code>doc string</code>?</h3><pre class="screen"><samp class="prompt">>>> </samp>>>> def foo(): print 2
<samp class="prompt">>>> </samp>>>> foo()
2
<samp class="prompt">>>> </samp>>>> foo.__doc__ <img id="apihelper.alltogether.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>foo.__doc__ == None <img id="apihelper.alltogether.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
True
<samp class="prompt">>>> </samp>str(foo.__doc__) <img id="apihelper.alltogether.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'None'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can easily define a function that has no <code>doc string</code>, so its <code>__doc__</code> attribute is <code>None</code>. Confusingly, if you evaluate the <code>__doc__</code> attribute directly, the Python <acronym>IDE</acronym> prints nothing at all, which makes sense if you think about it, but is still unhelpful.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can verify that the value of the <code>__doc__</code> attribute is actually <code>None</code> by comparing it directly.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">str</code> function takes the null value and returns a string representation of it, <code>'None'</code>.
</td>
</tr>
</table>
</div><table id="compare.isnull.sql" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In <acronym>SQL</acronym>, you must use <code>IS NULL</code> instead of <code>= NULL</code> to compare a null value. In Python, you can use either <code>== None</code> or <code>is None</code>, but <code>is None</code> is faster.
</td>
</tr>
</table>
<p>Now that you are guaranteed to have a string, you can pass the string to <code class="varname">processFunc</code>, which you have <a href="#apihelper.lambda" title="4.7. Using lambda Functions">already defined</a> as a function that either does or doesn't collapse whitespace. Now you see why it was important to use <code class="function">str</code> to convert a <code>None</code> value into a string representation. <code class="varname">processFunc</code> is assuming a string argument and calling its <code class="function">split</code> method, which would crash if you passed it <code>None</code> because <code>None</code> doesn't have a <code class="function">split</code> method.
<p>Stepping back even further, you see that you're using string formatting again to concatenate the return value of <code class="varname">processFunc</code> with the return value of <code class="varname">method</code>'s <code class="function">ljust</code> method. This is a new string method that you haven't seen before.
<div class="example"><h3>Example 4.24. Introducing <code class="function">ljust</code></h3><pre class="screen"><samp class="prompt">>>> </samp>s = 'buildConnectionString'
<samp class="prompt">>>> </samp>s.ljust(30) <img id="apihelper.alltogether.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'buildConnectionString '
<samp class="prompt">>>> </samp>s.ljust(20) <img id="apihelper.alltogether.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'buildConnectionString'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">ljust</code> pads the string with spaces to the given length. This is what the <code class="function">info</code> function uses to make two columns of output and line up all the <code>doc string</code>s in the second column.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the given length is smaller than the length of the string, <code class="function">ljust</code> will simply return the string unchanged. It never truncates the string.
</td>
</tr>
</table>
<p>You're almost finished. Given the padded method name from the <code class="function">ljust</code> method and the (possibly collapsed) <code>doc string</code> from the call to <code class="varname">processFunc</code>, you concatenate the two and get a single string. Since you're mapping <code class="varname">methodList</code>, you end up with a list of strings. Using the <code class="function">join</code> method of the string <code>"\n"</code>, you join this list into a single string, with each element of the list on a separate line, and print the result.
<div class="example"><h3>Example 4.25. Printing a List</h3><pre class="screen"><samp class="prompt">>>> </samp>li = ['a', 'b', 'c']
<samp class="prompt">>>> </samp>print "\n".join(li) <img id="apihelper.alltogether.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">a
b
c</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#apihelper.alltogether.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is also a useful debugging trick when you're working with lists. And in Python, you're always working with lists.
</td>
</tr>
</table>
<p>That's the last piece of the puzzle. You should now understand this code.
<div class="informalexample"><pre class="programlisting">
print "\n".join(["%s %s" %
(method.ljust(spacing),
processFunc(str(getattr(object, method).__doc__)))
for method in methodList])</pre><h2 id="apihelper.summary">4.9. Summary</h2>
<p>The <code class="filename">apihelper.py</code> program and its output should now make perfect sense.
<div class="informalexample"><pre class="programlisting">
def info(object, spacing=10, collapse=1):
"""Print methods and doc strings.
Takes module, class, list, dictionary, or string."""
methodList = [method for method in dir(object) if callable(getattr(object, method))]
processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)
print "\n".join(["%s %s" %
(method.ljust(spacing),
processFunc(str(getattr(object, method).__doc__)))
for method in methodList])
if __name__ == "__main__":
print info.__doc__</pre><div class="informalexample">
<p>Here is the output of <code class="filename">apihelper.py</code>:<pre class="screen"><samp class="prompt">>>> </samp>from apihelper import info
<samp class="prompt">>>> </samp>li = []
<samp class="prompt">>>> </samp>info(li)
<samp class="computeroutput">append L.append(object) -- append object to end
count L.count(value) -> integer -- return number of occurrences of value
extend L.extend(list) -- extend list by appending list elements
index L.index(value) -> integer -- return index of first occurrence of value
insert L.insert(index, object) -- insert object before index
pop L.pop([index]) -> item -- remove and return item at index (default last)
remove L.remove(value) -- remove first occurrence of value
reverse L.reverse() -- reverse *IN PLACE*
sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1</span></pre><div class="highlights">
<p>Before diving into the next chapter, make sure you're comfortable doing all of these things:
<div class="itemizedlist">
<ul>
<li>Defining and calling functions with <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional and named arguments</a>
<li>Using <a href="#apihelper.str.intro" title="Example 4.6. Introducing str"><code class="function">str</code></a> to coerce any arbitrary value into a string representation
<li>Using <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code class="function">getattr</code></a> to get references to functions and other attributes dynamically
<li>Extending the list comprehension syntax to do <a href="#apihelper.filter" title="4.5. Filtering Lists">list filtering</a>
<li>Recognizing <a href="#apihelper.andor" title="4.6. The Peculiar Nature of and and or">the <code>and-or</code> trick</a> and using it safely
<li>Defining <a href="#apihelper.lambda" title="4.7. Using lambda Functions"><code>lambda</code> functions</a>
<li><a href="#apihelper.funcassign">Assigning functions to variables</a> and calling the function by referencing the variable. I can't emphasize this enough, because this mode of thought is vital
to advancing your understanding of Python. You'll see more complex applications of this concept throughout this book.
</ul>
<div class="chapter">
<h2 id="fileinfo">Chapter 5. Objects and Object-Orientation</h2>
<p>This chapter, and pretty much every chapter after this, deals with object-oriented Python programming.
<h2 id="fileinfo.divein">5.1. Diving In</h2>
<p>Here is a complete, working Python program. Read the <a href="#odbchelper.docstring" title="2.3. Documenting Functions"><code>doc string</code>s</a> of the module, the classes, and the functions to get an overview of what this program does and how it works. As usual, don't
worry about the stuff you don't understand; that's what the rest of the chapter is for.
<div class="example"><h3>Example 5.1. <code class="filename">fileinfo.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
"""Framework for getting filetype-specific metadata.
Instantiate appropriate class with filename. Returned object acts like a
dictionary, with key-value pairs for each piece of metadata.
import fileinfo
info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3")
print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()])
Or use listDirectory function to get info on all files in a directory.
for info in fileinfo.listDirectory("/music/ap/", [".mp3"]):
...
Framework can be extended by adding classes for particular file types, e.g.
HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for
parsing its files appropriately; see MP3FileInfo for example.
"""
import os
import sys
from UserDict import UserDict
def stripnulls(data):
"strip whitespace and nulls"
return data.replace("\00", "").strip()
class FileInfo(UserDict):
"store file metadata"
def __init__(self, filename=None):
UserDict.__init__(self)
self["name"] = filename
class MP3FileInfo(FileInfo):
"store ID3v1.0 MP3 tags"
tagDataMap = {"title" : ( 3, 33, stripnulls),
"artist" : ( 33, 63, stripnulls),
"album" : ( 63, 93, stripnulls),
"year" : ( 93, 97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre" : (127, 128, ord)}
def __parse(self, filename):
"parse ID3v1.0 tags from MP3 file"
self.clear()
try:
fsock = open(filename, "rb", 0)
try:
fsock.seek(-128, 2)
tagdata = fsock.read(128)
finally:
fsock.close()
if tagdata[:3] == "TAG":
for tag, (start, end, parseFunc) in self.tagDataMap.items():
self[tag] = parseFunc(tagdata[start:end])
except IOError:
pass
def __setitem__(self, key, item):
if key == "name" and item:
self.__parse(item)
FileInfo.__setitem__(self, key, item)
def listDirectory(directory, fileExtList):
"get list of file info objects for files of particular extensions"
fileList = [os.path.normcase(f)
for f in os.listdir(directory)]
fileList = [os.path.join(directory, f)
for f in fileList
if os.path.splitext(f)[1] in fileExtList]
def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):
"get file info class from filename extension"
subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]
return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
return [getFileInfoClass(f)(f) for f in fileList]
if __name__ == "__main__":
for info in listDirectory("/music/_singles/", [".mp3"]): <img id="fileinfo_divein.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
print "\n".join(["%s=%s" % (k, v) for k, v in info.items()])
print</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo_divein.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This program's output depends on the files on your hard drive. To get meaningful output, you'll need to change the directory
path to point to a directory of MP3 files on your own machine.
</td>
</tr>
</table>
<div class="informalexample">
<p>This is the output I got on my machine. Your output will be different, unless, by some startling coincidence, you share my
exact taste in music.<pre class="screen"><samp class="computeroutput">album=
artist=Ghost in the Machine
title=A Time Long Forgotten (Concept
genre=31
name=/music/_singles/a_time_long_forgotten_con.mp3
year=1999
comment=http://mp3.com/ghostmachine
album=Rave Mix
artist=***DJ MARY-JANE***
title=HELLRAISER****Trance from Hell
genre=31
name=/music/_singles/hellraiser.mp3
year=2000
comment=http://mp3.com/DJMARYJANE
album=Rave Mix
artist=***DJ MARY-JANE***
title=KAIRO****THE BEST GOA
genre=31
name=/music/_singles/kairo.mp3
year=2000
comment=http://mp3.com/DJMARYJANE
album=Journeys
artist=Masters of Balance
title=Long Way Home
genre=31
name=/music/_singles/long_way_home1.mp3
year=2000
comment=http://mp3.com/MastersofBalan
album=
artist=The Cynic Project
title=Sidewinder
genre=18
name=/music/_singles/sidewinder.mp3
year=2000
comment=http://mp3.com/cynicproject
album=Digitosis@128k
artist=VXpanded
title=Spinning
genre=255
name=/music/_singles/spinning.mp3
year=2000
comment=http://mp3.com/artists/95/vxp</span></pre><h2 id="fileinfo.fromimport">5.2. Importing Modules Using <code>from <i class="replaceable">module</i> import</code></h2>
<p>Python has two ways of importing modules. Both are useful, and you should know when to use each. One way, <code>import <i class="replaceable">module</i></code>, you've already seen in <a href="#odbchelper.objects" title="2.4. Everything Is an Object">Section 2.4, &#8220;Everything Is an Object&#8221;</a>. The other way accomplishes the same thing, but it has subtle and important differences.
<div class="informalexample">
<p>Here is the basic <code>from <i class="replaceable">module</i> import</code> syntax:<pre class="programlisting">
from UserDict import UserDict
</pre><p>This is similar to the <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's doc string"><code>import <i class="replaceable">module</i></code></a> syntax that you know and love, but with an important difference: the attributes and methods of the imported module <code class="filename">types</code> are imported directly into the local namespace, so they are available directly, without qualification by module name. You
can import individual items or use <code>from <i class="replaceable">module</i> import *</code> to import everything.<table id="compare.fromimport.perl" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%"><code>from <i class="replaceable">module</i> import *</code> in Python is like <code>use <i class="replaceable">module</i></code> in Perl; <code>import <i class="replaceable">module</i></code> in Python is like <code>require <i class="replaceable">module</i></code> in Perl.
</td>
</tr>
</table><table id="compare.fromimport.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%"><code>from <i class="replaceable">module</i> import *</code> in Python is like <code>import <i class="replaceable">module</i>.*</code> in Java; <code>import <i class="replaceable">module</i></code> in Python is like <code>import <i class="replaceable">module</i></code> in Java.
</td>
</tr>
</table>
<div class="example"><h3>Example 5.2. <code>import <i class="replaceable">module</i></code> <i class="foreignphrase"><acronym>vs.</acronym></i> <code>from <i class="replaceable">module</i> import</code></h3><pre class="screen"><samp class="prompt">>>> </samp>import types
<samp class="prompt">>>> </samp>types.FunctionType <img id="fileinfo.import.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;type 'function'>
<samp class="prompt">>>> </samp>FunctionType <img id="fileinfo.import.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
NameError: There is no variable named 'FunctionType'</samp>
<samp class="prompt">>>> </samp>from types import FunctionType <img id="fileinfo.import.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>FunctionType <img id="fileinfo.import.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;type 'function'></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.import.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="filename">types</code> module contains no methods; it just has attributes for each Python object type. Note that the attribute, <code class="constant">FunctionType</code>, must be qualified by the module name, <code class="filename">types</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.import.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="constant">FunctionType</code> by itself has not been defined in this namespace; it exists only in the context of <code class="filename">types</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.import.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This syntax imports the attribute <code class="constant">FunctionType</code> from the <code class="filename">types</code> module directly into the local namespace.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.import.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now <code class="constant">FunctionType</code> can be accessed directly, without reference to <code class="filename">types</code>.
</td>
</tr>
</table>
<p>When should you use <code>from <i class="replaceable">module</i> import</code>?
<div class="itemizedlist">
<ul>
<li>If you will be accessing attributes and methods often and don't want to type the module name over and over, use <code>from <i class="replaceable">module</i> import</code>.
<li>If you want to selectively import some attributes and methods but not others, use <code>from <i class="replaceable">module</i> import</code>.
<li>If the module contains attributes or functions with the same name as ones in your module, you must use <code>import <i class="replaceable">module</i></code> to avoid name conflicts.
</ul>
<p>Other than that, it's just a matter of style, and you will see Python code written both ways.<table class="caution" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/caution.png" alt="Caution" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Use <code>from module import *</code> sparingly, because it makes it difficult to determine where a particular function or attribute came from, and that makes
debugging and refactoring more difficult.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Module Importing Techniques</h3>
<ul>
<li><a href="http://www.effbot.org/guides/">eff-bot</a> has more to say on <a href="http://www.effbot.org/guides/import-confusion.htm"><code>import <i class="replaceable">module</i></code> <i class="foreignphrase"><acronym>vs.</acronym></i> <code>from <i class="replaceable">module</i> import</code></a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> discusses advanced import techniques, including <a href="http://www.python.org/doc/current/tut/node8.html#SECTION008410000000000000000"><code>from <i class="replaceable">module</i> import *</code></a>.
</ul>
<h2 id="fileinfo.class">5.3. Defining Classes</h2>
<p>Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the
classes you've defined.
<p>Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word <code>class</code>, followed by the class name. Technically, that's all that's required, since a class doesn't need to inherit from any other
class.
<div class="example"><h3 id="fileinfo.class.simplest">Example 5.3. The Simplest Python Class</h3><pre class="programlisting">
class Loaf: <img id="fileinfo.class.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
pass <img id="fileinfo.class.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"> <img id="fileinfo.class.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The name of this class is <code class="classname">Loaf</code>, and it doesn't inherit from any other class. Class names are usually capitalized, <code class="classname">EachWordLikeThis</code>, but this is only a convention, not a requirement.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This class doesn't define any methods or attributes, but syntactically, there needs to be something in the definition, so
you use <code>pass</code>. This is a Python reserved word that just means &#8220;move along, nothing to see here&#8221;. It's a statement that does nothing, and it's a good placeholder when you're stubbing out functions or classes.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You probably guessed this, but everything in a class is indented, just like the code within a function, <code>if</code> statement, <code>for</code> loop, and so forth. The first thing not indented is not in the class.
</td>
</tr>
</table>
</div><table id="compare.pass.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">The <code>pass</code> statement in Python is like an empty set of braces (<code>{}</code>) in Java or <acronym>C</acronym>.
</td>
</tr>
</table>
<p>Of course, realistically, most classes will be inherited from other classes, and they will define their own class methods
and attributes. But as you've just seen, there is nothing that a class absolutely must have, other than a name. In particular,
<acronym>C++</acronym> programmers may find it odd that Python classes don't have explicit constructors and destructors. Python classes do have something similar to a constructor: the <code class="function">__init__</code> method.
<div class="example"><h3 id="fileinfo.class.example">Example 5.4. Defining the <code class="classname">FileInfo</code> Class</h3><pre class="programlisting">
from UserDict import UserDict
class FileInfo(UserDict): <img id="fileinfo.class.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. So the <code class="classname">FileInfo</code> class is inherited from the <code class="classname">UserDict</code> class (which was <a href="#fileinfo.fromimport" title="5.2. Importing Modules Using from module import">imported from the <code class="filename">UserDict</code> module</a>). <code class="classname">UserDict</code> is a class that acts like a dictionary, allowing you to essentially subclass the dictionary datatype and add your own behavior.
(There are similar classes <code class="classname">UserList</code> and <code class="classname">UserString</code> which allow you to subclass lists and strings.) There is a bit of black magic behind this, which you will demystify later
in this chapter when you explore the <code class="classname">UserDict</code> class in more depth.
</td>
</tr>
</table>
</div><table id="compare.extends.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. There is no special keyword like
<code>extends</code> in Java.
</td>
</tr>
</table>
<p>Python supports multiple inheritance. In the parentheses following the class name, you can list as many ancestor classes as you
like, separated by commas.
<h3>5.3.1. Initializing and Coding Classes</h3>
<p>This example shows the initialization of the <code class="classname">FileInfo</code> class using the <code class="function">__init__</code> method.
<div class="example"><h3 id="fileinfo.init.example">Example 5.5. Initializing the <code class="classname">FileInfo</code> Class</h3><pre class="programlisting">
class FileInfo(UserDict):
"store file metadata" <img id="fileinfo.class.2.2" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def __init__(self, filename=None): <img id="fileinfo.class.2.3" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"> <img id="fileinfo.class.2.4" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"> <img id="fileinfo.class.2.5" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.2.2"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Classes can (and <a href="#tip.docstring">should</a>) have <code>doc string</code>s too, just like modules and functions.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.2.3"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">__init__</code> is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor
of the class. It's tempting, because it looks like a constructor (by convention, <code class="function">__init__</code> is the first method defined for the class), acts like one (it's the first piece of code executed in a newly created instance
of the class), and even sounds like one (&#8220;init&#8221; certainly suggests a constructor-ish nature). Incorrect, because the object has already been constructed by the time <code class="function">__init__</code> is called, and you already have a valid reference to the new instance of the class. But <code class="function">__init__</code> is the closest thing you're going to get to a constructor in Python, and it fills much the same role.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.2.4"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The first argument of every class method, including <code class="function">__init__</code>, is always a reference to the current instance of the class. By convention, this argument is always named <code>self</code>. In the <code class="function">__init__</code> method, <code>self</code> refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although
you need to specify <code>self</code> explicitly when defining the method, you do <em>not</em> specify it when calling the method; Python will add it for you automatically.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.2.5"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">__init__</code> methods can take any number of arguments, and just like functions, the arguments can be defined with default values, making
them optional to the caller. In this case, <code class="varname">filename</code> has a default value of <code>None</code>, which is the Python null value.
</td>
</tr>
</table>
</div><table id="compare.self.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">By convention, the first argument of any Python class method (the reference to the current instance) is called <code>self</code>. This argument fills the role of the reserved word <code>this</code> in <acronym>C++</acronym> or Java, but <code>self</code> is not a reserved word in Python, merely a naming convention. Nonetheless, please don't call it anything but <code>self</code>; this is a very strong convention.
</td>
</tr>
</table>
<div class="example"><h3 id="fileinfo.init.code.example">Example 5.6. Coding the <code class="classname">FileInfo</code> Class</h3><pre class="programlisting">
class FileInfo(UserDict):
"store file metadata"
def __init__(self, filename=None):
UserDict.__init__(self) <img id="fileinfo.class.2.6" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
self["name"] = filename <img id="fileinfo.class.2.7" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<img id="fileinfo.class.2.8" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.2.6"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Some pseudo-object-oriented languages like Powerbuilder have a concept of &#8220;extending&#8221; constructors and other events, where the ancestor's method is called automatically before the descendant's method is executed.
Python does not do this; you must always explicitly call the appropriate method in the ancestor class.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.2.7"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">I told you that this class acts like a dictionary, and here is the first sign of it. You're assigning the argument <code class="varname">filename</code> as the value of this object's <code>name</code> key.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.class.2.8"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that the <code class="function">__init__</code> method never returns a value.
</td>
</tr>
</table>
<h3>5.3.2. Knowing When to Use <code>self</code> and <code class="function">__init__</code></h3>
<p>When defining your class methods, you <em>must</em> explicitly list <code>self</code> as the first argument for each method, including <code class="function">__init__</code>. When you call a method of an ancestor class from within your class, you <em>must</em> include the <code>self</code> argument. But when you call your class method from outside, you do not specify anything for the <code>self</code> argument; you skip it entirely, and Python automatically adds the instance reference for you. I am aware that this is confusing at first; it's not really inconsistent,
but it may appear inconsistent because it relies on a distinction (between bound and unbound methods) that you don't know
about yet.
<p>Whew. I realize that's a lot to absorb, but you'll get the hang of it. All Python classes work the same way, so once you learn one, you've learned them all. If you forget everything else, remember this
one thing, because I promise it will trip you up:<table id="tip.initoptional" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%"><code class="function">__init__</code> methods are optional, but when you define one, you must remember to explicitly call the ancestor's <code class="function">__init__</code> method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor,
the descendant method must explicitly call the ancestor method at the proper time, with the proper arguments.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Python Classes</h3>
<ul>
<li><a href="http://www.freenetpages.co.uk/hp/alan.gauld/" title="Python book for first-time programmers"><i class="citetitle">Learning to Program</i></a> has a gentler <a href="http://www.freenetpages.co.uk/hp/alan.gauld/tutclass.htm">introduction to classes</a>.
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class="citetitle">How to Think Like a Computer Scientist</i></a> shows how to <a href="http://www.ibiblio.org/obp/thinkCSpy/chap12.htm">use classes to model compound datatypes</a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> has an in-depth look at <a href="http://www.python.org/doc/current/tut/node11.html">classes, namespaces, and inheritance</a>.
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/242">common questions about classes</a>.
</ul>
<h2 id="fileinfo.create">5.4. Instantiating Classes</h2>
<p>Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the
<code class="function">__init__</code> method defines. The return value will be the newly created object.
<div class="example"><h3>Example 5.7. Creating a <code class="classname">FileInfo</code> Instance</h3><pre class="screen"><samp class="prompt">>>> </samp>import fileinfo
<samp class="prompt">>>> </samp>f = fileinfo.FileInfo("/music/_singles/kairo.mp3") <img id="fileinfo.create.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>f.__class__ <img id="fileinfo.create.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;class fileinfo.FileInfo at 010EC204>
<samp class="prompt">>>> </samp>f.__doc__ <img id="fileinfo.create.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'store file metadata'
<samp class="prompt">>>> </samp>f <img id="fileinfo.create.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
{'name': '/music/_singles/kairo.mp3'}</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.create.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You are creating an instance of the <code class="classname">FileInfo</code> class (defined in the <code class="filename">fileinfo</code> module) and assigning the newly created instance to the variable <code class="varname">f</code>. You are passing one parameter, <code>/music/_singles/kairo.mp3</code>, which will end up as the <code class="varname">filename</code> argument in <code class="classname">FileInfo</code>'s <code class="function">__init__</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.create.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Every class instance has a built-in attribute, <code>__class__</code>, which is the object's class. (Note that the representation of this includes the physical address of the instance on my
machine; your representation will be different.) Java programmers may be familiar with the <code class="classname">Class</code> class, which contains methods like <code class="function">getName</code> and <code class="function">getSuperclass</code> to get metadata information about an object. In Python, this kind of metadata is available directly on the object itself through attributes like <code>__class__</code>, <code>__name__</code>, and <code>__bases__</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.create.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can access the instance's <code>doc string</code> just as with a function or a module. All instances of a class share the same <code>doc string</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.create.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember when the <code class="function">__init__</code> method <a href="#fileinfo.class.example" title="Example 5.4. Defining the FileInfo Class">assigned its <code class="varname">filename</code> argument to <code>self["name"]</code></a>? Well, here's the result. The arguments you pass when you create the class instance get sent right along to the <code class="function">__init__</code> method (along with the object reference, <code>self</code>, which Python adds for free).
</td>
</tr>
</table>
</div><table id="compare.new.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit <code>new</code> operator like <acronym>C++</acronym> or Java.
</td>
</tr>
</table>
<h3>5.4.1. Garbage Collection</h3>
<p>If creating new instances is easy, destroying them is even easier. In general, there is no need to explicitly free instances,
because they are freed automatically when the variables assigned to them go out of scope. Memory leaks are rare in Python.
<div class="example"><h3 id="fileinfo.scope">Example 5.8. Trying to Implement a Memory Leak</h3><pre class="screen"><samp class="prompt">>>> </samp>def leakmem():
<samp class="prompt">... </samp>f = fileinfo.FileInfo('/music/_singles/kairo.mp3') <img id="fileinfo.create.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp>
<samp class="prompt">>>> </samp>for i in range(100):
<samp class="prompt">... </samp>leakmem() <img id="fileinfo.create.2.3" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.create.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Every time the <code class="function">leakmem</code> function is called, you are creating an instance of <code class="classname">FileInfo</code> and assigning it to the variable <code class="varname">f</code>, which is a local variable within the function. Then the function ends without ever freeing <code class="varname">f</code>, so you would expect a memory leak, but you would be wrong. When the function ends, the local variable <code class="varname">f</code> goes out of scope. At this point, there are no longer any references to the newly created instance of <code class="classname">FileInfo</code> (since you never assigned it to anything other than <code class="varname">f</code>), so Python destroys the instance for us.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.create.2.3"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">No matter how many times you call the <code class="function">leakmem</code> function, it will never leak memory, because every time, Python will destroy the newly created <code class="classname">FileInfo</code> class before returning from <code class="function">leakmem</code>.
</td>
</tr>
</table>
<p>The technical term for this form of garbage collection is &#8220;reference counting&#8221;. Python keeps a list of references to every instance created. In the above example, there was only one reference to the <code class="classname">FileInfo</code> instance: the local variable <code class="varname">f</code>. When the function ends, the variable <code class="varname">f</code> goes out of scope, so the reference count drops to <code class="constant">0</code>, and Python destroys the instance automatically.
<p>In previous versions of Python, there were situations where reference counting failed, and Python couldn't clean up after you. If you created two instances that referenced each other (for instance, a doubly-linked list,
where each node has a pointer to the previous and next node in the list), neither instance would ever be destroyed automatically
because Python (correctly) believed that there is always a reference to each instance. Python 2.0 has an additional form of garbage collection called &#8220;mark-and-sweep&#8221; which is smart enough to notice this virtual gridlock and clean up circular references correctly.
<p>As a former philosophy major, it disturbs me to think that things disappear when no one is looking at them, but that's exactly
what happens in Python. In general, you can simply forget about memory management and let Python clean up after you.
<div class="itemizedlist">
<h3>Further Reading on Garbage Collection</h3>
<ul>
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/specialattrs.html">built-in attributes like <code>__class__</code></a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-gc.html"><code class="filename">gc</code> module</a>, which gives you low-level control over Python's garbage collection.
</ul>
<h2 id="fileinfo.userdict">5.5. Exploring <code class="classname">UserDict</code>: A Wrapper Class</h2>
<p>As you've seen, <code class="classname">FileInfo</code> is a class that acts like a dictionary. To explore this further, let's look at the <code class="classname">UserDict</code> class in the <code class="filename">UserDict</code> module, which is the ancestor of the <code class="classname">FileInfo</code> class. This is nothing special; the class is written in Python and stored in a <code>.py</code> file, just like any other Python code. In particular, it's stored in the <code class="filename">lib</code> directory in your Python installation.<table id="tip.locate" class="tip" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In the ActivePython <acronym>IDE</acronym> on Windows, you can quickly open any module in your library path by selecting
File->Locate... (<kbd class="shortcut">Ctrl-L</kbd>).
</td>
</tr>
</table>
<div class="example"><h3 id="fileinfo.userdict.init.example">Example 5.9. Defining the <code class="classname">UserDict</code> Class</h3><pre class="programlisting">
class UserDict: <img id="fileinfo.userdict.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def __init__(self, dict=None): <img id="fileinfo.userdict.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
self.data = {} <img id="fileinfo.userdict.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
if dict is not None: self.update(dict) <img id="fileinfo.userdict.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"> <img id="fileinfo.userdict.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that <code class="classname">UserDict</code> is a base class, not inherited from any other class.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the <code class="function">__init__</code> method that you <a href="#fileinfo.class.example" title="Example 5.4. Defining the FileInfo Class">overrode in the <code class="classname">FileInfo</code> class</a>. Note that the argument list in this ancestor class is different than the descendant. That's okay; each subclass can have
its own set of arguments, as long as it calls the ancestor with the correct arguments. Here the ancestor class has a way
to define initial values (by passing a dictionary in the <code class="varname">dict</code> argument) which the <code class="classname">FileInfo</code> does not use.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Python supports data attributes (called &#8220;instance variables&#8221; in Java and Powerbuilder, and &#8220;member variables&#8221; in <acronym>C++</acronym>). Data attributes are pieces of data held by a specific instance of a class. In this case, each instance of <code class="classname">UserDict</code> will have a data attribute <code class="varname">data</code>. To reference this attribute from code outside the class, you qualify it with the instance name, <code><i class="replaceable">instance</i>.data</code>, in the same way that you qualify a function with its module name. To reference a data attribute from within the class,
you use <code>self</code> as the qualifier. By convention, all data attributes are initialized to reasonable values in the <code class="function">__init__</code> method. However, this is not required, since data attributes, like local variables, <a href="#odbchelper.vardef" title="3.4. Declaring variables">spring into existence</a> when they are first assigned a value.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">update</code> method is a dictionary duplicator: it copies all the keys and values from one dictionary to another. This does <em>not</em> clear the target dictionary first; if the target dictionary already has some keys, the ones from the source dictionary will
be overwritten, but others will be left untouched. Think of <code class="function">update</code> as a merge function, not a copy function.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a syntax you may not have seen before (I haven't used it in the examples in this book). It's an <code>if</code> statement, but instead of having an indented block starting on the next line, there is just a single statement on the same
line, after the colon. This is perfectly legal syntax, which is just a shortcut you can use when you have only one statement
in a block. (It's like specifying a single statement without braces in <acronym>C++</acronym>.) You can use this syntax, or you can have indented code on subsequent lines, but you can't do both for the same block.
</td>
</tr>
</table>
</div><table id="compare.overloading" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Java and Powerbuilder support function overloading by argument list, <i class="foreignphrase"><acronym>i.e.</acronym></i> one class can have multiple methods with the same name but a different number of arguments, or arguments of different types.
Other languages (most notably <acronym>PL/SQL</acronym>) even support function overloading by argument name; <i class="foreignphrase"><acronym>i.e.</acronym></i> one class can have multiple methods with the same name and the same number of arguments of the same type but different argument
names. Python supports neither of these; it has no form of function overloading whatsoever. Methods are defined solely by their name,
and there can be only one method per class with a given name. So if a descendant class has an <code class="function">__init__</code> method, it <em>always</em> overrides the ancestor <code class="function">__init__</code> method, even if the descendant defines it with a different argument list. And the same rule applies to any other method.
</td>
</tr>
</table><table id="fileinfo.derivedclasses" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Guido, the original author of Python, explains method overriding this way: "Derived classes may override methods of their base classes. Because methods have no
special privileges when calling other methods of the same object, a method of a base class that calls another method defined
in the same base class, may in fact end up calling a method of a derived class that overrides it. (For <acronym>C++</acronym> programmers: all methods in Python are effectively virtual.)" If that doesn't make sense to you (it confuses the hell out of me), feel free to ignore it.
I just thought I'd pass it along.
</td>
</tr>
</table><table id="note.dataattributes" class="caution" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/caution.png" alt="Caution" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Always assign an initial value to all of an instance's data attributes in the <code class="function">__init__</code> method. It will save you hours of debugging later, tracking down <code class="classname">AttributeError</code> exceptions because you're referencing uninitialized (and therefore non-existent) attributes.
</td>
</tr>
</table>
<div class="example"><h3 id="fileinfo.userdict.normalmethods">Example 5.10. <code class="classname">UserDict</code> Normal Methods</h3><pre class="programlisting">
def clear(self): self.data.clear() <img id="fileinfo.userdict.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def copy(self): <img id="fileinfo.userdict.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
if self.__class__ is UserDict: <img id="fileinfo.userdict.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
return UserDict(self.data)
import copy <img id="fileinfo.userdict.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
return copy.copy(self)
def keys(self): return self.data.keys() <img id="fileinfo.userdict.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
def items(self): return self.data.items()
def values(self): return self.data.values()
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">clear</code> is a normal class method; it is publicly available to be called by anyone at any time. Notice that <code class="function">clear</code>, like all class methods, has <code>self</code> as its first argument. (Remember that you don't include <code>self</code> when you call the method; it's something that Python adds for you.) Also note the basic technique of this wrapper class: store a real dictionary (<code class="varname">data</code>) as a data attribute, define all the methods that a real dictionary has, and have each class method redirect to the corresponding
method on the real dictionary. (In case you'd forgotten, a dictionary's <code class="function">clear</code> method <a href="#odbchelper.dict.del" title="Example 3.5. Deleting Items from a Dictionary">deletes all of its keys</a> and their associated values.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="filename">copy</code> method of a real dictionary returns a new dictionary that is an exact duplicate of the original (all the same key-value pairs).
But <code class="classname">UserDict</code> can't simply redirect to <code class="function">self.data.copy</code>, because that method returns a real dictionary, and what you want is to return a new instance that is the same class as <code>self</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You use the <code>__class__</code> attribute to see if <code>self</code> is a <code class="classname">UserDict</code>; if so, you're golden, because you know how to copy a <code class="classname">UserDict</code>: just create a new <code class="classname">UserDict</code> and give it the real dictionary that you've squirreled away in <code class="varname">self.data</code>. Then you immediately return the new <code class="classname">UserDict</code> you don't even get to the <code>import copy</code> on the next line.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If <code>self.__class__</code> is not <code class="classname">UserDict</code>, then <code>self</code> must be some subclass of <code class="classname">UserDict</code> (like maybe <code class="classname">FileInfo</code>), in which case life gets trickier. <code class="classname">UserDict</code> doesn't know how to make an exact copy of one of its descendants; there could, for instance, be other data attributes defined
in the subclass, so you would need to iterate through them and make sure to copy all of them. Luckily, Python comes with a module to do exactly this, and it's called <code class="filename">copy</code>. I won't go into the details here (though it's a wicked cool module, if you're ever inclined to dive into it on your own).
Suffice it to say that <code class="filename">copy</code> can copy arbitrary Python objects, and that's how you're using it here.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The rest of the methods are straightforward, redirecting the calls to the built-in methods on <code class="varname">self.data</code>.
</td>
</tr>
</table>
</div><table class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In versions of Python prior to 2.2, you could not directly subclass built-in datatypes like strings, lists, and dictionaries. To compensate for
this, Python comes with wrapper classes that mimic the behavior of these built-in datatypes: <code class="classname">UserString</code>, <code class="classname">UserList</code>, and <code class="classname">UserDict</code>. Using a combination of normal and special methods, the <code class="classname">UserDict</code> class does an excellent imitation of a dictionary. In Python 2.2 and later, you can inherit classes directly from built-in datatypes like <code class="classname">dict</code>. An example of this is given in the examples that come with this book, in <code class="filename">fileinfo_fromdict.py</code>.
</td>
</tr>
</table>
<p>In Python, you can inherit directly from the <code class="classname">dict</code> built-in datatype, as shown in this example. There are three differences here compared to the <code class="filename">UserDict</code> version.
<div class="example"><h3 id="fileinfo.userdict.fromdict">Example 5.11. Inheriting Directly from Built-In Datatype <code class="classname">dict</code></h3><pre class="programlisting">
class FileInfo(dict):<img id="fileinfo.userdict.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
"store file metadata"
def __init__(self, filename=None): <img id="fileinfo.userdict.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
self["name"] = filename
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The first difference is that you don't need to import the <code class="filename">UserDict</code> module, since <code class="classname">dict</code> is a built-in datatype and is always available. The second is that you are inheriting from <code class="classname">dict</code> directly, instead of from <code class="function">UserDict.UserDict</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.userdict.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The third difference is subtle but important. Because of the way <code class="filename">UserDict</code> works internally, it requires you to manually call its <code class="function">__init__</code> method to properly initialize its internal data structures. <code class="classname">dict</code> does not work like this; it is not a wrapper, and it requires no explicit initialization.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on <code class="filename">UserDict</code></h3>
<ul>
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-UserDict.html"><code class="filename">UserDict</code> module</a> and the <a href="http://www.python.org/doc/current/lib/module-copy.html"><code class="filename">copy</code> module</a>.
</ul>
<h2 id="fileinfo.specialmethods">5.6. Special Class Methods</h2>
<p>In addition to normal class methods, there are a number of special methods that Python classes can define. Instead of being called directly by your code (like normal methods), special methods are called for
you by Python in particular circumstances or when specific syntax is used.
<p>As you saw in the <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class">previous section</a>, normal methods go a long way towards wrapping a dictionary in a class. But normal methods alone are not enough, because
there are a lot of things you can do with dictionaries besides call methods on them. For starters, you can <a href="#odbchelper.dict.define" title="Example 3.1. Defining a Dictionary">get</a> and <a href="#odbchelper.dict.modify" title="Example 3.2. Modifying a Dictionary">set</a> items with a syntax that doesn't include explicitly invoking methods. This is where special class methods come in: they
provide a way to map non-method-calling syntax into method calls.
<h3>5.6.1. Getting and Setting Items</h3>
<div class="example"><h3>Example 5.12. The <code class="function">__getitem__</code> Special Method</h3><pre class="programlisting">
def __getitem__(self, key): return self.data[key]</pre><pre class="screen"><samp class="prompt">>>> </samp>f = fileinfo.FileInfo("/music/_singles/kairo.mp3")
<samp class="prompt">>>> </samp>f
{'name':'/music/_singles/kairo.mp3'}
<samp class="prompt">>>> </samp>f.__getitem__("name") <img id="fileinfo.specialmethods.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'/music/_singles/kairo.mp3'
<samp class="prompt">>>> </samp>f["name"] <img id="fileinfo.specialmethods.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'/music/_singles/kairo.mp3'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">__getitem__</code> special method looks simple enough. Like the normal methods <code class="function">clear</code>, <code class="function">keys</code>, and <code class="function">values</code>, it just redirects to the dictionary to return its value. But how does it get called? Well, you can call <code class="function">__getitem__</code> directly, but in practice you wouldn't actually do that; I'm just doing it here to show you how it works. The right way
to use <code class="function">__getitem__</code> is to get Python to call it for you.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This looks just like the syntax you would use to <a href="#odbchelper.dict.define" title="Example 3.1. Defining a Dictionary">get a dictionary value</a>, and in fact it returns the value you would expect. But here's the missing link: under the covers, Python has converted this syntax to the method call <code>f.__getitem__("name")</code>. That's why <code class="function">__getitem__</code> is a special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax.
</td>
</tr>
</table>
<p>Of course, Python has a <code class="function">__setitem__</code> special method to go along with <code class="function">__getitem__</code>, as shown in the next example.
<div class="example"><h3 id="fileinfo.specialmethods.setitem.example">Example 5.13. The <code class="function">__setitem__</code> Special Method</h3><pre class="programlisting">
def __setitem__(self, key, item): self.data[key] = item</pre><pre class="screen"><samp class="prompt">>>> </samp>f
{'name':'/music/_singles/kairo.mp3'}
<samp class="prompt">>>> </samp>f.__setitem__("genre", 31) <img id="fileinfo.specialmethods.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>f
{'name':'/music/_singles/kairo.mp3', 'genre':31}
<samp class="prompt">>>> </samp>f["genre"] = 32 <img id="fileinfo.specialmethods.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>f
{'name':'/music/_singles/kairo.mp3', 'genre':32}</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Like the <code class="function">__getitem__</code> method, <code class="function">__setitem__</code> simply redirects to the real dictionary <code class="varname">self.data</code> to do its work. And like <code class="function">__getitem__</code>, you wouldn't ordinarily call it directly like this; Python calls <code class="function">__setitem__</code> for you when you use the right syntax.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This looks like regular dictionary syntax, except of course that <code class="varname">f</code> is really a class that's trying very hard to masquerade as a dictionary, and <code class="function">__setitem__</code> is an essential part of that masquerade. This line of code actually calls <code>f.__setitem__("genre", 32)</code> under the covers.
</td>
</tr>
</table>
<p><code class="function">__setitem__</code> is a special class method because it gets called for you, but it's still a class method. Just as easily as the <code class="function">__setitem__</code> method was defined in <code class="classname">UserDict</code>, you can redefine it in the descendant class to override the ancestor method. This allows you to define classes that act
like dictionaries in some ways but define their own behavior above and beyond the built-in dictionary.
<p>This concept is the basis of the entire framework you're studying in this chapter. Each file type can have a handler class
that knows how to get metadata from a particular type of file. Once some attributes (like the file's name and location) are
known, the handler class knows how to derive other attributes automatically. This is done by overriding the <code class="function">__setitem__</code> method, checking for particular keys, and adding additional processing when they are found.
<p>For example, <code class="classname">MP3FileInfo</code> is a descendant of <code class="classname">FileInfo</code>. When an <code class="classname">MP3FileInfo</code>'s <code>name</code> is set, it doesn't just set the <code>name</code> key (like the ancestor <code class="classname">FileInfo</code> does); it also looks in the file itself for <abbr>MP3</abbr> tags and populates a whole set of keys. The next example shows how this works.
<div class="example"><h3>Example 5.14. Overriding <code class="function">__setitem__</code> in <code class="classname">MP3FileInfo</code></h3><pre class="programlisting">
def __setitem__(self, key, item): <img id="fileinfo.specialmethods.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
if key == "name" and item: <img id="fileinfo.specialmethods.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
self.__parse(item) <img id="fileinfo.specialmethods.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
FileInfo.__setitem__(self, key, item) <img id="fileinfo.specialmethods.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Notice that this <code class="function">__setitem__</code> method is defined exactly the same way as the ancestor method. This is important, since Python will be calling the method for you, and it expects it to be defined with a certain number of arguments. (Technically speaking,
the names of the arguments don't matter; only the number of arguments is important.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here's the crux of the entire <code class="classname">MP3FileInfo</code> class: if you're assigning a value to the <code>name</code> key, you want to do something extra.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The extra processing you do for <code>name</code>s is encapsulated in the <code class="function">__parse</code> method. This is another class method defined in <code class="classname">MP3FileInfo</code>, and when you call it, you qualify it with <code class="varname">self</code>. Just calling <code class="function">__parse</code> would look for a normal function defined outside the class, which is not what you want. Calling <code class="function">self.__parse</code> will look for a class method defined within the class. This isn't anything new; you reference <a href="#fileinfo.userdict.normalmethods" title="Example 5.10. UserDict Normal Methods">data attributes</a> the same way.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">After doing this extra processing, you want to call the ancestor method. Remember that this is never done for you in Python; you must do it manually. Note that you're calling the immediate ancestor, <code class="classname">FileInfo</code>, even though it doesn't have a <code class="function">__setitem__</code> method. That's okay, because Python will walk up the ancestor tree until it finds a class with the method you're calling, so this line of code will eventually
find and call the <code class="function">__setitem__</code> defined in <code class="classname">UserDict</code>.
</td>
</tr>
</table>
</div><table id="tip.self.call" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">When accessing data attributes within a class, you need to qualify the attribute name: <code>self.<i class="replaceable">attribute</i></code>. When calling other methods within a class, you need to qualify the method name: <code>self.<i class="replaceable">method</i></code>.
</td>
</tr>
</table>
<div class="example"><h3 id="fileinfo.specialmethods.setname">Example 5.15. Setting an <code class="classname">MP3FileInfo</code>'s <code>name</code></h3><pre class="screen"><samp class="prompt">>>> </samp>import fileinfo
<samp class="prompt">>>> </samp>mp3file = fileinfo.MP3FileInfo() <img id="fileinfo.specialmethods.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>mp3file
{'name':None}
<samp class="prompt">>>> </samp>mp3file["name"] = "/music/_singles/kairo.mp3" <img id="fileinfo.specialmethods.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>mp3file
<samp class="computeroutput">{'album': 'Rave Mix', 'artist': '***DJ MARY-JANE***', 'genre': 31,
'title': 'KAIRO****THE BEST GOA', 'name': '/music/_singles/kairo.mp3',
'year': '2000', 'comment': 'http://mp3.com/DJMARYJANE'}</samp>
<samp class="prompt">>>> </samp>mp3file["name"] = "/music/_singles/sidewinder.mp3" <img id="fileinfo.specialmethods.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>mp3file
<samp class="computeroutput">{'album': '', 'artist': 'The Cynic Project', 'genre': 18, 'title': 'Sidewinder',
'name': '/music/_singles/sidewinder.mp3', 'year': '2000',
'comment': 'http://mp3.com/cynicproject'}</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First, you create an instance of <code class="classname">MP3FileInfo</code>, without passing it a filename. (You can get away with this because the <code class="varname">filename</code> argument of the <code class="function">__init__</code> method is <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional</a>.) Since <code class="classname">MP3FileInfo</code> has no <code class="function">__init__</code> method of its own, Python walks up the ancestor tree and finds the <code class="function">__init__</code> method of <code class="classname">FileInfo</code>. This <code class="function">__init__</code> method manually calls the <code class="function">__init__</code> method of <code class="classname">UserDict</code> and then sets the <code>name</code> key to <code class="varname">filename</code>, which is <code>None</code>, since you didn't pass a filename. Thus, <code class="varname">mp3file</code> initially looks like a dictionary with one key, <code>name</code>, whose value is <code>None</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now the real fun begins. Setting the <code>name</code> key of <code class="varname">mp3file</code> triggers the <code class="function">__setitem__</code> method on <code class="classname">MP3FileInfo</code> (not <code class="classname">UserDict</code>), which notices that you're setting the <code>name</code> key with a real value and calls <code class="function">self.__parse</code>. Although you haven't traced through the <code class="function">__parse</code> method yet, you can see from the output that it sets several other keys: <code>album</code>, <code>artist</code>, <code>genre</code>, <code>title</code>, <code>year</code>, and <code>comment</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.specialmethods.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Modifying the <code>name</code> key will go through the same process again: Python calls <code class="function">__setitem__</code>, which calls <code class="function">self.__parse</code>, which sets all the other keys.
</td>
</tr>
</table>
<h2 id="fileinfo.morespecial">5.7. Advanced Special Class Methods</h2>
<p>Python has more special methods than just <code class="function">__getitem__</code> and <code class="function">__setitem__</code>. Some of them let you emulate functionality that you may not even know about.
<p>This example shows some of the other special methods in <code class="filename">UserDict</code>.
<div class="example"><h3 id="fileinfo.morespecial.example">Example 5.16. More Special Methods in <code class="classname">UserDict</code></h3><pre class="programlisting">
def __repr__(self): return repr(self.data) <img id="fileinfo.morespecial.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def __cmp__(self, dict): <img id="fileinfo.morespecial.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
if isinstance(dict, UserDict):
return cmp(self.data, dict.data)
else:
return cmp(self.data, dict)
def __len__(self): return len(self.data) <img id="fileinfo.morespecial.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
def __delitem__(self, key): del self.data[key] <img id="fileinfo.morespecial.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.morespecial.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">__repr__</code> is a special method that is called when you call <code>repr(<i class="replaceable">instance</i>)</code>. The <code class="function">repr</code> function is a built-in function that returns a string representation of an object. It works on any object, not just class
instances. You're already intimately familiar with <code class="function">repr</code> and you don't even know it. In the interactive window, when you type just a variable name and press the <kbd>ENTER</kbd> key, Python uses <code class="function">repr</code> to display the variable's value. Go create a dictionary <code class="varname">d</code> with some data and then <code>print repr(d)</code> to see for yourself.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.morespecial.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">__cmp__</code> is called when you compare class instances. In general, you can compare any two Python objects, not just class instances, by using <code>==</code>. There are rules that define when built-in datatypes are considered equal; for instance, dictionaries are equal when they
have all the same keys and values, and strings are equal when they are the same length and contain the same sequence of characters.
For class instances, you can define the <code class="function">__cmp__</code> method and code the comparison logic yourself, and then you can use <code>==</code> to compare instances of your class and Python will call your <code class="function">__cmp__</code> special method for you.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.morespecial.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">__len__</code> is called when you call <code>len(<i class="replaceable">instance</i>)</code>. The <code class="function">len</code> function is a built-in function that returns the length of an object. It works on any object that could reasonably be thought
of as having a length. The <code class="function">len</code> of a string is its number of characters; the <code class="function">len</code> of a dictionary is its number of keys; the <code class="function">len</code> of a list or tuple is its number of elements. For class instances, define the <code class="function">__len__</code> method and code the length calculation yourself, and then call <code>len(<i class="replaceable">instance</i>)</code> and Python will call your <code class="function">__len__</code> special method for you.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.morespecial.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">__delitem__</code> is called when you call <code>del <i class="replaceable">instance</i>[<i class="replaceable">key</i>]</code>, which you may remember as the way to <a href="#odbchelper.dict.del" title="Example 3.5. Deleting Items from a Dictionary">delete individual items from a dictionary</a>. When you use <code class="function">del</code> on a class instance, Python calls the <code class="function">__delitem__</code> special method for you.
</td>
</tr>
</table>
</div><table id="compare.strequals.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In Java, you determine whether two string variables reference the same physical memory location by using <code>str1 == str2</code>. This is called <em>object identity</em>, and it is written in Python as <code>str1 is str2</code>. To compare string values in Java, you would use <code>str1.equals(str2)</code>; in Python, you would use <code>str1 == str2</code>. Java programmers who have been taught to believe that the world is a better place because <code>==</code> in Java compares by identity instead of by value may have a difficult time adjusting to Python's lack of such &#8220;gotchas&#8221;.
</td>
</tr>
</table>
<p>At this point, you may be thinking, &#8220;All this work just to do something in a class that I can do with a built-in datatype.&#8221; And it's true that life would be easier (and the entire <code class="classname">UserDict</code> class would be unnecessary) if you could inherit from built-in datatypes like a dictionary. But even if you could, special
methods would still be useful, because they can be used in any class, not just wrapper classes like <code class="classname">UserDict</code>.
<p>Special methods mean that <em>any class</em> can store key/value pairs like a dictionary, just by defining the <code class="function">__setitem__</code> method. <em>Any class</em> can act like a sequence, just by defining the <code class="function">__getitem__</code> method. Any class that defines the <code class="function">__cmp__</code> method can be compared with <code>==</code>. And if your class represents something that has a length, don't define a <code class="function">GetLength</code> method; define the <code class="function">__len__</code> method and use <code>len(<i class="replaceable">instance</i>)</code>.<table id="note.physical.v.logical" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">While other object-oriented languages only let you define the physical model of an object (&#8220;this object has a <code class="function">GetLength</code> method&#8221;), Python's special class methods like <code class="function">__len__</code> allow you to define the logical model of an object (&#8220;this object has a length&#8221;).
</td>
</tr>
</table>
<p>Python has a lot of other special methods. There's a whole set of them that let classes act like numbers, allowing you to add,
subtract, and do other arithmetic operations on class instances. (The canonical example of this is a class that represents
complex numbers, numbers with both real and imaginary components.) The <code class="function">__call__</code> method lets a class act like a function, allowing you to call a class instance directly. And there are other special methods
that allow classes to have read-only and write-only data attributes; you'll talk more about those in later chapters.
<div class="itemizedlist">
<h3>Further Reading on Special Class Methods</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class="citetitle">Python Reference Manual</i></a> documents <a href="http://www.python.org/doc/current/ref/specialnames.html">all the special class methods</a>.
</ul>
<h2 id="fileinfo.classattributes">5.8. Introducing Class Attributes</h2>
<p>You already know about <a href="#fileinfo.userdict.init.example" title="Example 5.9. Defining the UserDict Class">data attributes</a>, which are variables owned by a specific instance of a class. Python also supports class attributes, which are variables owned by the class itself.
<div class="example"><h3 id="fileinfo.classattributes.intro">Example 5.17. Introducing Class Attributes</h3><pre class="programlisting">
class MP3FileInfo(FileInfo):
"store ID3v1.0 MP3 tags"
tagDataMap = {"title" : ( 3, 33, stripnulls),
"artist" : ( 33, 63, stripnulls),
"album" : ( 63, 93, stripnulls),
"year" : ( 93, 97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre" : (127, 128, ord)}</pre><pre class="screen"><samp class="prompt">>>> </samp>import fileinfo
<samp class="prompt">>>> </samp>fileinfo.MP3FileInfo <img id="fileinfo.classattributes.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;class fileinfo.MP3FileInfo at 01257FDC>
<samp class="prompt">>>> </samp>fileinfo.MP3FileInfo.tagDataMap <img id="fileinfo.classattributes.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">{'title': (3, 33, &lt;function stripnulls at 0260C8D4>),
'genre': (127, 128, &lt;built-in function ord>),
'artist': (33, 63, &lt;function stripnulls at 0260C8D4>),
'year': (93, 97, &lt;function stripnulls at 0260C8D4>),
'comment': (97, 126, &lt;function stripnulls at 0260C8D4>),
'album': (63, 93, &lt;function stripnulls at 0260C8D4>)}</samp>
<samp class="prompt">>>> </samp>m = fileinfo.MP3FileInfo() <img id="fileinfo.classattributes.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>m.tagDataMap
<samp class="computeroutput">{'title': (3, 33, &lt;function stripnulls at 0260C8D4>),
'genre': (127, 128, &lt;built-in function ord>),
'artist': (33, 63, &lt;function stripnulls at 0260C8D4>),
'year': (93, 97, &lt;function stripnulls at 0260C8D4>),
'comment': (97, 126, &lt;function stripnulls at 0260C8D4>),
'album': (63, 93, &lt;function stripnulls at 0260C8D4>)}</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.classattributes.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="classname">MP3FileInfo</code> is the class itself, not any particular instance of the class.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.classattributes.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">tagDataMap</code> is a class attribute: literally, an attribute of the class. It is available before creating any instances of the class.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.classattributes.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Class attributes are available both through direct reference to the class and through any instance of the class.</td>
</tr>
</table>
</div><table id="compare.classattr.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In Java, both static variables (called class attributes in Python) and instance variables (called data attributes in Python) are defined immediately after the class definition (one with the <code>static</code> keyword, one without). In Python, only class attributes can be defined here; data attributes are defined in the <code class="function">__init__</code> method.
</td>
</tr>
</table>
<p>Class attributes can be used as class-level constants (which is how you use them in <code class="classname">MP3FileInfo</code>), but they are not really constants. You can also change them.<table class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">There are no constants in Python. Everything can be changed if you try hard enough. This fits with one of the core principles of Python: bad behavior should be discouraged but not banned. If you really want to change the value of <code>None</code>, you can do it, but don't come running to me when your code is impossible to debug.
</td>
</tr>
</table>
<div class="example"><h3 id="fileinfo.classattributes.writeable.example">Example 5.18. Modifying Class Attributes</h3><pre class="screen"><samp class="prompt">>>> </samp>class counter:
<samp class="prompt">... </samp>count = 0 <img id="fileinfo.classattributes.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp>def __init__(self):
<samp class="prompt">... </samp> self.__class__.count += 1 <img id="fileinfo.classattributes.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">... </samp>
<samp class="prompt">>>> </samp>counter
&lt;class __main__.counter at 010EAECC>
<samp class="prompt">>>> </samp>counter.count <img id="fileinfo.classattributes.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
0
<samp class="prompt">>>> </samp>c = counter()
<samp class="prompt">>>> </samp>c.count <img id="fileinfo.classattributes.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
1
<samp class="prompt">>>> </samp>counter.count
1
<samp class="prompt">>>> </samp>d = counter() <img id="fileinfo.classattributes.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>d.count
2
<samp class="prompt">>>> </samp>c.count
2
<samp class="prompt">>>> </samp>counter.count
2</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.classattributes.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">count</code> is a class attribute of the <code class="classname">counter</code> class.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.classattributes.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>__class__</code> is a built-in attribute of every class instance (of every class). It is a reference to the class that <code class="varname">self</code> is an instance of (in this case, the <code class="classname">counter</code> class).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.classattributes.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Because <code class="varname">count</code> is a class attribute, it is available through direct reference to the class, before you have created any instances of the
class.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.classattributes.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Creating an instance of the class calls the <code class="function">__init__</code> method, which increments the class attribute <code class="varname">count</code> by <code class="constant">1</code>. This affects the class itself, not just the newly created instance.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.classattributes.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Creating a second instance will increment the class attribute <code class="varname">count</code> again. Notice how the class attribute is shared by the class and all instances of the class.
</td>
</tr>
</table>
<h2 id="fileinfo.private">5.9. Private Functions</h2>
<p>Like most languages, Python has the concept of private elements:
<div class="itemizedlist">
<ul>
<li>Private functions, which can't be called from outside their module
<li>Private class methods, which can't be called from outside their class
<li>Private attributes, which can't be accessed from outside their class.
</ul>
<p>Unlike in most languages, whether a Python function, method, or attribute is private or public is determined entirely by its name.
<p>If the name of a Python function, class method, or attribute starts with (but doesn't end with) two underscores, it's private; everything else is
public. Python has no concept of <em>protected</em> class methods (accessible only in their own class and descendant classes). Class methods are either private (accessible
only in their own class) or public (accessible from anywhere).
<p>In <code class="classname">MP3FileInfo</code>, there are two methods: <code class="function">__parse</code> and <code class="function">__setitem__</code>. As you have already discussed, <code class="function">__setitem__</code> is a <a href="#fileinfo.specialmethods.setitem.example" title="Example 5.13. The __setitem__ Special Method">special method</a>; normally, you would call it indirectly by using the dictionary syntax on a class instance, but it is public, and you could
call it directly (even from outside the <code class="filename">fileinfo</code> module) if you had a really good reason. However, <code class="function">__parse</code> is private, because it has two underscores at the beginning of its name.<table id="tip.specialmethodnames" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In Python, all special methods (like <a href="#fileinfo.specialmethods.setitem.example" title="Example 5.13. The __setitem__ Special Method"><code class="function">__setitem__</code></a>) and built-in attributes (like <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's doc string"><code>__doc__</code></a>) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and
attributes this way, because it will only confuse you (and others) later.
</td>
</tr>
</table>
<div class="example"><h3>Example 5.19. Trying to Call a Private Method</h3><pre class="screen"><samp class="prompt">>>> </samp>import fileinfo
<samp class="prompt">>>> </samp>m = fileinfo.MP3FileInfo()
<samp class="prompt">>>> </samp>m.__parse("/music/_singles/kairo.mp3") <img id="fileinfo.private.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'MP3FileInfo' instance has no attribute '__parse'</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.private.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If you try to call a private method, Python will raise a slightly misleading exception, saying that the method does not exist. Of course it does exist, but it's private,
so it's not accessible outside the class.Strictly speaking, private methods are accessible outside their class, just not <em>easily</em> accessible. Nothing in Python is truly private; internally, the names of private methods and attributes are mangled and unmangled on the fly to make them
seem inaccessible by their given names. You can access the <code class="function">__parse</code> method of the <code class="classname">MP3FileInfo</code> class by the name <code class="function">_MP3FileInfo__parse</code>. Acknowledge that this is interesting, but promise to never, ever do it in real code. Private methods are private for a
reason, but like many other things in Python, their privateness is ultimately a matter of convention, not force.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Private Functions</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> discusses the inner workings of <a href="http://www.python.org/doc/current/tut/node11.html#SECTION0011600000000000000000">private variables</a>.
</ul>
<h2 id="fileinfo.summary">5.10. Summary</h2>
<p>That's it for the hard-core object trickery. You'll see a real-world application of special class methods in <a href="#soap">Chapter 12</a>, which uses <code class="function">getattr</code> to create a proxy to a remote web service.
<p>The next chapter will continue using this code sample to explore other Python concepts, such as exceptions, file objects, and <code>for</code> loops.
<p>Before diving into the next chapter, make sure you're comfortable doing all of these things:
<div class="itemizedlist">
<ul>
<li>Importing modules using either <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's doc string"><code>import <i class="replaceable">module</i></code></a> or <a href="#fileinfo.fromimport" title="5.2. Importing Modules Using from module import"><code>from <i class="replaceable">module</i> import</code></a>
<li><a href="#fileinfo.class" title="5.3. Defining Classes">Defining</a> and <a href="#fileinfo.create" title="5.4. Instantiating Classes">instantiating</a> classes
<li>Defining <a href="#fileinfo.class.example" title="Example 5.4. Defining the FileInfo Class"><code class="function">__init__</code> methods</a> and other <a href="#fileinfo.specialmethods" title="5.6. Special Class Methods">special class methods</a>, and understanding when they are called
<li>Subclassing <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class"><code class="classname">UserDict</code></a> to define classes that act like dictionaries
<li>Defining <a href="#fileinfo.userdict.init.example" title="Example 5.9. Defining the UserDict Class">data attributes</a> and <a href="#fileinfo.classattributes" title="5.8. Introducing Class Attributes">class attributes</a>, and understanding the differences between them
<li>Defining <a href="#fileinfo.private" title="5.9. Private Functions">private attributes and methods</a>
</ul>
<div class="chapter">
<h2 id="filehandling">Chapter 6. Exceptions and File Handling</h2>
<p>In this chapter, you will dive into exceptions, file objects, <code>for</code> loops, and the <code class="filename">os</code> and <code class="filename">sys</code> modules. If you've used exceptions in another programming language, you can skim the first section to get a sense of Python's syntax. Be sure to tune in again for file handling.
<h2 id="fileinfo.exception">6.1. Handling Exceptions</h2>
<p>Like many other programming languages, Python has exception handling via <code>try...except</code> blocks.<table id="compare.exceptions.java" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Python uses <code>try...except</code> to handle exceptions and <code>raise</code> to generate them. Java and <acronym>C++</acronym> use <code>try...catch</code> to handle exceptions, and <code>throw</code> to generate them.
</td>
</tr>
</table>
<p>Exceptions are everywhere in Python. Virtually every module in the standard Python library uses them, and Python itself will raise them in a lot of different circumstances. You've already seen them repeatedly throughout this book.
<div class="itemizedlist">
<ul>
<li><a href="#odbchelper.dict.define" title="Example 3.1. Defining a Dictionary">Accessing a non-existent dictionary key</a> will raise a <code class="errorcode">KeyError</code> exception.
<li><a href="#odbchelper.list.search" title="Example 3.12. Searching a List">Searching a list for a non-existent value</a> will raise a <code class="errorcode">ValueError</code> exception.
<li><a href="#odbchelper.tuplemethods" title="Example 3.16. Tuples Have No Methods">Calling a non-existent method</a> will raise an <code class="errorcode">AttributeError</code> exception.
<li><a href="#odbchelper.unboundvariable" title="Example 3.18. Referencing an Unbound Variable">Referencing a non-existent variable</a> will raise a <code class="errorcode">NameError</code> exception.
<li><a href="#odbchelper.stringformatting.coerce" title="Example 3.22. String Formatting vs. Concatenating">Mixing datatypes without coercion</a> will raise a <code class="errorcode">TypeError</code> exception.
</ul>
<p>In each of these cases, you were simply playing around in the Python <acronym>IDE</acronym>: an error occurred, the exception was printed (depending on your <acronym>IDE</acronym>, perhaps in an intentionally jarring shade of red), and that was that. This is called an <em>unhandled</em> exception. When the exception was raised, there was no code to explicitly notice it and deal with it, so it bubbled its
way back to the default behavior built in to Python, which is to spit out some debugging information and give up. In the <acronym>IDE</acronym>, that's no big deal, but if that happened while your actual Python program was running, the entire program would come to a screeching halt.
<p>An exception doesn't need result in a complete program crash, though. Exceptions, when raised, can be <em>handled</em>. Sometimes an exception is really because you have a bug in your code (like accessing a variable that doesn't exist), but
many times, an exception is something you can anticipate. If you're opening a file, it might not exist. If you're connecting
to a database, it might be unavailable, or you might not have the correct security credentials to access it. If you know
a line of code may raise an exception, you should handle the exception using a <code>try...except</code> block.
<div class="example"><h3>Example 6.1. Opening a Non-Existent File</h3><pre class="screen"><samp class="prompt">>>> </samp>fsock = open("/notthere", "r") <img id="fileinfo.exceptions.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
IOError: [Errno 2] No such file or directory: '/notthere'</samp>
<samp class="prompt">>>> </samp>try:
<samp class="prompt">... </samp>fsock = open("/notthere") <img id="fileinfo.exceptions.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">... </samp>except IOError: <img id="fileinfo.exceptions.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">... </samp>print "The file does not exist, exiting gracefully"
<samp class="prompt">... </samp>print "This line will always print" <img id="fileinfo.exceptions.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">The file does not exist, exiting gracefully
This line will always print</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.exceptions.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using the built-in <code class="function">open</code> function, you can try to open a file for reading (more on <code class="function">open</code> in the next section). But the file doesn't exist, so this raises the <code class="errorcode">IOError</code> exception. Since you haven't provided any explicit check for an <code class="errorcode">IOError</code> exception, Python just prints out some debugging information about what happened and then gives up.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.exceptions.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You're trying to open the same non-existent file, but this time you're doing it within a <code>try...except</code> block.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.exceptions.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When the <code class="function">open</code> method raises an <code class="errorcode">IOError</code> exception, you're ready for it. The <code>except IOError:</code> line catches the exception and executes your own block of code, which in this case just prints a more pleasant error message.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.exceptions.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Once an exception has been handled, processing continues normally on the first line after the <code>try...except</code> block. Note that this line will always print, whether or not an exception occurs. If you really did have a file called
<code class="filename">notthere</code> in your root directory, the call to <code class="function">open</code> would succeed, the <code>except</code> clause would be ignored, and this line would still be executed.
</td>
</tr>
</table>
<p>Exceptions may seem unfriendly (after all, if you don't catch the exception, your entire program will crash), but consider
the alternative. Would you rather get back an unusable file object to a non-existent file? You'd need to check its validity
somehow anyway, and if you forgot, somewhere down the line, your program would give you strange errors somewhere down the
line that you would need to trace back to the source. I'm sure you've experienced this, and you know it's not fun. With
exceptions, errors occur immediately, and you can handle them in a standard way at the source of the problem.
<h3>6.1.1. Using Exceptions For Other Purposes</h3>
<p>There are a lot of other uses for exceptions besides handling actual error conditions. A common use in the standard Python library is to try to import a module, and then check whether it worked. Importing a module that does not exist will raise
an <code class="errorcode">ImportError</code> exception. You can use this to define multiple levels of functionality based on which modules are available at run-time,
or to support multiple platforms (where platform-specific code is separated into different modules).
<p>You can also define your own exceptions by creating a class that inherits from the built-in <code class="classname">Exception</code> class, and then raise your exceptions with the <code class="function">raise</code> command. See the further reading section if you're interested in doing this.
<p>The next example demonstrates how to use an exception to support platform-specific functionality. This code comes from the
<code class="filename">getpass</code> module, a wrapper module for getting a password from the user. Getting a password is accomplished differently on <acronym>UNIX</acronym>, Windows, and Mac OS platforms, but this code encapsulates all of those differences.
<div class="example"><h3 id="crossplatform.example">Example 6.2. Supporting Platform-Specific Functionality</h3><pre class="programlisting">
# Bind the name getpass to the appropriate function
try:
import termios, TERMIOS <img id="fileinfo.exceptions.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
except ImportError:
try:
import msvcrt <img id="fileinfo.exceptions.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
except ImportError:
try:
from EasyDialogs import AskPassword <img id="fileinfo.exceptions.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
except ImportError:
getpass = default_getpass <img id="fileinfo.exceptions.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
else: <img id="fileinfo.exceptions.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
getpass = AskPassword
else:
getpass = win_getpass
else:
getpass = unix_getpass</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.exceptions.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">termios</code> is a <acronym>UNIX</acronym>-specific module that provides low-level control over the input terminal. If this module is not available (because it's not
on your system, or your system doesn't support it), the import fails and Python raises an <code class="errorcode">ImportError</code>, which you catch.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.exceptions.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">OK, you didn't have <code class="filename">termios</code>, so let's try <code class="filename">msvcrt</code>, which is a Windows-specific module that provides an <acronym>API</acronym> to many useful functions in the Microsoft Visual C++ runtime services. If this import fails, Python will raise an <code class="errorcode">ImportError</code>, which you catch.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.exceptions.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the first two didn't work, you try to import a function from <code class="filename">EasyDialogs</code>, which is a Mac OS-specific module that provides functions to pop up dialog boxes of various types. Once again, if this import fails, Python will raise an <code class="errorcode">ImportError</code>, which you catch.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.exceptions.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">None of these platform-specific modules is available (which is possible, since Python has been ported to a lot of different platforms), so you need to fall back on a default password input function (which is
defined elsewhere in the <code class="filename">getpass</code> module). Notice what you're doing here: assigning the function <code class="function">default_getpass</code> to the variable <code class="varname">getpass</code>. If you read the official <code class="filename">getpass</code> documentation, it tells you that the <code class="filename">getpass</code> module defines a <code class="function">getpass</code> function. It does this by binding <code class="varname">getpass</code> to the correct function for your platform. Then when you call the <code class="function">getpass</code> function, you're really calling a platform-specific function that this code has set up for you. You don't need to know or
care which platform your code is running on -- just call <code class="function">getpass</code>, and it will always do the right thing.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.exceptions.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">A <code>try...except</code> block can have an <code>else</code> clause, like an <code>if</code> statement. If no exception is raised during the <code>try</code> block, the <code>else</code> clause is executed afterwards. In this case, that means that the <code>from EasyDialogs import AskPassword</code> import worked, so you should bind <code class="varname">getpass</code> to the <code class="function">AskPassword</code> function. Each of the other <code>try...except</code> blocks has similar <code>else</code> clauses to bind <code class="varname">getpass</code> to the appropriate function when you find an <code>import</code> that works.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Exception Handling</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> discusses <a href="http://www.python.org/doc/current/tut/node10.html#SECTION0010400000000000000000">defining and raising your own exceptions, and handling multiple exceptions at once</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/module-exceptions.html">all the built-in exceptions</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-getpass.html">getpass</a> module.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-traceback.html"><code class="filename">traceback</code> module</a>, which provides low-level access to exception attributes after an exception is raised.
<li><a href="http://www.python.org/doc/current/ref/"><i class="citetitle">Python Reference Manual</i></a> discusses the inner workings of the <a href="http://www.python.org/doc/current/ref/try.html"><code>try...except</code> block</a>.
</ul>
<h2 id="fileinfo.files">6.2. Working with File Objects</h2>
<p>Python has a built-in function, <code class="function">open</code>, for opening a file on disk. <code class="function">open</code> returns a file object, which has methods and attributes for getting information about and manipulating the opened file.
<div class="example"><h3>Example 6.3. Opening a File</h3><pre class="screen"><samp class="prompt">>>> </samp>f = open("/music/_singles/kairo.mp3", "rb") <img id="fileinfo.files.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>f <img id="fileinfo.files.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
<samp class="prompt">>>> </samp>f.mode <img id="fileinfo.files.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'rb'
<samp class="prompt">>>> </samp>f.name <img id="fileinfo.files.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'/music/_singles/kairo.mp3'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">open</code> method can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename,
is required; the other two are <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional</a>. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode.
(<code>print open.__doc__</code> displays a great explanation of all the possible modes.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">open</code> function returns an object (by now, <a href="#odbchelper.objects" title="2.4. Everything Is an Object">this should not surprise you</a>). A file object has several useful attributes.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="varname">mode</code> attribute of a file object tells you in which mode the file was opened.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="varname">name</code> attribute of a file object tells you the name of the file that the file object has open.
</td>
</tr>
</table>
<h3>6.2.1. Reading Files</h3>
<p>After you open a file, the first thing you'll want to do is read from it, as shown in the next example.
<div class="example"><h3>Example 6.4. Reading a File</h3><pre class="screen">
<samp class="prompt">>>> </samp>f
&lt;open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
<samp class="prompt">>>> </samp>f.tell() <img id="fileinfo.files.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
0
<samp class="prompt">>>> </samp>f.seek(-128, 2) <img id="fileinfo.files.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>f.tell() <img id="fileinfo.files.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
7542909
<samp class="prompt">>>> </samp>tagData = f.read(128) <img id="fileinfo.files.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>tagData
<samp class="computeroutput">'TAGKAIRO****THE BEST GOA ***DJ MARY-JANE***
Rave Mix 2000http://mp3.com/DJMARYJANE \037'</samp>
<samp class="prompt">>>> </samp>f.tell() <img id="fileinfo.files.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
7543037</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">A file object maintains state about the file it has open. The <code class="function">tell</code> method of a file object tells you your current position in the open file. Since you haven't done anything with this file
yet, the current position is <code class="constant">0</code>, which is the beginning of the file.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">seek</code> method of a file object moves to another position in the open file. The second parameter specifies what the first one means;
<code class="constant">0</code> means move to an absolute position (counting from the start of the file), <code class="constant">1</code> means move to a relative position (counting from the current position), and <code>2</code> means move to a position relative to the end of the file. Since the <abbr>MP3</abbr> tags you're looking for are stored at the end of the file, you use <code>2</code> and tell the file object to move to a position <code>128</code> bytes from the end of the file.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">tell</code> method confirms that the current file position has moved.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">read</code> method reads a specified number of bytes from the open file and returns a string with the data that was read. The optional
parameter specifies the maximum number of bytes to read. If no parameter is specified, <code class="function">read</code> will read until the end of the file. (You could have simply said <code>read()</code> here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.) The read data
is assigned to the <code class="varname">tagData</code> variable, and the current position is updated based on how many bytes were read.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">tell</code> method confirms that the current position has moved. If you do the math, you'll see that after reading 128 bytes, the position
has been incremented by 128.
</td>
</tr>
</table>
<h3>6.2.2. Closing Files</h3>
<p>Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's
important to close files as soon as you're finished with them.
<div class="example"><h3>Example 6.5. Closing a File</h3><pre class="screen">
<samp class="prompt">>>> </samp>f
&lt;open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
<samp class="prompt">>>> </samp>f.closed <img id="fileinfo.files.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
False
<samp class="prompt">>>> </samp>f.close() <img id="fileinfo.files.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>f
&lt;closed file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
<samp class="prompt">>>> </samp>f.closed <img id="fileinfo.files.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
True
<samp class="prompt">>>> </samp>f.seek(0) <img id="fileinfo.files.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
ValueError: I/O operation on closed file</samp>
<samp class="prompt">>>> </samp>f.tell()
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
ValueError: I/O operation on closed file</samp>
<samp class="prompt">>>> </samp>f.read()
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
ValueError: I/O operation on closed file</samp>
<samp class="prompt">>>> </samp>f.close() <img id="fileinfo.files.3.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="varname">closed</code> attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (<code class="varname">closed</code> is <code class="constant">False</code>).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To close a file, call the <code class="function">close</code> method of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any)
that the system hadn't gotten around to actually writing yet, and releases the system resources.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="varname">closed</code> attribute confirms that the file is closed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Just because a file is closed doesn't mean that the file object ceases to exist. The variable <code class="varname">f</code> will continue to exist until it <a href="#fileinfo.scope" title="Example 5.8. Trying to Implement a Memory Leak">goes out of scope</a> or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed;
they all raise an exception.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.3.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Calling <code class="function">close</code> on a file object whose file is already closed does <em>not</em> raise an exception; it fails silently.
</td>
</tr>
</table>
<h3>6.2.3. Handling <acronym>I/O</acronym> Errors</h3>
<p>Now you've seen enough to understand the file handling code in the <code class="filename">fileinfo.py</code> sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle
errors.
<div class="example"><h3 id="fileinfo.files.incode">Example 6.6. File Objects in <code class="classname">MP3FileInfo</code></h3><pre class="programlisting">
try: <img id="fileinfo.files.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
fsock = open(filename, "rb", 0) <img id="fileinfo.files.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
try:
fsock.seek(-128, 2) <img id="fileinfo.files.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
tagdata = fsock.read(128) <img id="fileinfo.files.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
finally: <img id="fileinfo.files.4.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
fsock.close()
.
.
.
except IOError: <img id="fileinfo.files.4.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
pass </pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a <code>try...except</code> block. (Hey, isn't <a href="#odbchelper.indenting" title="2.5. Indenting Code">standardized indentation</a> great? This is where you start to appreciate it.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">open</code> function may raise an <code class="errorcode">IOError</code>. (Maybe the file doesn't exist.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">seek</code> method may raise an <code class="errorcode">IOError</code>. (Maybe the file is smaller than 128 bytes.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">read</code> method may raise an <code class="errorcode">IOError</code>. (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.4.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is new: a <code>try...finally</code> block. Once the file has been opened successfully by the <code class="function">open</code> function, you want to make absolutely sure that you close it, even if an exception is raised by the <code class="function">seek</code> or <code class="function">read</code> methods. That's what a <code>try...finally</code> block is for: code in the <code>finally</code> block will <em>always</em> be executed, even if something in the <code>try</code> block raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.4.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">At last, you handle your <code class="errorcode">IOError</code> exception. This could be the <code class="errorcode">IOError</code> exception raised by the call to <code class="function">open</code>, <code class="function">seek</code>, or <code class="function">read</code>. Here, you really don't care, because all you're going to do is ignore it silently and continue. (Remember, <code>pass</code> is a Python statement that <a href="#fileinfo.class.simplest" title="Example 5.3. The Simplest Python Class">does nothing</a>.) That's perfectly legal; &#8220;handling&#8221; an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the
next line of code after the <code>try...except</code> block.
</td>
</tr>
</table>
<h3>6.2.4. Writing to Files</h3>
<p>As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes:
<div class="itemizedlist">
<ul>
<li>"Append" mode will add data to the end of the file.
<li>"write" mode will overwrite the file.
</ul>
<p>Either mode will create the file automatically if it doesn't already exist, so there's never a need for any sort of fiddly
"if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open
it and start writing.
<div class="example"><h3 id="fileinfo.files.writeandappend">Example 6.7. Writing to Files</h3><pre class="screen">
<samp class="prompt">>>> </samp>logfile = open('test.log', 'w') <img id="fileinfo.files.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>logfile.write('test succeeded') <img id="fileinfo.files.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>logfile.close()
<samp class="prompt">>>> </samp>print file('test.log').read() <img id="fileinfo.files.5.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
test succeeded
<samp class="prompt">>>> </samp>logfile = open('test.log', 'a') <img id="fileinfo.files.5.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>logfile.write('line 2')
<samp class="prompt">>>> </samp>logfile.close()
<samp class="prompt">>>> </samp>print file('test.log').read() <img id="fileinfo.files.5.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
test succeededline 2
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You start boldly by creating either the new file <code class="filename">test.log</code> or overwrites the existing file, and opening the file for writing. (The second parameter <code>"w"</code> means open the file for writing.) Yes, that's all as dangerous as it sounds. I hope you didn't care about the previous
contents of that file, because it's gone now.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can add data to the newly opened file with the <code class="function">write</code> method of the file object returned by <code class="function">open</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.5.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">file</code> is a synonym for <code class="function">open</code>. This one-liner opens the file, reads its contents, and prints them.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.5.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You happen to know that <code class="filename">test.log</code> exists (since you just finished writing to it), so you can open it and append to it. (The <code>"a"</code> parameter means open the file for appending.) Actually you could do this even if the file didn't exist, because opening
the file for appending will create the file if necessary. But appending will <em>never</em> harm the existing contents of the file.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.files.5.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you can see, both the original line you wrote and the second line you appended are now in <code class="filename">test.log</code>. Also note that carriage returns are not included. Since you didn't write them explicitly to the file either time, the
file doesn't include them. You can write a carriage return with the <code>"\n"</code> character. Since you didn't do this, everything you wrote to the file ended up smooshed together on the same line.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on File Handling</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> discusses reading and writing files, including how to <a href="http://www.python.org/doc/current/tut/node9.html#SECTION009210000000000000000">read a file one line at a time into a list</a>.
<li><a href="http://www.effbot.org/guides/">eff-bot</a> discusses efficiency and performance of <a href="http://www.effbot.org/guides/readline-performance.htm">various ways of reading a file</a>.
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/552">common questions about files</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/bltin-file-objects.html">all the file object methods</a>.
</ul>
<h2 id="fileinfo.for">6.3. Iterating with <code>for</code> Loops</h2>
<p>Like most other languages, Python has <code>for</code> loops. The only reason you haven't seen them until now is that Python is good at so many other things that you don't need them as often.
<p>Most other languages don't have a powerful list datatype like Python, so you end up doing a lot of manual work, specifying a start, end, and step to define a range of integers or characters
or other iteratable entities. But in Python, a <code>for</code> loop simply iterates over a list, the same way <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehensions</a> work.
<div class="example"><h3>Example 6.8. Introducing the <code>for</code> Loop</h3><pre class="screen"><samp class="prompt">>>> </samp>li = ['a', 'b', 'e']
<samp class="prompt">>>> </samp>for s in li: <img id="fileinfo.for.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp>print s <img id="fileinfo.for.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">a
b
e</samp>
<samp class="prompt">>>> </samp>print "\n".join(li) <img id="fileinfo.for.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">a
b
e</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.for.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The syntax for a <code>for</code> loop is similar to <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehensions</a>. <code class="varname">li</code> is a list, and <code class="varname">s</code> will take the value of each element in turn, starting from the first element.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.for.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Like an <code>if</code> statement or any other <a href="#odbchelper.indenting" title="2.5. Indenting Code">indented block</a>, a <code>for</code> loop can have any number of lines of code in it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.for.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the reason you haven't seen the <code>for</code> loop yet: you haven't needed it yet. It's amazing how often you use <code>for</code> loops in other languages when all you really want is a <code class="function">join</code> or a list comprehension.
</td>
</tr>
</table>
<p>Doing a &#8220;normal&#8221; (by Visual Basic standards) counter <code>for</code> loop is also simple.
<div class="example"><h3 id="fileinfo.for.counter">Example 6.9. Simple Counters</h3><pre class="screen">
<samp class="prompt">>>> </samp>for i in range(5): <img id="fileinfo.for.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp>print i
<samp class="computeroutput">0
1
2
3
4</samp>
<samp class="prompt">>>> </samp>li = ['a', 'b', 'c', 'd', 'e']
<samp class="prompt">>>> </samp>for i in range(len(li)): <img id="fileinfo.for.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">... </samp>print li[i]
<samp class="computeroutput">a
b
c
d
e</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.for.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#odbchelper.multiassign.range" title="Example 3.20. Assigning Consecutive Values">Example 3.20, &#8220;Assigning Consecutive Values&#8221;</a>, <code class="function">range</code> produces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress
<em>occasionally</em>) useful to have a counter loop.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.for.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Don't ever do this. This is Visual Basic-style thinking. Break out of it. Just iterate through the list, as shown in the previous example.
</td>
</tr>
</table>
<p><code>for</code> loops are not just for simple counters. They can iterate through all kinds of things. Here is an example of using a <code>for</code> loop to iterate through a dictionary.
<div class="example"><h3 id="dictionaryiter.example">Example 6.10. Iterating Through a Dictionary</h3><pre class="screen">
<samp class="prompt">>>> </samp>import os
<samp class="prompt">>>> </samp>for k, v in os.environ.items(): <img id="fileinfo.for.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"> <img id="fileinfo.for.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">... </samp>print "%s=%s" % (k, v)
<samp class="computeroutput">USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim
[...snip...]</samp>
<samp class="prompt">>>> </samp>print "\n".join(["%s=%s" % (k, v)
<samp class="prompt">... </samp>for k, v in os.environ.items()]) <img id="fileinfo.for.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim
[...snip...]</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.for.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">os.environ</code> is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables
accessible from <acronym>MS-DOS</acronym>. In <acronym>UNIX</acronym>, they are the variables exported in your shell's startup scripts. In Mac OS, there is no concept of environment variables, so this dictionary is empty.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.for.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>os.environ.items()</code> returns a list of tuples: <code>[(<i class="replaceable">key1</i>, <i class="replaceable">value1</i>), (<i class="replaceable">key2</i>, <i class="replaceable">value2</i>), ...]</code>. The <code>for</code> loop iterates through this list. The first round, it assigns <code><i class="replaceable">key1</i></code> to <code class="varname">k</code> and <code><i class="replaceable">value1</i></code> to <code class="varname">v</code>, so <code class="varname">k</code> = <code>USERPROFILE</code> and <code class="varname">v</code> = <code>C:\Documents and Settings\mpilgrim</code>. In the second round, <code class="varname">k</code> gets the second key, <code>OS</code>, and <code class="varname">v</code> gets the corresponding value, <code>Windows_NT</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.for.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">With <a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a> and <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehensions</a>, you can replace the entire <code>for</code> loop with a single statement. Whether you actually do this in real code is a matter of personal coding style. I like it
because it makes it clear that what I'm doing is mapping a dictionary into a list, then joining the list into a single string.
Other programmers prefer to write this out as a <code>for</code> loop. The output is the same in either case, although this version is slightly faster, because there is only one <code class="function">print</code> statement instead of many.
</td>
</tr>
</table>
<p>Now we can look at the <code>for</code> loop in <code class="classname">MP3FileInfo</code>, from the sample <code class="filename">fileinfo.py</code> program introduced in <a href="#fileinfo">Chapter 5</a>.
<div class="example"><h3 id="fileinfo.multiassign.for.example">Example 6.11. <code>for</code> Loop in <code class="classname">MP3FileInfo</code></h3><pre class="programlisting">
tagDataMap = {"title" : ( 3, 33, stripnulls),
"artist" : ( 33, 63, stripnulls),
"album" : ( 63, 93, stripnulls),
"year" : ( 93, 97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre" : (127, 128, ord)} <img id="fileinfo.multiassign.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
.
.
.
if tagdata[:3] == "TAG":
for tag, (start, end, parseFunc) in self.tagDataMap.items(): <img id="fileinfo.multiassign.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
self[tag] = parseFunc(tagdata[start:end]) <img id="fileinfo.multiassign.5.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.multiassign.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">tagDataMap</code> is a <a href="#fileinfo.classattributes" title="5.8. Introducing Class Attributes">class attribute</a> that defines the tags you're looking for in an <abbr>MP3</abbr> file. Tags are stored in fixed-length fields. Once you read the last 128 bytes of the file, bytes 3 through 32 of those
are always the song title, 33 through 62 are always the artist name, 63 through 92 are the album name, and so forth. Note
that <code class="varname">tagDataMap</code> is a dictionary of tuples, and each tuple contains two integers and a function reference.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.multiassign.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This looks complicated, but it's not. The structure of the <code>for</code> variables matches the structure of the elements of the list returned by <code class="function">items</code>. Remember that <code class="function">items</code> returns a list of tuples of the form <code>(<i class="replaceable">key</i>, <i class="replaceable">value</i>)</code>. The first element of that list is <code>("title", (3, 33, &lt;function stripnulls>))</code>, so the first time around the loop, <code class="varname">tag</code> gets <code>"title"</code>, <code class="varname">start</code> gets <code>3</code>, <code class="varname">end</code> gets <code>33</code>, and <code class="varname">parseFunc</code> gets the function <code class="function">stripnulls</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.multiassign.5.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now that you've extracted all the parameters for a single <abbr>MP3</abbr> tag, saving the tag data is easy. You <a href="#odbchelper.list.slice" title="Example 3.8. Slicing a List">slice</a> <code class="varname">tagdata</code> from <code class="varname">start</code> to <code class="varname">end</code> to get the actual data for this tag, call <code class="varname">parseFunc</code> to post-process the data, and assign this as the value for the key <code class="varname">tag</code> in the pseudo-dictionary <code class="varname">self</code>. After iterating through all the elements in <code class="varname">tagDataMap</code>, <code class="varname">self</code> has the values for all the tags, and <a href="#fileinfo.specialmethods.setname" title="Example 5.15. Setting an MP3FileInfo's name">you know what that looks like</a>.
</td>
</tr>
</table>
<h2 id="fileinfo.modules">6.4. Using <code><code class="filename">sys</code>.modules</code></h2>
<p>Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary <code><code class="filename">sys</code>.modules</code>.
<div class="example"><h3>Example 6.12. Introducing <code><code class="filename">sys</code>.modules</code></h3><pre class="screen"><samp class="prompt">>>> </samp>import sys <img id="fileinfo.modules.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print '\n'.join(sys.modules.keys()) <img id="fileinfo.modules.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">win32api
os.path
os
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.modules.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="filename">sys</code> module contains system-level information, such as the version of Python you're running (<code><code class="filename">sys</code>.version</code> or <code><code class="filename">sys</code>.version_info</code>), and system-level options such as the maximum allowed recursion depth (<code><code class="filename">sys</code>.getrecursionlimit()</code> and <code><code class="filename">sys</code>.setrecursionlimit()</code>).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.modules.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code><code class="filename">sys</code>.modules</code> is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules <em>your</em> program has imported. Python preloads some modules on startup, and if you're using a Python <acronym>IDE</acronym>, <code><code class="filename">sys</code>.modules</code> contains all the modules imported by all the programs you've run within the <acronym>IDE</acronym>.
</td>
</tr>
</table>
<p>This example demonstrates how to use <code><code class="filename">sys</code>.modules</code>.
<div class="example"><h3>Example 6.13. Using <code><code class="filename">sys</code>.modules</code></h3><pre class="screen"><samp class="prompt">>>> </samp>import fileinfo <img id="fileinfo.modules.1.3" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print '\n'.join(sys.modules.keys())
<samp class="computeroutput">win32api
os.path
os
fileinfo
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat</samp>
<samp class="prompt">>>> </samp>fileinfo
&lt;module 'fileinfo' from 'fileinfo.pyc'>
<samp class="prompt">>>> </samp>sys.modules["fileinfo"] <img id="fileinfo.modules.1.4" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;module 'fileinfo' from 'fileinfo.pyc'></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.modules.1.3"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As new modules are imported, they are added to <code><code class="filename">sys</code>.modules</code>. This explains why importing the same module twice is very fast: Python has already loaded and cached the module in <code><code class="filename">sys</code>.modules</code>, so importing the second time is simply a dictionary lookup.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.modules.1.4"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Given the name (as a string) of any previously-imported module, you can get a reference to the module itself through the <code><code class="filename">sys</code>.modules</code> dictionary.
</td>
</tr>
</table>
<p>The next example shows how to use the <code>__module__</code> class attribute with the <code><code class="filename">sys</code>.modules</code> dictionary to get a reference to the module in which a class is defined.
<div class="example"><h3>Example 6.14. The <code>__module__</code> Class Attribute</h3><pre class="screen"><samp class="prompt">>>> </samp>from fileinfo import MP3FileInfo
<samp class="prompt">>>> </samp>MP3FileInfo.__module__ <img id="fileinfo.modules.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'fileinfo'
<samp class="prompt">>>> </samp>sys.modules[MP3FileInfo.__module__] <img id="fileinfo.modules.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;module 'fileinfo' from 'fileinfo.pyc'></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.modules.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Every Python class has a built-in <a href="#fileinfo.classattributes" title="5.8. Introducing Class Attributes">class attribute</a> <code>__module__</code>, which is the name of the module in which the class is defined.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.modules.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Combining this with the <code><code class="filename">sys</code>.modules</code> dictionary, you can get a reference to the module in which a class is defined.
</td>
</tr>
</table>
<p>Now you're ready to see how <code><code class="filename">sys</code>.modules</code> is used in <code class="filename">fileinfo.py</code>, the sample program introduced in <a href="#fileinfo">Chapter 5</a>. This example shows that portion of the code.
<div class="example"><h3>Example 6.15. <code><code class="filename">sys</code>.modules</code> in <code class="filename">fileinfo.py</code></h3><pre class="programlisting">
def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]): <img id="fileinfo.modules.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
"get file info class from filename extension"
subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:] <img id="fileinfo.modules.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
return hasattr(module, subclass) and getattr(module, subclass) or FileInfo <img id="fileinfo.modules.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.modules.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a function with two arguments; <code class="varname">filename</code> is required, but <code class="varname">module</code> is <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional</a> and defaults to the module that contains the <code class="classname">FileInfo</code> class. This looks inefficient, because you might expect Python to evaluate the <code><code class="filename">sys</code>.modules</code> expression every time the function is called. In fact, Python evaluates default expressions only once, the first time the module is imported. As you'll see later, you never call this
function with a <code class="varname">module</code> argument, so <code class="varname">module</code> serves as a function-level constant.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.modules.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You'll plow through this line later, after you dive into the <code class="filename">os</code> module. For now, take it on faith that <code class="varname">subclass</code> ends up as the name of a class, like <code class="classname">MP3FileInfo</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.modules.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You already know about <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code class="function">getattr</code></a>, which gets a reference to an object by name. <code class="function">hasattr</code> is a complementary function that checks whether an object has a particular attribute; in this case, whether a module has
a particular class (although it works for any object and any attribute, just like <code class="function">getattr</code>). In English, this line of code says, &#8220;If this module has the class named by <code class="varname">subclass</code> then return it, otherwise return the base class <code class="classname">FileInfo</code>.&#8221;
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Modules</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class="citetitle">Python Tutorial</i></a> discusses exactly <a href="http://www.python.org/doc/current/tut/node6.html#SECTION006710000000000000000">when and how default arguments are evaluated</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-sys.html"><code class="filename">sys</code></a> module.
</ul>
<h2 id="fileinfo.os">6.5. Working with Directories</h2>
<p>The <code class="filename">os.path</code> module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing
the contents of a directory.
<div class="example"><h3 id="fileinfo.os.path.join.example">Example 6.16. Constructing Pathnames</h3><pre class="screen">
<samp class="prompt">>>> </samp>import os
<samp class="prompt">>>> </samp>os.path.join("c:\\music\\ap\\", "mahadeva.mp3") <img id="fileinfo.os.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"> <img id="fileinfo.os.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'c:\\music\\ap\\mahadeva.mp3'
<samp class="prompt">>>> </samp>os.path.join("c:\\music\\ap", "mahadeva.mp3") <img id="fileinfo.os.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'c:\\music\\ap\\mahadeva.mp3'
<samp class="prompt">>>> </samp>os.path.expanduser("~") <img id="fileinfo.os.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'c:\\Documents and Settings\\mpilgrim\\My Documents'
<samp class="prompt">>>> </samp>os.path.join(os.path.expanduser("~"), "Python") <img id="fileinfo.os.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">os.path</code> is a reference to a module -- which module depends on your platform. Just as <a href="#crossplatform.example" title="Example 6.2. Supporting Platform-Specific Functionality"><code class="filename">getpass</code></a> encapsulates differences between platforms by setting <code class="varname">getpass</code> to a platform-specific function, <code class="filename">os</code> encapsulates differences between platforms by setting <code class="varname">path</code> to a platform-specific module.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">join</code> function of <code class="filename">os.path</code> constructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing
with pathnames on Windows is annoying because the backslash character must be escaped.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In this slightly less trivial case, <code class="function">join</code> will add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since
<code class="function">addSlashIfNecessary</code> is one of the stupid little functions I always need to write when building up my toolbox in a new language. <em>Do not</em> write this stupid little function in Python; smart people have already taken care of it for you.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">expanduser</code> will expand a pathname that uses <code>~</code> to represent the current user's home directory. This works on any platform where users have a home directory, like Windows,
<acronym>UNIX</acronym>, and Mac OS X; it has no effect on Mac OS.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Combining these techniques, you can easily construct pathnames for directories and files under the user's home directory.</td>
</tr>
</table>
<div class="example"><h3 id="splittingpathnames.example">Example 6.17. Splitting Pathnames</h3><pre class="screen"><samp class="prompt">>>> </samp>os.path.split("c:\\music\\ap\\mahadeva.mp3") <img id="fileinfo.os.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
('c:\\music\\ap', 'mahadeva.mp3')
<samp class="prompt">>>> </samp>(filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3") <img id="fileinfo.os.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>filepath <img id="fileinfo.os.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'c:\\music\\ap'
<samp class="prompt">>>> </samp>filename <img id="fileinfo.os.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'mahadeva.mp3'
<samp class="prompt">>>> </samp>(shortname, extension) = os.path.splitext(filename) <img id="fileinfo.os.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>shortname
'mahadeva'
<samp class="prompt">>>> </samp>extension
'.mp3'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">split</code> function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use
<a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a> to return multiple values from a function? Well, <code class="function">split</code> is such a function.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You assign the return value of the <code class="function">split</code> function into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The first variable, <code class="varname">filepath</code>, receives the value of the first element of the tuple returned from <code class="function">split</code>, the file path.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The second variable, <code class="varname">filename</code>, receives the value of the second element of the tuple returned from <code class="function">split</code>, the filename.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">os.path</code> also contains a function <code class="function">splitext</code>, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique
to assign each of them to separate variables.
</td>
</tr>
</table>
<div class="example"><h3 id="fileinfo.listdir.example">Example 6.18. Listing Directories</h3><pre class="screen"><samp class="prompt">>>> </samp>os.listdir("c:\\music\\_singles\\") <img id="fileinfo.os.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
'spinning.mp3']</samp>
<samp class="prompt">>>> </samp>dirname = "c:\\"
<samp class="prompt">>>> </samp>os.listdir(dirname) <img id="fileinfo.os.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin',
'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS',
'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys',
'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']</samp>
<samp class="prompt">>>> </samp>[f for f in os.listdir(dirname)
<samp class="prompt">... </samp>if os.path.isfile(os.path.join(dirname, f))] <img id="fileinfo.os.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS',
'NTDETECT.COM', 'ntldr', 'pagefile.sys']</samp>
<samp class="prompt">>>> </samp>[f for f in os.listdir(dirname)
<samp class="prompt">... </samp>if os.path.isdir(os.path.join(dirname, f))] <img id="fileinfo.os.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">['cygwin', 'docbook', 'Documents and Settings', 'Incoming',
'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">listdir</code> function takes a pathname and returns a list of the contents of the directory.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">listdir</code> returns both files and folders, with no indication of which is which.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can use <a href="#apihelper.filter" title="4.5. Filtering Lists">list filtering</a> and the <code class="function">isfile</code> function of the <code class="filename">os.path</code> module to separate the files from the folders. <code class="function">isfile</code> takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're using <code><code class="filename">os.path</code>.<code class="function">join</code></code> to ensure a full pathname, but <code class="function">isfile</code> also works with a partial path, relative to the current working directory. You can use <code>os.getcwd()</code> to get the current working directory.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">os.path</code> also has a <code class="function">isdir</code> function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories
within a directory.
</td>
</tr>
</table>
<div class="example"><h3>Example 6.19. Listing Directories in <code class="filename">fileinfo.py</code></h3><pre class="programlisting">
def listDirectory(directory, fileExtList):
"get list of file info objects for files of particular extensions"
fileList = [os.path.normcase(f)
for f in os.listdir(directory)] <img id="fileinfo.os.3a.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"> <img id="fileinfo.os.3a.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
fileList = [os.path.join(directory, f)
for f in fileList
if os.path.splitext(f)[1] in fileExtList] <img id="fileinfo.os.3a.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"> <img id="fileinfo.os.3a.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"> <img id="fileinfo.os.3a.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.3a.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>os.listdir(directory)</code> returns a list of all the files and folders in <code class="varname">directory</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.3a.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Iterating through the list with <code class="varname">f</code>, you use <code>os.path.normcase(f)</code> to normalize the case according to operating system defaults. <code class="function">normcase</code> is a useful little function that compensates for case-insensitive operating systems that think that <code class="filename">mahadeva.mp3</code> and <code class="filename">mahadeva.MP3</code> are the same file. For instance, on Windows and Mac OS, <code class="function">normcase</code> will convert the entire filename to lowercase; on <acronym>UNIX</acronym>-compatible systems, it will return the filename unchanged.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.3a.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Iterating through the normalized list with <code class="varname">f</code> again, you use <code>os.path.splitext(f)</code> to split each filename into name and extension.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.3a.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">For each file, you see if the extension is in the list of file extensions you care about (<code class="varname">fileExtList</code>, which was passed to the <code class="function">listDirectory</code> function).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.3a.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">For each file you care about, you use <code>os.path.join(directory, f)</code> to construct the full pathname of the file, and return a list of the full pathnames.
</td>
</tr>
</table>
</div><table id="tip.os" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Whenever possible, you should use the functions in <code class="filename">os</code> and <code class="filename">os.path</code> for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like
<code class="function">os.path.split</code> work on <acronym>UNIX</acronym>, Windows, Mac OS, and any other platform supported by Python.
</td>
</tr>
</table>
<p>There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you
may already be familiar with from working on the command line.
<div class="example"><h3 id="fileinfo.os.glob.example">Example 6.20. Listing Directories with <code class="filename">glob</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>os.listdir("c:\\music\\_singles\\") <img id="fileinfo.os.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
'spinning.mp3']</samp>
<samp class="prompt">>>> </samp>import glob
<samp class="prompt">>>> </samp>glob.glob('c:\\music\\_singles\\*.mp3') <img id="fileinfo.os.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">['c:\\music\\_singles\\a_time_long_forgotten_con.mp3',
'c:\\music\\_singles\\hellraiser.mp3',
'c:\\music\\_singles\\kairo.mp3',
'c:\\music\\_singles\\long_way_home1.mp3',
'c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']</samp>
<samp class="prompt">>>> </samp>glob.glob('c:\\music\\_singles\\s*.mp3') <img id="fileinfo.os.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">['c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']</samp>
<samp class="prompt">>>> </samp>glob.glob('c:\\music\\*\\*.mp3')<img id="fileinfo.os.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw earlier, <code class="function">os.listdir</code> simply takes a directory path and lists all files and directories in that directory.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="filename">glob</code> module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard.
Here the wildcard is a directory path plus "*.mp3", which will match all <code class="filename">.mp3</code> files. Note that each element of the returned list already includes the full path of the file.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If you want to find all the files in a specific directory that start with "s" and end with ".mp3", you can do that too.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.os.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now consider this scenario: you have a <code class="filename">music</code> directory, with several subdirectories within it, with <code class="filename">.mp3</code> files within each subdirectory. You can get a list of all of those with a single call to <code class="filename">glob</code>, by using two wildcards at once. One wildcard is the <code>"*.mp3"</code> (to match <code class="filename">.mp3</code> files), and one wildcard is <em>within the directory path itself</em>, to match any subdirectory within <code class="filename">c:\music</code>. That's a crazy amount of power packed into one deceptively simple-looking function!
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on the <code class="filename">os</code> Module</h3>
<ul>
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/240">questions about the <code class="filename">os</code> module</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-os.html"><code class="filename">os</code></a> module and the <a href="http://www.python.org/doc/current/lib/module-os.path.html"><code class="filename">os.path</code></a> module.
</ul>
<h2 id="fileinfo.alltogether">6.6. Putting It All Together</h2>
<p>Once again, all the dominoes are in place. You've seen how each line of code works. Now let's step back and see how it all
fits together.
<div class="example"><h3 id="fileinfo.nested">Example 6.21. <code class="function">listDirectory</code></h3><pre class="programlisting">
def listDirectory(directory, fileExtList): <img id="fileinfo.alltogether.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
"get list of file info objects for files of particular extensions"
fileList = [os.path.normcase(f)
for f in os.listdir(directory)]
fileList = [os.path.join(directory, f)
for f in fileList
if os.path.splitext(f)[1] in fileExtList] <img id="fileinfo.alltogether.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]): <img id="fileinfo.alltogether.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
"get file info class from filename extension"
subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:] <img id="fileinfo.alltogether.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
return hasattr(module, subclass) and getattr(module, subclass) or FileInfo <img id="fileinfo.alltogether.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
return [getFileInfoClass(f)(f) for f in fileList] <img id="fileinfo.alltogether.1.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.alltogether.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">listDirectory</code> is the main attraction of this entire module. It takes a directory (like <code class="filename">c:\music\_singles\</code> in my case) and a list of interesting file extensions (like <code>['.mp3']</code>), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in
that directory. And it does it in just a few straightforward lines of code.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.alltogether.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in the <a href="#fileinfo.os" title="6.5. Working with Directories">previous section</a>, this line of code gets a list of the full pathnames of all the files in <code class="varname">directory</code> that have an interesting file extension (as specified by <code class="varname">fileExtList</code>).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.alltogether.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports <em>nested functions</em> -- literally, a function within a function. The nested function <code class="function">getFileInfoClass</code> can be called only from the function in which it is defined, <code class="function">listDirectory</code>. As with any other function, you don't need an interface declaration or anything fancy; just define the function and code
it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.alltogether.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now that you've seen the <a href="#fileinfo.os" title="6.5. Working with Directories"><code class="filename">os</code></a> module, this line should make more sense. It gets the extension of the file (<code>os.path.splitext(filename)[1]</code>), forces it to uppercase (<code>.upper()</code>), slices off the dot (<code>[1:]</code>), and constructs a class name out of it with string formatting. So <code class="filename">c:\music\ap\mahadeva.mp3</code> becomes <code>.mp3</code> becomes <code>.MP3</code> becomes <code>MP3</code> becomes <code>MP3FileInfo</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.alltogether.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Having constructed the name of the handler class that would handle this file, you check to see if that handler class actually
exists in this module. If it does, you return the class, otherwise you return the base class <code class="classname">FileInfo</code>. This is a very important point: <em>this function returns a class</em>. Not an instance of a class, but the class itself.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#fileinfo.alltogether.1.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">For each file in the &#8220;interesting files&#8221; list (<code class="varname">fileList</code>), you call <code class="function">getFileInfoClass</code> with the filename (<code class="varname">f</code>). Calling <code>getFileInfoClass(f)</code> returns a class; you don't know exactly which class, but you don't care. You then create an instance of this class (whatever
it is) and pass the filename (<code class="varname">f</code> again), to the <code class="function">__init__</code> method. As you saw <a href="#fileinfo.specialmethods.setname" title="Example 5.15. Setting an MP3FileInfo's name">earlier in this chapter</a>, the <code class="function">__init__</code> method of <code class="classname">FileInfo</code> sets <code>self["name"]</code>, which triggers <code class="function">__setitem__</code>, which is overridden in the descendant (<code class="classname">MP3FileInfo</code>) to parse the file appropriately to pull out the file's metadata. You do all that for each interesting file and return a
list of the resulting instances.
</td>
</tr>
</table>
<p>Note that <code class="function">listDirectory</code> is completely generic. It doesn't know ahead of time which types of files it will be getting, or which classes are defined
that could potentially handle those files. It inspects the directory for the files to process, and then introspects its own
module to see what special handler classes (like <code class="classname">MP3FileInfo</code>) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class:
<code class="classname">HTMLFileInfo</code> for <acronym>HTML</acronym> files, <code class="classname">DOCFileInfo</code> for Word <code>.doc</code> files, and so forth. <code class="function">listDirectory</code> will handle them all, without modification, by handing off the real work to the appropriate classes and collating the results.
<h2 id="fileinfo.summary2">6.7. Summary</h2>
<p>The <code class="filename">fileinfo.py</code> program introduced in <a href="#fileinfo">Chapter 5</a> should now make perfect sense.
<div class="informalexample"><pre class="programlisting">
"""Framework for getting filetype-specific metadata.
Instantiate appropriate class with filename. Returned object acts like a
dictionary, with key-value pairs for each piece of metadata.
import fileinfo
info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3")
print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()])
Or use listDirectory function to get info on all files in a directory.
for info in fileinfo.listDirectory("/music/ap/", [".mp3"]):
...
Framework can be extended by adding classes for particular file types, e.g.
HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for
parsing its files appropriately; see MP3FileInfo for example.
"""
import os
import sys
from UserDict import UserDict
def stripnulls(data):
"strip whitespace and nulls"
return data.replace("\00", "").strip()
class FileInfo(UserDict):
"store file metadata"
def __init__(self, filename=None):
UserDict.__init__(self)
self["name"] = filename
class MP3FileInfo(FileInfo):
"store ID3v1.0 MP3 tags"
tagDataMap = {"title" : ( 3, 33, stripnulls),
"artist" : ( 33, 63, stripnulls),
"album" : ( 63, 93, stripnulls),
"year" : ( 93, 97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre" : (127, 128, ord)}
def __parse(self, filename):
"parse ID3v1.0 tags from MP3 file"
self.clear()
try:
fsock = open(filename, "rb", 0)
try:
fsock.seek(-128, 2)
tagdata = fsock.read(128)
finally:
fsock.close()
if tagdata[:3] == "TAG":
for tag, (start, end, parseFunc) in self.tagDataMap.items():
self[tag] = parseFunc(tagdata[start:end])
except IOError:
pass
def __setitem__(self, key, item):
if key == "name" and item:
self.__parse(item)
FileInfo.__setitem__(self, key, item)
def listDirectory(directory, fileExtList):
"get list of file info objects for files of particular extensions"
fileList = [os.path.normcase(f)
for f in os.listdir(directory)]
fileList = [os.path.join(directory, f)
for f in fileList
if os.path.splitext(f)[1] in fileExtList]
def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):
"get file info class from filename extension"
subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]
return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
return [getFileInfoClass(f)(f) for f in fileList]
if __name__ == "__main__":
for info in listDirectory("/music/_singles/", [".mp3"]):
print "\n".join(["%s=%s" % (k, v) for k, v in info.items()])
print</pre><div class="highlights">
<p>Before diving into the next chapter, make sure you're comfortable doing the following things:
<div class="itemizedlist">
<ul>
<li>Catching exceptions with <a href="#fileinfo.exception" title="6.1. Handling Exceptions"><code>try...except</code></a>
<li>Protecting external resources with <a href="#fileinfo.files.incode" title="Example 6.6. File Objects in MP3FileInfo"><code>try...finally</code></a>
<li>Reading from <a href="#fileinfo.files" title="6.2. Working with File Objects">files</a>
<li>Assigning multiple values at once in a <a href="#fileinfo.multiassign.for.example" title="Example 6.11. for Loop in MP3FileInfo"><code>for</code> loop</a>
<li>Using the <a href="#fileinfo.os" title="6.5. Working with Directories"><code class="filename">os</code></a> module for all your cross-platform file manipulation needs
<li>Dynamically <a href="#fileinfo.alltogether" title="6.6. Putting It All Together">instantiating classes of unknown type</a> by treating classes as objects and passing them around
</ul>
<div class="chapter">
<h2 id="re">Chapter 7. Regular Expressions</h2>
<p>Regular expressions are a powerful and standardized way of searching, replacing, and parsing text with complex patterns of
characters. If you've used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the <a href="http://www.python.org/doc/current/lib/module-re.html"><code class="filename">re</code> module</a> to get an overview of the available functions and their arguments.
<h2 id="re.intro">7.1. Diving In</h2>
<p>Strings have methods for searching (<code class="function">index</code>, <code class="function">find</code>, and <code class="function">count</code>), replacing (<code class="function">replace</code>), and parsing (<code class="function">split</code>), but they are limited to the simplest of cases. The search methods look for a single, hard-coded substring, and they are
always case-sensitive. To do case-insensitive searches of a string <code class="varname">s</code>, you must call <code class="function">s.lower()</code> or <code class="function">s.upper()</code> and make sure your search strings are the appropriate case to match. The <code class="function">replace</code> and <code class="function">split</code> methods have the same limitations.
<p>If what you're trying to do can be accomplished with string functions, you should use them. They're fast and simple and easy
to read, and there's a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different
string functions with <code>if</code> statements to handle special cases, or if you're combining them with <code class="function">split</code> and <code class="function">join</code> and list comprehensions in weird unreadable ways, you may need to move up to regular expressions.
<p>Although the regular expression syntax is tight and unlike normal code, the result can end up being <em>more</em> readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments
within regular expressions to make them practically self-documenting.
<h2 id="re.matching">7.2. Case Study: Street Addresses</h2>
<p>This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub
and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don't just
make this stuff up; it's actually useful.) This example shows how I approached the problem.
<div class="example"><h3>Example 7.1. Matching at the End of a String</h3><pre class="screen">
<samp class="prompt">>>> </samp>s = '100 NORTH MAIN ROAD'
<samp class="prompt">>>> </samp>s.replace('ROAD', 'RD.') <img id="re.matching.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'100 NORTH MAIN RD.'
<samp class="prompt">>>> </samp>s = '100 NORTH BROAD ROAD'
<samp class="prompt">>>> </samp>s.replace('ROAD', 'RD.') <img id="re.matching.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'100 NORTH BRD. RD.'
<samp class="prompt">>>> </samp>s[:-4] + s[-4:].replace('ROAD', 'RD.') <img id="re.matching.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'100 NORTH BROAD RD.'
<samp class="prompt">>>> </samp>import re <img id="re.matching.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>re.sub('ROAD$', 'RD.', s) <img id="re.matching.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12"> <img id="re.matching.1.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
'100 NORTH BROAD RD.'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">My goal is to standardize a street address so that <code>'ROAD'</code> is always abbreviated as <code>'RD.'</code>. At first glance, I thought this was simple enough that I could just use the string method <code class="function">replace</code>. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, <code>'ROAD'</code>, was a constant. And in this deceptively simple example, <code class="function">s.replace</code> does indeed work.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that <code>'ROAD'</code> appears twice in the address, once as part of the street name <code>'BROAD'</code> and once as its own word. The <code class="function">replace</code> method sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To solve the problem of addresses with more than one <code>'ROAD'</code> substring, you could resort to something like this: only search and replace <code>'ROAD'</code> in the last four characters of the address (<code>s[-4:]</code>), and leave the string alone (<code>s[:-4]</code>). But you can see that this is already getting unwieldy. For example, the pattern is dependent on the length of the string
you're replacing (if you were replacing <code>'STREET'</code> with <code>'ST.'</code>, you would need to use <code>s[:-6]</code> and <code>s[-6:].replace(...)</code>). Would you like to come back in six months and debug this? I know I wouldn't.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">It's time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the <code class="filename">re</code> module.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Take a look at the first parameter: <code>'ROAD$'</code>. This is a simple regular expression that matches <code>'ROAD'</code> only when it occurs at the end of a string. The <code>$</code> means &#8220;end of the string&#8221;. (There is a corresponding character, the caret <code>^</code>, which means &#8220;beginning of the string&#8221;.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.1.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using the <code class="function">re.sub</code> function, you search the string <code class="varname">s</code> for the regular expression <code>'ROAD$'</code> and replace it with <code>'RD.'</code>. This matches the <code>ROAD</code> at the end of the string <code class="varname">s</code>, but does <em>not</em> match the <code>ROAD</code> that's part of the word <code>BROAD</code>, because that's in the middle of <code class="varname">s</code>.
</td>
</tr>
</table>
<p>Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching <code>'ROAD'</code> at the end of the address, was not good enough, because not all addresses included a street designation at all; some just
ended with the street name. Most of the time, I got away with it, but if the street name was <code>'BROAD'</code>, then the regular expression would match <code>'ROAD'</code> at the end of the string as part of the word <code>'BROAD'</code>, which is not what I wanted.
<div class="example"><h3>Example 7.2. Matching Whole Words</h3><pre class="screen">
<samp class="prompt">>>> </samp>s = '100 BROAD'
<samp class="prompt">>>> </samp>re.sub('ROAD$', 'RD.', s)
'100 BRD.'
<samp class="prompt">>>> </samp>re.sub('\\bROAD$', 'RD.', s) <img id="re.matching.2.2" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'100 BROAD'
<samp class="prompt">>>> </samp>re.sub(r'\bROAD$', 'RD.', s) <img id="re.matching.2.3" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'100 BROAD'
<samp class="prompt">>>> </samp>s = '100 BROAD ROAD APT. 3'
<samp class="prompt">>>> </samp>re.sub(r'\bROAD$', 'RD.', s) <img id="re.matching.2.4" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'100 BROAD ROAD APT. 3'
<samp class="prompt">>>> </samp>re.sub(r'\bROAD\b', 'RD.', s) <img id="re.matching.2.5" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'100 BROAD RD. APT 3'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.2.2"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">What I <em>really</em> wanted was to match <code>'ROAD'</code> when it was at the end of the string <em>and</em> it was its own whole word, not a part of some larger word. To express this in a regular expression, you use <code>\b</code>, which means &#8220;a word boundary must occur right here&#8221;. In Python, this is complicated by the fact that the <code>'\'</code> character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason
why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it's a bug in syntax or
a bug in your regular expression.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.2.3"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To work around the backslash plague, you can use what is called a raw string, by prefixing the string with the letter <code>r</code>. This tells Python that nothing in this string should be escaped; <code>'\t'</code> is a tab character, but <code>r'\t'</code> is really the backslash character <code>\</code> followed by the letter <code>t</code>. I recommend always using raw strings when dealing with regular expressions; otherwise, things get too confusing too quickly
(and regular expressions get confusing quickly enough all by themselves).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.2.4"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><em>*sigh*</em> Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word
<code>'ROAD'</code> as a whole word by itself, but it wasn't at the end, because the address had an apartment number after the street designation.
Because <code>'ROAD'</code> isn't at the very end of the string, it doesn't match, so the entire call to <code class="function">re.sub</code> ends up replacing nothing at all, and you get the original string back, which is not what you want.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.matching.2.5"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To solve this problem, I removed the <code>$</code> character and added another <code>\b</code>. Now the regular expression reads &#8220;match <code>'ROAD'</code> when it's a whole word by itself anywhere in the string,&#8221; whether at the end, the beginning, or somewhere in the middle.
</td>
</tr>
</table>
<h2 id="re.roman">7.3. Case Study: Roman Numerals</h2>
<p>You've most likely seen Roman numerals, even if you didn't recognize them. You may have seen them in copyrights of old movies
and television shows (&#8220;Copyright <code>MCMXLVI</code>&#8221; instead of &#8220;Copyright <code>1946</code>&#8221;), or on the dedication walls of libraries or universities (&#8220;established <code>MDCCCLXXXVIII</code>&#8221; instead of &#8220;established <code>1888</code>&#8221;). You may also have seen them in outlines and bibliographical references. It's a system of representing numbers that really
does date back to the ancient Roman empire (hence the name).
<p>In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.
<div class="itemizedlist">
<ul>
<li><code>I</code> = <code>1</code>
<li><code>V</code> = <code>5</code>
<li><code>X</code> = <code>10</code>
<li><code>L</code> = <code>50</code>
<li><code>C</code> = <code>100</code>
<li><code>D</code> = <code>500</code>
<li><code>M</code> = <code>1000</code>
</ul>
<p>The following are some general rules for constructing Roman numerals:
<div class="itemizedlist">
<ul>
<li>Characters are additive. <code>I</code> is <code class="constant">1</code>, <code>II</code> is <code>2</code>, and <code>III</code> is <code>3</code>. <code>VI</code> is <code>6</code> (literally, &#8220;<code>5</code> and <code>1</code>&#8221;), <code>VII</code> is <code>7</code>, and <code>VIII</code> is <code>8</code>.
<li>The tens characters (<code>I</code>, <code>X</code>, <code>C</code>, and <code>M</code>) can be repeated up to three times. At <code>4</code>, you need to subtract from the next highest fives character. You can't represent <code>4</code> as <code>IIII</code>; instead, it is represented as <code>IV</code> (&#8220;<code>1</code> less than <code>5</code>&#8221;). The number <code>40</code> is written as <code>XL</code> (<code>10</code> less than <code>50</code>), <code>41</code> as <code>XLI</code>, <code>42</code> as <code>XLII</code>, <code>43</code> as <code>XLIII</code>, and then <code>44</code> as <code>XLIV</code> (<code>10</code> less than <code>50</code>, then <code>1</code> less than <code>5</code>).
<li>Similarly, at <code>9</code>, you need to subtract from the next highest tens character: <code>8</code> is <code>VIII</code>, but <code>9</code> is <code>IX</code> (<code>1</code> less than <code>10</code>), not <code>VIIII</code> (since the <code>I</code> character can not be repeated four times). The number <code>90</code> is <code>XC</code>, <code>900</code> is <code>CM</code>.
<li>The fives characters can not be repeated. The number <code>10</code> is always represented as <code>X</code>, never as <code>VV</code>. The number <code>100</code> is always <code>C</code>, never <code>LL</code>.
<li>Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much.
<code>DC</code> is <code>600</code>; <code>CD</code> is a completely different number (<code>400</code>, <code>100</code> less than <code>500</code>). <code>CI</code> is <code>101</code>; <code>IC</code> is not even a valid Roman numeral (because you can't subtract <code>1</code> directly from <code>100</code>; you would need to write it as <code>XCIX</code>, for <code>10</code> less than <code>100</code>, then <code>1</code> less than <code>10</code>).
</ul>
<h3>7.3.1. Checking for Thousands</h3>
<p>What would it take to validate that an arbitrary string is a valid Roman numeral? Let's take it one digit at a time. Since
Roman numerals are always written highest to lowest, let's start with the highest: the thousands place. For numbers 1000
and higher, the thousands are represented by a series of <code>M</code> characters.
<div class="example"><h3>Example 7.3. Checking for Thousands</h3><pre class="screen">
<samp class="prompt">>>> </samp>import re
<samp class="prompt">>>> </samp>pattern = '^M?M?M?$' <img id="re.roman.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>re.search(pattern, 'M') <img id="re.roman.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;SRE_Match object at 0106FB58>
<samp class="prompt">>>> </samp>re.search(pattern, 'MM') <img id="re.roman.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;SRE_Match object at 0106C290>
<samp class="prompt">>>> </samp>re.search(pattern, 'MMM') <img id="re.roman.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;SRE_Match object at 0106AA38>
<samp class="prompt">>>> </samp>re.search(pattern, 'MMMM') <img id="re.roman.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>re.search(pattern, '') <img id="re.roman.1.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
&lt;SRE_Match object at 0106F4A8></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This pattern has three parts:
<div class="itemizedlist">
<ul>
<li><code>^</code> to match what follows only at the beginning of the string. If this were not specified, the pattern would match no matter
where the <code>M</code> characters were, which is not what you want. You want to make sure that the <code>M</code> characters, if they're there, are at the beginning of the string.
<li><code>M?</code> to optionally match a single <code>M</code> character. Since this is repeated three times, you're matching anywhere from zero to three <code>M</code> characters in a row.
<li><code>$</code> to match what precedes only at the end of the string. When combined with the <code>^</code> character at the beginning, this means that the pattern must match the entire string, with no other characters before or
after the <code>M</code> characters.
</ul>
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The essence of the <code class="filename">re</code> module is the <code class="function">search</code> function, that takes a regular expression (<code class="varname">pattern</code>) and a string (<code>'M'</code>) to try to match against the regular expression. If a match is found, <code class="function">search</code> returns an object which has various methods to describe the match; if no match is found, <code class="function">search</code> returns <code>None</code>, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return
value of <code class="function">search</code>. <code>'M'</code> matches this regular expression, because the first optional <code>M</code> matches and the second and third optional <code>M</code> characters are ignored.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>'MM'</code> matches because the first and second optional <code>M</code> characters match and the third <code>M</code> is ignored.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>'MMM'</code> matches because all three <code>M</code> characters match.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>'MMMM'</code> does not match. All three <code>M</code> characters match, but then the regular expression insists on the string ending (because of the <code>$</code> character), and the string doesn't end yet (because of the fourth <code>M</code>). So <code class="function">search</code> returns <code>None</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.1.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Interestingly, an empty string also matches this regular expression, since all the <code>M</code> characters are optional.
</td>
</tr>
</table>
<h3>7.3.2. Checking for Hundreds</h3>
<p>The hundreds place is more difficult than the thousands, because there are several mutually exclusive ways it could be expressed,
depending on its value.
<div class="itemizedlist">
<ul>
<li><code>100</code> = <code>C</code>
<li><code>200</code> = <code>CC</code>
<li><code>300</code> = <code>CCC</code>
<li><code>400</code> = <code>CD</code>
<li><code>500</code> = <code>D</code>
<li><code>600</code> = <code>DC</code>
<li><code>700</code> = <code>DCC</code>
<li><code>800</code> = <code>DCCC</code>
<li><code>900</code> = <code>CM</code>
</ul>
<p>So there are four possible patterns:
<div class="itemizedlist">
<ul>
<li><code>CM</code>
<li><code>CD</code>
<li>Zero to three <code>C</code> characters (zero if the hundreds place is 0)
<li><code>D</code>, followed by zero to three <code>C</code> characters
</ul>
<p>The last two patterns can be combined:
<div class="itemizedlist">
<ul>
<li>an optional <code>D</code>, followed by zero to three <code>C</code> characters
</ul>
<p>This example shows how to validate the hundreds place of a Roman numeral.
<div class="example"><h3 id="re.roman.hundreds">Example 7.4. Checking for Hundreds</h3><pre class="screen">
<samp class="prompt">>>> </samp>import re
<samp class="prompt">>>> </samp>pattern = '^M?M?M?(CM|CD|D?C?C?C?)$' <img id="re.roman.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>re.search(pattern, 'MCM') <img id="re.roman.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;SRE_Match object at 01070390>
<samp class="prompt">>>> </samp>re.search(pattern, 'MD') <img id="re.roman.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;SRE_Match object at 01073A50>
<samp class="prompt">>>> </samp>re.search(pattern, 'MMMCCC') <img id="re.roman.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;SRE_Match object at 010748A8>
<samp class="prompt">>>> </samp>re.search(pattern, 'MCMC') <img id="re.roman.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>re.search(pattern, '') <img id="re.roman.2.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
&lt;SRE_Match object at 01071D98></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This pattern starts out the same as the previous one, checking for the beginning of the string (<code>^</code>), then the thousands place (<code>M?M?M?</code>). Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical
bars: <code>CM</code>, <code>CD</code>, and <code>D?C?C?C?</code> (which is an optional <code>D</code> followed by zero to three optional <code>C</code> characters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first
one that matches, and ignores the rest.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>'MCM'</code> matches because the first <code>M</code> matches, the second and third <code>M</code> characters are ignored, and the <code>CM</code> matches (so the <code>CD</code> and <code>D?C?C?C?</code> patterns are never even considered). <code>MCM</code> is the Roman numeral representation of <code>1900</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>'MD'</code> matches because the first <code>M</code> matches, the second and third <code>M</code> characters are ignored, and the <code>D?C?C?C?</code> pattern matches <code>D</code> (each of the three <code>C</code> characters are optional and are ignored). <code>MD</code> is the Roman numeral representation of <code>1500</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>'MMMCCC'</code> matches because all three <code>M</code> characters match, and the <code>D?C?C?C?</code> pattern matches <code>CCC</code> (the <code>D</code> is optional and is ignored). <code>MMMCCC</code> is the Roman numeral representation of <code>3300</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>'MCMC'</code> does not match. The first <code>M</code> matches, the second and third <code>M</code> characters are ignored, and the <code>CM</code> matches, but then the <code>$</code> does not match because you're not at the end of the string yet (you still have an unmatched <code>C</code> character). The <code>C</code> does <em>not</em> match as part of the <code>D?C?C?C?</code> pattern, because the mutually exclusive <code>CM</code> pattern has already matched.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.roman.2.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Interestingly, an empty string still matches this pattern, because all the <code>M</code> characters are optional and ignored, and the empty string matches the <code>D?C?C?C?</code> pattern where all the characters are optional and ignored.
</td>
</tr>
</table>
<p>Whew! See how quickly regular expressions can get nasty? And you've only covered the thousands and hundreds places of Roman
numerals. But if you followed all that, the tens and ones places are easy, because they're exactly the same pattern. But
let's look at another way to express the pattern.
<h2 id="re.nm">7.4. Using the <code>{n,m}</code> Syntax</h2>
<p>In <a href="#re.roman" title="7.3. Case Study: Roman Numerals">the previous section</a>, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express
this in regular expressions, which some people find more readable. First look at the method we already used in the previous
example.
<div class="example"><h3>Example 7.5. The Old Way: Every Character Optional</h3><pre class="screen">
<samp class="prompt">>>> </samp>import re
<samp class="prompt">>>> </samp>pattern = '^M?M?M?$'
<samp class="prompt">>>> </samp>re.search(pattern, 'M') <img id="re.nm.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EE090>
<samp class="prompt">>>> </samp>pattern = '^M?M?M?$'
<samp class="prompt">>>> </samp>re.search(pattern, 'MM') <img id="re.nm.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>pattern = '^M?M?M?$'
<samp class="prompt">>>> </samp>re.search(pattern, 'MMM') <img id="re.nm.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EE090>
<samp class="prompt">>>> </samp>re.search(pattern, 'MMMM') <img id="re.nm.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, and then the first optional <code>M</code>, but not the second and third <code>M</code> (but that's okay because they're optional), and then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, and then the first and second optional <code>M</code>, but not the third <code>M</code> (but that's okay because it's optional), and then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, and then all three optional <code>M</code>, and then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, and then all three optional <code>M</code>, but then does not match the the end of the string (because there is still one unmatched <code>M</code>), so the pattern does not match and returns <code>None</code>.
</td>
</tr>
</table>
<div class="example"><h3>Example 7.6. The New Way: From <code>n</code> o <code>m</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>pattern = '^M{0,3}$' <img id="re.nm.2.0" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>re.search(pattern, 'M') <img id="re.nm.2.1" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'MM') <img id="re.nm.2.2" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EE090>
<samp class="prompt">>>> </samp>re.search(pattern, 'MMM') <img id="re.nm.2.3" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEDA8>
<samp class="prompt">>>> </samp>re.search(pattern, 'MMMM') <img id="re.nm.2.4" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.2.0"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This pattern says: &#8220;Match the start of the string, then anywhere from zero to three <code>M</code> characters, then the end of the string.&#8221; The 0 and 3 can be any numbers; if you want to match at least one but no more than three <code>M</code> characters, you could say <code>M{1,3}</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.2.1"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then one <code>M</code> out of a possible three, then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.2.2"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then two <code>M</code> out of a possible three, then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.2.3"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then three <code>M</code> out of a possible three, then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.2.4"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then three <code>M</code> out of a possible three, but then <em>does not match</em> the end of the string. The regular expression allows for up to only three <code>M</code> characters before the end of the string, but you have four, so the pattern does not match and returns <code>None</code>.
</td>
</tr>
</table>
</div><table class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">There is no way to programmatically determine that two regular expressions are equivalent. The best you can do is write a
lot of test cases to make sure they behave the same way on all relevant inputs. You'll talk more about writing test cases
later in this book.
</td>
</tr>
</table>
<h3>7.4.1. Checking for Tens and Ones</h3>
<p>Now let's expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for
tens.
<div class="example"><h3 id="re.tens.example">Example 7.7. Checking for Tens</h3><pre class="screen">
<samp class="prompt">>>> </samp>pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
<samp class="prompt">>>> </samp>re.search(pattern, 'MCMXL') <img id="re.nm.3.3" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'MCML') <img id="re.nm.3.4" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'MCMLX') <img id="re.nm.3.5" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'MCMLXXX') <img id="re.nm.3.7" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'MCMLXXXX') <img id="re.nm.3.8" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.3.3"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then <code>XL</code>, then the end of the string. Remember, the <code>(A|B|C)</code> syntax means &#8220;match exactly one of A, B, or C&#8221;. You match <code>XL</code>, so you ignore the <code>XC</code> and <code>L?X?X?X?</code> choices, and then move on to the end of the string. <code>MCML</code> is the Roman numeral representation of <code>1940</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.3.4"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then <code>L?X?X?X?</code>. Of the <code>L?X?X?X?</code>, it matches the <code>L</code> and skips all three optional <code>X</code> characters. Then you move to the end of the string. <code>MCML</code> is the Roman numeral representation of <code>1950</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.3.5"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then the optional <code>L</code> and the first optional <code>X</code>, skips the second and third optional <code>X</code>, then the end of the string. <code>MCMLX</code> is the Roman numeral representation of <code>1960</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.3.7"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then the optional <code>L</code> and all three optional <code>X</code> characters, then the end of the string. <code>MCMLXXX</code> is the Roman numeral representation of <code>1980</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.3.8"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then the optional <code>L</code> and all three optional <code>X</code> characters, then <em>fails to match</em> the end of the string because there is still one more <code>X</code> unaccounted for. So the entire pattern fails to match, and returns <code>None</code>. <code>MCMLXXXX</code> is not a valid Roman numeral.
</td>
</tr>
</table>
<p>The expression for the ones place follows the same pattern. I'll spare you the details and show you the end result.
<div class="informalexample"><pre class="screen">
<samp class="prompt">>>> </samp>pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
</pre><p>So what does that look like using this alternate <code>{n,m}</code> syntax? This example shows the new syntax.
<div class="example"><h3 id="re.nm.example">Example 7.8. Validating Roman Numerals with <code>{n,m}</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>pattern = '^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
<samp class="prompt">>>> </samp>re.search(pattern, 'MDLV') <img id="re.nm.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'MMDCLXVI') <img id="re.nm.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'MMMMDCCCLXXXVIII') <img id="re.nm.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'I') <img id="re.nm.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then one of a possible four <code>M</code> characters, then <code>D?C{0,3}</code>. Of that, it matches the optional <code>D</code> and zero of three possible <code>C</code> characters. Moving on, it matches <code>L?X{0,3}</code> by matching the optional <code>L</code> and zero of three possible <code>X</code> characters. Then it matches <code>V?I{0,3}</code> by matching the optional V and zero of three possible <code>I</code> characters, and finally the end of the string. <code>MDLV</code> is the Roman numeral representation of <code>1555</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then two of a possible four <code>M</code> characters, then the <code>D?C{0,3}</code> with a <code>D</code> and one of three possible <code>C</code> characters; then <code>L?X{0,3}</code> with an <code>L</code> and one of three possible <code>X</code> characters; then <code>V?I{0,3}</code> with a <code>V</code> and one of three possible <code>I</code> characters; then the end of the string. <code>MMDCLXVI</code> is the Roman numeral representation of <code>2666</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then four out of four <code>M</code> characters, then <code>D?C{0,3}</code> with a <code>D</code> and three out of three <code>C</code> characters; then <code>L?X{0,3}</code> with an <code>L</code> and three out of three <code>X</code> characters; then <code>V?I{0,3}</code> with a <code>V</code> and three out of three <code>I</code> characters; then the end of the string. <code>MMMMDCCCLXXXVIII</code> is the Roman numeral representation of <code>3888</code>, and it's the longest Roman numeral you can write without extended syntax.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.nm.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Watch closely. (I feel like a magician. &#8220;Watch closely, kids, I'm going to pull a rabbit out of my hat.&#8221;) This matches the start of the string, then zero out of four <code>M</code>, then matches <code>D?C{0,3}</code> by skipping the optional <code>D</code> and matching zero out of three <code>C</code>, then matches <code>L?X{0,3}</code> by skipping the optional <code>L</code> and matching zero out of three <code>X</code>, then matches <code>V?I{0,3}</code> by skipping the optional <code>V</code> and matching one out of three <code>I</code>. Then the end of the string. Whoa.
</td>
</tr>
</table>
<p>If you followed all that and understood it on the first try, you're doing better than I did. Now imagine trying to understand
someone else's regular expressions, in the middle of a critical function of a large program. Or even imagine coming back
to your own regular expressions a few months later. I've done it, and it's not a pretty sight.
<p>In the next section you'll explore an alternate syntax that can help keep your expressions maintainable.
<h2 id="re.verbose">7.5. Verbose Regular Expressions</h2>
<p>So far you've just been dealing with what I'll call &#8220;compact&#8221; regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee
that you'll be able to understand it six months later. What you really need is inline documentation.
<p>Python allows you to do this with something called <em>verbose regular expressions</em>. A verbose regular expression is different from a compact regular expression in two ways:
<div class="itemizedlist">
<ul>
<li>Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They're
not matched at all. (If you want to match a space in a verbose regular expression, you'll need to escape it by putting a
backslash in front of it.)
<li>Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a <code>#</code> character and goes until the end of the line. In this case it's a comment within a multi-line string instead of within your
source code, but it works the same way.
</ul>
<p>This will be more clear with an example. Let's revisit the compact regular expression you've been working with, and make
it a verbose regular expression. This example shows how.
<div class="example"><h3>Example 7.9. Regular Expressions with Inline Comments</h3><pre class="screen">
<samp class="prompt">>>> </samp><kbd>pattern = """
^ # beginning of string
M{0,4} # thousands - 0 to 4 M's
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
# or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
# or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
# or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
"""</kbd>
<samp class="prompt">>>> </samp>re.search(pattern, 'M', re.VERBOSE) <img id="re.verbose.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'MCMLXXXIX', re.VERBOSE) <img id="re.verbose.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'MMMMDCCCLXXXVIII', re.VERBOSE) <img id="re.verbose.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x008EEB48>
<samp class="prompt">>>> </samp>re.search(pattern, 'M') <img id="re.verbose.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.verbose.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when
working with them: <code>re.VERBOSE</code> is a constant defined in the <code class="filename">re</code> module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has
quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the
whitespace and the comments, this is exactly the same regular expression as you saw in <a href="#re.nm" title="7.4. Using the {n,m} Syntax">the previous section</a>, but it's a lot more readable.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.verbose.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then one of a possible four <code>M</code>, then <code>CM</code>, then <code>L</code> and three of a possible three <code>X</code>, then <code>IX</code>, then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.verbose.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This matches the start of the string, then four of a possible four <code>M</code>, then <code>D</code> and three of a possible three <code>C</code>, then <code>L</code> and three of a possible three <code>X</code>, then <code>V</code> and three of a possible three <code>I</code>, then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.verbose.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This does not match. Why? Because it doesn't have the <code>re.VERBOSE</code> flag, so the <code class="function">re.search</code> function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can't auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose.
</td>
</tr>
</table>
<h2 id="re.phone">7.6. Case study: Parsing Phone Numbers</h2>
<p>So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions
are much more powerful than that. When a regular expression <em>does</em> match, you can pick out specific pieces of it. You can find out what matched where.
<p>This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American
phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the
area code, trunk, number, and optionally an extension separately in the company's database. I scoured the Web and found many
examples of regular expressions that purported to do this, but none of them were permissive enough.
<p>Here are the phone numbers I needed to be able to accept:
<div class="itemizedlist">
<ul>
<li><code>800-555-1212</code>
<li><code>800 555 1212</code>
<li><code>800.555.1212</code>
<li><code>(800) 555-1212</code>
<li><code>1-800-555-1212</code>
<li><code>800-555-1212-1234</code>
<li><code>800-555-1212x1234</code>
<li><code>800-555-1212 ext. 1234</code>
<li><code>work 1-(800) 555.1212 #1234</code>
</ul>
<p>Quite a variety! In each of these cases, I need to know that the area code was <code>800</code>, the trunk was <code>555</code>, and the rest of the phone number was <code>1212</code>. For those with an extension, I need to know that the extension was <code>1234</code>.
<p>Let's work through developing a solution for phone number parsing. This example shows the first step.
<div class="example"><h3 id="re.phone.example">Example 7.10. Finding Numbers</h3><pre class="screen">
<samp class="prompt">>>> </samp>phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$') <img id="re.phone.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212').groups() <img id="re.phone.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
('800', '555', '1212')
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212-1234') <img id="re.phone.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Always read regular expressions from left to right. This one matches the beginning of the string, and then <code>(\d{3})</code>. What's <code>\d{3}</code>? Well, the <code>{3}</code> means &#8220;match exactly three numeric digits&#8221;; it's a variation on the <a href="#re.nm" title="7.4. Using the {n,m} Syntax"><code>{n,m} syntax</code></a> you saw earlier. <code>\d</code> means &#8220;any numeric digit&#8221; (<code>0</code> through <code>9</code>). Putting it in parentheses means &#8220;match exactly three numeric digits, <em>and then remember them as a group that I can ask for later</em>&#8221;. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another
group of exactly four digits. Then match the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To get access to the groups that the regular expression parser remembered along the way, use the <code class="function">groups()</code> method on the object that the <code class="function">search</code> function returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you
defined three groups, one with three digits, one with three digits, and one with four digits.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This regular expression is not the final answer, because it doesn't handle a phone number with an extension on the end. For
that, you'll need to expand the regular expression.
</td>
</tr>
</table>
<div class="example"><h3>Example 7.11. Finding the Extension</h3><pre class="screen">
<samp class="prompt">>>> </samp>phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$') <img id="re.phone.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212-1234').groups() <img id="re.phone.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
('800', '555', '1212', '1234')
<samp class="prompt">>>> </samp>phonePattern.search('800 555 1212 1234') <img id="re.phone.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212') <img id="re.phone.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then
a remembered group of three digits, then a hyphen, then a remembered group of three digits, then a hyphen, then a remembered
group of four digits. What's new is that you then match another hyphen, and a remembered group of one or more digits, then
the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">groups()</code> method now returns a tuple of four elements, since the regular expression now defines four groups to remember.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Unfortunately, this regular expression is not the final answer either, because it assumes that the different parts of the
phone number are separated by hyphens. What if they're separated by spaces, or commas, or dots? You need a more general
solution to match several different types of separators.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Oops! Not only does this regular expression not do everything you want, it's actually a step backwards, because now you can't
parse phone numbers <em>without</em> an extension. That's not what you wanted at all; if the extension is there, you want to know what it is, but if it's not
there, you still want to know what the different parts of the main number are.
</td>
</tr>
</table>
<p>The next example shows the regular expression to handle separators between the different parts of the phone number.
<div class="example"><h3>Example 7.12. Handling Different Separators</h3><pre class="screen">
<samp class="prompt">>>> </samp>phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$') <img id="re.phone.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>phonePattern.search('800 555 1212 1234').groups() <img id="re.phone.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
('800', '555', '1212', '1234')
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212-1234').groups() <img id="re.phone.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
('800', '555', '1212', '1234')
<samp class="prompt">>>> </samp>phonePattern.search('80055512121234') <img id="re.phone.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212') <img id="re.phone.3.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Hang on to your hat. You're matching the beginning of the string, then a group of three digits, then <code>\D+</code>. What the heck is that? Well, <code>\D</code> matches any character <em>except</em> a numeric digit, and <code>+</code> means &#8220;1 or more&#8221;. So <code>\D+</code> matches one or more characters that are not digits. This is what you're using instead of a literal hyphen, to try to match
different separators.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using <code>\D+</code> instead of <code>-</code> means you can now match phone numbers where the parts are separated by spaces instead of hyphens.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Of course, phone numbers separated by hyphens still work too.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Unfortunately, this is still not the final answer, because it assumes that there is a separator at all. What if the phone
number is entered without any spaces or hyphens at all?
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Oops! This still hasn't fixed the problem of requiring extensions. Now you have two problems, but you can solve both of
them with the same technique.
</td>
</tr>
</table>
<p>The next example shows the regular expression for handling phone numbers <em>without</em> separators.
<div class="example"><h3>Example 7.13. Handling Numbers Without Separators</h3><pre class="screen">
<samp class="prompt">>>> </samp>phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$') <img id="re.phone.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>phonePattern.search('80055512121234').groups() <img id="re.phone.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
('800', '555', '1212', '1234')
<samp class="prompt">>>> </samp>phonePattern.search('800.555.1212 x1234').groups()<img id="re.phone.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
('800', '555', '1212', '1234')
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212').groups() <img id="re.phone.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
('800', '555', '1212', '')
<samp class="prompt">>>> </samp>phonePattern.search('(800)5551212 x1234') <img id="re.phone.4.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The only change you've made since that last step is changing all the <code>+</code> to <code>*</code>. Instead of <code>\D+</code> between the parts of the phone number, you now match on <code>\D*</code>. Remember that <code>+</code> means &#8220;1 or more&#8221;? Well, <code>*</code> means &#8220;zero or more&#8221;. So now you should be able to parse phone numbers even when there is no separator character at all.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Lo and behold, it actually works. Why? You matched the beginning of the string, then a remembered group of three digits
(<code>800</code>), then zero non-numeric characters, then a remembered group of three digits (<code>555</code>), then zero non-numeric characters, then a remembered group of four digits (<code>1212</code>), then zero non-numeric characters, then a remembered group of an arbitrary number of digits (<code>1234</code>), then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Other variations work now too: dots instead of hyphens, and both a space and an <code>x</code> before the extension.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Finally, you've solved the other long-standing problem: extensions are optional again. If no extension is found, the <code class="function">groups()</code> method still returns a tuple of four elements, but the fourth element is just an empty string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.4.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">I hate to be the bearer of bad news, but you're not finished yet. What's the problem here? There's an extra character before
the area code, but the regular expression assumes that the area code is the first thing at the beginning of the string. No
problem, you can use the same technique of &#8220;zero or more non-numeric characters&#8221; to skip over the leading characters before the area code.
</td>
</tr>
</table>
<p>The next example shows how to handle leading characters in phone numbers.
<div class="example"><h3>Example 7.14. Handling Leading Characters</h3><pre class="screen">
<samp class="prompt">>>> </samp>phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$') <img id="re.phone.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>phonePattern.search('(800)5551212 ext. 1234').groups() <img id="re.phone.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
('800', '555', '1212', '1234')
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212').groups() <img id="re.phone.5.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
('800', '555', '1212', '')
<samp class="prompt">>>> </samp>phonePattern.search('work 1-(800) 555.1212 #1234') <img id="re.phone.5.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the same as in the previous example, except now you're matching <code>\D*</code>, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you're not remembering
these non-numeric characters (they're not in parentheses). If you find them, you'll just skip over them and then start remembering
the area code whenever you get to it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis
after the area code is already handled; it's treated as a non-numeric separator and matched by the <code>\D*</code> after the first remembered group.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.5.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Just a sanity check to make sure you haven't broken anything that used to work. Since the leading characters are entirely
optional, this matches the beginning of the string, then zero non-numeric characters, then a remembered group of three digits
(<code>800</code>), then one non-numeric character (the hyphen), then a remembered group of three digits (<code>555</code>), then one non-numeric character (the hyphen), then a remembered group of four digits (<code>1212</code>), then zero non-numeric characters, then a remembered group of zero digits, then the end of the string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.5.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn't this phone number match?
Because there's a <code>1</code> before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (<code>\D*</code>). Aargh.
</td>
</tr>
</table>
<p>Let's back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you
see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than
trying to match it all just so you can skip over it, let's take a different approach: don't explicitly match the beginning
of the string at all. This approach is shown in the next example.
<div class="example"><h3>Example 7.15. Phone Number, Wherever I May Find Ye</h3><pre class="screen">
<samp class="prompt">>>> </samp>phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$') <img id="re.phone.6.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>phonePattern.search('work 1-(800) 555.1212 #1234').groups() <img id="re.phone.6.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
('800', '555', '1212', '1234')
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212') <img id="re.phone.6.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
('800', '555', '1212', '')
<samp class="prompt">>>> </samp>phonePattern.search('80055512121234') <img id="re.phone.6.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
('800', '555', '1212', '1234')
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.6.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note the lack of <code>^</code> in this regular expression. You are not matching the beginning of the string anymore. There's nothing that says you need
to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out
where the input string starts to match, and go from there.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.6.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you can successfully parse a phone number that includes leading characters and a leading digit, plus any number of any
kind of separators around each part of the phone number.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.6.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Sanity check. this still works.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.6.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">That still works too.</td>
</tr>
</table>
<p>See how quickly a regular expression can get out of control? Take a quick glance at any of the previous iterations. Can
you tell the difference between one and the next?
<p>While you still understand the final answer (and it is the final answer; if you've discovered a case it doesn't handle, I
don't want to know about it), let's write it out as a verbose regular expression, before you forget why you made the choices
you made.
<div class="example"><h3>Example 7.16. Parsing Phone Numbers (Final Version)</h3><pre class="screen">
<samp class="prompt">>>> </samp><kbd>phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D* # optional separator is any number of non-digits
(\d{3}) # trunk is 3 digits (e.g. '555')
\D* # optional separator
(\d{4}) # rest of number is 4 digits (e.g. '1212')
\D* # optional separator
(\d*) # extension is optional and can be any number of digits
$ # end of string
''', re.VERBOSE)</kbd>
<samp class="prompt">>>> </samp>phonePattern.search('work 1-(800) 555.1212 #1234').groups() <img id="re.phone.7.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
('800', '555', '1212', '1234')
<samp class="prompt">>>> </samp>phonePattern.search('800-555-1212') <img id="re.phone.7.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
('800', '555', '1212', '')
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.7.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Other than being spread out over multiple lines, this is exactly the same regular expression as the last step, so it's no
surprise that it parses the same inputs.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#re.phone.7.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Final sanity check. Yes, this still works. You're done.</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Regular Expressions</h3>
<ul>
<li><a href="http://py-howto.sourceforge.net/regex/regex.html">Regular Expression HOWTO</a> teaches about regular expressions and how to use them in Python.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> summarizes the <a href="http://www.python.org/doc/current/lib/module-re.html"><code class="filename">re</code> module</a>.
</ul>
<h2 id="re.summary">7.7. Summary</h2>
<p>This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you're completely
overwhelmed by them now, believe me, you ain't seen nothing yet.
<p>You should now be familiar with the following techniques:
<div class="itemizedlist">
<ul>
<li><code>^</code> matches the beginning of a string.
<li><code>$</code> matches the end of a string.
<li><code>\b</code> matches a word boundary.
<li><code>\d</code> matches any numeric digit.
<li><code>\D</code> matches any non-numeric character.
<li><code>x?</code> matches an optional <code>x</code> character (in other words, it matches an <code>x</code> zero or one times).
<li><code>x*</code> matches <code>x</code> zero or more times.
<li><code>x+</code> matches <code>x</code> one or more times.
<li><code>x{n,m}</code> matches an <code>x</code> character at least <code>n</code> times, but not more than <code>m</code> times.
<li><code>(a|b|c)</code> matches either <code>a</code> or <code>b</code> or <code>c</code>.
<li><code>(x)</code> in general is a <em>remembered group</em>. You can get the value of what matched by using the <code class="function">groups()</code> method of the object returned by <code class="function">re.search</code>.
</ul>
<p>Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough
about them to know when they are appropriate, when they will solve your problems, and when they will cause more problems than
they solve.
<div class="blockquote">
<table border="0" width="100%" cellspacing="0" cellpadding="0" class="blockquote" summary="Block quote">
<tr>
<td width="10%" valign="top"> </td>
<td width="80%" valign="top">
<p>Some people, when confronted with a problem, think &#8220;I know, I'll use regular expressions.&#8221; Now they have two problems.
</td>
<td width="10%" valign="top"> </td>
</tr>
<tr>
<td colspan="2" align="right" valign="top">--Jamie Zawinski, <a href="http://groups.google.com/groups?selm=33F0C496.370D7C45%40netscape.com">in comp.emacs.xemacs</a></td>
<td width="10%" valign="top"> </td>
</tr>
</table>
<div class="chapter">
<h2 id="dialect">Chapter 8. <acronym>HTML</acronym> Processing</h2>
<h2 id="dialect.divein">8.1. Diving in</h2>
<p>I often see questions on <a href="http://groups.google.com/groups?group=comp.lang.python">comp.lang.python</a> like &#8220;How can I list all the [headers|images|links] in my <acronym>HTML</acronym> document?&#8221; &#8220;How do I parse/translate/munge the text of my <acronym>HTML</acronym> document but leave the tags alone?&#8221; &#8220;How can I add/remove/quote attributes of all my <acronym>HTML</acronym> tags at once?&#8221; This chapter will answer all of these questions.
<p>Here is a complete, working Python program in two parts. The first part, <code class="filename">BaseHTMLProcessor.py</code>, is a generic tool to help you process <acronym>HTML</acronym> files by walking through the tags and text blocks. The second part, <code class="filename">dialect.py</code>, is an example of how to use <code class="filename">BaseHTMLProcessor.py</code> to translate the text of an <acronym>HTML</acronym> document but leave the tags alone. Read the <code>doc string</code>s and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how
any of these class methods ever get called. Don't worry, all will be revealed in due time.
<div class="example"><h3 id="dialect.basehtml.listing">Example 8.1. <code class="filename">BaseHTMLProcessor.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
from sgmllib import SGMLParser
import htmlentitydefs
class BaseHTMLProcessor(SGMLParser):
def reset(self):
# extend (called by SGMLParser.__init__)
self.pieces = []
SGMLParser.reset(self)
def unknown_starttag(self, tag, attrs):
# called for each start tag
# attrs is a list of (attr, value) tuples
# e.g. for &lt;pre class="screen">, tag="pre", attrs=[("class", "screen")]
# Ideally we would like to reconstruct original tag and attributes, but
# we may end up quoting attribute values that weren't quoted in the source
# document, or we may change the type of quotes around the attribute value
# (single to double quotes).
# Note that improperly embedded non-HTML code (like client-side Javascript)
# may be parsed incorrectly by the ancestor, causing runtime script errors.
# All non-HTML code must be enclosed in HTML comment tags (&lt;!-- code -->)
# to ensure that it will pass through this parser unaltered (in handle_comment).
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
self.pieces.append("&lt;%(tag)s%(strattrs)s>" % locals())
def unknown_endtag(self, tag):
# called for each end tag, e.g. for &lt;/pre>, tag will be "pre"
# Reconstruct the original end tag.
self.pieces.append("&lt;/%(tag)s>" % locals())
def handle_charref(self, ref):
# called for each character reference, e.g. for "&amp;#160;", ref will be "160"
# Reconstruct the original character reference.
self.pieces.append("&amp;#%(ref)s;" % locals())
def handle_entityref(self, ref):
# called for each entity reference, e.g. for "&amp;copy;", ref will be "copy"
# Reconstruct the original entity reference.
self.pieces.append("&amp;%(ref)s" % locals())
# standard HTML entities are closed with a semicolon; other entities are not
if htmlentitydefs.entitydefs.has_key(ref):
self.pieces.append(";")
def handle_data(self, text):
# called for each block of plain text, i.e. outside of any tag and
# not containing any character or entity references
# Store the original text verbatim.
self.pieces.append(text)
def handle_comment(self, text):
# called for each HTML comment, e.g. &lt;!-- insert Javascript code here -->
# Reconstruct the original comment.
# It is especially important that the source document enclose client-side
# code (like Javascript) within comments so it can pass through this
# processor undisturbed; see comments in unknown_starttag for details.
self.pieces.append("&lt;!--%(text)s-->" % locals())
def handle_pi(self, text):
# called for each processing instruction, e.g. &lt;?instruction>
# Reconstruct original processing instruction.
self.pieces.append("&lt;?%(text)s>" % locals())
def handle_decl(self, text):
# called for the DOCTYPE, if present, e.g.
# &lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
# "http://www.w3.org/TR/html4/loose.dtd">
# Reconstruct original DOCTYPE
self.pieces.append("&lt;!%(text)s>" % locals())
def output(self):
"""Return processed HTML as a single string"""
return "".join(self.pieces)</pre><div class="example"><h3>Example 8.2. <code class="filename">dialect.py</code></h3><pre class="programlisting">
import re
from BaseHTMLProcessor import BaseHTMLProcessor
class Dialectizer(BaseHTMLProcessor):
subs = ()
def reset(self):
# extend (called from __init__ in ancestor)
# Reset all data attributes
self.verbatim = 0
BaseHTMLProcessor.reset(self)
def start_pre(self, attrs):
# called for every &lt;pre> tag in HTML source
# Increment verbatim mode count, then handle tag like normal
self.verbatim += 1
self.unknown_starttag("pre", attrs)
def end_pre(self):
# called for every &lt;/pre> tag in HTML source
# Decrement verbatim mode count
self.unknown_endtag("pre")
self.verbatim -= 1
def handle_data(self, text):
# override
# called for every block of text in HTML source
# If in verbatim mode, save text unaltered;
# otherwise process the text with a series of substitutions
self.pieces.append(self.verbatim and text or self.process(text))
def process(self, text):
# called from handle_data
# Process text block by performing series of regular expression
# substitutions (actual substitions are defined in descendant)
for fromPattern, toPattern in self.subs:
text = re.sub(fromPattern, toPattern, text)
return text
class ChefDialectizer(Dialectizer):
"""convert HTML to Swedish Chef-speak
based on the classic chef.x, copyright (c) 1992, 1993 John Hagerman
"""
subs = ((r'a([nu])', r'u\1'),
(r'A([nu])', r'U\1'),
(r'a\B', r'e'),
(r'A\B', r'E'),
(r'en\b', r'ee'),
(r'\Bew', r'oo'),
(r'\Be\b', r'e-a'),
(r'\be', r'i'),
(r'\bE', r'I'),
(r'\Bf', r'ff'),
(r'\Bir', r'ur'),
(r'(\w*?)i(\w*?)$', r'\1ee\2'),
(r'\bow', r'oo'),
(r'\bo', r'oo'),
(r'\bO', r'Oo'),
(r'the', r'zee'),
(r'The', r'Zee'),
(r'th\b', r't'),
(r'\Btion', r'shun'),
(r'\Bu', r'oo'),
(r'\BU', r'Oo'),
(r'v', r'f'),
(r'V', r'F'),
(r'w', r'w'),
(r'W', r'W'),
(r'([a-z])[.]', r'\1. Bork Bork Bork!'))
class FuddDialectizer(Dialectizer):
"""convert HTML to Elmer Fudd-speak"""
subs = ((r'[rl]', r'w'),
(r'qu', r'qw'),
(r'th\b', r'f'),
(r'th', r'd'),
(r'n[.]', r'n, uh-hah-hah-hah.'))
class OldeDialectizer(Dialectizer):
"""convert HTML to mock Middle English"""
subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'),
(r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'),
(r'ick\b', r'yk'),
(r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
(r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
(r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'),
(r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'),
(r'([aeiou])re\b', r'\1r'),
(r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'),
(r'tion\b', r'cioun'),
(r'ion\b', r'ioun'),
(r'aid', r'ayde'),
(r'ai', r'ey'),
(r'ay\b', r'y'),
(r'ay', r'ey'),
(r'ant', r'aunt'),
(r'ea', r'ee'),
(r'oa', r'oo'),
(r'ue', r'e'),
(r'oe', r'o'),
(r'ou', r'ow'),
(r'ow', r'ou'),
(r'\bhe', r'hi'),
(r've\b', r'veth'),
(r'se\b', r'e'),
(r"'s\b", r'es'),
(r'ic\b', r'ick'),
(r'ics\b', r'icc'),
(r'ical\b', r'ick'),
(r'tle\b', r'til'),
(r'll\b', r'l'),
(r'ould\b', r'olde'),
(r'own\b', r'oune'),
(r'un\b', r'onne'),
(r'rry\b', r'rye'),
(r'est\b', r'este'),
(r'pt\b', r'pte'),
(r'th\b', r'the'),
(r'ch\b', r'che'),
(r'ss\b', r'sse'),
(r'([wybdp])\b', r'\1e'),
(r'([rnt])\b', r'\1\1e'),
(r'from', r'fro'),
(r'when', r'whan'))
def translate(url, dialectName="chef"):
"""fetch URL and translate using dialect
dialect in ("chef", "fudd", "olde")"""
import urllib
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
parserName = "%sDialectizer" % dialectName.capitalize()
parserClass = globals()[parserName]
parser = parserClass()
parser.feed(htmlSource)
parser.close()
return parser.output()
def test(url):
"""test all dialects against URL"""
for dialect in ("chef", "fudd", "olde"):
outfile = "%s.html" % dialect
fsock = open(outfile, "wb")
fsock.write(translate(url, dialect))
fsock.close()
import webbrowser
webbrowser.open_new(outfile)
if __name__ == "__main__":
test("http://diveintopython3.org/odbchelper_list.html")</pre><div class="example"><h3>Example 8.3. Output of <code class="filename">dialect.py</code></h3>
<p>Running this script will translate <a href="#odbchelper.list" title="3.2. Introducing Lists">Section 3.2, &#8220;Introducing Lists&#8221;</a> into <a href="../native_data_types/chef.html">mock Swedish Chef-speak</a> (from The Muppets), <a href="../native_data_types/fudd.html">mock Elmer Fudd-speak</a> (from Bugs Bunny cartoons), and <a href="../native_data_types/olde.html">mock Middle English</a> (loosely based on Chaucer's <i class="citetitle">The Canterbury Tales</i>). If you look at the <acronym>HTML</acronym> source of the output pages, you'll see that all the <acronym>HTML</acronym> tags and attributes are untouched, but the text between the tags has been &#8220;translated&#8221; into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the
code listings and screen examples were left untouched.<pre class="programlisting">
&lt;div class="abstract">
&lt;p>Lists awe &lt;span class="application">Pydon&lt;/span>'s wowkhowse datatype.
If youw onwy expewience wif wists is awways in
&lt;span class="application">Visuaw Basic&lt;/span> ow (God fowbid) de datastowe
in &lt;span class="application">Powewbuiwdew&lt;/span>, bwace youwsewf fow
&lt;span class="application">Pydon&lt;/span> wists.&lt;/p>
&lt;/div>
</pre><h2 id="dialect.sgmllib">8.2. Introducing <code class="filename">sgmllib.py</code></h2>
<p><acronym>HTML</acronym> processing is broken into three steps: breaking down the <acronym>HTML</acronym> into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into <acronym>HTML</acronym> again. The first step is done by <code class="filename">sgmllib.py</code>, a part of the standard Python library.
<p>The key to understanding this chapter is to realize that <acronym>HTML</acronym> is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags
and end tags. Usually you don't work with <acronym>HTML</acronym> this way; you work with it <em>textually</em> in a text editor, or <em>visually</em> in a web browser or web authoring tool. <code class="filename">sgmllib.py</code> presents <acronym>HTML</acronym> <em>structurally</em>.
<p><code class="filename">sgmllib.py</code> contains one important class: <code class="classname">SGMLParser</code>. <code class="classname">SGMLParser</code> parses <acronym>HTML</acronym> into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece,
it calls a method on itself based on what it found. In order to use the parser, you subclass the <code class="classname">SGMLParser</code> class and override these methods. This is what I meant when I said that it presents <acronym>HTML</acronym> <em>structurally</em>: the structure of the <acronym>HTML</acronym> determines the sequence of method calls and the arguments passed to each method.
<p><code class="classname">SGMLParser</code> parses <acronym>HTML</acronym> into 8 kinds of data, and calls a separate method for each of them:
<div class="variablelist">
<dl>
<dt>Start tag</dt>
<dd>An <acronym>HTML</acronym> tag that starts a block, like <code class="sgmltag-element">&lt;html></code>, <code class="sgmltag-element">&lt;head></code>, <code class="sgmltag-element">&lt;body></code>, or <code class="sgmltag-element">&lt;pre></code>, or a standalone tag like <code class="sgmltag-element">&lt;br></code> or <code class="sgmltag-element">&lt;img></code>. When it finds a start tag <i class="replaceable"><code>tagname</code></i>, <code class="classname">SGMLParser</code> will look for a method called <code class="function">start_<i class="replaceable"><code>tagname</code></i></code> or <code class="function">do_<i class="replaceable"><code>tagname</code></i></code>. For instance, when it finds a <code class="sgmltag-element">&lt;pre></code> tag, it will look for a <code class="function">start_pre</code> or <code class="function">do_pre</code> method. If found, <code class="classname">SGMLParser</code> calls this method with a list of the tag's attributes; otherwise, it calls <code class="function">unknown_starttag</code> with the tag name and list of attributes.
</dd>
<dt>End tag</dt>
<dd>An <acronym>HTML</acronym> tag that ends a block, like <code class="sgmltag-element">&lt;/html></code>, <code class="sgmltag-element">&lt;/head></code>, <code class="sgmltag-element">&lt;/body></code>, or <code class="sgmltag-element">&lt;/pre></code>. When it finds an end tag, <code class="classname">SGMLParser</code> will look for a method called <code class="function">end_<i class="replaceable"><code>tagname</code></i></code>. If found, <code class="classname">SGMLParser</code> calls this method, otherwise it calls <code class="function">unknown_endtag</code> with the tag name.
</dd>
<dt>Character reference</dt>
<dd>An escaped character referenced by its decimal or hexadecimal equivalent, like <code>&amp;#160;</code>. When found, <code class="classname">SGMLParser</code> calls <code class="function">handle_charref</code> with the text of the decimal or hexadecimal character equivalent.
</dd>
<dt>Entity reference</dt>
<dd>An <acronym>HTML</acronym> entity, like <code>&amp;copy;</code>. When found, <code class="classname">SGMLParser</code> calls <code class="function">handle_entityref</code> with the name of the <acronym>HTML</acronym> entity.
</dd>
<dt>Comment</dt>
<dd>An <acronym>HTML</acronym> comment, enclosed in <code>&lt;!-- ... --></code>. When found, <code class="classname">SGMLParser</code> calls <code class="function">handle_comment</code> with the body of the comment.
</dd>
<dt>Processing instruction</dt>
<dd>An <acronym>HTML</acronym> processing instruction, enclosed in <code>&lt;? ... ></code>. When found, <code class="classname">SGMLParser</code> calls <code class="function">handle_pi</code> with the body of the processing instruction.
</dd>
<dt>Declaration</dt>
<dd>An <acronym>HTML</acronym> declaration, such as a <code class="sgmltag-element">DOCTYPE</code>, enclosed in <code>&lt;! ... ></code>. When found, <code class="classname">SGMLParser</code> calls <code class="function">handle_decl</code> with the body of the declaration.
</dd>
<dt>Text data</dt>
<dd>A block of text. Anything that doesn't fit into the other 7 categories. When found, <code class="classname">SGMLParser</code> calls <code class="function">handle_data</code> with the text.
</dd>
</dl>
</div><table class="important" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/important.png" alt="Important" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Python 2.0 had a bug where <code class="classname">SGMLParser</code> would not recognize declarations at all (<code class="function">handle_decl</code> would never be called), which meant that <code class="sgmltag-element">DOCTYPE</code>s were silently ignored. This is fixed in Python 2.1.
</td>
</tr>
</table>
<p><code class="filename">sgmllib.py</code> comes with a test suite to illustrate this. You can run <code class="filename">sgmllib.py</code>, passing the name of an <acronym>HTML</acronym> file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing
the <code class="classname">SGMLParser</code> class and defining <code class="function">unknown_starttag</code>, <code class="function">unknown_endtag</code>, <code class="function">handle_data</code> and other methods which simply print their arguments.<table id="tip.commandline.windows" class="tip" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In the ActivePython <acronym>IDE</acronym> on Windows, you can specify command line arguments in the &#8220;Run script&#8221; dialog. Separate multiple arguments with spaces.
</td>
</tr>
</table>
<div class="example"><h3>Example 8.4. Sample test of <code class="filename">sgmllib.py</code></h3>
<p>Here is a snippet from the table of contents of the <acronym>HTML</acronym> version of this book. Of course your paths may vary. (If you haven't downloaded the <acronym>HTML</acronym> version of the book, you can do so at <a href="http://diveintopython3.org/">http://diveintopython3.org/</a>.<pre class="screen">
<samp class="prompt">c:\python23\lib></samp> type "c:\downloads\diveintopython3\html\toc\index.html"
<code>
&lt;!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
&lt;html>
&lt;head>
&lt;meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
&lt;title>Dive Into Python&lt;/title>
&lt;link rel="stylesheet" href="diveintopython3.css" type="text/css">
... rest of file omitted for brevity ...
</code></pre><p>Running this through the test suite of <code class="filename">sgmllib.py</code> yields this output:<pre class="screen">
<samp class="prompt">c:\python23\lib></samp> python sgmllib.py "c:\downloads\diveintopython3\html\toc\index.html"
<samp class="computeroutput">data: '\n\n'
start tag: &lt;html >
data: '\n '
start tag: &lt;head>
data: '\n '
start tag: &lt;meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >
data: '\n \n '
start tag: &lt;title>
data: 'Dive Into Python'
end tag: &lt;/title>
data: '\n '
start tag: &lt;link rel="stylesheet" href="diveintopython3.css" type="text/css" >
data: '\n '
... rest of output omitted for brevity ...
</span></pre><p>Here's the roadmap for the rest of the chapter:
<div class="itemizedlist">
<ul>
<li>Subclass <code class="classname">SGMLParser</code> to create classes that extract interesting data out of <acronym>HTML</acronym> documents.
<li>Subclass <code class="classname">SGMLParser</code> to create <code class="classname">BaseHTMLProcessor</code>, which overrides all 8 handler methods and uses them to reconstruct the original <acronym>HTML</acronym> from the pieces.
<li>Subclass <code class="classname">BaseHTMLProcessor</code> to create <code class="classname">Dialectizer</code>, which adds some methods to process specific <acronym>HTML</acronym> tags specially, and overrides the <code class="function">handle_data</code> method to provide a framework for processing the text blocks between the <acronym>HTML</acronym> tags.
<li>Subclass <code class="classname">Dialectizer</code> to create classes that define text processing rules used by <code class="function">Dialectizer.handle_data</code>.
<li>Write a test suite that grabs a real web page from <code class="systemitem">http://diveintopython3.org/</code> and processes it.
</ul>
<p>Along the way, you'll also learn about <code class="function">locals</code>, <code class="function">globals</code>, and dictionary-based string formatting.
<h2 id="dialect.extract">8.3. Extracting data from <acronym>HTML</acronym> documents</h2>
<p>To extract data from <acronym>HTML</acronym> documents, subclass the <code class="classname">SGMLParser</code> class and define methods for each tag or entity you want to capture.
<p>The first step to extracting data from an <acronym>HTML</acronym> document is getting some <acronym>HTML</acronym>. If you have some <acronym>HTML</acronym> lying around on your hard drive, you can use <a href="#fileinfo.files" title="6.2. Working with File Objects">file functions</a> to read it, but the real fun begins when you get <acronym>HTML</acronym> from live web pages.
<div class="example"><h3 id="dialect.extract.urllib">Example 8.5. Introducing <code class="filename">urllib</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>import urllib <img id="dialect.extract.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>sock = urllib.urlopen("http://diveintopython3.org/") <img id="dialect.extract.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>htmlSource = sock.read() <img id="dialect.extract.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>sock.close() <img id="dialect.extract.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print htmlSource<img id="dialect.extract.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="computeroutput">&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">&lt;html>&lt;head>
&lt;meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
&lt;title>Dive Into Python&lt;/title>
&lt;link rel='stylesheet' href='diveintopython3.css' type='text/css'>
&lt;link rev='made' href='mailto:mark@diveintopython3.org'>
&lt;meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'>
&lt;meta name='description' content='a free Python tutorial for experienced programmers'>
&lt;/head>
&lt;body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'>
&lt;table cellpadding='0' cellspacing='0' border='0' width='100%'>
&lt;tr>&lt;td class='header' width='1%' valign='top'>diveintopython3.org&lt;/td>
&lt;td width='99%' align='right'>&lt;hr size='1' noshade>&lt;/td>&lt;/tr>
&lt;tr>&lt;td class='tagline' colspan='2'>Python&amp;nbsp;for&amp;nbsp;experienced&amp;nbsp;programmers&lt;/td>&lt;/tr></span>
[...snip...]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="filename">urllib</code> module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based <acronym>URL</acronym>s (mainly web pages).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The simplest use of <code class="filename">urllib</code> is to retrieve the entire text of a web page using the <code class="function">urlopen</code> function. Opening a <acronym>URL</acronym> is similar to <a href="#fileinfo.files" title="6.2. Working with File Objects">opening a file</a>. The return value of <code class="function">urlopen</code> is a file-like object, which has some of the same methods as a file object.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The simplest thing to do with the file-like object returned by <code class="function">urlopen</code> is <code class="function">read</code>, which reads the entire <acronym>HTML</acronym> of the web page into a single string. The object also supports <code class="function">readlines</code>, which reads the text line by line into a list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When you're done with the object, make sure to <code class="function">close</code> it, just like a normal file object.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You now have the complete <acronym>HTML</acronym> of the home page of <code class="systemitem">http://diveintopython3.org/</code> in a string, and you're ready to parse it.
</td>
</tr>
</table>
<div class="example"><h3 id="dialect.extract.links">Example 8.6. Introducing <code class="filename">urllister.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self): <img id="dialect.extract.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs): <img id="dialect.extract.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
href = [v for k, v in attrs if k=='href'] <img id="dialect.extract.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"> <img id="dialect.extract.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
if href:
self.urls.extend(href)</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">reset</code> is called by the <code class="function">__init__</code> method of <code class="classname">SGMLParser</code>, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization,
do it in <code class="function">reset</code>, not in <code class="function">__init__</code>, so that it will be re-initialized properly when someone re-uses a parser instance.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">start_a</code> is called by <code class="classname">SGMLParser</code> whenever it finds an <code class="sgmltag-element">&lt;a></code> tag. The tag may contain an <code>href</code> attribute, and/or other attributes, like <code>name</code> or <code>title</code>. The <code class="varname">attrs</code> parameter is a list of tuples, <code>[(<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), (<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), ...]</code>. Or it may be just an <code class="sgmltag-element">&lt;a></code>, a valid (if useless) <acronym>HTML</acronym> tag, in which case <code class="varname">attrs</code> would be an empty list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can find out whether this <code class="sgmltag-element">&lt;a></code> tag has an <code>href</code> attribute with a simple <a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable</a> <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehension</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">String comparisons like <code>k=='href'</code> are always case-sensitive, but that's safe in this case, because <code class="classname">SGMLParser</code> converts attribute names to lowercase while building <code class="varname">attrs</code>.
</td>
</tr>
</table>
<div class="example"><h3 id="dialect.feed.example">Example 8.7. Using <code class="filename">urllister.py</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>import urllib, urllister
<samp class="prompt">>>> </samp>usock = urllib.urlopen("http://diveintopython3.org/")
<samp class="prompt">>>> </samp>parser = urllister.URLLister()
<samp class="prompt">>>> </samp>parser.feed(usock.read()) <img id="dialect.extract.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>usock.close() <img id="dialect.extract.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>parser.close() <img id="dialect.extract.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>for url in parser.urls: print url <img id="dialect.extract.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">toc/index.html
#download
#languages
toc/index.html
appendix/history.html
download/diveintopython3-html-5.0.zip
download/diveintopython3-pdf-5.0.zip
download/diveintopython3-word-5.0.zip
download/diveintopython3-text-5.0.zip
download/diveintopython3-html-flat-5.0.zip
download/diveintopython3-xml-5.0.zip
download/diveintopython3-common-5.0.zip
</span>
... rest of output omitted for brevity ...</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Call the <code class="function">feed</code> method, defined in <code class="classname">SGMLParser</code>, to get <acronym>HTML</acronym> into the parser.<sup>[<a name="d0e20503" href="#ftn.d0e20503">1</a>]</sup> It takes a string, which is what <code class="function">usock.read()</code> returns.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Like files, you should <code class="function">close</code> your <acronym>URL</acronym> objects as soon as you're done with them.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You should <code class="function">close</code> your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the <code class="function">feed</code> method isn't guaranteed to have actually processed all the <acronym>HTML</acronym> you give it; it may buffer it, waiting for more. Be sure to call <code class="function">close</code> to flush the buffer and force everything to be fully parsed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Once the parser is <code class="function">close</code>d, the parsing is complete, and <code class="varname">parser.urls</code> contains a list of all the linked <acronym>URL</acronym>s in the <acronym>HTML</acronym> document. (Your output may look different, if the download links have been updated by the time you read this.)
</td>
</tr>
</table>
<h2 id="dialect.basehtml">8.4. Introducing <code class="filename">BaseHTMLProcessor.py</code></h2>
<p><code class="classname">SGMLParser</code> doesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it
finds, but the methods don't do anything. <code class="classname">SGMLParser</code> is an <acronym>HTML</acronym> <em>consumer</em>: it takes <acronym>HTML</acronym> and breaks it down into small, structured pieces. As you saw in the <a href="#dialect.extract" title="8.3. Extracting data from HTML documents">previous section</a>, you can subclass <code class="classname">SGMLParser</code> to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll
take this one step further by defining a class that catches everything <code class="classname">SGMLParser</code> throws at it and reconstructs the complete <acronym>HTML</acronym> document. In technical terms, this class will be an <acronym>HTML</acronym> <em>producer</em>.
<p><code class="classname">BaseHTMLProcessor</code> subclasses <code class="classname">SGMLParser</code> and provides all 8 essential handler methods: <code class="function">unknown_starttag</code>, <code class="function">unknown_endtag</code>, <code class="function">handle_charref</code>, <code class="function">handle_entityref</code>, <code class="function">handle_comment</code>, <code class="function">handle_pi</code>, <code class="function">handle_decl</code>, and <code class="function">handle_data</code>.
<div class="example"><h3 id="dialect.basehtml.intro">Example 8.8. Introducing <code class="classname">BaseHTMLProcessor</code></h3><pre class="programlisting">
class BaseHTMLProcessor(SGMLParser):
def reset(self): <img id="dialect.basehtml.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
self.pieces = []
SGMLParser.reset(self)
def unknown_starttag(self, tag, attrs): <img id="dialect.basehtml.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
self.pieces.append("&lt;%(tag)s%(strattrs)s>" % locals())
def unknown_endtag(self, tag): <img id="dialect.basehtml.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
self.pieces.append("&lt;/%(tag)s>" % locals())
def handle_charref(self, ref): <img id="dialect.basehtml.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
self.pieces.append("&amp;#%(ref)s;" % locals())
def handle_entityref(self, ref): <img id="dialect.basehtml.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
self.pieces.append("&amp;%(ref)s" % locals())
if htmlentitydefs.entitydefs.has_key(ref):
self.pieces.append(";")
def handle_data(self, text): <img id="dialect.basehtml.1.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
self.pieces.append(text)
def handle_comment(self, text): <img id="dialect.basehtml.1.7" src="images/callouts/7.png" alt="7" border="0" width="12" height="12">
self.pieces.append("&lt;!--%(text)s-->" % locals())
def handle_pi(self, text): <img id="dialect.basehtml.1.8" src="images/callouts/8.png" alt="8" border="0" width="12" height="12">
self.pieces.append("&lt;?%(text)s>" % locals())
def handle_decl(self, text):
self.pieces.append("&lt;!%(text)s>" % locals())</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">reset</code>, called by <code class="function">SGMLParser.__init__</code>, initializes <code class="varname">self.pieces</code> as an empty list before <a href="#fileinfo.init.code.example" title="Example 5.6. Coding the FileInfo Class">calling the ancestor method</a>. <code class="varname">self.pieces</code> is a <a href="#fileinfo.userdict.init.example" title="Example 5.9. Defining the UserDict Class">data attribute</a> which will hold the pieces of the <acronym>HTML</acronym> document you're constructing. Each handler method will reconstruct the <acronym>HTML</acronym> that <code class="classname">SGMLParser</code> parsed, and each method will append that string to <code class="varname">self.pieces</code>. Note that <code class="varname">self.pieces</code> is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but
Python is much more efficient at dealing with lists.<sup>[<a name="d0e20702" href="#ftn.d0e20702">2</a>]</sup></td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since <code class="classname">BaseHTMLProcessor</code> does not define any methods for specific tags (like the <code class="function">start_a</code> method in <a href="#dialect.extract.links" title="Example 8.6. Introducing urllister.py"><code class="classname">URLLister</code></a>), <code class="classname">SGMLParser</code> will call <code class="function">unknown_starttag</code> for every start tag. This method takes the tag (<code class="varname">tag</code>) and the list of attribute name/value pairs (<code class="varname">attrs</code>), reconstructs the original <acronym>HTML</acronym>, and appends it to <code class="varname">self.pieces</code>. The <a href="#odbchelper.stringformatting" title="3.5. Formatting Strings">string formatting</a> here is a little strange; you'll untangle that (and also the odd-looking <code class="function">locals</code> function) later in this chapter.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Reconstructing end tags is much simpler; just take the tag name and wrap it in the <code>&lt;/...></code> brackets.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When <code class="classname">SGMLParser</code> finds a character reference, it calls <code class="function">handle_charref</code> with the bare reference. If the <acronym>HTML</acronym> document contains the reference <code>&amp;#160;</code>, <code class="varname">ref</code> will be <code>160</code>. Reconstructing the original complete character reference just involves wrapping <code class="varname">ref</code> in <code>&amp;#...;</code> characters.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Entity references are similar to character references, but without the hash mark. Reconstructing the original entity reference
requires wrapping <code class="varname">ref</code> in <code>&amp;...;</code> characters. (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this. Only certain standard
<acronym>HTML</acronym> entites end in a semicolon; other similar-looking entities do not. Luckily for us, the set of standard <acronym>HTML</acronym> entities is defined in a dictionary in a Python module called <code class="filename">htmlentitydefs</code>. Hence the extra <code>if</code> statement.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.1.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Blocks of text are simply appended to <code class="varname">self.pieces</code> unaltered.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.1.7"><img src="images/callouts/7.png" alt="7" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><acronym>HTML</acronym> comments are wrapped in <code>&lt;!--...--></code> characters.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.1.8"><img src="images/callouts/8.png" alt="8" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Processing instructions are wrapped in <code>&lt;?...></code> characters.
</td>
</tr>
</table>
</div><table class="important" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/important.png" alt="Important" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">The <acronym>HTML</acronym> specification requires that all non-<acronym>HTML</acronym> (like client-side JavaScript) must be enclosed in <acronym>HTML</acronym> comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't). <code class="classname">BaseHTMLProcessor</code> is not forgiving; if script is improperly embedded, it will be parsed as if it were <acronym>HTML</acronym>. For instance, if the script contains less-than and equals signs, <code class="classname">SGMLParser</code> may incorrectly think that it has found tags and attributes. <code class="classname">SGMLParser</code> always converts tags and attribute names to lowercase, which may break the script, and <code class="classname">BaseHTMLProcessor</code> always encloses attribute values in double quotes (even if the original <acronym>HTML</acronym> document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script
within <acronym>HTML</acronym> comments.
</td>
</tr>
</table>
<div class="example"><h3 id="dialect.output.example">Example 8.9. <code class="classname">BaseHTMLProcessor</code> output</h3><pre class="programlisting">
def output(self): <img id="dialect.basehtml.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
"""Return processed HTML as a single string"""
return "".join(self.pieces) <img id="dialect.basehtml.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the one method in <code class="classname">BaseHTMLProcessor</code> that is never called by the ancestor <code class="classname">SGMLParser</code>. Since the other handler methods store their reconstructed <acronym>HTML</acronym> in <code class="varname">self.pieces</code>, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If you prefer, you could use the <code class="function">join</code> method of the <code class="filename">string</code> module instead: <code>string.join(self.pieces, "")</code></td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further reading</h3>
<ul>
<li><a href="http://www.w3.org/">W3C</a> discusses <a href="http://www.w3.org/TR/REC-html40/charset.html#entities">character and entity references</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> confirms your suspicions that <a href="http://www.python.org/doc/current/lib/module-htmlentitydefs.html">the <code class="filename">htmlentitydefs</code> module</a> is exactly what it sounds like.
</ul>
<h2 id="dialect.locals">8.5. <code class="function">locals</code> and <code class="function">globals</code></h2>
<p>Let's digress from <acronym>HTML</acronym> processing for a minute and talk about how Python handles variables. Python has two built-in functions, <code class="function">locals</code> and <code class="function">globals</code>, which provide dictionary-based access to local and global variables.
<p>Remember <code class="function">locals</code>? You first saw it here:
<div class="informalexample"><pre class="programlisting">
def unknown_starttag(self, tag, attrs):
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
self.pieces.append("&lt;%(tag)s%(strattrs)s>" % locals())
</pre><p>No, wait, you can't learn about <code class="function">locals</code> yet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention.
<p>Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names
of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute.
<p>At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which
keeps track of the function's variables, including function arguments and locally defined variables. Each module has its
own namespace, called the global namespace, which keeps track of the module's variables, including functions, classes, any
other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any
module, which holds built-in functions and exceptions.
<p>When a line of code asks for the value of a variable <code class="varname">x</code>, Python will search for that variable in all the available namespaces, in order:
<div class="orderedlist">
<ol>
<li>local namespace - specific to the current function or class method. If the function defines a local variable <code class="varname">x</code>, or has an argument <code class="varname">x</code>, Python will use this and stop searching.
<li>global namespace - specific to the current module. If the module has defined a variable, function, or class called <code class="varname">x</code>, Python will use that and stop searching.
<li>built-in namespace - global to all modules. As a last resort, Python will assume that <code class="varname">x</code> is the name of built-in function or variable.
</ol>
<p>If Python doesn't find <code class="varname">x</code> in any of these namespaces, it gives up and raises a <code class="errorcode">NameError</code> with the message <code class="errorname">There is no variable named 'x'</code>, which you saw back in <a href="#odbchelper.unboundvariable" title="Example 3.18. Referencing an Unbound Variable">Example 3.18, &#8220;Referencing an Unbound Variable&#8221;</a>, but you didn't appreciate how much work Python was doing before giving you that error.<table class="important" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/important.png" alt="Important" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Python 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a <a href="#fileinfo.nested" title="Example 6.21. listDirectory">nested function</a> or <a href="#apihelper.lambda" title="4.7. Using lambda Functions"><code>lambda</code> function</a>, Python will search for that variable in the current (nested or <code>lambda</code>) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested or <code>lambda</code>) function's namespace, <em>then in the parent function's namespace</em>, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2:<pre class="programlisting">
from __future__ import nested_scopes</pre></td>
</tr>
</table>
<p>Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are <em>directly accessible at run-time</em>. How? Well, the local namespace is accessible via the built-in <code class="function">locals</code> function, and the global (module level) namespace is accessible via the built-in <code class="function">globals</code> function.
<div class="example"><h3>Example 8.10. Introducing <code class="function">locals</code></h3><pre class="screen"><samp class="prompt">>>> </samp>def foo(arg): <img id="dialect.locals.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp>x = 1
<samp class="prompt">... </samp>print locals()
<samp class="prompt">... </samp>
<samp class="prompt">>>> </samp>foo(7) <img id="dialect.locals.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
{'arg': 7, 'x': 1}
<samp class="prompt">>>> </samp>foo('bar') <img id="dialect.locals.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
{'arg': 'bar', 'x': 1}</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The function <code class="function">foo</code> has two variables in its local namespace: <code class="varname">arg</code>, whose value is passed in to the function, and <code class="varname">x</code>, which is defined within the function.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">locals</code> returns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values
of the dictionary are the actual values of the variables. So calling <code class="function">foo</code> with <code>7</code> prints the dictionary containing the function's two local variables: <code class="varname">arg</code> (<code>7</code>) and <code class="varname">x</code> (<code class="constant">1</code>).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember, Python has dynamic typing, so you could just as easily pass a string in for <code class="varname">arg</code>; the function (and the call to <code class="function">locals</code>) would still work just as well. <code class="function">locals</code> works with all variables of all datatypes.
</td>
</tr>
</table>
<p>What <code class="function">locals</code> does for the local (function) namespace, <code class="function">globals</code> does for the global (module) namespace. <code class="function">globals</code> is more exciting, though, because a module's namespace is more exciting.<sup>[<a name="d0e21226" href="#ftn.d0e21226">3</a>]</sup> Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes
defined in the module. Plus, it includes anything that was imported into the module.
<p>Remember the difference between <a href="#fileinfo.fromimport" title="5.2. Importing Modules Using from module import"><code>from <i class="replaceable">module</i> import</code></a> and <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's doc string"><code>import <i class="replaceable">module</i></code></a>? With <code>import <i class="replaceable">module</i></code>, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access
any of its functions or attributes: <code><i class="replaceable">module</i>.<i class="replaceable">function</i></code>. But with <code>from <i class="replaceable">module</i> import</code>, you're actually importing specific functions and attributes from another module into your own namespace, which is why you
access them directly without referencing the original module they came from. With the <code class="function">globals</code> function, you can actually see this happen.
<div class="example"><h3 id="dialect.globals.example">Example 8.11. Introducing <code class="function">globals</code></h3>
<p>Look at the following block of code at the bottom of <code class="filename">BaseHTMLProcessor.py</code>:<pre class="programlisting">
if __name__ == "__main__":
for k, v in globals().items(): <img id="dialect.locals.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
print k, "=", v</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Just so you don't get intimidated, remember that you've seen all this before. The <code class="function">globals</code> function returns a dictionary, and you're <a href="#dictionaryiter.example" title="Example 6.10. Iterating Through a Dictionary">iterating through the dictionary</a> using the <code class="function">items</code> method and <a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a>. The only thing new here is the <code class="function">globals</code> function.
</td>
</tr>
</table>
<p>Now running the script from the command line gives this output (note that your output may be slightly different, depending
on your platform and where you installed Python):<pre class="screen"><samp class="prompt">c:\docbook\dip\py></samp> python BaseHTMLProcessor.py</pre><pre class="programlisting">
SGMLParser = sgmllib.SGMLParser <img id="dialect.locals.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
htmlentitydefs = &lt;module 'htmlentitydefs' from 'C:\Python23\lib\htmlentitydefs.py'> <img id="dialect.locals.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
BaseHTMLProcessor = __main__.BaseHTMLProcessor <img id="dialect.locals.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
__name__ = __main__ <img id="dialect.locals.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
... rest of output omitted for brevity...</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="classname">SGMLParser</code> was imported from <code class="filename">sgmllib</code>, using <code>from <i class="replaceable">module</i> import</code>. That means that it was imported directly into the module's namespace, and here it is.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Contrast this with <code class="filename">htmlentitydefs</code>, which was imported using <code>import</code>. That means that the <code class="filename">htmlentitydefs</code> module itself is in the namespace, but the <code class="varname">entitydefs</code> variable defined within <code class="filename">htmlentitydefs</code> is not.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This module only defines one class, <code class="classname">BaseHTMLProcessor</code>, and here it is. Note that the value here is <a href="#fileinfo.classattributes.intro" title="Example 5.17. Introducing Class Attributes">the class itself</a>, not a specific instance of the class.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember the <a href="#odbchelper.ifnametrick"><code>if __name__</code> trick</a>? When running a module (as opposed to importing it from another module), the built-in <code>__name__</code> attribute is a special value, <code>__main__</code>. Since you ran this module as a script from the command line, <code>__name__</code> is <code>__main__</code>, which is why the little test code to print the <code class="function">globals</code> got executed.
</td>
</tr>
</table>
</div><table id="tip.localsbyname" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Using the <code class="function">locals</code> and <code class="function">globals</code> functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors
the functionality of the <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code class="function">getattr</code></a> function, which allows you to access arbitrary functions dynamically by providing the function name as a string.
</td>
</tr>
</table>
<p>There is one other important difference between the <code class="function">locals</code> and <code class="function">globals</code> functions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning
it.
<div class="example"><h3 id="dialect.locals.readonly.example">Example 8.12. <code class="function">locals</code> is read-only, <code class="function">globals</code> is not</h3><pre class="programlisting">
def foo(arg):
x = 1
print locals() <img id="dialect.locals.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
locals()["x"] = 2 <img id="dialect.locals.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
print "x=",x <img id="dialect.locals.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
z = 7
print "z=",z
foo(3)
globals()["z"] = 8 <img id="dialect.locals.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
print "z=",z <img id="dialect.locals.4.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since <code class="function">foo</code> is called with <code>3</code>, this will print <code>{'arg': 3, 'x': 1}</code>. This should not be a surprise.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">locals</code> is a function that returns a dictionary, and here you are setting a value in that dictionary. You might think that this
would change the value of the local variable <code class="varname">x</code> to <code>2</code>, but it doesn't. <code class="function">locals</code> does not actually return the local namespace, it returns a copy. So changing it does nothing to the value of the variables
in the local namespace.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This prints <code>x= 1</code>, not <code>x= 2</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">After being burned by <code class="function">locals</code>, you might think that this <em>wouldn't</em> change the value of <code class="varname">z</code>, but it does. Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself), <code class="function">globals</code> returns the actual global namespace, not a copy: the exact opposite behavior of <code class="function">locals</code>. So any changes to the dictionary returned by <code class="function">globals</code> directly affect your global variables.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.locals.4.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This prints <code>z= 8</code>, not <code>z= 7</code>.
</td>
</tr>
</table>
<h2 id="dialect.dictsub">8.6. Dictionary-based string formatting</h2>
<p>Why did you learn about <code class="function">locals</code> and <code class="function">globals</code>? So you can learn about dictionary-based string formatting. As you recall, <a href="#odbchelper.stringformatting" title="3.5. Formatting Strings">regular string formatting</a> provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in
place of each formatting marker. While this is efficient, it is not always the easiest code to read, especially when multiple
values are being inserted. You can't simply scan through the string in one pass and understand what the result will be; you're
constantly switching between reading the string and reading the tuple of values.
<p>There is an alternative form of string formatting that uses dictionaries instead of tuples of values.
<div class="example"><h3>Example 8.13. Introducing dictionary-based string formatting</h3><pre class="screen">
<samp class="prompt">>>> </samp>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
<samp class="prompt">>>> </samp>"%(pwd)s" % params<img id="dialect.dictsub.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'secret'
<samp class="prompt">>>> </samp>"%(pwd)s is not a good password for %(uid)s" % params <img id="dialect.dictsub.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'secret is not a good password for sa'
<samp class="prompt">>>> </samp>"%(database)s of mind, %(database)s of body" % params <img id="dialect.dictsub.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'master of mind, master of body'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dictsub.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Instead of a tuple of explicit values, this form of string formatting uses a dictionary, <code class="varname">params</code>. And instead of a simple <code>%s</code> marker in the string, the marker contains a name in parentheses. This name is used as a key in the <code class="varname">params</code> dictionary and subsitutes the corresponding value, <code>secret</code>, in place of the <code>%(pwd)s</code> marker.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dictsub.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Dictionary-based string formatting works with any number of named keys. Each key must exist in the given dictionary, or the
formatting will fail with a <code class="errorcode">KeyError</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dictsub.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can even specify the same key twice; each occurrence will be replaced with the same value.</td>
</tr>
</table>
<p>So why would you use dictionary-based string formatting? Well, it does seem like overkill to set up a dictionary of keys
and values simply to do string formatting in the next line; it's really most useful when you happen to have a dictionary of
meaningful keys and values already. Like <a href="#dialect.locals" title="8.5. locals and globals"><code class="function">locals</code></a>.
<div class="example"><h3 id="dialect.unknownstarttag">Example 8.14. Dictionary-based string formatting in <code class="filename">BaseHTMLProcessor.py</code></h3><pre class="programlisting">
def handle_comment(self, text):
self.pieces.append("&lt;!--%(text)s-->" % locals()) <img id="dialect.dictsub.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dictsub.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using the built-in <code class="function">locals</code> function is the most common use of dictionary-based string formatting. It means that you can use the names of local variables
within your string (in this case, <code class="varname">text</code>, which was passed to the class method as an argument) and each named variable will be replaced by its value. If <code class="varname">text</code> is <code>'Begin page footer'</code>, the string formatting <code>"&lt;!--%(text)s-->" % locals()</code> will resolve to the string <code>'&lt;!--Begin page footer-->'</code>.
</td>
</tr>
</table>
<div class="example"><h3>Example 8.15. More dictionary-based string formatting</h3><pre class="programlisting">
def unknown_starttag(self, tag, attrs):
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) <img id="dialect.dictsub.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
self.pieces.append("&lt;%(tag)s%(strattrs)s>" % locals()) <img id="dialect.dictsub.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dictsub.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When this method is called, <code class="varname">attrs</code> is a list of key/value tuples, just like the <a href="#odbchelper.items" title="Example 3.25. The keys, values, and items Functions"><code class="function">items</code> of a dictionary</a>, which means you can use <a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a> to iterate through it. This should be a familiar pattern by now, but there's a lot going on here, so let's break it down:
<div class="orderedlist">
<ol type="a">
<li>Suppose <code class="varname">attrs</code> is <code>[('href', 'index.html'), ('title', 'Go to home page')]</code>.
<li>In the first round of the list comprehension, <code class="varname">key</code> will get <code>'href'</code>, and <code class="varname">value</code> will get <code>'index.html'</code>.
<li>The string formatting <code>' %s="%s"' % (key, value)</code> will resolve to <code>' href="index.html"'</code>. This string becomes the first element of the list comprehension's return value.
<li>In the second round, <code class="varname">key</code> will get <code>'title'</code>, and <code class="varname">value</code> will get <code>'Go to home page'</code>.
<li>The string formatting will resolve to <code>' title="Go to home page"'</code>.
<li>The list comprehension returns a list of these two resolved strings, and <code class="varname">strattrs</code> will join both elements of this list together to form <code>' href="index.html" title="Go to home page"'</code>.
</ol>
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dictsub.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now, using dictionary-based string formatting, you insert the value of <code class="varname">tag</code> and <code class="varname">strattrs</code> into a string. So if <code class="varname">tag</code> is <code>'a'</code>, the final result would be <code>'&lt;a href="index.html" title="Go to home page">'</code>, and that is what gets appended to <code class="varname">self.pieces</code>.
</td>
</tr>
</table>
</div><table class="important" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/important.png" alt="Important" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Using dictionary-based string formatting with <code class="function">locals</code> is a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a
slight performance hit in making the call to <code class="function">locals</code>, since <a href="#dialect.locals.readonly.example" title="Example 8.12. locals is read-only, globals is not"><code class="function">locals</code> builds a copy</a> of the local namespace.
</td>
</tr>
</table>
<h2 id="dialect.quoting">8.7. Quoting attribute values</h2>
<p>A common question on <a href="http://groups.google.com/groups?group=comp.lang.python">comp.lang.python</a> is &#8220;I have a bunch of <acronym>HTML</acronym> documents with unquoted attribute values, and I want to properly quote them all. How can I do this?&#8221;<sup>[<a name="d0e21764" href="#ftn.d0e21764">4</a>]</sup> (This is generally precipitated by a project manager who has found the <acronym>HTML</acronym>-is-a-standard religion joining a large project and proclaiming that all pages must validate against an <acronym>HTML</acronym> validator. Unquoted attribute values are a common violation of the <acronym>HTML</acronym> standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding <acronym>HTML</acronym> through <code class="classname">BaseHTMLProcessor</code>.
<p><code class="classname">BaseHTMLProcessor</code> consumes <acronym>HTML</acronym> (since it's descended from <code class="classname">SGMLParser</code>) and produces equivalent <acronym>HTML</acronym>, but the <acronym>HTML</acronym> output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase
or mixed case, and attribute values will be enclosed in double quotes, even if they started in single quotes or with no quotes
at all. It is this last side effect that you can take advantage of.
<div class="example"><h3 id="dialect.quoting.example">Example 8.16. Quoting attribute values</h3><pre class="screen">
<samp class="prompt">>>> </samp>htmlSource = """ <img id="dialect.basehtml.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp>&lt;html>
<samp class="prompt">... </samp>&lt;head>
<samp class="prompt">... </samp>&lt;title>Test page&lt;/title>
<samp class="prompt">... </samp>&lt;/head>
<samp class="prompt">... </samp>&lt;body>
<samp class="prompt">... </samp>&lt;ul>
<samp class="prompt">... </samp>&lt;li>&lt;a href=index.html>Home&lt;/a>&lt;/li>
<samp class="prompt">... </samp>&lt;li>&lt;a href=toc.html>Table of contents&lt;/a>&lt;/li>
<samp class="prompt">... </samp>&lt;li>&lt;a href=history.html>Revision history&lt;/a>&lt;/li>
<samp class="prompt">... </samp>&lt;/body>
<samp class="prompt">... </samp>&lt;/html>
<samp class="prompt">... </samp>"""
<samp class="prompt">>>> </samp>from BaseHTMLProcessor import BaseHTMLProcessor
<samp class="prompt">>>> </samp>parser = BaseHTMLProcessor()
<samp class="prompt">>>> </samp>parser.feed(htmlSource) <img id="dialect.basehtml.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print parser.output() <img id="dialect.basehtml.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">&lt;html>
&lt;head>
&lt;title>Test page&lt;/title>
&lt;/head>
&lt;body>
&lt;ul>
&lt;li>&lt;a href="index.html">Home&lt;/a>&lt;/li>
&lt;li>&lt;a href="toc.html">Table of contents&lt;/a>&lt;/li>
&lt;li>&lt;a href="history.html">Revision history&lt;/a>&lt;/li>
&lt;/body>
&lt;/html></span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that the attribute values of the <code>href</code> attributes in the <code class="sgmltag-element">&lt;a></code> tags are not properly quoted. (Also note that you're using <a href="#odbchelper.triplequotes" title="Example 2.2. Defining the buildConnectionString Function's doc string">triple quotes</a> for something other than a <code>doc string</code>. And directly in the <acronym>IDE</acronym>, no less. They're very useful.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Feed the parser.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.basehtml.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using the <code class="function">output</code> function defined in <code class="classname">BaseHTMLProcessor</code>, you get the output as a single string, complete with quoted attribute values. While this may seem anti-climactic, think
about how much has actually happened here: <code class="classname">SGMLParser</code> parsed the entire <acronym>HTML</acronym> document, breaking it down into tags, refs, data, and so forth; <code class="classname">BaseHTMLProcessor</code> used those elements to reconstruct pieces of <acronym>HTML</acronym> (which are still stored in <code class="varname">parser.pieces</code>, if you want to see them); finally, you called <code class="function">parser.output</code>, which joined all the pieces of <acronym>HTML</acronym> into one string.
</td>
</tr>
</table>
<h2 id="dialect.dialectizer">8.8. Introducing <code class="filename">dialect.py</code></h2>
<p><code class="classname">Dialectizer</code> is a simple (and silly) descendant of <code class="classname">BaseHTMLProcessor</code>. It runs blocks of text through a series of substitutions, but it makes sure that anything within a <code><code class="sgmltag-element">&lt;pre></code>...<code class="sgmltag-element">&lt;/pre></code></code> block passes through unaltered.
<p>To handle the <code class="sgmltag-element">&lt;pre></code> blocks, you define two methods in <code class="classname">Dialectizer</code>: <code class="function">start_pre</code> and <code class="function">end_pre</code>.
<div class="example"><h3 id="dialect.specifictags.example">Example 8.17. Handling specific tags</h3><pre class="programlisting">
def start_pre(self, attrs): <img id="dialect.dialectizer.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
self.verbatim += 1<img id="dialect.dialectizer.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
self.unknown_starttag("pre", attrs) <img id="dialect.dialectizer.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
def end_pre(self): <img id="dialect.dialectizer.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
self.unknown_endtag("pre") <img id="dialect.dialectizer.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
self.verbatim -= 1<img id="dialect.dialectizer.1.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">start_pre</code> is called every time <code class="classname">SGMLParser</code> finds a <code class="sgmltag-element">&lt;pre></code> tag in the <acronym>HTML</acronym> source. (In a minute, you'll see exactly how this happens.) The method takes a single parameter, <code class="varname">attrs</code>, which contains the attributes of the tag (if any). <code class="varname">attrs</code> is a list of key/value tuples, just like <a href="#dialect.unknownstarttag" title="Example 8.14. Dictionary-based string formatting in BaseHTMLProcessor.py"><code class="function">unknown_starttag</code></a> takes.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In the <code class="function">reset</code> method, you initialize a data attribute that serves as a counter for <code class="sgmltag-element">&lt;pre></code> tags. Every time you hit a <code class="sgmltag-element">&lt;pre></code> tag, you increment the counter; every time you hit a <code class="sgmltag-element">&lt;/pre></code> tag, you'll decrement the counter. (You could just use this as a flag and set it to <code class="constant">1</code> and reset it to <code class="constant">0</code>, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested <code class="sgmltag-element">&lt;pre></code> tags.) In a minute, you'll see how this counter is put to good use.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">That's it, that's the only special processing you do for <code class="sgmltag-element">&lt;pre></code> tags. Now you pass the list of attributes along to <code class="function">unknown_starttag</code> so it can do the default processing.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">end_pre</code> is called every time <code class="classname">SGMLParser</code> finds a <code class="sgmltag-element">&lt;/pre></code> tag. Since end tags can not contain attributes, the method takes no parameters.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First, you want to do the default processing, just like any other end tag.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.1.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Second, you decrement your counter to signal that this <code class="sgmltag-element">&lt;pre></code> block has been closed.
</td>
</tr>
</table>
<p>At this point, it's worth digging a little further into <code class="classname">SGMLParser</code>. I've claimed repeatedly (and you've taken it on faith so far) that <code class="classname">SGMLParser</code> looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of <code class="function">start_pre</code> and <code class="function">end_pre</code> to handle <code class="sgmltag-element">&lt;pre></code> and <code class="sgmltag-element">&lt;/pre></code>. But how does this happen? Well, it's not magic, it's just good Python coding.
<div class="example"><h3 id="dialect.dialectizer.example">Example 8.18. <code class="classname">SGMLParser</code></h3><pre class="programlisting">
def finish_starttag(self, tag, attrs): <img id="dialect.dialectizer.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
try:
method = getattr(self, 'start_' + tag) <img id="dialect.dialectizer.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
except AttributeError: <img id="dialect.dialectizer.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
try:
method = getattr(self, 'do_' + tag) <img id="dialect.dialectizer.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
except AttributeError:
self.unknown_starttag(tag, attrs) <img id="dialect.dialectizer.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
return -1
else:
self.handle_starttag(tag, method, attrs) <img id="dialect.dialectizer.2.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
return 0
else:
self.stack.append(tag)
self.handle_starttag(tag, method, attrs)
return 1 <img id="dialect.dialectizer.2.7" src="images/callouts/7.png" alt="7" border="0" width="12" height="12">
def handle_starttag(self, tag, method, attrs):
method(attrs)<img id="dialect.dialectizer.2.8" src="images/callouts/8.png" alt="8" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">At this point, <code class="classname">SGMLParser</code> has already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a
specific handler method for this tag, or whether you should fall back on the default method (<code class="function">unknown_starttag</code>).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The &#8220;magic&#8221; of <code class="classname">SGMLParser</code> is nothing more than your old friend, <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code class="function">getattr</code></a>. What you may not have realized before is that <code class="function">getattr</code> will find methods defined in descendants of an object as well as the object itself. Here the object is <code>self</code>, the current instance. So if <code class="varname">tag</code> is <code>'pre'</code>, this call to <code class="function">getattr</code> will look for a <code class="function">start_pre</code> method on the current instance, which is an instance of the <code class="classname">Dialectizer</code> class.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">getattr</code> raises an <code class="errorcode">AttributeError</code> if the method it's looking for doesn't exist in the object (or any of its descendants), but that's okay, because you wrapped
the call to <code class="function">getattr</code> inside a <a href="#fileinfo.exception" title="6.1. Handling Exceptions"><code>try...except</code></a> block and explicitly caught the <code class="errorcode">AttributeError</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since you didn't find a <code class="function">start_xxx</code> method, you'll also look for a <code class="function">do_xxx</code> method before giving up. This alternate naming scheme is generally used for standalone tags, like <code class="sgmltag-element">&lt;br></code>, which have no corresponding end tag. But you can use either naming scheme; as you can see, <code class="classname">SGMLParser</code> tries both for every tag. (You shouldn't define both a <code class="function">start_xxx</code> and <code class="function">do_xxx</code> handler method for the same tag, though; only the <code class="function">start_xxx</code> method will get called.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Another <code class="errorcode">AttributeError</code>, which means that the call to <code class="function">getattr</code> failed with <code class="function">do_xxx</code>. Since you found neither a <code class="function">start_xxx</code> nor a <code class="function">do_xxx</code> method for this tag, you catch the exception and fall back on the default method, <code class="function">unknown_starttag</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.2.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember, <code>try...except</code> blocks can have an <code>else</code> clause, which is called if <a href="#crossplatform.example" title="Example 6.2. Supporting Platform-Specific Functionality">no exception is raised</a> during the <code>try...except</code> block. Logically, that means that you <em>did</em> find a <code class="function">do_xxx</code> method for this tag, so you're going to call it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.2.7"><img src="images/callouts/7.png" alt="7" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">By the way, don't worry about these different return values; in theory they mean something, but they're never actually used.
Don't worry about the <code>self.stack.append(tag)</code> either; <code class="classname">SGMLParser</code> keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this
information either. In theory, you could use this module to validate that your tags were fully balanced, but it's probably
not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.2.8"><img src="images/callouts/8.png" alt="8" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">start_xxx</code> and <code class="function">do_xxx</code> methods are not called directly; the tag, method, and attributes are passed to this function, <code class="function">handle_starttag</code>, so that descendants can override it and change the way <em>all</em> start tags are dispatched. You don't need that level of control, so you just let this method do its thing, which is to call
the method (<code class="function">start_xxx</code> or <code class="function">do_xxx</code>) with the list of attributes. Remember, <code class="varname">method</code> is a function, returned from <code class="function">getattr</code>, and functions are objects. (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run
out of ways to use it to my advantage.) Here, the function object is passed into this dispatch method as an argument, and
this method turns around and calls the function. At this point, you don't need to know what the function is, what it's named,
or where it's defined; the only thing you need to know about the function is that it is called with one argument, <code class="varname">attrs</code>.
</td>
</tr>
</table>
<p>Now back to our regularly scheduled program: <code class="classname">Dialectizer</code>. When you left, you were in the process of defining specific handler methods for <code class="sgmltag-element">&lt;pre></code> and <code class="sgmltag-element">&lt;/pre></code> tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that,
you need to override the <code class="function">handle_data</code> method.
<div class="example"><h3>Example 8.19. Overriding the <code class="function">handle_data</code> method</h3><pre class="programlisting">
def handle_data(self, text): <img id="dialect.dialectizer.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
self.pieces.append(self.verbatim and text or self.process(text)) <img id="dialect.dialectizer.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">handle_data</code> is called with only one argument, the text to process.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.dialectizer.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In the ancestor <a href="#dialect.basehtml.intro" title="Example 8.8. Introducing BaseHTMLProcessor"><code class="classname">BaseHTMLProcessor</code></a>, the <code class="function">handle_data</code> method simply appended the text to the output buffer, <code class="varname">self.pieces</code>. Here the logic is only slightly more complicated. If you're in the middle of a <code><code class="sgmltag-element">&lt;pre></code>...<code class="sgmltag-element">&lt;/pre></code></code> block, <code class="varname">self.verbatim</code> will be some value greater than <code class="constant">0</code>, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the
substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using <a href="#apihelper.andortrick.intro" title="Example 4.17. Introducing the and-or Trick">the <code>and-or</code> trick</a>.
</td>
</tr>
</table>
<p>You're close to completely understanding <code class="classname">Dialectizer</code>. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes
later in <code class="filename">dialect.py</code> define a series of regular expressions that operate on the text between the <acronym>HTML</acronym> tags. But you just had <a href="#re" title="Chapter 7. Regular Expressions">a whole chapter on regular expressions</a>. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough
for one chapter.
<h2 id="dialect.alltogether">8.9. Putting it all together</h2>
<p>It's time to put everything you've learned so far to good use. I hope you were paying attention.
<div class="example"><h3>Example 8.20. The <code class="function">translate</code> function, part 1</h3><pre class="programlisting">
def translate(url, dialectName="chef"): <img id="dialect.alltogether.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
import urllib <img id="dialect.alltogether.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
sock = urllib.urlopen(url) <img id="dialect.alltogether.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
htmlSource = sock.read()
sock.close()
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.alltogether.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">translate</code> function has an <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional argument</a> <code class="varname">dialectName</code>, which is a string that specifies the dialect you'll be using. You'll see how this is used in a minute.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.alltogether.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Hey, wait a minute, there's an <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's doc string"><code>import</code></a> statement in this function! That's perfectly legal in Python. You're used to seeing <code>import</code> statements at the top of a program, which means that the imported module is available anywhere in the program. But you can
also import modules within a function, which means that the imported module is only available within the function. If you
have a module that is only ever used in one function, this is an easy way to make your code more modular. (When you find
that your weekend hack has turned into an 800-line work of art and decide to split it up into a dozen reusable modules, you'll
appreciate this.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.alltogether.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you <a href="#dialect.extract.urllib" title="Example 8.5. Introducing urllib">get the source of the given URL</a>.
</td>
</tr>
</table>
<div class="example"><h3>Example 8.21. The <code class="function">translate</code> function, part 2: curiouser and curiouser</h3><pre class="programlisting">
parserName = "%sDialectizer" % dialectName.capitalize() <img id="dialect.alltogether.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
parserClass = globals()[parserName] <img id="dialect.alltogether.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
parser = parserClass() <img id="dialect.alltogether.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.alltogether.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">capitalize</code> is a string method you haven't seen before; it simply capitalizes the first letter of a string and forces everything else
to lowercase. Combined with some <a href="#odbchelper.stringformatting" title="3.5. Formatting Strings">string formatting</a>, you've taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class. If <code class="varname">dialectName</code> is the string <code>'chef'</code>, <code class="varname">parserName</code> will be the string <code>'ChefDialectizer'</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.alltogether.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You have the name of a class as a string (<code class="varname">parserName</code>), and you have the global namespace as a dictionary (<code class="function">globals</code>()). Combined, you can get a reference to the class which the string names. (Remember, <a href="#fileinfo.classattributes" title="5.8. Introducing Class Attributes">classes are objects</a>, and they can be assigned to variables just like any other object.) If <code class="varname">parserName</code> is the string <code>'ChefDialectizer'</code>, <code class="varname">parserClass</code> will be the class <code>ChefDialectizer</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.alltogether.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Finally, you have a class object (<code class="varname">parserClass</code>), and you want an instance of the class. Well, you already know how to do that: <a href="#fileinfo.create" title="5.4. Instantiating Classes">call the class like a function</a>. The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable
like a function, and out pops an instance of the class. If <code class="varname">parserClass</code> is the class <code>ChefDialectizer</code>, <code class="varname">parser</code> will be an instance of the class <code>ChefDialectizer</code>.
</td>
</tr>
</table>
<p>Why bother? After all, there are only 3 <code class="classname">Dialectizer</code> classes; why not just use a <code class="function">case</code> statement? (Well, there's no <code class="function">case</code> statement in Python, but why not just use a series of <code>if</code> statements?) One reason: extensibility. The <code class="function">translate</code> function has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a new <code class="classname">FooDialectizer</code> tomorrow; <code class="function">translate</code> would work by passing <code>'foo'</code> as the <code class="varname">dialectName</code>.
<p>Even better, imagine putting <code class="classname">FooDialectizer</code> in a separate module, and importing it with <code>from <i class="replaceable">module</i> import</code>. You've already seen that this <a href="#dialect.globals.example" title="Example 8.11. Introducing globals">includes it in <code class="function">globals</code>()</a>, so <code class="function">translate</code> would still work without modification, even though <code class="classname">FooDialectizer</code> was in a separate file.
<p>Now imagine that the name of the dialect is coming from somewhere outside the program, maybe from a database or from a user-inputted
value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a <acronym>URL</acronym> and a dialect name (both strings) in the query string of a web page request, and output the &#8220;translated&#8221; web page.
<p>Finally, imagine a <code class="classname">Dialectizer</code> framework with a plug-in architecture. You could put each <code class="classname">Dialectizer</code> class in a separate file, leaving only the <code class="function">translate</code> function in <code class="filename">dialect.py</code>. Assuming a consistent naming scheme, the <code class="function">translate</code> function could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven't
seen dynamic importing yet, but I promise to cover it in a later chapter.) To add a new dialect, you would simply add an
appropriately-named file in the plug-ins directory (like <code class="filename">foodialect.py</code> which contains the <code class="classname">FooDialectizer</code> class). Calling the <code class="function">translate</code> function with the dialect name <code>'foo'</code> would find the module <code class="filename">foodialect.py</code>, import the class <code class="classname">FooDialectizer</code>, and away you go.
<div class="example"><h3>Example 8.22. The <code class="function">translate</code> function, part 3</h3><pre class="programlisting">
parser.feed(htmlSource) <img id="dialect.alltogether.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
parser.close() <img id="dialect.alltogether.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
return parser.output() <img id="dialect.alltogether.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.alltogether.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">After all that imagining, this is going to seem pretty boring, but the <code class="function">feed</code> function is what <a href="#dialect.feed.example" title="Example 8.7. Using urllister.py">does the entire transformation</a>. You had the entire <acronym>HTML</acronym> source in a single string, so you only had to call <code class="function">feed</code> once. However, you can call <code class="function">feed</code> as often as you want, and the parser will just keep parsing. So if you were worried about memory usage (or you knew you
were going to be dealing with very large <acronym>HTML</acronym> pages), you could set this up in a loop, where you read a few bytes of <acronym>HTML</acronym> and fed it to the parser. The result would be the same.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.alltogether.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Because <code class="function">feed</code> maintains an internal buffer, you should always call the parser's <code class="function">close</code> method when you're done (even if you fed it all at once, like you did). Otherwise you may find that your output is missing
the last few bytes.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.alltogether.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember, <code class="function">output</code> is the function you defined on <code class="classname">BaseHTMLProcessor</code> that <a href="#dialect.output.example" title="Example 8.9. BaseHTMLProcessor output">joins all the pieces of output you've buffered</a> and returns them in a single string.
</td>
</tr>
</table>
<p>And just like that, you've &#8220;translated&#8221; a web page, given nothing but a <acronym>URL</acronym> and the name of a dialect.
<div class="itemizedlist">
<h3>Further reading</h3>
<ul>
<li>You thought I was kidding about the server-side scripting idea. So did I, until I found <a href="http://rinkworks.com/dialect/">this web-based dialectizer</a>. Unfortunately, source code does not appear to be available.
</ul>
<h2 id="dialect.summary">8.10. Summary</h2>
<p>Python provides you with a powerful tool, <code class="filename">sgmllib.py</code>, to manipulate <acronym>HTML</acronym> by turning its structure into an object model. You can use this tool in many different ways.
<div class="itemizedlist">
<ul>
<li>parsing the <acronym>HTML</acronym> looking for something specific
<li>aggregating the results, like the <a href="#dialect.extract.links" title="Example 8.6. Introducing urllister.py"><acronym>URL</acronym> lister</a>
<li>altering the structure along the way, like the <a href="#dialect.quoting.example" title="Example 8.16. Quoting attribute values">attribute quoter</a>
<li>transforming the <acronym>HTML</acronym> into something else by manipulating the text while leaving the tags alone, like the <a href="#dialect.dialectizer" title="8.8. Introducing dialect.py"><code class="classname">Dialectizer</code></a>
</ul>
<p>Along with these examples, you should be comfortable doing all of the following things:
<div class="itemizedlist">
<ul>
<li>Using <a href="#dialect.locals" title="8.5. locals and globals"><code class="function">locals</code>() and <code class="function">globals</code>()</a> to access namespaces
<li><a href="#dialect.dictsub" title="8.6. Dictionary-based string formatting">Formatting strings</a> using dictionary-based substitutions
</ul>
<div class="footnotes"><br><hr width="100" align="left">
<div class="footnote">
<p><sup>[<a name="ftn.d0e20503" href="#d0e20503">1</a>] </sup>The technical term for a parser like <code class="classname">SGMLParser</code> is a <em>consumer</em>: it consumes <acronym>HTML</acronym> and breaks it down. Presumably, the name <code class="function">feed</code> was chosen to fit into the whole &#8220;consumer&#8221; motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or
evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring
back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the
only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, &#8220;Do not feed the parser.&#8221; But maybe that's just me. In any event, it's an interesting mental image.
<div class="footnote">
<p><sup>[<a name="ftn.d0e20702" href="#d0e20702">2</a>] </sup>The reason Python is better at lists than strings is that lists are mutable but strings are immutable. This means that appending to a list
just adds the element and updates the index. Since strings can not be changed after they are created, code like <code>s = s + newpiece</code> will create an entirely new string out of the concatenation of the original and the new piece, then throw away the original
string. This involves a lot of expensive memory management, and the amount of effort involved increases as the string gets
longer, so doing <code>s = s + newpiece</code> in a loop is deadly. In technical terms, appending <code class="varname">n</code> items to a list is <code>O(n)</code>, while appending <code class="varname">n</code> items to a string is <code>O(n<sup>2</sup>)</code>.
<div class="footnote">
<p><sup>[<a name="ftn.d0e21226" href="#d0e21226">3</a>] </sup>I don't get out much.
<div class="footnote">
<p><sup>[<a name="ftn.d0e21764" href="#d0e21764">4</a>] </sup>All right, it's not that common a question. It's not up there with &#8220;What editor should I use to write Python code?&#8221; (answer: Emacs) or &#8220;Is Python better or worse than Perl?&#8221; (answer: &#8220;Perl is worse than Python because people wanted it worse.&#8221; -Larry Wall, 10/14/1998) But questions about <acronym>HTML</acronym> processing pop up in one form or another about once a month, and among those questions, this is a popular one.
<div class="chapter">
<h2 id="kgp">Chapter 9. <acronym>XML</acronym> Processing</h2>
<h2 id="kgp.divein">9.1. Diving in</h2>
<p>These next two chapters are about <acronym>XML</acronym> processing in Python. It would be helpful if you already knew what an <acronym>XML</acronym> document looks like, that it's made up of structured tags to form a hierarchy of elements, and so on. If this doesn't make
sense to you, there are <a href="http://directory.google.com/Top/Computers/Data_Formats/Markup_Languages/XML/Resources/FAQs,_Help,_and_Tutorials/">many <acronym>XML</acronym> tutorials</a> that can explain the basics.
<p>If you're not particularly interested in XML, you should still read these chapters, which cover important topics like Python packages, Unicode, command line arguments, and how to use <code class="function">getattr</code> for method dispatching.
<p>Being a philosophy major is not required, although if you have ever had the misfortune of being subjected to the writings
of Immanuel Kant, you will appreciate the example program a lot more than if you majored in something useful, like computer
science.
<p>There are two basic ways to work with <acronym>XML</acronym>. One is called <acronym>SAX</acronym> (&#8220;Simple <acronym>API</acronym> for <acronym>XML</acronym>&#8221;), and it works by reading the <acronym>XML</acronym> a little bit at a time and calling a method for each element it finds. (If you read <a href="#dialect" title="Chapter 8. HTML Processing">Chapter 8, <i>HTML Processing</i></a>, this should sound familiar, because that's how the <code class="filename">sgmllib</code> module works.) The other is called <acronym>DOM</acronym> (&#8220;Document Object Model&#8221;), and it works by reading in the entire <acronym>XML</acronym> document at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the <acronym>DOM</acronym>.
<p>The following is a complete Python program which generates pseudo-random output based on a context-free grammar defined in an <acronym>XML</acronym> format. Don't worry yet if you don't understand what that means; you'll examine both the program's input and its output
in more depth throughout these next two chapters.
<div class="example"><h3>Example 9.1. <code class="filename">kgp.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
"""Kant Generator for Python
Generates mock philosophy based on a context-free grammar
Usage: python kgp.py [options] [source]
Options:
-g ..., --grammar=... use specified grammar file or URL
-h, --help show this help
-d show debugging information while parsing
Examples:
kgp.pygenerates several paragraphs of Kantian philosophy
kgp.py -g husserl.xml generates several paragraphs of Husserl
kpg.py "&lt;xref id='paragraph'/>" generates a paragraph of Kant
kgp.py template.xml reads from template.xml to decide what to generate
"""
from xml.dom import minidom
import random
import toolbox
import sys
import getopt
_debug = 0
class NoSourceError(Exception): pass
class KantGenerator:
"""generates mock philosophy based on a context-free grammar"""
def __init__(self, grammar, source=None):
self.loadGrammar(grammar)
self.loadSource(source and source or self.getDefaultSource())
self.refresh()
def _load(self, source):
"""load XML input source, return parsed XML document
- a URL of a remote XML file ("http://diveintopython3.org/kant.xml")
- a filename of a local XML file ("~/diveintopython3/common/py/kant.xml")
- standard input ("-")
- the actual XML document, as a string
"""
sock = toolbox.openAnything(source)
xmldoc = minidom.parse(sock).documentElement
sock.close()
return xmldoc
def loadGrammar(self, grammar):
"""load context-free grammar"""
self.grammar = self._load(grammar)
self.refs = {}
for ref in self.grammar.getElementsByTagName("ref"):
self.refs[ref.attributes["id"].value] = ref
def loadSource(self, source):
"""load source"""
self.source = self._load(source)
def getDefaultSource(self):
"""guess default source of the current grammar
The default source will be one of the &lt;ref>s that is not
cross-referenced. This sounds complicated but it's not.
Example: The default source for kant.xml is
"&lt;xref id='section'/>", because 'section' is the one &lt;ref>
that is not &lt;xref>'d anywhere in the grammar.
In most grammars, the default source will produce the
longest (and most interesting) output.
"""
xrefs = {}
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
if not standaloneXrefs:
raise NoSourceError, "can't guess source, and no source specified"
return '&lt;xref id="%s"/>' % random.choice(standaloneXrefs)
def reset(self):
"""reset parser"""
self.pieces = []
self.capitalizeNextWord = 0
def refresh(self):
"""reset output buffer, re-parse entire source file, and return output
Since parsing involves a good deal of randomness, this is an
easy way to get new output without having to reload a grammar file
each time.
"""
self.reset()
self.parse(self.source)
return self.output()
def output(self):
"""output generated text"""
return "".join(self.pieces)
def randomChildElement(self, node):
"""choose a random child element of a node
This is a utility method used by do_xref and do_choice.
"""
choices = [e for e in node.childNodes
if e.nodeType == e.ELEMENT_NODE]
chosen = random.choice(choices)
if _debug:
sys.stderr.write('%s available choices: %s\n' % \
(len(choices), [e.toxml() for e in choices]))
sys.stderr.write('Chosen: %s\n' % chosen.toxml())
return chosen
def parse(self, node):
"""parse a single XML node
A parsed XML document (from minidom.parse) is a tree of nodes
of various types. Each node is represented by an instance of the
corresponding Python class (Element for a tag, Text for
text data, Document for the top-level document). The following
statement constructs the name of a class method based on the type
of node we're parsing ("parse_Element" for an Element node,
"parse_Text" for a Text node, etc.) and then calls the method.
"""
parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)
parseMethod(node)
def parse_Document(self, node):
"""parse the document node
The document node by itself isn't interesting (to us), but
its only child, node.documentElement, is: it's the root node
of the grammar.
"""
self.parse(node.documentElement)
def parse_Text(self, node):
"""parse a text node
The text of a text node is usually added to the output buffer
verbatim. The one exception is that &lt;p class='sentence'> sets
a flag to capitalize the first letter of the next word. If
that flag is set, we capitalize the text and reset the flag.
"""
text = node.data
if self.capitalizeNextWord:
self.pieces.append(text[0].upper())
self.pieces.append(text[1:])
self.capitalizeNextWord = 0
else:
self.pieces.append(text)
def parse_Element(self, node):
"""parse an element
An XML element corresponds to an actual tag in the source:
&lt;xref id='...'>, &lt;p chance='...'>, &lt;choice>, etc.
Each element type is handled in its own method. Like we did in
parse(), we construct a method name based on the name of the
element ("do_xref" for an &lt;xref> tag, etc.) and
call the method.
"""
handlerMethod = getattr(self, "do_%s" % node.tagName)
handlerMethod(node)
def parse_Comment(self, node):
"""parse a comment
The grammar can contain XML comments, but we ignore them
"""
pass
def do_xref(self, node):
"""handle &lt;xref id='...'> tag
An &lt;xref id='...'> tag is a cross-reference to a &lt;ref id='...'>
tag. &lt;xref id='sentence'/> evaluates to a randomly chosen child of
&lt;ref id='sentence'>.
"""
id = node.attributes["id"].value
self.parse(self.randomChildElement(self.refs[id]))
def do_p(self, node):
"""handle &lt;p> tag
The &lt;p> tag is the core of the grammar. It can contain almost
anything: freeform text, &lt;choice> tags, &lt;xref> tags, even other
&lt;p> tags. If a "class='sentence'" attribute is found, a flag
is set and the next word will be capitalized. If a "chance='X'"
attribute is found, there is an X% chance that the tag will be
evaluated (and therefore a (100-X)% chance that it will be
completely ignored)
"""
keys = node.attributes.keys()
if "class" in keys:
if node.attributes["class"].value == "sentence":
self.capitalizeNextWord = 1
if "chance" in keys:
chance = int(node.attributes["chance"].value)
doit = (chance > random.randrange(100))
else:
doit = 1
if doit:
for child in node.childNodes: self.parse(child)
def do_choice(self, node):
"""handle &lt;choice> tag
A &lt;choice> tag contains one or more &lt;p> tags. One &lt;p> tag
is chosen at random and evaluated; the rest are ignored.
"""
self.parse(self.randomChildElement(node))
def usage():
print __doc__
def main(argv):
grammar = "kant.xml"
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
usage()
sys.exit(2)
for opt, arg in opts:
if opt in ("-h", "--help"):
usage()
sys.exit()
elif opt == '-d':
global _debug
_debug = 1
elif opt in ("-g", "--grammar"):
grammar = arg
source = "".join(args)
k = KantGenerator(grammar, source)
print k.output()
if __name__ == "__main__":
main(sys.argv[1:])
</pre><div class="example"><h3>Example 9.2. <code class="filename">toolbox.py</code></h3><pre class="programlisting">
"""Miscellaneous utility functions"""
def openAnything(source):
"""URI, filename, or string --> stream
This function lets you define parsers that take any input source
(URL, pathname to local or network file, or actual data as a string)
and deal with it in a uniform manner. Returned object is guaranteed
to have all the basic stdio read methods (read, readline, readlines).
Just .close() the object when you're done with it.
Examples:
>>> from xml.dom import minidom
>>> sock = openAnything("http://localhost/kant.xml")
>>> doc = minidom.parse(sock)
>>> sock.close()
>>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml")
>>> doc = minidom.parse(sock)
>>> sock.close()
>>> sock = openAnything("&lt;ref id='conjunction'>&lt;text>and&lt;/text>&lt;text>or&lt;/text>&lt;/ref>")
>>> doc = minidom.parse(sock)
>>> sock.close()
"""
if hasattr(source, "read"):
return source
if source == '-':
import sys
return sys.stdin
# try to open with urllib (if source is http, ftp, or file URL)
import urllib
try:
return urllib.urlopen(source)
except (IOError, OSError):
pass
# try to open with native open function (if source is pathname)
try:
return open(source)
except (IOError, OSError):
pass
# treat source as string
import StringIO
return StringIO.StringIO(str(source))
</pre><p>Run the program <code class="filename">kgp.py</code> by itself, and it will parse the default <acronym>XML</acronym>-based grammar, in <code class="filename">kant.xml</code>, and print several paragraphs worth of philosophy in the style of Immanuel Kant.
<div class="example"><h3>Example 9.3. Sample output of <code class="filename">kgp.py</code></h3><pre class="screen"><samp class="prompt">[you@localhost kgp]$ python kgp.py</samp>
<samp class="computeroutput"> As is shown in the writings of Hume, our a priori concepts, in
reference to ends, abstract from all content of knowledge; in the study
of space, the discipline of human reason, in accordance with the
principles of philosophy, is the clue to the discovery of the
Transcendental Deduction. The transcendental aesthetic, in all
theoretical sciences, occupies part of the sphere of human reason
concerning the existence of our ideas in general; still, the
never-ending regress in the series of empirical conditions constitutes
the whole content for the transcendental unity of apperception. What
we have alone been able to show is that, even as this relates to the
architectonic of human reason, the Ideal may not contradict itself, but
it is still possible that it may be in contradictions with the
employment of the pure employment of our hypothetical judgements, but
natural causes (and I assert that this is the case) prove the validity
of the discipline of pure reason. As we have already seen, time (and
it is obvious that this is true) proves the validity of time, and the
architectonic of human reason, in the full sense of these terms,
abstracts from all content of knowledge. I assert, in the case of the
discipline of practical reason, that the Antinomies are just as
necessary as natural causes, since knowledge of the phenomena is a
posteriori.
The discipline of human reason, as I have elsewhere shown, is by
its very nature contradictory, but our ideas exclude the possibility of
the Antinomies. We can deduce that, on the contrary, the pure
employment of philosophy, on the contrary, is by its very nature
contradictory, but our sense perceptions are a representation of, in
the case of space, metaphysics. The thing in itself is a
representation of philosophy. Applied logic is the clue to the
discovery of natural causes. However, what we have alone been able to
show is that our ideas, in other words, should only be used as a canon
for the Ideal, because of our necessary ignorance of the conditions.
[...snip...]</span></pre><p>This is, of course, complete gibberish. Well, not complete gibberish. It is syntactically and grammatically correct (although
very verbose -- Kant wasn't what you would call a get-to-the-point kind of guy). Some of it may actually be true (or at least
the sort of thing that Kant would have agreed with), some of it is blatantly false, and most of it is simply incoherent.
But all of it is in the style of Immanuel Kant.
<p>Let me repeat that this is much, much funnier if you are now or have ever been a philosophy major.
<p>The interesting thing about this program is that there is nothing Kant-specific about it. All the content in the previous
example was derived from the grammar file, <code class="filename">kant.xml</code>. If you tell the program to use a different grammar file (which you can specify on the command line), the output will be
completely different.
<div class="example"><h3>Example 9.4. Simpler output from <code class="filename">kgp.py</code></h3><pre class="screen"><samp class="prompt">[you@localhost kgp]$ python kgp.py -g binary.xml</samp>
00101001
<samp class="prompt">[you@localhost kgp]$ python kgp.py -g binary.xml</samp>
10110100</pre><p>You will take a closer look at the structure of the grammar file later in this chapter. For now, all you need to know is
that the grammar file defines the structure of the output, and the <code class="filename">kgp.py</code> program reads through the grammar and makes random decisions about which words to plug in where.
<h2 id="kgp.packages">9.2. Packages</h2>
<p>Actually parsing an <acronym>XML</acronym> document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour
to talk about packages.
<div class="example"><h3>Example 9.5. Loading an <acronym>XML</acronym> document (a sneak peek)</h3><pre class="screen">
<samp class="prompt">>>> </samp>from xml.dom import minidom <img id="kgp.packages.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.packages.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a syntax you haven't seen before. It looks almost like the <code>from <i class="replaceable">module</i> import</code> you know and love, but the <code>"."</code> gives it away as something above and beyond a simple import. In fact, <code class="filename">xml</code> is what is known as a package, <code class="filename">dom</code> is a nested package within <code class="filename">xml</code>, and <code class="filename">minidom</code> is a module within <code class="filename">xml.dom</code>.
</td>
</tr>
</table>
<p>That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than
directories of modules; nested packages are subdirectories. The modules within a package (or a nested package) are still
just <code class="filename">.py</code> files, like always, except that they're in a subdirectory instead of the main <code class="filename">lib/</code> directory of your Python installation.
<div class="example"><h3>Example 9.6. File layout of a package</h3><pre class="screen">Python21/ root Python installation (home of the executable)
|
+--lib/ library directory (home of the standard library modules)
|
+-- xml/ xml package (really just a directory with other stuff in it)
|
+--sax/ xml.sax package (again, just a directory)
|
+--dom/ xml.dom package (contains minidom.py)
|
+--parsers/ xml.parsers package (used internally)</pre><p>So when you say <code>from xml.dom import minidom</code>, Python figures out that that means &#8220;look in the <code class="filename">xml</code> directory for a <code class="filename">dom</code> directory, and look in <em>that</em> for the <code class="filename">minidom</code> module, and import it as <code class="filename">minidom</code>&#8221;. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import
specific classes or functions from a module contained within a package. You can also import the package itself as a module.
The syntax is all the same; Python figures out what you mean based on the file layout of the package, and automatically does the right thing.
<div class="example"><h3>Example 9.7. Packages are modules, too</h3><pre class="screen"><samp class="prompt">>>> </samp>from xml.dom import minidom <img id="kgp.packages.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>minidom
&lt;module 'xml.dom.minidom' from 'C:\Python21\lib\xml\dom\minidom.pyc'>
<samp class="prompt">>>> </samp>minidom.Element
&lt;class xml.dom.minidom.Element at 01095744>
<samp class="prompt">>>> </samp>from xml.dom.minidom import Element <img id="kgp.packages.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>Element
&lt;class xml.dom.minidom.Element at 01095744>
<samp class="prompt">>>> </samp>minidom.Element
&lt;class xml.dom.minidom.Element at 01095744>
<samp class="prompt">>>> </samp>from xml import dom <img id="kgp.packages.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>dom
&lt;module 'xml.dom' from 'C:\Python21\lib\xml\dom\__init__.pyc'>
<samp class="prompt">>>> </samp>import xml <img id="kgp.packages.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>xml
&lt;module 'xml' from 'C:\Python21\lib\xml\__init__.pyc'></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.packages.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here you're importing a module (<code class="filename">minidom</code>) from a nested package (<code class="filename">xml.dom</code>). The result is that <code class="filename">minidom</code> is imported into your <a href="#dialect.locals" title="8.5. locals and globals">namespace</a>, and in order to reference classes within the <code class="filename">minidom</code> module (like <code class="classname">Element</code>), you need to preface them with the module name.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.packages.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here you are importing a class (<code class="classname">Element</code>) from a module (<code class="filename">minidom</code>) from a nested package (<code class="filename">xml.dom</code>). The result is that <code class="classname">Element</code> is imported directly into your namespace. Note that this does not interfere with the previous import; the <code class="classname">Element</code> class can now be referenced in two ways (but it's all still the same class).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.packages.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here you are importing the <code class="filename">dom</code> package (a nested package of <code class="filename">xml</code>) as a module in and of itself. Any level of a package can be treated as a module, as you'll see in a moment. It can even
have its own attributes and methods, just the modules you've seen before.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.packages.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here you are importing the root level <code class="filename">xml</code> package as a module.
</td>
</tr>
</table>
<p>So how can a package (which is just a directory on disk) be imported and treated as a module (which is always a file on disk)?
The answer is the magical <code class="filename">__init__.py</code> file. You see, packages are not simply directories; they are directories with a specific file, <code class="filename">__init__.py</code>, inside. This file defines the attributes and methods of the package. For instance, <code class="filename">xml.dom</code> contains a <code class="classname">Node</code> class, which is defined in <code class="filename">xml/dom/__init__.py</code>. When you import a package as a module (like <code class="filename">dom</code> from <code class="filename">xml</code>), you're really importing its <code class="filename">__init__.py</code> file.<table class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">A package is a directory with the special <code class="filename">__init__.py</code> file in it. The <code class="filename">__init__.py</code> file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file,
but it has to exist. But if <code class="filename">__init__.py</code> doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages.
</td>
</tr>
</table>
<p>So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an <code class="filename">xml</code> package with <code class="filename">sax</code> and <code class="filename">dom</code> packages inside, the authors could have chosen to put all the <code class="filename">sax</code> functionality in <code class="filename">xmlsax.py</code> and all the <code class="filename">dom</code> functionality in <code class="filename">xmldom.py</code>, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the <acronym>XML</acronym> package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different
areas simultaneously).
<p>If you ever find yourself writing a large subsystem in Python (or, more likely, when you realize that your small subsystem has grown into a large one), invest some time designing a good
package architecture. It's one of the many things Python is good at, so take advantage of it.
<h2 id="kgp.parse">9.3. Parsing <acronym>XML</acronym></h2>
<p>As I was saying, actually parsing an <acronym>XML</acronym> document is very simple: one line of code. Where you go from there is up to you.
<div class="example"><h3>Example 9.8. Loading an <acronym>XML</acronym> document (for real this time)</h3><pre class="screen">
<samp class="prompt">>>> </samp>from xml.dom import minidom <img id="kgp.parse.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml') <img id="kgp.parse.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>xmldoc <img id="kgp.parse.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;xml.dom.minidom.Document instance at 010BE87C>
<samp class="prompt">>>> </samp>print xmldoc.toxml() <img id="kgp.parse.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">&lt;?xml version="1.0" ?>
&lt;grammar>
&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref>
&lt;ref id="byte">
&lt;p>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>\
&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;/p>
&lt;/ref>
&lt;/grammar></span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in the <a href="#kgp.packages" title="9.2. Packages">previous section</a>, this imports the <code class="filename">minidom</code> module from the <code class="filename">xml.dom</code> package.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here is the one line of code that does all the work: <code class="function">minidom.parse</code> takes one argument and returns a parsed representation of the <acronym>XML</acronym> document. The argument can be many things; in this case, it's simply a filename of an <acronym>XML</acronym> document on my local disk. (To follow along, you'll need to change the path to point to your downloaded examples directory.)
But you can also pass a <a href="#fileinfo.files" title="6.2. Working with File Objects">file object</a>, or even a <a href="#dialect.extract.urllib" title="Example 8.5. Introducing urllib">file-like object</a>. You'll take advantage of this flexibility later in this chapter.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The object returned from <code class="function">minidom.parse</code> is a <code class="classname">Document</code> object, a descendant of the <code class="classname">Node</code> class. This <code class="classname">Document</code> object is the root level of a complex tree-like structure of interlocking Python objects that completely represent the <acronym>XML</acronym> document you passed to <code class="function">minidom.parse</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">toxml</code> is a method of the <code class="classname">Node</code> class (and is therefore available on the <code class="classname">Document</code> object you got from <code class="function">minidom.parse</code>). <code class="function">toxml</code> prints out the <acronym>XML</acronym> that this <code class="classname">Node</code> represents. For the <code class="classname">Document</code> node, this prints out the entire <acronym>XML</acronym> document.
</td>
</tr>
</table>
<p>Now that you have an <acronym>XML</acronym> document in memory, you can start traversing through it.
<div class="example"><h3 id="kgp.parse.gettingchildnodes.example">Example 9.9. Getting child nodes</h3><pre class="screen">
<samp class="prompt">>>> </samp>xmldoc.childNodes <img id="kgp.parse.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
[&lt;DOM Element: grammar at 17538908>]
<samp class="prompt">>>> </samp>xmldoc.childNodes[0] <img id="kgp.parse.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;DOM Element: grammar at 17538908>
<samp class="prompt">>>> </samp>xmldoc.firstChild <img id="kgp.parse.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;DOM Element: grammar at 17538908></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Every <code class="classname">Node</code> has a <code class="function">childNodes</code> attribute, which is a list of the <code class="classname">Node</code> objects. A <code class="classname">Document</code> always has only one child node, the root element of the <acronym>XML</acronym> document (in this case, the <code class="sgmltag-element">grammar</code> element).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To get the first (and in this case, the only) child node, just use regular list syntax. Remember, there is nothing special
going on here; this is just a regular Python list of regular Python objects.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since getting the first child node of a node is a useful and common activity, the <code class="classname">Node</code> class has a <code class="function">firstChild</code> attribute, which is synonymous with <code>childNodes[0]</code>. (There is also a <code class="function">lastChild</code> attribute, which is synonymous with <code>childNodes[-1]</code>.)
</td>
</tr>
</table>
<div class="example"><h3>Example 9.10. <code class="function">toxml</code> works on any node</h3><pre class="screen">
<samp class="prompt">>>> </samp>grammarNode = xmldoc.firstChild
<samp class="prompt">>>> </samp>print grammarNode.toxml() <img id="kgp.parse.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">&lt;grammar>
&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref>
&lt;ref id="byte">
&lt;p>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>\
&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;/p>
&lt;/ref>
&lt;/grammar></span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since the <code class="function">toxml</code> method is defined in the <code class="classname">Node</code> class, it is available on any <acronym>XML</acronym> node, not just the <code class="classname">Document</code> element.
</td>
</tr>
</table>
<div class="example"><h3 id="kgp.parse.childnodescanbetext.example">Example 9.11. Child nodes can be text</h3><pre class="screen">
<samp class="prompt">>>> </samp>grammarNode.childNodes<img id="kgp.parse.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">[&lt;DOM Text node "\n">, &lt;DOM Element: ref at 17533332>, \
&lt;DOM Text node "\n">, &lt;DOM Element: ref at 17549660>, &lt;DOM Text node "\n">]</samp>
<samp class="prompt">>>> </samp>print grammarNode.firstChild.toxml() <img id="kgp.parse.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">
</samp>
<samp class="prompt">>>> </samp>print grammarNode.childNodes[1].toxml() <img id="kgp.parse.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref></samp>
<samp class="prompt">>>> </samp>print grammarNode.childNodes[3].toxml() <img id="kgp.parse.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">&lt;ref id="byte">
&lt;p>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>\
&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;/p>
&lt;/ref></samp>
<samp class="prompt">>>> </samp>print grammarNode.lastChild.toxml() <img id="kgp.parse.4.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="computeroutput">
</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Looking at the <acronym>XML</acronym> in <code class="filename">binary.xml</code>, you might think that the <code class="sgmltag-element">grammar</code> has only two child nodes, the two <code class="sgmltag-element">ref</code> elements. But you're missing something: the carriage returns! After the <code>'&lt;grammar>'</code> and before the first <code>'&lt;ref>'</code> is a carriage return, and this text counts as a child node of the <code class="sgmltag-element">grammar</code> element. Similarly, there is a carriage return after each <code>'&lt;/ref>'</code>; these also count as child nodes. So <code>grammar.childNodes</code> is actually a list of 5 objects: 3 <code class="classname">Text</code> objects and 2 <code class="classname">Element</code> objects.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The first child is a <code class="classname">Text</code> object representing the carriage return after the <code>'&lt;grammar>'</code> tag and before the first <code>'&lt;ref>'</code> tag.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The second child is an <code class="classname">Element</code> object representing the first <code class="sgmltag-element">ref</code> element.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The fourth child is an <code class="classname">Element</code> object representing the second <code class="sgmltag-element">ref</code> element.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.4.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The last child is a <code class="classname">Text</code> object representing the carriage return after the <code>'&lt;/ref>'</code> end tag and before the <code>'&lt;/grammar>'</code> end tag.
</td>
</tr>
</table>
<div class="example"><h3>Example 9.12. Drilling down all the way to text</h3><pre class="screen">
<samp class="prompt">>>> </samp>grammarNode
&lt;DOM Element: grammar at 19167148>
<samp class="prompt">>>> </samp>refNode = grammarNode.childNodes[1] <img id="kgp.parse.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>refNode
&lt;DOM Element: ref at 17987740>
<samp class="prompt">>>> </samp>refNode.childNodes<img id="kgp.parse.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">[&lt;DOM Text node "\n">, &lt;DOM Text node " ">, &lt;DOM Element: p at 19315844>, \
&lt;DOM Text node "\n">, &lt;DOM Text node " ">, \
&lt;DOM Element: p at 19462036>, &lt;DOM Text node "\n">]</samp>
<samp class="prompt">>>> </samp>pNode = refNode.childNodes[2]
<samp class="prompt">>>> </samp>pNode
&lt;DOM Element: p at 19315844>
<samp class="prompt">>>> </samp>print pNode.toxml() <img id="kgp.parse.5.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;p>0&lt;/p>
<samp class="prompt">>>> </samp>pNode.firstChild <img id="kgp.parse.5.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;DOM Text node "0">
<samp class="prompt">>>> </samp>pNode.firstChild.data <img id="kgp.parse.5.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
u'0'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in the previous example, the first <code class="sgmltag-element">ref</code> element is <code>grammarNode.childNodes[1]</code>, since childNodes[0] is a <code class="classname">Text</code> node for the carriage return.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="sgmltag-element">ref</code> element has its own set of child nodes, one for the carriage return, a separate one for the spaces, one for the <code class="sgmltag-element">p</code> element, and so forth.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.5.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can even use the <code class="function">toxml</code> method here, deeply nested within the document.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.5.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="sgmltag-element">p</code> element has only one child node (you can't tell that from this example, but look at <code>pNode.childNodes</code> if you don't believe me), and it is a <code class="classname">Text</code> node for the single character <code>'0'</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.parse.5.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code>.data</code> attribute of a <code class="classname">Text</code> node gives you the actual string that the text node represents. But what is that <code>'u'</code> in front of the string? The answer to that deserves its own section.
</td>
</tr>
</table>
<h2 id="kgp.unicode">9.4. Unicode</h2>
<p>Unicode is a system to represent characters from all the world's different languages. When Python parses an <acronym>XML</acronym> document, all data is stored in memory as unicode.
<p>You'll get to all that in a minute, but first, some background.
<p><b>Historical note. </b>Before unicode, there were separate character encoding systems for each language, each using the same numbers (0-255) to represent
that language's characters. Some languages (like Russian) have multiple conflicting standards about how to represent the
same characters; other languages (like Japanese) have so many characters that they require multiple-byte character sets.
Exchanging documents between systems was difficult because there was no way for a computer to tell for certain which character
encoding scheme the document author had used; the computer only saw numbers, and the numbers could mean different things.
Then think about trying to store these documents in the same place (like in the same database table); you would need to store
the character encoding alongside each piece of text, and make sure to pass it around whenever you passed the text around.
Then think about multilingual documents, with characters from multiple languages in the same document. (They typically used
escape codes to switch modes; poof, you're in Russian koi8-r mode, so character 241 means this; poof, now you're in Mac Greek
mode, so character 241 means something else. And so on.) These are the problems which unicode was designed to solve.
<p>To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.<sup>[<a name="d0e23786" href="#ftn.d0e23786">5</a>]</sup> Each 2-byte number represents a unique character used in at least one of the world's languages. (Characters that are used
in multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per number.
Unicode data is never ambiguous.
<p>Of course, there is still the matter of all these legacy encoding systems. 7-bit <acronym>ASCII</acronym>, for instance, which stores English characters as numbers ranging from 0 to 127. (65 is capital &#8220;<code>A</code>&#8221;, 97 is lowercase &#8220;<code>a</code>&#8221;, and so forth.) English has a very simple alphabet, so it can be completely expressed in 7-bit <acronym>ASCII</acronym>. Western European languages like French, Spanish, and German all use an encoding system called ISO-8859-1 (also called &#8220;latin-1&#8221;), which uses the 7-bit <acronym>ASCII</acronym> characters for the numbers 0 through 127, but then extends into the 128-255 range for characters like n-with-a-tilde-over-it
(241), and u-with-two-dots-over-it (252). And unicode uses the same characters as 7-bit <acronym>ASCII</acronym> for 0 through 127, and the same characters as ISO-8859-1 for 128 through 255, and then extends from there into characters
for other languages with the remaining numbers, 256 through 65535.
<p>When dealing with unicode data, you may at some point need to convert the data back into one of these other legacy encoding
systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding
scheme, or to print it to a non-unicode-aware terminal or printer. Or to store it in an <acronym>XML</acronym> document which explicitly specifies the encoding scheme.
<p>And on that note, let's get back to Python.
<p>Python has had unicode support throughout the language since version 2.0. The <acronym>XML</acronym> package uses unicode to store all parsed <acronym>XML</acronym> data, but you can use unicode anywhere.
<div class="example"><h3>Example 9.13. Introducing unicode</h3><pre class="screen">
<samp class="prompt">>>> </samp>s = u'Dive in' <img id="kgp.unicode.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>s
u'Dive in'
<samp class="prompt">>>> </samp>print s <img id="kgp.unicode.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
Dive in</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To create a unicode string instead of a regular <acronym>ASCII</acronym> string, add the letter &#8220;<code>u</code>&#8221; before the string. Note that this particular string doesn't have any non-<acronym>ASCII</acronym> characters. That's fine; unicode is a superset of <acronym>ASCII</acronym> (a very large superset at that), so any regular <acronym>ASCII</acronym> string can also be stored as unicode.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When printing a string, Python will attempt to convert it to your default encoding, which is usually <acronym>ASCII</acronym>. (More on this in a minute.) Since this unicode string is made up of characters that are also <acronym>ASCII</acronym> characters, printing it has the same result as printing a normal <acronym>ASCII</acronym> string; the conversion is seamless, and if you didn't know that <code class="varname">s</code> was a unicode string, you'd never notice the difference.
</td>
</tr>
</table>
<div class="example"><h3>Example 9.14. Storing non-<acronym>ASCII</acronym> characters</h3><pre class="screen">
<samp class="prompt">>>> </samp>s = u'La Pe\xf1a' <img id="kgp.unicode.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print s <img id="kgp.unicode.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)</samp>
<samp class="prompt">>>> </samp>print s.encode('latin-1') <img id="kgp.unicode.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
La Pe&ntilde;a</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The real advantage of unicode, of course, is its ability to store non-<acronym>ASCII</acronym> characters, like the Spanish &#8220;<code>&ntilde;</code>&#8221; (<code>n</code> with a tilde over it). The unicode character code for the tilde-n is <code>0xf1</code> in hexadecimal (241 in decimal), which you can type like this: <code>\xf1</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember I said that the <code class="function">print</code> function attempts to convert a unicode string to <acronym>ASCII</acronym> so it can print it? Well, that's not going to work here, because your unicode string contains non-<acronym>ASCII</acronym> characters, so Python raises a <code class="errorname">UnicodeError</code> error.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here's where the conversion-from-unicode-to-other-encoding-schemes comes in. <code class="varname">s</code> is a unicode string, but <code class="function">print</code> can only print a regular string. To solve this problem, you call the <code class="function">encode</code> method, available on every unicode string, to convert the unicode string to a regular string in the given encoding scheme,
which you pass as a parameter. In this case, you're using <code>latin-1</code> (also known as <code>iso-8859-1</code>), which includes the tilde-n (whereas the default <acronym>ASCII</acronym> encoding scheme did not, since it only includes characters numbered 0 through 127).
</td>
</tr>
</table>
<p>Remember I said Python usually converted unicode to <acronym>ASCII</acronym> whenever it needed to make a regular string out of a unicode string? Well, this default encoding scheme is an option which
you can customize.
<div class="example"><h3>Example 9.15. <code class="filename">sitecustomize.py</code></h3><pre class="programlisting">
# sitecustomize.py <img id="kgp.unicode.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
# this file can be anywhere in your Python path,
# but it usually goes in ${pythondir}/lib/site-packages/
import sys
sys.setdefaultencoding('iso-8859-1') <img id="kgp.unicode.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">sitecustomize.py</code> is a special script; Python will try to import it on startup, so any code in it will be run automatically. As the comment mentions, it can go anywhere
(as long as <code>import</code> can find it), but it usually goes in the <code class="filename">site-packages</code> directory within your Python <code class="filename">lib</code> directory.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">setdefaultencoding</code> function sets, well, the default encoding. This is the encoding scheme that Python will try to use whenever it needs to auto-coerce a unicode string into a regular string.
</td>
</tr>
</table>
<div class="example"><h3>Example 9.16. Effects of setting the default encoding</h3><pre class="screen">
<samp class="prompt">>>> </samp>import sys
<samp class="prompt">>>> </samp>sys.getdefaultencoding() <img id="kgp.unicode.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'iso-8859-1'
<samp class="prompt">>>> </samp>s = u'La Pe\xf1a'
<samp class="prompt">>>> </samp>print s<img id="kgp.unicode.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
La Pe&ntilde;a</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This example assumes that you have made the changes listed in the previous example to your <code class="filename">sitecustomize.py</code> file, and restarted Python. If your default encoding still says <code>'ascii'</code>, you didn't set up your <code class="filename">sitecustomize.py</code> properly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even
call <code class="function">sys.setdefaultencoding</code> after Python has started up. Dig into <code class="filename">site.py</code> and search for &#8220;<code>setdefaultencoding</code>&#8221; to find out how.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now that the default encoding scheme includes all the characters you use in your string, Python has no problem auto-coercing the string and printing it.
</td>
</tr>
</table>
<div class="example"><h3>Example 9.17. Specifying encoding in <code class="filename">.py</code> files</h3>
<p>If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual <code class="filename">.py</code> file by putting an encoding declaration at the top of each file. This declaration defines the <code class="filename">.py</code> file to be UTF-8:<pre class="programlisting">
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
</pre><p>Now, what about <acronym>XML</acronym>? Well, every <acronym>XML</acronym> document is in a specific encoding. Again, ISO-8859-1 is a popular encoding for data in Western European languages. KOI8-R
is popular for Russian texts. The encoding, if specified, is in the header of the <acronym>XML</acronym> document.
<div class="example"><h3>Example 9.18. <code class="filename">russiansample.xml</code></h3><pre class="screen"><samp class="computeroutput">
&lt;?xml version="1.0" encoding="koi8-r"?> </span><img id="kgp.unicode.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
&lt;preface>
&lt;title>&#1055;&#1088;&#1077;&#1076;&#1080;&#1089;&#1083;&#1086;&#1074;&#1080;&#1077;&lt;/title> </span><img id="kgp.unicode.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><samp class="computeroutput">
&lt;/preface></span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a sample extract from a real Russian <acronym>XML</acronym> document; it's part of a Russian translation of this very book. Note the encoding, <code>koi8-r</code>, specified in the header.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">These are Cyrillic characters which, as far as I know, spell the Russian word for &#8220;Preface&#8221;. If you open this file in a regular text editor, the characters will most likely like gibberish, because they're encoded
using the <code>koi8-r</code> encoding scheme, but they're being displayed in <code>iso-8859-1</code>.
</td>
</tr>
</table>
<div class="example"><h3>Example 9.19. Parsing <code class="filename">russiansample.xml</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>from xml.dom import minidom
<samp class="prompt">>>> </samp>xmldoc = minidom.parse('russiansample.xml') <img id="kgp.unicode.6.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>title = xmldoc.getElementsByTagName('title')[0].firstChild.data
<samp class="prompt">>>> </samp>title <img id="kgp.unicode.6.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
u'\u041f\u0440\u0435\u0434\u0438\u0441\u043b\u043e\u0432\u0438\u0435'
<samp class="prompt">>>> </samp>print title <img id="kgp.unicode.6.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)</samp>
<samp class="prompt">>>> </samp>convertedtitle = title.encode('koi8-r') <img id="kgp.unicode.6.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>convertedtitle
'\xf0\xd2\xc5\xc4\xc9\xd3\xcc\xcf\xd7\xc9\xc5'
<samp class="prompt">>>> </samp>print convertedtitle <img id="kgp.unicode.6.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
&#1055;&#1088;&#1077;&#1076;&#1080;&#1089;&#1083;&#1086;&#1074;&#1080;&#1077;</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.6.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">I'm assuming here that you saved the previous example as <code class="filename">russiansample.xml</code> in the current directory. I am also, for the sake of completeness, assuming that you've changed your default encoding back
to <code>'ascii'</code> by removing your <code class="filename">sitecustomize.py</code> file, or at least commenting out the <code class="function">setdefaultencoding</code> line.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.6.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that the text data of the <code class="sgmltag-element">title</code> tag (now in the <code class="varname">title</code> variable, thanks to that long concatenation of Python functions which I hastily skipped over and, annoyingly, won't explain until the next section) -- the text data inside the
<acronym>XML</acronym> document's <code class="sgmltag-element">title</code> element is stored in unicode.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.6.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Printing the title is not possible, because this unicode string contains non-<acronym>ASCII</acronym> characters, so Python can't convert it to <acronym>ASCII</acronym> because that doesn't make sense.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.6.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can, however, explicitly convert it to <code>koi8-r</code>, in which case you get a (regular, not unicode) string of single-byte characters (<code>f0</code>, <code>d2</code>, <code>c5</code>, and so forth) that are the <code>koi8-r</code>-encoded versions of the characters in the original unicode string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.unicode.6.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Printing the <code>koi8-r</code>-encoded string will probably show gibberish on your screen, because your Python <acronym>IDE</acronym> is interpreting those characters as <code>iso-8859-1</code>, not <code>koi8-r</code>. But at least they do print. (And, if you look carefully, it's the same gibberish that you saw when you opened the original
<acronym>XML</acronym> document in a non-unicode-aware text editor. Python converted it from <code>koi8-r</code> into unicode when it parsed the <acronym>XML</acronym> document, and you've just converted it back.)
</td>
</tr>
</table>
<p>To sum up, unicode itself is a bit intimidating if you've never seen it before, but unicode data is really very easy to handle
in Python. If your <acronym>XML</acronym> documents are all 7-bit <acronym>ASCII</acronym> (like the examples in this chapter), you will literally never think about unicode. Python will convert the <acronym>ASCII</acronym> data in the <acronym>XML</acronym> documents into unicode while parsing, and auto-coerce it back to <acronym>ASCII</acronym> whenever necessary, and you'll never even notice. But if you need to deal with that in other languages, Python is ready.
<div class="itemizedlist">
<h3>Further reading</h3>
<ul>
<li><a href="http://www.unicode.org/">Unicode.org</a> is the home page of the unicode standard, including a brief <a href="http://www.unicode.org/standard/principles.html">technical introduction</a>.
<li><a href="http://www.reportlab.com/i18n/python_unicode_tutorial.html">Unicode Tutorial</a> has some more examples of how to use Python's unicode functions, including how to force Python to coerce unicode into <acronym>ASCII</acronym> even when it doesn't really want to.
<li><a href="http://www.python.org/peps/pep-0263.html">PEP 263</a> goes into more detail about how and when to define a character encoding in your <code class="filename">.py</code> files.
</ul>
<h2 id="kgp.search">9.5. Searching for elements</h2>
<p>Traversing <acronym>XML</acronym> documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within
your <acronym>XML</acronym> document, there is a shortcut you can use to find it quickly: <code class="function">getElementsByTagName</code>.
<p>For this section, you'll be using the <code class="filename">binary.xml</code> grammar file, which looks like this:
<div class="example"><h3>Example 9.20. <code class="filename">binary.xml</code></h3><pre class="screen"><samp class="computeroutput">&lt;?xml version="1.0"?>
&lt;!DOCTYPE grammar PUBLIC "-//diveintopython3.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
&lt;grammar>
&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref>
&lt;ref id="byte">
&lt;p>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>\
&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;/p>
&lt;/ref>
&lt;/grammar></span></pre><p>It has two <code class="sgmltag-element">ref</code>s, <code>'bit'</code> and <code>'byte'</code>. A <code>bit</code> is either a <code>'0'</code> or <code>'1'</code>, and a <code>byte</code> is 8 <code>bit</code>s.
<div class="example"><h3>Example 9.21. Introducing <code class="function">getElementsByTagName</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>from xml.dom import minidom
<samp class="prompt">>>> </samp>xmldoc = minidom.parse('binary.xml')
<samp class="prompt">>>> </samp>reflist = xmldoc.getElementsByTagName('ref') <img id="kgp.search.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>reflist
[&lt;DOM Element: ref at 136138108>, &lt;DOM Element: ref at 136144292>]
<samp class="prompt">>>> </samp>print reflist[0].toxml()
<samp class="computeroutput">&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref></samp>
<samp class="prompt">>>> </samp>print reflist[1].toxml()
<samp class="computeroutput">&lt;ref id="byte">
&lt;p>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>\
&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;/p>
&lt;/ref>
</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.search.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">getElementsByTagName</code> takes one argument, the name of the element you wish to find. It returns a list of <code class="classname">Element</code> objects, corresponding to the <acronym>XML</acronym> elements that have that name. In this case, you find two <code>ref</code> elements.
</td>
</tr>
</table>
<div class="example"><h3>Example 9.22. Every element is searchable</h3><pre class="screen">
<samp class="prompt">>>> </samp>firstref = reflist[0] <img id="kgp.search.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print firstref.toxml()
<samp class="computeroutput">&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref></samp>
<samp class="prompt">>>> </samp>plist = firstref.getElementsByTagName("p") <img id="kgp.search.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>plist
[&lt;DOM Element: p at 136140116>, &lt;DOM Element: p at 136142172>]
<samp class="prompt">>>> </samp>print plist[0].toxml() <img id="kgp.search.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;p>0&lt;/p>
<samp class="prompt">>>> </samp>print plist[1].toxml()
&lt;p>1&lt;/p></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.search.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Continuing from the previous example, the first object in your <code class="varname">reflist</code> is the <code>'bit'</code> <code class="sgmltag-element">ref</code> element.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.search.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can use the same <code class="function">getElementsByTagName</code> method on this <code class="classname">Element</code> to find all the <code class="sgmltag-element">&lt;p></code> elements within the <code>'bit'</code> <code class="sgmltag-element">ref</code> element.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.search.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Just as before, the <code class="function">getElementsByTagName</code> method returns a list of all the elements it found. In this case, you have two, one for each bit.
</td>
</tr>
</table>
<div class="example"><h3>Example 9.23. Searching is actually recursive</h3><pre class="screen">
<samp class="prompt">>>> </samp>plist = xmldoc.getElementsByTagName("p") <img id="kgp.search.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>plist
[&lt;DOM Element: p at 136140116>, &lt;DOM Element: p at 136142172>, &lt;DOM Element: p at 136146124>]
<samp class="prompt">>>> </samp>plist[0].toxml() <img id="kgp.search.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'&lt;p>0&lt;/p>'
<samp class="prompt">>>> </samp>plist[1].toxml()
'&lt;p>1&lt;/p>'
<samp class="prompt">>>> </samp>plist[2].toxml() <img id="kgp.search.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">'&lt;p>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>\
&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;/p>'</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.search.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note carefully the difference between this and the previous example. Previously, you were searching for <code class="sgmltag-element">p</code> elements within <code class="varname">firstref</code>, but here you are searching for <code class="sgmltag-element">p</code> elements within <code class="varname">xmldoc</code>, the root-level object that represents the entire <acronym>XML</acronym> document. This <em>does</em> find the <code class="sgmltag-element">p</code> elements nested within the <code class="sgmltag-element">ref</code> elements within the root <code class="sgmltag-element">grammar</code> element.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.search.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The first two <code class="sgmltag-element">p</code> elements are within the first <code class="sgmltag-element">ref</code> (the <code>'bit'</code> <code class="sgmltag-element">ref</code>).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.search.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The last <code class="sgmltag-element">p</code> element is the one within the second <code class="sgmltag-element">ref</code> (the <code>'byte'</code> <code class="sgmltag-element">ref</code>).
</td>
</tr>
</table>
<h2 id="kgp.attributes">9.6. Accessing element attributes</h2>
<p><acronym>XML</acronym> elements can have one or more attributes, and it is incredibly simple to access them once you have parsed an <acronym>XML</acronym> document.
<p>For this section, you'll be using the <code class="filename">binary.xml</code> grammar file that you saw in the <a href="#kgp.search" title="9.5. Searching for elements">previous section</a>.<table class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">This section may be a little confusing, because of some overlapping terminology. Elements in an <acronym>XML</acronym> document have attributes, and Python objects also have attributes. When you parse an <acronym>XML</acronym> document, you get a bunch of Python objects that represent all the pieces of the <acronym>XML</acronym> document, and some of these Python objects represent attributes of the <acronym>XML</acronym> elements. But the (Python) objects that represent the (<acronym>XML</acronym>) attributes also have (Python) attributes, which are used to access various parts of the (<acronym>XML</acronym>) attribute that the object represents. I told you it was confusing. I am open to suggestions on how to distinguish these
more clearly.
</td>
</tr>
</table>
<div class="example"><h3>Example 9.24. Accessing element attributes</h3><pre class="screen">
<samp class="prompt">>>> </samp>xmldoc = minidom.parse('binary.xml')
<samp class="prompt">>>> </samp>reflist = xmldoc.getElementsByTagName('ref')
<samp class="prompt">>>> </samp>bitref = reflist[0]
<samp class="prompt">>>> </samp>print bitref.toxml()
<samp class="computeroutput">&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref></samp>
<samp class="prompt">>>> </samp>bitref.attributes <img id="kgp.attributes.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;xml.dom.minidom.NamedNodeMap instance at 0x81e0c9c>
<samp class="prompt">>>> </samp>bitref.attributes.keys() <img id="kgp.attributes.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"> <img id="kgp.attributes.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
[u'id']
<samp class="prompt">>>> </samp>bitref.attributes.values() <img id="kgp.attributes.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
[&lt;xml.dom.minidom.Attr instance at 0x81d5044>]
<samp class="prompt">>>> </samp>bitref.attributes["id"] <img id="kgp.attributes.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
&lt;xml.dom.minidom.Attr instance at 0x81d5044></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.attributes.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Each <code class="classname">Element</code> object has an attribute called <code>attributes</code>, which is a <code class="classname">NamedNodeMap</code> object. This sounds scary, but it's not, because a <code class="classname">NamedNodeMap</code> is an object that <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class">acts like a dictionary</a>, so you already know how to use it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.attributes.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Treating the <code class="classname">NamedNodeMap</code> as a dictionary, you can get a list of the names of the attributes of this element by using <code class="function">attributes.keys()</code>. This element has only one attribute, <code>'id'</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.attributes.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Attribute names, like all other text in an <acronym>XML</acronym> document, are stored in <a href="#kgp.unicode" title="9.4. Unicode">unicode</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.attributes.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Again treating the <code class="classname">NamedNodeMap</code> as a dictionary, you can get a list of the values of the attributes by using <code class="function">attributes.values()</code>. The values are themselves objects, of type <code class="classname">Attr</code>. You'll see how to get useful information out of this object in the next example.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.attributes.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Still treating the <code class="classname">NamedNodeMap</code> as a dictionary, you can access an individual attribute by name, using normal dictionary syntax. (Readers who have been
paying extra-close attention will already know how the <code class="classname">NamedNodeMap</code> class accomplishes this neat trick: by defining a <a href="#fileinfo.specialmethods" title="5.6. Special Class Methods"><code class="function">__getitem__</code> special method</a>. Other readers can take comfort in the fact that they don't need to understand how it works in order to use it effectively.)
</td>
</tr>
</table>
<div class="example"><h3>Example 9.25. Accessing individual attributes</h3><pre class="screen">
<samp class="prompt">>>> </samp>a = bitref.attributes["id"]
<samp class="prompt">>>> </samp>a
&lt;xml.dom.minidom.Attr instance at 0x81d5044>
<samp class="prompt">>>> </samp>a.name <img id="kgp.attributes.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
u'id'
<samp class="prompt">>>> </samp>a.value <img id="kgp.attributes.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
u'bit'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.attributes.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="classname">Attr</code> object completely represents a single <acronym>XML</acronym> attribute of a single <acronym>XML</acronym> element. The name of the attribute (the same name as you used to find this object in the <code>bitref.attributes</code> <code class="classname">NamedNodeMap</code> pseudo-dictionary) is stored in <code>a.name</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.attributes.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The actual text value of this <acronym>XML</acronym> attribute is stored in <code>a.value</code>.
</td>
</tr>
</table>
</div><table class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Like a dictionary, attributes of an <acronym>XML</acronym> element have no ordering. Attributes may <em>happen to be</em> listed in a certain order in the original <acronym>XML</acronym> document, and the <code class="classname">Attr</code> objects may <em>happen to be</em> listed in a certain order when the <acronym>XML</acronym> document is parsed into Python objects, but these orders are arbitrary and should carry no special meaning. You should always access individual attributes
by name, like the keys of a dictionary.
</td>
</tr>
</table>
<h2 id="kgp.segue">9.7. Segue</h2>
<p>OK, that's it for the hard-core XML stuff. The next chapter will continue to use these same example programs, but focus on
other aspects that make the program more flexible: using streams for input processing, using <code class="function">getattr</code> for method dispatching, and using command-line flags to allow users to reconfigure the program without changing the code.
<p>Before moving on to the next chapter, you should be comfortable doing all of these things:
<div class="itemizedlist">
<ul>
<li><a href="#kgp.parse" title="9.3. Parsing XML">Parsing <acronym>XML</acronym> documents</a> using <code class="filename">minidom</code>, <a href="#kgp.search" title="9.5. Searching for elements">searching through the parsed document</a>, and accessing arbitrary <a href="#kgp.attributes" title="9.6. Accessing element attributes">element attributes</a> and <a href="#kgp.child" title="10.4. Finding direct children of a node">element children</a>
<li>Organizing complex libraries into <a href="#kgp.packages" title="9.2. Packages">packages</a>
<li><a href="#kgp.unicode" title="9.4. Unicode">Converting unicode strings</a> to different character encodings
</ul>
<div class="footnotes"><br><hr width="100" align="left">
<div class="footnote">
<p><sup>[<a name="ftn.d0e23786" href="#d0e23786">5</a>] </sup>This, sadly, is <em>still</em> an oversimplification. Unicode now has been extended to handle ancient Chinese, Korean, and Japanese texts, which had so
many different characters that the 2-byte unicode system could not represent them all. But Python doesn't currently support that out of the box, and I don't know if there is a project afoot to add it. You've reached the
limits of my expertise, sorry.
<div class="chapter">
<h2 id="streams">Chapter 10. Scripts and Streams</h2>
<h2 id="kgp.openanything">10.1. Abstracting input sources</h2>
<p>One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the <em>file-like object</em>.
<p>Many functions which require an input source could simply take a filename, go open the file for reading, read it, and close
it when they're done. But they don't. Instead, they take a <em>file-like object</em>.
<p>In the simplest case, a <em>file-like object</em> is any object with a <code class="function">read</code> method with an optional <code class="varname">size</code> parameter, which returns a string. When called with no <code class="varname">size</code> parameter, it reads everything there is to read from the input source and returns all the data as a single string. When
called with a <code class="varname">size</code> parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left
off and returns the next chunk of data.
<p>This is how <a href="#fileinfo.files" title="6.2. Working with File Objects">reading from real files</a> works; the difference is that you're not limiting yourself to real files. The input source could be anything: a file on
disk, a web page, even a hard-coded string. As long as you pass a file-like object to the function, and the function simply
calls the object's <code class="function">read</code> method, the function can handle any kind of input source without specific code to handle each kind.
<p>In case you were wondering how this relates to <acronym>XML</acronym> processing, <code class="function">minidom.parse</code> is one such function which can take a file-like object.
<div class="example"><h3>Example 10.1. Parsing <acronym>XML</acronym> from a file</h3><pre class="screen">
<samp class="prompt">>>> </samp>from xml.dom import minidom
<samp class="prompt">>>> </samp>fsock = open('binary.xml') <img id="kgp.openanything.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>xmldoc = minidom.parse(fsock) <img id="kgp.openanything.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>fsock.close() <img id="kgp.openanything.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print xmldoc.toxml() <img id="kgp.openanything.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">&lt;?xml version="1.0" ?>
&lt;grammar>
&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref>
&lt;ref id="byte">
&lt;p>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>\
&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;/p>
&lt;/ref>
&lt;/grammar></span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First, you open the file on disk. This gives you a <a href="#fileinfo.files" title="6.2. Working with File Objects">file object</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You pass the file object to <code class="function">minidom.parse</code>, which calls the <code class="function">read</code> method of <code class="varname">fsock</code> and reads the <acronym>XML</acronym> document from the file on disk.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Be sure to call the <code class="function">close</code> method of the file object after you're done with it. <code class="function">minidom.parse</code> will not do this for you.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Calling the <code class="methodname">toxml()</code> method on the returned <acronym>XML</acronym> document prints out the entire thing.
</td>
</tr>
</table>
<p>Well, that all seems like a colossal waste of time. After all, you've already seen that <code class="function">minidom.parse</code> can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're
just going to be parsing a local file, you can pass the filename and <code class="function">minidom.parse</code> is smart enough to Do The Right Thing&#8482;. But notice how similar -- and easy -- it is to parse an <acronym>XML</acronym> document straight from the Internet.
<div class="example"><h3 id="kgp.openanything.urllib">Example 10.2. Parsing <acronym>XML</acronym> from a <acronym>URL</acronym></h3><pre class="screen">
<samp class="prompt">>>> </samp>import urllib
<samp class="prompt">>>> </samp>usock = urllib.urlopen('http://slashdot.org/slashdot.rdf') <img id="kgp.openanything.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>xmldoc = minidom.parse(usock) <img id="kgp.openanything.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>usock.close() <img id="kgp.openanything.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print xmldoc.toxml() <img id="kgp.openanything.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">&lt;?xml version="1.0" ?>
&lt;rdf:RDF xmlns="http://my.netscape.com/rdf/simple/0.9/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
&lt;channel>
&lt;title>Slashdot&lt;/title>
&lt;link>http://slashdot.org/&lt;/link>
&lt;description>News for nerds, stuff that matters&lt;/description>
&lt;/channel>
&lt;image>
&lt;title>Slashdot&lt;/title>
&lt;url>http://images.slashdot.org/topics/topicslashdot.gif&lt;/url>
&lt;link>http://slashdot.org/&lt;/link>
&lt;/image>
&lt;item>
&lt;title>To HDTV or Not to HDTV?&lt;/title>
&lt;link>http://slashdot.org/article.pl?sid=01/12/28/0421241&lt;/link>
&lt;/item>
[...snip...]</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw <a href="#dialect.extract.urllib" title="Example 8.5. Introducing urllib">in a previous chapter</a>, <code class="function">urlopen</code> takes a web page <acronym>URL</acronym> and returns a file-like object. Most importantly, this object has a <code class="function">read</code> method which returns the <acronym>HTML</acronym> source of the web page.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you pass the file-like object to <code class="function">minidom.parse</code>, which obediently calls the <code class="function">read</code> method of the object and parses the <acronym>XML</acronym> data that the <code class="function">read</code> method returns. The fact that this <acronym>XML</acronym> data is now coming straight from a web page is completely irrelevant. <code class="function">minidom.parse</code> doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As soon as you're done with it, be sure to close the file-like object that <code class="function">urlopen</code> gives you.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">By the way, this <acronym>URL</acronym> is real, and it really is <acronym>XML</acronym>. It's an <acronym>XML</acronym> representation of the current headlines on <a href="http://slashdot.org/">Slashdot</a>, a technical news and gossip site.
</td>
</tr>
</table>
<div class="example"><h3>Example 10.3. Parsing <acronym>XML</acronym> from a string (the easy but inflexible way)</h3><pre class="screen">
<samp class="prompt">>>> </samp>contents = "&lt;grammar>&lt;ref id='bit'>&lt;p>0&lt;/p>&lt;p>1&lt;/p>&lt;/ref>&lt;/grammar>"
<samp class="prompt">>>> </samp>xmldoc = minidom.parseString(contents) <img id="kgp.openanything.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print xmldoc.toxml()
<samp class="computeroutput">&lt;?xml version="1.0" ?>
&lt;grammar>&lt;ref id="bit">&lt;p>0&lt;/p>&lt;p>1&lt;/p>&lt;/ref>&lt;/grammar></span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">minidom</code> has a method, <code class="function">parseString</code>, which takes an entire <acronym>XML</acronym> document as a string and parses it. You can use this instead of <code class="function">minidom.parse</code> if you know you already have your entire <acronym>XML</acronym> document in a string.
</td>
</tr>
</table>
<p>OK, so you can use the <code class="function">minidom.parse</code> function for parsing both local files and remote <acronym>URL</acronym>s, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a
file, a <acronym>URL</acronym>, or a string, you'll need special logic to check whether it's a string, and call the <code class="function">parseString</code> function instead. How unsatisfying.
<p>If there were a way to turn a string into a file-like object, then you could simply pass this object to <code class="function">minidom.parse</code>. And in fact, there is a module specifically designed for doing just that: <code class="filename">StringIO</code>.
<div class="example"><h3 id="kgp.openanything.stringio.example">Example 10.4. Introducing <code class="filename">StringIO</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>contents = "&lt;grammar>&lt;ref id='bit'>&lt;p>0&lt;/p>&lt;p>1&lt;/p>&lt;/ref>&lt;/grammar>"
<samp class="prompt">>>> </samp>import StringIO
<samp class="prompt">>>> </samp>ssock = StringIO.StringIO(contents) <img id="kgp.openanything.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>ssock.read() <img id="kgp.openanything.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
"&lt;grammar>&lt;ref id='bit'>&lt;p>0&lt;/p>&lt;p>1&lt;/p>&lt;/ref>&lt;/grammar>"
<samp class="prompt">>>> </samp>ssock.read() <img id="kgp.openanything.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
''
<samp class="prompt">>>> </samp>ssock.seek(0) <img id="kgp.openanything.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>ssock.read(15) <img id="kgp.openanything.4.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
'&lt;grammar>&lt;ref i'
<samp class="prompt">>>> </samp>ssock.read(15)
"d='bit'>&lt;p>0&lt;/p"
<samp class="prompt">>>> </samp>ssock.read()
'>&lt;p>1&lt;/p>&lt;/ref>&lt;/grammar>'
<samp class="prompt">>>> </samp>ssock.close() <img id="kgp.openanything.4.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="filename">StringIO</code> module contains a single class, also called <code class="classname">StringIO</code>, which allows you to turn a string into a file-like object. The <code class="classname">StringIO</code> class takes the string as a parameter when creating an instance.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you have a file-like object, and you can do all sorts of file-like things with it. Like <code class="function">read</code>, which returns the original string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Calling <code class="function">read</code> again returns an empty string. This is how real file objects work too; once you read the entire file, you can't read any
more without explicitly seeking to the beginning of the file. The <code class="classname">StringIO</code> object works the same way.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can explicitly seek to the beginning of the string, just like seeking through a file, by using the <code class="function">seek</code> method of the <code class="classname">StringIO</code> object.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.4.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can also read the string in chunks, by passing a <code class="varname">size</code> parameter to the <code class="function">read</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.4.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">At any time, <code class="function">read</code> will return the rest of the string that you haven't read yet. All of this is exactly how file objects work; hence the term
<em>file-like object</em>.
</td>
</tr>
</table>
<div class="example"><h3>Example 10.5. Parsing <acronym>XML</acronym> from a string (the file-like object way)</h3><pre class="screen">
<samp class="prompt">>>> </samp>contents = "&lt;grammar>&lt;ref id='bit'>&lt;p>0&lt;/p>&lt;p>1&lt;/p>&lt;/ref>&lt;/grammar>"
<samp class="prompt">>>> </samp>ssock = StringIO.StringIO(contents)
<samp class="prompt">>>> </samp>xmldoc = minidom.parse(ssock) <img id="kgp.openanything.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>ssock.close()
<samp class="prompt">>>> </samp>print xmldoc.toxml()
<samp class="computeroutput">&lt;?xml version="1.0" ?>
&lt;grammar>&lt;ref id="bit">&lt;p>0&lt;/p>&lt;p>1&lt;/p>&lt;/ref>&lt;/grammar></span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you can pass the file-like object (really a <code class="classname">StringIO</code>) to <code class="function">minidom.parse</code>, which will call the object's <code class="function">read</code> method and happily parse away, never knowing that its input came from a hard-coded string.
</td>
</tr>
</table>
<p>So now you know how to use a single function, <code class="function">minidom.parse</code>, to parse an <acronym>XML</acronym> document stored on a web page, in a local file, or in a hard-coded string. For a web page, you use <code class="function">urlopen</code> to get a file-like object; for a local file, you use <code class="function">open</code>; and for a string, you use <code class="classname">StringIO</code>. Now let's take it one step further and generalize <em>these</em> differences as well.
<div class="example"><h3 id="kgp.openanything.example">Example 10.6. <code class="function">openAnything</code></h3><pre class="programlisting">
def openAnything(source):<img id="kgp.openanything.6.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
# try to open with urllib (if source is http, ftp, or file URL)
import urllib
try:
return urllib.urlopen(source) <img id="kgp.openanything.6.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
except (IOError, OSError):
pass
# try to open with native open function (if source is pathname)
try:
return open(source) <img id="kgp.openanything.6.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
except (IOError, OSError):
pass
# treat source as string
import StringIO
return StringIO.StringIO(str(source)) <img id="kgp.openanything.6.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.6.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">openAnything</code> function takes a single parameter, <code class="varname">source</code>, and returns a file-like object. <code class="varname">source</code> is a string of some sort; it can either be a <acronym>URL</acronym> (like <code>'http://slashdot.org/slashdot.rdf'</code>), a full or partial pathname to a local file (like <code>'binary.xml'</code>), or a string that contains actual <acronym>XML</acronym> data to be parsed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.6.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First, you see if <code class="varname">source</code> is a <acronym>URL</acronym>. You do this through brute force: you try to open it as a <acronym>URL</acronym> and silently ignore errors caused by trying to open something which is not a <acronym>URL</acronym>. This is actually elegant in the sense that, if <code class="filename">urllib</code> ever supports new types of <acronym>URL</acronym>s in the future, you will also support them without recoding. If <code class="filename">urllib</code> is able to open <code class="varname">source</code>, then the <code>return</code> kicks you out of the function immediately and the following <code>try</code> statements never execute.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.6.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">On the other hand, if <code class="filename">urllib</code> yelled at you and told you that <code class="varname">source</code> wasn't a valid <acronym>URL</acronym>, you assume it's a path to a file on disk and try to open it. Again, you don't do anything fancy to check whether <code class="varname">source</code> is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably
get them wrong anyway). Instead, you just blindly open the file, and silently trap any errors.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.openanything.6.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">By this point, you need to assume that <code class="varname">source</code> is a string that has hard-coded data in it (since nothing else worked), so you use <code class="classname">StringIO</code> to create a file-like object out of it and return that. (In fact, since you're using the <code class="function">str</code> function, <code class="varname">source</code> doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its <code class="function">__str__</code> <a href="#fileinfo.morespecial" title="5.7. Advanced Special Class Methods">special method</a>.)
</td>
</tr>
</table>
<p>Now you can use this <code class="function">openAnything</code> function in conjunction with <code class="function">minidom.parse</code> to make a function that takes a <code class="varname">source</code> that refers to an <acronym>XML</acronym> document somehow (either as a <acronym>URL</acronym>, or a local filename, or a hard-coded <acronym>XML</acronym> document in a string) and parses it.
<div class="example"><h3>Example 10.7. Using <code class="function">openAnything</code> in <code class="filename">kgp.py</code></h3><pre class="programlisting">
class KantGenerator:
def _load(self, source):
sock = toolbox.openAnything(source)
xmldoc = minidom.parse(sock).documentElement
sock.close()
return xmldoc</pre><h2 id="kgp.stdio">10.2. Standard input, output, and error</h2>
<p><acronym>UNIX</acronym> users are already familiar with the concept of standard input, standard output, and standard error. This section is for
the rest of you.
<p>Standard output and standard error (commonly abbreviated <code>stdout</code> and <code>stderr</code>) are pipes that are built into every <acronym>UNIX</acronym> system. When you <code class="function">print</code> something, it goes to the <code>stdout</code> pipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to the <code>stderr</code> pipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program
prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system
with a window-based Python <acronym>IDE</acronym>, <code>stdout</code> and <code>stderr</code> default to your &#8220;Interactive Window&#8221;.)
<div class="example"><h3>Example 10.8. Introducing <code>stdout</code> and <code>stderr</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>for i in range(3):
<samp class="prompt">... </samp>print 'Dive in' <img id="kgp.stdio.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">Dive in
Dive in
Dive in</samp>
<samp class="prompt">>>> </samp>import sys
<samp class="prompt">>>> </samp>for i in range(3):
<samp class="prompt">... </samp>sys.stdout.write('Dive in') <img id="kgp.stdio.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
Dive inDive inDive in
<samp class="prompt">>>> </samp>for i in range(3):
<samp class="prompt">... </samp>sys.stderr.write('Dive in') <img id="kgp.stdio.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
Dive inDive inDive in</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#fileinfo.for.counter" title="Example 6.9. Simple Counters">Example 6.9, &#8220;Simple Counters&#8221;</a>, you can use Python's built-in <code class="function">range</code> function to build simple counter loops that repeat something a set number of times.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>stdout</code> is a file-like object; calling its <code class="function">write</code> function will print out whatever string you give it. In fact, this is what the <code class="function">print</code> function really does; it adds a carriage return to the end of the string you're printing, and calls <code class="function">sys.stdout.write</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In the simplest case, <code>stdout</code> and <code>stderr</code> send their output to the same place: the Python <acronym>IDE</acronym> (if you're in one), or the terminal (if you're running Python from the command line). Like <code>stdout</code>, <code>stderr</code> does not add carriage returns for you; if you want them, add them yourself.
</td>
</tr>
</table>
<p><code>stdout</code> and <code>stderr</code> are both file-like objects, like the ones you discussed in <a href="#kgp.openanything" title="10.1. Abstracting input sources">Section 10.1, &#8220;Abstracting input sources&#8221;</a>, but they are both write-only. They have no <code class="function">read</code> method, only <code class="function">write</code>. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output.
<div class="example"><h3>Example 10.9. Redirecting output</h3><pre class="screen">
<samp class="prompt">[you@localhost kgp]$ </samp>python stdout.py
Dive in
<samp class="prompt">[you@localhost kgp]$ </samp>cat out.log
This message will be logged instead of displayed</pre><p>(On Windows, you can use <code>type</code> instead of <code>cat</code> to display the contents of a file.)
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
#stdout.py
import sys
print 'Dive in' <img id="kgp.stdio.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
saveout = sys.stdout <img id="kgp.stdio.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
fsock = open('out.log', 'w') <img id="kgp.stdio.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
sys.stdout = fsock <img id="kgp.stdio.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
print 'This message will be logged instead of displayed' <img id="kgp.stdio.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
sys.stdout = saveout <img id="kgp.stdio.2.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
fsock.close() <img id="kgp.stdio.2.7" src="images/callouts/7.png" alt="7" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This will print to the <acronym>IDE</acronym> &#8220;Interactive Window&#8221; (or the terminal, if running the script from the command line).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Always save <code>stdout</code> before redirecting it, so you can set it back to normal later.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Open a file for writing. If the file doesn't exist, it will be created. If the file does exist, it will be overwritten.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Redirect all further output to the new file you just opened.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This will be &#8220;printed&#8221; to the log file only; it will not be visible in the <acronym>IDE</acronym> window or on the screen.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.2.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Set <code>stdout</code> back to the way it was before you mucked with it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.2.7"><img src="images/callouts/7.png" alt="7" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Close the log file.</td>
</tr>
</table>
<p>Redirecting <code>stderr</code> works exactly the same way, using <code class="function">sys.stderr</code> instead of <code class="function">sys.stdout</code>.
<div class="example"><h3>Example 10.10. Redirecting error information</h3><pre class="screen">
<samp class="prompt">[you@localhost kgp]$ </samp>python stderr.py
<samp class="prompt">[you@localhost kgp]$ </samp>cat error.log
<samp class="computeroutput">Traceback (most recent line last):
File "stderr.py", line 5, in ?
raise Exception, 'this error will be logged'
Exception: this error will be logged</span></pre><p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
#stderr.py
import sys
fsock = open('error.log', 'w') <img id="kgp.stdio.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
sys.stderr = fsock <img id="kgp.stdio.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
raise Exception, 'this error will be logged' <img id="kgp.stdio.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"> <img id="kgp.stdio.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Open the log file where you want to store debugging information.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Redirect standard error by assigning the file object of the newly-opened log file to <code>stderr</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Raise an exception. Note from the screen output that this does <em>not</em> print anything on screen. All the normal traceback information has been written to <code class="filename">error.log</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Also note that you're not explicitly closing your log file, nor are you setting <code>stderr</code> back to its original value. This is fine, since once the program crashes (because of the exception), Python will clean up and close the file for us, and it doesn't make any difference that <code>stderr</code> is never restored, since, as I mentioned, the program crashes and Python ends. Restoring the original is more important for <code>stdout</code>, if you expect to go do other stuff within the same script afterwards.
</td>
</tr>
</table>
<p>Since it is so common to write error messages to standard error, there is a shorthand syntax that can be used instead of going
through the hassle of redirecting it outright.
<div class="example"><h3 id="kgp.stdio.print.example">Example 10.11. Printing to <code>stderr</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>print 'entering function'
entering function
<samp class="prompt">>>> </samp>import sys
<samp class="prompt">>>> </samp>print >> sys.stderr, 'entering function' <img id="kgp.stdio.6.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
entering function
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.6.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This shorthand syntax of the <code class="function">print</code> statement can be used to write to any open file, or file-like object. In this case, you can redirect a single <code class="function">print</code> statement to <code>stderr</code> without affecting subsequent <code class="function">print</code> statements.
</td>
</tr>
</table>
<p>Standard input, on the other hand, is a read-only file object, and it represents the data flowing into the program from some
previous program. This will likely not make much sense to classic Mac OS users, or even Windows users unless you were ever fluent on the <acronym>MS-DOS</acronym> command line. The way it works is that you can construct a chain of commands in a single line, so that one program's output
becomes the input for the next program in the chain. The first program simply outputs to standard output (without doing any
special redirecting itself, just doing normal <code class="function">print</code> statements or whatever), and the next program reads from standard input, and the operating system takes care of connecting
one program's output to the next program's input.
<div class="example"><h3>Example 10.12. Chaining commands</h3><pre class="screen">
<samp class="prompt">[you@localhost kgp]$ </samp>python kgp.py -g binary.xml <img id="kgp.stdio.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
01100111
<samp class="prompt">[you@localhost kgp]$ </samp>cat binary.xml <img id="kgp.stdio.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">&lt;?xml version="1.0"?>
&lt;!DOCTYPE grammar PUBLIC "-//diveintopython3.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
&lt;grammar>
&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref>
&lt;ref id="byte">
&lt;p>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>\
&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;/p>
&lt;/ref>
&lt;/grammar></samp>
<samp class="prompt">[you@localhost kgp]$ </samp>cat binary.xml | python kgp.py -g - <img id="kgp.stdio.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"> <img id="kgp.stdio.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
10110001</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#kgp.divein" title="9.1. Diving in">Section 9.1, &#8220;Diving in&#8221;</a>, this will print a string of eight random bits, <code class="constant">0</code> or <code class="constant">1</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This simply prints out the entire contents of <code class="filename">binary.xml</code>. (Windows users should use <code>type</code> instead of <code>cat</code>.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This prints the contents of <code class="filename">binary.xml</code>, but the &#8220;<code>|</code>&#8221; character, called the &#8220;pipe&#8221; character, means that the contents will not be printed to the screen. Instead, they will become the standard input of the
next command, which in this case calls your Python script.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Instead of specifying a module (like <code class="filename">binary.xml</code>), you specify &#8220;<code>-</code>&#8221;, which causes your script to load the grammar from standard input instead of from a specific file on disk. (More on how
this happens in the next example.) So the effect is the same as the first syntax, where you specified the grammar filename
directly, but think of the expansion possibilities here. Instead of simply doing <code>cat binary.xml</code>, you could run a script that dynamically generates the grammar, then you can pipe it into your script. It could come from
anywhere: a database, or some grammar-generating meta-script, or whatever. The point is that you don't need to change your
<code class="filename">kgp.py</code> script at all to incorporate any of this functionality. All you need to do is be able to take grammar files from standard
input, and you can separate all the other logic into another program.
</td>
</tr>
</table>
<p>So how does the script &#8220;know&#8221; to read from standard input when the grammar file is &#8220;<code>-</code>&#8221;? It's not magic; it's just code.
<div class="example"><h3>Example 10.13. Reading from standard input in <code class="filename">kgp.py</code></h3><pre class="programlisting">
def openAnything(source):
if source == "-": <img id="kgp.stdio.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
import sys
return sys.stdin
# try to open with urllib (if source is http, ftp, or file URL)
import urllib
try:
[... snip ...]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.stdio.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the <code class="function">openAnything</code> function from <code class="filename">toolbox.py</code>, which you previously examined in <a href="#kgp.openanything" title="10.1. Abstracting input sources">Section 10.1, &#8220;Abstracting input sources&#8221;</a>. All you've done is add three lines of code at the beginning of the function to check if the source is &#8220;<code>-</code>&#8221;; if so, you return <code>sys.stdin</code>. Really, that's it! Remember, <code>stdin</code> is a file-like object with a <code class="function">read</code> method, so the rest of the code (in <code class="filename">kgp.py</code>, where you call <code class="function">openAnything</code>) doesn't change a bit.
</td>
</tr>
</table>
<h2 id="kgp.cache">10.3. Caching node lookups</h2>
<p><code class="filename">kgp.py</code> employs several tricks which may or may not be useful to you in your <acronym>XML</acronym> processing. The first one takes advantage of the consistent structure of the input documents to build a cache of nodes.
<p>A grammar file defines a series of <code class="sgmltag-element">ref</code> elements. Each <code class="sgmltag-element">ref</code> contains one or more <code class="sgmltag-element">p</code> elements, which can contain a lot of different things, including <code class="sgmltag-element">xref</code>s. Whenever you encounter an <code class="sgmltag-element">xref</code>, you look for a corresponding <code class="sgmltag-element">ref</code> element with the same <code class="sgmltag-element">id</code> attribute, and choose one of the <code class="sgmltag-element">ref</code> element's children and parse it. (You'll see how this random choice is made in the next section.)
<p>This is how you build up the grammar: define <code class="sgmltag-element">ref</code> elements for the smallest pieces, then define <code class="sgmltag-element">ref</code> elements which "include" the first <code class="sgmltag-element">ref</code> elements by using <code class="sgmltag-element">xref</code>, and so forth. Then you parse the "largest" reference and follow each <code class="sgmltag-element">xref</code>, and eventually output real text. The text you output depends on the (random) decisions you make each time you fill in an
<code class="sgmltag-element">xref</code>, so the output is different each time.
<p>This is all very flexible, but there is one downside: performance. When you find an <code class="sgmltag-element">xref</code> and need to find the corresponding <code class="sgmltag-element">ref</code> element, you have a problem. The <code class="sgmltag-element">xref</code> has an <code class="sgmltag-element">id</code> attribute, and you want to find the <code class="sgmltag-element">ref</code> element that has that same <code class="sgmltag-element">id</code> attribute, but there is no easy way to do that. The slow way to do it would be to get the entire list of <code class="sgmltag-element">ref</code> elements each time, then manually loop through and look at each <code class="sgmltag-element">id</code> attribute. The fast way is to do that once and build a cache, in the form of a dictionary.
<div class="example"><h3>Example 10.14. <code class="function">loadGrammar</code></h3><pre class="programlisting">
def loadGrammar(self, grammar):
self.grammar = self._load(grammar)
self.refs = {} <img id="kgp.cache.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
for ref in self.grammar.getElementsByTagName("ref"): <img id="kgp.cache.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
self.refs[ref.attributes["id"].value] = ref <img id="kgp.cache.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"> <img id="kgp.cache.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.cache.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Start by creating an empty dictionary, <code class="varname">self.refs</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.cache.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#kgp.search" title="9.5. Searching for elements">Section 9.5, &#8220;Searching for elements&#8221;</a>, <code class="function">getElementsByTagName</code> returns a list of all the elements of a particular name. You easily can get a list of all the <code class="sgmltag-element">ref</code> elements, then simply loop through that list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.cache.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#kgp.attributes" title="9.6. Accessing element attributes">Section 9.6, &#8220;Accessing element attributes&#8221;</a>, you can access individual attributes of an element by name, using standard dictionary syntax. So the keys of the <code class="varname">self.refs</code> dictionary will be the values of the <code class="sgmltag-element">id</code> attribute of each <code class="sgmltag-element">ref</code> element.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.cache.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The values of the <code class="varname">self.refs</code> dictionary will be the <code class="sgmltag-element">ref</code> elements themselves. As you saw in <a href="#kgp.parse" title="9.3. Parsing XML">Section 9.3, &#8220;Parsing XML&#8221;</a>, each element, each node, each comment, each piece of text in a parsed <acronym>XML</acronym> document is an object.
</td>
</tr>
</table>
<p>Once you build this cache, whenever you come across an <code class="sgmltag-element">xref</code> and need to find the <code class="sgmltag-element">ref</code> element with the same <code class="sgmltag-element">id</code> attribute, you can simply look it up in <code class="varname">self.refs</code>.
<div class="example"><h3>Example 10.15. Using the <code class="sgmltag-element">ref</code> element cache</h3><pre class="programlisting">
def do_xref(self, node):
id = node.attributes["id"].value
self.parse(self.randomChildElement(self.refs[id]))</pre><p>You'll explore the <code class="function">randomChildElement</code> function in the next section.
<h2 id="kgp.child">10.4. Finding direct children of a node</h2>
<p>Another useful techique when parsing <acronym>XML</acronym> documents is finding all the direct child elements of a particular element. For instance, in the grammar files, a <code class="sgmltag-element">ref</code> element can have several <code class="sgmltag-element">p</code> elements, each of which can contain many things, including other <code class="sgmltag-element">p</code> elements. You want to find just the <code class="sgmltag-element">p</code> elements that are children of the <code class="sgmltag-element">ref</code>, not <code class="sgmltag-element">p</code> elements that are children of other <code class="sgmltag-element">p</code> elements.
<p>You might think you could simply use <code class="function">getElementsByTagName</code> for this, but you can't. <code class="function">getElementsByTagName</code> searches recursively and returns a single list for all the elements it finds. Since <code class="sgmltag-element">p</code> elements can contain other <code class="sgmltag-element">p</code> elements, you can't use <code class="function">getElementsByTagName</code>, because it would return nested <code class="sgmltag-element">p</code> elements that you don't want. To find only direct child elements, you'll need to do it yourself.
<div class="example"><h3>Example 10.16. Finding direct child elements</h3><pre class="programlisting">
def randomChildElement(self, node):
choices = [e for e in node.childNodes
if e.nodeType == e.ELEMENT_NODE] <img id="kgp.child.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"> <img id="kgp.child.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"> <img id="kgp.child.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
chosen = random.choice(choices) <img id="kgp.child.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
return chosen </pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.child.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#kgp.parse.gettingchildnodes.example" title="Example 9.9. Getting child nodes">Example 9.9, &#8220;Getting child nodes&#8221;</a>, the <code class="function">childNodes</code> attribute returns a list of all the child nodes of an element.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.child.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">However, as you saw in <a href="#kgp.parse.childnodescanbetext.example" title="Example 9.11. Child nodes can be text">Example 9.11, &#8220;Child nodes can be text&#8221;</a>, the list returned by <code class="function">childNodes</code> contains all different types of nodes, including text nodes. That's not what you're looking for here. You only want the
children that are elements.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.child.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Each node has a <code class="varname">nodeType</code> attribute, which can be <code>ELEMENT_NODE</code>, <code>TEXT_NODE</code>, <code>COMMENT_NODE</code>, or any number of other values. The complete list of possible values is in the <code class="filename">__init__.py</code> file in the <code class="classname">xml.dom</code> package. (See <a href="#kgp.packages" title="9.2. Packages">Section 9.2, &#8220;Packages&#8221;</a> for more on packages.) But you're just interested in nodes that are elements, so you can filter the list to only include
those nodes whose <code class="varname">nodeType</code> is <code>ELEMENT_NODE</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.child.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Once you have a list of actual elements, choosing a random one is easy. Python comes with a module called <code class="filename">random</code> which includes several useful functions. The <code class="function">random.choice</code> function takes a list of any number of items and returns a random item. For example, if the <code class="sgmltag-element">ref</code> elements contains several <code class="sgmltag-element">p</code> elements, then <code class="varname">choices</code> would be a list of <code class="sgmltag-element">p</code> elements, and <code class="varname">chosen</code> would end up being assigned exactly one of them, selected at random.
</td>
</tr>
</table>
<h2 id="kgp.handler">10.5. Creating separate handlers by node type</h2>
<p>The third useful <acronym>XML</acronym> processing tip involves separating your code into logical functions, based on node types and element names. Parsed <acronym>XML</acronym> documents are made up of various types of nodes, each represented by a Python object. The root level of the document itself is represented by a <code class="classname">Document</code> object. The <code class="classname">Document</code> then contains one or more <code class="classname">Element</code> objects (for actual <acronym>XML</acronym> tags), each of which may contain other <code class="classname">Element</code> objects, <code class="classname">Text</code> objects (for bits of text), or <code class="classname">Comment</code> objects (for embedded comments). Python makes it easy to write a dispatcher to separate the logic for each node type.
<div class="example"><h3>Example 10.17. Class names of parsed <acronym>XML</acronym> objects</h3><pre class="screen">
<samp class="prompt">>>> </samp>from xml.dom import minidom
<samp class="prompt">>>> </samp>xmldoc = minidom.parse('kant.xml') <img id="kgp.handler.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>xmldoc
&lt;xml.dom.minidom.Document instance at 0x01359DE8>
<samp class="prompt">>>> </samp>xmldoc.__class__ <img id="kgp.handler.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
&lt;class xml.dom.minidom.Document at 0x01105D40>
<samp class="prompt">>>> </samp>xmldoc.__class__.__name__ <img id="kgp.handler.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'Document'</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Assume for a moment that <code class="filename">kant.xml</code> is in the current directory.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#kgp.packages" title="9.2. Packages">Section 9.2, &#8220;Packages&#8221;</a>, the object returned by parsing an <acronym>XML</acronym> document is a <code class="classname">Document</code> object, as defined in the <code class="filename">minidom.py</code> in the <code class="filename">xml.dom</code> package. As you saw in <a href="#fileinfo.create" title="5.4. Instantiating Classes">Section 5.4, &#8220;Instantiating Classes&#8221;</a>, <code>__class__</code> is built-in attribute of every Python object.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Furthermore, <code>__name__</code> is a built-in attribute of every Python class, and it is a string. This string is not mysterious; it's the same as the class name you type when you define a class
yourself. (See <a href="#fileinfo.class" title="5.3. Defining Classes">Section 5.3, &#8220;Defining Classes&#8221;</a>.)
</td>
</tr>
</table>
<p>Fine, so now you can get the class name of any particular <acronym>XML</acronym> node (since each <acronym>XML</acronym> node is represented as a Python object). How can you use this to your advantage to separate the logic of parsing each node type? The answer is <code class="function">getattr</code>, which you first saw in <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr">Section 4.4, &#8220;Getting Object References With getattr&#8221;</a>.
<div class="example"><h3>Example 10.18. <code class="function">parse</code>, a generic <acronym>XML</acronym> node dispatcher</h3><pre class="programlisting">
def parse(self, node):
parseMethod = getattr(self, "parse_%s" % node.__class__.__name__) <img id="kgp.handler.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"> <img id="kgp.handler.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
parseMethod(node) <img id="kgp.handler.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First off, notice that you're constructing a larger string based on the class name of the node you were passed (in the <code class="varname">node</code> argument). So if you're passed a <code class="classname">Document</code> node, you're constructing the string <code>'parse_Document'</code>, and so forth.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you can treat that string as a function name, and get a reference to the function itself using <code class="function">getattr</code></td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Finally, you can call that function and pass the node itself as an argument. The next example shows the definitions of each
of these functions.
</td>
</tr>
</table>
<div class="example"><h3>Example 10.19. Functions called by the <code class="function">parse</code> dispatcher</h3><pre class="programlisting">
def parse_Document(self, node): <img id="kgp.handler.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
self.parse(node.documentElement)
def parse_Text(self, node): <img id="kgp.handler.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
text = node.data
if self.capitalizeNextWord:
self.pieces.append(text[0].upper())
self.pieces.append(text[1:])
self.capitalizeNextWord = 0
else:
self.pieces.append(text)
def parse_Comment(self, node): <img id="kgp.handler.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
pass
def parse_Element(self, node): <img id="kgp.handler.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
handlerMethod = getattr(self, "do_%s" % node.tagName)
handlerMethod(node)</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">parse_Document</code> is only ever called once, since there is only one <code class="classname">Document</code> node in an <acronym>XML</acronym> document, and only one <code class="classname">Document</code> object in the parsed <acronym>XML</acronym> representation. It simply turns around and parses the root element of the grammar file.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">parse_Text</code> is called on nodes that represent bits of text. The function itself does some special processing to handle automatic capitalization
of the first word of a sentence, but otherwise simply appends the represented text to a list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">parse_Comment</code> is just a <code>pass</code>, since you don't care about embedded comments in the grammar files. Note, however, that you still need to define the function
and explicitly make it do nothing. If the function did not exist, the generic <code class="function">parse</code> function would fail as soon as it stumbled on a comment, because it would try to find the non-existent <code class="function">parse_Comment</code> function. Defining a separate function for every node type, even ones you don't use, allows the generic <code class="function">parse</code> function to stay simple and dumb.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.handler.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">parse_Element</code> method is actually itself a dispatcher, based on the name of the element's tag. The basic idea is the same: take what distinguishes
elements from each other (their tag names) and dispatch to a separate function for each of them. You construct a string like
<code>'do_xref'</code> (for an <code class="sgmltag-element">&lt;xref></code> tag), find a function of that name, and call it. And so forth for each of the other tag names that might be found in the
course of parsing a grammar file (<code class="sgmltag-element">&lt;p></code> tags, <code class="sgmltag-element">&lt;choice></code> tags).
</td>
</tr>
</table>
<p>In this example, the dispatch functions <code class="function">parse</code> and <code class="function">parse_Element</code> simply find other methods in the same class. If your processing is very complex (or you have many different tag names),
you could break up your code into separate modules, and use dynamic importing to import each module and call whatever functions
you needed. Dynamic importing will be discussed in <a href="#regression" title="Chapter 16. Functional Programming">Chapter 16, <i>Functional Programming</i></a>.
<h2 id="kgp.commandline">10.6. Handling command-line arguments</h2>
<p>Python fully supports creating programs that can be run on the command line, complete with command-line arguments and either short-
or long-style flags to specify various options. None of this is <acronym>XML</acronym>-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it.
<p>It's difficult to talk about command-line processing without understanding how command-line arguments are exposed to your
Python program, so let's write a simple program to see them.
<div class="example"><h3>Example 10.20. Introducing <code class="varname">sys.argv</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
#argecho.py
import sys
for arg in sys.argv: <img id="kgp.commandline.0.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
print arg</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.0.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Each command-line argument passed to the program will be in <code class="varname">sys.argv</code>, which is just a list. Here you are printing each argument on a separate line.
</td>
</tr>
</table>
<div class="example"><h3>Example 10.21. The contents of <code class="varname">sys.argv</code></h3><pre class="screen">
<samp class="prompt">[you@localhost py]$ </samp>python argecho.py <img id="kgp.commandline.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
argecho.py
<samp class="prompt">[you@localhost py]$ </samp>python argecho.py abc def <img id="kgp.commandline.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">argecho.py
abc
def</samp>
<samp class="prompt">[you@localhost py]$ </samp>python argecho.py --help <img id="kgp.commandline.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">argecho.py
--help</samp>
<samp class="prompt">[you@localhost py]$ </samp>python argecho.py -m kant.xml <img id="kgp.commandline.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">argecho.py
-m
kant.xml</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The first thing to know about <code class="varname">sys.argv</code> is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later,
in <a href="#regression" title="Chapter 16. Functional Programming">Chapter 16, <i>Functional Programming</i></a>. Don't worry about it for now.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Command-line arguments are separated by spaces, and each shows up as a separate element in the <code class="varname">sys.argv</code> list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Command-line flags, like <code>--help</code>, also show up as their own element in the <code class="varname">sys.argv</code> list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag
(<code>-m</code>) which takes an argument (<code>kant.xml</code>). Both the flag itself and the flag's argument are simply sequential elements in the <code class="varname">sys.argv</code> list. No attempt is made to associate one with the other; all you get is a list.
</td>
</tr>
</table>
<p>So as you can see, you certainly have all the information passed on the command line, but then again, it doesn't look like
it's going to be all that easy to actually use it. For simple programs that only take a single argument and have no flags,
you can simply use <code>sys.argv[1]</code> to access the argument. There's no shame in this; I do it all the time. For more complex programs, you need the <code class="filename">getopt</code> module.
<div class="example"><h3>Example 10.22. Introducing <code class="filename">getopt</code></h3><pre class="programlisting">
def main(argv):
grammar = "kant.xml" <img id="kgp.commandline.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="]) <img id="kgp.commandline.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
except getopt.GetoptError: <img id="kgp.commandline.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
usage() <img id="kgp.commandline.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
sys.exit(2)
...
if __name__ == "__main__":
main(sys.argv[1:])</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First off, look at the bottom of the example and notice that you're calling the <code class="function">main</code> function with <code>sys.argv[1:]</code>. Remember, <code>sys.argv[0]</code> is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off
and pass the rest of the list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is where all the interesting processing happens. The <code class="function">getopt</code> function of the <code class="filename">getopt</code> module takes three parameters: the argument list (which you got from <code>sys.argv[1:]</code>), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer
command-line flags that are equivalent to the single-character versions. This is quite confusing at first glance, and is
explained in more detail below.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If anything goes wrong trying to parse these command-line flags, <code class="filename">getopt</code> will raise an exception, which you catch. You told <code class="filename">getopt</code> all the flags you understand, so this probably means that the end user passed some command-line flag that you don't understand.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As is standard practice in the <acronym>UNIX</acronym> world, when the script is passed flags it doesn't understand, you print out a summary of proper usage and exit gracefully.
Note that I haven't shown the <code class="function">usage</code> function here. You would still need to code that somewhere and have it print out the appropriate summary; it's not automatic.
</td>
</tr>
</table>
<p>So what are all those parameters you pass to the <code class="function">getopt</code> function? Well, the first one is simply the raw list of command-line flags and arguments (not including the first element,
the script name, which you already chopped off before calling the <code class="function">main</code> function). The second is the list of short command-line flags that the script accepts.
<div class="variablelist">
<h3><code>"hg:d"</code></h3>
<dl>
<dt><code>-h</code></dt>
<dd>print usage summary</dd>
<dt><code>-g ...</code></dt>
<dd>use specified grammar file or URL</dd>
<dt><code>-d</code></dt>
<dd>show debugging information while parsing</dd>
</dl>
<p>The first and third flags are simply standalone flags; you specify them or you don't, and they do things (print help) or change
state (turn on debugging). However, the second flag (<code>-g</code>) <em>must</em> be followed by an argument, which is the name of the grammar file to read from. In fact it can be a filename or a web address,
and you don't know which yet (you'll figure it out later), but you know it has to be <em>something</em>. So you tell <code class="filename">getopt</code> this by putting a colon after the <code>g</code> in that second parameter to the <code class="function">getopt</code> function.
<p>To further complicate things, the script accepts either short flags (like <code>-h</code>) or long flags (like <code>--help</code>), and you want them to do the same thing. This is what the third parameter to <code class="function">getopt</code> is for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter.
<div class="variablelist">
<h3><code>["help", "grammar="]</code></h3>
<dl>
<dt><code>--help</code></dt>
<dd>print usage summary</dd>
<dt><code>--grammar ...</code></dt>
<dd>use specified grammar file or URL</dd>
</dl>
<p>Three things of note here:
<div class="orderedlist">
<ol>
<li>All long flags are preceded by two dashes on the command line, but you don't include those dashes when calling <code class="function">getopt</code>. They are understood.
<li>The <code>--grammar</code> flag must always be followed by an additional argument, just like the <code>-g</code> flag. This is notated by an equals sign, <code>"grammar="</code>.
<li>The list of long flags is shorter than the list of short flags, because the <code>-d</code> flag does not have a corresponding long version. This is fine; only <code>-d</code> will turn on debugging. But the order of short and long flags needs to be the same, so you'll need to specify all the short
flags that <em>do</em> have corresponding long flags first, then all the rest of the short flags.
</ol>
<p>Confused yet? Let's look at the actual code and see if it makes sense in context.
<div class="example"><h3>Example 10.23. Handling command-line arguments in <code class="filename">kgp.py</code></h3><pre class="programlisting">
def main(argv): <img id="kgp.commandline.3.0" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
grammar = "kant.xml"
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
usage()
sys.exit(2)
for opt, arg in opts: <img id="kgp.commandline.3.1" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
if opt in ("-h", "--help"): <img id="kgp.commandline.3.2" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
usage()
sys.exit()
elif opt == '-d': <img id="kgp.commandline.3.3" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
global _debug
_debug = 1
elif opt in ("-g", "--grammar"): <img id="kgp.commandline.3.4" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
grammar = arg
source = "".join(args) <img id="kgp.commandline.3.5" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
k = KantGenerator(grammar, source)
print k.output()</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.3.0"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="varname">grammar</code> variable will keep track of the grammar file you're using. You initialize it here in case it's not specified on the command
line (using either the <code>-g</code> or the <code>--grammar</code> flag).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.3.1"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="varname">opts</code> variable that you get back from <code class="function">getopt</code> contains a list of tuples: <code class="varname">flag</code> and <code class="varname">argument</code>. If the flag doesn't take an argument, then <code class="varname">arg</code> will simply be <code>None</code>. This makes it easier to loop through the flags.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.3.2"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">getopt</code> validates that the command-line flags are acceptable, but it doesn't do any sort of conversion between short and long flags.
If you specify the <code>-h</code> flag, <code class="varname">opt</code> will contain <code>"-h"</code>; if you specify the <code>--help</code> flag, <code class="varname">opt</code> will contain <code>"--help"</code>. So you need to check for both.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.3.3"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember, the <code>-d</code> flag didn't have a corresponding long flag, so you only need to check for the short form. If you find it, you set a global
variable that you'll refer to later to print out debugging information. (I used this during the development of the script.
What, you thought all these examples worked on the first try?)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.3.4"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If you find a grammar file, either with a <code>-g</code> flag or a <code>--grammar</code> flag, you save the argument that followed it (stored in <code class="varname">arg</code>) into the <code class="varname">grammar</code> variable, overwriting the default that you initialized at the top of the <code class="function">main</code> function.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#kgp.commandline.3.5"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">That's it. You've looped through and dealt with all the command-line flags. That means that anything left must be command-line
arguments. These come back from the <code class="function">getopt</code> function in the <code class="varname">args</code> variable. In this case, you're treating them as source material for the parser. If there are no command-line arguments
specified, <code class="varname">args</code> will be an empty list, and <code class="varname">source</code> will end up as the empty string.
</td>
</tr>
</table>
<h2 id="kgp.alltogether">10.7. Putting it all together</h2>
<p>You've covered a lot of ground. Let's step back and see how all the pieces fit together.
<p>To start with, this is a script that <a href="#kgp.commandline" title="10.6. Handling command-line arguments">takes its arguments on the command line</a>, using the <code class="filename">getopt</code> module.
<div class="informalexample"><pre class="programlisting">
def main(argv):
...
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
...
for opt, arg in opts:
...</pre><p>You create a new instance of the <code class="classname">KantGenerator</code> class, and pass it the grammar file and source that may or may not have been specified on the command line.
<div class="informalexample"><pre class="programlisting">
k = KantGenerator(grammar, source)</pre><p>The <code class="classname">KantGenerator</code> instance automatically loads the grammar, which is an <acronym>XML</acronym> file. You use your custom <code class="function">openAnything</code> function to open the file (which <a href="#kgp.openanything" title="10.1. Abstracting input sources">could be stored in a local file or a remote web server</a>), then use the built-in <code class="filename">minidom</code> parsing functions to <a href="#kgp.parse" title="9.3. Parsing XML">parse the <acronym>XML</acronym> into a tree of Python objects</a>.
<div class="informalexample"><pre class="programlisting">
def _load(self, source):
sock = toolbox.openAnything(source)
xmldoc = minidom.parse(sock).documentElement
sock.close()</pre><p>Oh, and along the way, you take advantage of your knowledge of the structure of the <acronym>XML</acronym> document to <a href="#kgp.cache" title="10.3. Caching node lookups">set up a little cache of references</a>, which are just elements in the <acronym>XML</acronym> document.
<div class="informalexample"><pre class="programlisting">
def loadGrammar(self, grammar):
for ref in self.grammar.getElementsByTagName("ref"):
self.refs[ref.attributes["id"].value] = ref </pre><p>If you specified some source material on the command line, you use that; otherwise you rip through the grammar looking for
the "top-level" reference (that isn't referenced by anything else) and use that as a starting point.
<div class="informalexample"><pre class="programlisting">
def getDefaultSource(self):
xrefs = {}
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
return '&lt;xref id="%s"/>' % random.choice(standaloneXrefs)</pre><p>Now you rip through the source material. The source material is also <acronym>XML</acronym>, and you parse it one node at a time. To keep the code separated and more maintainable, you use <a href="#kgp.handler" title="10.5. Creating separate handlers by node type">separate handlers for each node type</a>.
<div class="informalexample"><pre class="programlisting">
def parse_Element(self, node):
handlerMethod = getattr(self, "do_%s" % node.tagName)
handlerMethod(node)</pre><p>You bounce through the grammar, <a href="#kgp.child" title="10.4. Finding direct children of a node">parsing all the children</a> of each <code class="sgmltag-element">p</code> element,
<div class="informalexample"><pre class="programlisting">
def do_p(self, node):
...
if doit:
for child in node.childNodes: self.parse(child)</pre><p>replacing <code class="sgmltag-element">choice</code> elements with a random child,
<div class="informalexample"><pre class="programlisting">
def do_choice(self, node):
self.parse(self.randomChildElement(node))</pre><p>and replacing <code class="sgmltag-element">xref</code> elements with a random child of the corresponding <code class="sgmltag-element">ref</code> element, which you previously cached.
<div class="informalexample"><pre class="programlisting">
def do_xref(self, node):
id = node.attributes["id"].value
self.parse(self.randomChildElement(self.refs[id]))</pre><p>Eventually, you parse your way down to plain text,
<div class="informalexample"><pre class="programlisting">
def parse_Text(self, node):
text = node.data
...
self.pieces.append(text)</pre><p>which you print out.
<div class="informalexample"><pre class="programlisting">
def main(argv):
...
k = KantGenerator(grammar, source)
print k.output()</pre><h2 id="kgp.summary">10.8. Summary</h2>
<p>Python comes with powerful libraries for parsing and manipulating <acronym>XML</acronym> documents. The <code class="filename">minidom</code> takes an <acronym>XML</acronym> file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments,
error handling, even the ability to take input from the piped result of a previous program.
<p>Before moving on to the next chapter, you should be comfortable doing all of these things:
<div class="itemizedlist">
<ul>
<li><a href="#kgp.stdio" title="10.2. Standard input, output, and error">Chaining programs</a> with standard input and output
<li><a href="#kgp.handler" title="10.5. Creating separate handlers by node type">Defining dynamic dispatchers</a> with <code class="function">getattr</code>.
<li><a href="#kgp.commandline" title="10.6. Handling command-line arguments">Using command-line flags</a> and validating them with <code class="filename">getopt</code>
</ul>
<div class="chapter">
<h2 id="oa">Chapter 11. HTTP Web Services</h2>
<h2 id="oa.divein">11.1. Diving in</h2>
<p>You've learned about <a href="#dialect" title="Chapter 8. HTML Processing">HTML processing</a> and <a href="#kgp" title="Chapter 9. XML Processing">XML processing</a>, and along the way you saw <a href="#dialect.extract.urllib" title="Example 8.5. Introducing urllib">how to download a web page</a> and <a href="#kgp.openanything.urllib" title="Example 10.2. Parsing XML from a URL">how to parse XML from a URL</a>, but let's dive into the more general topic of HTTP web services.
<p>Simply stated, HTTP web services are programmatic ways of sending and receiving data from remote servers using the operations
of HTTP directly. If you want to get data from the server, use a straight HTTP GET; if you want to send new data to the server,
use HTTP POST. (Some more advanced HTTP web service APIs also define ways of modifying existing data and deleting data, using
HTTP PUT and HTTP DELETE.) In other words, the &#8220;verbs&#8221; built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for receiving, sending,
modifying, and deleting data.
<p>The main advantage of this approach is simplicity, and its simplicity has proven popular with a lot of different sites. Data
-- usually XML data -- can be built and stored statically, or generated dynamically by a server-side script, and all major
languages include an HTTP library for downloading it. Debugging is also easier, because you can load up the web service in
any web browser and see the raw data. Modern browsers will even nicely format and pretty-print XML data for you, to allow
you to quickly navigate through it.
<p>Examples of pure XML-over-HTTP web services:
<div class="itemizedlist">
<ul>
<li><a href="http://www.amazon.com/webservices">Amazon API</a> allows you to retrieve product information from the Amazon.com online store.
<li><a href="http://www.nws.noaa.gov/alerts/">National Weather Service</a> (United States) and <a href="http://demo.xml.weather.gov.hk/">Hong Kong Observatory</a> (Hong Kong) offer weather alerts as a web service.
<li><a href="http://atomenabled.org/">Atom API</a> for managing web-based content.
<li><a href="http://syndic8.com/">Syndicated feeds</a> from weblogs and news sites bring you up-to-the-minute news from a variety of sites.
</ul>
<p>In later chapters, you'll explore APIs which use HTTP as a transport for sending and receiving data, but don't map application
semantics to the underlying HTTP semantics. (They tunnel everything over HTTP POST.) But this chapter will concentrate on
using HTTP GET to get data from a remote server, and you'll explore several HTTP features you can use to get the maximum benefit
out of pure HTTP web services.
<p>Here is a more advanced version of the <code class="filename">openanything</code> module that you saw in <a href="#streams" title="Chapter 10. Scripts and Streams">the previous chapter</a>:
<div class="example"><h3>Example 11.1. <code class="filename">openanything.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
import urllib2, urlparse, gzip
from StringIO import StringIO
USER_AGENT = 'OpenAnything/1.0 +http://diveintopython3.org/http_web_services/'
class SmartRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
result = urllib2.HTTPRedirectHandler.http_error_301(
self, req, fp, code, msg, headers)
result.status = code
return result
def http_error_302(self, req, fp, code, msg, headers):
result = urllib2.HTTPRedirectHandler.http_error_302(
self, req, fp, code, msg, headers)
result.status = code
return result
class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):
def http_error_default(self, req, fp, code, msg, headers):
result = urllib2.HTTPError(
req.get_full_url(), code, msg, headers, fp)
result.status = code
return result
def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
'''URL, filename, or string --> stream
This function lets you define parsers that take any input source
(URL, pathname to local or network file, or actual data as a string)
and deal with it in a uniform manner. Returned object is guaranteed
to have all the basic stdio read methods (read, readline, readlines).
Just .close() the object when you're done with it.
If the etag argument is supplied, it will be used as the value of an
If-None-Match request header.
If the lastmodified argument is supplied, it must be a formatted
date/time string in GMT (as returned in the Last-Modified header of
a previous request). The formatted date/time will be used
as the value of an If-Modified-Since request header.
If the agent argument is supplied, it will be used as the value of a
User-Agent request header.
'''
if hasattr(source, 'read'):
return source
if source == '-':
return sys.stdin
if urlparse.urlparse(source)[0] == 'http':
# open URL with urllib2
request = urllib2.Request(source)
request.add_header('User-Agent', agent)
if etag:
request.add_header('If-None-Match', etag)
if lastmodified:
request.add_header('If-Modified-Since', lastmodified)
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener(SmartRedirectHandler(), DefaultErrorHandler())
return opener.open(request)
# try to open with native open function (if source is a filename)
try:
return open(source)
except (IOError, OSError):
pass
# treat source as string
return StringIO(str(source))
def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
'''Fetch data and metadata from a URL, file, stream, or string'''
result = {}
f = openAnything(source, etag, last_modified, agent)
result['data'] = f.read()
if hasattr(f, 'headers'):
# save ETag, if the server sent one
result['etag'] = f.headers.get('ETag')
# save Last-Modified header, if the server sent one
result['lastmodified'] = f.headers.get('Last-Modified')
if f.headers.get('content-encoding', '') == 'gzip':
# data came back gzip-compressed, decompress it
result['data'] = gzip.GzipFile(fileobj=StringIO(result['data']])).read()
if hasattr(f, 'url'):
result['url'] = f.url
result['status'] = 200
if hasattr(f, 'status'):
result['status'] = f.status
f.close()
return result
</pre><div class="itemizedlist">
<h3>Further reading</h3>
<ul>
<li>Paul Prescod believes that <a href="http://webservices.xml.com/pub/a/ws/2002/02/06/rest.html">pure HTTP web services are the future of the Internet</a>.
</ul>
<h2 id="oa.review">11.2. How not to fetch data over HTTP</h2>
<p>Let's say you want to download a resource over HTTP, such as a syndicated Atom feed. But you don't just want to download
it once; you want to download it over and over again, every hour, to get the latest news from the site that's offering the
news feed. Let's do it the quick-and-dirty way first, and then see how you can do better.
<div class="example"><h3>Example 11.2. Downloading a feed the quick-and-dirty way</h3><pre class="screen">
<samp class="prompt">>>> </samp>import urllib
<samp class="prompt">>>> </samp>data = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read() <img id="oa.review.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print data
<samp class="computeroutput">&lt;?xml version="1.0" encoding="iso-8859-1"?>
&lt;feed version="0.3"
xmlns="http://purl.org/atom/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:lang="en">
&lt;title mode="escaped">dive into mark&lt;/title>
&lt;link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
&lt;-- rest of feed omitted for brevity --></span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.review.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Downloading anything over HTTP is incredibly easy in Python; in fact, it's a one-liner. The <code class="filename">urllib</code> module has a handy <code class="function">urlopen</code> function that takes the address of the page you want, and returns a file-like object that you can just <code class="function">read()</code> from to get the full contents of the page. It just can't get much easier.
</td>
</tr>
</table>
<p>So what's wrong with this? Well, for a quick one-off during testing or development, there's nothing wrong with it. I do
it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any
web page. But once you start thinking in terms of a web service that you want to access on a regular basis -- and remember,
you said you were planning on retrieving this syndicated feed once an hour -- then you're being inefficient, and you're being
rude.
<p>Let's talk about some of the basic features of HTTP.
<h2 id="oa.features">11.3. Features of HTTP</h2>
<p>There are five important features of HTTP which you should support.
<h3>11.3.1. <code>User-Agent</code></h3>
<p>The <code>User-Agent</code> is simply a way for a client to tell a server who it is when it requests a web page, a syndicated feed, or any sort of web
service over HTTP. When the client requests a resource, it should always announce who it is, as specifically as possible.
This allows the server-side administrator to get in touch with the client-side developer if anything is going fantastically
wrong.
<p>By default, Python sends a generic <code>User-Agent</code>: <code>Python-urllib/1.15</code>. In the next section, you'll see how to change this to something more specific.
<h3>11.3.2. Redirects</h3>
<p>Sometimes resources move around. Web sites get reorganized, pages move to new addresses. Even web services can reorganize.
A syndicated feed at <code>http://example.com/index.xml</code> might be moved to <code>http://example.com/xml/atom.xml</code>. Or an entire domain might move, as an organization expands and reorganizes; for instance, <code>http://www.example.com/index.xml</code> might be redirected to <code>http://server-farm-1.example.com/index.xml</code>.
<p>Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status
code <code>200</code> means &#8220;everything's normal, here's the page you asked for&#8221;. Status code <code>404</code> means &#8220;page not found&#8221;. (You've probably seen 404 errors while browsing the web.)
<p>HTTP has two different ways of signifying that a resource has moved. Status code <code>302</code> is a <em>temporary redirect</em>; it means &#8220;oops, that got moved over here temporarily&#8221; (and then gives the temporary address in a <code>Location:</code> header). Status code <code>301</code> is a <em>permanent redirect</em>; it means &#8220;oops, that got moved permanently&#8221; (and then gives the new address in a <code>Location:</code> header). If you get a <code>302</code> status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but
the next time you want to access the same resource, you should retry the old address. But if you get a <code>301</code> status code and a new address, you're supposed to use the new address from then on.
<p><code class="function">urllib.urlopen</code> will automatically &#8220;follow&#8221; redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn't tell you when
it does so. You'll end up getting data you asked for, but you'll never know that the underlying library &#8220;helpfully&#8221; followed a redirect for you. So you'll continue pounding away at the old address, and each time you'll get redirected to
the new address. That's two round trips instead of one: not very efficient! Later in this chapter, you'll see how to work
around this so you can deal with permanent redirects properly and efficiently.
<h3>11.3.3. <code>Last-Modified</code>/<code>If-Modified-Since</code></h3>
<p>Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the
home page of Google.com only changes once every few weeks (when they put up a special holiday logo, or advertise a new service).
Web services are no different; usually the server knows when the data you requested last changed, and HTTP provides a way
for the server to include this last-modified date along with the data you requested.
<p>If you ask for the same data a second time (or third, or fourth), you can tell the server the last-modified date that you
got last time: you send an <code>If-Modified-Since</code> header with your request, with the date you got back from the server last time. If the data hasn't changed since then, the
server sends back a special HTTP status code <code>304</code>, which means &#8220;this data hasn't changed since the last time you asked for it&#8221;. Why is this an improvement? Because when the server sends a <code>304</code>, <em>it doesn't re-send the data</em>. All you get is the status code. So you don't need to download the same data over and over again if it hasn't changed;
the server assumes you have the data cached locally.
<p>All modern web browsers support last-modified date checking. If you've ever visited a page, re-visited the same page a day
later and found that it hadn't changed, and wondered why it loaded so quickly the second time -- this could be why. Your
web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically
sent the last-modified date it got from the server the first time. The server simply says <code>304: Not Modified</code>, so your browser knows to load the page from its cache. Web services can be this smart too.
<p>Python's URL library has no built-in support for last-modified date checking, but since you can add arbitrary headers to each request
and read arbitrary headers in each response, you can add support for it yourself.
<h3>11.3.4. <code>ETag</code>/<code>If-None-Match</code></h3>
<p>ETags are an alternate way to accomplish the same thing as the last-modified date checking: don't re-download data that hasn't
changed. The way it works is, the server sends some sort of hash of the data (in an <code>ETag</code> header) along with the data you requested. Exactly how this hash is determined is entirely up to the server. The second
time you request the same data, you include the ETag hash in an <code>If-None-Match:</code> header, and if the data hasn't changed, the server will send you back a <code>304</code> status code. As with the last-modified date checking, the server <em>just</em> sends the <code>304</code>; it doesn't send you the same data a second time. By including the ETag hash in your second request, you're telling the
server that there's no need to re-send the same data if it still matches this hash, since you still have the data from the
last time.
<p>Python's URL library has no built-in support for ETags, but you'll see how to add it later in this chapter.
<h3>11.3.5. Compression</h3>
<p>The last important HTTP feature is gzip compression. When you talk about HTTP web services, you're almost always talking
about moving XML back and forth over the wire. XML is text, and quite verbose text at that, and text generally compresses
well. When you request a resource over HTTP, you can ask the server that, if it has any new data to send you, to please send
it in compressed format. You include the <code>Accept-encoding: gzip</code> header in your request, and if the server supports compression, it will send you back gzip-compressed data and mark it with
a <code>Content-encoding: gzip</code> header.
<p>Python's URL library has no built-in support for gzip compression per se, but you can add arbitrary headers to the request. And
Python comes with a separate <code class="filename">gzip</code> module, which has functions you can use to decompress the data yourself.
<p>Note that <a href="#oa.review" title="11.2. How not to fetch data over HTTP">our little one-line script</a> to download a syndicated feed did not support any of these HTTP features. Let's see how you can improve it.
<h2 id="oa.debug">11.4. Debugging HTTP web services</h2>
<p>First, let's turn on the debugging features of Python's HTTP library and see what's being sent over the wire. This will be useful throughout the chapter, as you add more and
more features.
<div class="example"><h3>Example 11.3. Debugging HTTP</h3><pre class="screen">
<samp class="prompt">>>> </samp>import httplib
<samp class="prompt">>>> </samp>httplib.HTTPConnection.debuglevel = 1 <img id="oa.debug.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>import urllib
<samp class="prompt">>>> </samp>feeddata = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read()
connect: (diveintomark.org, 80) <img id="oa.debug.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
send: '
GET /xml/atom.xml HTTP/1.0 <img id="oa.debug.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
Host: diveintomark.org <img id="oa.debug.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
User-agent: Python-urllib/1.15 <img id="oa.debug.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
'
reply: 'HTTP/1.1 200 OK\r\n' <img id="oa.debug.1.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
header: Date: Wed, 14 Apr 2004 22:27:30 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Content-Type: application/atom+xml
header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT <img id="oa.debug.1.7" src="images/callouts/7.png" alt="7" border="0" width="12" height="12">
header: ETag: "e8284-68e0-4de30f80" <img id="oa.debug.1.8" src="images/callouts/8.png" alt="8" border="0" width="12" height="12">
header: Accept-Ranges: bytes
header: Content-Length: 26848
header: Connection: close
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.debug.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">urllib</code> relies on another standard Python library, <code class="filename">httplib</code>. Normally you don't need to <code>import httplib</code> directly (<code class="filename">urllib</code> does that automatically), but you will here so you can set the debugging flag on the <code class="classname">HTTPConnection</code> class that <code class="filename">urllib</code> uses internally to connect to the HTTP server. This is an incredibly useful technique. Some other Python libraries have similar debug flags, but there's no particular standard for naming them or turning them on; you need to read
the documentation of each library to see if such a feature is available.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.debug.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now that the debugging flag is set, information on the the HTTP request and response is printed out in real time. The first
thing it tells you is that you're connecting to the server <code>diveintomark.org</code> on port 80, which is the standard port for HTTP.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.debug.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When you request the Atom feed, <code class="filename">urllib</code> sends three lines to the server. The first line specifies the HTTP verb you're using, and the path of the resource (minus
the domain name). All the requests in this chapter will use <code>GET</code>, but in the next chapter on <acronym>SOAP</acronym>, you'll see that it uses <code>POST</code> for everything. The basic syntax is the same, regardless of the verb.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.debug.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The second line is the <code>Host</code> header, which specifies the domain name of the service you're accessing. This is important, because a single HTTP server
can host multiple separate domains. My server currently hosts 12 domains; other servers can host hundreds or even thousands.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.debug.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The third line is the <code>User-Agent</code> header. What you see here is the generic <code>User-Agent</code> that the <code class="filename">urllib</code> library adds by default. In the next section, you'll see how to customize this to be more specific.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.debug.1.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The server replies with a status code and a bunch of headers (and possibly some data, which got stored in the <code class="varname">feeddata</code> variable). The status code here is <code>200</code>, meaning &#8220;everything's normal, here's the data you requested&#8221;. The server also tells you the date it responded to your request, some information about the server itself, and the content
type of the data it's giving you. Depending on your application, this might be useful, or not. It's certainly reassuring
that you thought you were asking for an Atom feed, and lo and behold, you're getting an Atom feed (<code>application/atom+xml</code>, which is the registered content type for Atom feeds).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.debug.1.7"><img src="images/callouts/7.png" alt="7" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The server tells you when this Atom feed was last modified (in this case, about 13 minutes ago). You can send this date back
to the server the next time you request the same feed, and the server can do last-modified checking.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.debug.1.8"><img src="images/callouts/8.png" alt="8" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The server also tells you that this Atom feed has an ETag hash of <code>"e8284-68e0-4de30f80"</code>. The hash doesn't mean anything by itself; there's nothing you can do with it, except send it back to the server the next
time you request this same feed. Then the server can use it to tell you if the data has changed or not.
</td>
</tr>
</table>
<h2 id="oa.useragent">11.5. Setting the <code>User-Agent</code></h2>
<p>The first step to improving your HTTP web services client is to identify yourself properly with a <code>User-Agent</code>. To do that, you need to move beyond the basic <code class="filename">urllib</code> and dive into <code class="filename">urllib2</code>.
<div class="example"><h3>Example 11.4. Introducing <code class="filename">urllib2</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>import httplib
<samp class="prompt">>>> </samp>httplib.HTTPConnection.debuglevel = 1 <img id="oa.useragent.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>import urllib2
<samp class="prompt">>>> </samp>request = urllib2.Request('http://diveintomark.org/xml/atom.xml') <img id="oa.useragent.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>opener = urllib2.build_opener() <img id="oa.useragent.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>feeddata = opener.open(request).read() <img id="oa.useragent.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Wed, 14 Apr 2004 23:23:12 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Content-Type: application/atom+xml
header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT
header: ETag: "e8284-68e0-4de30f80"
header: Accept-Ranges: bytes
header: Content-Length: 26848
header: Connection: close
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.useragent.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If you still have your Python <acronym>IDE</acronym> open from the previous section's example, you can skip this, but this turns on <a href="#oa.debug" title="11.4. Debugging HTTP web services">HTTP debugging</a> so you can see what you're actually sending over the wire, and what gets sent back.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.useragent.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Fetching an HTTP resource with <code class="filename">urllib2</code> is a three-step process, for good reasons that will become clear shortly. The first step is to create a <code class="classname">Request</code> object, which takes the URL of the resource you'll eventually get around to retrieving. Note that this step doesn't actually
retrieve anything yet.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.useragent.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The second step is to build a URL opener. This can take any number of handlers, which control how responses are handled.
But you can also build an opener without any custom handlers, which is what you're doing here. You'll see how to define
and use custom handlers later in this chapter when you explore redirects.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.useragent.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The final step is to tell the opener to open the URL, using the <code class="classname">Request</code> object you created. As you can see from all the debugging information that gets printed, this step actually retrieves the
resource and stores the returned data in <code class="varname">feeddata</code>.
</td>
</tr>
</table>
<div class="example"><h3>Example 11.5. Adding headers with the <code class="classname">Request</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>request <img id="oa.useragent.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;urllib2.Request instance at 0x00250AA8>
<samp class="prompt">>>> </samp>request.get_full_url()
http://diveintomark.org/xml/atom.xml
<samp class="prompt">>>> </samp><kbd>request.add_header('User-Agent',</kbd>
<samp class="prompt">... </samp><kbd>'OpenAnything/1.0 +http://diveintopython3.org/')</kbd> <img id="oa.useragent.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>feeddata = opener.open(request).read() <img id="oa.useragent.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: OpenAnything/1.0 +http://diveintopython3.org/ <img id="oa.useragent.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Wed, 14 Apr 2004 23:45:17 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Content-Type: application/atom+xml
header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT
header: ETag: "e8284-68e0-4de30f80"
header: Accept-Ranges: bytes
header: Content-Length: 26848
header: Connection: close
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.useragent.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You're continuing from the previous example; you've already created a <code class="classname">Request</code> object with the URL you want to access.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.useragent.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using the <code class="function">add_header</code> method on the <code class="classname">Request</code> object, you can add arbitrary HTTP headers to the request. The first argument is the header, the second is the value you're
providing for that header. Convention dictates that a <code>User-Agent</code> should be in this specific format: an application name, followed by a slash, followed by a version number. The rest is free-form,
and you'll see a lot of variations in the wild, but somewhere it should include a URL of your application. The <code>User-Agent</code> is usually logged by the server along with other details of your request, and including a URL of your application allows
server administrators looking through their access logs to contact you if something is wrong.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.useragent.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="varname">opener</code> object you created before can be reused too, and it will retrieve the same feed again, but with your custom <code>User-Agent</code> header.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.useragent.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">And here's you sending your custom <code>User-Agent</code>, in place of the generic one that Python sends by default. If you look closely, you'll notice that you defined a <code>User-Agent</code> header, but you actually sent a <code>User-agent</code> header. See the difference? <code class="filename">urllib2</code> changed the case so that only the first letter was capitalized. It doesn't really matter; HTTP specifies that header field
names are completely case-insensitive.
</td>
</tr>
</table>
<h2 id="oa.etags">11.6. Handling <code>Last-Modified</code> and <code>ETag</code></h2>
<p>Now that you know how to add custom HTTP headers to your web service requests, let's look at adding support for <code>Last-Modified</code> and <code>ETag</code> headers.
<p>These examples show the output with debugging turned off. If you still have it turned on from the previous section, you can
turn it off by setting <code>httplib.HTTPConnection.debuglevel = 0</code>. Or you can just leave debugging on, if that helps you.
<div class="example"><h3 id="oa.etags.example.1">Example 11.6. Testing <code>Last-Modified</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>import urllib2
<samp class="prompt">>>> </samp>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
<samp class="prompt">>>> </samp>opener = urllib2.build_opener()
<samp class="prompt">>>> </samp>firstdatastream = opener.open(request)
<samp class="prompt">>>> </samp>firstdatastream.headers.dict <img id="oa.etags.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">{'date': 'Thu, 15 Apr 2004 20:42:41 GMT',
'server': 'Apache/2.0.49 (Debian GNU/Linux)',
'content-type': 'application/atom+xml',
'last-modified': 'Thu, 15 Apr 2004 19:45:21 GMT',
'etag': '"e842a-3e53-55d97640"',
'content-length': '15955',
'accept-ranges': 'bytes',
'connection': 'close'}</samp>
<samp class="prompt">>>> </samp>request.add_header('If-Modified-Since',
<samp class="prompt">... </samp>firstdatastream.headers.get('Last-Modified')) <img id="oa.etags.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>seconddatastream = opener.open(request) <img id="oa.etags.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="traceback">Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
File "c:\python23\lib\urllib2.py", line 326, in open
'_open', req)
File "c:\python23\lib\urllib2.py", line 306, in _call_chain
result = func(*args)
File "c:\python23\lib\urllib2.py", line 901, in http_open
return self.do_open(httplib.HTTP, req)
File "c:\python23\lib\urllib2.py", line 895, in do_open
return self.parent.error('http', req, fp, code, msg, hdrs)
File "c:\python23\lib\urllib2.py", line 352, in error
return self._call_chain(*args)
File "c:\python23\lib\urllib2.py", line 306, in _call_chain
result = func(*args)
File "c:\python23\lib\urllib2.py", line 412, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 304: Not Modified</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember all those HTTP headers you saw printed out when you turned on debugging? This is how you can get access to them
programmatically: <code class="varname">firstdatastream.headers</code> is <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class">an object that acts like a dictionary</a> and allows you to get any of the individual headers returned from the HTTP server.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">On the second request, you add the <code>If-Modified-Since</code> header with the last-modified date from the first request. If the data hasn't changed, the server should return a <code>304</code> status code.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Sure enough, the data hasn't changed. You can see from the traceback that <code class="filename">urllib2</code> throws a special exception, <code class="classname">HTTPError</code>, in response to the <code>304</code> status code. This is a little unusual, and not entirely helpful. After all, it's not an error; you specifically asked the
server not to send you any data if it hadn't changed, and the data didn't change, so the server told you it wasn't sending
you any data. That's not an error; that's exactly what you were hoping for.
</td>
</tr>
</table>
<p><code class="filename">urllib2</code> also raises an <code class="classname">HTTPError</code> exception for conditions that you would think of as errors, such as <code>404</code> (page not found). In fact, it will raise <code class="classname">HTTPError</code> for <em>any</em> status code other than <code>200</code> (OK), <code>301</code> (permanent redirect), or <code>302</code> (temporary redirect). It would be more helpful for your purposes to capture the status code and simply return it, without
throwing an exception. To do that, you'll need to define a custom URL handler.
<div class="example"><h3>Example 11.7. Defining URL handlers</h3>
<p>This custom URL handler is part of <code class="filename">openanything.py</code>.<pre class="programlisting">
class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <img id="oa.etags.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def http_error_default(self, req, fp, code, msg, headers): <img id="oa.etags.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
result = urllib2.HTTPError(
req.get_full_url(), code, msg, headers, fp)
result.status = code <img id="oa.etags.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">urllib2</code> is designed around URL handlers. Each handler is just a class that can define any number of methods. When something happens
-- like an HTTP error, or even a <code>304</code> code -- <code class="filename">urllib2</code> introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a> to define handlers for different node types, but <code class="filename">urllib2</code> is more flexible, and introspects over as many handlers as are defined for the current request.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">urllib2</code> searches through the defined handlers and calls the <code class="methodname">http_error_default</code> method when it encounters a <code>304</code> status code from the server. By defining a custom error handler, you can prevent <code class="filename">urllib2</code> from raising an exception. Instead, you create the <code class="classname">HTTPError</code> object, but return it instead of raising it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the key part: before returning, you save the status code returned by the HTTP server. This will allow you easy access
to it from the calling program.
</td>
</tr>
</table>
<div class="example"><h3>Example 11.8. Using custom URL handlers</h3><pre class="screen">
<samp class="prompt">>>> </samp>request.headers <img id="oa.etags.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
{'If-modified-since': 'Thu, 15 Apr 2004 19:45:21 GMT'}
<samp class="prompt">>>> </samp>import openanything
<samp class="prompt">>>> </samp>opener = urllib2.build_opener(
<samp class="prompt">... </samp>openanything.DefaultErrorHandler()) <img id="oa.etags.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>seconddatastream = opener.open(request)
<samp class="prompt">>>> </samp>seconddatastream.status <img id="oa.etags.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
304
<samp class="prompt">>>> </samp>seconddatastream.read() <img id="oa.etags.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
''
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You're continuing the previous example, so the <code class="classname">Request</code> object is already set up, and you've already added the <code>If-Modified-Since</code> header.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the key: now that you've defined your custom URL handler, you need to tell <code class="filename">urllib2</code> to use it. Remember how I said that <code class="filename">urllib2</code> broke up the process of accessing an HTTP resource into three steps, and for good reason? This is why building the URL opener
is its own step, because you can build it with your own custom URL handlers that override <code class="filename">urllib2</code>'s default behavior.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use <code class="varname">seconddatastream.headers.dict</code> to acess them), also contains the HTTP status code. In this case, as you expected, the status is <code>304</code>, meaning this data hasn't changed since the last time you asked for it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that when the server sends back a <code>304</code> status code, it doesn't re-send the data. That's the whole point: to save bandwidth by not re-downloading data that hasn't
changed. So if you actually want that data, you'll need to cache it locally the first time you get it.
</td>
</tr>
</table>
<p>Handling <code>ETag</code> works much the same way, but instead of checking for <code>Last-Modified</code> and sending <code>If-Modified-Since</code>, you check for <code>ETag</code> and send <code>If-None-Match</code>. Let's start with a fresh <acronym>IDE</acronym> session.
<div class="example"><h3 id="oa.etags.example">Example 11.9. Supporting <code>ETag</code>/<code>If-None-Match</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>import urllib2, openanything
<samp class="prompt">>>> </samp>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
<samp class="prompt">>>> </samp>opener = urllib2.build_opener(
<samp class="prompt">... </samp>openanything.DefaultErrorHandler())
<samp class="prompt">>>> </samp>firstdatastream = opener.open(request)
<samp class="prompt">>>> </samp>firstdatastream.headers.get('ETag') <img id="oa.etags.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'"e842a-3e53-55d97640"'
<samp class="prompt">>>> </samp>firstdata = firstdatastream.read()
<samp class="prompt">>>> </samp>print firstdata <img id="oa.etags.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">&lt;?xml version="1.0" encoding="iso-8859-1"?>
&lt;feed version="0.3"
xmlns="http://purl.org/atom/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:lang="en">
&lt;title mode="escaped">dive into mark&lt;/title>
&lt;link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
&lt;-- rest of feed omitted for brevity --></samp>
<samp class="prompt">>>> </samp>request.add_header('If-None-Match',
<samp class="prompt">... </samp>firstdatastream.headers.get('ETag')) <img id="oa.etags.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>seconddatastream = opener.open(request)
<samp class="prompt">>>> </samp>seconddatastream.status <img id="oa.etags.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
304
<samp class="prompt">>>> </samp>seconddatastream.read() <img id="oa.etags.4.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
''
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using the <code class="varname">firstdatastream.headers</code> pseudo-dictionary, you can get the <code>ETag</code> returned from the server. (What happens if the server didn't send back an <code>ETag</code>? Then this line would return <code>None</code>.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">OK, you got the data.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now set up the second call by setting the <code>If-None-Match</code> header to the <code>ETag</code> you got from the first call.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The second call succeeds quietly (without throwing an exception), and once again you see that the server has sent back a <code>304</code> status code. Based on the <code>ETag</code> you sent the second time, it knows that the data hasn't changed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.etags.4.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Regardless of whether the <code>304</code> is triggered by <code>Last-Modified</code> date checking or <code>ETag</code> hash matching, you'll never get the data along with the <code>304</code>. That's the whole point.
</td>
</tr>
</table>
</div><table id="tip.etag.vs.lastmodified" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">In these examples, the HTTP server has supported both <code>Last-Modified</code> and <code>ETag</code> headers, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively
in case a server only supports one or the other, or neither.
</td>
</tr>
</table>
<h2 id="oa.redirect">11.7. Handling redirects</h2>
<p>You can support permanent and temporary redirects using a different kind of custom URL handler.
<p>First, let's see why a redirect handler is necessary in the first place.
<div class="example"><h3>Example 11.10. Accessing web services without a redirect handler</h3><pre class="screen">
<samp class="prompt">>>> </samp>import urllib2, httplib
<samp class="prompt">>>> </samp>httplib.HTTPConnection.debuglevel = 1 <img id="oa.redirect.1.0" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>request = urllib2.Request(
<samp class="prompt">... </samp>'http://diveintomark.org/redir/example301.xml') <img id="oa.redirect.1.1" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>opener = urllib2.build_opener()
<samp class="prompt">>>> </samp>f = opener.open(request)
<samp class="computeroutput">connect: (diveintomark.org, 80)
send: '
GET /redir/example301.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'</span> <img id="oa.redirect.1.2" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">header: Date: Thu, 15 Apr 2004 22:06:25 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Location: http://diveintomark.org/xml/atom.xml</span> <img id="oa.redirect.1.3" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">header: Content-Length: 338
header: Connection: close
header: Content-Type: text/html; charset=iso-8859-1
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0</span> <img id="oa.redirect.1.4" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="computeroutput">Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 15 Apr 2004 22:06:25 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
header: ETag: "e842a-3e53-55d97640"
header: Accept-Ranges: bytes
header: Content-Length: 15955
header: Connection: close
header: Content-Type: application/atom+xml</samp>
<samp class="prompt">>>> </samp>f.url <img id="oa.redirect.1.5" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
'http://diveintomark.org/xml/atom.xml'
<samp class="prompt">>>> </samp>f.headers.dict
<samp class="computeroutput">{'content-length': '15955',
'accept-ranges': 'bytes',
'server': 'Apache/2.0.49 (Debian GNU/Linux)',
'last-modified': 'Thu, 15 Apr 2004 19:45:21 GMT',
'connection': 'close',
'etag': '"e842a-3e53-55d97640"',
'date': 'Thu, 15 Apr 2004 22:06:25 GMT',
'content-type': 'application/atom+xml'}</samp>
<samp class="prompt">>>> </samp>f.status
<samp class="traceback">Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
AttributeError: addinfourl instance has no attribute 'status'</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.1.0"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You'll be better able to see what's happening if you turn on debugging.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.1.1"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a URL which I have set up to permanently redirect to my Atom feed at <code>http://diveintomark.org/xml/atom.xml</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.1.2"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Sure enough, when you try to download the data at that address, the server sends back a <code>301</code> status code, telling you that the resource has moved permanently.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.1.3"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The server also sends back a <code>Location:</code> header that gives the new address of this data.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.1.4"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">urllib2</code> notices the redirect status code and automatically tries to retrieve the data at the new location specified in the <code>Location:</code> header.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.1.5"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The object you get back from the <code class="varname">opener</code> contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent
address). But the status code is missing, so you have no way of knowing programmatically whether this redirect was temporary
or permanent. And that matters very much: if it was a temporary redirect, then you should continue to ask for the data at
the old location. But if it was a permanent redirect (as this was), you should ask for the data at the new location from
now on.
</td>
</tr>
</table>
<p>This is suboptimal, but easy to fix. <code class="filename">urllib2</code> doesn't behave exactly as you want it to when it encounters a <code>301</code> or <code>302</code>, so let's override its behavior. How? With a custom URL handler, <a href="#oa.etags" title="11.6. Handling Last-Modified and ETag">just like you did to handle <code>304</code> codes</a>.
<div class="example"><h3>Example 11.11. Defining the redirect handler</h3>
<p>This class is defined in <code class="filename">openanything.py</code>.<pre class="programlisting">
class SmartRedirectHandler(urllib2.HTTPRedirectHandler): <img id="oa.redirect.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def http_error_301(self, req, fp, code, msg, headers):
result = urllib2.HTTPRedirectHandler.http_error_301( <img id="oa.redirect.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
self, req, fp, code, msg, headers)
result.status = code <img id="oa.redirect.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
return result
def http_error_302(self, req, fp, code, msg, headers): <img id="oa.redirect.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
result = urllib2.HTTPRedirectHandler.http_error_302(
self, req, fp, code, msg, headers)
result.status = code
return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Redirect behavior is defined in <code class="filename">urllib2</code> in a class called <code class="classname">HTTPRedirectHandler</code>. You don't want to completely override the behavior, you just want to extend it a little, so you'll subclass <code class="classname">HTTPRedirectHandler</code> so you can call the ancestor class to do all the hard work.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">When it encounters a <code>301</code> status code from the server, <code class="filename">urllib2</code> will search through its handlers and call the <code class="methodname">http_error_301</code> method. The first thing ours does is just call the <code class="methodname">http_error_301</code> method in the ancestor, which handles the grunt work of looking for the <code>Location:</code> header and following the redirect to the new address.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here's the key: before you return, you store the status code (<code>301</code>), so that the calling program can access it later.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Temporary redirects (status code <code>302</code>) work the same way: override the <code>http_error_302</code> method, call the ancestor, and save the status code before returning.
</td>
</tr>
</table>
<p>So what has this bought us? You can now build a URL opener with the custom redirect handler, and it will still automatically
follow redirects, but now it will also expose the redirect status code.
<div class="example"><h3>Example 11.12. Using the redirect handler to detect permanent redirects</h3><pre class="screen">
<samp class="prompt">>>> </samp>request = urllib2.Request('http://diveintomark.org/redir/example301.xml')
<samp class="prompt">>>> </samp>import openanything, httplib
<samp class="prompt">>>> </samp>httplib.HTTPConnection.debuglevel = 1
<samp class="prompt">>>> </samp>opener = urllib2.build_opener(
<samp class="prompt">... </samp>openanything.SmartRedirectHandler()) <img id="oa.redirect.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>f = opener.open(request)
<samp class="computeroutput">connect: (diveintomark.org, 80)
send: 'GET /redir/example301.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'</span> <img id="oa.redirect.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">header: Date: Thu, 15 Apr 2004 22:13:21 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Location: http://diveintomark.org/xml/atom.xml
header: Content-Length: 338
header: Connection: close
header: Content-Type: text/html; charset=iso-8859-1
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 15 Apr 2004 22:13:21 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
header: ETag: "e842a-3e53-55d97640"
header: Accept-Ranges: bytes
header: Content-Length: 15955
header: Connection: close
header: Content-Type: application/atom+xml
</samp>
<samp class="prompt">>>> </samp>f.status <img id="oa.redirect.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
301
<samp class="prompt">>>> </samp>f.url
'http://diveintomark.org/xml/atom.xml'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First, build a URL opener with the redirect handler you just defined.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You sent off a request, and you got a <code>301</code> status code in response. At this point, the <code class="methodname">http_error_301</code> method gets called. You call the ancestor method, which follows the redirect and sends a request at the new location (<code>http://diveintomark.org/xml/atom.xml</code>).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the payoff: now, not only do you have access to the new URL, but you have access to the redirect status code, so you
can tell that this was a permanent redirect. The next time you request this data, you should request it from the new location
(<code>http://diveintomark.org/xml/atom.xml</code>, as specified in <code class="varname">f.url</code>). If you had stored the location in a configuration file or a database, you need to update that so you don't keep pounding
the server with requests at the old address. It's time to update your address book.
</td>
</tr>
</table>
<p>The same redirect handler can also tell you that you <em>shouldn't</em> update your address book.
<div class="example"><h3>Example 11.13. Using the redirect handler to detect temporary redirects</h3><pre class="screen">
<samp class="prompt">>>> </samp>request = urllib2.Request(
<samp class="prompt">... </samp>'http://diveintomark.org/redir/example302.xml') <img id="oa.redirect.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>f = opener.open(request)
<samp class="computeroutput">connect: (diveintomark.org, 80)
send: '
GET /redir/example302.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 302 Found\r\n'</span> <img id="oa.redirect.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">header: Date: Thu, 15 Apr 2004 22:18:21 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Location: http://diveintomark.org/xml/atom.xml
header: Content-Length: 314
header: Connection: close
header: Content-Type: text/html; charset=iso-8859-1
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0</span> <img id="oa.redirect.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 15 Apr 2004 22:18:21 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
header: ETag: "e842a-3e53-55d97640"
header: Accept-Ranges: bytes
header: Content-Length: 15955
header: Connection: close
header: Content-Type: application/atom+xml</samp>
<samp class="prompt">>>> </samp>f.status <img id="oa.redirect.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
302
<samp class="prompt">>>> </samp>f.url
http://diveintomark.org/xml/atom.xml
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a sample URL I've set up that is configured to tell clients to <em>temporarily</em> redirect to <code>http://diveintomark.org/xml/atom.xml</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The server sends back a <code>302</code> status code, indicating a temporary redirect. The temporary new location of the data is given in the <code>Location:</code> header.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">urllib2</code> calls your <code class="methodname">http_error_302</code> method, which calls the ancestor method of the same name in <code class="classname">urllib2.HTTPRedirectHandler</code>, which follows the redirect to the new location. Then your <code class="methodname">http_error_302</code> method stores the status code (<code>302</code>) so the calling application can get it later.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.redirect.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">And here you are, having successfully followed the redirect to <code>http://diveintomark.org/xml/atom.xml</code>. <code class="varname">f.status</code> tells you that this was a temporary redirect, which means that you should continue to request data from the original address
(<code>http://diveintomark.org/redir/example302.xml</code>). Maybe it will redirect next time too, but maybe not. Maybe it will redirect to a different address. It's not for you
to say. The server said this redirect was only temporary, so you should respect that. And now you're exposing enough information
that the calling application can respect that.
</td>
</tr>
</table>
<h2 id="oa.gzip">11.8. Handling compressed data</h2>
<p>The last important HTTP feature you want to support is compression. Many web services have the ability to send data compressed,
which can cut down the amount of data sent over the wire by 60% or more. This is especially true of XML web services, since
XML data compresses very well.
<p>Servers won't give you compressed data unless you tell them you can handle it.
<div class="example"><h3>Example 11.14. Telling the server you would like compressed data</h3><pre class="screen">
<samp class="prompt">>>> </samp>import urllib2, httplib
<samp class="prompt">>>> </samp>httplib.HTTPConnection.debuglevel = 1
<samp class="prompt">>>> </samp>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
<samp class="prompt">>>> </samp>request.add_header('Accept-encoding', 'gzip') <img id="oa.gzip.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>opener = urllib2.build_opener()
<samp class="prompt">>>> </samp>f = opener.open(request)
<samp class="computeroutput">connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
Accept-encoding: gzip</span><img id="oa.gzip.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 15 Apr 2004 22:24:39 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
header: ETag: "e842a-3e53-55d97640"
header: Accept-Ranges: bytes
header: Vary: Accept-Encoding
header: Content-Encoding: gzip</span> <img id="oa.gzip.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
header: Content-Length: 6289 <img id="oa.gzip.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">header: Connection: close
header: Content-Type: application/atom+xml</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the key: once you've created your <code class="classname">Request</code> object, add an <code>Accept-encoding</code> header to tell the server you can accept gzip-encoded data. <code>gzip</code> is the name of the compression algorithm you're using. In theory there could be other compression algorithms, but <code>gzip</code> is the compression algorithm used by 99% of web servers.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">There's your header going across the wire.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">And here's what the server sends back: the <code>Content-Encoding: gzip</code> header means that the data you're about to receive has been gzip-compressed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code>Content-Length</code> header is the length of the compressed data, not the uncompressed data. As you'll see in a minute, the actual length of
the uncompressed data was 15955, so gzip compression cut your bandwidth by over 60%!
</td>
</tr>
</table>
<div class="example"><h3>Example 11.15. Decompressing the data</h3><pre class="screen">
<samp class="prompt">>>> </samp>compresseddata = f.read() <img id="oa.gzip.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>len(compresseddata)
6289
<samp class="prompt">>>> </samp>import StringIO
<samp class="prompt">>>> </samp>compressedstream = StringIO.StringIO(compresseddata) <img id="oa.gzip.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>import gzip
<samp class="prompt">>>> </samp>gzipper = gzip.GzipFile(fileobj=compressedstream) <img id="oa.gzip.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>data = gzipper.read() <img id="oa.gzip.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>print data <img id="oa.gzip.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="computeroutput">&lt;?xml version="1.0" encoding="iso-8859-1"?>
&lt;feed version="0.3"
xmlns="http://purl.org/atom/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:lang="en">
&lt;title mode="escaped">dive into mark&lt;/title>
&lt;link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
&lt;-- rest of feed omitted for brevity --></samp>
<samp class="prompt">>>> </samp>len(data)
15955
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Continuing from the previous example, <code class="varname">f</code> is the file-like object returned from the URL opener. Using its <code class="methodname">read()</code> method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first
step towards getting the data you really want.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">OK, this step is a little bit of messy workaround. Python has a <code class="filename">gzip</code> module, which reads (and actually writes) gzip-compressed files on disk. But you don't have a file on disk, you have a gzip-compressed
buffer in memory, and you don't want to write out a temporary file just so you can uncompress it. So what you're going to
do is create a file-like object out of the in-memory data (<code class="varname">compresseddata</code>), using the <code class="filename">StringIO</code> module. You first saw the <code class="filename">StringIO</code> module in <a href="#kgp.openanything.stringio.example" title="Example 10.4. Introducing StringIO">the previous chapter</a>, but now you've found another use for it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you can create an instance of <code class="classname">GzipFile</code>, and tell it that its &#8220;file&#8221; is the file-like object <code class="varname">compressedstream</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the line that does all the actual work: &#8220;reading&#8221; from <code class="classname">GzipFile</code> will decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. <code class="varname">gzipper</code> is a file-like object which represents a gzip-compressed file. That &#8220;file&#8221; is not a real file on disk, though; <code class="varname">gzipper</code> is really just &#8220;reading&#8221; from the file-like object you created with <code class="filename">StringIO</code> to wrap the compressed data, which is only in memory in the variable <code class="varname">compresseddata</code>. And where did that compressed data come from? You originally downloaded it from a remote HTTP server by &#8220;reading&#8221; from the file-like object you built with <code class="function">urllib2.build_opener</code>. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Look ma, real data. (15955 bytes of it, in fact.)</td>
</tr>
</table>
<p>&#8220;But wait!&#8221; I hear you cry. &#8220;This could be even easier!&#8221; I know what you're thinking. You're thinking that <code class="varname">opener.open</code> returns a file-like object, so why not cut out the <code class="filename">StringIO</code> middleman and just pass <code class="varname">f</code> directly to <code class="methodname">GzipFile</code>? OK, maybe you weren't thinking that, but don't worry about it, because it doesn't work.
<div class="example"><h3>Example 11.16. Decompressing the data directly from the server</h3><pre class="screen">
<samp class="prompt">>>> </samp>f = opener.open(request)<img id="oa.gzip.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>f.headers.get('Content-Encoding') <img id="oa.gzip.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'gzip'
<samp class="prompt">>>> </samp>data = gzip.GzipFile(fileobj=f).read() <img id="oa.gzip.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="traceback">Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
File "c:\python23\lib\gzip.py", line 217, in read
self._read(readsize)
File "c:\python23\lib\gzip.py", line 252, in _read
pos = self.fileobj.tell() # Save current position
AttributeError: addinfourl instance has no attribute 'tell'</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Continuing from the previous example, you already have a <code class="classname">Request</code> object set up with an <code>Accept-encoding: gzip</code> header.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Simply opening the request will get you the headers (though not download any data yet). As you can see from the returned
<code>Content-Encoding</code> header, this data has been sent gzip-compressed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since <code class="methodname">opener.open</code> returns a file-like object, and you know from the headers that when you read it, you're going to get gzip-compressed data,
why not simply pass that file-like object directly to <code class="classname">GzipFile</code>? As you &#8220;read&#8221; from the <code class="classname">GzipFile</code> instance, it will &#8220;read&#8221; compressed data from the remote HTTP server and decompress it on the fly. It's a good idea, but unfortunately it doesn't
work. Because of the way gzip compression works, <code class="classname">GzipFile</code> needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the &#8220;file&#8221; is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and
forth through the data stream. So the inelegant hack of using <code class="filename">StringIO</code> is the best solution: download the compressed data, create a file-like object out of it with <code class="filename">StringIO</code>, and then decompress the data from that.
</td>
</tr>
</table>
<h2 id="oa.alltogether">11.9. Putting it all together</h2>
<p>You've seen all the pieces for building an intelligent HTTP web services client. Now let's see how they all fit together.
<div class="example"><h3>Example 11.17. The <code class="function">openanything</code> function</h3>
<p>This function is defined in <code class="filename">openanything.py</code>.<pre class="programlisting">
def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
# non-HTTP code omitted for brevity
if urlparse.urlparse(source)[0] == 'http': <img id="oa.alltogether.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
# open URL with urllib2
request = urllib2.Request(source)
request.add_header('User-Agent', agent) <img id="oa.alltogether.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
if etag:
request.add_header('If-None-Match', etag) <img id="oa.alltogether.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
if lastmodified:
request.add_header('If-Modified-Since', lastmodified) <img id="oa.alltogether.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
request.add_header('Accept-encoding', 'gzip') <img id="oa.alltogether.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
opener = urllib2.build_opener(SmartRedirectHandler(), DefaultErrorHandler()) <img id="oa.alltogether.1.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
return opener.open(request) <img id="oa.alltogether.1.7" src="images/callouts/7.png" alt="7" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="filename">urlparse</code> is a handy utility module for, you guessed it, parsing URLs. It's primary function, also called <code class="function">urlparse</code>, takes a URL and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier).
Of these, the only thing you care about is the scheme, to make sure that you're dealing with an HTTP URL (which <code class="filename">urllib2</code> can handle).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You identify yourself to the HTTP server with the <code>User-Agent</code> passed in by the calling function. If no <code>User-Agent</code> was specified, you use a default one defined earlier in the <code class="filename">openanything.py</code> module. You never use the default one defined by <code class="filename">urllib2</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If an <code>ETag</code> hash was given, send it in the <code>If-None-Match</code> header.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If a last-modified date was given, send it in the <code>If-Modified-Since</code> header.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Tell the server you would like compressed data if possible.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.1.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Build a URL opener that uses <em>both</em> of the custom URL handlers: <code class="classname">SmartRedirectHandler</code> for handling <code>301</code> and <code>302</code> redirects, and <code class="classname">DefaultErrorHandler</code> for handling <code>304</code>, <code>404</code>, and other error conditions gracefully.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.1.7"><img src="images/callouts/7.png" alt="7" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">That's it! Open the URL and return a file-like object to the caller.</td>
</tr>
</table>
<div class="example"><h3>Example 11.18. The <code class="function">fetch</code> function</h3>
<p>This function is defined in <code class="filename">openanything.py</code>.<pre class="programlisting">
def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
'''Fetch data and metadata from a URL, file, stream, or string'''
result = {}
f = openAnything(source, etag, last_modified, agent) <img id="oa.alltogether.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
result['data'] = f.read() <img id="oa.alltogether.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
if hasattr(f, 'headers'):
# save ETag, if the server sent one
result['etag'] = f.headers.get('ETag') <img id="oa.alltogether.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
# save Last-Modified header, if the server sent one
result['lastmodified'] = f.headers.get('Last-Modified') <img id="oa.alltogether.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
if f.headers.get('content-encoding', '') == 'gzip': <img id="oa.alltogether.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
# data came back gzip-compressed, decompress it
result['data'] = gzip.GzipFile(fileobj=StringIO(result['data']])).read()
if hasattr(f, 'url'): <img id="oa.alltogether.2.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
result['url'] = f.url
result['status'] = 200
if hasattr(f, 'status'): <img id="oa.alltogether.2.7" src="images/callouts/7.png" alt="7" border="0" width="12" height="12">
result['status'] = f.status
f.close()
return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First, you call the <code class="function">openAnything</code> function with a URL, <code>ETag</code> hash, <code>Last-Modified</code> date, and <code>User-Agent</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Read the actual data returned from the server. This may be compressed; if so, you'll decompress it later.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Save the <code>ETag</code> hash returned from the server, so the calling application can pass it back to you next time, and you can pass it on to <code class="function">openAnything</code>, which can stick it in the <code>If-None-Match</code> header and send it to the remote server.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Save the <code>Last-Modified</code> date too.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the server says that it sent compressed data, decompress it.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.2.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If you got a URL back from the server, save it, and assume that the status code is <code>200</code> until you find out otherwise.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.2.7"><img src="images/callouts/7.png" alt="7" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If one of the custom URL handlers captured a status code, then save that too.</td>
</tr>
</table>
<div class="example"><h3>Example 11.19. Using <code class="filename">openanything.py</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>import openanything
<samp class="prompt">>>> </samp>useragent = 'MyHTTPWebServicesApp/1.0'
<samp class="prompt">>>> </samp>url = 'http://diveintopython3.org/redir/example301.xml'
<samp class="prompt">>>> </samp>params = openanything.fetch(url, agent=useragent) <img id="oa.alltogether.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>params <img id="oa.alltogether.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">{'url': 'http://diveintomark.org/xml/atom.xml',
'lastmodified': 'Thu, 15 Apr 2004 19:45:21 GMT',
'etag': '"e842a-3e53-55d97640"',
'status': 301,
'data': '&lt;?xml version="1.0" encoding="iso-8859-1"?>
&lt;feed version="0.3"
&lt;-- rest of data omitted for brevity -->'}</samp>
<samp class="prompt">>>> </samp>if params['status'] == 301:<img id="oa.alltogether.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">... </samp>url = params['url']
<samp class="prompt">>>> </samp>newparams = openanything.fetch(
<samp class="prompt">... </samp>url, params['etag'], params['lastmodified'], useragent) <img id="oa.alltogether.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>newparams
<samp class="computeroutput">{'url': 'http://diveintomark.org/xml/atom.xml',
'lastmodified': None,
'etag': '"e842a-3e53-55d97640"',
'status': 304,
'data': ''}</span> <img id="oa.alltogether.3.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The very first time you fetch a resource, you don't have an <code>ETag</code> hash or <code>Last-Modified</code> date, so you'll leave those out. (They're <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional parameters</a>.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">What you get back is a dictionary of several useful headers, the HTTP status code, and the actual data returned from the server.
<code class="filename">openanything</code> handles the gzip compression internally; you don't care about that at this level.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If you ever get a <code>301</code> status code, that's a permanent redirect, and you need to update your URL to the new address.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The second time you fetch the same resource, you have all sorts of information to pass back: a (possibly updated) URL, the
<code>ETag</code> from the last time, the <code>Last-Modified</code> date from the last time, and of course your <code>User-Agent</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.alltogether.3.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">What you get back is again a dictionary, but the data hasn't changed, so all you got was a <code>304</code> status code and no data.
</td>
</tr>
</table>
<h2 id="oa.summary">11.10. Summary</h2>
<p>The <code class="filename">openanything.py</code> and its functions should now make perfect sense.
<p>There are 5 important features of HTTP web services that every client should support:
<div class="itemizedlist">
<ul>
<li>Identifying your application <a href="#oa.useragent" title="11.5. Setting the User-Agent">by setting a proper <code>User-Agent</code></a>.
<li>Handling <a href="#oa.redirect" title="11.7. Handling redirects">permanent redirects properly</a>.
<li>Supporting <a href="#oa.etags" title="11.6. Handling Last-Modified and ETag"><code>Last-Modified</code> date checking</a> to avoid re-downloading data that hasn't changed.
<li>Supporting <a href="#oa.etags.example" title="Example 11.9. Supporting ETag/If-None-Match"><code>ETag</code> hashes</a> to avoid re-downloading data that hasn't changed.
<li>Supporting <a href="#oa.gzip" title="11.8. Handling compressed data">gzip compression</a> to reduce bandwidth even when data <em>has</em> changed.
</ul>
<div class="chapter">
<h2 id="soap">Chapter 12. <acronym>SOAP</acronym> Web Services</h2>
<p><a href="#oa">Chapter 11</a> focused on document-oriented web services over HTTP. The &#8220;input parameter&#8221; was the <acronym>URL</acronym>, and the &#8220;return value&#8221; was an actual XML document which it was your responsibility to parse.
<p>This chapter will focus on <acronym>SOAP</acronym> web services, which take a more structured approach. Rather than dealing with HTTP requests and XML documents directly,
<acronym>SOAP</acronym> allows you to simulate calling functions that return native data types. As you will see, the illusion is almost perfect;
you can &#8220;call&#8221; a function through a <acronym>SOAP</acronym> library, with the standard Python calling syntax, and the function appears to return Python objects and values. But under the covers, the <acronym>SOAP</acronym> library has actually performed a complex transaction involving multiple XML documents and a remote server.
<p><acronym>SOAP</acronym> is a complex specification, and it is somewhat misleading to say that <acronym>SOAP</acronym> is all about calling remote functions. Some people would pipe up to add that <acronym>SOAP</acronym> allows for one-way asynchronous message passing, and document-oriented web services. And those people would be correct;
<acronym>SOAP</acronym> can be used that way, and in many different ways. But this chapter will focus on so-called &#8220;RPC-style&#8221; <acronym>SOAP</acronym> -- calling a remote function and getting results back.
<h2 id="soap.divein">12.1. Diving In</h2>
<p>You use Google, right? It's a popular search engine. Have you ever wished you could programmatically access Google search
results? Now you can. Here is a program to search Google from Python.
<div class="example"><h3>Example 12.1. <code class="filename">search.py</code></h3><pre class="programlisting">from SOAPpy import WSDL
# you'll need to configure these two values;
# see http://www.google.com/apis/
WSDLFILE = '/path/to/copy/of/GoogleSearch.wsdl'
APIKEY = 'YOUR_GOOGLE_API_KEY'
_server = WSDL.Proxy(WSDLFILE)
def search(q):
"""Search Google and return list of {title, link, description}"""
results = _server.doGoogleSearch(
APIKEY, q, 0, 10, False, "", False, "", "utf-8", "utf-8")
return [{"title": r.title.encode("utf-8"),
"link": r.URL.encode("utf-8"),
"description": r.snippet.encode("utf-8")}
for r in results.resultElements]
if __name__ == '__main__':
import sys
for r in search(sys.argv[1])[:5]:
print r['title']
print r['link']
print r['description']
print</pre><p>You can import this as a module and use it from a larger program, or you can run the script from the command line. On the
command line, you give the search query as a command-line argument, and it prints out the URL, title, and description of the
top five Google search results.
<p>Here is the sample output for a search for the word &#8220;python&#8221;.
<div class="example"><h3>Example 12.2. Sample Usage of <code class="filename">search.py</code></h3><pre class="screen">
<samp class="prompt">C:\diveintopython3\common\py></samp> python search.py "python"
<samp class="computeroutput">&lt;b>Python&lt;/b> Programming Language
http://www.python.org/
Home page for &lt;b>Python&lt;/b>, an interpreted, interactive, object-oriented,
extensible&lt;br> programming language. &lt;b>...&lt;/b> &lt;b>Python&lt;/b>
is OSI Certified Open Source: OSI Certified.
&lt;b>Python&lt;/b> Documentation Index
http://www.python.org/doc/
&lt;b>...&lt;/b> New-style classes (aka descrintro). Regular expressions. Database
API. Email Us.&lt;br> docs@&lt;b>python&lt;/b>.org. (c) 2004. &lt;b>Python&lt;/b>
Software Foundation. &lt;b>Python&lt;/b> Documentation. &lt;b>...&lt;/b>
Download &lt;b>Python&lt;/b> Software
http://www.python.org/download/
Download Standard &lt;b>Python&lt;/b> Software. &lt;b>Python&lt;/b> 2.3.3 is the
current production&lt;br> version of &lt;b>Python&lt;/b>. &lt;b>...&lt;/b>
&lt;b>Python&lt;/b> is OSI Certified Open Source:
Pythonline
http://www.pythonline.com/
Dive Into &lt;b>Python&lt;/b>
http://diveintopython3.org/
Dive Into &lt;b>Python&lt;/b>. &lt;b>Python&lt;/b> from novice to pro. Find:
&lt;b>...&lt;/b> It is also available in multiple&lt;br> languages. Read
Dive Into &lt;b>Python&lt;/b>. This book is still being written. &lt;b>...&lt;/b></span>
</pre><div class="itemizedlist">
<h3>Further Reading on <acronym>SOAP</acronym></h3>
<ul>
<li><a href="http://www.xmethods.net/">http://www.xmethods.net/</a> is a repository of public access <acronym>SOAP</acronym> web services.
<li>The <a href="http://www.w3.org/TR/soap/"><acronym>SOAP</acronym> specification</a> is surprisingly readable, if you like that sort of thing.
</ul>
<h2 id="soap.install">12.2. Installing the SOAP Libraries</h2>
<p>Unlike the other code in this book, this chapter relies on libraries that do not come pre-installed with Python.
<p>Before you can dive into <acronym>SOAP</acronym> web services, you'll need to install three libraries: PyXML, fpconst, and SOAPpy.
<h3>12.2.1. Installing PyXML</h3>
<p>The first library you need is PyXML, an advanced set of <acronym>XML</acronym> libraries that provide more functionality than the built-in <acronym>XML</acronym> libraries we studied in <a href="#kgp">Chapter 9</a>.
<div class="procedure">
<h3>Procedure 12.1. </h3>
<p>Here is the procedure for installing PyXML:
<ol>
<li>
<p>Go to <a href="http://pyxml.sourceforge.net/">http://pyxml.sourceforge.net/</a>, click Downloads, and download the latest version for your operating system.
<li>
<p>If you are using Windows, there are several choices. Make sure to download the version of PyXML that matches the version of Python you are using.
<li>
<p>Double-click the installer. If you download PyXML 0.8.3 for Windows and Python 2.3, the installer program will be <code class="filename">PyXML-0.8.3.win32-py2.3.exe</code>.
<li>
<p>Step through the installer program.
<li>
<p>After the installation is complete, close the installer. There will not be any visible indication of success (no programs
installed on the Start Menu or shortcuts installed on the desktop). PyXML is simply a collection of <acronym>XML</acronym> libraries used by other programs.
</ol>
<p>To verify that you installed PyXML correctly, run your Python <acronym>IDE</acronym> and check the version of the <acronym>XML</acronym> libraries you have installed, as shown here.
<div class="example"><h3>Example 12.3. Verifying PyXML Installation</h3><pre class="screen">
<samp class="prompt">>>> </samp>import xml
<samp class="prompt">>>> </samp>xml.__version__
'0.8.3'
</pre><p>This version number should match the version number of the PyXML installer program you downloaded and ran.
<h3>12.2.2. Installing fpconst</h3>
<p>The second library you need is fpconst, a set of constants and functions for working with IEEE754 double-precision special values. This provides support for the
special values Not-a-Number (NaN), Positive Infinity (Inf), and Negative Infinity (-Inf), which are part of the <acronym>SOAP</acronym> datatype specification.
<div class="procedure">
<h3>Procedure 12.2. </h3>
<p>Here is the procedure for installing fpconst:
<ol>
<li>
<p>Download the latest version of fpconst from <a href="http://www.analytics.washington.edu/statcomp/projects/rzope/fpconst/">http://www.analytics.washington.edu/statcomp/projects/rzope/fpconst/</a>.
<li>
<p>There are two downloads available, one in <code class="filename">.tar.gz</code> format, the other in <code class="filename">.zip</code> format. If you are using Windows, download the <code class="filename">.zip</code> file; otherwise, download the <code class="filename">.tar.gz</code> file.
<li>
<p>Decompress the downloaded file. On Windows XP, you can right-click on the file and choose Extract All; on earlier versions
of Windows, you will need a third-party program such as WinZip. On Mac OS X, you can double-click the compressed file to decompress it with Stuffit Expander.
<li>
<p>Open a command prompt and navigate to the directory where you decompressed the fpconst files.
<li>
<p>Type <kbd>python setup.py install</kbd> to run the installation program.
</ol>
<p>To verify that you installed fpconst correctly, run your Python <acronym>IDE</acronym> and check the version number.
<div class="example"><h3>Example 12.4. Verifying fpconst Installation</h3><pre class="screen">
<samp class="prompt">>>> </samp>import fpconst
<samp class="prompt">>>> </samp>fpconst.__version__
'0.6.0'
</pre><p>This version number should match the version number of the fpconst archive you downloaded and installed.
<h3>12.2.3. Installing SOAPpy</h3>
<p>The third and final requirement is the <acronym>SOAP</acronym> library itself: SOAPpy.
<div class="procedure">
<h3>Procedure 12.3. </h3>
<p>Here is the procedure for installing SOAPpy:
<ol>
<li>
<p>Go to <a href="http://pywebsvcs.sourceforge.net/">http://pywebsvcs.sourceforge.net/</a> and select Latest Official Release under the SOAPpy section.
<li>
<p>There are two downloads available. If you are using Windows, download the <code class="filename">.zip</code> file; otherwise, download the <code class="filename">.tar.gz</code> file.
<li>
<p>Decompress the downloaded file, just as you did with fpconst.
<li>
<p>Open a command prompt and navigate to the directory where you decompressed the SOAPpy files.
<li>
<p>Type <kbd>python setup.py install</kbd> to run the installation program.
</ol>
<p>To verify that you installed SOAPpy correctly, run your Python <acronym>IDE</acronym> and check the version number.
<div class="example"><h3>Example 12.5. Verifying SOAPpy Installation</h3><pre class="screen">
<samp class="prompt">>>> </samp>import SOAPpy
<samp class="prompt">>>> </samp>SOAPpy.__version__
'0.11.4'
</pre><p>This version number should match the version number of the SOAPpy archive you downloaded and installed.
<h2 id="soap.firststeps">12.3. First Steps with <acronym>SOAP</acronym></h2>
<p>The heart of <acronym>SOAP</acronym> is the ability to call remote functions. There are a number of public access <acronym>SOAP</acronym> servers that provide simple functions for demonstration purposes.
<p>The most popular public access <acronym>SOAP</acronym> server is <a href="http://www.xmethods.net/">http://www.xmethods.net/</a>. This example uses a demonstration function that takes a United States zip code and returns the current temperature in that
region.
<div class="example"><h3>Example 12.6. Getting the Current Temperature</h3><pre class="screen">
<samp class="prompt">>>> </samp>from SOAPpy import SOAPProxy <img id="soap.firststeps.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>url = 'http://services.xmethods.net:80/soap/servlet/rpcrouter'
<samp class="prompt">>>> </samp>namespace = 'urn:xmethods-Temperature' <img id="soap.firststeps.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>server = SOAPProxy(url, namespace) <img id="soap.firststeps.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>server.getTemp('27502') <img id="soap.firststeps.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
80.0
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.firststeps.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You access the remote <acronym>SOAP</acronym> server through a proxy class, <code class="classname">SOAPProxy</code>. The proxy handles all the internals of <acronym>SOAP</acronym> for you, including creating the XML request document out of the function name and argument list, sending the request over
HTTP to the remote <acronym>SOAP</acronym> server, parsing the XML response document, and creating native Python values to return. You'll see what these XML documents look like in the next section.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.firststeps.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Every <acronym>SOAP</acronym> service has a <acronym>URL</acronym> which handles all the requests. The same <acronym>URL</acronym> is used for all function calls. This particular service only has a single function, but later in this chapter you'll see
examples of the Google <acronym>API</acronym>, which has several functions. The service <acronym>URL</acronym> is shared by all functions.Each <acronym>SOAP</acronym> service also has a namespace, which is defined by the server and is completely arbitrary. It's simply part of the configuration
required to call <acronym>SOAP</acronym> methods. It allows the server to share a single service <acronym>URL</acronym> and route requests between several unrelated services. It's like dividing Python modules into <a href="#kgp.packages" title="9.2. Packages">packages</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.firststeps.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You're creating the <code class="classname">SOAPProxy</code> with the service <acronym>URL</acronym> and the service namespace. This doesn't make any connection to the <acronym>SOAP</acronym> server; it simply creates a local Python object.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.firststeps.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now with everything configured properly, you can actually call remote <acronym>SOAP</acronym> methods as if they were local functions. You pass arguments just like a normal function, and you get a return value just
like a normal function. But under the covers, there's a heck of a lot going on.
</td>
</tr>
</table>
<p>Let's peek under those covers.
<h2 id="soap.debug">12.4. Debugging <acronym>SOAP</acronym> Web Services</h2>
<p>The <acronym>SOAP</acronym> libraries provide an easy way to see what's going on behind the scenes.
<p>Turning on debugging is a simple matter of setting two flags in the <code class="classname">SOAPProxy</code>'s configuration.
<div class="example"><h3>Example 12.7. Debugging <acronym>SOAP</acronym> Web Services</h3><pre class="screen">
<samp class="prompt">>>> </samp>from SOAPpy import SOAPProxy
<samp class="prompt">>>> </samp>url = 'http://services.xmethods.net:80/soap/servlet/rpcrouter'
<samp class="prompt">>>> </samp>n = 'urn:xmethods-Temperature'
<samp class="prompt">>>> </samp>server = SOAPProxy(url, namespace=n) <img id="soap.debug.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>server.config.dumpSOAPOut = 1 <img id="soap.debug.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>server.config.dumpSOAPIn = 1
<samp class="prompt">>>> </samp>temperature = server.getTemp('27502') <img id="soap.debug.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">*** Outgoing SOAP ******************************************************
&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/1999/XMLSchema">
&lt;SOAP-ENV:Body>
&lt;ns1:getTemp xmlns:ns1="urn:xmethods-Temperature" SOAP-ENC:root="1">
&lt;v1 xsi:type="xsd:string">27502&lt;/v1>
&lt;/ns1:getTemp>
&lt;/SOAP-ENV:Body>
&lt;/SOAP-ENV:Envelope>
************************************************************************
*** Incoming SOAP ******************************************************
&lt;?xml version='1.0' encoding='UTF-8'?>
&lt;SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
&lt;SOAP-ENV:Body>
&lt;ns1:getTempResponse xmlns:ns1="urn:xmethods-Temperature"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
&lt;return xsi:type="xsd:float">80.0&lt;/return>
&lt;/ns1:getTempResponse>
&lt;/SOAP-ENV:Body>
&lt;/SOAP-ENV:Envelope>
************************************************************************
</samp>
<samp class="prompt">>>> </samp>temperature
80.0
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.debug.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">First, create the <code class="classname">SOAPProxy</code> like normal, with the service <acronym>URL</acronym> and the namespace.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.debug.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Second, turn on debugging by setting <code class="varname">server.config.dumpSOAPIn</code> and <code class="varname">server.config.dumpSOAPOut</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.debug.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Third, call the remote <acronym>SOAP</acronym> method as usual. The <acronym>SOAP</acronym> library will print out both the outgoing XML request document, and the incoming XML response document. This is all the hard
work that <code class="classname">SOAPProxy</code> is doing for you. Intimidating, isn't it? Let's break it down.
</td>
</tr>
</table>
<p>Most of the XML request document that gets sent to the server is just boilerplate. Ignore all the namespace declarations;
they're going to be the same (or similar) for all <acronym>SOAP</acronym> calls. The heart of the &#8220;function call&#8221; is this fragment within the <code class="sgmltag-element">&lt;Body></code> element:
<div class="informalexample"><pre class="programlisting">
&lt;ns1:getTemp <img id="soap.debug.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
xmlns:ns1="urn:xmethods-Temperature" <img id="soap.debug.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
SOAP-ENC:root="1">
&lt;v1 xsi:type="xsd:string">27502&lt;/v1> <img id="soap.debug.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;/ns1:getTemp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.debug.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The element name is the function name, <code class="function">getTemp</code>. <code class="classname">SOAPProxy</code> uses <a href="#kgp.handler" title="10.5. Creating separate handlers by node type"><code class="function">getattr</code> as a dispatcher</a>. Instead of calling separate local methods based on the method name, it actually uses the method name to construct the XML
request document.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.debug.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The function's XML element is contained in a specific namespace, which is the namespace you specified when you created the
<code class="classname">SOAPProxy</code> object. Don't worry about the <code>SOAP-ENC:root</code>; that's boilerplate too.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.debug.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The arguments of the function also got translated into XML. <code class="classname">SOAPProxy</code> introspects each argument to determine its datatype (in this case it's a string). The argument datatype goes into the <code>xsi:type</code> attribute, followed by the actual string value.
</td>
</tr>
</table>
<p>The XML return document is equally easy to understand, once you know what to ignore. Focus on this fragment within the <code class="sgmltag-element">&lt;Body></code>:
<div class="informalexample"><pre class="programlisting">
&lt;ns1:getTempResponse <img id="soap.debug.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
xmlns:ns1="urn:xmethods-Temperature" <img id="soap.debug.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
&lt;return xsi:type="xsd:float">80.0&lt;/return> <img id="soap.debug.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;/ns1:getTempResponse>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.debug.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The server wraps the function return value within a <code class="sgmltag-element">&lt;getTempResponse></code> element. By convention, this wrapper element is the name of the function, plus <code>Response</code>. But it could really be almost anything; the important thing that <code class="classname">SOAPProxy</code> notices is not the element name, but the namespace.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.debug.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The server returns the response in the same namespace we used in the request, the same namespace we specified when we first
create the <code class="classname">SOAPProxy</code>. Later in this chapter we'll see what happens if you forget to specify the namespace when creating the <code class="classname">SOAPProxy</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.debug.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The return value is specified, along with its datatype (it's a float). <code class="classname">SOAPProxy</code> uses this explicit datatype to create a Python object of the correct native datatype and return it.
</td>
</tr>
</table>
<h2 id="soap.wsdl">12.5. Introducing <acronym>WSDL</acronym></h2>
<p>The <code class="classname">SOAPProxy</code> class proxies local method calls and transparently turns then into invocations of remote <acronym>SOAP</acronym> methods. As you've seen, this is a lot of work, and <code class="classname">SOAPProxy</code> does it quickly and transparently. What it doesn't do is provide any means of method introspection.
<p>Consider this: the previous two sections showed an example of calling a simple remote <acronym>SOAP</acronym> method with one argument and one return value, both of simple data types. This required knowing, and keeping track of, the
service <acronym>URL</acronym>, the service namespace, the function name, the number of arguments, and the datatype of each argument. If any of these is
missing or wrong, the whole thing falls apart.
<p>That shouldn't come as a big surprise. If I wanted to call a local function, I would need to know what package or module
it was in (the equivalent of service <acronym>URL</acronym> and namespace). I would need to know the correct function name and the correct number of arguments. Python deftly handles datatyping without explicit types, but I would still need to know how many argument to pass, and how many
return values to expect.
<p>The big difference is introspection. As you saw in <a href="#apihelper">Chapter 4</a>, Python excels at letting you discover things about modules and functions at runtime. You can list the available functions within
a module, and with a little work, drill down to individual function declarations and arguments.
<p><acronym>WSDL</acronym> lets you do that with <acronym>SOAP</acronym> web services. <acronym>WSDL</acronym> stands for &#8220;Web Services Description Language&#8221;. Although designed to be flexible enough to describe many types of web services, it is most often used to describe <acronym>SOAP</acronym> web services.
<p>A <acronym>WSDL</acronym> file is just that: a file. More specifically, it's an XML file. It usually lives on the same server you use to access the
<acronym>SOAP</acronym> web services it describes, although there's nothing special about it. Later in this chapter, we'll download the <acronym>WSDL</acronym> file for the Google API and use it locally. That doesn't mean we're calling Google locally; the <acronym>WSDL</acronym> file still describes the remote functions sitting on Google's server.
<p>A <acronym>WSDL</acronym> file contains a description of everything involved in calling a <acronym>SOAP</acronym> web service:
<div class="itemizedlist">
<ul>
<li>The service <acronym>URL</acronym> and namespace
<li>The type of web service (probably function calls using <acronym>SOAP</acronym>, although as I mentioned, <acronym>WSDL</acronym> is flexible enough to describe a wide variety of web services)
<li>The list of available functions
<li>The arguments for each function
<li>The datatype of each argument
<li>The return values of each function, and the datatype of each return value
</ul>
<p>In other words, a <acronym>WSDL</acronym> file tells you everything you need to know to be able to call a <acronym>SOAP</acronym> web service.
<h2 id="soap.introspection">12.6. Introspecting <acronym>SOAP</acronym> Web Services with <acronym>WSDL</acronym></h2>
<p>Like many things in the web services arena, <acronym>WSDL</acronym> has a long and checkered history, full of political strife and intrigue. I will skip over this history entirely, since it
bores me to tears. There were other standards that tried to do similar things, but <acronym>WSDL</acronym> won, so let's learn how to use it.
<p>The most fundamental thing that <acronym>WSDL</acronym> allows you to do is discover the available methods offered by a <acronym>SOAP</acronym> server.
<div class="example"><h3>Example 12.8. Discovering The Available Methods</h3><pre class="screen">
<samp class="prompt">>>> </samp>from SOAPpy import WSDL <img id="soap.introspection.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>wsdlFile = 'http://www.xmethods.net/sd/2001/TemperatureService.wsdl')
<samp class="prompt">>>> </samp>server = WSDL.Proxy(wsdlFile) <img id="soap.introspection.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>server.methods.keys() <img id="soap.introspection.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
[u'getTemp']
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">SOAPpy includes a <acronym>WSDL</acronym> parser. At the time of this writing, it was labeled as being in the early stages of development, but I had no problem parsing
any of the <acronym>WSDL</acronym> files I tried.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To use a <acronym>WSDL</acronym> file, you again use a proxy class, <code class="classname">WSDL.Proxy</code>, which takes a single argument: the <acronym>WSDL</acronym> file. Note that in this case you are passing in the <acronym>URL</acronym> of a <acronym>WSDL</acronym> file stored on the remote server, but the proxy class works just as well with a local copy of the <acronym>WSDL</acronym> file. The act of creating the <acronym>WSDL</acronym> proxy will download the <acronym>WSDL</acronym> file and parse it, so it there are any errors in the <acronym>WSDL</acronym> file (or it can't be fetched due to networking problems), you'll know about it immediately.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <acronym>WSDL</acronym> proxy class exposes the available functions as a Python dictionary, <code class="varname">server.methods</code>. So getting the list of available methods is as simple as calling the dictionary method <code class="methodname">keys()</code>.
</td>
</tr>
</table>
<p>Okay, so you know that this <acronym>SOAP</acronym> server offers a single method: <code class="methodname">getTemp</code>. But how do you call it? The <acronym>WSDL</acronym> proxy object can tell you that too.
<div class="example"><h3>Example 12.9. Discovering A Method's Arguments</h3><pre class="screen">
<samp class="prompt">>>> </samp>callInfo = server.methods['getTemp'] <img id="soap.introspection.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>callInfo.inparams <img id="soap.introspection.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
[&lt;SOAPpy.wstools.WSDLTools.ParameterInfo instance at 0x00CF3AD0>]
<samp class="prompt">>>> </samp>callInfo.inparams[0].name <img id="soap.introspection.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
u'zipcode'
<samp class="prompt">>>> </samp>callInfo.inparams[0].type <img id="soap.introspection.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
(u'http://www.w3.org/2001/XMLSchema', u'string')
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="varname">server.methods</code> dictionary is filled with a SOAPpy-specific structure called <code class="classname">CallInfo</code>. A <code class="classname">CallInfo</code> object contains information about one specific function, including the function arguments.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The function arguments are stored in <code class="varname">callInfo.inparams</code>, which is a Python list of <code class="classname">ParameterInfo</code> objects that hold information about each parameter.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Each <code class="classname">ParameterInfo</code> object contains a <code class="varname">name</code> attribute, which is the argument name. You are not required to know the argument name to call the function through <acronym>SOAP</acronym>, but <acronym>SOAP</acronym> does support calling functions with named arguments (just like Python), and <code class="classname">WSDL.Proxy</code> will correctly handle mapping named arguments to the remote function if you choose to use them.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Each parameter is also explicitly typed, using datatypes defined in XML Schema. You saw this in the wire trace in the previous
section; the XML Schema namespace was part of the &#8220;boilerplate&#8221; I told you to ignore. For our purposes here, you may continue to ignore it. The <code class="varname">zipcode</code> parameter is a string, and if you pass in a Python string to the <code class="classname">WSDL.Proxy</code> object, it will map it correctly and send it to the server.
</td>
</tr>
</table>
<p><acronym>WSDL</acronym> also lets you introspect into a function's return values.
<div class="example"><h3>Example 12.10. Discovering A Method's Return Values</h3><pre class="screen">
<samp class="prompt">>>> </samp>callInfo.outparams <img id="soap.introspection.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
[&lt;SOAPpy.wstools.WSDLTools.ParameterInfo instance at 0x00CF3AF8>]
<samp class="prompt">>>> </samp>callInfo.outparams[0].name <img id="soap.introspection.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
u'return'
<samp class="prompt">>>> </samp>callInfo.outparams[0].type
(u'http://www.w3.org/2001/XMLSchema', u'float')
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The adjunct to <code class="varname">callInfo.inparams</code> for function arguments is <code class="varname">callInfo.outparams</code> for return value. It is also a list, because functions called through <acronym>SOAP</acronym> can return multiple values, just like Python functions.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Each <code class="classname">ParameterInfo</code> object contains <code class="varname">name</code> and <code class="varname">type</code>. This function returns a single value, named <code class="varname">return</code>, which is a float.
</td>
</tr>
</table>
<p>Let's put it all together, and call a <acronym>SOAP</acronym> web service through a <acronym>WSDL</acronym> proxy.
<div class="example"><h3>Example 12.11. Calling A Web Service Through A <acronym>WSDL</acronym> Proxy</h3><pre class="screen">
<samp class="prompt">>>> </samp>from SOAPpy import WSDL
<samp class="prompt">>>> </samp>wsdlFile = 'http://www.xmethods.net/sd/2001/TemperatureService.wsdl')
<samp class="prompt">>>> </samp>server = WSDL.Proxy(wsdlFile) <img id="soap.introspection.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>server.getTemp('90210') <img id="soap.introspection.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
66.0
<samp class="prompt">>>> </samp>server.soapproxy.config.dumpSOAPOut = 1 <img id="soap.introspection.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>server.soapproxy.config.dumpSOAPIn = 1
<samp class="prompt">>>> </samp>temperature = server.getTemp('90210')
<samp class="computeroutput">*** Outgoing SOAP ******************************************************
&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/1999/XMLSchema">
&lt;SOAP-ENV:Body>
&lt;ns1:getTemp xmlns:ns1="urn:xmethods-Temperature" SOAP-ENC:root="1">
&lt;v1 xsi:type="xsd:string">90210&lt;/v1>
&lt;/ns1:getTemp>
&lt;/SOAP-ENV:Body>
&lt;/SOAP-ENV:Envelope>
************************************************************************
*** Incoming SOAP ******************************************************
&lt;?xml version='1.0' encoding='UTF-8'?>
&lt;SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
&lt;SOAP-ENV:Body>
&lt;ns1:getTempResponse xmlns:ns1="urn:xmethods-Temperature"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
&lt;return xsi:type="xsd:float">66.0&lt;/return>
&lt;/ns1:getTempResponse>
&lt;/SOAP-ENV:Body>
&lt;/SOAP-ENV:Envelope>
************************************************************************
</samp>
<samp class="prompt">>>> </samp>temperature
66.0
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The configuration is simpler than calling the <acronym>SOAP</acronym> service directly, since the <acronym>WSDL</acronym> file contains the both service <acronym>URL</acronym> and namespace you need to call the service. Creating the <code class="classname">WSDL.Proxy</code> object downloads the <acronym>WSDL</acronym> file, parses it, and configures a <code class="classname">SOAPProxy</code> object that it uses to call the actual <acronym>SOAP</acronym> web service.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Once the <code class="classname">WSDL.Proxy</code> object is created, you can call a function as easily as you did with the <code class="classname">SOAPProxy</code> object. This is not surprising; the <code class="classname">WSDL.Proxy</code> is just a wrapper around the <code class="classname">SOAPProxy</code> with some introspection methods added, so the syntax for calling functions is the same.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.introspection.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can access the <code class="classname">WSDL.Proxy</code>'s <code class="classname">SOAPProxy</code> with <code class="varname">server.soapproxy</code>. This is useful to turning on debugging, so that when you can call functions through the <acronym>WSDL</acronym> proxy, its <code class="classname">SOAPProxy</code> will dump the outgoing and incoming XML documents that are going over the wire.
</td>
</tr>
</table>
<h2 id="soap.google">12.7. Searching Google</h2>
<p>Let's finally turn to the sample code that you saw that the beginning of this chapter, which does something more useful and
exciting than get the current temperature.
<p>Google provides a <acronym>SOAP</acronym> <acronym>API</acronym> for programmatically accessing Google search results. To use it, you will need to sign up for Google Web Services.
<div class="procedure">
<h3>Procedure 12.4. Signing Up for Google Web Services</h3>
<ol>
<li>
<p>Go to <a href="http://www.google.com/apis/">http://www.google.com/apis/</a> and create a Google account. This requires only an email address. After you sign up you will receive your Google API license
key by email. You will need this key to pass as a parameter whenever you call Google's search functions.
<li>
<p>Also on <a href="http://www.google.com/apis/">http://www.google.com/apis/</a>, download the Google Web APIs developer kit. This includes some sample code in several programming languages (but not Python), and more importantly, it includes the <acronym>WSDL</acronym> file.
<li>
<p>Decompress the developer kit file and find <code class="filename">GoogleSearch.wsdl</code>. Copy this file to some permanent location on your local drive. You will need it later in this chapter.
</ol>
<p>Once you have your developer key and your Google <acronym>WSDL</acronym> file in a known place, you can start poking around with Google Web Services.
<div class="example"><h3>Example 12.12. Introspecting Google Web Services</h3><pre class="screen">
<samp class="prompt">>>> </samp>from SOAPpy import WSDL
<samp class="prompt">>>> </samp>server = WSDL.Proxy('/path/to/your/GoogleSearch.wsdl') <img id="soap.google.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>server.methods.keys() <img id="soap.google.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
[u'doGoogleSearch', u'doGetCachedPage', u'doSpellingSuggestion']
<samp class="prompt">>>> </samp>callInfo = server.methods['doGoogleSearch']
<samp class="prompt">>>> </samp>for arg in callInfo.inparams: <img id="soap.google.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">... </samp>print arg.name.ljust(15), arg.type
<samp class="computeroutput">key (u'http://www.w3.org/2001/XMLSchema', u'string')
q (u'http://www.w3.org/2001/XMLSchema', u'string')
start (u'http://www.w3.org/2001/XMLSchema', u'int')
maxResults (u'http://www.w3.org/2001/XMLSchema', u'int')
filter (u'http://www.w3.org/2001/XMLSchema', u'boolean')
restrict (u'http://www.w3.org/2001/XMLSchema', u'string')
safeSearch (u'http://www.w3.org/2001/XMLSchema', u'boolean')
lr (u'http://www.w3.org/2001/XMLSchema', u'string')
ie (u'http://www.w3.org/2001/XMLSchema', u'string')
oe (u'http://www.w3.org/2001/XMLSchema', u'string')</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.google.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Getting started with Google web services is easy: just create a <code class="classname">WSDL.Proxy</code> object and point it at your local copy of Google's <acronym>WSDL</acronym> file.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.google.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">According to the <acronym>WSDL</acronym> file, Google offers three functions: <code class="function">doGoogleSearch</code>, <code class="function">doGetCachedPage</code>, and <code class="function">doSpellingSuggestion</code>. These do exactly what they sound like: perform a Google search and return the results programmatically, get access to the
cached version of a page from the last time Google saw it, and offer spelling suggestions for commonly misspelled search words.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.google.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">doGoogleSearch</code> function takes a number of parameters of various types. Note that while the <acronym>WSDL</acronym> file can tell you what the arguments are called and what datatype they are, it can't tell you what they mean or how to use
them. It could theoretically tell you the acceptable range of values for each parameter, if only specific values were allowed,
but Google's <acronym>WSDL</acronym> file is not that detailed. <code class="classname">WSDL.Proxy</code> can't work magic; it can only give you the information provided in the <acronym>WSDL</acronym> file.
</td>
</tr>
</table>
<p>Here is a brief synopsis of all the parameters to the <code class="function">doGoogleSearch</code> function:
<div class="itemizedlist">
<ul>
<li><code class="varname">key</code> - Your Google API key, which you received when you signed up for Google web services.
<li><code class="varname">q</code> - The search word or phrase you're looking for. The syntax is exactly the same as Google's web form, so if you know any
advanced search syntax or tricks, they all work here as well.
<li><code class="varname">start</code> - The index of the result to start on. Like the interactive web version of Google, this function returns 10 results at a
time. If you wanted to get the second &#8220;page&#8221; of results, you would set <code class="varname">start</code> to 10.
<li><code class="varname">maxResults</code> - The number of results to return. Currently capped at 10, although you can specify fewer if you are only interested in
a few results and want to save a little bandwidth.
<li><code class="varname">filter</code> - If <code class="constant">True</code>, Google will filter out duplicate pages from the results.
<li><code class="varname">restrict</code> - Set this to <code>country</code> plus a country code to get results only from a particular country. Example: <code>countryUK</code> to search pages in the United Kingdom. You can also specify <code>linux</code>, <code>mac</code>, or <code>bsd</code> to search a Google-defined set of technical sites, or <code>unclesam</code> to search sites about the United States government.
<li><code class="varname">safeSearch</code> - If <code class="constant">True</code>, Google will filter out porn sites.
<li><code class="varname">lr</code> (&#8220;language restrict&#8221;) - Set this to a language code to get results only in a particular language.
<li><code class="varname">ie</code> and <code class="varname">oe</code> (&#8220;input encoding&#8221; and &#8220;output encoding&#8221;) - Deprecated, both must be <code>utf-8</code>.
</ul>
<div class="example"><h3>Example 12.13. Searching Google</h3><pre class="screen">
<samp class="prompt">>>> </samp>from SOAPpy import WSDL
<samp class="prompt">>>> </samp>server = WSDL.Proxy('/path/to/your/GoogleSearch.wsdl')
<samp class="prompt">>>> </samp>key = 'YOUR_GOOGLE_API_KEY'
<samp class="prompt">>>> </samp>results = server.doGoogleSearch(key, 'mark', 0, 10, False, "",
<samp class="prompt">... </samp>False, "", "utf-8", "utf-8") <img id="soap.google.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>len(results.resultElements)<img id="soap.google.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
10
<samp class="prompt">>>> </samp>results.resultElements[0].URL <img id="soap.google.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'http://diveintomark.org/'
<samp class="prompt">>>> </samp>results.resultElements[0].title
'dive into &lt;b>mark&lt;/b>'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.google.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">After setting up the <code class="classname">WSDL.Proxy</code> object, you can call <code class="function">server.doGoogleSearch</code> with all ten parameters. Remember to use your own Google API key that you received when you signed up for Google web services.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.google.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">There's a lot of information returned, but let's look at the actual search results first. They're stored in <code class="varname">results.resultElements</code>, and you can access them just like a normal Python list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.google.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Each element in the <code class="varname">resultElements</code> is an object that has a <code class="varname">URL</code>, <code class="varname">title</code>, <code class="varname">snippet</code>, and other useful attributes. At this point you can use normal Python introspection techniques like <kbd>dir(results.resultElements[0])</kbd> to see the available attributes. Or you can introspect through the <acronym>WSDL</acronym> proxy object and look through the function's <code class="varname">outparams</code>. Each technique will give you the same information.
</td>
</tr>
</table>
<p>The <code class="varname">results</code> object contains more than the actual search results. It also contains information about the search itself, such as how long
it took and how many results were found (even though only 10 were returned). The Google web interface shows this information,
and you can access it programmatically too.
<div class="example"><h3>Example 12.14. Accessing Secondary Information From Google</h3><pre class="screen">
<samp class="prompt">>>> </samp>results.searchTime <img id="soap.google.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
0.224919
<samp class="prompt">>>> </samp>results.estimatedTotalResultsCount <img id="soap.google.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
29800000
<samp class="prompt">>>> </samp>results.directoryCategories <img id="soap.google.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">[&lt;SOAPpy.Types.structType item at 14367400>:
{'fullViewableName':
'Top/Arts/Literature/World_Literature/American/19th_Century/Twain,_Mark',
'specialEncoding': ''}]</samp>
<samp class="prompt">>>> </samp>results.directoryCategories[0].fullViewableName
'Top/Arts/Literature/World_Literature/American/19th_Century/Twain,_Mark'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.google.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This search took 0.224919 seconds. That does not include the time spent sending and receiving the actual <acronym>SOAP</acronym> XML documents. It's just the time that Google spent processing your request once it received it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.google.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In total, there were approximately 30 million results. You can access them 10 at a time by changing the <code class="varname">start</code> parameter and calling <code class="function">server.doGoogleSearch</code> again.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.google.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">For some queries, Google also returns a list of related categories in the <a href="http://directory.google.com/">Google Directory</a>. You can append these URLs to <a href="http://directory.google.com/">http://directory.google.com/</a> to construct the link to the directory category page.
</td>
</tr>
</table>
<h2 id="soap.troubleshooting">12.8. Troubleshooting <acronym>SOAP</acronym> Web Services</h2>
<p>Of course, the world of <acronym>SOAP</acronym> web services is not all happiness and light. Sometimes things go wrong.
<p>As you've seen throughout this chapter, <acronym>SOAP</acronym> involves several layers. There's the HTTP layer, since <acronym>SOAP</acronym> is sending XML documents to, and receiving XML documents from, an HTTP server. So all the debugging techniques you learned
in <a href="#oa" title="Chapter 11. HTTP Web Services">Chapter 11, <i>HTTP Web Services</i></a> come into play here. You can <kbd>import httplib</kbd> and then set <kbd>httplib.HTTPConnection.debuglevel = 1</kbd> to see the underlying HTTP traffic.
<p>Beyond the underlying HTTP layer, there are a number of things that can go wrong. SOAPpy does an admirable job hiding the <acronym>SOAP</acronym> syntax from you, but that also means it can be difficult to determine where the problem is when things don't work.
<p>Here are a few examples of common mistakes that I've made in using <acronym>SOAP</acronym> web services, and the errors they generated.
<div class="example"><h3>Example 12.15. Calling a Method With an Incorrectly Configured Proxy</h3><pre class="screen">
<samp class="prompt">>>> </samp>from SOAPpy import SOAPProxy
<samp class="prompt">>>> </samp>url = 'http://services.xmethods.net:80/soap/servlet/rpcrouter'
<samp class="prompt">>>> </samp>server = SOAPProxy(url) <img id="soap.troubleshooting.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>server.getTemp('27502') <img id="soap.troubleshooting.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="traceback">&lt;Fault SOAP-ENV:Server.BadTargetObjectURI:
Unable to determine object id from call: is the method element namespaced?>
Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 453, in __call__
return self.__r_call(*args, **kw)
File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 475, in __r_call
self.__hd, self.__ma)
File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 389, in __call
raise p
SOAPpy.Types.faultType: &lt;Fault SOAP-ENV:Server.BadTargetObjectURI:
Unable to determine object id from call: is the method element namespaced?></span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.troubleshooting.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Did you spot the mistake? You're creating a <code class="classname">SOAPProxy</code> manually, and you've correctly specified the service <acronym>URL</acronym>, but you haven't specified the namespace. Since multiple services may be routed through the same service <acronym>URL</acronym>, the namespace is essential to determine which service you're trying to talk to, and therefore which method you're really
calling.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.troubleshooting.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The server responds by sending a <acronym>SOAP</acronym> Fault, which SOAPpy turns into a Python exception of type <code class="classname">SOAPpy.Types.faultType</code>. All errors returned from any <acronym>SOAP</acronym> server will always be <acronym>SOAP</acronym> Faults, so you can easily catch this exception. In this case, the human-readable part of the <acronym>SOAP</acronym> Fault gives a clue to the problem: the method element is not namespaced, because the original <code class="classname">SOAPProxy</code> object was not configured with a service namespace.
</td>
</tr>
</table>
<p>Misconfiguring the basic elements of the <acronym>SOAP</acronym> service is one of the problems that <acronym>WSDL</acronym> aims to solve. The <acronym>WSDL</acronym> file contains the service <acronym>URL</acronym> and namespace, so you can't get it wrong. Of course, there are still other things you can get wrong.
<div class="example"><h3>Example 12.16. Calling a Method With the Wrong Arguments</h3><pre class="screen">
<samp class="prompt">>>> </samp>wsdlFile = 'http://www.xmethods.net/sd/2001/TemperatureService.wsdl'
<samp class="prompt">>>> </samp>server = WSDL.Proxy(wsdlFile)
<samp class="prompt">>>> </samp>temperature = server.getTemp(27502) <img id="soap.troubleshooting.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="traceback">&lt;Fault SOAP-ENV:Server: Exception while handling service request:
services.temperature.TempService.getTemp(int) -- no signature match> <img id="soap.troubleshooting.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 453, in __call__
return self.__r_call(*args, **kw)
File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 475, in __r_call
self.__hd, self.__ma)
File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 389, in __call
raise p
SOAPpy.Types.faultType: &lt;Fault SOAP-ENV:Server: Exception while handling service request:
services.temperature.TempService.getTemp(int) -- no signature match></span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.troubleshooting.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Did you spot the mistake? It's a subtle one: you're calling <code class="function">server.getTemp</code> with an integer instead of a string. As you saw from introspecting the <acronym>WSDL</acronym> file, the <code class="function">getTemp()</code> <acronym>SOAP</acronym> function takes a single argument, <code class="varname">zipcode</code>, which must be a string. <code class="classname">WSDL.Proxy</code> will <em>not</em> coerce datatypes for you; you need to pass the exact datatypes that the server expects.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.troubleshooting.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Again, the server returns a <acronym>SOAP</acronym> Fault, and the human-readable part of the error gives a clue as to the problem: you're calling a <code class="function">getTemp</code> function with an integer value, but there is no function defined with that name that takes an integer. In theory, <acronym>SOAP</acronym> allows you to <em>overload</em> functions, so you could have two functions in the same <acronym>SOAP</acronym> service with the same name and the same number of arguments, but the arguments were of different datatypes. This is why
it's important to match the datatypes exactly, and why <code class="classname">WSDL.Proxy</code> doesn't coerce datatypes for you. If it did, you could end up calling a completely different function! Good luck debugging
that one. It's much easier to be picky about datatypes and fail as quickly as possible if you get them wrong.
</td>
</tr>
</table>
<p>It's also possible to write Python code that expects a different number of return values than the remote function actually returns.
<div class="example"><h3>Example 12.17. Calling a Method and Expecting the Wrong Number of Return Values</h3><pre class="screen">
<samp class="prompt">>>> </samp>wsdlFile = 'http://www.xmethods.net/sd/2001/TemperatureService.wsdl'
<samp class="prompt">>>> </samp>server = WSDL.Proxy(wsdlFile)
<samp class="prompt">>>> </samp>(city, temperature) = server.getTemp(27502) <img id="soap.troubleshooting.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="traceback">Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
TypeError: unpack non-sequence</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.troubleshooting.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Did you spot the mistake? <code class="function">server.getTemp</code> only returns one value, a float, but you've written code that assumes you're getting two values and trying to assign them
to two different variables. Note that this does not fail with a <acronym>SOAP</acronym> fault. As far as the remote server is concerned, nothing went wrong at all. The error only occurred <em>after</em> the <acronym>SOAP</acronym> transaction was complete, <code class="classname">WSDL.Proxy</code> returned a float, and your local Python interpreter tried to accomodate your request to split it into two different variables. Since the function only returned
one value, you get a Python exception trying to split it, not a <acronym>SOAP</acronym> Fault.
</td>
</tr>
</table>
<p>What about Google's web service? The most common problem I've had with it is that I forget to set the application key properly.
<div class="example"><h3>Example 12.18. Calling a Method With An Application-Specific Error</h3><pre class="screen">
<samp class="prompt">>>> </samp>from SOAPpy import WSDL
<samp class="prompt">>>> </samp>server = WSDL.Proxy(r'/path/to/local/GoogleSearch.wsdl')
<samp class="prompt">>>> </samp>results = server.doGoogleSearch('foo', 'mark', 0, 10, False, "", <img id="soap.troubleshooting.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp>False, "", "utf-8", "utf-8")
<samp class="traceback">&lt;Fault SOAP-ENV:Server: <img id="soap.troubleshooting.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
Exception from service object: Invalid authorization key: foo:
&lt;SOAPpy.Types.structType detail at 14164616>:
{'stackTrace':
'com.google.soap.search.GoogleSearchFault: Invalid authorization key: foo
at com.google.soap.search.QueryLimits.lookUpAndLoadFromINSIfNeedBe(
QueryLimits.java:220)
at com.google.soap.search.QueryLimits.validateKey(QueryLimits.java:127)
at com.google.soap.search.GoogleSearchService.doPublicMethodChecks(
GoogleSearchService.java:825)
at com.google.soap.search.GoogleSearchService.doGoogleSearch(
GoogleSearchService.java:121)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.soap.server.RPCRouter.invoke(RPCRouter.java:146)
at org.apache.soap.providers.RPCJavaProvider.invoke(
RPCJavaProvider.java:129)
at org.apache.soap.server.http.RPCRouterServlet.doPost(
RPCRouterServlet.java:288)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:760)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at com.google.gse.HttpConnection.runServlet(HttpConnection.java:237)
at com.google.gse.HttpConnection.run(HttpConnection.java:195)
at com.google.gse.DispatchQueue$WorkerThread.run(DispatchQueue.java:201)
Caused by: com.google.soap.search.UserKeyInvalidException: Key was of wrong size.
at com.google.soap.search.UserKey.&lt;init>(UserKey.java:59)
at com.google.soap.search.QueryLimits.lookUpAndLoadFromINSIfNeedBe(
QueryLimits.java:217)
... 14 more
'}>
Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 453, in __call__
return self.__r_call(*args, **kw)
File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 475, in __r_call
self.__hd, self.__ma)
File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 389, in __call
raise p
SOAPpy.Types.faultType: &lt;Fault SOAP-ENV:Server: Exception from service object:
Invalid authorization key: foo:
&lt;SOAPpy.Types.structType detail at 14164616>:
{'stackTrace':
'com.google.soap.search.GoogleSearchFault: Invalid authorization key: foo
at com.google.soap.search.QueryLimits.lookUpAndLoadFromINSIfNeedBe(
QueryLimits.java:220)
at com.google.soap.search.QueryLimits.validateKey(QueryLimits.java:127)
at com.google.soap.search.GoogleSearchService.doPublicMethodChecks(
GoogleSearchService.java:825)
at com.google.soap.search.GoogleSearchService.doGoogleSearch(
GoogleSearchService.java:121)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.soap.server.RPCRouter.invoke(RPCRouter.java:146)
at org.apache.soap.providers.RPCJavaProvider.invoke(
RPCJavaProvider.java:129)
at org.apache.soap.server.http.RPCRouterServlet.doPost(
RPCRouterServlet.java:288)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:760)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at com.google.gse.HttpConnection.runServlet(HttpConnection.java:237)
at com.google.gse.HttpConnection.run(HttpConnection.java:195)
at com.google.gse.DispatchQueue$WorkerThread.run(DispatchQueue.java:201)
Caused by: com.google.soap.search.UserKeyInvalidException: Key was of wrong size.
at com.google.soap.search.UserKey.&lt;init>(UserKey.java:59)
at com.google.soap.search.QueryLimits.lookUpAndLoadFromINSIfNeedBe(
QueryLimits.java:217)
... 14 more
'}></span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soap.troubleshooting.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Can you spot the mistake? There's nothing wrong with the calling syntax, or the number of arguments, or the datatypes. The
problem is application-specific: the first argument is supposed to be my application key, but <code>foo</code> is not a valid Google key.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soap.troubleshooting.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The Google server responds with a <acronym>SOAP</acronym> Fault and an incredibly long error message, which includes a complete Java stack trace. Remember that <em>all</em> <acronym>SOAP</acronym> errors are signified by <acronym>SOAP</acronym> Faults: errors in configuration, errors in function arguments, and application-specific errors like this. Buried in there
somewhere is the crucial piece of information: <code>Invalid authorization key: foo</code>.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Troubleshooting <acronym>SOAP</acronym></h3>
<ul>
<li><a href="http://www-106.ibm.com/developerworks/webservices/library/ws-pyth17.html">New developments for SOAPpy</a> steps through trying to connect to another <acronym>SOAP</acronym> service that doesn't quite work as advertised.
</ul>
<h2 id="soap.summary">12.9. Summary</h2>
<p><acronym>SOAP</acronym> web services are very complicated. The specification is very ambitious and tries to cover many different use cases for web
services. This chapter has touched on some of the simpler use cases.
<div class="highlights">
<p>Before diving into the next chapter, make sure you're comfortable doing all of these things:
<div class="itemizedlist">
<ul>
<li>Connecting to a <acronym>SOAP</acronym> server and calling remote methods
<li>Loading a <acronym>WSDL</acronym> file and introspecting remote methods
<li>Debugging <acronym>SOAP</acronym> calls with wire traces
<li>Troubleshooting common <acronym>SOAP</acronym>-related errors
</ul>
<div class="chapter">
<h2 id="roman">Chapter 13. Unit Testing</h2>
<h2 id="roman.intro">13.1. Introduction to Roman numerals</h2>
<p>In previous chapters, you &#8220;dived in&#8221; by immediately looking at code and trying to understand it as quickly as possible. Now that you have some Python under your belt, you're going to step back and look at the steps that happen <em>before</em> the code gets written.
<p>In the next few chapters, you're going to write, debug, and optimize a set of utility functions to convert to and from Roman
numerals. You saw the mechanics of constructing and validating Roman numerals in <a href="#re.roman" title="7.3. Case Study: Roman Numerals">Section 7.3, &#8220;Case Study: Roman Numerals&#8221;</a>, but now let's step back and consider what it would take to expand that into a two-way utility.
<p><a href="#re.roman" title="7.3. Case Study: Roman Numerals">The rules for Roman numerals</a> lead to a number of interesting observations:
<div class="orderedlist">
<ol>
<li>There is only one correct way to represent a particular number as Roman numerals.
<li>The converse is also true: if a string of characters is a valid Roman numeral, it represents only one number (<i class="foreignphrase"><acronym>i.e.</acronym></i> it can only be read one way).
<li>There is a limited range of numbers that can be expressed as Roman numerals, specifically <code>1</code> through <code>3999</code>. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent
that its normal value should be multiplied by <code>1000</code>, but you're not going to deal with that. For the purposes of this chapter, let's stipulate that Roman numerals go from <code>1</code> to <code>3999</code>.)
<li>There is no way to represent <code class="constant">0</code> in Roman numerals. (Amazingly, the ancient Romans had no concept of <code class="constant">0</code> as a number. Numbers were for counting things you had; how can you count what you don't have?)
<li>There is no way to represent negative numbers in Roman numerals.
<li>There is no way to represent fractions or non-integer numbers in Roman numerals.
</ol>
<p>Given all of this, what would you expect out of a set of functions to convert to and from Roman numerals?
<div class="orderedlist"><h3 id="roman.requirements"><code class="filename">roman.py</code> requirements</h3>
<ol>
<li><code class="function">toRoman</code> should return the Roman numeral representation for all integers <code class="constant">1</code> to <code>3999</code>.
<li><code class="function">toRoman</code> should fail when given an integer outside the range <code class="constant">1</code> to <code>3999</code>.
<li><code class="function">toRoman</code> should fail when given a non-integer number.
<li><code class="function">fromRoman</code> should take a valid Roman numeral and return the number that it represents.
<li><code class="function">fromRoman</code> should fail when given an invalid Roman numeral.
<li>If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number
you started with. So <code>fromRoman(toRoman(n)) == n</code> for all <code class="varname">n</code> in <code>1..3999</code>.
<li><code class="function">toRoman</code> should always return a Roman numeral using uppercase letters.
<li><code class="function">fromRoman</code> should only accept uppercase Roman numerals (<i class="foreignphrase"><acronym>i.e.</acronym></i> it should fail when given lowercase input).
</ol>
<div class="itemizedlist">
<h3>Further reading</h3>
<ul>
<li><a href="http://www.wilkiecollins.demon.co.uk/roman/front.htm">This site</a> has more on Roman numerals, including a fascinating <a href="http://www.wilkiecollins.demon.co.uk/roman/intro.htm">history</a> of how Romans and other civilizations really used them (short answer: haphazardly and inconsistently).
</ul>
<h2 id="roman.divein">13.2. Diving in</h2>
<p>Now that you've completely defined the behavior you expect from your conversion functions, you're going to do something a
little unexpected: you're going to write a test suite that puts these functions through their paces and makes sure that they
behave the way you want them to. You read that right: you're going to write code that tests code that you haven't written
yet.
<p>This is called unit testing, since the set of two conversion functions can be written and tested as a unit, separate from
any larger program they may become part of later. Python has a framework for unit testing, the appropriately-named <code class="filename">unittest</code> module.<table id="note.unittest" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%"><code class="filename">unittest</code> is included with Python 2.1 and later. Python 2.0 users can download it from <a href="http://pyunit.sourceforge.net/"><code class="systemitem">pyunit.sourceforge.net</code></a>.
</td>
</tr>
</table>
<p>Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important
to write them early (preferably before writing the code that they test), and to keep them updated as code and requirements
change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases
of development:
<div class="itemizedlist">
<ul>
<li>Before writing code, it forces you to detail your requirements in a useful fashion.
<li>While writing code, it keeps you from over-coding. When all the test cases pass, the function is complete.
<li>When refactoring code, it assures you that the new version behaves the same way as the old version.
<li>When maintaining code, it helps you cover your ass when someone comes screaming that your latest change broke their old code.
(&#8220;But <em>sir</em>, all the unit tests passed when I checked it in...&#8221;)
<li>When writing code in a team, it increases confidence that the code you're about to commit isn't going to break other peoples'
code, because you can run their unittests first. (I've seen this sort of thing in code sprints. A team breaks up the assignment,
everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team.
That way, nobody goes off too far into developing code that won't play well with others.)
</ul>
<h2 id="roman.romantest">13.3. Introducing <code class="filename">romantest.py</code></h2>
<p>This is the complete test suite for your Roman numeral conversion functions, which are yet to be written but will eventually
be in <code class="filename">roman.py</code>. It is not immediately obvious how it all fits together; none of these classes or methods reference any of the others.
There are good reasons for this, as you'll see shortly.
<div class="example"><h3>Example 13.1. <code class="filename">romantest.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
"""Unit test for roman.py"""
import roman
import unittest
class KnownValues(unittest.TestCase):
knownValues = ( (1, 'I'),
(2, 'II'),
(3, 'III'),
(4, 'IV'),
(5, 'V'),
(6, 'VI'),
(7, 'VII'),
(8, 'VIII'),
(9, 'IX'),
(10, 'X'),
(50, 'L'),
(100, 'C'),
(500, 'D'),
(1000, 'M'),
(31, 'XXXI'),
(148, 'CXLVIII'),
(294, 'CCXCIV'),
(312, 'CCCXII'),
(421, 'CDXXI'),
(528, 'DXXVIII'),
(621, 'DCXXI'),
(782, 'DCCLXXXII'),
(870, 'DCCCLXX'),
(941, 'CMXLI'),
(1043, 'MXLIII'),
(1110, 'MCX'),
(1226, 'MCCXXVI'),
(1301, 'MCCCI'),
(1485, 'MCDLXXXV'),
(1509, 'MDIX'),
(1607, 'MDCVII'),
(1754, 'MDCCLIV'),
(1832, 'MDCCCXXXII'),
(1993, 'MCMXCIII'),
(2074, 'MMLXXIV'),
(2152, 'MMCLII'),
(2212, 'MMCCXII'),
(2343, 'MMCCCXLIII'),
(2499, 'MMCDXCIX'),
(2574, 'MMDLXXIV'),
(2646, 'MMDCXLVI'),
(2723, 'MMDCCXXIII'),
(2892, 'MMDCCCXCII'),
(2975, 'MMCMLXXV'),
(3051, 'MMMLI'),
(3185, 'MMMCLXXXV'),
(3250, 'MMMCCL'),
(3313, 'MMMCCCXIII'),
(3408, 'MMMCDVIII'),
(3501, 'MMMDI'),
(3610, 'MMMDCX'),
(3743, 'MMMDCCXLIII'),
(3844, 'MMMDCCCXLIV'),
(3888, 'MMMDCCCLXXXVIII'),
(3940, 'MMMCMXL'),
(3999, 'MMMCMXCIX'))
def testToRomanKnownValues(self):
"""toRoman should give known result with known input"""
for integer, numeral in self.knownValues:
result = roman.toRoman(integer)
self.assertEqual(numeral, result)
def testFromRomanKnownValues(self):
"""fromRoman should give known result with known input"""
for integer, numeral in self.knownValues:
result = roman.fromRoman(numeral)
self.assertEqual(integer, result)
class ToRomanBadInput(unittest.TestCase):
def testTooLarge(self):
"""toRoman should fail with large input"""
self.assertRaises(roman.OutOfRangeError, roman.toRoman, 4000)
def testZero(self):
"""toRoman should fail with 0 input"""
self.assertRaises(roman.OutOfRangeError, roman.toRoman, 0)
def testNegative(self):
"""toRoman should fail with negative input"""
self.assertRaises(roman.OutOfRangeError, roman.toRoman, -1)
def testNonInteger(self):
"""toRoman should fail with non-integer input"""
self.assertRaises(roman.NotIntegerError, roman.toRoman, 0.5)
class FromRomanBadInput(unittest.TestCase):
def testTooManyRepeatedNumerals(self):
"""fromRoman should fail with too many repeated numerals"""
for s in ('MMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'):
self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s)
def testRepeatedPairs(self):
"""fromRoman should fail with repeated pairs of numerals"""
for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'):
self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s)
def testMalformedAntecedent(self):
"""fromRoman should fail with malformed antecedents"""
for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV',
'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'):
self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s)
class SanityCheck(unittest.TestCase):
def testSanity(self):
"""fromRoman(toRoman(n))==n for all n"""
for integer in range(1, 4000):
numeral = roman.toRoman(integer)
result = roman.fromRoman(numeral)
self.assertEqual(integer, result)
class CaseCheck(unittest.TestCase):
def testToRomanCase(self):
"""toRoman should always return uppercase"""
for integer in range(1, 4000):
numeral = roman.toRoman(integer)
self.assertEqual(numeral, numeral.upper())
def testFromRomanCase(self):
"""fromRoman should only accept uppercase input"""
for integer in range(1, 4000):
numeral = roman.toRoman(integer)
roman.fromRoman(numeral.upper())
self.assertRaises(roman.InvalidRomanNumeralError,
roman.fromRoman, numeral.lower())
if __name__ == "__main__":
unittest.main() </pre><div class="itemizedlist">
<h3>Further reading</h3>
<ul>
<li><a href="http://pyunit.sourceforge.net/">The PyUnit home page</a> has an in-depth discussion of <a href="http://pyunit.sourceforge.net/pyunit.html">using the <code class="filename">unittest</code> framework</a>, including advanced features not covered in this chapter.
<li><a href="http://pyunit.sourceforge.net/pyunit.html">The PyUnit <acronym>FAQ</acronym></a> explains <a href="http://pyunit.sourceforge.net/pyunit.html#WHERE">why test cases are stored separately</a> from the code they test.
<li><a href="http://www.python.org/doc/current/lib/"><i class="citetitle">Python Library Reference</i></a> summarizes the <a href="http://www.python.org/doc/current/lib/module-unittest.html"><code class="filename">unittest</code></a> module.
<li><a href="http://www.extremeprogramming.org/">ExtremeProgramming.org</a> discusses <a href="http://www.extremeprogramming.org/rules/unittests.html">why you should write unit tests</a>.
<li><a href="http://www.c2.com/cgi/wiki">The Portland Pattern Repository</a> has an ongoing discussion of <a href="http://www.c2.com/cgi/wiki?UnitTests">unit tests</a>, including a <a href="http://www.c2.com/cgi/wiki?StandardDefinitionOfUnitTest">standard definition</a>, why you should <a href="http://www.c2.com/cgi/wiki?CodeUnitTestFirst">code unit tests first</a>, and several in-depth <a href="http://www.c2.com/cgi/wiki?UnitTestTrial">case studies</a>.
</ul>
<h2 id="roman.success">13.4. Testing for success</h2>
<p>The most fundamental part of unit testing is constructing individual test cases. A test case answers a single question about
the code it is testing.
<p>A test case should be able to...
<div class="itemizedlist">
<ul>
<li>...run completely by itself, without any human input. Unit testing is about automation.
<li>...determine by itself whether the function it is testing has passed or failed, without a human interpreting the results.
<li>...run in isolation, separate from any other test cases (even if they test the same functions). Each test case is an island.
</ul>
<p>Given that, let's build the first test case. You have the following <a href="#roman.requirements">requirement</a>:
<div class="orderedlist">
<ol>
<li><code class="function">toRoman</code> should return the Roman numeral representation for all integers <code class="constant">1</code> to <code>3999</code>.
</ol>
<div class="example"><h3 id="roman.testtoromanknownvalues.example">Example 13.2. <code class="function">testToRomanKnownValues</code></h3><pre class="programlisting">
class KnownValues(unittest.TestCase): <img id="roman.success.1.0" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
knownValues = ( (1, 'I'),
(2, 'II'),
(3, 'III'),
(4, 'IV'),
(5, 'V'),
(6, 'VI'),
(7, 'VII'),
(8, 'VIII'),
(9, 'IX'),
(10, 'X'),
(50, 'L'),
(100, 'C'),
(500, 'D'),
(1000, 'M'),
(31, 'XXXI'),
(148, 'CXLVIII'),
(294, 'CCXCIV'),
(312, 'CCCXII'),
(421, 'CDXXI'),
(528, 'DXXVIII'),
(621, 'DCXXI'),
(782, 'DCCLXXXII'),
(870, 'DCCCLXX'),
(941, 'CMXLI'),
(1043, 'MXLIII'),
(1110, 'MCX'),
(1226, 'MCCXXVI'),
(1301, 'MCCCI'),
(1485, 'MCDLXXXV'),
(1509, 'MDIX'),
(1607, 'MDCVII'),
(1754, 'MDCCLIV'),
(1832, 'MDCCCXXXII'),
(1993, 'MCMXCIII'),
(2074, 'MMLXXIV'),
(2152, 'MMCLII'),
(2212, 'MMCCXII'),
(2343, 'MMCCCXLIII'),
(2499, 'MMCDXCIX'),
(2574, 'MMDLXXIV'),
(2646, 'MMDCXLVI'),
(2723, 'MMDCCXXIII'),
(2892, 'MMDCCCXCII'),
(2975, 'MMCMLXXV'),
(3051, 'MMMLI'),
(3185, 'MMMCLXXXV'),
(3250, 'MMMCCL'),
(3313, 'MMMCCCXIII'),
(3408, 'MMMCDVIII'),
(3501, 'MMMDI'),
(3610, 'MMMDCX'),
(3743, 'MMMDCCXLIII'),
(3844, 'MMMDCCCXLIV'),
(3888, 'MMMDCCCLXXXVIII'),
(3940, 'MMMCMXL'),
(3999, 'MMMCMXCIX')) <img id="roman.success.1.1" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
def testToRomanKnownValues(self): <img id="roman.success.1.2" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
"""toRoman should give known result with known input"""
for integer, numeral in self.knownValues:
result = roman.toRoman(integer) <img id="roman.success.1.3" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"> <img id="roman.success.1.4" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
self.assertEqual(numeral, result) <img id="roman.success.1.5" src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.success.1.0"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To write a test case, first subclass the <code class="classname">TestCase</code> class of the <code class="filename">unittest</code> module. This class provides many useful methods which you can use in your test case to test specific conditions.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.success.1.1"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a list of integer/numeral pairs that I verified manually. It includes the lowest ten numbers, the highest number,
every number that translates to a single-character Roman numeral, and a random sampling of other valid numbers. The point
of a unit test is not to test every possible input, but to test a representative sample.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.success.1.2"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Every individual test is its own method, which must take no parameters and return no value. If the method exits normally
without raising an exception, the test is considered passed; if the method raises an exception, the test is considered failed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.success.1.3"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here you call the actual <code class="function">toRoman</code> function. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you
have now defined the <acronym>API</acronym> for the <code class="function">toRoman</code> function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the
<acronym>API</acronym> is different than that, this test is considered failed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.success.1.4"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Also notice that you are not trapping any exceptions when you call <code class="function">toRoman</code>. This is intentional. <code class="function">toRoman</code> shouldn't raise an exception when you call it with valid input, and these input values are all valid. If <code class="function">toRoman</code> raises an exception, this test is considered failed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.success.1.5"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Assuming the <code class="function">toRoman</code> function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check
whether it returned the <em>right</em> value. This is a common question, and the <code class="classname">TestCase</code> class provides a method, <code class="function">assertEqual</code>, to check whether two values are equal. If the result returned from <code class="function">toRoman</code> (<code class="varname">result</code>) does not match the known value you were expecting (<code class="varname">numeral</code>), <code class="function">assertEqual</code> will raise an exception and the test will fail. If the two values are equal, <code class="function">assertEqual</code> will do nothing. If every value returned from <code class="function">toRoman</code> matches the known value you expect, <code class="function">assertEqual</code> never raises an exception, so <code class="function">testToRomanKnownValues</code> eventually exits normally, which means <code class="function">toRoman</code> has passed this test.
</td>
</tr>
</table>
<h2 id="roman.failure">13.5. Testing for failure</h2>
<p>It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input.
And not just any sort of failure; they must fail in the way you expect.
<p>Remember the <a href="#roman.requirements">other requirements</a> for <code class="function">toRoman</code>:
<div class="orderedlist">
<ol start="2">
<li><code class="function">toRoman</code> should fail when given an integer outside the range <code class="constant">1</code> to <code>3999</code>.
<li><code class="function">toRoman</code> should fail when given a non-integer number.
</ol>
<p>In Python, functions indicate failure by raising <a href="#fileinfo.exception" title="6.1. Handling Exceptions">exceptions</a>, and the <code class="filename">unittest</code> module provides methods for testing whether a function raises a particular exception when given bad input.
<div class="example"><h3 id="roman.tobadinput.example">Example 13.3. Testing bad input to <code class="function">toRoman</code></h3><pre class="programlisting">
class ToRomanBadInput(unittest.TestCase):
def testTooLarge(self):
"""toRoman should fail with large input"""
self.assertRaises(roman.OutOfRangeError, roman.toRoman, 4000) <img id="roman.failure.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def testZero(self):
"""toRoman should fail with 0 input"""
self.assertRaises(roman.OutOfRangeError, roman.toRoman, 0) <img id="roman.failure.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
def testNegative(self):
"""toRoman should fail with negative input"""
self.assertRaises(roman.OutOfRangeError, roman.toRoman, -1)
def testNonInteger(self):
"""toRoman should fail with non-integer input"""
self.assertRaises(roman.NotIntegerError, roman.toRoman, 0.5) <img id="roman.failure.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.failure.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="classname">TestCase</code> class of the <code class="filename">unittest</code> provides the <code class="function">assertRaises</code> method, which takes the following arguments: the exception you're expecting, the function you're testing, and the arguments
you're passing that function. (If the function you're testing takes more than one argument, pass them all to <code class="function">assertRaises</code>, in order, and it will pass them right along to the function you're testing.) Pay close attention to what you're doing here:
instead of calling <code class="function">toRoman</code> directly and manually checking that it raises a particular exception (by wrapping it in a <a href="#fileinfo.exception" title="6.1. Handling Exceptions"><code>try...except</code> block</a>), <code class="function">assertRaises</code> has encapsulated all of that for us. All you do is give it the exception (<code class="errorcode">roman.OutOfRangeError</code>), the function (<code class="function">toRoman</code>), and <code class="function">toRoman</code>'s arguments (<code>4000</code>), and <code class="function">assertRaises</code> takes care of calling <code class="function">toRoman</code> and checking to make sure that it raises <code class="errorcode">roman.OutOfRangeError</code>. (Also note that you're passing the <code class="function">toRoman</code> function itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned
recently how handy it is that <a href="#odbchelper.objects" title="2.4. Everything Is an Object">everything in Python is an object</a>, including functions and exceptions?)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.failure.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Along with testing numbers that are too large, you need to test numbers that are too small. Remember, Roman numerals cannot
express <code class="constant">0</code> or negative numbers, so you have a test case for each of those (<code class="function">testZero</code> and <code class="function">testNegative</code>). In <code class="function">testZero</code>, you are testing that <code class="function">toRoman</code> raises a <code class="errorcode">roman.OutOfRangeError</code> exception when called with <code class="constant">0</code>; if it does <em>not</em> raise a <code class="errorcode">roman.OutOfRangeError</code> (either because it returns an actual value, or because it raises some other exception), this test is considered failed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.failure.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><a href="#roman.requirements">Requirement #3</a> specifies that <code class="function">toRoman</code> cannot accept a non-integer number, so here you test to make sure that <code class="function">toRoman</code> raises a <code class="errorcode">roman.NotIntegerError</code> exception when called with <code>0.5</code>. If <code class="function">toRoman</code> does not raise a <code class="errorcode">roman.NotIntegerError</code>, this test is considered failed.
</td>
</tr>
</table>
<p>The next two <a href="#roman.requirements">requirements</a> are similar to the first three, except they apply to <code class="function">fromRoman</code> instead of <code class="function">toRoman</code>:
<div class="orderedlist">
<ol start="4">
<li><code class="function">fromRoman</code> should take a valid Roman numeral and return the number that it represents.
<li><code class="function">fromRoman</code> should fail when given an invalid Roman numeral.
</ol>
<p>Requirement #4 is handled in the same way as <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">requirement #1</a>, iterating through a sampling of known values and testing each in turn. Requirement #5 is handled in the same way as requirements
#2 and #3, by testing a series of bad inputs and making sure <code class="function">fromRoman</code> raises the appropriate exception.
<div class="example"><h3 id="roman.frombadinput.example">Example 13.4. Testing bad input to <code class="function">fromRoman</code></h3><pre class="programlisting">
class FromRomanBadInput(unittest.TestCase):
def testTooManyRepeatedNumerals(self):
"""fromRoman should fail with too many repeated numerals"""
for s in ('MMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'):
self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s) <img id="roman.failure.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def testRepeatedPairs(self):
"""fromRoman should fail with repeated pairs of numerals"""
for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'):
self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s)
def testMalformedAntecedent(self):
"""fromRoman should fail with malformed antecedents"""
for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV',
'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'):
self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s)</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.failure.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Not much new to say about these; the pattern is exactly the same as the one you used to test bad input to <code class="function">toRoman</code>. I will briefly note that you have another exception: <code class="errorcode">roman.InvalidRomanNumeralError</code>. That makes a total of three custom exceptions that will need to be defined in <code class="filename">roman.py</code> (along with <code class="errorcode">roman.OutOfRangeError</code> and <code class="errorcode">roman.NotIntegerError</code>). You'll see how to define these custom exceptions when you actually start writing <code class="filename">roman.py</code>, later in this chapter.
</td>
</tr>
</table>
<h2 id="roman.sanity">13.6. Testing for sanity</h2>
<p>Often, you will find that a unit of code contains a set of reciprocal functions, usually in the form of conversion functions
where one converts A to B and the other converts B to A. In these cases, it is useful to create a &#8220;sanity check&#8221; to make sure that you can convert A to B and back to A without losing precision, incurring rounding errors, or triggering
any other sort of bug.
<p>Consider this <a href="#roman.requirements">requirement</a>:
<div class="orderedlist">
<ol start="6">
<li>If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number
you started with. So <code>fromRoman(toRoman(n)) == n</code> for all <code class="varname">n</code> in <code>1..3999</code>.
</ol>
<div class="example"><h3 id="roman.sanity.example">Example 13.5. Testing <code class="function">toRoman</code> against <code class="function">fromRoman</code></h3><pre class="programlisting">
class SanityCheck(unittest.TestCase):
def testSanity(self):
"""fromRoman(toRoman(n))==n for all n"""
for integer in range(1, 4000): <img id="roman.sanity.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"> <img id="roman.sanity.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
numeral = roman.toRoman(integer)
result = roman.fromRoman(numeral)
self.assertEqual(integer, result) <img id="roman.sanity.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.sanity.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You've seen <a href="#odbchelper.multiassign.range" title="Example 3.20. Assigning Consecutive Values">the <code class="function">range</code> function</a> before, but here it is called with two arguments, which returns a list of integers starting at the first argument (<code class="constant">1</code>) and counting consecutively up to <em>but not including</em> the second argument (<code>4000</code>). Thus, <code>1..3999</code>, which is the valid range for converting to Roman numerals.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.sanity.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">I just wanted to mention in passing that <code class="varname">integer</code> is not a keyword in Python; here it's just a variable name like any other.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.sanity.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The actual testing logic here is straightforward: take a number (<code class="varname">integer</code>), convert it to a Roman numeral (<code class="varname">numeral</code>), then convert it back to a number (<code class="varname">result</code>) and make sure you end up with the same number you started with. If not, <code class="function">assertEqual</code> will raise an exception and the test will immediately be considered failed. If all the numbers match, <code class="function">assertEqual</code> will always return silently, the entire <code class="function">testSanity</code> method will eventually return silently, and the test will be considered passed.
</td>
</tr>
</table>
<p>The <a href="#roman.requirements">last two requirements</a> are different from the others because they seem both arbitrary and trivial:
<div class="orderedlist">
<ol start="7">
<li><code class="function">toRoman</code> should always return a Roman numeral using uppercase letters.
<li><code class="function">fromRoman</code> should only accept uppercase Roman numerals (<i class="foreignphrase"><acronym>i.e.</acronym></i> it should fail when given lowercase input).
</ol>
<p>In fact, they are somewhat arbitrary. You could, for instance, have stipulated that <code class="function">fromRoman</code> accept lowercase and mixed case input. But they are not completely arbitrary; if <code class="function">toRoman</code> is always returning uppercase output, then <code class="function">fromRoman</code> must at least accept uppercase input, or the &#8220;sanity check&#8221; (requirement #6) would fail. The fact that it <em>only</em> accepts uppercase input is arbitrary, but as any systems integrator will tell you, case always matters, so it's worth specifying
the behavior up front. And if it's worth specifying, it's worth testing.
<div class="example"><h3>Example 13.6. Testing for case</h3><pre class="programlisting">
class CaseCheck(unittest.TestCase):
def testToRomanCase(self):
"""toRoman should always return uppercase"""
for integer in range(1, 4000):
numeral = roman.toRoman(integer)
self.assertEqual(numeral, numeral.upper()) <img id="roman.sanity.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def testFromRomanCase(self):
"""fromRoman should only accept uppercase input"""
for integer in range(1, 4000):
numeral = roman.toRoman(integer)
roman.fromRoman(numeral.upper()) <img id="roman.sanity.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"> <img id="roman.sanity.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
self.assertRaises(roman.InvalidRomanNumeralError,
roman.fromRoman, numeral.lower()) <img id="roman.sanity.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.sanity.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The most interesting thing about this test case is all the things it doesn't test. It doesn't test that the value returned
from <code class="function">toRoman</code> is <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">right</a> or even <a href="#roman.sanity.example" title="Example 13.5. Testing toRoman against fromRoman">consistent</a>; those questions are answered by separate test cases. You have a whole test case just to test for uppercase-ness. You might
be tempted to combine this with the <a href="#roman.sanity.example" title="Example 13.5. Testing toRoman against fromRoman">sanity check</a>, since both run through the entire range of values and call <code class="function">toRoman</code>.<sup>[<a name="d0e32781" href="#ftn.d0e32781">6</a>]</sup> But that would violate one of the <a href="#roman.success" title="13.4. Testing for success">fundamental rules</a>: each test case should answer only a single question. Imagine that you combined this case check with the sanity check, and
then that test case failed. You would need to do further analysis to figure out which part of the test case failed to determine
what the problem was. If you need to analyze the results of your unit testing just to figure out what they mean, it's a sure
sign that you've mis-designed your test cases.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.sanity.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">There's a similar lesson to be learned here: even though &#8220;you know&#8221; that <code class="function">toRoman</code> always returns uppercase, you are explicitly converting its return value to uppercase here to test that <code class="function">fromRoman</code> accepts uppercase input. Why? Because the fact that <code class="function">toRoman</code> always returns uppercase is an independent requirement. If you changed that requirement so that, for instance, it always
returned lowercase, the <code class="function">testToRomanCase</code> test case would need to change, but this test case would still work. This was another of the <a href="#roman.success" title="13.4. Testing for success">fundamental rules</a>: each test case must be able to work in isolation from any of the others. Every test case is an island.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.sanity.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that you're not assigning the return value of <code class="function">fromRoman</code> to anything. This is legal syntax in Python; if a function returns a value but nobody's listening, Python just throws away the return value. In this case, that's what you want. This test case doesn't test anything about the return
value; it just tests that <code class="function">fromRoman</code> accepts the uppercase input without raising an exception.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.sanity.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a complicated line, but it's very similar to what you did in the <code class="classname">ToRomanBadInput</code> and <code class="classname">FromRomanBadInput</code> tests. You are testing to make sure that calling a particular function (<code class="function">roman.fromRoman</code>) with a particular value (<code>numeral.lower()</code>, the lowercase version of the current Roman numeral in the loop) raises a particular exception (<code>roman.InvalidRomanNumeralError</code>). If it does (each time through the loop), the test passes; if even one time it does something else (like raises a different
exception, or returning a value without raising an exception at all), the test fails.
</td>
</tr>
</table>
<p>In the next chapter, you'll see how to write code that passes these tests.
<div class="footnotes"><br><hr width="100" align="left">
<div class="footnote">
<p><sup>[<a name="ftn.d0e32781" href="#d0e32781">6</a>] </sup>&#8220;I can resist everything except temptation.&#8221; --Oscar Wilde
<div class="chapter">
<h2 id="roman1.5">Chapter 14. Test-First Programming</h2>
<h2 id="roman.stage1">14.1. <code class="filename">roman.py</code>, stage 1</h2>
<p>Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're
going to do this in stages, so you can see all the unit tests fail, then watch them pass one by one as you fill in the gaps
in <code class="filename">roman.py</code>.
<div class="example"><h3>Example 14.1. <code class="filename">roman1.py</code></h3>
<p>This file is available in <code class="filename">py/roman/stage1/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass <img id="roman.stage1.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
class OutOfRangeError(RomanError): pass <img id="roman.stage1.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass <img id="roman.stage1.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
def toRoman(n):
"""convert integer to Roman numeral"""
pass <img id="roman.stage1.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
def fromRoman(s):
"""convert Roman numeral to integer"""
pass
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage1.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is how you define your own custom exceptions in Python. Exceptions are classes, and you create your own by subclassing existing exceptions. It is strongly recommended (but not
required) that you subclass <code class="errorcode">Exception</code>, which is the base class that all built-in exceptions inherit from. Here I am defining <code class="errorcode">RomanError</code> (inherited from <code class="errorcode">Exception</code>) to act as the base class for all my other custom exceptions to follow. This is a matter of style; I could just as easily
have inherited each individual exception from the <code class="errorcode">Exception</code> class directly.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage1.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="errorcode">OutOfRangeError</code> and <code class="errorcode">NotIntegerError</code> exceptions will eventually be used by <code class="function">toRoman</code> to flag various forms of invalid input, as specified in <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to toRoman"><code class="classname">ToRomanBadInput</code></a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage1.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="errorcode">InvalidRomanNumeralError</code> exception will eventually be used by <code class="function">fromRoman</code> to flag invalid input, as specified in <a href="#roman.frombadinput.example" title="Example 13.4. Testing bad input to fromRoman"><code class="classname">FromRomanBadInput</code></a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage1.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">At this stage, you want to define the <acronym>API</acronym> of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word <a href="#fileinfo.class.simplest" title="Example 5.3. The Simplest Python Class"><code>pass</code></a>.
</td>
</tr>
</table>
<p>Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At
this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to <code class="filename">romantest.py</code> and re-evaluate why you coded a test so useless that it passes with do-nothing functions.
<p>Run <code class="filename">romantest1.py</code> with the <code class="option">-v</code> command-line option, which will give more verbose output so you can see exactly what's going on as each test case runs.
With any luck, your output should look like this:
<div class="example"><h3 id="roman.stage1.output">Example 14.2. Output of <code class="filename">romantest1.py</code> against <code class="filename">roman1.py</code></h3><pre class="screen"><samp class="computeroutput">fromRoman should only accept uppercase input ... ERROR
toRoman should always return uppercase ... ERROR
fromRoman should fail with malformed antecedents ... FAIL
fromRoman should fail with repeated pairs of numerals ... FAIL
fromRoman should fail with too many repeated numerals ... FAIL
fromRoman should give known result with known input ... FAIL
toRoman should give known result with known input ... FAIL
fromRoman(toRoman(n))==n for all n ... FAIL
toRoman should fail with non-integer input ... FAIL
toRoman should fail with negative input ... FAIL
toRoman should fail with large input ... FAIL
toRoman should fail with 0 input ... FAIL
======================================================================
ERROR: fromRoman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 154, in testFromRomanCase
roman1.fromRoman(numeral.upper())
AttributeError: 'None' object has no attribute 'upper'</span><samp class="computeroutput">
======================================================================
ERROR: toRoman should always return uppercase
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 148, in testToRomanCase
self.assertEqual(numeral, numeral.upper())
AttributeError: 'None' object has no attribute 'upper'</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 133, in testMalformedAntecedent
self.assertRaises(roman1.InvalidRomanNumeralError, roman1.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 127, in testRepeatedPairs
self.assertRaises(roman1.InvalidRomanNumeralError, roman1.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman1.InvalidRomanNumeralError, roman1.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should give known result with known input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 99, in testFromRomanKnownValues
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp class="computeroutput">
======================================================================
FAIL: toRoman should give known result with known input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 93, in testToRomanKnownValues
self.assertEqual(numeral, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: I != None</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman(toRoman(n))==n for all n
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 141, in testSanity
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp class="computeroutput">
======================================================================
FAIL: toRoman should fail with non-integer input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 116, in testNonInteger
self.assertRaises(roman1.NotIntegerError, roman1.toRoman, 0.5)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: NotIntegerError</span><samp class="computeroutput">
======================================================================
FAIL: toRoman should fail with negative input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 112, in testNegative
self.assertRaises(roman1.OutOfRangeError, roman1.toRoman, -1)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp class="computeroutput">
======================================================================
FAIL: toRoman should fail with large input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 104, in testTooLarge
self.assertRaises(roman1.OutOfRangeError, roman1.toRoman, 4000)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp class="computeroutput">
======================================================================
FAIL: toRoman should fail with 0 input </span><img id="roman.stage1.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 108, in testZero
self.assertRaises(roman1.OutOfRangeError, roman1.toRoman, 0)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError </span><img id="roman.stage1.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><samp class="computeroutput">
----------------------------------------------------------------------
Ran 12 tests in 0.040s </span><img id="roman.stage1.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"><samp class="computeroutput">
FAILED (failures=10, errors=2) </span><img id="roman.stage1.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage1.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Running the script runs <code class="function">unittest.main()</code>, which runs each test case, which is to say each method defined in each class within <code class="filename">romantest.py</code>. For each test case, it prints out the <code>doc string</code> of the method and whether that test passed or failed. As expected, none of the test cases passed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage1.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">For each failed test case, <code class="filename">unittest</code> displays the trace information showing exactly what happened. In this case, the call to <code class="function">assertRaises</code> (also called <code class="function">failUnlessRaises</code>) raised an <code class="errorcode">AssertionError</code> because it was expecting <code class="function">toRoman</code> to raise an <code class="errorcode">OutOfRangeError</code> and it didn't.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage1.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">After the detail, <code class="filename">unittest</code> displays a summary of how many tests were performed and how long it took.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage1.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Overall, the unit test failed because at least one test case did not pass. When a test case doesn't pass, <code class="filename">unittest</code> distinguishes between failures and errors. A failure is a call to an <code class="function">assertXYZ</code> method, like <code class="function">assertEqual</code> or <code class="function">assertRaises</code>, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort
of exception raised in the code you're testing or the unit test case itself. For instance, the <code class="function">testFromRomanCase</code> method (&#8220;<code class="function">fromRoman</code> should only accept uppercase input&#8221;) was an error, because the call to <code class="function">numeral.upper()</code> raised an <code class="errorcode">AttributeError</code> exception, because <code class="function">toRoman</code> was supposed to return a string but didn't. But <code class="function">testZero</code> (&#8220;<code class="function">toRoman</code> should fail with 0 input&#8221;) was a failure, because the call to <code class="function">fromRoman</code> did not raise the <code class="errorcode">InvalidRomanNumeral</code> exception that <code class="function">assertRaises</code> was looking for.
</td>
</tr>
</table>
<h2 id="roman.stage2">14.2. <code class="filename">roman.py</code>, stage 2</h2>
<p>Now that you have the framework of the <code class="filename">roman</code> module laid out, it's time to start writing code and passing test cases.
<div class="example"><h3 id="roman.stage2.example">Example 14.3. <code class="filename">roman2.py</code></h3>
<p>This file is available in <code class="filename">py/roman/stage2/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000), <img id="roman.stage2.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def toRoman(n):
"""convert integer to Roman numeral"""
result = ""
for numeral, integer in romanNumeralMap:
while n >= integer: <img id="roman.stage2.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
result += numeral
n -= integer
return result
def fromRoman(s):
"""convert Roman numeral to integer"""
pass
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage2.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">romanNumeralMap</code> is a tuple of tuples which defines three things:
<div class="orderedlist">
<ol>
<li>The character representations of the most basic Roman numerals. Note that this is not just the single-character Roman numerals;
you're also defining two-character pairs like <code>CM</code> (&#8220;one hundred less than one thousand&#8221;); this will make the <code class="function">toRoman</code> code simpler later.
<li>The order of the Roman numerals. They are listed in descending value order, from <code>M</code> all the way down to <code>I</code>.
<li>The value of each Roman numeral. Each inner tuple is a pair of <code>(<i class="replaceable">numeral</i>, <i class="replaceable">value</i>)</code>.
</ol>
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage2.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here's where your rich data structure pays off, because you don't need any special logic to handle the subtraction rule.
To convert to Roman numerals, you simply iterate through <code class="varname">romanNumeralMap</code> looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation
to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
</td>
</tr>
</table>
<div class="example"><h3>Example 14.4. How <code class="function">toRoman</code> works</h3>
<p>If you're not clear how <code class="function">toRoman</code> works, add a <code class="function">print</code> statement to the end of the <code>while</code> loop:<pre class="programlisting">
while n >= integer:
result += numeral
n -= integer
print 'subtracting', integer, 'from input, adding', numeral, 'to output'</pre><pre class="screen">
<samp class="prompt">>>> </samp>import roman2
<samp class="prompt">>>> </samp>roman2.toRoman(1424)
<samp class="computeroutput">subtracting 1000 from input, adding M to output
subtracting 400 from input, adding CD to output
subtracting 10 from input, adding X to output
subtracting 10 from input, adding X to output
subtracting 4 from input, adding IV to output
'MCDXXIV'</span>
</pre><p>So <code class="function">toRoman</code> appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely.
<div class="example"><h3>Example 14.5. Output of <code class="filename">romantest2.py</code> against <code class="filename">roman2.py</code></h3>
<p>Remember to run <code class="filename">romantest2.py</code> with the <code>-v</code> command-line flag to enable verbose mode.<pre class="screen"><samp class="computeroutput">fromRoman should only accept uppercase input ... FAIL
toRoman should always return uppercase ... ok</span><img id="roman.stage2.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
fromRoman should fail with malformed antecedents ... FAIL
fromRoman should fail with repeated pairs of numerals ... FAIL
fromRoman should fail with too many repeated numerals ... FAIL
fromRoman should give known result with known input ... FAIL
toRoman should give known result with known input ... ok </span><img id="roman.stage2.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><samp class="computeroutput">
fromRoman(toRoman(n))==n for all n ... FAIL
toRoman should fail with non-integer input ... FAIL </span><img id="roman.stage2.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should fail with negative input ... FAIL
toRoman should fail with large input ... FAIL
toRoman should fail with 0 input ... FAIL</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage2.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">toRoman</code> does, in fact, always return uppercase, because <code class="varname">romanNumeralMap</code> defines the Roman numeral representations as uppercase. So this test passes already.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage2.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here's the big news: this version of the <code class="function">toRoman</code> function passes the <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">known values test</a>. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including
inputs that produce every single-character Roman numeral, the largest possible input (<code>3999</code>), and the input that produces the longest possible Roman numeral (<code>3888</code>). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage2.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">However, the function does not &#8220;work&#8221; for bad values; it fails every single <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to toRoman">bad input test</a>. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to
be raised (via <code class="function">assertRaises</code>), and you're never raising them. You'll do that in the next stage.
</td>
</tr>
</table>
<p>Here's the rest of the output of the unit test, listing the details of all the failures. You're down to 10.<pre class="screen"><samp class="computeroutput">
======================================================================
FAIL: fromRoman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 156, in testFromRomanCase
roman2.fromRoman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 133, in testMalformedAntecedent
self.assertRaises(roman2.InvalidRomanNumeralError, roman2.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 127, in testRepeatedPairs
self.assertRaises(roman2.InvalidRomanNumeralError, roman2.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman2.InvalidRomanNumeralError, roman2.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should give known result with known input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 99, in testFromRomanKnownValues
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman(toRoman(n))==n for all n
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 141, in testSanity
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp class="computeroutput">
======================================================================
FAIL: toRoman should fail with non-integer input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 116, in testNonInteger
self.assertRaises(roman2.NotIntegerError, roman2.toRoman, 0.5)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: NotIntegerError</span><samp class="computeroutput">
======================================================================
FAIL: toRoman should fail with negative input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 112, in testNegative
self.assertRaises(roman2.OutOfRangeError, roman2.toRoman, -1)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp class="computeroutput">
======================================================================
FAIL: toRoman should fail with large input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 104, in testTooLarge
self.assertRaises(roman2.OutOfRangeError, roman2.toRoman, 4000)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp class="computeroutput">
======================================================================
FAIL: toRoman should fail with 0 input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 108, in testZero
self.assertRaises(roman2.OutOfRangeError, roman2.toRoman, 0)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp class="computeroutput">
----------------------------------------------------------------------
Ran 12 tests in 0.320s
FAILED (failures=10)</span></pre><h2 id="roman.stage3">14.3. <code class="filename">roman.py</code>, stage 3</h2>
<p>Now that <code class="function">toRoman</code> behaves correctly with good input (integers from <code>1</code> to <code>3999</code>), it's time to make it behave correctly with bad input (everything else).
<div class="example"><h3>Example 14.6. <code class="filename">roman3.py</code></h3>
<p>This file is available in <code class="filename">py/roman/stage3/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def toRoman(n):
"""convert integer to Roman numeral"""
if not (0 &lt; n &lt; 4000): <img id="roman.stage3.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
raise OutOfRangeError, "number out of range (must be 1..3999)" <img id="roman.stage3.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
if int(n) &lt;> n: <img id="roman.stage3.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
raise NotIntegerError, "non-integers can not be converted"
result = "" <img id="roman.stage3.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
for numeral, integer in romanNumeralMap:
while n >= integer:
result += numeral
n -= integer
return result
def fromRoman(s):
"""convert Roman numeral to integer"""
pass
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage3.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to <code>if not ((0 &lt; n) and (n &lt; 4000))</code>, but it's much easier to read. This is the range check, and it should catch inputs that are too large, negative, or zero.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage3.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You raise exceptions yourself with the <code>raise</code> statement. You can raise any of the built-in exceptions, or you can raise any of your custom exceptions that you've defined.
The second parameter, the error message, is optional; if given, it is displayed in the traceback that is printed if the exception
is never handled.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage3.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the non-integer check. Non-integers can not be converted to Roman numerals.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage3.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The rest of the function is unchanged.</td>
</tr>
</table>
<div class="example"><h3>Example 14.7. Watching <code class="function">toRoman</code> handle bad input</h3><pre class="screen">
<samp class="prompt">>>> </samp>import roman3
<samp class="prompt">>>> </samp>roman3.toRoman(4000)
<samp class="traceback">Traceback (most recent call last):
File "&lt;interactive input>", line 1, in ?
File "roman3.py", line 27, in toRoman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)</samp>
<samp class="prompt">>>> </samp>roman3.toRoman(1.5)
<samp class="traceback">Traceback (most recent call last):
File "&lt;interactive input>", line 1, in ?
File "roman3.py", line 29, in toRoman
raise NotIntegerError, "non-integers can not be converted"
NotIntegerError: non-integers can not be converted</span>
</pre><div class="example"><h3>Example 14.8. Output of <code class="filename">romantest3.py</code> against <code class="filename">roman3.py</code></h3><pre class="screen"><samp class="computeroutput">fromRoman should only accept uppercase input ... FAIL
toRoman should always return uppercase ... ok
fromRoman should fail with malformed antecedents ... FAIL
fromRoman should fail with repeated pairs of numerals ... FAIL
fromRoman should fail with too many repeated numerals ... FAIL
fromRoman should give known result with known input ... FAIL
toRoman should give known result with known input ... ok </span><img id="roman.stage3.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
fromRoman(toRoman(n))==n for all n ... FAIL
toRoman should fail with non-integer input ... ok </span><img id="roman.stage3.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should fail with negative input ... ok </span><img id="roman.stage3.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should fail with large input ... ok
toRoman should fail with 0 input ... ok</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage3.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">toRoman</code> still passes the <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">known values test</a>, which is comforting. All the tests that passed in <a href="#roman.stage2" title="14.2. roman.py, stage 2">stage 2</a> still pass, so the latest code hasn't broken anything.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage3.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">More exciting is the fact that all of the <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to toRoman">bad input tests</a> now pass. This test, <code class="function">testNonInteger</code>, passes because of the <code>int(n) &lt;> n</code> check. When a non-integer is passed to <code class="function">toRoman</code>, the <code>int(n) &lt;> n</code> check notices it and raises the <code class="errorcode">NotIntegerError</code> exception, which is what <code class="function">testNonInteger</code> is looking for.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage3.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This test, <code class="function">testNegative</code>, passes because of the <code>not (0 &lt; n &lt; 4000)</code> check, which raises an <code class="errorcode">OutOfRangeError</code> exception, which is what <code class="function">testNegative</code> is looking for.
</td>
</tr>
</table>
</div><pre class="screen"><samp class="computeroutput">
======================================================================
FAIL: fromRoman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 156, in testFromRomanCase
roman3.fromRoman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 133, in testMalformedAntecedent
self.assertRaises(roman3.InvalidRomanNumeralError, roman3.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 127, in testRepeatedPairs
self.assertRaises(roman3.InvalidRomanNumeralError, roman3.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman3.InvalidRomanNumeralError, roman3.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should give known result with known input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 99, in testFromRomanKnownValues
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman(toRoman(n))==n for all n
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 141, in testSanity
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp class="computeroutput">
----------------------------------------------------------------------
Ran 12 tests in 0.401s
FAILED (failures=6)</span> <img id="roman.stage3.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage3.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You're down to 6 failures, and all of them involve <code class="function">fromRoman</code>: the known values test, the three separate bad input tests, the case check, and the sanity check. That means that <code class="function">toRoman</code> has passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires that <code class="function">fromRoman</code> be written, which it isn't yet.) Which means that you must stop coding <code class="function">toRoman</code> now. No tweaking, no twiddling, no extra checks &#8220;just in case&#8221;. Stop. Now. Back away from the keyboard.
</td>
</tr>
</table>
</div><table class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">The most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for
a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module.
</td>
</tr>
</table>
<h2 id="roman.stage4">14.4. <code class="filename">roman.py</code>, stage 4</h2>
<p>Now that <code class="function">toRoman</code> is done, it's time to start coding <code class="function">fromRoman</code>. Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than
the <code class="function">toRoman</code> function.
<div class="example"><h3>Example 14.9. <code class="filename">roman4.py</code></h3>
<p>This file is available in <code class="filename">py/roman/stage4/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
# toRoman function omitted for clarity (it hasn't changed)
def fromRoman(s):
"""convert Roman numeral to integer"""
result = 0
index = 0
for numeral, integer in romanNumeralMap:
while s[index:index+len(numeral)] == numeral: <img id="roman.stage4.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
result += integer
index += len(numeral)
return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage4.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The pattern here is the same as <a href="#roman.stage2.example" title="Example 14.3. roman2.py"><code class="function">toRoman</code></a>. You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer
values as often as possible, you match the &#8220;highest&#8221; Roman numeral character strings as often as possible.
</td>
</tr>
</table>
<div class="example"><h3>Example 14.10. How <code class="function">fromRoman</code> works</h3>
<p>If you're not clear how <code class="function">fromRoman</code> works, add a <code class="function">print</code> statement to the end of the <code>while</code> loop:<pre class="programlisting">
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
print 'found', numeral, 'of length', len(numeral), ', adding', integer</pre><pre class="screen">
<samp class="prompt">>>> </samp>import roman4
<samp class="prompt">>>> </samp>roman4.fromRoman('MCMLXXII')
<samp class="computeroutput">found M , of length 1, adding 1000
found CM , of length 2, adding 900
found L , of length 1, adding 50
found X , of length 1, adding 10
found X , of length 1, adding 10
found I , of length 1, adding 1
found I , of length 1, adding 1
1972</span></pre><div class="example"><h3>Example 14.11. Output of <code class="filename">romantest4.py</code> against <code class="filename">roman4.py</code></h3><pre class="screen"><samp class="computeroutput">fromRoman should only accept uppercase input ... FAIL
toRoman should always return uppercase ... ok
fromRoman should fail with malformed antecedents ... FAIL
fromRoman should fail with repeated pairs of numerals ... FAIL
fromRoman should fail with too many repeated numerals ... FAIL
fromRoman should give known result with known input ... ok </span><img id="roman.stage4.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should give known result with known input ... ok
fromRoman(toRoman(n))==n for all n ... ok</span><img id="roman.stage4.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should fail with non-integer input ... ok
toRoman should fail with negative input ... ok
toRoman should fail with large input ... ok
toRoman should fail with 0 input ... ok</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage4.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Two pieces of exciting news here. The first is that <code class="function">fromRoman</code> works for good input, at least for all the <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">known values</a> you test.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage4.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The second is that the <a href="#roman.sanity.example" title="Example 13.5. Testing toRoman against fromRoman">sanity check</a> also passed. Combined with the known values tests, you can be reasonably sure that both <code class="function">toRoman</code> and <code class="function">fromRoman</code> work properly for all possible good values. (This is not guaranteed; it is theoretically possible that <code class="function">toRoman</code> has a bug that produces the wrong Roman numeral for some particular set of inputs, <em>and</em> that <code class="function">fromRoman</code> has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that <code class="function">toRoman</code> generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write
more comprehensive test cases until it doesn't bother you.)
</td>
</tr>
</table>
</div><pre class="screen"><samp class="computeroutput">
======================================================================
FAIL: fromRoman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 156, in testFromRomanCase
roman4.fromRoman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 133, in testMalformedAntecedent
self.assertRaises(roman4.InvalidRomanNumeralError, roman4.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 127, in testRepeatedPairs
self.assertRaises(roman4.InvalidRomanNumeralError, roman4.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
======================================================================
FAIL: fromRoman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman4.InvalidRomanNumeralError, roman4.fromRoman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
----------------------------------------------------------------------
Ran 12 tests in 1.222s
FAILED (failures=4)</span></pre><h2 id="roman.stage5">14.5. <code class="filename">roman.py</code>, stage 5</h2>
<p>Now that <code class="function">fromRoman</code> works properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input.
That means finding a way to look at a string and determine if it's a valid Roman numeral. This is inherently more difficult
than <a href="#roman.stage3" title="14.3. roman.py, stage 3">validating numeric input</a> in <code class="function">toRoman</code>, but you have a powerful tool at your disposal: regular expressions.
<p>If you're not familiar with regular expressions and didn't read <a href="#re" title="Chapter 7. Regular Expressions">Chapter 7, <i>Regular Expressions</i></a>, now would be a good time.
<p>As you saw in <a href="#re.roman" title="7.3. Case Study: Roman Numerals">Section 7.3, &#8220;Case Study: Roman Numerals&#8221;</a>, there are several simple rules for constructing a Roman numeral, using the letters <code>M</code>, <code>D</code>, <code>C</code>, <code>L</code>, <code>X</code>, <code>V</code>, and <code>I</code>. Let's review the rules:
<div class="orderedlist">
<ol>
<li>Characters are additive. <code>I</code> is <code class="constant">1</code>, <code>II</code> is <code>2</code>, and <code>III</code> is <code>3</code>. <code>VI</code> is <code>6</code> (literally, &#8220;<code>5</code> and <code>1</code>&#8221;), <code>VII</code> is <code>7</code>, and <code>VIII</code> is <code>8</code>.
<li>The tens characters (<code>I</code>, <code>X</code>, <code>C</code>, and <code>M</code>) can be repeated up to three times. At <code>4</code>, you need to subtract from the next highest fives character. You can't represent <code>4</code> as <code>IIII</code>; instead, it is represented as <code>IV</code> (&#8220;<code>1</code> less than <code>5</code>&#8221;). <code>40</code> is written as <code>XL</code> (&#8220;<code>10</code> less than <code>50</code>&#8221;), <code>41</code> as <code>XLI</code>, <code>42</code> as <code>XLII</code>, <code>43</code> as <code>XLIII</code>, and then <code>44</code> as <code>XLIV</code> (&#8220;<code>10</code> less than <code>50</code>, then <code>1</code> less than <code>5</code>&#8221;).
<li>Similarly, at <code>9</code>, you need to subtract from the next highest tens character: <code>8</code> is <code>VIII</code>, but <code>9</code> is <code>IX</code> (&#8220;<code>1</code> less than <code>10</code>&#8221;), not <code>VIIII</code> (since the <code>I</code> character can not be repeated four times). <code>90</code> is <code>XC</code>, <code>900</code> is <code>CM</code>.
<li>The fives characters can not be repeated. <code>10</code> is always represented as <code>X</code>, never as <code>VV</code>. <code>100</code> is always <code>C</code>, never <code>LL</code>.
<li>Roman numerals are always written highest to lowest, and read left to right, so order of characters matters very much. <code>DC</code> is <code>600</code>; <code>CD</code> is a completely different number (<code>400</code>, &#8220;<code>100</code> less than <code>500</code>&#8221;). <code>CI</code> is <code>101</code>; <code>IC</code> is not even a valid Roman numeral (because you can't subtract <code>1</code> directly from <code>100</code>; you would need to write it as <code>XCIX</code>, &#8220;<code>10</code> less than <code>100</code>, then <code>1</code> less than <code>10</code>&#8221;).
</ol>
<div class="example"><h3>Example 14.12. <code class="filename">roman5.py</code></h3>
<p>This file is available in <code class="filename">py/roman/stage5/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
"""Convert to and from Roman numerals"""
import re
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def toRoman(n):
"""convert integer to Roman numeral"""
if not (0 &lt; n &lt; 4000):
raise OutOfRangeError, "number out of range (must be 1..3999)"
if int(n) &lt;> n:
raise NotIntegerError, "non-integers can not be converted"
result = ""
for numeral, integer in romanNumeralMap:
while n >= integer:
result += numeral
n -= integer
return result
#Define pattern to detect valid Roman numerals
romanNumeralPattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$' <img id="roman.stage5.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def fromRoman(s):
"""convert Roman numeral to integer"""
if not re.search(romanNumeralPattern, s):<img id="roman.stage5.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
result = 0
index = 0
for numeral, integer in romanNumeralMap:
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage5.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is just a continuation of the pattern you discussed in <a href="#re.roman" title="7.3. Case Study: Roman Numerals">Section 7.3, &#8220;Case Study: Roman Numerals&#8221;</a>. The tens places is either <code>XC</code> (<code>90</code>), <code>XL</code> (<code>40</code>), or an optional <code>L</code> followed by 0 to 3 optional <code>X</code> characters. The ones place is either <code>IX</code> (<code>9</code>), <code>IV</code> (<code>4</code>), or an optional <code>V</code> followed by 0 to 3 optional <code>I</code> characters.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage5.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Having encoded all that logic into a regular expression, the code to check for invalid Roman numerals becomes trivial. If
<code class="function">re.search</code> returns an object, then the regular expression matched and the input is valid; otherwise, the input is invalid.
</td>
</tr>
</table>
<p>At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of
invalid Roman numerals. But don't take my word for it, look at the results:
<div class="example"><h3>Example 14.13. Output of <code class="filename">romantest5.py</code> against <code class="filename">roman5.py</code></h3><pre class="screen"><samp class="computeroutput">
fromRoman should only accept uppercase input ... ok </span><img id="roman.stage5.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should always return uppercase ... ok
fromRoman should fail with malformed antecedents ... ok </span><img id="roman.stage5.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><samp class="computeroutput">
fromRoman should fail with repeated pairs of numerals ... ok </span><img id="roman.stage5.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"><samp class="computeroutput">
fromRoman should fail with too many repeated numerals ... ok
fromRoman should give known result with known input ... ok
toRoman should give known result with known input ... ok
fromRoman(toRoman(n))==n for all n ... ok
toRoman should fail with non-integer input ... ok
toRoman should fail with negative input ... ok
toRoman should fail with large input ... ok
toRoman should fail with 0 input ... ok
----------------------------------------------------------------------
Ran 12 tests in 2.864s
OK </span><img id="roman.stage5.4.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage5.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">One thing I didn't mention about regular expressions is that, by default, they are case-sensitive. Since the regular expression
<code class="varname">romanNumeralPattern</code> was expressed in uppercase characters, the <code class="function">re.search</code> check will reject any input that isn't completely uppercase. So the uppercase input test passes.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage5.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">More importantly, the bad input tests pass. For instance, the malformed antecedents test checks cases like <code>MCMC</code>. As you've seen, this does not match the regular expression, so <code class="function">fromRoman</code> raises an <code class="errorcode">InvalidRomanNumeralError</code> exception, which is what the malformed antecedents test case is looking for, so the test passes.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage5.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In fact, all the bad input tests pass. This regular expression catches everything you could think of when you made your test
cases.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.stage5.4.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">And the anticlimax award of the year goes to the word &#8220;<code>OK</code>&#8221;, which is printed by the <code class="filename">unittest</code> module when all the tests pass.
</td>
</tr>
</table>
</div><table class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">When all of your tests pass, stop coding.</td>
</tr>
</table>
<div class="chapter">
<h2 id="roman2">Chapter 15. Refactoring</h2>
<h2 id="roman.bugs">15.1. Handling bugs</h2>
<p>Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by &#8220;bug&#8221;? A bug is a test case you haven't written yet.
<div class="example"><h3>Example 15.1. The bug</h3><pre class="screen"><samp class="prompt">>>> </samp>import roman5
<samp class="prompt">>>> </samp>roman5.fromRoman("") <img id="roman.bugs.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
0</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.bugs.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember in the <a href="#roman.stage5" title="14.5. roman.py, stage 5">previous section</a> when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals?
Well, it turns out that this is still true for the final version of the regular expression. And that's a bug; you want an
empty string to raise an <code class="errorcode">InvalidRomanNumeralError</code> exception just like any other sequence of characters that don't represent a valid Roman numeral.
</td>
</tr>
</table>
<p>After reproducing the bug, and before fixing it, you should write a test case that fails, thus illustrating the bug.
<div class="example"><h3>Example 15.2. Testing for the bug (<code class="filename">romantest61.py</code>)</h3><pre class="programlisting">
class FromRomanBadInput(unittest.TestCase):
# previous test cases omitted for clarity (they haven't changed)
def testBlank(self):
"""fromRoman should fail with blank string"""
self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, "") <img id="roman.bugs.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.bugs.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Pretty simple stuff here. Call <code class="function">fromRoman</code> with an empty string and make sure it raises an <code class="errorcode">InvalidRomanNumeralError</code> exception. The hard part was finding the bug; now that you know about it, testing for it is the easy part.
</td>
</tr>
</table>
<p>Since your code has a bug, and you now have a test case that tests this bug, the test case will fail:
<div class="example"><h3>Example 15.3. Output of <code class="filename">romantest61.py</code> against <code class="filename">roman61.py</code></h3><pre class="screen"><samp class="computeroutput">fromRoman should only accept uppercase input ... ok
toRoman should always return uppercase ... ok
fromRoman should fail with blank string ... FAIL
fromRoman should fail with malformed antecedents ... ok
fromRoman should fail with repeated pairs of numerals ... ok
fromRoman should fail with too many repeated numerals ... ok
fromRoman should give known result with known input ... ok
toRoman should give known result with known input ... ok
fromRoman(toRoman(n))==n for all n ... ok
toRoman should fail with non-integer input ... ok
toRoman should fail with negative input ... ok
toRoman should fail with large input ... ok
toRoman should fail with 0 input ... ok
======================================================================
FAIL: fromRoman should fail with blank string
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage6\romantest61.py", line 137, in testBlank
self.assertRaises(roman61.InvalidRomanNumeralError, roman61.fromRoman, "")
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp class="computeroutput">
----------------------------------------------------------------------
Ran 13 tests in 2.864s
FAILED (failures=1)</span></pre><p><em>Now</em> you can fix the bug.
<div class="example"><h3>Example 15.4. Fixing the bug (<code class="filename">roman62.py</code>)</h3>
<p>This file is available in <code class="filename">py/roman/stage6/</code> in the examples directory.<pre class="programlisting">
def fromRoman(s):
"""convert Roman numeral to integer"""
if not s: <img id="roman.bugs.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
raise InvalidRomanNumeralError, 'Input can not be blank'
if not re.search(romanNumeralPattern, s):
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
result = 0
index = 0
for numeral, integer in romanNumeralMap:
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.bugs.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Only two lines of code are required: an explicit check for an empty string, and a <code>raise</code> statement.
</td>
</tr>
</table>
<div class="example"><h3>Example 15.5. Output of <code class="filename">romantest62.py</code> against <code class="filename">roman62.py</code></h3><pre class="screen"><samp class="computeroutput">fromRoman should only accept uppercase input ... ok
toRoman should always return uppercase ... ok
fromRoman should fail with blank string ... ok </span><img id="roman.bugs.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
fromRoman should fail with malformed antecedents ... ok
fromRoman should fail with repeated pairs of numerals ... ok
fromRoman should fail with too many repeated numerals ... ok
fromRoman should give known result with known input ... ok
toRoman should give known result with known input ... ok
fromRoman(toRoman(n))==n for all n ... ok
toRoman should fail with non-integer input ... ok
toRoman should fail with negative input ... ok
toRoman should fail with large input ... ok
toRoman should fail with 0 input ... ok
----------------------------------------------------------------------
Ran 13 tests in 2.834s
OK</span> <img id="roman.bugs.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.bugs.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The blank string test case now passes, so the bug is fixed.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.bugs.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">All the other test cases still pass, which means that this bug fix didn't break anything else. Stop coding.</td>
</tr>
</table>
<p>Coding this way does not make fixing bugs any easier. Simple bugs (like this one) require simple test cases; complex bugs
will require complex test cases. In a testing-centric environment, it may <em>seem</em> like it takes longer to fix a bug, since you need to articulate in code exactly what the bug is (to write the test case),
then fix the bug itself. Then if the test case doesn't pass right away, you need to figure out whether the fix was wrong,
or whether the test case itself has a bug in it. However, in the long run, this back-and-forth between test code and code
tested pays for itself, because it makes it more likely that bugs are fixed correctly the first time. Also, since you can
easily re-run <em>all</em> the test cases along with your new one, you are much less likely to break old code when fixing new code. Today's unit test
is tomorrow's regression test.
<h2 id="roman.change">15.2. Handling changing requirements</h2>
<p>Despite your best efforts to pin your customers to the ground and extract exact requirements from them on pain of horrible
nasty things involving scissors and hot wax, requirements will change. Most customers don't know what they want until they
see it, and even if they do, they aren't that good at articulating what they want precisely enough to be useful. And even
if they do, they'll want more in the next release anyway. So be prepared to update your test cases as requirements change.
<p>Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Remember <a href="#roman.divein" title="13.2. Diving in">the rule</a> that said that no character could be repeated more than three times? Well, the Romans were willing to make an exception
to that rule by having 4 <code>M</code> characters in a row to represent <code>4000</code>. If you make this change, you'll be able to expand the range of convertible numbers from <code>1..3999</code> to <code>1..4999</code>. But first, you need to make some changes to the test cases.
<div class="example"><h3>Example 15.6. Modifying test cases for new requirements (<code class="filename">romantest71.py</code>)</h3>
<p>This file is available in <code class="filename">py/roman/stage7/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
import roman71
import unittest
class KnownValues(unittest.TestCase):
knownValues = ( (1, 'I'),
(2, 'II'),
(3, 'III'),
(4, 'IV'),
(5, 'V'),
(6, 'VI'),
(7, 'VII'),
(8, 'VIII'),
(9, 'IX'),
(10, 'X'),
(50, 'L'),
(100, 'C'),
(500, 'D'),
(1000, 'M'),
(31, 'XXXI'),
(148, 'CXLVIII'),
(294, 'CCXCIV'),
(312, 'CCCXII'),
(421, 'CDXXI'),
(528, 'DXXVIII'),
(621, 'DCXXI'),
(782, 'DCCLXXXII'),
(870, 'DCCCLXX'),
(941, 'CMXLI'),
(1043, 'MXLIII'),
(1110, 'MCX'),
(1226, 'MCCXXVI'),
(1301, 'MCCCI'),
(1485, 'MCDLXXXV'),
(1509, 'MDIX'),
(1607, 'MDCVII'),
(1754, 'MDCCLIV'),
(1832, 'MDCCCXXXII'),
(1993, 'MCMXCIII'),
(2074, 'MMLXXIV'),
(2152, 'MMCLII'),
(2212, 'MMCCXII'),
(2343, 'MMCCCXLIII'),
(2499, 'MMCDXCIX'),
(2574, 'MMDLXXIV'),
(2646, 'MMDCXLVI'),
(2723, 'MMDCCXXIII'),
(2892, 'MMDCCCXCII'),
(2975, 'MMCMLXXV'),
(3051, 'MMMLI'),
(3185, 'MMMCLXXXV'),
(3250, 'MMMCCL'),
(3313, 'MMMCCCXIII'),
(3408, 'MMMCDVIII'),
(3501, 'MMMDI'),
(3610, 'MMMDCX'),
(3743, 'MMMDCCXLIII'),
(3844, 'MMMDCCCXLIV'),
(3888, 'MMMDCCCLXXXVIII'),
(3940, 'MMMCMXL'),
(3999, 'MMMCMXCIX'),
(4000, 'MMMM'), <img id="roman.change.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
(4500, 'MMMMD'),
(4888, 'MMMMDCCCLXXXVIII'),
(4999, 'MMMMCMXCIX'))
def testToRomanKnownValues(self):
"""toRoman should give known result with known input"""
for integer, numeral in self.knownValues:
result = roman71.toRoman(integer)
self.assertEqual(numeral, result)
def testFromRomanKnownValues(self):
"""fromRoman should give known result with known input"""
for integer, numeral in self.knownValues:
result = roman71.fromRoman(numeral)
self.assertEqual(integer, result)
class ToRomanBadInput(unittest.TestCase):
def testTooLarge(self):
"""toRoman should fail with large input"""
self.assertRaises(roman71.OutOfRangeError, roman71.toRoman, 5000) <img id="roman.change.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
def testZero(self):
"""toRoman should fail with 0 input"""
self.assertRaises(roman71.OutOfRangeError, roman71.toRoman, 0)
def testNegative(self):
"""toRoman should fail with negative input"""
self.assertRaises(roman71.OutOfRangeError, roman71.toRoman, -1)
def testNonInteger(self):
"""toRoman should fail with non-integer input"""
self.assertRaises(roman71.NotIntegerError, roman71.toRoman, 0.5)
class FromRomanBadInput(unittest.TestCase):
def testTooManyRepeatedNumerals(self):
"""fromRoman should fail with too many repeated numerals"""
for s in ('MMMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'): <img id="roman.change.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, s)
def testRepeatedPairs(self):
"""fromRoman should fail with repeated pairs of numerals"""
for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'):
self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, s)
def testMalformedAntecedent(self):
"""fromRoman should fail with malformed antecedents"""
for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV',
'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'):
self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, s)
def testBlank(self):
"""fromRoman should fail with blank string"""
self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, "")
class SanityCheck(unittest.TestCase):
def testSanity(self):
"""fromRoman(toRoman(n))==n for all n"""
for integer in range(1, 5000):<img id="roman.change.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
numeral = roman71.toRoman(integer)
result = roman71.fromRoman(numeral)
self.assertEqual(integer, result)
class CaseCheck(unittest.TestCase):
def testToRomanCase(self):
"""toRoman should always return uppercase"""
for integer in range(1, 5000):
numeral = roman71.toRoman(integer)
self.assertEqual(numeral, numeral.upper())
def testFromRomanCase(self):
"""fromRoman should only accept uppercase input"""
for integer in range(1, 5000):
numeral = roman71.toRoman(integer)
roman71.fromRoman(numeral.upper())
self.assertRaises(roman71.InvalidRomanNumeralError,
roman71.fromRoman, numeral.lower())
if __name__ == "__main__":
unittest.main()
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The existing known values don't change (they're all still reasonable values to test), but you need to add a few more in the
<code>4000</code> range. Here I've included <code>4000</code> (the shortest), <code>4500</code> (the second shortest), <code>4888</code> (the longest), and <code>4999</code> (the largest).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The definition of &#8220;large input&#8221; has changed. This test used to call <code class="function">toRoman</code> with <code>4000</code> and expect an error; now that <code>4000-4999</code> are good values, you need to bump this up to <code>5000</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The definition of &#8220;too many repeated numerals&#8221; has also changed. This test used to call <code class="function">fromRoman</code> with <code>'MMMM'</code> and expect an error; now that <code>MMMM</code> is considered a valid Roman numeral, you need to bump this up to <code>'MMMMM'</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The sanity check and case checks loop through every number in the range, from <code class="constant">1</code> to <code>3999</code>. Since the range has now expanded, these <code>for</code> loops need to be updated as well to go up to <code>4999</code>.
</td>
</tr>
</table>
<p>Now your test cases are up to date with the new requirements, but your code is not, so you expect several of the test cases
to fail.
<div class="example"><h3>Example 15.7. Output of <code class="filename">romantest71.py</code> against <code class="filename">roman71.py</code></h3><pre class="screen"><samp class="computeroutput">
fromRoman should only accept uppercase input ... ERROR </span><img id="roman.change.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should always return uppercase ... ERROR
fromRoman should fail with blank string ... ok
fromRoman should fail with malformed antecedents ... ok
fromRoman should fail with repeated pairs of numerals ... ok
fromRoman should fail with too many repeated numerals ... ok
fromRoman should give known result with known input ... ERROR </span><img id="roman.change.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should give known result with known input ... ERROR </span><img id="roman.change.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"><samp class="computeroutput">
fromRoman(toRoman(n))==n for all n ... ERROR</span><img id="roman.change.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should fail with non-integer input ... ok
toRoman should fail with negative input ... ok
toRoman should fail with large input ... ok
toRoman should fail with 0 input ... ok
</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Our case checks now fail because they loop from <code class="constant">1</code> to <code>4999</code>, but <code class="function">toRoman</code> only accepts numbers from <code class="constant">1</code> to <code>3999</code>, so it will fail as soon the test case hits <code>4000</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">fromRoman</code> known values test will fail as soon as it hits <code>'MMMM'</code>, because <code class="function">fromRoman</code> still thinks this is an invalid Roman numeral.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">toRoman</code> known values test will fail as soon as it hits <code>4000</code>, because <code class="function">toRoman</code> still thinks this is out of range.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The sanity check will also fail as soon as it hits <code>4000</code>, because <code class="function">toRoman</code> still thinks this is out of range.
</td>
</tr>
</table>
</div><pre class="screen"><samp class="computeroutput">
======================================================================
ERROR: fromRoman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 161, in testFromRomanCase
numeral = roman71.toRoman(integer)
File "roman71.py", line 28, in toRoman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)</span><samp class="computeroutput">
======================================================================
ERROR: toRoman should always return uppercase
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 155, in testToRomanCase
numeral = roman71.toRoman(integer)
File "roman71.py", line 28, in toRoman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)</span><samp class="computeroutput">
======================================================================
ERROR: fromRoman should give known result with known input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 102, in testFromRomanKnownValues
result = roman71.fromRoman(numeral)
File "roman71.py", line 47, in fromRoman
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
InvalidRomanNumeralError: Invalid Roman numeral: MMMM</span><samp class="computeroutput">
======================================================================
ERROR: toRoman should give known result with known input
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 96, in testToRomanKnownValues
result = roman71.toRoman(integer)
File "roman71.py", line 28, in toRoman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)</span><samp class="computeroutput">
======================================================================
ERROR: fromRoman(toRoman(n))==n for all n
----------------------------------------------------------------------
</span><samp class="traceback">Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 147, in testSanity
numeral = roman71.toRoman(integer)
File "roman71.py", line 28, in toRoman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)</span><samp class="computeroutput">
----------------------------------------------------------------------
Ran 13 tests in 2.213s
FAILED (errors=5)</span></pre><p>Now that you have test cases that fail due to the new requirements, you can think about fixing the code to bring it in line
with the test cases. (One thing that takes some getting used to when you first start coding unit tests is that the code being
tested is never &#8220;ahead&#8221; of the test cases. While it's behind, you still have some work to do, and as soon as it catches up to the test cases, you
stop coding.)
<div class="example"><h3>Example 15.8. Coding the new requirements (<code class="filename">roman72.py</code>)</h3>
<p>This file is available in <code class="filename">py/roman/stage7/</code> in the examples directory.<pre class="programlisting">
"""Convert to and from Roman numerals"""
import re
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def toRoman(n):
"""convert integer to Roman numeral"""
if not (0 &lt; n &lt; 5000): <img id="roman.change.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
raise OutOfRangeError, "number out of range (must be 1..4999)"
if int(n) &lt;> n:
raise NotIntegerError, "non-integers can not be converted"
result = ""
for numeral, integer in romanNumeralMap:
while n >= integer:
result += numeral
n -= integer
return result
#Define pattern to detect valid Roman numerals
romanNumeralPattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$' <img id="roman.change.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
def fromRoman(s):
"""convert Roman numeral to integer"""
if not s:
raise InvalidRomanNumeralError, 'Input can not be blank'
if not re.search(romanNumeralPattern, s):
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
result = 0
index = 0
for numeral, integer in romanNumeralMap:
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">toRoman</code> only needs one small change, in the range check. Where you used to check <code>0 &lt; n &lt; 4000</code>, you now check <code>0 &lt; n &lt; 5000</code>. And you change the error message that you <code>raise</code> to reflect the new acceptable range (<code>1..4999</code> instead of <code>1..3999</code>). You don't need to make any changes to the rest of the function; it handles the new cases already. (It merrily adds <code>'M'</code> for each thousand that it finds; given <code>4000</code>, it will spit out <code>'MMMM'</code>. The only reason it didn't do this before is that you explicitly stopped it with the range check.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You don't need to make any changes to <code class="function">fromRoman</code> at all. The only change is to <code class="varname">romanNumeralPattern</code>; if you look closely, you'll notice that you added another optional <code>M</code> in the first section of the regular expression. This will allow up to 4 <code>M</code> characters instead of 3, meaning you will allow the Roman numeral equivalents of <code>4999</code> instead of <code>3999</code>. The actual <code class="function">fromRoman</code> function is completely general; it just looks for repeated Roman numeral characters and adds them up, without caring how
many times they repeat. The only reason it didn't handle <code>'MMMM'</code> before is that you explicitly stopped it with the regular expression pattern matching.
</td>
</tr>
</table>
<p>You may be skeptical that these two small changes are all that you need. Hey, don't take my word for it; see for yourself:
<div class="example"><h3 id="roman.roman72.output">Example 15.9. Output of <code class="filename">romantest72.py</code> against <code class="filename">roman72.py</code></h3><pre class="screen"><samp class="computeroutput">fromRoman should only accept uppercase input ... ok
toRoman should always return uppercase ... ok
fromRoman should fail with blank string ... ok
fromRoman should fail with malformed antecedents ... ok
fromRoman should fail with repeated pairs of numerals ... ok
fromRoman should fail with too many repeated numerals ... ok
fromRoman should give known result with known input ... ok
toRoman should give known result with known input ... ok
fromRoman(toRoman(n))==n for all n ... ok
toRoman should fail with non-integer input ... ok
toRoman should fail with negative input ... ok
toRoman should fail with large input ... ok
toRoman should fail with 0 input ... ok
----------------------------------------------------------------------
Ran 13 tests in 3.685s
OK</span> <img id="roman.change.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.change.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">All the test cases pass. Stop coding.</td>
</tr>
</table>
<p>Comprehensive unit testing means never having to rely on a programmer who says &#8220;Trust me.&#8221;
<h2 id="roman.refactoring">15.3. Refactoring</h2>
<p>The best thing about comprehensive unit testing is not the feeling you get when all your test cases finally pass, or even
the feeling you get when someone else blames you for breaking their code and you can actually <em>prove</em> that you didn't. The best thing about unit testing is that it gives you the freedom to refactor mercilessly.
<p>Refactoring is the process of taking working code and making it work better. Usually, &#8220;better&#8221; means &#8220;faster&#8221;, although it can also mean &#8220;using less memory&#8221;, or &#8220;using less disk space&#8221;, or simply &#8220;more elegantly&#8221;. Whatever it means to you, to your project, in your environment, refactoring is important to the long-term health of any
program.
<p>Here, &#8220;better&#8221; means &#8220;faster&#8221;. Specifically, the <code class="function">fromRoman</code> function is slower than it needs to be, because of that big nasty regular expression that you use to validate Roman numerals.
It's probably not worth trying to do away with the regular expression altogether (it would be difficult, and it might not
end up any faster), but you can speed up the function by precompiling the regular expression.
<div class="example"><h3>Example 15.10. Compiling regular expressions</h3><pre class="screen">
<samp class="prompt">>>> </samp>import re
<samp class="prompt">>>> </samp>pattern = '^M?M?M?$'
<samp class="prompt">>>> </samp>re.search(pattern, 'M') <img id="roman.refactoring.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;SRE_Match object at 01090490>
<samp class="prompt">>>> </samp>compiledPattern = re.compile(pattern) <img id="roman.refactoring.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>compiledPattern
&lt;SRE_Pattern object at 00F06E28>
<samp class="prompt">>>> </samp>dir(compiledPattern)<img id="roman.refactoring.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
['findall', 'match', 'scanner', 'search', 'split', 'sub', 'subn']
<samp class="prompt">>>> </samp>compiledPattern.search('M') <img id="roman.refactoring.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
&lt;SRE_Match object at 01104928></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the syntax you've seen before: <code class="function">re.search</code> takes a regular expression as a string (<code class="varname">pattern</code>) and a string to match against it (<code>'M'</code>). If the pattern matches, the function returns a match object which can be queried to find out exactly what matched and
how.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the new syntax: <code class="function">re.compile</code> takes a regular expression as a string and returns a pattern object. Note there is no string to match here. Compiling a
regular expression has nothing to do with matching it against any specific strings (like <code>'M'</code>); it only involves the regular expression itself.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The compiled pattern object returned from <code class="function">re.compile</code> has several useful-looking functions, including several (like <code class="function">search</code> and <code class="function">sub</code>) that are available directly in the <code class="filename">re</code> module.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Calling the compiled pattern object's <code class="function">search</code> function with the string <code>'M'</code> accomplishes the same thing as calling <code class="function">re.search</code> with both the regular expression and the string <code>'M'</code>. Only much, much faster. (In fact, the <code class="function">re.search</code> function simply compiles the regular expression and calls the resulting pattern object's <code class="function">search</code> method for you.)
</td>
</tr>
</table>
</div><table class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Whenever you are going to use a regular expression more than once, you should compile it to get a pattern object, then call
the methods on the pattern object directly.
</td>
</tr>
</table>
<div class="example"><h3>Example 15.11. Compiled regular expressions in <code class="filename">roman81.py</code></h3>
<p>This file is available in <code class="filename">py/roman/stage8/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
# toRoman and rest of module omitted for clarity
romanNumeralPattern = \
re.compile('^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$') <img id="roman.refactoring.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def fromRoman(s):
"""convert Roman numeral to integer"""
if not s:
raise InvalidRomanNumeralError, 'Input can not be blank'
if not romanNumeralPattern.search(s):<img id="roman.refactoring.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
result = 0
index = 0
for numeral, integer in romanNumeralMap:
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This looks very similar, but in fact a lot has changed. <code class="varname">romanNumeralPattern</code> is no longer a string; it is a pattern object which was returned from <code class="function">re.compile</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">That means that you can call methods on <code class="varname">romanNumeralPattern</code> directly. This will be much, much faster than calling <code class="function">re.search</code> every time. The regular expression is compiled once and stored in <code class="varname">romanNumeralPattern</code> when the module is first imported; then, every time you call <code class="function">fromRoman</code>, you can immediately match the input string against the regular expression, without any intermediate steps occurring under
the covers.
</td>
</tr>
</table>
<p>So how much faster is it to compile regular expressions? See for yourself:
<div class="example"><h3 id="roman.stage8.1.output">Example 15.12. Output of <code class="filename">romantest81.py</code> against <code class="filename">roman81.py</code></h3><pre class="screen">............. <img id="roman.refactoring.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
----------------------------------------------------------------------
Ran 13 tests in 3.385s </span><img id="roman.refactoring.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><samp class="computeroutput">
OK</span> <img id="roman.refactoring.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Just a note in passing here: this time, I ran the unit test <em>without</em> the <code class="option">-v</code> option, so instead of the full <code>doc string</code> for each test, you only get a dot for each test that passes. (If a test failed, you'd get an <code>F</code>, and if it had an error, you'd get an <code>E</code>. You'd still get complete tracebacks for each failure and error, so you could track down any problems.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You ran <code>13</code> tests in <code>3.385</code> seconds, compared to <a href="#roman.roman72.output" title="Example 15.9. Output of romantest72.py against roman72.py"><code>3.685</code> seconds</a> without precompiling the regular expressions. That's an <code>8%</code> improvement overall, and remember that most of the time spent during the unit test is spent doing other things. (Separately,
I time-tested the regular expressions by themselves, apart from the rest of the unit tests, and found that compiling this
regular expression speeds up the <code class="function">search</code> by an average of <code>54%</code>.) Not bad for such a simple fix.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Oh, and in case you were wondering, precompiling the regular expression didn't break anything, and you just proved it.</td>
</tr>
</table>
<p>There is one other performance optimization that I want to try. Given the complexity of regular expression syntax, it should
come as no surprise that there is frequently more than one way to write the same expression. After some discussion about
this module on <a href="http://groups.google.com/groups?group=comp.lang.python">comp.lang.python</a>, someone suggested that I try using the <code>{<i class="replaceable">m</i>,<i class="replaceable">n</i>}</code> syntax for the optional repeated characters.
<div class="example"><h3>Example 15.13. <code class="filename">roman82.py</code></h3>
<p>This file is available in <code class="filename">py/roman/stage8/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
# rest of program omitted for clarity
#old version
#romanNumeralPattern = \
# re.compile('^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$')
#new version
romanNumeralPattern = \
re.compile('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$') <img id="roman.refactoring.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You have replaced <code>M?M?M?M?</code> with <code>M{0,4}</code>. Both mean the same thing: &#8220;match 0 to 4 <code>M</code> characters&#8221;. Similarly, <code>C?C?C?</code> became <code>C{0,3}</code> (&#8220;match 0 to 3 <code>C</code> characters&#8221;) and so forth for <code>X</code> and <code>I</code>.
</td>
</tr>
</table>
<p>This form of the regular expression is a little shorter (though not any more readable). The big question is, is it any faster?
<div class="example"><h3>Example 15.14. Output of <code class="filename">romantest82.py</code> against <code class="filename">roman82.py</code></h3><pre class="screen"><samp class="computeroutput">.............
----------------------------------------------------------------------
Ran 13 tests in 3.315s </span><img id="roman.refactoring.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
OK</span> <img id="roman.refactoring.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Overall, the unit tests run 2% faster with this form of regular expression. That doesn't sound exciting, but remember that
the <code class="function">search</code> function is a small part of the overall unit test; most of the time is spent doing other things. (Separately, I time-tested
just the regular expressions, and found that the <code class="function">search</code> function is <code>11%</code> faster with this syntax.) By precompiling the regular expression and rewriting part of it to use this new syntax, you've
improved the regular expression performance by over <code>60%</code>, and improved the overall performance of the entire unit test by over <code>10%</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">More important than any performance boost is the fact that the module still works perfectly. This is the freedom I was talking
about earlier: the freedom to tweak, change, or rewrite any piece of it and verify that you haven't messed anything up in
the process. This is not a license to endlessly tweak your code just for the sake of tweaking it; you had a very specific
objective (&#8220;make <code class="function">fromRoman</code> faster&#8221;), and you were able to accomplish that objective without any lingering doubts about whether you introduced new bugs in the
process.
</td>
</tr>
</table>
<p>One other tweak I would like to make, and then I promise I'll stop refactoring and put this module to bed. As you've seen
repeatedly, regular expressions can get pretty hairy and unreadable pretty quickly. I wouldn't like to come back to this
module in six months and try to maintain it. Sure, the test cases pass, so I know that it works, but if I can't figure out
<em>how</em> it works, it's still going to be difficult to add new features, fix new bugs, or otherwise maintain it. As you saw in <a href="#re.verbose" title="7.5. Verbose Regular Expressions">Section 7.5, &#8220;Verbose Regular Expressions&#8221;</a>, Python provides a way to document your logic line-by-line.
<div class="example"><h3>Example 15.15. <code class="filename">roman83.py</code></h3>
<p>This file is available in <code class="filename">py/roman/stage8/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
# rest of program omitted for clarity
#old version
#romanNumeralPattern = \
# re.compile('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$')
#new version
romanNumeralPattern = re.compile('''
^ # beginning of string
M{0,4} # thousands - 0 to 4 M's
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
# or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
# or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
# or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
''', re.VERBOSE) <img id="roman.refactoring.6.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.6.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">re.compile</code> function can take an optional second argument, which is a set of one or more flags that control various options about the
compiled regular expression. Here you're specifying the <code>re.VERBOSE</code> flag, which tells Python that there are in-line comments within the regular expression itself. The comments and all the whitespace around them are
<em>not</em> considered part of the regular expression; the <code class="function">re.compile</code> function simply strips them all out when it compiles the expression. This new, &#8220;verbose&#8221; version is identical to the old version, but it is infinitely more readable.
</td>
</tr>
</table>
<div class="example"><h3>Example 15.16. Output of <code class="filename">romantest83.py</code> against <code class="filename">roman83.py</code></h3><pre class="screen"><samp class="computeroutput">.............
----------------------------------------------------------------------
Ran 13 tests in 3.315s </span><img id="roman.refactoring.7.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
OK</span> <img id="roman.refactoring.7.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.7.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This new, &#8220;verbose&#8221; version runs at exactly the same speed as the old version. In fact, the compiled pattern objects are the same, since the
<code class="function">re.compile</code> function strips out all the stuff you added.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#roman.refactoring.7.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This new, &#8220;verbose&#8221; version passes all the same tests as the old version. Nothing has changed, except that the programmer who comes back to
this module in six months stands a fighting chance of understanding how the function works.
</td>
</tr>
</table>
<h2 id="roman.postscript">15.4. Postscript</h2>
<p>A clever reader read the <a href="#roman.refactoring" title="15.3. Refactoring">previous section</a> and took it to the next level. The biggest headache (and performance drain) in the program as it is currently written is
the regular expression, which is required because you have no other way of breaking down a Roman numeral. But there's only
5000 of them; why don't you just build a lookup table once, then simply read that? This idea gets even better when you realize
that you don't need to use regular expressions at all. As you build the lookup table for converting integers to Roman numerals,
you can build the reverse lookup table to convert Roman numerals to integers.
<p>And best of all, he already had a complete set of unit tests. He changed over half the code in the module, but the unit tests
stayed the same, so he could prove that his code worked just as well as the original.
<div class="example"><h3>Example 15.17. <code class="filename">roman9.py</code></h3>
<p>This file is available in <code class="filename">py/roman/stage9/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Roman numerals must be less than 5000
MAX_ROMAN_NUMERAL = 4999
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
#Create tables for fast conversion of roman numerals.
#See fillLookupTables() below.
toRomanTable = [ None ] # Skip an index since Roman numerals have no zero
fromRomanTable = {}
def toRoman(n):
"""convert integer to Roman numeral"""
if not (0 &lt; n &lt;= MAX_ROMAN_NUMERAL):
raise OutOfRangeError, "number out of range (must be 1..%s)" % MAX_ROMAN_NUMERAL
if int(n) &lt;> n:
raise NotIntegerError, "non-integers can not be converted"
return toRomanTable[n]
def fromRoman(s):
"""convert Roman numeral to integer"""
if not s:
raise InvalidRomanNumeralError, "Input can not be blank"
if not fromRomanTable.has_key(s):
raise InvalidRomanNumeralError, "Invalid Roman numeral: %s" % s
return fromRomanTable[s]
def toRomanDynamic(n):
"""convert integer to Roman numeral using dynamic programming"""
result = ""
for numeral, integer in romanNumeralMap:
if n >= integer:
result = numeral
n -= integer
break
if n > 0:
result += toRomanTable[n]
return result
def fillLookupTables():
"""compute all the possible roman numerals"""
#Save the values in two global tables to convert to and from integers.
for integer in range(1, MAX_ROMAN_NUMERAL + 1):
romanNumber = toRomanDynamic(integer)
toRomanTable.append(romanNumber)
fromRomanTable[romanNumber] = integer
fillLookupTables()
</pre><p>So how fast is it?
<div class="example"><h3>Example 15.18. Output of <code class="filename">romantest9.py</code> against <code class="filename">roman9.py</code></h3><pre class="screen">
<samp class="computeroutput">
.............
----------------------------------------------------------------------
Ran 13 tests in 0.791s
OK
</span>
</pre><p>Remember, the best performance you ever got in the original version was 13 tests in 3.315 seconds. Of course, it's not entirely
a fair comparison, because this version will take longer to import (when it fills the lookup tables). But since import is
only done once, this is negligible in the long run.
<p>The moral of the story?
<div class="itemizedlist">
<ul>
<li>Simplicity is a virtue.
<li>Especially when regular expressions are involved.
<li>And unit tests can give you the confidence to do large-scale refactoring... even if you didn't write the original code.
</ul>
<h2 id="roman.summary">15.5. Summary</h2>
<p>Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and increase flexibility
in any long-term project. It is also important to understand that unit testing is not a panacea, a Magic Problem Solver,
or a silver bullet. Writing good test cases is hard, and keeping them up to date takes discipline (especially when customers
are screaming for critical bug fixes). Unit testing is not a replacement for other forms of testing, including functional
testing, integration testing, and user acceptance testing. But it is feasible, and it does work, and once you've seen it
work, you'll wonder how you ever got along without it.
<p>This chapter covered a lot of ground, and much of it wasn't even Python-specific. There are unit testing frameworks for many languages, all of which require you to understand the same basic concepts:
<div class="highlights">
<div class="itemizedlist">
<ul>
<li>Designing test cases that are specific, automated, and independent
<li>Writing test cases <em>before</em> the code they are testing
<li>Writing tests that <a href="#roman.success" title="13.4. Testing for success">test good input</a> and check for proper results
<li>Writing tests that <a href="#roman.failure" title="13.5. Testing for failure">test bad input</a> and check for proper failures
<li>Writing and updating test cases to <a href="#roman.bugs" title="15.1. Handling bugs">illustrate bugs</a> or <a href="#roman.change" title="15.2. Handling changing requirements">reflect new requirements</a>
<li><a href="#roman.refactoring" title="15.3. Refactoring">Refactoring</a> mercilessly to improve performance, scalability, readability, maintainability, or whatever other -ility you're lacking
</ul>
<p>Additionally, you should be comfortable doing all of the following Python-specific things:
<div class="highlights">
<div class="itemizedlist">
<ul>
<li><a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">Subclassing <code>unittest.TestCase</code></a> and writing methods for individual test cases
<li>Using <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues"><code class="function">assertEqual</code></a> to check that a function returns a known value
<li>Using <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to toRoman"><code class="function">assertRaises</code></a> to check that a function raises a known exception
<li>Calling <a href="#roman.stage1.output" title="Example 14.2. Output of romantest1.py against roman1.py"><code>unittest.main()</code></a> in your <code>if __name__</code> clause to run all your test cases at once
<li>Running unit tests in <a href="#roman.stage1.output" title="Example 14.2. Output of romantest1.py against roman1.py">verbose</a> or <a href="#roman.stage8.1.output" title="Example 15.12. Output of romantest81.py against roman81.py">regular</a> mode
</ul>
<div class="itemizedlist">
<h3>Further reading</h3>
<ul>
<li><a href="http://www.xprogramming.com/">XProgramming.com</a> has links to <a href="http://www.xprogramming.com/software.htm">download unit testing frameworks</a> for many different languages.
</ul>
<div class="chapter">
<h2 id="regression">Chapter 16. Functional Programming</h2>
<h2 id="regression.divein">16.1. Diving in</h2>
<p>In <a href="#roman" title="Chapter 13. Unit Testing">Chapter 13, <i>Unit Testing</i></a>, you learned about the philosophy of unit testing. In <a href="#roman1.5" title="Chapter 14. Test-First Programming">Chapter 14, <i>Test-First Programming</i></a>, you stepped through the implementation of basic unit tests in Python. In <a href="#roman2" title="Chapter 15. Refactoring">Chapter 15, <i>Refactoring</i></a>, you saw how unit testing makes large-scale refactoring easier. This chapter will build on those sample programs, but here
we will focus more on advanced Python-specific techniques, rather than on unit testing itself.
<p>The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual
modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the
build process for this book; I have unit tests for several of the example programs (not just the <code class="filename">roman.py</code> module featured in <a href="#roman" title="Chapter 13. Unit Testing">Chapter 13, <i>Unit Testing</i></a>), and the first thing my automated build script does is run this program to make sure all my examples still work. If this
regression test fails, the build immediately stops. I don't want to release non-working examples any more than you want to
download them and sit around scratching your head and yelling at your monitor and wondering why they don't work.
<div class="example"><h3>Example 16.1. <code class="filename">regression.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
"""Regression testing framework
This module will search for scripts in the same directory named
XYZtest.py. Each such script should be a test suite that tests a
module through PyUnit. (As of Python 2.1, PyUnit is included in
the standard library as "unittest".) This script will aggregate all
found test suites into one big test suite and run them all at once.
"""
import sys, os, re, unittest
def regressionTest():
path = os.path.abspath(os.path.dirname(sys.argv[0]))
files = os.listdir(path)
test = re.compile("test\.py$", re.IGNORECASE)
files = filter(test.search, files)
filenameToModuleName = lambda f: os.path.splitext(f)[0]
moduleNames = map(filenameToModuleName, files)
modules = map(__import__, moduleNames)
load = unittest.defaultTestLoader.loadTestsFromModule
return unittest.TestSuite(map(load, modules))
if __name__ == "__main__":
unittest.main(defaultTest="regressionTest")
</pre><p>Running this script in the same directory as the rest of the example scripts that come with this book will find all the unit
tests, named <code class="filename"><i class="replaceable"><code>module</code></i>test.py</code>, run them as a single test, and pass or fail them all at once.
<div class="example"><h3>Example 16.2. Sample output of <code class="filename">regression.py</code></h3><pre class="screen">
<samp class="prompt">[you@localhost py]$ </samp>python regression.py -v
help should fail with no object ... ok <img id="regression.divein.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"><samp class="computeroutput">
help should return known result for apihelper ... ok
help should honor collapse argument ... ok
help should honor spacing argument ... ok
buildConnectionString should fail with list input ... ok </span><img id="regression.divein.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"><samp class="computeroutput">
buildConnectionString should fail with string input ... ok
buildConnectionString should fail with tuple input ... ok
buildConnectionString handles empty dictionary ... ok
buildConnectionString returns known result with known input ... ok
fromRoman should only accept uppercase input ... ok </span><img id="regression.divein.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"><samp class="computeroutput">
toRoman should always return uppercase ... ok
fromRoman should fail with blank string ... ok
fromRoman should fail with malformed antecedents ... ok
fromRoman should fail with repeated pairs of numerals ... ok
fromRoman should fail with too many repeated numerals ... ok
fromRoman should give known result with known input ... ok
toRoman should give known result with known input ... ok
fromRoman(toRoman(n))==n for all n ... ok
toRoman should fail with non-integer input ... ok
toRoman should fail with negative input ... ok
toRoman should fail with large input ... ok
toRoman should fail with 0 input ... ok
kgp a ref test ... ok
kgp b ref test ... ok
kgp c ref test ... ok
kgp d ref test ... ok
kgp e ref test ... ok
kgp f ref test ... ok
kgp g ref test ... ok
----------------------------------------------------------------------
Ran 29 tests in 2.799s
OK</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.divein.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The first 5 tests are from <code class="filename">apihelpertest.py</code>, which tests the example script from <a href="#apihelper" title="Chapter 4. The Power Of Introspection">Chapter 4, <i>The Power Of Introspection</i></a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.divein.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The next 5 tests are from <code class="filename">odbchelpertest.py</code>, which tests the example script from <a href="#odbchelper" title="Chapter 2. Your First Python Program">Chapter 2, <i>Your First Python Program</i></a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.divein.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The rest are from <code class="filename">romantest.py</code>, which you studied in depth in <a href="#roman" title="Chapter 13. Unit Testing">Chapter 13, <i>Unit Testing</i></a>.
</td>
</tr>
</table>
<h2 id="regression.path">16.2. Finding the path</h2>
<p>When running Python scripts from the command line, it is sometimes useful to know where the currently running script is located on disk.
<p>This is one of those obscure little tricks that is virtually impossible to figure out on your own, but simple to remember
once you see it. The key to it is <code>sys.argv</code>. As you saw in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a>, this is a list that holds the list of command-line arguments. However, it also holds the name of the running script, exactly
as it was called from the command line, and this is enough information to determine its location.
<div class="example"><h3>Example 16.3. <code class="filename">fullpath.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
import sys, os
print 'sys.argv[0] =', sys.argv[0] <img id="regression.path.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
pathname = os.path.dirname(sys.argv[0]) <img id="regression.path.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
print 'path =', pathname
print 'full path =', os.path.abspath(pathname) <img id="regression.path.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Regardless of how you run a script, <code>sys.argv[0]</code> will always contain the name of the script, exactly as it appears on the command line. This may or may not include any path
information, as you'll see shortly.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">os.path.dirname</code> takes a filename as a string and returns the directory path portion. If the given filename does not include any path information,
<code class="function">os.path.dirname</code> returns an empty string.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">os.path.abspath</code> is the key here. It takes a pathname, which can be partial or even blank, and returns a fully qualified pathname.
</td>
</tr>
</table>
<p><code class="function">os.path.abspath</code> deserves further explanation. It is very flexible; it can take any kind of pathname.
<div class="example"><h3>Example 16.4. Further explanation of <code class="function">os.path.abspath</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>import os
<samp class="prompt">>>> </samp>os.getcwd() <img id="regression.path.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
/home/you
<samp class="prompt">>>> </samp>os.path.abspath('') <img id="regression.path.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
/home/you
<samp class="prompt">>>> </samp>os.path.abspath('.ssh') <img id="regression.path.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
/home/you/.ssh
<samp class="prompt">>>> </samp>os.path.abspath('/home/you/.ssh') <img id="regression.path.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
/home/you/.ssh
<samp class="prompt">>>> </samp>os.path.abspath('.ssh/../foo/') <img id="regression.path.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
/home/you/foo</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">os.getcwd()</code> returns the current working directory.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Calling <code class="function">os.path.abspath</code> with an empty string returns the current working directory, same as <code class="function">os.getcwd()</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Calling <code class="function">os.path.abspath</code> with a partial pathname constructs a fully qualified pathname out of it, based on the current working directory.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Calling <code class="function">os.path.abspath</code> with a full pathname simply returns it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">os.path.abspath</code> also <em>normalizes</em> the pathname it returns. Note that this example worked even though I don't actually have a 'foo' directory. <code class="function">os.path.abspath</code> never checks your actual disk; this is all just string manipulation.
</td>
</tr>
</table>
</div><table id="os.path.abspath.exist.note" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">The pathnames and filenames you pass to <code class="function">os.path.abspath</code> do not need to exist.
</td>
</tr>
</table><table id="os.path.normpath.note" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%"><code class="function">os.path.abspath</code> not only constructs full path names, it also normalizes them. That means that if you are in the <code class="filename">/usr/</code> directory, <code>os.path.abspath('bin/../local/bin')</code> will return <code class="filename">/usr/local/bin</code>. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without
turning it into a full pathname, use <code class="function">os.path.normpath</code> instead.
</td>
</tr>
</table>
<div class="example"><h3>Example 16.5. Sample output from <code class="filename">fullpath.py</code></h3><pre class="screen">
<samp class="prompt">[you@localhost py]$ </samp>python /home/you/diveintopython3/common/py/fullpath.py <img id="regression.path.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">sys.argv[0] = /home/you/diveintopython3/common/py/fullpath.py
path = /home/you/diveintopython3/common/py
full path = /home/you/diveintopython3/common/py</samp>
<samp class="prompt">[you@localhost diveintopython3]$ </samp>python common/py/fullpath.py <img id="regression.path.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">sys.argv[0] = common/py/fullpath.py
path = common/py
full path = /home/you/diveintopython3/common/py</samp>
<samp class="prompt">[you@localhost diveintopython3]$ </samp>cd common/py
<samp class="prompt">[you@localhost py]$ </samp>python fullpath.py <img id="regression.path.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">sys.argv[0] = fullpath.py
path =
full path = /home/you/diveintopython3/common/py</span></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">In the first case, <code>sys.argv[0]</code> includes the full path of the script. You can then use the <code class="function">os.path.dirname</code> function to strip off the script name and return the full directory name, and <code class="function">os.path.abspath</code> simply returns what you give it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the script is run by using a partial pathname, <code>sys.argv[0]</code> will still contain exactly what appears on the command line. <code class="function">os.path.dirname</code> will then give you a partial pathname (relative to the current directory), and <code class="function">os.path.abspath</code> will construct a full pathname from the partial pathname.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the script is run from the current directory without giving any path, <code class="function">os.path.dirname</code> will simply return an empty string. Given an empty string, <code class="function">os.path.abspath</code> returns the current directory, which is what you want, since the script was run from the current directory.
</td>
</tr>
</table>
</div><table id="os.path.abspath.crossplatform.note" class="note" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">Like the other functions in the <code class="filename">os</code> and <code class="filename">os.path</code> modules, <code class="function">os.path.abspath</code> is cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash
as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of the <code class="filename">os</code> module.
</td>
</tr>
</table>
<p><b>Addendum. </b>One reader was dissatisfied with this solution, and wanted to be able to run all the unit tests in the current directory,
not the directory where <code class="filename">regression.py</code> is located. He suggests this approach instead:
<div class="example"><h3 id="regression.path.cwd.example">Example 16.6. Running scripts in the current directory</h3><pre class="programlisting">import sys, os, re, unittest
def regressionTest():
path = os.getcwd() <img id="regression.path.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
sys.path.append(path) <img id="regression.path.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
files = os.listdir(path) <img id="regression.path.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Instead of setting <code class="varname">path</code> to the directory where the currently running script is located, you set it to the current working directory instead. This
will be whatever directory you were in before you ran the script, which is not necessarily the same as the directory the script
is in. (Read that sentence a few times until you get it.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Append this directory to the Python library search path, so that when you dynamically import the unit test modules later, Python can find them. You didn't need to do this when <code class="varname">path</code> was the directory of the currently running script, because Python always looks in that directory.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.path.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The rest of the function is the same.</td>
</tr>
</table>
<p>This technique will allow you to re-use this <code class="filename">regression.py</code> script on multiple projects. Just put the script in a common directory, then change to the project's directory before running
it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory where <code class="filename">regression.py</code> is located.
<h2 id="regression.filter">16.3. Filtering lists revisited</h2>
<p>You're already familiar with <a href="#apihelper.filter" title="4.5. Filtering Lists">using list comprehensions to filter lists</a>. There is another way to accomplish this same thing, which some people feel is more expressive.
<p>Python has a built-in <code class="function">filter</code> function which takes two arguments, a function and a list, and returns a list.<sup>[<a name="d0e35697" href="#ftn.d0e35697">7</a>]</sup> The function passed as the first argument to <code class="function">filter</code> must itself take one argument, and the list that <code class="function">filter</code> returns will contain all the elements from the list passed to <code class="function">filter</code> for which the function passed to <code class="function">filter</code> returns true.
<p>Got all that? It's not as difficult as it sounds.
<div class="example"><h3>Example 16.7. Introducing <code class="function">filter</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>def odd(n): <img id="regression.filter.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp>return n % 2
<samp class="prompt">... </samp>
<samp class="prompt">>>> </samp>li = [1, 2, 3, 5, 9, 10, 256, -3]
<samp class="prompt">>>> </samp>filter(odd, li) <img id="regression.filter.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
[1, 3, 5, 9, -3]
<samp class="prompt">>>> </samp>[e for e in li if odd(e)] <img id="regression.filter.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>filteredList = []
<samp class="prompt">>>> </samp>for n in li: <img id="regression.filter.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="prompt">... </samp>if odd(n):
<samp class="prompt">... </samp> filteredList.append(n)
<samp class="prompt">... </samp>
<samp class="prompt">>>> </samp>filteredList
[1, 3, 5, 9, -3]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.filter.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">odd</code> uses the built-in mod function &#8220;<code>%</code>&#8221; to return <code class="constant">True</code> if <code class="varname">n</code> is odd and <code class="constant">False</code> if <code class="varname">n</code> is even.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.filter.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">filter</code> takes two arguments, a function (<code class="function">odd</code>) and a list (<code class="varname">li</code>). It loops through the list and calls <code class="function">odd</code> with each element. If <code class="function">odd</code> returns a true value (remember, any non-zero value is true in Python), then the element is included in the returned list, otherwise it is filtered out. The result is a list of only the odd
numbers from the original list, in the same order as they appeared in the original.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.filter.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You could accomplish the same thing using list comprehensions, as you saw in <a href="#apihelper.filter" title="4.5. Filtering Lists">Section 4.5, &#8220;Filtering Lists&#8221;</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.filter.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You could also accomplish the same thing with a <code>for</code> loop. Depending on your programming background, this may seem more &#8220;straightforward&#8221;, but functions like <code class="function">filter</code> are much more expressive. Not only is it easier to write, it's easier to read, too. Reading the <code>for</code> loop is like standing too close to a painting; you see all the details, but it may take a few seconds to be able to step
back and see the bigger picture: &#8220;Oh, you're just filtering the list!&#8221;
</td>
</tr>
</table>
<div class="example"><h3>Example 16.8. <code class="function">filter</code> in <code class="filename">regression.py</code></h3><pre class="programlisting">
files = os.listdir(path) <img id="regression.filter.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
test = re.compile("test\.py$", re.IGNORECASE) <img id="regression.filter.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
files = filter(test.search, files) <img id="regression.filter.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.filter.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#regression.path" title="16.2. Finding the path">Section 16.2, &#8220;Finding the path&#8221;</a>, <code class="varname">path</code> may contain the full or partial pathname of the directory of the currently running script, or it may contain an empty string
if the script is being run from the current directory. Either way, <code class="varname">files</code> will end up with the names of the files in the same directory as this script you're running.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.filter.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is a compiled regular expression. As you saw in <a href="#roman.refactoring" title="15.3. Refactoring">Section 15.3, &#8220;Refactoring&#8221;</a>, if you're going to use the same regular expression over and over, you should compile it for faster performance. The compiled
object has a <code class="function">search</code> method which takes a single argument, the string to search. If the regular expression matches the string, the <code class="function">search</code> method returns a <code class="classname">Match</code> object containing information about the regular expression match; otherwise it returns <code>None</code>, the Python null value.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.filter.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">For each element in the <code class="varname">files</code> list, you're going to call the <code class="function">search</code> method of the compiled regular expression object, <code class="varname">test</code>. If the regular expression matches, the method will return a <code class="classname">Match</code> object, which Python considers to be true, so the element will be included in the list returned by <code class="function">filter</code>. If the regular expression does not match, the <code class="function">search</code> method will return <code>None</code>, which Python considers to be false, so the element will not be included.
</td>
</tr>
</table>
<p><b>Historical note. </b>Versions of Python prior to 2.0 did not have <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehensions</a>, so you couldn't <a href="#apihelper.filter" title="4.5. Filtering Lists">filter using list comprehensions</a>; the <code class="function">filter</code> function was the only game in town. Even with the introduction of list comprehensions in 2.0, some people still prefer the
old-style <code class="function">filter</code> (and its companion function, <code class="function">map</code>, which you'll see later in this chapter). Both techniques work at the moment, so which one you use is a matter of style.
There is discussion that <code class="function">map</code> and <code class="function">filter</code> might be deprecated in a future version of Python, but no decision has been made.
<div class="example"><h3>Example 16.9. Filtering using list comprehensions instead</h3><pre class="programlisting">
files = os.listdir(path)
test = re.compile("test\.py$", re.IGNORECASE)
files = [f for f in files if test.search(f)] <img id="regression.filter.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.filter.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This will accomplish exactly the same result as using the <code class="function">filter</code> function. Which way is more expressive? That's up to you.
</td>
</tr>
</table>
<h2 id="regression.map">16.4. Mapping lists revisited</h2>
<p>You're already familiar with using <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehensions</a> to map one list into another. There is another way to accomplish the same thing, using the built-in <code class="function">map</code> function. It works much the same way as the <a href="#regression.filter" title="16.3. Filtering lists revisited"><code class="function">filter</code></a> function.
<div class="example"><h3>Example 16.10. Introducing <code class="function">map</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>def double(n):
<samp class="prompt">... </samp>return n*2
<samp class="prompt">... </samp>
<samp class="prompt">>>> </samp>li = [1, 2, 3, 5, 9, 10, 256, -3]
<samp class="prompt">>>> </samp>map(double, li) <img id="regression.map.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
[2, 4, 6, 10, 18, 20, 512, -6]
<samp class="prompt">>>> </samp>[double(n) for n in li] <img id="regression.map.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
[2, 4, 6, 10, 18, 20, 512, -6]
<samp class="prompt">>>> </samp>newlist = []
<samp class="prompt">>>> </samp>for n in li: <img id="regression.map.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">... </samp>newlist.append(double(n))
<samp class="prompt">... </samp>
<samp class="prompt">>>> </samp>newlist
[2, 4, 6, 10, 18, 20, 512, -6]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.map.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">map</code> takes a function and a list<sup>[<a name="d0e36079" href="#ftn.d0e36079">8</a>]</sup> and returns a new list by calling the function with each element of the list in order. In this case, the function simply
multiplies each element by 2.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.map.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You could accomplish the same thing with a list comprehension. List comprehensions were first introduced in Python 2.0; <code class="function">map</code> has been around forever.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.map.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You could, if you insist on thinking like a Visual Basic programmer, use a <code>for</code> loop to accomplish the same thing.
</td>
</tr>
</table>
<div class="example"><h3>Example 16.11. <code class="function">map</code> with lists of mixed datatypes</h3><pre class="screen">
<samp class="prompt">>>> </samp>li = [5, 'a', (2, 'b')]
<samp class="prompt">>>> </samp>map(double, li) <img id="regression.map.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
[10, 'aa', (2, 'b', 2, 'b')]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.map.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As a side note, I'd like to point out that <code class="function">map</code> works just as well with lists of mixed datatypes, as long as the function you're using correctly handles each type. In this
case, the <code class="function">double</code> function simply multiplies the given argument by 2, and Python Does The Right Thing depending on the datatype of the argument. For integers, this means actually multiplying it by 2; for
strings, it means concatenating the string with itself; for tuples, it means making a new tuple that has all of the elements
of the original, then all of the elements of the original again.
</td>
</tr>
</table>
<p>All right, enough play time. Let's look at some real code.
<div class="example"><h3>Example 16.12. <code class="function">map</code> in <code class="filename">regression.py</code></h3><pre class="programlisting">
filenameToModuleName = lambda f: os.path.splitext(f)[0] <img id="regression.map.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
moduleNames = map(filenameToModuleName, files) <img id="regression.map.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.map.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#apihelper.lambda" title="4.7. Using lambda Functions">Section 4.7, &#8220;Using lambda Functions&#8221;</a>, <code>lambda</code> defines an inline function. And as you saw in <a href="#splittingpathnames.example" title="Example 6.17. Splitting Pathnames">Example 6.17, &#8220;Splitting Pathnames&#8221;</a>, <code class="function">os.path.splitext</code> takes a filename and returns a tuple <code>(<i class="replaceable">name</i>, <i class="replaceable">extension</i>)</code>. So <code class="function">filenameToModuleName</code> is a function which will take a filename and strip off the file extension, and return just the name.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.map.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Calling <code class="function">map</code> takes each filename listed in <code class="varname">files</code>, passes it to the function <code class="function">filenameToModuleName</code>, and returns a list of the return values of each of those function calls. In other words, you strip the file extension off
of each filename, and store the list of all those stripped filenames in <code class="varname">moduleNames</code>.
</td>
</tr>
</table>
<p>As you'll see in the rest of the chapter, you can extend this type of data-centric thinking all the way to the final goal,
which is to define and execute a single test suite that contains the tests from all of those individual test suites.
<h2 id="regression.datacentric">16.5. Data-centric programming</h2>
<p>By now you're probably scratching your head wondering why this is better than using <code>for</code> loops and straight function calls. And that's a perfectly valid question. Mostly, it's a matter of perspective. Using
<code class="function">map</code> and <code class="function">filter</code> forces you to center your thinking around your data.
<p>In this case, you started with no data at all; the first thing you did was <a href="#regression.path" title="16.2. Finding the path">get the directory path</a> of the current script, and got a list of files in that directory. That was the bootstrap, and it gave you real data to work
with: a list of filenames.
<p>However, you knew you didn't care about all of those files, only the ones that were actually test suites. You had <em>too much data</em>, so you needed to <code class="function">filter</code> it. How did you know which data to keep? You needed a test to decide, so you defined one and passed it to the <code class="function">filter</code> function. In this case you used a regular expression to decide, but the concept would be the same regardless of how you
constructed the test.
<p>Now you had the filenames of each of the test suites (and only the test suites, since everything else had been filtered out),
but you really wanted module names instead. You had the right amount of data, but it was <em>in the wrong format</em>. So you defined a function that would transform a single filename into a module name, and you mapped that function onto
the entire list. From one filename, you can get a module name; from a list of filenames, you can get a list of module names.
<p>Instead of <code class="function">filter</code>, you could have used a <code>for</code> loop with an <code>if</code> statement. Instead of <code class="function">map</code>, you could have used a <code>for</code> loop with a function call. But using <code>for</code> loops like that is busywork. At best, it simply wastes time; at worst, it introduces obscure bugs. For instance, you need
to figure out how to test for the condition &#8220;is this file a test suite?&#8221; anyway; that's the application-specific logic, and no language can write that for us. But once you've figured that out,
do you really want go to all the trouble of defining a new empty list and writing a <code>for</code> loop and an <code>if</code> statement and manually calling <code class="function">append</code> to add each element to the new list if it passes the condition and then keeping track of which variable holds the new filtered
data and which one holds the old unfiltered data? Why not just define the test condition, then let Python do the rest of that work for us?
<p>Oh sure, you could try to be fancy and delete elements in place without creating a new list. But you've been burned by that
before. Trying to modify a data structure that you're looping through can be tricky. You delete an element, then loop to
the next element, and suddenly you've skipped one. Is Python one of the languages that works that way? How long would it take you to figure it out? Would you remember for certain whether
it was safe the next time you tried? Programmers spend so much time and make so many mistakes dealing with purely technical
issues like this, and it's all pointless. It doesn't advance your program at all; it's just busywork.
<p>I resisted list comprehensions when I first learned Python, and I resisted <code class="function">filter</code> and <code class="function">map</code> even longer. I insisted on making my life more difficult, sticking to the familiar way of <code>for</code> loops and <code>if</code> statements and step-by-step code-centric programming. And my Python programs looked a lot like Visual Basic programs, detailing every step of every operation in every function. And they had all the same types of little problems
and obscure bugs. And it was all pointless.
<p>Let it all go. Busywork code is not important. Data is important. And data is not difficult. It's only data. If you have
too much, filter it. If it's not what you want, map it. Focus on the data; leave the busywork behind.
<h2 id="regression.import">16.6. Dynamically importing modules</h2>
<p>OK, enough philosophizing. Let's talk about dynamically importing modules.
<p>First, let's look at how you normally import modules. The <code>import <i class="replaceable">module</i></code> syntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once
this way, with a comma-separated list. You did this on the very first line of this chapter's script.
<div class="example"><h3>Example 16.13. Importing multiple modules at once</h3><pre class="programlisting">
import sys, os, re, unittest <img id="regression.import.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.import.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This imports four modules at once: <code class="filename">sys</code> (for system functions and access to the command line parameters), <code class="filename">os</code> (for operating system functions like directory listings), <code class="filename">re</code> (for regular expressions), and <code class="filename">unittest</code> (for unit testing).
</td>
</tr>
</table>
<p>Now let's do the same thing, but with dynamic imports.
<div class="example"><h3>Example 16.14. Importing modules dynamically</h3><pre class="screen">
<samp class="prompt">>>> </samp>sys = __import__('sys') <img id="regression.import.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>os = __import__('os')
<samp class="prompt">>>> </samp>re = __import__('re')
<samp class="prompt">>>> </samp>unittest = __import__('unittest')
<samp class="prompt">>>> </samp>sys <img id="regression.import.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>&lt;module 'sys' (built-in)>
<samp class="prompt">>>> </samp>os
<samp class="prompt">>>> </samp>&lt;module 'os' from '/usr/local/lib/python2.2/os.pyc'>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.import.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The built-in <code class="function">__import__</code> function accomplishes the same goal as using the <code>import</code> statement, but it's an actual function, and it takes a string as an argument.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.import.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The variable <code class="varname">sys</code> is now the <code class="filename">sys</code> module, just as if you had said <code>import sys</code>. The variable <code class="varname">os</code> is now the <code class="filename">os</code> module, and so forth.
</td>
</tr>
</table>
<p>So <code class="function">__import__</code> imports a module, but takes a string argument to do it. In this case the module you imported was just a hard-coded string,
but it could just as easily be a variable, or the result of a function call. And the variable that you assign the module
to doesn't need to match the module name, either. You could import a series of modules and assign them to a list.
<div class="example"><h3>Example 16.15. Importing a list of modules dynamically</h3><pre class="screen">
<samp class="prompt">>>> </samp>moduleNames = ['sys', 'os', 're', 'unittest'] <img id="regression.import.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>moduleNames
['sys', 'os', 're', 'unittest']
<samp class="prompt">>>> </samp>modules = map(__import__, moduleNames) <img id="regression.import.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>modules <img id="regression.import.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="computeroutput">[&lt;module 'sys' (built-in)>,
&lt;module 'os' from 'c:\Python22\lib\os.pyc'>,
&lt;module 're' from 'c:\Python22\lib\re.pyc'>,
&lt;module 'unittest' from 'c:\Python22\lib\unittest.pyc'>]</samp>
<samp class="prompt">>>> </samp>modules[0].version <img id="regression.import.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
<samp class="prompt">>>> </samp>import sys
<samp class="prompt">>>> </samp>sys.version
'2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.import.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">moduleNames</code> is just a list of strings. Nothing fancy, except that the strings happen to be names of modules that you could import, if
you wanted to.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.import.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Surprise, you wanted to import them, and you did, by mapping the <code class="function">__import__</code> function onto the list. Remember, this takes each element of the list (<code class="varname">moduleNames</code>) and calls the function (<code class="function">__import__</code>) over and over, once with each element of the list, builds a list of the return values, and returns the result.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.import.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">So now from a list of strings, you've created a list of actual modules. (Your paths may be different, depending on your operating
system, where you installed Python, the phase of the moon, etc.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.import.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To drive home the point that these are real modules, let's look at some module attributes. Remember, <code class="varname">modules[0]</code> <em>is</em> the <code class="filename">sys</code> module, so <code class="varname">modules[0].version</code> <em>is</em> <code class="varname">sys.version</code>. All the other attributes and methods of these modules are also available. There's nothing magic about the <code>import</code> statement, and there's nothing magic about modules. Modules are objects. Everything is an object.
</td>
</tr>
</table>
<p>Now you should be able to put this all together and figure out what most of this chapter's code sample is doing.
<h2 id="regression.alltogether">16.7. Putting it all together</h2>
<p>You've learned enough now to deconstruct the first seven lines of this chapter's code sample: reading a directory and importing
selected modules within it.
<div class="example"><h3>Example 16.16. The <code class="function">regressionTest</code> function</h3><pre class="programlisting">
def regressionTest():
path = os.path.abspath(os.path.dirname(sys.argv[0]))
files = os.listdir(path)
test = re.compile("test\.py$", re.IGNORECASE)
files = filter(test.search, files)
filenameToModuleName = lambda f: os.path.splitext(f)[0]
moduleNames = map(filenameToModuleName, files)
modules = map(__import__, moduleNames)
load = unittest.defaultTestLoader.loadTestsFromModule
return unittest.TestSuite(map(load, modules))
</pre><p>Let's look at it line by line, interactively. Assume that the current directory is <code class="filename">c:\diveintopython3\py</code>, which contains the examples that come with this book, including this chapter's script. As you saw in <a href="#regression.path" title="16.2. Finding the path">Section 16.2, &#8220;Finding the path&#8221;</a>, the script directory will end up in the <code class="varname">path</code> variable, so let's start hard-code that and go from there.
<div class="example"><h3>Example 16.17. Step 1: Get all the files</h3><pre class="screen">
<samp class="prompt">>>> </samp>import sys, os, re, unittest
<samp class="prompt">>>> </samp>path = r'c:\diveintopython3\py'
<samp class="prompt">>>> </samp>files = os.listdir(path)
<samp class="prompt">>>> </samp>files <img id="regression.alltogether.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">['BaseHTMLProcessor.py', 'LICENSE.txt', 'apihelper.py', 'apihelpertest.py',
'argecho.py', 'autosize.py', 'builddialectexamples.py', 'dialect.py',
'fileinfo.py', 'fullpath.py', 'kgptest.py', 'makerealworddoc.py',
'odbchelper.py', 'odbchelpertest.py', 'parsephone.py', 'piglatin.py',
'plural.py', 'pluraltest.py', 'pyfontify.py', 'regression.py', 'roman.py', 'romantest.py',
'uncurly.py', 'unicode2koi8r.py', 'urllister.py', 'kgp', 'plural', 'roman',
'colorize.py']</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">files</code> is a list of all the files and directories in the script's directory. (If you've been running some of the examples already,
you may also see some <code class="filename">.pyc</code> files in there as well.)
</td>
</tr>
</table>
<div class="example"><h3>Example 16.18. Step 2: Filter to find the files you care about</h3><pre class="screen">
<samp class="prompt">>>> </samp>test = re.compile("test\.py$", re.IGNORECASE) <img id="regression.alltogether.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>files = filter(test.search, files) <img id="regression.alltogether.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>files <img id="regression.alltogether.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
['apihelpertest.py', 'kgptest.py', 'odbchelpertest.py', 'pluraltest.py', 'romantest.py']
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This regular expression will match any string that ends with <code>test.py</code>. Note that you need to escape the period, since a period in a regular expression usually means &#8220;match any single character&#8221;, but you actually want to match a literal period instead.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The compiled regular expression acts like a function, so you can use it to filter the large list of files and directories,
to find the ones that match the regular expression.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">And you're left with the list of unit testing scripts, because they were the only ones named <code class="filename">SOMETHINGtest.py</code>.
</td>
</tr>
</table>
<div class="example"><h3>Example 16.19. Step 3: Map filenames to module names</h3><pre class="screen">
<samp class="prompt">>>> </samp>filenameToModuleName = lambda f: os.path.splitext(f)[0] <img id="regression.alltogether.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>filenameToModuleName('romantest.py') <img id="regression.alltogether.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'romantest'
<samp class="prompt">>>> </samp>filenameToModuleName('odchelpertest.py')
'odbchelpertest'
<samp class="prompt">>>> </samp>moduleNames = map(filenameToModuleName, files) <img id="regression.alltogether.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>moduleNames <img id="regression.alltogether.3.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
['apihelpertest', 'kgptest', 'odbchelpertest', 'pluraltest', 'romantest']
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#apihelper.lambda" title="4.7. Using lambda Functions">Section 4.7, &#8220;Using lambda Functions&#8221;</a>, <code>lambda</code> is a quick-and-dirty way of creating an inline, one-line function. This one takes a filename with an extension and returns
just the filename part, using the standard library function <code class="function">os.path.splitext</code> that you saw in <a href="#splittingpathnames.example" title="Example 6.17. Splitting Pathnames">Example 6.17, &#8220;Splitting Pathnames&#8221;</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">filenameToModuleName</code> is a function. There's nothing magic about <code>lambda</code> functions as opposed to regular functions that you define with a <code>def</code> statement. You can call the <code class="varname">filenameToModuleName</code> function like any other, and it does just what you wanted it to do: strips the file extension off of its argument.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you can apply this function to each file in the list of unit test files, using <code class="function">map</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.3.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">And the result is just what you wanted: a list of modules, as strings.</td>
</tr>
</table>
<div class="example"><h3>Example 16.20. Step 4: Mapping module names to modules</h3><pre class="screen">
<samp class="prompt">>>> </samp>modules = map(__import__, moduleNames)<img id="regression.alltogether.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>modules <img id="regression.alltogether.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="computeroutput">[&lt;module 'apihelpertest' from 'apihelpertest.py'>,
&lt;module 'kgptest' from 'kgptest.py'>,
&lt;module 'odbchelpertest' from 'odbchelpertest.py'>,
&lt;module 'pluraltest' from 'pluraltest.py'>,
&lt;module 'romantest' from 'romantest.py'>]</samp>
<samp class="prompt">>>> </samp>modules[-1] <img id="regression.alltogether.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;module 'romantest' from 'romantest.py'>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw in <a href="#regression.import" title="16.6. Dynamically importing modules">Section 16.6, &#8220;Dynamically importing modules&#8221;</a>, you can use a combination of <code class="function">map</code> and <code class="function">__import__</code> to map a list of module names (as strings) into actual modules (which you can call or access like any other module).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">modules</code> is now a list of modules, fully accessible like any other module.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The last module in the list <em>is</em> the <code class="filename">romantest</code> module, just as if you had said <code>import romantest</code>.
</td>
</tr>
</table>
<div class="example"><h3>Example 16.21. Step 5: Loading the modules into a test suite</h3><pre class="screen">
<samp class="prompt">>>> </samp>load = unittest.defaultTestLoader.loadTestsFromModule
<samp class="prompt">>>> </samp>map(load, modules) <img id="regression.alltogether.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="computeroutput">[&lt;unittest.TestSuite tests=[
&lt;unittest.TestSuite tests=[&lt;apihelpertest.BadInput testMethod=testNoObject>]>,
&lt;unittest.TestSuite tests=[&lt;apihelpertest.KnownValues testMethod=testApiHelper>]>,
&lt;unittest.TestSuite tests=[
&lt;apihelpertest.ParamChecks testMethod=testCollapse>,
&lt;apihelpertest.ParamChecks testMethod=testSpacing>]>,
...
]
]</samp>
<samp class="prompt">>>> </samp>unittest.TestSuite(map(load, modules)) <img id="regression.alltogether.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">These are real module objects. Not only can you access them like any other module, instantiate classes and call functions,
you can also introspect into the module to figure out which classes and functions it has in the first place. That's what
the <code class="function">loadTestsFromModule</code> method does: it introspects into each module and returns a <code>unittest.TestSuite</code> object for each module. Each <code>TestSuite</code> object actually contains a list of <code>TestSuite</code> objects, one for each <code>TestCase</code> class in your module, and each of those <code>TestSuite</code> objects contains a list of tests, one for each test method in your module.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Finally, you wrap the list of <code>TestSuite</code> objects into one big test suite. The <code class="filename">unittest</code> module has no problem traversing this tree of nested test suites within test suites; eventually it gets down to an individual
test method and executes it, verifies that it passes or fails, and moves on to the next one.
</td>
</tr>
</table>
<p>This introspection process is what the <code class="filename">unittest</code> module usually does for us. Remember that magic-looking <code>unittest.main()</code> function that our individual test modules called to kick the whole thing off? <code class="function">unittest.main()</code> actually creates an instance of <code>unittest.TestProgram</code>, which in turn creates an instance of a <code>unittest.defaultTestLoader</code> and loads it up with the module that called it. (How does it get a reference to the module that called it if you don't give
it one? By using the equally-magic <code>__import__('__main__')</code> command, which dynamically imports the currently-running module. I could write a book on all the tricks and techniques used
in the <code class="filename">unittest</code> module, but then I'd never finish this one.)
<div class="example"><h3>Example 16.22. Step 6: Telling <code class="filename">unittest</code> to use your test suite</h3><pre class="programlisting">
if __name__ == "__main__":
unittest.main(defaultTest="regressionTest") <img id="regression.alltogether.6.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#regression.alltogether.6.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Instead of letting the <code class="filename">unittest</code> module do all its magic for us, you've done most of it yourself. You've created a function (<code class="function">regressionTest</code>) that imports the modules yourself, calls <code>unittest.defaultTestLoader</code> yourself, and wraps it all up in a test suite. Now all you need to do is tell <code class="filename">unittest</code> that, instead of looking for tests and building a test suite in the usual way, it should just call the <code class="function">regressionTest</code> function, which returns a ready-to-use <code>TestSuite</code>.
</td>
</tr>
</table>
<h2 id="regression.summary">16.8. Summary</h2>
<p>The <code class="filename">regression.py</code> program and its output should now make perfect sense.
<p>You should now feel comfortable doing all of these things:
<div class="itemizedlist">
<ul>
<li>Manipulating <a href="#regression.path" title="16.2. Finding the path">path information</a> from the command line.
<li>Filtering lists <a href="#regression.filter" title="16.3. Filtering lists revisited">using <code class="function">filter</code></a> instead of list comprehensions.
<li>Mapping lists <a href="#regression.map" title="16.4. Mapping lists revisited">using <code class="function">map</code></a> instead of list comprehensions.
<li>Dynamically <a href="#regression.import" title="16.6. Dynamically importing modules">importing modules</a>.
</ul>
<div class="footnotes"><br><hr width="100" align="left">
<div class="footnote">
<p><sup>[<a name="ftn.d0e35697" href="#d0e35697">7</a>] </sup>Technically, the second argument to <code class="function">filter</code> can be any sequence, including lists, tuples, and custom classes that act like lists by defining the <code class="function">__getitem__</code> special method. If possible, <code class="function">filter</code> will return the same datatype as you give it, so filtering a list returns a list, but filtering a tuple returns a tuple.
<div class="footnote">
<p><sup>[<a name="ftn.d0e36079" href="#d0e36079">8</a>] </sup>Again, I should point out that <code class="function">map</code> can take a list, a tuple, or any object that acts like a sequence. See previous footnote about <code class="function">filter</code>.
<div class="chapter">
<h2 id="plural">Chapter 17. Dynamic functions</h2>
<h2 id="plural.divein">17.1. Diving in</h2>
<p>I want to talk about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators.
Generators are new in Python 2.3. But first, let's talk about how to make plural nouns.
<p>If you haven't read <a href="#re" title="Chapter 7. Regular Expressions">Chapter 7, <i>Regular Expressions</i></a>, now would be a good time. This chapter assumes you understand the basics of regular expressions, and quickly descends into
more advanced uses.
<p>English is a schizophrenic language that borrows from a lot of other languages, and the rules for making singular nouns into
plural nouns are varied and complex. There are rules, and then there are exceptions to those rules, and then there are exceptions
to the exceptions.
<p>If you grew up in an English-speaking country or learned English in a formal school setting, you're probably familiar with
the basic rules:
<div class="orderedlist">
<ol>
<li>If a word ends in S, X, or Z, add ES. &#8220;Bass&#8221; becomes &#8220;basses&#8221;, &#8220;fax&#8221; becomes &#8220;faxes&#8221;, and &#8220;waltz&#8221; becomes &#8220;waltzes&#8221;.
<li>If a word ends in a noisy H, add ES; if it ends in a silent H, just add S. What's a noisy H? One that gets combined with
other letters to make a sound that you can hear. So &#8220;coach&#8221; becomes &#8220;coaches&#8221; and &#8220;rash&#8221; becomes &#8220;rashes&#8221;, because you can hear the CH and SH sounds when you say them. But &#8220;cheetah&#8221; becomes &#8220;cheetahs&#8221;, because the H is silent.
<li>If a word ends in Y that sounds like I, change the Y to IES; if the Y is combined with a vowel to sound like something else,
just add S. So &#8220;vacancy&#8221; becomes &#8220;vacancies&#8221;, but &#8220;day&#8221; becomes &#8220;days&#8221;.
<li>If all else fails, just add S and hope for the best.
</ol>
<p>(I know, there are a lot of exceptions. &#8220;Man&#8221; becomes &#8220;men&#8221; and &#8220;woman&#8221; becomes &#8220;women&#8221;, but &#8220;human&#8221; becomes &#8220;humans&#8221;. &#8220;Mouse&#8221; becomes &#8220;mice&#8221; and &#8220;louse&#8221; becomes &#8220;lice&#8221;, but &#8220;house&#8221; becomes &#8220;houses&#8221;. &#8220;Knife&#8221; becomes &#8220;knives&#8221; and &#8220;wife&#8221; becomes &#8220;wives&#8221;, but &#8220;lowlife&#8221; becomes &#8220;lowlifes&#8221;. And don't even get me started on words that are their own plural, like &#8220;sheep&#8221;, &#8220;deer&#8221;, and &#8220;haiku&#8221;.)
<p>Other languages are, of course, completely different.
<p>Let's design a module that pluralizes nouns. Start with just English nouns, and just these four rules, but keep in mind that
you'll inevitably need to add more rules, and you may eventually need to add more languages.
<h2 id="plural.stage1">17.2. <code class="filename">plural.py</code>, stage 1</h2>
<p>So you're looking at words, which at least in English are strings of characters. And you have rules that say you need to
find different combinations of characters, and then do different things to them. This sounds like a job for regular expressions.
<div class="example"><h3>Example 17.1. <code class="filename">plural1.py</code></h3><pre class="programlisting">
import re
def plural(noun):
if re.search('[sxz]$', noun): <img id="plural.stage1.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
return re.sub('$', 'es', noun) <img id="plural.stage1.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
elif re.search('[^aeioudgkprt]h$', noun):
return re.sub('$', 'es', noun)
elif re.search('[^aeiou]y$', noun):
return re.sub('y$', 'ies', noun)
else:
return noun + 's'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">OK, this is a regular expression, but it uses a syntax you didn't see in <a href="#re" title="Chapter 7. Regular Expressions">Chapter 7, <i>Regular Expressions</i></a>. The square brackets mean &#8220;match exactly one of these characters&#8221;. So <code>[sxz]</code> means &#8220;<code>s</code>, or <code>x</code>, or <code>z</code>&#8221;, but only one of them. The <code>$</code> should be familiar; it matches the end of string. So you're checking to see if <code class="varname">noun</code> ends with <code>s</code>, <code>x</code>, or <code>z</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This <code class="function">re.sub</code> function performs regular expression-based string substitutions. Let's look at it in more detail.
</td>
</tr>
</table>
<div class="example"><h3>Example 17.2. Introducing <code class="function">re.sub</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>import re
<samp class="prompt">>>> </samp>re.search('[abc]', 'Mark') <img id="plural.stage1.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x001C1FA8>
<samp class="prompt">>>> </samp>re.sub('[abc]', 'o', 'Mark') <img id="plural.stage1.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'Mork'
<samp class="prompt">>>> </samp>re.sub('[abc]', 'o', 'rock') <img id="plural.stage1.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
'rook'
<samp class="prompt">>>> </samp>re.sub('[abc]', 'o', 'caps') <img id="plural.stage1.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
'oops'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Does the string <code>Mark</code> contain <code>a</code>, <code>b</code>, or <code>c</code>? Yes, it contains <code>a</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">OK, now find <code>a</code>, <code>b</code>, or <code>c</code>, and replace it with <code>o</code>. <code>Mark</code> becomes <code>Mork</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The same function turns <code>rock</code> into <code>rook</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You might think this would turn <code>caps</code> into <code>oaps</code>, but it doesn't. <code>re.sub</code> replaces <em>all</em> of the matches, not just the first one. So this regular expression turns <code>caps</code> into <code>oops</code>, because both the <code>c</code> and the <code>a</code> get turned into <code>o</code>.
</td>
</tr>
</table>
<div class="example"><h3>Example 17.3. Back to <code class="filename">plural1.py</code></h3><pre class="programlisting">
import re
def plural(noun):
if re.search('[sxz]$', noun):
return re.sub('$', 'es', noun) <img id="plural.stage1.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
elif re.search('[^aeioudgkprt]h$', noun): <img id="plural.stage1.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
return re.sub('$', 'es', noun) <img id="plural.stage1.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
elif re.search('[^aeiou]y$', noun):
return re.sub('y$', 'ies', noun)
else:
return noun + 's'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Back to the <code class="function">plural</code> function. What are you doing? You're replacing the end of string with <code>es</code>. In other words, adding <code>es</code> to the string. You could accomplish the same thing with string concatenation, for example <code>noun + 'es'</code>, but I'm using regular expressions for everything, for consistency, for reasons that will become clear later in the chapter.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Look closely, this is another new variation. The <code>^</code> as the first character inside the square brackets means something special: negation. <code>[^abc]</code> means &#8220;any single character <em>except</em> <code>a</code>, <code>b</code>, or <code>c</code>&#8221;. So <code>[^aeioudgkprt]</code> means any character except <code>a</code>, <code>e</code>, <code>i</code>, <code>o</code>, <code>u</code>, <code>d</code>, <code>g</code>, <code>k</code>, <code>p</code>, <code>r</code>, or <code>t</code>. Then that character needs to be followed by <code>h</code>, followed by end of string. You're looking for words that end in H where the H can be heard.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Same pattern here: match words that end in Y, where the character before the Y is <em>not</em> <code>a</code>, <code>e</code>, <code>i</code>, <code>o</code>, or <code>u</code>. You're looking for words that end in Y that sounds like I.
</td>
</tr>
</table>
<div class="example"><h3>Example 17.4. More on negation regular expressions</h3><pre class="screen">
<samp class="prompt">>>> </samp>import re
<samp class="prompt">>>> </samp>re.search('[^aeiou]y$', 'vacancy') <img id="plural.stage1.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
&lt;_sre.SRE_Match object at 0x001C1FA8>
<samp class="prompt">>>> </samp>re.search('[^aeiou]y$', 'boy') <img id="plural.stage1.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
<samp class="prompt">>>> </samp>re.search('[^aeiou]y$', 'day')
<samp class="prompt">>>> </samp>
<samp class="prompt">>>> </samp>re.search('[^aeiou]y$', 'pita') <img id="plural.stage1.4.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>vacancy</code> matches this regular expression, because it ends in <code>cy</code>, and <code>c</code> is not <code>a</code>, <code>e</code>, <code>i</code>, <code>o</code>, or <code>u</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>boy</code> does not match, because it ends in <code>oy</code>, and you specifically said that the character before the <code>y</code> could not be <code>o</code>. <code>day</code> does not match, because it ends in <code>ay</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.4.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>pita</code> does not match, because it does not end in <code>y</code>.
</td>
</tr>
</table>
<div class="example"><h3>Example 17.5. More on <code class="function">re.sub</code></h3><pre class="screen">
<samp class="prompt">>>> </samp>re.sub('y$', 'ies', 'vacancy') <img id="plural.stage1.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
'vacancies'
<samp class="prompt">>>> </samp>re.sub('y$', 'ies', 'agency')
'agencies'
<samp class="prompt">>>> </samp>re.sub('([^aeiou])y$', r'\1ies', 'vacancy') <img id="plural.stage1.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
'vacancies'
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This regular expression turns <code>vacancy</code> into <code>vacancies</code> and <code>agency</code> into <code>agencies</code>, which is what you wanted. Note that it would also turn <code>boy</code> into <code>boies</code>, but that will never happen in the function because you did that <code class="function">re.search</code> first to find out whether you should do this <code class="function">re.sub</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage1.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Just in passing, I want to point out that it is possible to combine these two regular expressions (one to find out if the
rule applies, and another to actually apply it) into a single regular expression. Here's what that would look like. Most
of it should look familiar: you're using a remembered group, which you learned in <a href="#re.phone" title="7.6. Case study: Parsing Phone Numbers">Section 7.6, &#8220;Case study: Parsing Phone Numbers&#8221;</a>, to remember the character before the <code>y</code>. Then in the substitution string, you use a new syntax, <code>\1</code>, which means &#8220;hey, that first group you remembered? put it here&#8221;. In this case, you remember the <code>c</code> before the <code>y</code>, and then when you do the substitution, you substitute <code>c</code> in place of <code>c</code>, and <code>ies</code> in place of <code>y</code>. (If you have more than one remembered group, you can use <code>\2</code> and <code>\3</code> and so on.)
</td>
</tr>
</table>
<p>Regular expression substitutions are extremely powerful, and the <code>\1</code> syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder
to read, and it doesn't directly map to the way you first described the pluralizing rules. You originally laid out rules
like &#8220;if the word ends in S, X, or Z, then add ES&#8221;. And if you look at this function, you have two lines of code that say &#8220;if the word ends in S, X, or Z, then add ES&#8221;. It doesn't get much more direct than that.
<h2 id="plural.stage2">17.3. <code class="filename">plural.py</code>, stage 2</h2>
<p>Now you're going to add a level of abstraction. You started by defining a list of rules: if this, then do that, otherwise
go to the next rule. Let's temporarily complicate part of the program so you can simplify another part.
<div class="example"><h3>Example 17.6. <code class="filename">plural2.py</code></h3><pre class="programlisting">
import re
def match_sxz(noun):
return re.search('[sxz]$', noun)
def apply_sxz(noun):
return re.sub('$', 'es', noun)
def match_h(noun):
return re.search('[^aeioudgkprt]h$', noun)
def apply_h(noun):
return re.sub('$', 'es', noun)
def match_y(noun):
return re.search('[^aeiou]y$', noun)
def apply_y(noun):
return re.sub('y$', 'ies', noun)
def match_default(noun):
return 1
def apply_default(noun):
return noun + 's'
rules = ((match_sxz, apply_sxz),
(match_h, apply_h),
(match_y, apply_y),
(match_default, apply_default)
) <img id="plural.stage2.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def plural(noun):
for matchesRule, applyRule in rules: <img id="plural.stage2.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
if matchesRule(noun):<img id="plural.stage2.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
return applyRule(noun) <img id="plural.stage2.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage2.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This version looks more complicated (it's certainly longer), but it does exactly the same thing: try to match four different
rules, in order, and apply the appropriate regular expression when a match is found. The difference is that each individual
match and apply rule is defined in its own function, and the functions are then listed in this <code class="varname">rules</code> variable, which is a tuple of tuples.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage2.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Using a <code>for</code> loop, you can pull out the match and apply rules two at a time (one match, one apply) from the <code class="varname">rules</code> tuple. On the first iteration of the <code>for</code> loop, <code class="varname">matchesRule</code> will get <code class="function">match_sxz</code>, and <code class="varname">applyRule</code> will get <code class="function">apply_sxz</code>. On the second iteration (assuming you get that far), <code class="varname">matchesRule</code> will be assigned <code class="function">match_h</code>, and <code class="varname">applyRule</code> will be assigned <code class="function">apply_h</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage2.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Remember that <a href="#odbchelper.objects" title="2.4. Everything Is an Object">everything in Python is an object</a>, including functions. <code class="varname">rules</code> contains actual functions; not names of functions, but actual functions. When they get assigned in the <code>for</code> loop, then <code class="varname">matchesRule</code> and <code class="varname">applyRule</code> are actual functions that you can call. So on the first iteration of the <code>for</code> loop, this is equivalent to calling <code class="function">matches_sxz(noun)</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage2.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">On the first iteration of the <code>for</code> loop, this is equivalent to calling <code class="function">apply_sxz(noun)</code>, and so forth.
</td>
</tr>
</table>
<p>If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. This <code>for</code> loop is equivalent to the following:
<div class="example"><h3>Example 17.7. Unrolling the <code class="function">plural</code> function</h3><pre class="programlisting">
def plural(noun):
if match_sxz(noun):
return apply_sxz(noun)
if match_h(noun):
return apply_h(noun)
if match_y(noun):
return apply_y(noun)
if match_default(noun):
return apply_default(noun)
</pre><p>The benefit here is that that <code class="function">plural</code> function is now simplified. It takes a list of rules, defined elsewhere, and iterates through them in a generic fashion.
Get a match rule; does it match? Then call the apply rule. The rules could be defined anywhere, in any way. The <code class="function">plural</code> function doesn't care.
<p>Now, was adding this level of abstraction worth it? Well, not yet. Let's consider what it would take to add a new rule to
the function. Well, in the previous example, it would require adding an <code>if</code> statement to the <code class="function">plural</code> function. In this example, it would require adding two functions, <code class="function">match_foo</code> and <code class="function">apply_foo</code>, and then updating the <code class="varname">rules</code> list to specify where in the order the new match and apply functions should be called relative to the other rules.
<p>This is really just a stepping stone to the next section. Let's move on.
<h2 id="plural.stage3">17.4. <code class="filename">plural.py</code>, stage 3</h2>
<p>Defining separate named functions for each match and apply rule isn't really necessary. You never call them directly; you
define them in the <code class="varname">rules</code> list and call them through there. Let's streamline the rules definition by anonymizing those functions.
<div class="example"><h3>Example 17.8. <code class="filename">plural3.py</code></h3><pre class="programlisting">
import re
rules = \
(
(
lambda word: re.search('[sxz]$', word),
lambda word: re.sub('$', 'es', word)
),
(
lambda word: re.search('[^aeioudgkprt]h$', word),
lambda word: re.sub('$', 'es', word)
),
(
lambda word: re.search('[^aeiou]y$', word),
lambda word: re.sub('y$', 'ies', word)
),
(
lambda word: re.search('$', word),
lambda word: re.sub('$', 's', word)
)
) <img id="plural.stage3.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def plural(noun):
for matchesRule, applyRule in rules: <img id="plural.stage3.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
if matchesRule(noun):
return applyRule(noun)
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage3.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the same set of rules as you defined in stage 2. The only difference is that instead of defining named functions
like <code class="function">match_sxz</code> and <code class="function">apply_sxz</code>, you have &#8220;inlined&#8221; those function definitions directly into the <code class="varname">rules</code> list itself, using <a href="#apihelper.lambda" title="4.7. Using lambda Functions">lambda functions</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage3.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Note that the <code class="function">plural</code> function hasn't changed at all. It iterates through a set of rule functions, checks the first rule, and if it returns a
true value, calls the second rule and returns the value. Same as above, word for word. The only difference is that the rule
functions were defined inline, anonymously, using lambda functions. But the <code class="function">plural</code> function doesn't care how they were defined; it just gets a list of rules and blindly works through them.
</td>
</tr>
</table>
<p>Now to add a new rule, all you need to do is define the functions directly in the <code class="varname">rules</code> list itself: one match rule, and one apply rule. But defining the rule functions inline like this makes it very clear that
you have some unnecessary duplication here. You have four pairs of functions, and they all follow the same pattern. The
match function is a single call to <code class="function">re.search</code>, and the apply function is a single call to <code class="function">re.sub</code>. Let's factor out these similarities.
<h2 id="plural.stage4">17.5. <code class="filename">plural.py</code>, stage 4</h2>
<p>Let's factor out the duplication in the code so that defining new rules can be easier.
<div class="example"><h3 id="plural.stage4.example.1">Example 17.9. <code class="filename">plural4.py</code></h3><pre class="programlisting">
import re
def buildMatchAndApplyFunctions((pattern, search, replace)):
matchFunction = lambda word: re.search(pattern, word) <img id="plural.stage4.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
applyFunction = lambda word: re.sub(search, replace, word) <img id="plural.stage4.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
return (matchFunction, applyFunction) <img id="plural.stage4.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage4.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="function">buildMatchAndApplyFunctions</code> is a function that builds other functions dynamically. It takes <code class="varname">pattern</code>, <code class="varname">search</code> and <code class="varname">replace</code> (actually it takes a tuple, but more on that in a minute), and you can build the match function using the <code>lambda</code> syntax to be a function that takes one parameter (<code class="varname">word</code>) and calls <code class="function">re.search</code> with the <code class="varname">pattern</code> that was passed to the <code class="function">buildMatchAndApplyFunctions</code> function, and the <code class="varname">word</code> that was passed to the match function you're building. Whoa.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage4.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Building the apply function works the same way. The apply function is a function that takes one parameter, and calls <code class="function">re.sub</code> with the <code class="varname">search</code> and <code class="varname">replace</code> parameters that were passed to the <code class="function">buildMatchAndApplyFunctions</code> function, and the <code class="varname">word</code> that was passed to the apply function you're building. This technique of using the values of outside parameters within a
dynamic function is called <em>closures</em>. You're essentially defining constants within the apply function you're building: it takes one parameter (<code class="varname">word</code>), but it then acts on that plus two other values (<code class="varname">search</code> and <code class="varname">replace</code>) which were set when you defined the apply function.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage4.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Finally, the <code class="function">buildMatchAndApplyFunctions</code> function returns a tuple of two values: the two functions you just created. The constants you defined within those functions
(<code class="varname">pattern</code> within <code class="varname">matchFunction</code>, and <code class="varname">search</code> and <code class="varname">replace</code> within <code class="varname">applyFunction</code>) stay with those functions, even after you return from <code class="function">buildMatchAndApplyFunctions</code>. That's insanely cool.
</td>
</tr>
</table>
<p>If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it.
<div class="example"><h3>Example 17.10. <code class="filename">plural4.py</code> continued</h3><pre class="programlisting">
patterns = \
(
('[sxz]$', '$', 'es'),
('[^aeioudgkprt]h$', '$', 'es'),
('(qu|[^aeiou])y$', 'y$', 'ies'),
('$', '$', 's')
) <img id="plural.stage4.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
rules = map(buildMatchAndApplyFunctions, patterns) <img id="plural.stage4.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage4.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Our pluralization rules are now defined as a series of strings (not functions). The first string is the regular expression
that you would use in <code class="function">re.search</code> to see if this rule matches; the second and third are the search and replace expressions you would use in <code class="function">re.sub</code> to actually apply the rule to turn a noun into its plural.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage4.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This line is magic. It takes the list of strings in <code class="varname">patterns</code> and turns them into a list of functions. How? By mapping the strings to the <code class="function">buildMatchAndApplyFunctions</code> function, which just happens to take three strings as parameters and return a tuple of two functions. This means that <code class="varname">rules</code> ends up being exactly the same as the previous example: a list of tuples, where each tuple is a pair of functions, where
the first function is the match function that calls <code class="function">re.search</code>, and the second function is the apply function that calls <code class="function">re.sub</code>.
</td>
</tr>
</table>
<p>I swear I am not making this up: <code class="varname">rules</code> ends up with exactly the same list of functions as the previous example. Unroll the <code class="varname">rules</code> definition, and you'll get this:
<div class="example"><h3>Example 17.11. Unrolling the rules definition</h3><pre class="programlisting">
rules = \
(
(
lambda word: re.search('[sxz]$', word),
lambda word: re.sub('$', 'es', word)
),
(
lambda word: re.search('[^aeioudgkprt]h$', word),
lambda word: re.sub('$', 'es', word)
),
(
lambda word: re.search('[^aeiou]y$', word),
lambda word: re.sub('y$', 'ies', word)
),
(
lambda word: re.search('$', word),
lambda word: re.sub('$', 's', word)
)
)
</pre><div class="example"><h3 id="plural.finishing.up">Example 17.12. <code class="filename">plural4.py</code>, finishing up</h3><pre class="programlisting">
def plural(noun):
for matchesRule, applyRule in rules: <img id="plural.stage4.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
if matchesRule(noun):
return applyRule(noun)
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage4.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since the <code class="varname">rules</code> list is the same as the previous example, it should come as no surprise that the <code class="function">plural</code> function hasn't changed. Remember, it's completely generic; it takes a list of rule functions and calls them in order.
It doesn't care how the rules are defined. In <a href="#plural.stage2" title="17.3. plural.py, stage 2">stage 2</a>, they were defined as seperate named functions. In <a href="#plural.stage3" title="17.4. plural.py, stage 3">stage 3</a>, they were defined as anonymous <code>lambda</code> functions. Now in stage 4, they are built dynamically by mapping the <code class="function">buildMatchAndApplyFunctions</code> function onto a list of raw strings. Doesn't matter; the <code class="function">plural</code> function still works the same way.
</td>
</tr>
</table>
<p>Just in case that wasn't mind-blowing enough, I must confess that there was a subtlety in the definition of <code class="function">buildMatchAndApplyFunctions</code> that I skipped over. Let's go back and take another look.
<div class="example"><h3>Example 17.13. Another look at <code class="function">buildMatchAndApplyFunctions</code></h3><pre class="programlisting">
def buildMatchAndApplyFunctions((pattern, search, replace)): <img id="plural.stage4.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage4.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Notice the double parentheses? This function doesn't actually take three parameters; it actually takes one parameter, a tuple
of three elements. But the tuple is expanded when the function is called, and the three elements of the tuple are each assigned
to different variables: <code class="varname">pattern</code>, <code class="varname">search</code>, and <code class="varname">replace</code>. Confused yet? Let's see it in action.
</td>
</tr>
</table>
<div class="example"><h3>Example 17.14. Expanding tuples when calling functions</h3><pre class="screen">
<samp class="prompt">>>> </samp>def foo((a, b, c)):
<samp class="prompt">... </samp>print c
<samp class="prompt">... </samp>print b
<samp class="prompt">... </samp>print a
<samp class="prompt">>>> </samp>parameters = ('apple', 'bear', 'catnap')
<samp class="prompt">>>> </samp>foo(parameters) <img id="plural.stage4.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
catnap
bear
apple
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage4.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The proper way to call the function <code class="function">foo</code> is with a tuple of three elements. When the function is called, the elements are assigned to different local variables within
<code class="function">foo</code>.
</td>
</tr>
</table>
<p>Now let's go back and see why this auto-tuple-expansion trick was necessary. <code class="varname">patterns</code> was a list of tuples, and each tuple had three elements. When you called <code>map(buildMatchAndApplyFunctions, patterns)</code>, that means that <code class="function">buildMatchAndApplyFunctions</code> is <em>not</em> getting called with three parameters. Using <code class="function">map</code> to map a single list onto a function always calls the function with a single parameter: each element of the list. In the
case of <code class="varname">patterns</code>, each element of the list is a tuple, so <code class="function">buildMatchAndApplyFunctions</code> always gets called with the tuple, and you use the auto-tuple-expansion trick in the definition of <code class="function">buildMatchAndApplyFunctions</code> to assign the elements of that tuple to named variables that you can work with.
<h2 id="plural.stage5">17.6. <code class="filename">plural.py</code>, stage 5</h2>
<p>You've factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a
list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained
separately from the code that uses them.
<p>First, let's create a text file that contains the rules you want. No fancy data structures, just space- (or tab-)delimited
strings in three columns. You'll call it <code class="filename">rules.en</code>; &#8220;en&#8221; stands for English. These are the rules for pluralizing English nouns. You could add other rule files for other languages
later.
<div class="example"><h3>Example 17.15. <code class="filename">rules.en</code></h3><pre class="programlisting">
[sxz]$$ es
[^aeioudgkprt]h$ $ es
[^aeiou]y$ y$ ies
$ $ s
</pre><p>Now let's see how you can use this rules file.
<div class="example"><h3>Example 17.16. <code class="filename">plural5.py</code></h3><pre class="programlisting">
import re
import string
def buildRule((pattern, search, replace)):
return lambda word: re.search(pattern, word) and re.sub(search, replace, word) <img id="plural.stage5.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
def plural(noun, language='en'): <img id="plural.stage5.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
lines = file('rules.%s' % language).readlines() <img id="plural.stage5.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
patterns = map(string.split, lines) <img id="plural.stage5.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
rules = map(buildRule, patterns) <img id="plural.stage5.1.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
for rule in rules:
result = rule(noun) <img id="plural.stage5.1.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
if result: return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage5.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You're still using the closures technique here (building a function dynamically that uses variables defined outside the function),
but now you've combined the separate match and apply functions into one. (The reason for this change will become clear in
the next section.) This will let you accomplish the same thing as having two functions, but you'll need to call it differently,
as you'll see in a minute.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage5.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Our <code class="function">plural</code> function now takes an optional second parameter, <code class="varname">language</code>, which defaults to <code>en</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage5.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You use the <code class="varname">language</code> parameter to construct a filename, then open the file and read the contents into a list. If <code class="varname">language</code> is <code>en</code>, then you'll open the <code class="filename">rules.en</code> file, read the entire thing, break it up by carriage returns, and return a list. Each line of the file will be one element
in the list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage5.1.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">As you saw, each line in the file really has three values, but they're separated by whitespace (tabs or spaces, it makes no
difference). Mapping the <code class="function">string.split</code> function onto this list will create a new list where each element is a tuple of three strings. So a line like <code>[sxz]$ $ es</code> will be broken up into the tuple <code>('[sxz]$', '$', 'es')</code>. This means that <code class="varname">patterns</code> will end up as a list of tuples, just like you hard-coded it in <a href="#plural.stage4" title="17.5. plural.py, stage 4">stage 4</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage5.1.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If <code class="varname">patterns</code> is a list of tuples, then <code class="varname">rules</code> will be a list of the functions created dynamically by each call to <code class="function">buildRule</code>. Calling <code class="function">buildRule(('[sxz]$', '$', 'es'))</code> returns a function that takes a single parameter, <code class="varname">word</code>. When this returned function is called, it will execute <code>re.search('[sxz]$', word) and re.sub('$', 'es', word)</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage5.1.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Because you're now building a combined match-and-apply function, you need to call it differently. Just call the function,
and if it returns something, then that's the plural; if it returns nothing (<code>None</code>), then the rule didn't match and you need to try another rule.
</td>
</tr>
</table>
<p>So the improvement here is that you've completely separated the pluralization rules into an external file. Not only can the
file be maintained separately from the code, but you've set up a naming scheme where the same <code class="function">plural</code> function can use different rule files, based on the <code class="varname">language</code> parameter.
<p>The downside here is that you're reading that file every time you call the <code class="function">plural</code> function. I thought I could get through this entire book without using the phrase &#8220;left as an exercise for the reader&#8221;, but here you go: building a caching mechanism for the language-specific rule files that auto-refreshes itself if the rule
files change between calls <em>is left as an exercise for the reader</em>. Have fun.
<h2 id="plural.stage6">17.7. <code class="filename">plural.py</code>, stage 6</h2>
<p>Now you're ready to talk about generators.
<div class="example"><h3>Example 17.17. <code class="filename">plural6.py</code></h3><pre class="programlisting">
import re
def rules(language):
for line in file('rules.%s' % language):
pattern, search, replace = line.split()
yield lambda word: re.search(pattern, word) and re.sub(search, replace, word)
def plural(noun, language='en'):
for applyRule in rules(language):
result = applyRule(noun)
if result: return result
</pre><p>This uses a technique called generators, which I'm not even going to try to explain until you look at a simpler example first.
<div class="example"><h3 id="plural.introducing.generators">Example 17.18. Introducing generators</h3><pre class="screen">
<samp class="prompt">>>> </samp>def make_counter(x):
<samp class="prompt">... </samp>print 'entering make_counter'
<samp class="prompt">... </samp>while 1:
<samp class="prompt">... </samp> yield x <img id="plural.stage6.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp> print 'incrementing x'
<samp class="prompt">... </samp> x = x + 1
<samp class="prompt">... </samp>
<samp class="prompt">>>> </samp>counter = make_counter(2) <img id="plural.stage6.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>counter <img id="plural.stage6.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
&lt;generator object at 0x001C9C10>
<samp class="prompt">>>> </samp>counter.next() <img id="plural.stage6.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
<samp class="computeroutput">entering make_counter
2</samp>
<samp class="prompt">>>> </samp>counter.next() <img id="plural.stage6.2.5" src="images/callouts/5.png" alt="5" border="0" width="12" height="12">
<samp class="computeroutput">incrementing x
3</samp>
<samp class="prompt">>>> </samp>counter.next() <img id="plural.stage6.2.6" src="images/callouts/6.png" alt="6" border="0" width="12" height="12">
<samp class="computeroutput">incrementing x
4</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.2.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The presence of the <code>yield</code> keyword in <code class="function">make_counter</code> means that this is not a normal function. It is a special kind of function which generates values one at a time. You can
think of it as a resumable function. Calling it will return a generator that can be used to generate successive values of
<code class="varname">x</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.2.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To create an instance of the <code class="function">make_counter</code> generator, just call it like any other function. Note that this does not actually execute the function code. You can tell
this because the first line of <code class="function">make_counter</code> is a <code class="function">print</code> statement, but nothing has been printed yet.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.2.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="function">make_counter</code> function returns a generator object.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.2.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The first time you call the <code class="function">next()</code> method on the generator object, it executes the code in <code class="function">make_counter</code> up to the first <code>yield</code> statement, and then returns the value that was yielded. In this case, that will be <code>2</code>, because you originally created the generator by calling <code class="function">make_counter(2)</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.2.5"><img src="images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Repeatedly calling <code class="function">next()</code> on the generator object <em>resumes where you left off</em> and continues until you hit the next <code>yield</code> statement. The next line of code waiting to be executed is the <code class="function">print</code> statement that prints <code>incrementing x</code>, and then after that the <code>x = x + 1</code> statement that actually increments it. Then you loop through the <code>while</code> loop again, and the first thing you do is <code>yield x</code>, which returns the current value of <code class="varname">x</code> (now 3).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.2.6"><img src="images/callouts/6.png" alt="6" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The second time you call <code class="function">counter.next()</code>, you do all the same things again, but this time <code class="varname">x</code> is now <code>4</code>. And so forth. Since <code class="function">make_counter</code> sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing <code class="varname">x</code> and spitting out values. But let's look at more productive uses of generators instead.
</td>
</tr>
</table>
<div class="example"><h3 id="plural.fib.example">Example 17.19. Using generators instead of recursion</h3><pre class="programlisting">
def fibonacci(max):
a, b = 0, 1 <img id="plural.stage6.3.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
while a &lt; max:
yield a <img id="plural.stage6.3.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
a, b = b, a+b <img id="plural.stage6.3.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.3.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The Fibonacci sequence is a sequence of numbers where each number is the sum of the two numbers before it. It starts with
<code class="constant">0</code> and <code class="constant">1</code>, goes up slowly at first, then more and more rapidly. To start the sequence, you need two variables: <code class="varname">a</code> starts at <code class="constant">0</code>, and <code class="varname">b</code> starts at <code class="constant">1</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.3.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">a</code> is the current number in the sequence, so yield it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.3.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code class="varname">b</code> is the next number in the sequence, so assign that to <code class="varname">a</code>, but also calculate the next value (<code>a+b</code>) and assign that to <code class="varname">b</code> for later use. Note that this happens in parallel; if <code class="varname">a</code> is <code>3</code> and <code class="varname">b</code> is <code>5</code>, then <code>a, b = b, a+b</code> will set <code class="varname">a</code> to <code>5</code> (the previous value of <code class="varname">b</code>) and <code class="varname">b</code> to <code>8</code> (the sum of the previous values of <code class="varname">a</code> and <code class="varname">b</code>).
</td>
</tr>
</table>
<p>So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way
is easier to read. Also, it works well with <code>for</code> loops.
<div class="example"><h3>Example 17.20. Generators in <code>for</code> loops</h3><pre class="screen">
<samp class="prompt">>>> </samp>for n in fibonacci(1000): <img id="plural.stage6.4.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">... </samp>print n, <img id="plural.stage6.4.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.4.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can use a generator like <code class="function">fibonacci</code> in a <code>for</code> loop directly. The <code>for</code> loop will create the generator object and successively call the <code class="function">next()</code> method to get values to assign to the <code>for</code> loop index variable (<code class="varname">n</code>).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.4.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Each time through the <code>for</code> loop, <code class="varname">n</code> gets a new value from the <code>yield</code> statement in <code class="function">fibonacci</code>, and all you do is print it out. Once <code class="function">fibonacci</code> runs out of numbers (<code class="varname">a</code> gets bigger than <code class="varname">max</code>, which in this case is <code>1000</code>), then the <code>for</code> loop exits gracefully.
</td>
</tr>
</table>
<p>OK, let's go back to the <code class="function">plural</code> function and see how you're using this.
<div class="example"><h3>Example 17.21. Generators that generate dynamic functions</h3><pre class="programlisting">
def rules(language):
for line in file('rules.%s' % language): <img id="plural.stage6.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
pattern, search, replace = line.split() <img id="plural.stage6.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
yield lambda word: re.search(pattern, word) and re.sub(search, replace, word) <img id="plural.stage6.5.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
def plural(noun, language='en'):
for applyRule in rules(language): <img id="plural.stage6.5.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12">
result = applyRule(noun)
if result: return result
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.5.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>for line in file(...)</code> is a common idiom for reading lines from a file, one line at a time. It works because <em><code class="function">file</code> actually returns a generator</em> whose <code class="function">next()</code> method returns the next line of the file. That is so insanely cool, I wet myself just thinking about it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.5.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">No magic here. Remember that the lines of the rules file have three values separated by whitespace, so <code>line.split()</code> returns a tuple of 3 values, and you assign those values to 3 local variables.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.5.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><em>And then you yield.</em> What do you yield? A function, built dynamically with <code>lambda</code>, that is actually a closure (it uses the local variables <code class="varname">pattern</code>, <code class="varname">search</code>, and <code class="varname">replace</code> as constants). In other words, <code class="function">rules</code> is a generator that spits out rule functions.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#plural.stage6.5.4"><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since <code class="function">rules</code> is a generator, you can use it directly in a <code>for</code> loop. The first time through the <code>for</code> loop, you will call the <code class="function">rules</code> function, which will open the rules file, read the first line out of it, dynamically build a function that matches and applies
the first rule defined in the rules file, and yields the dynamically built function. The second time through the <code>for</code> loop, you will pick up where you left off in <code class="function">rules</code> (which was in the middle of the <code>for line in file(...)</code> loop), read the second line of the rules file, dynamically build another function that matches and applies the second rule
defined in the rules file, and yields it. And so forth.
</td>
</tr>
</table>
<p>What have you gained over <a href="#plural.stage5" title="17.6. plural.py, stage 5">stage 5</a>? In stage 5, you read the entire rules file and built a list of all the possible rules before you even tried the first one.
Now with generators, you can do everything lazily: you open the first and read the first rule and create a function to try
it, but if that works you don't ever read the rest of the file or create any other functions.
<div class="itemizedlist">
<h3>Further reading</h3>
<ul>
<li><a href="http://www.python.org/peps/pep-0255.html">PEP 255</a> defines generators.
<li><a href="http://www.activestate.com/ASPN/Python/Cookbook/" title="growing archive of annotated code samples">Python Cookbook</a> has <a href="http://www.google.com/search?q=generators+cookbook+site:aspn.activestate.com">many more examples of generators</a>.
</ul>
<h2 id="plural.summary">17.8. Summary</h2>
<p>You talked about several different advanced techniques in this chapter. Not all of them are appropriate for every situation.
<p>You should now be comfortable with all of these techniques:
<div class="itemizedlist">
<ul>
<li>Performing <a href="#plural.stage1" title="17.2. plural.py, stage 1">string substitution with regular expressions</a>.
<li>Treating <a href="#plural.stage2" title="17.3. plural.py, stage 2">functions as objects</a>, storing them in lists, assigning them to variables, and calling them through those variables.
<li>Building <a href="#plural.stage3" title="17.4. plural.py, stage 3">dynamic functions with <code>lambda</code></a>.
<li>Building <a href="#plural.stage4" title="17.5. plural.py, stage 4">closures</a>, dynamic functions that contain surrounding variables as constants.
<li>Building <a href="#plural.stage6" title="17.7. plural.py, stage 6">generators</a>, resumable functions that perform incremental logic and return different values each time you call them.
</ul>
<p>Adding abstractions, building functions dynamically, building closures, and using generators can all make your code simpler,
more readable, and more flexible. But they can also end up making it more difficult to debug later. It's up to you to find
the right balance between simplicity and power.
<div class="chapter">
<h2 id="soundex">Chapter 18. Performance Tuning</h2>
<p>Performance tuning is a many-splendored thing. Just because Python is an interpreted language doesn't mean you shouldn't worry about code optimization. But don't worry about it <em>too</em> much.
<h2 id="soundex.divein">18.1. Diving in</h2>
<p>There are so many pitfalls involved in optimizing your code, it's hard to know where to start.
<p>Let's start here: <em>are you sure you need to do it at all?</em> Is your code really so bad? Is it worth the time to tune it? Over the lifetime of your application, how much time is going
to be spent running that code, compared to the time spent waiting for a remote database server, or waiting for user input?
<p>Second, <em>are you sure you're done coding?</em> Premature optimization is like spreading frosting on a half-baked cake. You spend hours or days (or more) optimizing your
code for performance, only to discover it doesn't do what you need it to do. That's time down the drain.
<p>This is not to say that code optimization is worthless, but you need to look at the whole system and decide whether it's the
best use of your time. Every minute you spend optimizing code is a minute you're not spending adding new features, or writing
documentation, or playing with your kids, or writing unit tests.
<p>Oh yes, unit tests. It should go without saying that you need a complete set of unit tests before you begin performance tuning.
The last thing you need is to introduce new bugs while fiddling with your algorithms.
<p>With these caveats in place, let's look at some techniques for optimizing Python code. The code in question is an implementation of the Soundex algorithm. Soundex was a method used in the early 20th century
for categorizing surnames in the United States census. It grouped similar-sounding names together, so even if a name was
misspelled, researchers had a chance of finding it. Soundex is still used today for much the same reason, although of course
we use computerized database servers now. Most database servers include a Soundex function.
<p>There are several subtle variations of the Soundex algorithm. This is the one used in this chapter:
<div class="orderedlist">
<ol>
<li>Keep the first letter of the name as-is.
<li>Convert the remaining letters to digits, according to a specific table:
<div class="itemizedlist">
<ul>
<li>B, F, P, and V become 1.
<li>C, G, J, K, Q, S, X, and Z become 2.
<li>D and T become 3.
<li>L becomes 4.
<li>M and N become 5.
<li>R becomes 6.
<li>All other letters become 9.
</ul>
<li>Remove consecutive duplicates.
<li>Remove all 9s altogether.
<li>If the result is shorter than four characters (the first letter plus three digits), pad the result with trailing zeros.
<li>if the result is longer than four characters, discard everything after the fourth character.
</ol>
<p>For example, my name, <code>Pilgrim</code>, becomes P942695. That has no consecutive duplicates, so nothing to do there. Then you remove the 9s, leaving P4265. That's
too long, so you discard the excess character, leaving P426.
<p>Another example: <code>Woo</code> becomes W99, which becomes W9, which becomes W, which gets padded with zeros to become W000.
<p>Here's a first attempt at a Soundex function:
<div class="example"><h3>Example 18.1. <code class="filename">soundex/stage1/soundex1a.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="programlisting">
import string, re
charToSoundex = {"A": "9",
"B": "1",
"C": "2",
"D": "3",
"E": "9",
"F": "1",
"G": "2",
"H": "9",
"I": "9",
"J": "2",
"K": "2",
"L": "4",
"M": "5",
"N": "5",
"O": "9",
"P": "1",
"Q": "2",
"R": "6",
"S": "2",
"T": "3",
"U": "9",
"V": "1",
"W": "9",
"X": "2",
"Y": "9",
"Z": "2"}
def soundex(source):
"convert string to Soundex equivalent"
# Soundex requirements:
# source string must be at least 1 character
# and must consist entirely of letters
allChars = string.uppercase + string.lowercase
if not re.search('^[%s]+$' % allChars, source):
return "0000"
# Soundex algorithm:
# 1. make first character uppercase
source = source[0].upper() + source[1:]
# 2. translate all other characters to Soundex digits
digits = source[0]
for s in source[1:]:
s = s.upper()
digits += charToSoundex[s]
# 3. remove consecutive duplicates
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
# 4. remove all "9"s
digits3 = re.sub('9', '', digits2)
# 5. pad end with "0"s to 4 characters
while len(digits3) &lt; 4:
digits3 += "0"
# 6. return first 4 characters
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><div class="itemizedlist">
<h3>Further Reading on Soundex</h3>
<ul>
<li><a href="http://www.avotaynu.com/soundex.html">Soundexing and Genealogy</a> gives a chronology of the evolution of the Soundex and its regional variations.
</ul>
<h2 id="soundex.timeit">18.2. Using the <code class="filename">timeit</code> Module</h2>
<p>The most important thing you need to know about optimizing Python code is that you shouldn't write your own timing function.
<p>Timing short pieces of code is incredibly complex. How much processor time is your computer devoting to running this code?
Are there things running in the background? Are you sure? Every modern computer has background processes running, some all
the time, some intermittently. Cron jobs fire off at consistent intervals; background services occasionally &#8220;wake up&#8221; to do useful things like check for new mail, connect to instant messaging servers, check for application updates, scan for
viruses, check whether a disk has been inserted into your CD drive in the last 100 nanoseconds, and so on. Before you start
your timing tests, turn everything off and disconnect from the network. Then turn off all the things you forgot to turn off
the first time, then turn off the service that's incessantly checking whether the network has come back yet, then ...
<p>And then there's the matter of the variations introduced by the timing framework itself. Does the Python interpreter cache method name lookups? Does it cache code block compilations? Regular expressions? Will your code have
side effects if run more than once? Don't forget that you're dealing with small fractions of a second, so small mistakes
in your timing framework will irreparably skew your results.
<p>The Python community has a saying: &#8220;Python comes with batteries included.&#8221; Don't write your own timing framework. Python 2.3 comes with a perfectly good one called <code class="filename">timeit</code>.
<div class="example"><h3>Example 18.2. Introducing <code class="filename">timeit</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.<pre class="screen">
<samp class="prompt">>>> </samp>import timeit
<samp class="prompt">>>> </samp>t = timeit.Timer("soundex.soundex('Pilgrim')",
<samp class="prompt">... </samp>"import soundex") <img id="soundex.timeit.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12">
<samp class="prompt">>>> </samp>t.timeit() <img id="soundex.timeit.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12">
8.21683733547
<samp class="prompt">>>> </samp>t.repeat(3, 2000000) <img id="soundex.timeit.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12">
[16.48319309109, 16.46128984923, 16.44203948912]
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#soundex.timeit.1.1"><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code class="filename">timeit</code> module defines one class, <code class="classname">Timer</code>, which takes two arguments. Both arguments are strings. The first argument is the statement you wish to time; in this case,
you are timing a call to the Soundex function within the <code class="filename">soundex</code> with an argument of <code>'Pilgrim'</code>. The second argument to the <code class="classname">Timer</code> class is the import statement that sets up the environment for the statement. Internally, <code class="filename">timeit</code> sets up an isolated virtual environment, manually executes the setup statement (importing the <code class="filename">soundex</code> module), then manually compiles and executes the timed statement (calling the Soundex function).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soundex.timeit.1.2"><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Once you have the <code class="classname">Timer</code> object, the easiest thing to do is call <code class="methodname">timeit()</code>, which calls your function 1 million times and returns the number of seconds it took to do it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#soundex.timeit.1.3"><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The other major method of the <code class="classname">Timer</code> object is <code class="methodname">repeat()</code>, which takes two optional arguments. The first argument is the number of times to repeat the entire test, and the second
argument is the number of times to call the timed statement within each test. Both arguments are optional, and they default
to <code>3</code> and <code>1000000</code> respectively. The <code class="methodname">repeat()</code> method returns a list of the times each test cycle took, in seconds.
</td>
</tr>
</table>
</div><table class="tip" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">You can use the <code class="filename">timeit</code> module on the command line to test an existing Python program, without modifying the code. See <a href="http://docs.python.org/lib/node396.html">http://docs.python.org/lib/node396.html</a> for documentation on the command-line flags.
</td>
</tr>
</table>
<p>Note that <code class="methodname">repeat()</code> returns a list of times. The times will almost never be identical, due to slight variations in how much processor time the
Python interpreter is getting (and those pesky background processes that you can't get rid of). Your first thought might be to
say &#8220;Let's take the average and call that The True Number.&#8221;
<p>In fact, that's almost certainly wrong. The tests that took longer didn't take longer because of variations in your code
or in the Python interpreter; they took longer because of those pesky background processes, or other factors outside of the Python interpreter that you can't fully eliminate. If the different timing results differ by more than a few percent, you still
have too much variability to trust the results. Otherwise, take the minimum time and discard the rest.
<p>Python has a handy <code class="function">min</code> function that takes a list and returns the smallest value:
<div class="informalexample"><pre class="screen">
<samp class="prompt">>>> </samp>min(t.repeat(3, 1000000))
8.22203948912
</pre></div><table class="tip" border="0" summary="">
<tr>
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td>
</tr>
<tr>
<td colspan="2" align="left" valign="top" width="99%">The <code class="filename">timeit</code> module only works if you already know what piece of code you need to optimize. If you have a larger Python program and don't know where your performance problems are, check out <a href="http://docs.python.org/lib/module-hotshot.html">the <code class="filename">hotshot</code> module.</a></td>
</tr>
</table>
<h2 id="soundex.stage1">18.3. Optimizing Regular Expressions</h2>
<p>The first thing the Soundex function checks is whether the input is a non-empty string of letters. What's the best way to
do this?
<p>If you answered &#8220;regular expressions&#8221;, go sit in the corner and contemplate your bad instincts. Regular expressions are almost never the right answer; they should
be avoided whenever possible. Not only for performance reasons, but simply because they're difficult to debug and maintain.
Also for performance reasons.
<p>This code fragment from <code class="filename">soundex/stage1/soundex1a.py</code> checks whether the function argument <code class="varname">source</code> is a word made entirely of letters, with at least one letter (not the empty string):
<div class="informalexample"><pre class="programlisting">
allChars = string.uppercase + string.lowercase
if not re.search('^[%s]+$' % allChars, source):
return "0000"
</pre><p>How does <code class="filename">soundex1a.py</code> perform? For convenience, the <code>__main__</code> section of the script contains this code that calls the <code class="filename">timeit</code> module, sets up a timing test with three different names, tests each name three times, and displays the minimum time for
each:
<div class="informalexample"><pre class="programlisting">
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><p>So how does <code class="filename">soundex1a.py</code> perform with this regular expression?
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage1></samp>python soundex1a.py
<samp class="computeroutput">Woo W000 19.3356647283
Pilgrim P426 24.0772053431
Flingjingwaller F452 35.0463220884</span>
</pre><p>As you might expect, the algorithm takes significantly longer when called with longer names. There will be a few things we
can do to narrow that gap (make the function take less relative time for longer input), but the nature of the algorithm dictates
that it will never run in constant time.
<p>The other thing to keep in mind is that we are testing a representative sample of names. <code>Woo</code> is a kind of trivial case, in that it gets shorted down to a single letter and then padded with zeros. <code>Pilgrim</code> is a normal case, of average length and a mixture of significant and ignored letters. <code>Flingjingwaller</code> is extraordinarily long and contains consecutive duplicates. Other tests might also be helpful, but this hits a good range
of different cases.
<p>So what about that regular expression? Well, it's inefficient. Since the expression is testing for ranges of characters
(<code>A-Z</code> in uppercase, and <code>a-z</code> in lowercase), we can use a shorthand regular expression syntax. Here is <code class="filename">soundex/stage1/soundex1b.py</code>:
<div class="informalexample"><pre class="programlisting">
if not re.search('^[A-Za-z]+$', source):
return "0000"
</pre><p><code class="filename">timeit</code> says <code class="filename">soundex1b.py</code> is slightly faster than <code class="filename">soundex1a.py</code>, but nothing to get terribly excited about:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage1></samp>python soundex1b.py
<samp class="computeroutput">Woo W000 17.1361133887
Pilgrim P426 21.8201693232
Flingjingwaller F452 32.7262294509</span>
</pre><p>We saw in <a href="#roman.refactoring" title="15.3. Refactoring">Section 15.3, &#8220;Refactoring&#8221;</a> that regular expressions can be compiled and reused for faster results. Since this regular expression never changes across
function calls, we can compile it once and use the compiled version. Here is <code class="filename">soundex/stage1/soundex1c.py</code>:
<div class="informalexample"><pre class="programlisting">
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
</pre><p>Using a compiled regular expression in <code class="filename">soundex1c.py</code> is significantly faster:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage1></samp>python soundex1c.py
<samp class="computeroutput">Woo W000 14.5348347346
Pilgrim P426 19.2784703084
Flingjingwaller F452 30.0893873383</span>
</pre><p>But is this the wrong path? The logic here is simple: the input <code class="varname">source</code> needs to be non-empty, and it needs to be composed entirely of letters. Wouldn't it be faster to write a loop checking each
character, and do away with regular expressions altogether?
<p>Here is <code class="filename">soundex/stage1/soundex1d.py</code>:
<div class="informalexample"><pre class="programlisting">
if not source:
return "0000"
for c in source:
if not ('A' &lt;= c &lt;= 'Z') and not ('a' &lt;= c &lt;= 'z'):
return "0000"
</pre><p>It turns out that this technique in <code class="filename">soundex1d.py</code> is <em>not</em> faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression):
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage1></samp>python soundex1d.py
<samp class="computeroutput">Woo W000 15.4065058548
Pilgrim P426 22.2753567842
Flingjingwaller F452 37.5845122774</span>
</pre><p>Why isn't <code class="filename">soundex1d.py</code> faster? The answer lies in the interpreted nature of Python. The regular expression engine is written in C, and compiled to run natively on your computer. On the other hand, this
loop is written in Python, and runs through the Python interpreter. Even though the loop is relatively simple, it's not simple enough to make up for the overhead of being interpreted.
Regular expressions are never the right answer... except when they are.
<p>It turns out that Python offers an obscure string method. You can be excused for not knowing about it, since it's never been mentioned in this book.
The method is called <code class="methodname">isalpha()</code>, and it checks whether a string contains only letters.
<p>This is <code class="filename">soundex/stage1/soundex1e.py</code>:
<div class="informalexample"><pre class="programlisting">
if (not source) and (not source.isalpha()):
return "0000"
</pre><p>How much did we gain by using this specific method in <code class="filename">soundex1e.py</code>? Quite a bit.
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage1></samp>python soundex1e.py
<samp class="computeroutput">Woo W000 13.5069504644
Pilgrim P426 18.2199394057
Flingjingwaller F452 28.9975225902</span>
</pre><div class="example"><h3>Example 18.3. Best Result So Far: <code class="filename">soundex/stage1/soundex1e.py</code></h3><pre class="programlisting">
import string, re
charToSoundex = {"A": "9",
"B": "1",
"C": "2",
"D": "3",
"E": "9",
"F": "1",
"G": "2",
"H": "9",
"I": "9",
"J": "2",
"K": "2",
"L": "4",
"M": "5",
"N": "5",
"O": "9",
"P": "1",
"Q": "2",
"R": "6",
"S": "2",
"T": "3",
"U": "9",
"V": "1",
"W": "9",
"X": "2",
"Y": "9",
"Z": "2"}
def soundex(source):
if (not source) and (not source.isalpha()):
return "0000"
source = source[0].upper() + source[1:]
digits = source[0]
for s in source[1:]:
s = s.upper()
digits += charToSoundex[s]
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><h2 id="soundex.stage2">18.4. Optimizing Dictionary Lookups</h2>
<p>The second step of the Soundex algorithm is to convert characters to digits in a specific pattern. What's the best way to
do this?
<p>The most obvious solution is to define a dictionary with individual characters as keys and their corresponding digits as values,
and do dictionary lookups on each character. This is what we have in <code class="filename">soundex/stage1/soundex1c.py</code> (the current best result so far):
<div class="informalexample"><pre class="programlisting">
charToSoundex = {"A": "9",
"B": "1",
"C": "2",
"D": "3",
"E": "9",
"F": "1",
"G": "2",
"H": "9",
"I": "9",
"J": "2",
"K": "2",
"L": "4",
"M": "5",
"N": "5",
"O": "9",
"P": "1",
"Q": "2",
"R": "6",
"S": "2",
"T": "3",
"U": "9",
"V": "1",
"W": "9",
"X": "2",
"Y": "9",
"Z": "2"}
def soundex(source):
# ... input check omitted for brevity ...
source = source[0].upper() + source[1:]
digits = source[0]
for s in source[1:]:
s = s.upper()
digits += charToSoundex[s]
</pre><p>You timed <code class="filename">soundex1c.py</code> already; this is how it performs:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage1></samp>python soundex1c.py
<samp class="computeroutput">Woo W000 14.5341678901
Pilgrim P426 19.2650071448
Flingjingwaller F452 30.1003563302</span>
</pre><p>This code is straightforward, but is it the best solution? Calling <code class="methodname">upper()</code> on each individual character seems inefficient; it would probably be better to call <code class="methodname">upper()</code> once on the entire string.
<p>Then there's the matter of incrementally building the <code class="varname">digits</code> string. Incrementally building strings like this is horribly inefficient; internally, the Python interpreter needs to create a new string each time through the loop, then discard the old one.
<p>Python is good at lists, though. It can treat a string as a list of characters automatically. And lists are easy to combine into
strings again, using the string method <code class="methodname">join()</code>.
<p>Here is <code class="filename">soundex/stage2/soundex2a.py</code>, which converts letters to digits by using &#8614; and <code>lambda</code>:
<div class="informalexample"><pre class="programlisting">
def soundex(source):
# ...
source = source.upper()
digits = source[0] + "".join(map(lambda c: charToSoundex[c], source[1:]))
</pre><p>Surprisingly, <code class="filename">soundex2a.py</code> is not faster:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage2></samp>python soundex2a.py
<samp class="computeroutput">Woo W000 15.0097526362
Pilgrim P426 19.254806407
Flingjingwaller F452 29.3790847719</span>
</pre><p>The overhead of the anonymous <code>lambda</code> function kills any performance you gain by dealing with the string as a list of characters.
<p><code class="filename">soundex/stage2/soundex2b.py</code> uses a list comprehension instead of &#8614; and <code>lambda</code>:
<div class="informalexample"><pre class="programlisting">
source = source.upper()
digits = source[0] + "".join([charToSoundex[c] for c in source[1:]])
</pre><p>Using a list comprehension in <code class="filename">soundex2b.py</code> is faster than using &#8614; and <code>lambda</code> in <code class="filename">soundex2a.py</code>, but still not faster than the original code (incrementally building a string in <code class="filename">soundex1c.py</code>):
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage2></samp>python soundex2b.py
<samp class="computeroutput">Woo W000 13.4221324219
Pilgrim P426 16.4901234654
Flingjingwaller F452 25.8186157738</span>
</pre><p>It's time for a radically different approach. Dictionary lookups are a general purpose tool. Dictionary keys can be any
length string (or many other data types), but in this case we are only dealing with single-character keys <em>and</em> single-character values. It turns out that Python has a specialized function for handling exactly this situation: the <code class="function">string.maketrans</code> function.
<p>This is <code class="filename">soundex/stage2/soundex2c.py</code>:
<div class="informalexample"><pre class="programlisting">
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
def soundex(source):
# ...
digits = source[0].upper() + source[1:].translate(charToSoundex)
</pre><p>What the heck is going on here? <code class="function">string.maketrans</code> creates a translation matrix between two strings: the first argument and the second argument. In this case, the first argument
is the string <code>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</code>, and the second argument is the string <code>9123912992245591262391929291239129922455912623919292</code>. See the pattern? It's the same conversion pattern we were setting up longhand with a dictionary. A maps to 9, B maps
to 1, C maps to 2, and so forth. But it's not a dictionary; it's a specialized data structure that you can access using the
string method <code class="methodname">translate</code>, which translates each character into the corresponding digit, according to the matrix defined by <code class="function">string.maketrans</code>.
<p><code class="filename">timeit</code> shows that <code class="filename">soundex2c.py</code> is significantly faster than defining a dictionary and looping through the input and building the output incrementally:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage2></samp>python soundex2c.py
<samp class="computeroutput">Woo W000 11.437645008
Pilgrim P426 13.2825062962
Flingjingwaller F452 18.5570110168</span>
</pre><p>You're not going to get much better than that. Python has a specialized function that does exactly what you want to do; use it and move on.
<div class="example"><h3>Example 18.4. Best Result So Far: <code class="filename">soundex/stage2/soundex2c.py</code></h3><pre class="programlisting">
import string, re
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
digits = source[0].upper() + source[1:].translate(charToSoundex)
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><h2 id="soundex.stage3">18.5. Optimizing List Operations</h2>
<p>The third step in the Soundex algorithm is eliminating consecutive duplicate digits. What's the best way to do this?
<p>Here's the code we have so far, in <code class="filename">soundex/stage2/soundex2c.py</code>:
<div class="informalexample"><pre class="programlisting">
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
</pre><p>Here are the performance results for <code class="filename">soundex2c.py</code>:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage2></samp>python soundex2c.py
<samp class="computeroutput">Woo W000 12.6070768771
Pilgrim P426 14.4033353401
Flingjingwaller F452 19.7774882003</span>
</pre><p>The first thing to consider is whether it's efficient to check <code class="varname">digits[-1]</code> each time through the loop. Are list indexes expensive? Would we be better off maintaining the last digit in a separate
variable, and checking that instead?
<p>To answer this question, here is <code class="filename">soundex/stage3/soundex3a.py</code>:
<div class="informalexample"><pre class="programlisting">
digits2 = ''
last_digit = ''
for d in digits:
if d != last_digit:
digits2 += d
last_digit = d
</pre><p><code class="filename">soundex3a.py</code> does not run any faster than <code class="filename">soundex2c.py</code>, and may even be slightly slower (although it's not enough of a difference to say for sure):
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage3></samp>python soundex3a.py
<samp class="computeroutput">Woo W000 11.5346048171
Pilgrim P426 13.3950636184
Flingjingwaller F452 18.6108927252</span>
</pre><p>Why isn't <code class="filename">soundex3a.py</code> faster? It turns out that list indexes in Python are extremely efficient. Repeatedly accessing <code class="varname">digits2[-1]</code> is no problem at all. On the other hand, manually maintaining the last seen digit in a separate variable means we have <em>two</em> variable assignments for each digit we're storing, which wipes out any small gains we might have gotten from eliminating
the list lookup.
<p>Let's try something radically different. If it's possible to treat a string as a list of characters, it should be possible
to use a list comprehension to iterate through the list. The problem is, the code needs access to the previous character
in the list, and that's not easy to do with a straightforward list comprehension.
<p>However, it is possible to create a list of index numbers using the built-in <code class="function">range()</code> function, and use those index numbers to progressively search through the list and pull out each character that is different
from the previous character. That will give you a list of characters, and you can use the string method <code class="methodname">join()</code> to reconstruct a string from that.
<p>Here is <code class="filename">soundex/stage3/soundex3b.py</code>:
<div class="informalexample"><pre class="programlisting">
digits2 = "".join([digits[i] for i in range(len(digits))
if i == 0 or digits[i-1] != digits[i]])
</pre><p>Is this faster? In a word, no.
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage3></samp>python soundex3b.py
<samp class="computeroutput">Woo W000 14.2245271396
Pilgrim P426 17.8337165757
Flingjingwaller F452 25.9954005327</span>
</pre><p>It's possible that the techniques so far as have been &#8220;string-centric&#8221;. Python can convert a string into a list of characters with a single command: <code class="function">list('abc')</code> returns <code>['a', 'b', 'c']</code>. Furthermore, lists can be <em>modified in place</em> very quickly. Instead of incrementally building a new list (or string) out of the source string, why not move elements around
within a single list?
<p>Here is <code class="filename">soundex/stage3/soundex3c.py</code>, which modifies a list in place to remove consecutive duplicate elements:
<div class="informalexample"><pre class="programlisting">
digits = list(source[0].upper() + source[1:].translate(charToSoundex))
i=0
for item in digits:
if item==digits[i]: continue
i+=1
digits[i]=item
del digits[i+1:]
digits2 = "".join(digits)
</pre><p>Is this faster than <code class="filename">soundex3a.py</code> or <code class="filename">soundex3b.py</code>? No, in fact it's the slowest method yet:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage3></samp>python soundex3c.py
<samp class="computeroutput">Woo W000 14.1662554878
Pilgrim P426 16.0397885765
Flingjingwaller F452 22.1789341942</span>
</pre><p>We haven't made any progress here at all, except to try and rule out several &#8220;clever&#8221; techniques. The fastest code we've seen so far was the original, most straightforward method (<code class="filename">soundex2c.py</code>). Sometimes it doesn't pay to be clever.
<div class="example"><h3>Example 18.5. Best Result So Far: <code class="filename">soundex/stage2/soundex2c.py</code></h3><pre class="programlisting">
import string, re
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
digits = source[0].upper() + source[1:].translate(charToSoundex)
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><h2 id="soundex.stage4">18.6. Optimizing String Manipulation</h2>
<p>The final step of the Soundex algorithm is padding short results with zeros, and truncating long results. What is the best
way to do this?
<p>This is what we have so far, taken from <code class="filename">soundex/stage2/soundex2c.py</code>:
<div class="informalexample"><pre class="programlisting">
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
</pre><p>These are the results for <code class="filename">soundex2c.py</code>:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage2></samp>python soundex2c.py
<samp class="computeroutput">Woo W000 12.6070768771
Pilgrim P426 14.4033353401
Flingjingwaller F452 19.7774882003</span>
</pre><p>The first thing to consider is replacing that regular expression with a loop. This code is from <code class="filename">soundex/stage4/soundex4a.py</code>:
<div class="informalexample"><pre class="programlisting">
digits3 = ''
for d in digits2:
if d != '9':
digits3 += d
</pre><p>Is <code class="filename">soundex4a.py</code> faster? Yes it is:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage4></samp>python soundex4a.py
<samp class="computeroutput">Woo W000 6.62865531792
Pilgrim P426 9.02247576158
Flingjingwaller F452 13.6328416042</span>
</pre><p>But wait a minute. A loop to remove characters from a string? We can use a simple string method for that. Here's <code class="filename">soundex/stage4/soundex4b.py</code>:
<div class="informalexample"><pre class="programlisting">
digits3 = digits2.replace('9', '')
</pre><p>Is <code class="filename">soundex4b.py</code> faster? That's an interesting question. It depends on the input:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage4></samp>python soundex4b.py
<samp class="computeroutput">Woo W000 6.75477414029
Pilgrim P426 7.56652144337
Flingjingwaller F452 10.8727729362</span>
</pre><p>The string method in <code class="filename">soundex4b.py</code> is faster than the loop for most names, but it's actually slightly slower than <code class="filename">soundex4a.py</code> in the trivial case (of a very short name). Performance optimizations aren't always uniform; tuning that makes one case
faster can sometimes make other cases slower. In this case, the majority of cases will benefit from the change, so let's
leave it at that, but the principle is an important one to remember.
<p>Last but not least, let's examine the final two steps of the algorithm: padding short results with zeros, and truncating long
results to four characters. The code you see in <code class="filename">soundex4b.py</code> does just that, but it's horribly inefficient. Take a look at <code class="filename">soundex/stage4/soundex4c.py</code> to see why:
<div class="informalexample"><pre class="programlisting">
digits3 += '000'
return digits3[:4]
</pre><p>Why do we need a <code>while</code> loop to pad out the result? We know in advance that we're going to truncate the result to four characters, and we know that
we already have at least one character (the initial letter, which is passed unchanged from the original <code class="varname">source</code> variable). That means we can simply add three zeros to the output, then truncate it. Don't get stuck in a rut over the
exact wording of the problem; looking at the problem slightly differently can lead to a simpler solution.
<p>How much speed do we gain in <code class="filename">soundex4c.py</code> by dropping the <code>while</code> loop? It's significant:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage4></samp>python soundex4c.py
<samp class="computeroutput">Woo W000 4.89129791636
Pilgrim P426 7.30642134685
Flingjingwaller F452 10.689832367</span>
</pre><p>Finally, there is still one more thing you can do to these three lines of code to make them faster: you can combine them into
one line. Take a look at <code class="filename">soundex/stage4/soundex4d.py</code>:
<div class="informalexample"><pre class="programlisting">
return (digits2.replace('9', '') + '000')[:4]
</pre><p>Putting all this code on one line in <code class="filename">soundex4d.py</code> is barely faster than <code class="filename">soundex4c.py</code>:
<div class="informalexample"><pre class="screen">
<samp class="prompt">C:\samples\soundex\stage4></samp>python soundex4d.py
<samp class="computeroutput">Woo W000 4.93624105857
Pilgrim P426 7.19747593619
Flingjingwaller F452 10.5490700634</span>
</pre><p>It is also significantly less readable, and for not much performance gain. Is that worth it? I hope you have good comments.
Performance isn't everything. Your optimization efforts must always be balanced against threats to your program's readability
and maintainability.
<h2 id="soundex.summary">18.7. Summary</h2>
<p>This chapter has illustrated several important aspects of performance tuning in Python, and performance tuning in general.
<div class="itemizedlist">
<ul>
<li>If you need to choose between regular expressions and writing a loop, choose regular expressions. The regular expression
engine is compiled in C and runs natively on your computer; your loop is written in Python and runs through the Python interpreter.
<li>If you need to choose between regular expressions and string methods, choose string methods. Both are compiled in C, so choose
the simpler one.
<li>General-purpose dictionary lookups are fast, but specialtiy functions such as <code class="function">string.maketrans</code> and string methods such as <code class="methodname">isalpha()</code> are faster. If Python has a custom-tailored function for you, use it.
<li>Don't be too clever. Sometimes the most obvious algorithm is also the fastest.
<li>Don't sweat it too much. Performance isn't everything.
</ul>
<p>I can't emphasize that last point strongly enough. Over the course of this chapter, you made this function three times faster
and saved 20 seconds over 1 million function calls. Great. Now think: over the course of those million function calls, how
many seconds will your surrounding application wait for a database connection? Or wait for disk I/O? Or wait for user input?
Don't spend too much time over-optimizing one algorithm, or you'll ignore obvious improvements somewhere else. Develop an
instinct for the sort of code that Python runs well, correct obvious blunders if you find them, and leave the rest alone.
</body>
</html>