Files
dive-into-python3/dip2
T

3977 lines
312 KiB
Plaintext
Executable File

<div class=chapter>
<h2 id="install">Chapter 1. Installing Python</h2>
<p>Welcome to Python. Let's dive in. In this chapter, you'll install the version of Python that's right for you.
<h2 id="install.choosing">1.1. Which Python is right for you?</h2>
<p>The first thing you need to do with Python is install it. Or do you?
<p>If you're using an account on a hosted server, your ISP may have already installed Python. Most popular Linux distributions come with Python in the default installation. Mac OS X 10.2 and later includes a command-line version of Python, although you'll probably want to install a version that includes a more Mac-like graphical interface.
<p>Windows does not come with any version of Python, but don't despair! There are several ways to point-and-click your way to Python on Windows.
<p>As you can see already, Python runs on a great many operating systems. The full list includes Windows, Mac OS, Mac OS X, and all varieties of free <abbr>UNIX</abbr>-compatible systems like Linux. There are also versions that run on Sun Solaris, AS/400, Amiga, OS/2, BeOS, and a plethora
of other platforms you've probably never even heard of.
<p>What's more, Python programs written on one platform can, with a little care, run on <em>any</em> supported platform. For instance, I regularly develop Python programs on Windows and later deploy them on Linux.
<p>So back to the question that started this section, &#8220;Which Python is right for you?&#8221; The answer is whichever one runs on the computer you already have.
<h2 id="install.windows">1.2. Python on Windows</h2>
<div class=procedure>
<h3>Procedure 1.1. Option 1: Installing ActivePython</h3>
<p>Here is the procedure for installing ActivePython:
<ol>
<li>
<p>Download ActivePython from <a href="http://www.activestate.com/Products/ActivePython/">http://www.activestate.com/Products/ActivePython/</a>.
<li>
<p>If you are using Windows 95, Windows 98, or Windows ME, you will also need to download and install <a href="http://download.microsoft.com/download/WindowsInstaller/Install/2.0/W9XMe/EN-US/InstMsiA.exe">Windows Installer 2.0</a> before installing ActivePython.
<li>
<p>Double-click the installer, <code>ActivePython-2.2.2-224-win32-ix86.msi</code>.
<li>
<p>Step through the installer program.
<li>
<p>If space is tight, you can do a custom installation and deselect the documentation, but I don't recommend this unless you
absolutely can't spare the 14MB.
<li>
<p>After the installation is complete, close the installer and choose Start->Programs->ActiveState ActivePython 2.2->PythonWin IDE. You'll see something like the following:
</ol>
<pre class=screen>
<samp>PythonWin 2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)] on win32.
Portions Copyright 1994-2001 Mark Hammond (mhammond@skippinet.com.au) -
see 'Help/About PythonWin' for further copyright information.</samp>
<samp class=p>>>> </samp><kbd></kbd>
</pre><div class=procedure>
<h3>Procedure 1.2. Option 2: Installing Python from <a href="http://www.python.org/" title="Python language home page">Python.org</a></h3>
<ol>
<li>
<p>Download the latest Python Windows installer by going to <a href="http://www.python.org/ftp/python/">http://www.python.org/ftp/python/</a> and selecting the highest version number listed, then downloading the <code>.exe</code> installer.
<li>
<p>Double-click the installer, <code>Python-2.xxx.yyy.exe</code>. The name will depend on the version of Python available when you read this.
<li>
<p>Step through the installer program.
<li>
<p>If disk space is tight, you can deselect the HTMLHelp file, the utility scripts (<code>Tools/</code>), and/or the test suite (<code>Lib/test/</code>).
<li>
<p>If you do not have administrative rights on your machine, you can select Advanced Options, then choose Non-Admin Install. This just affects where Registry entries and Start menu shortcuts are created.
<li>
<p>After the installation is complete, close the installer and select Start->Programs->Python 2.3->IDLE (Python GUI). You'll see something like the following:
</ol>
<pre class=screen>
<samp>Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
****************************************************************
Personal firewall software may warn about the connection IDLE
makes to its subprocess using this computer's internal loopback
interface. This connection is not visible on any external
interface and no data is sent to or received from the Internet.
****************************************************************
IDLE 1.0</samp>
<samp class=p>>>> </samp><kbd></kbd>
</pre><h2 id="install.macosx">1.3. Python on Mac OS X</h2>
<p>On Mac OS X, you have two choices for installing Python: install it, or don't install it. You probably want to install it.
<p>Mac OS X 10.2 and later comes with a command-line version of Python preinstalled. If you are comfortable with the command line, you can use this version for the first third of the book. However,
the preinstalled version does not come with an <abbr>XML</abbr> parser, so when you get to the <abbr>XML</abbr> chapter, you'll need to install the full version.
<p>Rather than using the preinstalled version, you'll probably want to install the latest version, which also comes with a graphical
interactive shell.
<div class=procedure>
<h3>Procedure 1.3. Running the Preinstalled Version of Python on Mac OS X</h3>
<p>To use the preinstalled version of Python, follow these steps:
<ol>
<li>
<p>Open the <code>/Applications</code> folder.
<li>
<p>Open the <code>Utilities</code> folder.
<li>
<p>Double-click <code>Terminal</code> to open a terminal window and get to a command line.
<li>
<p>Type <kbd>python</kbd> at the command prompt.
</ol>
<p>Try it out:
<pre class=screen>
Welcome to Darwin!
<samp class=p>[localhost:~] you% </samp>python
<samp>Python 2.2 (#1, 07/14/02, 23:25:09)
[GCC Apple cpp-precomp 6.14] on darwin
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class=p>>>> </samp><kbd>[press Ctrl+D to get back to the command prompt]</kbd>
<samp class=p>[localhost:~] you% </samp>
</pre><div class=procedure>
<h3>Procedure 1.4. Installing the Latest Version of Python on Mac OS X</h3>
<p>Follow these steps to download and install the latest version of Python:
<ol>
<li>
<p>Download the <code>MacPython-OSX</code> disk image from <a href="http://homepages.cwi.nl/~jack/macpython/download.html">http://homepages.cwi.nl/~jack/macpython/download.html</a>.
<li>
<p>If your browser has not already done so, double-click <code>MacPython-OSX-2.3-1.dmg</code> to mount the disk image on your desktop.
<li>
<p>Double-click the installer, <code>MacPython-OSX.pkg</code>.
<li>
<p>The installer will prompt you for your administrative username and password.
<li>
<p>Step through the installer program.
<li>
<p>After installation is complete, close the installer and open the <code>/Applications</code> folder.
<li>
<p>Open the <code>MacPython-2.3</code> folder
<li>
<p>Double-click <code>PythonIDE</code> to launch Python.
</ol>
<p>The MacPython <abbr>IDE</abbr> should display a splash screen, then take you to the interactive shell. If the interactive shell does not appear, select
Window->Python Interactive (<kbd class=shortcut>Cmd-0</kbd>). The opening window will look something like this:
<pre class=screen>
<samp>Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)]
Type "copyright", "credits" or "license" for more information.
MacPython IDE 1.0.1</samp>
<samp class=p>>>> </samp><kbd></kbd>
</pre><p>Note that once you install the latest version, the pre-installed version is still present. If you are running scripts from
the command line, you need to be aware which version of Python you are using.
<div class=example><h3>Example 1.1. Two versions of Python</h3><pre class=screen>
<samp class=p>[localhost:~] you% </samp>python
<samp>Python 2.2 (#1, 07/14/02, 23:25:09)
[GCC Apple cpp-precomp 6.14] on darwin
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class=p>>>> </samp><kbd>[press Ctrl+D to get back to the command prompt]</kbd>
<samp class=p>[localhost:~] you% </samp>/usr/local/bin/python
<samp>Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)] on darwin
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class=p>>>> </samp><kbd>[press Ctrl+D to get back to the command prompt]</kbd>
<samp class=p>[localhost:~] you% </samp>
</pre><h2 id="install.macos9">1.4. Python on Mac OS 9</h2>
<p>Mac OS 9 does not come with any version of Python, but installation is very simple, and there is only one choice.
<div class=procedure>
<p>Follow these steps to install Python on Mac OS 9:
<ol>
<li>
<p>Download the <code>MacPython23full.bin</code> file from <a href="http://homepages.cwi.nl/~jack/macpython/download.html">http://homepages.cwi.nl/~jack/macpython/download.html</a>.
<li>
<p>If your browser does not decompress the file automatically, double-click <code>MacPython23full.bin</code> to decompress the file with Stuffit Expander.
<li>
<p>Double-click the installer, <code>MacPython23full</code>.
<li>
<p>Step through the installer program.
<li>
<p>AFter installation is complete, close the installer and open the <code>/Applications</code> folder.
<li>
<p>Open the <code>MacPython-OS9 2.3</code> folder.
<li>
<p>Double-click <code>Python IDE</code> to launch Python.
</ol>
<p>The MacPython <abbr>IDE</abbr> should display a splash screen, and then take you to the interactive shell. If the interactive shell does not appear, select
Window->Python Interactive (<kbd class=shortcut>Cmd-0</kbd>). You'll see a screen like this:
<pre class=screen>
<samp>Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)]
Type "copyright", "credits" or "license" for more information.
MacPython IDE 1.0.1</samp>
<samp class=p>>>> </samp><kbd></kbd>
</pre><h2 id="install.redhat">1.5. Python on RedHat Linux</h2>
<p>Installing under UNIX-compatible operating systems such as Linux is easy if you're willing to install a binary package. Pre-built
binary packages are available for most popular Linux distributions. Or you can always compile from source.
<p>Download the latest Python <abbr>RPM</abbr> by going to <a href="http://www.python.org/ftp/python/">http://www.python.org/ftp/python/</a> and selecting the highest version number listed, then selecting the <code>rpms/</code> directory within that. Then download the <abbr>RPM</abbr> with the highest version number. You can install it with the <kbd>rpm</kbd> command, as shown here:
<div class=example><h3>Example 1.2. Installing on RedHat Linux 9</h3><pre class=screen>
<samp class=p>localhost:~$ </samp>su -
<samp class=p>Password: </samp>[enter your root password]
<samp class=p>[root@localhost root]# </samp>wget http://python.org/ftp/python/2.3/rpms/redhat-9/python2.3-2.3-5pydotorg.i386.rpm
<samp>Resolving python.org... done.
Connecting to python.org[194.109.137.226]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,495,111 [application/octet-stream]
...</samp>
<samp class=p>[root@localhost root]# </samp>rpm -Uvh python2.3-2.3-5pydotorg.i386.rpm
<samp>Preparing... ########################################### [100%]
1:python2.3 ########################################### [100%]</samp>
<samp class=p>[root@localhost root]# </samp>python <span>&#x2460;</span>
<samp>Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class=p>>>> </samp><kbd>[press Ctrl+D to exit]</kbd>
<samp class=p>[root@localhost root]# </samp>python2.3 <span>&#x2461;</span>
<samp>Python 2.3 (#1, Sep 12 2003, 10:53:56)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
Type "help", "copyright", "credits", or "license" for more information.</samp>
<samp class=p>>>> </samp><kbd>[press Ctrl+D to exit]</kbd>
<samp class=p>[root@localhost root]# </samp>which python2.3 <span>&#x2462;</span>
/usr/bin/python2.3
</pre>
<ol>
<li>Whoops! Just typing <kbd>python</kbd> gives you the older version of Python -- the one that was installed by default. That's not the one you want.
<li>At the time of this writing, the newest version is called <kbd>python2.3</kbd>. You'll probably want to change the path on the first line of the sample scripts to point to the newer version.
<li>This is the complete path of the newer version of Python that you just installed. Use this on the <code>#!</code> line (the first line of each script) to ensure that scripts are running under the latest version of Python, and be sure to type <kbd>python2.3</kbd> to get into the interactive shell.
<h2 id="install.debian">1.6. Python on Debian <abbr>GNU</abbr>/Linux</h2>
<p>If you are lucky enough to be running Debian <abbr>GNU</abbr>/Linux, you install Python through the <kbd>apt</kbd> command.
<div class=example><h3>Example 1.3. Installing on Debian <abbr>GNU</abbr>/Linux</h3><pre class=screen>
<samp class=p>localhost:~$ </samp>su -
<samp class=p>Password: </samp>[enter your root password]
<samp class=p>localhost:~# </samp>apt-get install python
<samp>Reading Package Lists... Done
Building Dependency Tree... Done
The following extra packages will be installed:
python2.3
Suggested packages:
python-tk python2.3-doc
The following NEW packages will be installed:
python python2.3
0 upgraded, 2 newly installed, 0 to remove and 3 not upgraded.
Need to get 0B/2880kB of archives.
After unpacking 9351kB of additional disk space will be used.</samp>
<samp class=p>Do you want to continue? [Y/n] </samp>Y
<samp>Selecting previously deselected package python2.3.
(Reading database ... 22848 files and directories currently installed.)
Unpacking python2.3 (from .../python2.3_2.3.1-1_i386.deb) ...
Selecting previously deselected package python.
Unpacking python (from .../python_2.3.1-1_all.deb) ...
Setting up python (2.3.1-1) ...
Setting up python2.3 (2.3.1-1) ...
Compiling python modules in /usr/lib/python2.3 ...
Compiling optimized python modules in /usr/lib/python2.3 ...</samp>
<samp class=p>localhost:~# </samp>exit
logout
<samp class=p>localhost:~$ </samp>python
<samp>Python 2.3.1 (#2, Sep 24 2003, 11:39:14)
[GCC 3.3.2 20030908 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.</samp>
<samp class=p>>>> </samp><kbd>[press Ctrl+D to exit]</kbd>
</pre><h2 id="install.source">1.7. Python Installation from Source</h2>
<p>If you prefer to build from source, you can download the Python source code from <a href="http://www.python.org/ftp/python/">http://www.python.org/ftp/python/</a>. Select the highest version number listed, download the <code>.tgz</code> file), and then do the usual <kbd>configure</kbd>, <kbd>make</kbd>, <kbd>make install</kbd> dance.
<div class=example><h3>Example 1.4. Installing from source</h3><pre class=screen>
<samp class=p>localhost:~$ </samp>su -
<samp class=p>Password: </samp>[enter your root password]
<samp class=p>localhost:~# </samp>wget http://www.python.org/ftp/python/2.3/Python-2.3.tgz
<samp>Resolving www.python.org... done.
Connecting to www.python.org[194.109.137.226]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8,436,880 [application/x-tar]
...</samp>
<samp class=p>localhost:~# </samp>tar xfz Python-2.3.tgz
<samp class=p>localhost:~# </samp>cd Python-2.3
<samp class=p>localhost:~/Python-2.3# </samp>./configure
<samp>checking MACHDEP... linux2
checking EXTRAPLATDIR...
checking for --without-gcc... no
...</samp>
<samp class=p>localhost:~/Python-2.3# </samp>make
<samp>gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
-I. -I./Include -DPy_BUILD_CORE -o Modules/python.o Modules/python.c
gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
-I. -I./Include -DPy_BUILD_CORE -o Parser/acceler.o Parser/acceler.c
gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
-I. -I./Include -DPy_BUILD_CORE -o Parser/grammar1.o Parser/grammar1.c
...</samp>
<samp class=p>localhost:~/Python-2.3# </samp>make install
<samp>/usr/bin/install -c python /usr/local/bin/python2.3
...</samp>
<samp class=p>localhost:~/Python-2.3# </samp>exit
logout
<samp class=p>localhost:~$ </samp>which python
/usr/local/bin/python
<samp class=p>localhost:~$ </samp>python
<samp>Python 2.3.1 (#2, Sep 24 2003, 11:39:14)
[GCC 3.3.2 20030908 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.</samp>
<samp class=p>>>> </samp><kbd>[press Ctrl+D to get back to the command prompt]</kbd>
<samp class=p>localhost:~$ </samp>
</pre><h2 id="install.shell">1.8. The Interactive Shell</h2>
<p>Now that you have Python installed, what's this interactive shell thing you're running?
<p>It's like this: Python leads a double life. It's an interpreter for scripts that you can run from the command line or run like applications, by
double-clicking the scripts. But it's also an interactive shell that can evaluate arbitrary statements and expressions.
This is extremely useful for debugging, quick hacking, and testing. I even know some people who use the Python interactive shell in lieu of a calculator!
<p>Launch the Python interactive shell in whatever way works on your platform, and let's dive in with the steps shown here:
<div class=example><h3>Example 1.5. First Steps in the Interactive Shell</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>1 + 1</kbd> <span>&#x2460;</span>
2
<samp class=p>>>> </samp><kbd>print 'hello world'</kbd> <span>&#x2461;</span>
hello world
<samp class=p>>>> </samp><kbd>x = 1</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>y = 2</kbd>
<samp class=p>>>> </samp><kbd>x + y</kbd>
3
</pre>
<ol>
<li>The Python interactive shell can evaluate arbitrary Python expressions, including any basic arithmetic expression.
<li>The interactive shell can execute arbitrary Python statements, including the <kbd>print</kbd> statement.
<li>You can also assign values to variables, and the values will be remembered as long as the shell is open (but not any longer
than that).
<h2 id="install.summary">1.9. Summary</h2>
<p>You should now have a version of Python installed that works for you.
<p>Depending on your platform, you may have more than one version of Python intsalled. If so, you need to be aware of your paths. If simply typing <kbd>python</kbd> on the command line doesn't run the version of Python that you want to use, you may need to enter the full pathname of your preferred version.
<p>Congratulations, and welcome to Python.
<h2 id="odbchelper.objects">2.4. Everything Is an Object</h2>
<h2 id="odbchelper.testing">2.6. Testing Modules</h2>
<p>Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them.
Here's an example that uses the <code>if</code> <code>__name__</code> trick.
<pre id="odbchelper.ifnametrick" class=programlisting>
if __name__ == "__main__":</pre><p>Some quick observations before you get to the good stuff. First, parentheses are not required around the <code>if</code> expression. Second, the <code>if</code> statement ends with a colon, and is followed by <a href="#odbchelper.indenting" title="2.5. Indenting Code">indented code</a>.
<table id="compare.equals.c" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Like <abbr>C</abbr>, Python uses <code>==</code> for comparison and <code>=</code> for assignment. Unlike <abbr>C</abbr>, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.
<p>So why is this particular <code>if</code> statement a trick? Modules are objects, and all modules have a built-in attribute <code>__name__</code>. A module's <code>__name__</code> depends on how you're using the module. If you <code>import</code> the module, then <code>__name__</code> is the module's filename, without a directory path or file extension. But you can also run the module directly as a standalone
program, in which case <code>__name__</code> will be a special default value, <code>__main__</code>.
<pre class=screen><samp class=p>>>> </samp><kbd>import odbchelper</kbd>
<samp class=p>>>> </samp>odbchelper.<code>__name__</code>
'odbchelper'</pre><p>Knowing this, you can design a test suite for your module within the module itself by putting it in this <code>if</code> statement. When you run the module directly, <code>__name__</code> is <code>__main__</code>, so the test suite executes. When you import the module, <code>__name__</code> is something else, so the test suite is ignored. This makes it easier to develop and debug new modules before integrating
them into a larger program.
<table id="tip.mac.runasmain" class=tip border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">On MacPython, there is an additional step to make the <code>if</code> <code>__name__</code> trick work. Pop up the module's options menu by clicking the black triangle in the upper-right corner of the window, and
make sure Run as __main__ is checked.
<div class=itemizedlist>
<h3>Further Reading on Importing Modules</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class=citetitle>Python Reference Manual</i></a> discusses the low-level details of <a href="http://www.python.org/doc/current/ref/import.html">importing modules</a>.
</ul>
<h2 id="odbchelper.vardef">3.4. Declaring variables</h2>
<p>Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from <a href="#odbchelper">Chapter 2</a>, <code>odbchelper.py</code>.
<p>Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring
into existence by being assigned a value, and they are automatically destroyed when they go out of scope.
<div class=example><h3 id="myparamsdef">Example 3.17. Defining the <var>myParams</var> Variable</h3><pre><code>
if __name__ == "__main__":
myParams = {"server":"mpilgrim", \
"database":"master", \
"uid":"sa", \
"pwd":"secret" \
}</pre><p>Notice the indentation. An <code>if</code> statement is a code block and needs to be indented just like a function.
<p>Also notice that the variable assignment is one command split over several lines, with a backslash (&#8220;<code>\</code>&#8221;) serving as a line-continuation marker.
<table id="tip.multiline" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">When a command is split among several lines with the line-continuation marker (&#8220;<code>\</code>&#8221;), the continued lines can be indented in any manner; Python's normally stringent indentation rules do not apply. If your Python <abbr>IDE</abbr> auto-indents the continued line, you should probably accept its default unless you have a burning reason not to.
<p><a name="tip.implicitmultiline"></a>Strictly speaking, expressions in parentheses, straight brackets, or curly braces (like <a href="#myparamsdef" title="Example 3.17. Defining the myParams Variable">defining a dictionary</a>) can be split into multiple lines with or without the line continuation character (&#8220;<code>\</code>&#8221;). I like to include the backslash even when it's not required because I think it makes the code easier to read, but that's
a matter of style.
[unbound variable exception example was here]
<h3 id="odbchelper.multiassign">3.4.2. Assigning Multiple Values at Once</h3>
<p>One of the cooler programming shortcuts in Python is using sequences to assign multiple values at once.
<div class=example><h3>Example 3.19. Assigning multiple values at once</h3><pre class=screen><samp class=p>>>> </samp><kbd>v = ('a', 'b', 'e')</kbd>
<samp class=p>>>> </samp><kbd>(x, y, z) = v</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>x</kbd>
'a'
<samp class=p>>>> </samp><kbd>y</kbd>
'b'
<samp class=p>>>> </samp><kbd>z</kbd>
'e'</pre>
<ol>
<li><var>v</var> is a tuple of three elements, and <code>(x, y, z)</code> is a tuple of three variables. Assigning one to the other assigns each of the values of <var>v</var> to each of the variables, in order.
<p>This has all sorts of uses. I often want to assign names to a range of values. In <abbr>C</abbr>, you would use <code>enum</code> and manually list each constant and its associated value, which seems especially tedious when the values are consecutive.
In Python, you can use the built-in <code>range</code> function with multi-variable assignment to quickly assign consecutive values.
<div class=example><h3 id="odbchelper.multiassign.range">Example 3.20. Assigning Consecutive Values</h3><pre class=screen><samp class=p>>>> </samp><kbd>range(7)</kbd> <span>&#x2460;</span>
[0, 1, 2, 3, 4, 5, 6]
<samp class=p>>>> </samp><kbd>(MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7)</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>MONDAY</kbd> <span>&#x2462;</span>
0
<samp class=p>>>> </samp><kbd>TUESDAY</kbd>
1
<samp class=p>>>> </samp><kbd>SUNDAY</kbd>
6</pre>
<ol>
<li>The built-in <code>range</code> function returns a list of integers. In its simplest form, it takes an upper limit and returns a zero-based list counting
up to but not including the upper limit. (If you like, you can pass other parameters to specify a base other than <code>0</code> and a step other than <code>1</code>. You can <code>print range.__doc__</code> for details.)
<li><var>MONDAY</var>, <var>TUESDAY</var>, <var>WEDNESDAY</var>, <var>THURSDAY</var>, <var>FRIDAY</var>, <var>SATURDAY</var>, and <var>SUNDAY</var> are the variables you're defining. (This example came from the <code>calendar</code> module, a fun little module that prints calendars, like the <abbr>UNIX</abbr> program <code>cal</code>. The <code>calendar</code> module defines integer constants for days of the week.)
<li>Now each variable has its value: <var>MONDAY</var> is <code>0</code>, <var>TUESDAY</var> is <code>1</code>, and so forth.
<p>You can also use multi-variable assignment to build functions that return multiple values, simply by returning a tuple of
all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including the <code>os</code> module, which you'll discuss in <a href="#filehandling">Chapter 6</a>.
<div class=itemizedlist>
<h3>Further Reading on Variables</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class=citetitle>Python Reference Manual</i></a> shows examples of <a href="http://www.python.org/doc/current/ref/implicit-joining.html">when you can skip the line continuation character</a> and <a href="http://www.python.org/doc/current/ref/explicit-joining.html">when you need to use it</a>.
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class=citetitle>How to Think Like a Computer Scientist</i></a> shows how to use multi-variable assignment to <a href="http://www.ibiblio.org/obp/thinkCSpy/chap09.htm">swap the values of two variables</a>.
</ul>
<h2 id="odbchelper.map">3.6. Mapping Lists</h2>
<p>One of the most powerful features of Python is the list comprehension, which provides a compact way of mapping a list into another list by applying a function to each
of the elements of the list.
<div class=example><h3>Example 3.24. Introducing List Comprehensions</h3><pre class=screen><samp class=p>>>> </samp><kbd>li = [1, 9, 8, 4]</kbd>
<samp class=p>>>> </samp><kbd>[elem*2 for elem in li]</kbd> <span>&#x2460;</span>
[2, 18, 16, 8]
<samp class=p>>>> </samp><kbd>li</kbd> <span>&#x2461;</span>
[1, 9, 8, 4]
<samp class=p>>>> </samp><kbd>li = [elem*2 for elem in li]</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>li</kbd>
[2, 18, 16, 8]</pre>
<ol>
<li>To make sense of this, look at it from right to left. <var>li</var> is the list you're mapping. Python loops through <var>li</var> one element at a time, temporarily assigning the value of each element to the variable <var>elem</var>. Python then applies the function <code><var>elem</var>*2</code> and appends that result to the returned list.
<li>Note that list comprehensions do not change the original list.
<li>It is safe to assign the result of a list comprehension to the variable that you're mapping. Python constructs the new list in memory, and when the list comprehension is complete, it assigns the result to the variable.
<p>Here are the list comprehensions in the <code>buildConnectionString</code> function that you declared in <a href="#odbchelper">Chapter 2</a>:<pre><code>
["%s=%s" % (k, v) for k, v in params.items()]</pre><p>First, notice that you're calling the <code>items</code> function of the <var>params</var> dictionary. This function returns a list of tuples of all the data in the dictionary.
<div class=example><h3 id="odbchelper.items">Example 3.25. The <code>keys</code>, <code>values</code>, and <code>items</code> Functions</h3><pre class=screen><samp class=p>>>> </samp><kbd>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}</kbd>
<samp class=p>>>> </samp><kbd>params.keys()</kbd> <span>&#x2460;</span>
['server', 'uid', 'database', 'pwd']
<samp class=p>>>> </samp><kbd>params.values()</kbd> <span>&#x2461;</span>
['mpilgrim', 'sa', 'master', 'secret']
<samp class=p>>>> </samp><kbd>params.items()</kbd> <span>&#x2462;</span>
[('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]</pre>
<ol>
<li>The <code>keys</code> method of a dictionary returns a list of all the keys. The list is not in the order in which the dictionary was defined
(remember that elements in a dictionary are unordered), but it is a list.
<li>The <code>values</code> method returns a list of all the values. The list is in the same order as the list returned by <code>keys</code>, so <code>params.values()[n] == params[params.keys()[n]]</code> for all values of <var>n</var>.
<li>The <code>items</code> method returns a list of tuples of the form <code>(<var>key</var>, <var>value</var>)</code>. The list contains all the data in the dictionary.
<p>Now let's see what <code>buildConnectionString</code> does. It takes a list, <code><var>params</var>.<code>items</code>()</code>, and maps it to a new list by applying string formatting to each element. The new list will have the same number of elements
as <code><var>params</var>.<code>items</code>()</code>, but each element in the new list will be a string that contains both a key and its associated value from the <var>params</var> dictionary.
<div class=example><h3>Example 3.26. List Comprehensions in <code>buildConnectionString</code>, Step by Step</h3><pre class=screen><samp class=p>>>> </samp><kbd>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}</kbd>
<samp class=p>>>> </samp><kbd>params.items()</kbd>
[('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]
<samp class=p>>>> </samp><kbd>[k for k, v in params.items()]</kbd> <span>&#x2460;</span>
['server', 'uid', 'database', 'pwd']
<samp class=p>>>> </samp><kbd>[v for k, v in params.items()]</kbd> <span>&#x2461;</span>
['mpilgrim', 'sa', 'master', 'secret']
<samp class=p>>>> </samp><kbd>["%s=%s" % (k, v) for k, v in params.items()]</kbd> <span>&#x2462;</span>
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']</pre>
<ol>
<li>Note that you're using two variables to iterate through the <code>params.items()</code> list. This is another use of <a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a>. The first element of <code>params.items()</code> is <code>('server', 'mpilgrim')</code>, so in the first iteration of the list comprehension, <var>k</var> will get <code>'server'</code> and <var>v</var> will get <code>'mpilgrim'</code>. In this case, you're ignoring the value of <var>v</var> and only including the value of <var>k</var> in the returned list, so this list comprehension ends up being equivalent to <code><var>params</var>.<code>keys</code>()</code>.
<li>Here you're doing the same thing, but ignoring the value of <var>k</var>, so this list comprehension ends up being equivalent to <code><var>params</var>.<code>values</code>()</code>.
<li>Combining the previous two examples with some simple <a href="#odbchelper.stringformatting" title="3.5. Formatting Strings">string formatting</a>, you get a list of strings that include both the key and value of each element of the dictionary. This looks suspiciously
like the <a href="#odbchelper.output">output</a> of the program. All that remains is to join the elements in this list into a single string.
<div class=itemizedlist>
<h3>Further Reading on List Comprehensions</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> discusses another way to map lists <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007130000000000000000">using the built-in <code>map</code> function</a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> shows how to <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007140000000000000000">do nested list comprehensions</a>.
</ul>
(String splitting stuff was here)
<p>Before diving into the next chapter, make sure you're comfortable doing all of these things:
<div class=itemizedlist>
<ul>
<li>Using the Python <abbr>IDE</abbr> to test expressions interactively
<li>Writing Python programs and <a href="#odbchelper.testing" title="2.6. Testing Modules">running them from within your <abbr>IDE</abbr></a>, or from the command line
<li><a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's docstring">Importing modules</a> and calling their functions
<li><a href="#odbchelper.funcdef" title="2.2. Declaring Functions">Declaring functions</a> and using <a href="#odbchelper.docstring" title="2.3. Documenting Functions"><code>docstring</code>s</a>, <a href="#odbchelper.vardef" title="3.4. Declaring variables">local variables</a>, and <a href="#odbchelper.indenting" title="2.5. Indenting Code">proper indentation</a>
<li>Defining <a href="#odbchelper.dict" title="3.1. Introducing Dictionaries">dictionaries</a>, <a href="#odbchelper.tuple" title="3.3. Introducing Tuples">tuples</a>, and <a href="#odbchelper.list" title="3.2. Introducing Lists">lists</a>
<li>Accessing attributes and methods of <a href="#odbchelper.objects" title="2.4. Everything Is an Object">any object</a>, including strings, lists, dictionaries, functions, and modules
<li>Concatenating values through <a href="#odbchelper.stringformatting" title="3.5. Formatting Strings">string formatting</a>
<li><a href="#odbchelper.map" title="3.6. Mapping Lists">Mapping lists</a> into other lists using list comprehensions
<li><a href="#odbchelper.join" title="3.7. Joining Lists and Splitting Strings">Splitting strings</a> into lists and joining lists into strings
</ul>
<div class=chapter>
<h2 id="apihelper">Chapter 4. The Power Of Introspection</h2>
<p>This chapter covers one of Python's strengths: introspection. As you know, <a href="#odbchelper.objects" title="2.4. Everything Is an Object">everything in Python is an object</a>, and introspection is code looking at other modules and functions in memory as objects, getting information about them, and
manipulating them. Along the way, you'll define functions with no name, call functions with arguments out of order, and reference
functions whose names you don't even know ahead of time.
<h2 id="apihelper.divein">4.1. Diving In</h2>
<p>Here is a complete, working Python program. You should understand a good deal about it just by looking at it. The numbered lines illustrate concepts covered
in <a href="#odbchelper" title="Chapter 2. Your First Python Program">Chapter 2, <i>Your First Python Program</i></a>. Don't worry if the rest of the code looks intimidating; you'll learn all about it throughout this chapter.
<div class=example><h3>Example 4.1. <code>apihelper.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
def info(object, spacing=10, collapse=1): <span>&#x2460;</span> <span>&#x2461;</span> <span>&#x2462;</span>
"""Print methods and docstrings.
Takes module, class, list, dictionary, or string."""
methodList = [method for method in dir(object) if callable(getattr(object, method))]
processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)
print "\n".join(["%s %s" %
(method.ljust(spacing),
processFunc(str(getattr(object, method).__doc__)))
for method in methodList])
if __name__ == "__main__": <span>&#x2463;</span> <span>&#x2464;</span>
print info.__doc__</pre>
<ol>
<li>This module has one function, <code>info</code>. According to its <a href="#odbchelper.funcdef" title="2.2. Declaring Functions">function declaration</a>, it takes three parameters: <var>object</var>, <var>spacing</var>, and <var>collapse</var>. The last two are actually optional parameters, as you'll see shortly.
<li>The <code>info</code> function has a multi-line <a href="#odbchelper.docstring" title="2.3. Documenting Functions"><code>docstring</code></a> that succinctly describes the function's purpose. Note that no return value is mentioned; this function will be used solely
for its effects, rather than its value.
<li>Code within the function is <a href="#odbchelper.indenting" title="2.5. Indenting Code">indented</a>.
<li>The <code>if __name__</code> <a href="#odbchelper.ifnametrick">trick</a> allows this program do something useful when run by itself, without interfering with its use as a module for other programs.
In this case, the program simply prints out the <code>docstring</code> of the <code>info</code> function.
<li><a href="#odbchelper.ifnametrick"><code>if</code> statements</a> use <code>==</code> for comparison, and parentheses are not required.
<p>The <code>info</code> function is designed to be used by you, the programmer, while working in the Python <abbr>IDE</abbr>. It takes any object that has functions or methods (like a module, which has functions, or a list, which has methods) and
prints out the functions and their <code>docstring</code>s.
<div class=example><h3>Example 4.2. Sample Usage of <code>apihelper.py</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>from apihelper import info</kbd>
<samp class=p>>>> </samp><kbd>li = []</kbd>
<samp class=p>>>> </samp><kbd>info(li)</kbd>
<samp>append L.append(object) -- append object to end
count L.count(value) -> integer -- return number of occurrences of value
extend L.extend(list) -- extend list by appending list elements
index L.index(value) -> integer -- return index of first occurrence of value
insert L.insert(index, object) -- insert object before index
pop L.pop([index]) -> item -- remove and return item at index (default last)
remove L.remove(value) -- remove first occurrence of value
reverse L.reverse() -- reverse *IN PLACE*
sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1</span></pre><p>By default the output is formatted to be easy to read. Multi-line <code>docstring</code>s are collapsed into a single long line, but this option can be changed by specifying <code>0</code> for the <i class=parameter><code>collapse</code></i> argument. If the function names are longer than 10 characters, you can specify a larger value for the <i class=parameter><code>spacing</code></i> argument to make the output easier to read.
<div class=example><h3>Example 4.3. Advanced Usage of <code>apihelper.py</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>import odbchelper</kbd>
<samp class=p>>>> </samp><kbd>info(odbchelper)</kbd>
buildConnectionString Build a connection string from a dictionary Returns string.
<samp class=p>>>> </samp><kbd>info(odbchelper, 30)</kbd>
buildConnectionString Build a connection string from a dictionary Returns string.
<samp class=p>>>> </samp><kbd>info(odbchelper, 30, 0)</kbd>
<samp>buildConnectionString Build a connection string from a dictionary
Returns string.
(optional and named arguments stuff was here)
<h3>4.3.1. The <code>type</code> Function</h3>
<p>The <code>type</code> function returns the datatype of any arbitrary object. The possible types are listed in the <code>types</code> module. This is useful for helper functions that can handle several types of data.
<div class=example><h3 id="apihelper.type.intro">Example 4.5. Introducing <code>type</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>type(1)</kbd> <span>&#x2460;</span>
&lt;type 'int'>
<samp class=p>>>> </samp><kbd>li = []</kbd>
<samp class=p>>>> </samp><kbd>type(li)</kbd> <span>&#x2461;</span>
&lt;type 'list'>
<samp class=p>>>> </samp><kbd>import odbchelper</kbd>
<samp class=p>>>> </samp><kbd>type(odbchelper)</kbd> <span>&#x2462;</span>
&lt;type 'module'>
<samp class=p>>>> </samp><kbd>import types</kbd> <span>&#x2463;</span>
<samp class=p>>>> </samp><kbd>type(odbchelper) == types.ModuleType</kbd>
True</pre>
<ol>
<li><code>type</code> takes anything -- and I mean anything -- and returns its datatype. Integers, strings, lists, dictionaries, tuples, functions,
classes, modules, even types are acceptable.
<li><code>type</code> can take a variable and return its datatype.
<li><code>type</code> also works on modules.
<li>You can use the constants in the <code>types</code> module to compare types of objects. This is what the <code>info</code> function does, as you'll see shortly.
<h3>4.3.2. The <code>str</code> Function</h3>
<p>The <code>str</code> coerces data into a string. Every datatype can be coerced into a string.
<div class=example><h3 id="apihelper.str.intro">Example 4.6. Introducing <code>str</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>str(1)</kbd> <span>&#x2460;</span>
'1'
<samp class=p>>>> </samp><kbd>horsemen = ['war', 'pestilence', 'famine']</kbd>
<samp class=p>>>> </samp><kbd>horsemen</kbd>
['war', 'pestilence', 'famine']
<samp class=p>>>> </samp><kbd>horsemen.append('Powerbuilder')</kbd>
<samp class=p>>>> </samp><kbd>str(horsemen)</kbd> <span>&#x2461;</span>
"['war', 'pestilence', 'famine', 'Powerbuilder']"
<samp class=p>>>> </samp><kbd>str(odbchelper)</kbd> <span>&#x2462;</span>
"&lt;module 'odbchelper' from 'c:\\docbook\\dip\\py\\odbchelper.py'>"
<samp class=p>>>> </samp><kbd>str(None)</kbd> <span>&#x2463;</span>
'None'</pre>
<ol>
<li>For simple datatypes like integers, you would expect <code>str</code> to work, because almost every language has a function to convert an integer to a string.
<li>However, <code>str</code> works on any object of any type. Here it works on a list which you've constructed in bits and pieces.
<li><code>str</code> also works on modules. Note that the string representation of the module includes the pathname of the module on disk, so
yours will be different.
<li>A subtle but important behavior of <code>str</code> is that it works on <code>None</code>, the Python null value. It returns the string <code>'None'</code>. You'll use this to your advantage in the <code>info</code> function, as you'll see shortly.
<p>At the heart of the <code>info</code> function is the powerful <code>dir</code> function. <code>dir</code> returns a list of the attributes and methods of any object: modules, functions, strings, lists, dictionaries... pretty much
anything.
<div class=example><h3 id="apihelper.dir.intro">Example 4.7. Introducing <code>dir</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>li = []</kbd>
<samp class=p>>>> </samp><kbd>dir(li)</kbd> <span>&#x2460;</span>
<samp>['append', 'count', 'extend', 'index', 'insert',
'pop', 'remove', 'reverse', 'sort']</samp>
<samp class=p>>>> </samp><kbd>d = {}</kbd>
<samp class=p>>>> </samp><kbd>dir(d)</kbd> <span>&#x2461;</span>
['clear', 'copy', 'get', 'has_key', 'items', 'keys', 'setdefault', 'update', 'values']
<samp class=p>>>> </samp><kbd>import odbchelper</kbd>
<samp class=p>>>> </samp><kbd>dir(odbchelper)</kbd> <span>&#x2462;</span>
['__builtins__', '__doc__', '__file__', '__name__', 'buildConnectionString']</pre>
<ol>
<li><var>li</var> is a list, so <code><code>dir</code>(<var>li</var>)</code> returns a list of all the methods of a list. Note that the returned list contains the names of the methods as strings, not
the methods themselves.
<li><var>d</var> is a dictionary, so <code><code>dir</code>(<var>d</var>)</code> returns a list of the names of dictionary methods. At least one of these, <a href="#odbchelper.items" title="Example 3.25. The keys, values, and items Functions"><code>keys</code></a>, should look familiar.
<li>This is where it really gets interesting. <code>odbchelper</code> is a module, so <code><code>dir</code>(<code>odbchelper</code>)</code> returns a list of all kinds of stuff defined in the module, including built-in attributes, like <a href="#odbchelper.ifnametrick"><code>__name__</code></a>, <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's docstring"><code>__doc__</code></a>, and whatever other attributes and methods you define. In this case, <code>odbchelper</code> has only one user-defined method, the <code>buildConnectionString</code> function described in <a href="#odbchelper">Chapter 2</a>.
<p>Finally, the <code>callable</code> function takes any object and returns <code>True</code> if the object can be called, or <code>False</code> otherwise. Callable objects include functions, class methods, even classes themselves. (More on classes in the next chapter.)
<div class=example><h3 id="apihelper.builtin.callable">Example 4.8. Introducing <code>callable</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import string</kbd>
<samp class=p>>>> </samp><kbd>string.punctuation</kbd> <span>&#x2460;</span>
'!"#$%&amp;\'()*+,-./:;&lt;=>?@[\\]^_`{|}~'
<samp class=p>>>> </samp><kbd>string.join</kbd><span>&#x2461;</span><!-- " -->
&lt;function join at 00C55A7C>
<samp class=p>>>> </samp><kbd>callable(string.punctuation)</kbd> <span>&#x2462;</span>
False
<samp class=p>>>> </samp><kbd>callable(string.join)</kbd> <span>&#x2463;</span>
True
<samp class=p>>>> </samp><kbd>print string.join.__doc__</kbd> <span>&#x2464;</span>
<samp>join(list [,sep]) -> string
Return a string composed of the words in list, with
intervening occurrences of sep. The default separator is a
single space.
(joinfields and join are synonymous)</span></pre>
<ol>
<li>The functions in the <code>string</code> module are deprecated (although many people still use the <code>join</code> function), but the module contains a lot of useful constants like this <var>string.punctuation</var>, which contains all the standard punctuation characters.
<li><a href="#odbchelper.join" title="3.7. Joining Lists and Splitting Strings"><code>string.join</code></a> is a function that joins a list of strings.
<li><var>string.punctuation</var> is not callable; it is a string. (A string does have callable methods, but the string itself is not callable.)
<li><code>string.join</code> is callable; it's a function that takes two arguments.
<li>Any callable object may have a <code>docstring</code>. By using the <code>callable</code> function on each of an object's attributes, you can determine which attributes you care about (methods, functions, classes)
and which you want to ignore (constants and so on) without knowing anything about the object ahead of time.
<h3>4.3.3. Built-In Functions</h3>
<p><code>type</code>, <code>str</code>, <code>dir</code>, and all the rest of Python's built-in functions are grouped into a special module called <code>__builtin__</code>. (That's two underscores before and after.) If it helps, you can think of Python automatically executing <code>from __builtin__ import *</code> on startup, which imports all the &#8220;built-in&#8221; functions into the namespace so you can use them directly.
<p>The advantage of thinking like this is that you can access all the built-in functions and attributes as a group by getting
information about the <code>__builtin__</code> module. And guess what, Python has a function called <code>info</code>. Try it yourself and skim through the list now. We'll dive into some of the more important functions later. (Some of the
built-in error classes, like <a href="#odbchelper.tuplemethods" title="Example 3.16. Tuples Have No Methods"><code>AttributeError</code></a>, should already look familiar.)
<div class=example><h3 id="apihelper.builtin.list">Example 4.9. Built-in Attributes and Functions</h3><pre class=screen><samp class=p>>>> </samp><kbd>from apihelper import info</kbd>
<samp class=p>>>> </samp><kbd>import __builtin__</kbd>
<samp class=p>>>> </samp><kbd>info(__builtin__, 20)</kbd>
<samp>ArithmeticError Base class for arithmetic errors.
AssertionError Assertion failed.
AttributeError Attribute not found.
EOFError Read beyond end of file.
EnvironmentError Base class for I/O related errors.
Exception Common base class for all exceptions.
FloatingPointError Floating point operation failed.
IOError I/O operation failed.
[...snip...]</span></pre><table id="tip.manuals" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Python comes with excellent reference manuals, which you should peruse thoroughly to learn all the modules Python has to offer. But unlike most languages, where you would find yourself referring back to the manuals or man pages to remind
yourself how to use these modules, Python is largely self-documenting.
<div class=itemizedlist>
<h3>Further Reading on Built-In Functions</h3>
<ul>
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> documents <a href="http://www.python.org/doc/current/lib/built-in-funcs.html">all the built-in functions</a> and <a href="http://www.python.org/doc/current/lib/module-exceptions.html">all the built-in exceptions</a>.
</ul>
<h2 id="apihelper.getattr">4.4. Getting Object References With <code>getattr</code></h2>
<p>You already know that <a href="#odbchelper.objects" title="2.4. Everything Is an Object">Python functions are objects</a>. What you don't know is that you can get a reference to a function without knowing its name until run-time, by using the
<code>getattr</code> function.
<div class=example><h3 id="apihelper.getattr.intro">Example 4.10. Introducing <code>getattr</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>li = ["Larry", "Curly"]</kbd>
<samp class=p>>>> </samp><kbd>li.pop</kbd> <span>&#x2460;</span>
&lt;built-in method pop of list object at 010DF884>
<samp class=p>>>> </samp><kbd>getattr(li, "pop")</kbd> <span>&#x2461;</span>
&lt;built-in method pop of list object at 010DF884>
<samp class=p>>>> </samp><kbd>getattr(li, "append")("Moe")</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>li</kbd>
["Larry", "Curly", "Moe"]
<samp class=p>>>> </samp><kbd>getattr({}, "clear")</kbd> <span>&#x2463;</span>
&lt;built-in method clear of dictionary object at 00F113D4>
<samp class=p>>>> </samp><kbd>getattr((), "pop")</kbd> <span>&#x2464;</span>
<samp class=traceback>Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'pop'</span></pre>
<ol>
<li>This gets a reference to the <code>pop</code> method of the list. Note that this is not calling the <code>pop</code> method; that would be <code>li.pop()</code>. This is the method itself.
<li>This also returns a reference to the <code>pop</code> method, but this time, the method name is specified as a string argument to the <code>getattr</code> function. <code>getattr</code> is an incredibly useful built-in function that returns any attribute of any object. In this case, the object is a list,
and the attribute is the <code>pop</code> method.
<li>In case it hasn't sunk in just how incredibly useful this is, try this: the return value of <code>getattr</code> <em>is</em> the method, which you can then call just as if you had said <code>li.append("Moe")</code> directly. But you didn't call the function directly; you specified the function name as a string instead.
<li><code>getattr</code> also works on dictionaries.
<li>In theory, <code>getattr</code> would work on tuples, except that <a href="#odbchelper.tuplemethods" title="Example 3.16. Tuples Have No Methods">tuples have no methods</a>, so <code>getattr</code> will raise an exception no matter what attribute name you give.
<h3>4.4.1. <code>getattr</code> with Modules</h3>
<p><code>getattr</code> isn't just for built-in datatypes. It also works on modules.
<div class=example><h3 id="apihelper.getattr.example">Example 4.11. The <code>getattr</code> Function in <code>apihelper.py</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>import odbchelper</kbd>
<samp class=p>>>> </samp><kbd>odbchelper.buildConnectionString</kbd> <span>&#x2460;</span>
&lt;function buildConnectionString at 00D18DD4>
<samp class=p>>>> </samp><kbd>getattr(odbchelper, "buildConnectionString")</kbd> <span>&#x2461;</span>
&lt;function buildConnectionString at 00D18DD4>
<samp class=p>>>> </samp><kbd>object = odbchelper</kbd>
<samp class=p>>>> </samp><kbd>method = "buildConnectionString"</kbd>
<samp class=p>>>> </samp><kbd>getattr(object, method)</kbd> <span>&#x2462;</span>
&lt;function buildConnectionString at 00D18DD4>
<samp class=p>>>> </samp><kbd>type(getattr(object, method))</kbd> <span>&#x2463;</span>
&lt;type 'function'>
<samp class=p>>>> </samp><kbd>import types</kbd>
<samp class=p>>>> </samp><kbd>type(getattr(object, method)) == types.FunctionType</kbd>
True
<samp class=p>>>> </samp><kbd>callable(getattr(object, method))</kbd> <span>&#x2464;</span>
True</pre>
<ol>
<li>This returns a reference to the <code>buildConnectionString</code> function in the <code>odbchelper</code> module, which you studied in <a href="#odbchelper" title="Chapter 2. Your First Python Program">Chapter 2, <i>Your First Python Program</i></a>. (The hex address you see is specific to my machine; your output will be different.)
<li>Using <code>getattr</code>, you can get the same reference to the same function. In general, <code><code>getattr</code>(<var>object</var>, "<var>attribute</var>")</code> is equivalent to <code><var>object</var>.<var>attribute</var></code>. If <var><code>object</code></var> is a module, then <var><code>attribute</code></var> can be anything defined in the module: a function, class, or global variable.
<li>And this is what you actually use in the <code>info</code> function. <var>object</var> is passed into the function as an argument; <var>method</var> is a string which is the name of a method or function.
<li>In this case, <var>method</var> is the name of a function, which you can prove by getting its <a href="#apihelper.type.intro" title="Example 4.5. Introducing type"><code>type</code></a>.
<li>Since <var>method</var> is a function, it is <a href="#apihelper.builtin.callable" title="Example 4.8. Introducing callable">callable</a>.
<h3>4.4.2. <code>getattr</code> As a Dispatcher</h3>
<p>A common usage pattern of <code>getattr</code> is as a dispatcher. For example, if you had a program that could output data in a variety of different formats, you could
define separate functions for each output format and use a single dispatch function to call the right one.
<p>For example, let's imagine a program that prints site statistics in <abbr>HTML</abbr>, <abbr>XML</abbr>, and plain text formats. The choice of output format could be specified on the command line, or stored in a configuration
file. A <code>statsout</code> module defines three functions, <code>output_html</code>, <code>output_xml</code>, and <code>output_text</code>. Then the main program defines a single output function, like this:
<div class=example><h3 id="apihelper.getattr.dispatch">Example 4.12. Creating a Dispatcher with <code>getattr</code></h3><pre><code>
import statsout
def output(data, format="text"): <span>&#x2460;</span>
output_function = getattr(statsout, "output_%s" % format) <span>&#x2461;</span>
return output_function(data) <span>&#x2462;</span>
</pre>
<ol>
<li>The <code>output</code> function takes one required argument, <var>data</var>, and one optional argument, <var>format</var>. If <var>format</var> is not specified, it defaults to <code>text</code>, and you will end up calling the plain text output function.
<li>You concatenate the <var>format</var> argument with "output_" to produce a function name, and then go get that function from the <code>statsout</code> module. This allows you to easily extend the program later to support other output formats, without changing this dispatch
function. Just add another function to <code>statsout</code> named, for instance, <code>output_pdf</code>, and pass "pdf" as the <var>format</var> into the <code>output</code> function.
<li>Now you can simply call the output function in the same way as any other function. The <var>output_function</var> variable is a reference to the appropriate function from the <code>statsout</code> module.
<p>Did you see the bug in the previous example? This is a very loose coupling of strings and functions, and there is no error
checking. What happens if the user passes in a format that doesn't have a corresponding function defined in <code>statsout</code>? Well, <code>getattr</code> will return <code>None</code>, which will be assigned to <var>output_function</var> instead of a valid function, and the next line that attempts to call that function will crash and raise an exception. That's
bad.
<p>Luckily, <code>getattr</code> takes an optional third argument, a default value.
<div class=example><h3 id="apihelper.getattr.default">Example 4.13. <code>getattr</code> Default Values</h3><pre><code>
import statsout
def output(data, format="text"):
output_function = getattr(statsout, "output_%s" % format, statsout.output_text)
return output_function(data) <span>&#x2460;</span>
</pre>
<ol>
<li>This function call is guaranteed to work, because you added a third argument to the call to <code>getattr</code>. The third argument is a default value that is returned if the attribute or method specified by the second argument wasn't
found.
<p>As you can see, <code>getattr</code> is quite powerful. It is the heart of introspection, and you'll see even more powerful examples of it in later chapters.
<h2 id="apihelper.filter">4.5. Filtering Lists</h2>
<p>As you know, Python has powerful capabilities for mapping lists into other lists, via list comprehensions (<a href="#odbchelper.map" title="3.6. Mapping Lists">Section 3.6, &#8220;Mapping Lists&#8221;</a>). This can be combined with a filtering mechanism, where some elements in the list are mapped while others are skipped entirely.
<p>Here is the list filtering syntax:<pre><code>
[<var><code>mapping-expression</code></var> for <var><code>element</code></var> in <var><code>source-list</code></var> if <var><code>filter-expression</code></var>]</pre><p>This is an extension of the <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehensions</a> that you know and love. The first two thirds are the same; the last part, starting with the <code>if</code>, is the filter expression. A filter expression can be any expression that evaluates true or false (which in Python can be <a href="#tip.boolean">almost anything</a>). Any element for which the filter expression evaluates true will be included in the mapping. All other elements are ignored,
so they are never put through the mapping expression and are not included in the output list.
<div class=example><h3>Example 4.14. Introducing List Filtering</h3><pre class=screen><samp class=p>>>> </samp><kbd>li = ["a", "mpilgrim", "foo", "b", "c", "b", "d", "d"]</kbd>
<samp class=p>>>> </samp><kbd>[elem for elem in li if len(elem) > 1]</kbd> <span>&#x2460;</span>
['mpilgrim', 'foo']
<samp class=p>>>> </samp><kbd>[elem for elem in li if elem != "b"]</kbd> <span>&#x2461;</span>
['a', 'mpilgrim', 'foo', 'c', 'd', 'd']
<samp class=p>>>> </samp><kbd>[elem for elem in li if li.count(elem) == 1]</kbd> <span>&#x2462;</span>
['a', 'mpilgrim', 'foo', 'c']</pre>
<ol>
<li>The mapping expression here is simple (it just returns the value of each element), so concentrate on the filter expression.
As Python loops through the list, it runs each element through the filter expression. If the filter expression is true, the element
is mapped and the result of the mapping expression is included in the returned list. Here, you are filtering out all the
one-character strings, so you're left with a list of all the longer strings.
<li>Here, you are filtering out a specific value, <code>b</code>. Note that this filters all occurrences of <code>b</code>, since each time it comes up, the filter expression will be false.
<li><code>count</code> is a list method that returns the number of times a value occurs in a list. You might think that this filter would eliminate
duplicates from a list, returning a list containing only one copy of each value in the original list. But it doesn't, because
values that appear twice in the original list (in this case, <code>b</code> and <code>d</code>) are excluded completely. There are ways of eliminating duplicates from a list, but filtering is not the solution.
<p>Let's id="apihelper.filter.care" get back to this line from <code>apihelper.py</code>:<pre><code>
methodList = [method for method in dir(object) if callable(getattr(object, method))]</pre><p>This looks complicated, and it is complicated, but the basic structure is the same. The whole filter expression returns a
list, which is assigned to the <var>methodList</var> variable. The first half of the expression is the list mapping part. The mapping expression is an identity expression,
which it returns the value of each element. <code><code>dir</code>(<var>object</var>)</code> returns a list of <var>object</var>'s attributes and methods -- that's the list you're mapping. So the only new part is the filter expression after the <code>if</code>.
<p>The filter expression looks scary, but it's not. You already know about <a href="#apihelper.builtin.callable" title="Example 4.8. Introducing callable"><code>callable</code></a>, <a href="#apihelper.getattr.intro" title="Example 4.10. Introducing getattr"><code>getattr</code></a>, and <a href="#odbchelper.tuplemethods" title="Example 3.16. Tuples Have No Methods"><code>in</code></a>. As you saw in the <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr">previous section</a>, the expression <code>getattr(object, method)</code> returns a function object if <var>object</var> is a module and <var>method</var> is the name of a function in that module.
<p>So this expression takes an object (named <var>object</var>). Then it gets a list of the names of the object's attributes, methods, functions, and a few other things. Then it filters
that list to weed out all the stuff that you don't care about. You do the weeding out by taking the name of each attribute/method/function
and getting a reference to the real thing, via the <code>getattr</code> function. Then you check to see if that object is callable, which will be any methods and functions, both built-in (like
the <code>pop</code> method of a list) and user-defined (like the <code>buildConnectionString</code> function of the <code>odbchelper</code> module). You don't care about other attributes, like the <code>__name__</code> attribute that's built in to every module.
<div class=itemizedlist>
<h3>Further Reading on Filtering Lists</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> discusses another way to filter lists <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007130000000000000000">using the built-in <code>filter</code> function</a>.
</ul>
<h2 id="apihelper.andor">4.6. The Peculiar Nature of <code>and</code> and <code>or</code></h2>
<p>In Python, <code>and</code> and <code>or</code> perform boolean logic as you would expect, but they do not return boolean values; instead, they return one of the actual
values they are comparing.
<div class=example><h3 id="apihelper.andor.intro.example">Example 4.15. Introducing <code>and</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>'a' and 'b'</kbd> <span>&#x2460;</span>
'b'
<samp class=p>>>> </samp><kbd>'' and 'b'</kbd> <span>&#x2461;</span>
''
<samp class=p>>>> </samp><kbd>'a' and 'b' and 'c'</kbd> <span>&#x2462;</span>
'c'</pre>
<ol>
<li>When using <code>and</code>, values are evaluated in a boolean context from left to right. <code>0</code>, <code>''</code>, <code>[]</code>, <code>()</code>, <code>{}</code>, and <code>None</code> are false in a boolean context; everything else is true. Well, almost everything. By default, instances of classes are
true in a boolean context, but you can define special methods in your class to make an instance evaluate to false. You'll
learn all about classes and special methods in <a href="#fileinfo">Chapter 5</a>. If all values are true in a boolean context, <code>and</code> returns the last value. In this case, <code>and</code> evaluates <code>'a'</code>, which is true, then <code>'b'</code>, which is true, and returns <code>'b'</code>.
<li>If any value is false in a boolean context, <code>and</code> returns the first false value. In this case, <code>''</code> is the first false value.
<li>All values are true, so <code>and</code> returns the last value, <code>'c'</code>.
<div class=example><h3>Example 4.16. Introducing <code>or</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>'a' or 'b'</kbd> <span>&#x2460;</span>
'a'
<samp class=p>>>> </samp><kbd>'' or 'b'</kbd> <span>&#x2461;</span>
'b'
<samp class=p>>>> </samp><kbd>'' or [] or {}</kbd> <span>&#x2462;</span>
{}
<samp class=p>>>> </samp><kbd>def sidefx():</kbd>
<samp class=p>... </samp>print "in sidefx()"
<samp class=p>... </samp>return 1
<samp class=p>>>> </samp><kbd>'a' or sidefx()</kbd> <span>&#x2463;</span>
'a'</pre>
<ol>
<li>When using <code>or</code>, values are evaluated in a boolean context from left to right, just like <code>and</code>. If any value is true, <code>or</code> returns that value immediately. In this case, <code>'a'</code> is the first true value.
<li><code>or</code> evaluates <code>''</code>, which is false, then <code>'b'</code>, which is true, and returns <code>'b'</code>.
<li>If all values are false, <code>or</code> returns the last value. <code>or</code> evaluates <code>''</code>, which is false, then <code>[]</code>, which is false, then <code>{}</code>, which is false, and returns <code>{}</code>.
<li>Note that <code>or</code> evaluates values only until it finds one that is true in a boolean context, and then it ignores the rest. This distinction
is important if some values can have side effects. Here, the function <code>sidefx</code> is never called, because <code>or</code> evaluates <code>'a'</code>, which is true, and returns <code>'a'</code> immediately.
<p>If you're a <abbr>C</abbr> hacker, you are certainly familiar with the <code><var>bool</var> ? <var>a</var> : <var>b</var></code> expression, which evaluates to <var>a</var> if <var><code>bool</code></var> is true, and <var>b</var> otherwise. Because of the way <code>and</code> and <code>or</code> work in Python, you can accomplish the same thing.
<h3>4.6.1. Using the <code>and-or</code> Trick</h3>
<div class=example><h3 id="apihelper.andortrick.intro">Example 4.17. Introducing the <code>and-or</code> Trick</h3><pre class=screen><samp class=p>>>> </samp><kbd>a = "first"</kbd>
<samp class=p>>>> </samp><kbd>b = "second"</kbd>
<samp class=p>>>> </samp><kbd>1 and a or b</kbd> <span>&#x2460;</span>
'first'
<samp class=p>>>> </samp><kbd>0 and a or b</kbd> <span>&#x2461;</span>
'second'
</pre>
<ol>
<li>This syntax looks similar to the <code><var>bool</var> ? <var>a</var> : <var>b</var></code> expression in <abbr>C</abbr>. The entire expression is evaluated from left to right, so the <code>and</code> is evaluated first. <code>1 and 'first'</code> evalutes to <code>'first'</code>, then <code>'first' or 'second'</code> evalutes to <code>'first'</code>.
<li><code>0 and 'first'</code> evalutes to <code>False</code>, and then <code>0 or 'second'</code> evaluates to <code>'second'</code>.
<p>However, since this Python expression is simply boolean logic, and not a special construct of the language, there is one extremely important difference
between this <code>and-or</code> trick in Python and the <code><var>bool</var> ? <var>a</var> : <var>b</var></code> syntax in <abbr>C</abbr>. If the value of <var>a</var> is false, the expression will not work as you would expect it to. (Can you tell I was bitten by this? More than once?)
<div class=example><h3>Example 4.18. When the <code>and-or</code> Trick Fails</h3><pre class=screen><samp class=p>>>> </samp><kbd>a = ""</kbd>
<samp class=p>>>> </samp><kbd>b = "second"</kbd>
<samp class=p>>>> </samp><kbd>1 and a or b</kbd> <span>&#x2460;</span>
'second'</pre>
<ol>
<li>Since <var>a</var> is an empty string, which Python considers false in a boolean context, <code>1 and ''</code> evalutes to <code>''</code>, and then <code>'' or 'second'</code> evalutes to <code>'second'</code>. Oops! That's not what you wanted.
<p>The <code>and-or</code> trick, <code><var>bool</var> and <var>a</var> or <var>b</var></code>, will not work like the <abbr>C</abbr> expression <code><var>bool</var> ? <var>a</var> : <var>b</var></code> when <var>a</var> is false in a boolean context.
<p>The real trick behind the <code>and-or</code> trick, then, is to make sure that the value of <var>a</var> is never false. One common way of doing this is to turn <var>a</var> into <code>[<var>a</var>]</code> and <var>b</var> into <code>[<var>b</var>]</code>, then taking the first element of the returned list, which will be either <var>a</var> or <var>b</var>.
<div class=example><h3>Example 4.19. Using the <code>and-or</code> Trick Safely</h3><pre class=screen><samp class=p>>>> </samp><kbd>a = ""</kbd>
<samp class=p>>>> </samp><kbd>b = "second"</kbd>
<samp class=p>>>> </samp><kbd>(1 and [a] or [b])[0]</kbd> <span>&#x2460;</span>
''</pre>
<ol>
<li>Since <code>[<var>a</var>]</code> is a non-empty list, it is never false. Even if <var>a</var> is <code>0</code> or <code>''</code> or some other false value, the list <code>[<var>a</var>]</code> is true because it has one element.
<p>By now, this trick may seem like more trouble than it's worth. You could, after all, accomplish the same thing with an <code>if</code> statement, so why go through all this fuss? Well, in many cases, you are choosing between two constant values, so you can
use the simpler syntax and not worry, because you know that the <var>a</var> value will always be true. And even if you need to use the more complicated safe form, there are good reasons to do so.
For example, there are some cases in Python where <code>if</code> statements are not allowed, such as in <code>lambda</code> functions.
<div class=itemizedlist>
<h3>Further Reading on the <code>and-or</code> Trick</h3>
<ul>
<li><a href="http://www.activestate.com/ASPN/Python/Cookbook/" title="growing archive of annotated code samples">Python Cookbook</a> discusses <a href="http://www.activestate.com/ASPN/Python/Cookbook/Recipe/52310">alternatives to the <code>and-or</code> trick</a>.
</ul>
<h2 id="apihelper.lambda">4.7. Using <code>lambda</code> Functions</h2>
<p>Python supports an interesting syntax that lets you define one-line mini-functions on the fly. Borrowed from Lisp, these so-called <code>lambda</code> functions can be used anywhere a function is required.
<div class=example><h3>Example 4.20. Introducing <code>lambda</code> Functions</h3><pre class=screen><samp class=p>>>> </samp><kbd>def f(x):</kbd>
<samp class=p>... </samp>return x*2
<samp class=p>... </samp>
<samp class=p>>>> </samp><kbd>f(3)</kbd>
6
<samp class=p>>>> </samp><kbd>g = lambda x: x*2</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>g(3)</kbd>
6
<samp class=p>>>> </samp><kbd>(lambda x: x*2)(3)</kbd> <span>&#x2461;</span>
6</pre>
<ol>
<li>This is a <code>lambda</code> function that accomplishes the same thing as the normal function above it. Note the abbreviated syntax here: there are no
parentheses around the argument list, and the <code>return</code> keyword is missing (it is implied, since the entire function can only be one expression). Also, the function has no name,
but it can be called through the variable it is assigned to.
<li>You can use a <code>lambda</code> function without even assigning it to a variable. This may not be the most useful thing in the world, but it just goes to
show that a lambda is just an in-line function.
<p>To generalize, a <code>lambda</code> function is a function that takes any number of arguments (including <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional arguments</a>) and returns the value of a single expression. <code>lambda</code> functions can not contain commands, and they can not contain more than one expression. Don't try to squeeze too much into
a <code>lambda</code> function; if you need something more complex, define a normal function instead and make it as long as you want.
<table id="tip.lambda" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code>lambda</code> functions are a matter of style. Using them is never required; anywhere you could use them, you could define a separate
normal function and use that instead. I use them in places where I want to encapsulate specific, non-reusable code without
littering my code with a lot of little one-line functions.
<h3>4.7.1. Real-World <code>lambda</code> Functions</h3>
<p>Here are the <code>lambda</code> functions in <code>apihelper.py</code>:<pre><code>
processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)</pre><p>Notice that this uses the simple form of the <a href="#apihelper.andor" title="4.6. The Peculiar Nature of and and or"><code>and-or</code></a> trick, which is okay, because a <code>lambda</code> function is always true <a href="#tip.boolean">in a boolean context</a>. (That doesn't mean that a <code>lambda</code> function can't return a false value. The function is always true; its return value could be anything.)
<p>Also notice that you're using the <code>split</code> function with no arguments. You've already seen it used with <a href="#odbchelper.split.example" title="Example 3.28. Splitting a String">one or two arguments</a>, but without any arguments it splits on whitespace.
<div class=example><h3>Example 4.21. <code>split</code> With No Arguments</h3><pre class=screen><samp class=p>>>> </samp><kbd>s = "this is\na\ttest"</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>print s</kbd>
<samp>this is
a test</samp>
<samp class=p>>>> </samp><kbd>print s.split()</kbd> <span>&#x2461;</span>
['this', 'is', 'a', 'test']
<samp class=p>>>> </samp><kbd>print " ".join(s.split())</kbd> <span>&#x2462;</span>
'this is a test'</pre>
<ol>
<li>This is a multiline string, defined by escape characters instead of <a href="#odbchelper.triplequotes" title="Example 2.2. Defining the buildConnectionString Function's docstring">triple quotes</a>. <code>\n</code> is a carriage return, and <code>\t</code> is a tab character.
<li><code>split</code> without any arguments splits on whitespace. So three spaces, a carriage return, and a tab character are all the same.
<li>You can normalize whitespace by splitting a string with <code>split</code> and then rejoining it with <code>join</code>, using a single space as a delimiter. This is what the <code>info</code> function does to collapse multi-line <code>docstring</code>s into a single line.
<p>So what is the <code>info</code> function actually doing with these <code>lambda</code> functions, <code>split</code>s, and <code>and-or</code> tricks?
<pre id="apihelper.funcassign" class=programlisting>
processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)</pre><p><var>processFunc</var> is now a function, but which function it is depends on the value of the <var>collapse</var> variable. If <var>collapse</var> is true, <code><var>processFunc</var>(<var>string</var>)</code> will collapse whitespace; otherwise, <code><var>processFunc</var>(<var>string</var>)</code> will return its argument unchanged.
<p>To do this in a less robust language, like Visual Basic, you would probably create a function that took a string and a <i class=parameter><code>collapse</code></i> argument and used an <code>if</code> statement to decide whether to collapse the whitespace or not, then returned the appropriate value. This would be inefficient,
because the function would need to handle every possible case. Every time you called it, it would need to decide whether
to collapse whitespace before it could give you what you wanted. In Python, you can take that decision logic out of the function and define a <code>lambda</code> function that is custom-tailored to give you exactly (and only) what you want. This is more efficient, more elegant, and
less prone to those nasty oh-I-thought-those-arguments-were-reversed kinds of errors.
<div class=itemizedlist>
<h3>Further Reading on <code>lambda</code> Functions</h3>
<ul>
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> discusses using <code>lambda</code> to <a href="http://www.faqts.com/knowledge-base/view.phtml/aid/6081/fid/241">call functions indirectly</a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> shows how to <a href="http://www.python.org/doc/current/tut/node6.html#SECTION006740000000000000000">access outside variables from inside a <code>lambda</code> function</a>. (<a href="http://python.sourceforge.net/peps/pep-0227.html"><abbr>PEP</abbr> 227</a> explains how this will change in future versions of Python.)
<li><a href="http://www.python.org/doc/FAQ.html"><i class=citetitle>The Whole Python <abbr>FAQ</abbr></i></a> has examples of <a href="http://www.python.org/cgi-bin/faqw.py?query=4.15&amp;querytype=simple&amp;casefold=yes&amp;req=search">obfuscated one-liners using <code>lambda</code></a>.
</ul>
<h2 id="apihelper.alltogether">4.8. Putting It All Together</h2>
<p>The last line of code, the only one you haven't deconstructed yet, is the one that does all the work. But by now the work
is easy, because everything you need is already set up just the way you need it. All the dominoes are in place; it's time
to knock them down.
<p>This is the meat of <code>apihelper.py</code>:<pre><code>
print "\n".join(["%s %s" %
(method.ljust(spacing),
processFunc(str(getattr(object, method).__doc__)))
for method in methodList])</pre><p>Note that this is one command, split over multiple lines, but it doesn't use the line continuation character (<code>\</code>). Remember when I said that <a href="#tip.implicitmultiline">some expressions can be split into multiple lines</a> without using a backslash? A list comprehension is one of those expressions, since the entire expression is contained in
square brackets.
<p>Now, let's take it from the end and work backwards. The <pre><code>
for method in methodList</pre><p>shows that this is a <a href="#odbchelper.map" title="3.6. Mapping Lists">list comprehension</a>. As you know, <var>methodList</var> is a list of <a href="#apihelper.filter.care">all the methods you care about</a> in <var>object</var>. So you're looping through that list with <var>method</var>.
<div class=example><h3>Example 4.22. Getting a <code>docstring</code> Dynamically</h3><pre class=screen><samp class=p>>>> </samp><kbd>import odbchelper</kbd>
<samp class=p>>>> </samp><kbd>object = odbchelper</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>method = 'buildConnectionString'</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>getattr(object, method)</kbd> <span>&#x2462;</span>
&lt;function buildConnectionString at 010D6D74>
<samp class=p>>>> </samp><kbd>print getattr(object, method).__doc__</kbd> <span>&#x2463;</span>
<samp>Build a connection string from a dictionary of parameters.
Returns string.</span></pre>
<ol>
<li>In the <code>info</code> function, <var>object</var> is the object you're getting help on, passed in as an argument.
<li>As you're looping through <var>methodList</var>, <var>method</var> is the name of the current method.
<li>Using the <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code>getattr</code></a> function, you're getting a reference to the <var><code>method</code></var> function in the <var><code>object</code></var> module.
<li>Now, printing the actual <code>docstring</code> of the method is easy.
<p>The next piece of the puzzle is the use of <code>str</code> around the <code>docstring</code>. As you may recall, <code>str</code> is a built-in function that <a href="#apihelper.builtin" title="4.3. Using type, str, dir, and Other Built-In Functions">coerces data into a string</a>. But a <code>docstring</code> is always a string, so why bother with the <code>str</code> function? The answer is that not every function has a <code>docstring</code>, and if it doesn't, its <code>__doc__</code> attribute is <code>None</code>.
<div class=example><h3>Example 4.23. Why Use <code>str</code> on a <code>docstring</code>?</h3><pre class=screen><samp class=p>>>> </samp><kbd>>>> def foo(): print 2</kbd>
<samp class=p>>>> </samp><kbd>>>> foo()</kbd>
2
<samp class=p>>>> </samp><kbd>>>> foo.__doc__</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>foo.__doc__ == None</kbd> <span>&#x2461;</span>
True
<samp class=p>>>> </samp><kbd>str(foo.__doc__)</kbd> <span>&#x2462;</span>
'None'
</pre>
<ol>
<li>You can easily define a function that has no <code>docstring</code>, so its <code>__doc__</code> attribute is <code>None</code>. Confusingly, if you evaluate the <code>__doc__</code> attribute directly, the Python <abbr>IDE</abbr> prints nothing at all, which makes sense if you think about it, but is still unhelpful.
<li>You can verify that the value of the <code>__doc__</code> attribute is actually <code>None</code> by comparing it directly.
<li>The <code>str</code> function takes the null value and returns a string representation of it, <code>'None'</code>.
<table id="compare.isnull.sql" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In <abbr>SQL</abbr>, you must use <code>IS NULL</code> instead of <code>= NULL</code> to compare a null value. In Python, you can use either <code>== None</code> or <code>is None</code>, but <code>is None</code> is faster.
<p>Now that you are guaranteed to have a string, you can pass the string to <var>processFunc</var>, which you have <a href="#apihelper.lambda" title="4.7. Using lambda Functions">already defined</a> as a function that either does or doesn't collapse whitespace. Now you see why it was important to use <code>str</code> to convert a <code>None</code> value into a string representation. <var>processFunc</var> is assuming a string argument and calling its <code>split</code> method, which would crash if you passed it <code>None</code> because <code>None</code> doesn't have a <code>split</code> method.
<p>Stepping back even further, you see that you're using string formatting again to concatenate the return value of <var>processFunc</var> with the return value of <var>method</var>'s <code>ljust</code> method. This is a new string method that you haven't seen before.
<div class=example><h3>Example 4.24. Introducing <code>ljust</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>s = 'buildConnectionString'</kbd>
<samp class=p>>>> </samp><kbd>s.ljust(30)</kbd> <span>&#x2460;</span>
'buildConnectionString '
<samp class=p>>>> </samp><kbd>s.ljust(20)</kbd> <span>&#x2461;</span>
'buildConnectionString'</pre>
<ol>
<li><code>ljust</code> pads the string with spaces to the given length. This is what the <code>info</code> function uses to make two columns of output and line up all the <code>docstring</code>s in the second column.
<li>If the given length is smaller than the length of the string, <code>ljust</code> will simply return the string unchanged. It never truncates the string.
<p>You're almost finished. Given the padded method name from the <code>ljust</code> method and the (possibly collapsed) <code>docstring</code> from the call to <var>processFunc</var>, you concatenate the two and get a single string. Since you're mapping <var>methodList</var>, you end up with a list of strings. Using the <code>join</code> method of the string <code>"\n"</code>, you join this list into a single string, with each element of the list on a separate line, and print the result.
<div class=example><h3>Example 4.25. Printing a List</h3><pre class=screen><samp class=p>>>> </samp><kbd>li = ['a', 'b', 'c']</kbd>
<samp class=p>>>> </samp><kbd>print "\n".join(li)</kbd> <span>&#x2460;</span>
<samp>a
b
c</span></pre>
<ol>
<li>This is also a useful debugging trick when you're working with lists. And in Python, you're always working with lists.
<p>That's the last piece of the puzzle. You should now understand this code.
<pre><code>
print "\n".join(["%s %s" %
(method.ljust(spacing),
processFunc(str(getattr(object, method).__doc__)))
for method in methodList])</pre><h2 id="apihelper.summary">4.9. Summary</h2>
<p>The <code>apihelper.py</code> program and its output should now make perfect sense.
<pre><code>
def info(object, spacing=10, collapse=1):
"""Print methods and docstrings.
Takes module, class, list, dictionary, or string."""
methodList = [method for method in dir(object) if callable(getattr(object, method))]
processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)
print "\n".join(["%s %s" %
(method.ljust(spacing),
processFunc(str(getattr(object, method).__doc__)))
for method in methodList])
if __name__ == "__main__":
print info.__doc__</pre>
<p>Here is the output of <code>apihelper.py</code>:<pre class=screen><samp class=p>>>> </samp><kbd>from apihelper import info</kbd>
<samp class=p>>>> </samp><kbd>li = []</kbd>
<samp class=p>>>> </samp><kbd>info(li)</kbd>
<samp>append L.append(object) -- append object to end
count L.count(value) -> integer -- return number of occurrences of value
extend L.extend(list) -- extend list by appending list elements
index L.index(value) -> integer -- return index of first occurrence of value
insert L.insert(index, object) -- insert object before index
pop L.pop([index]) -> item -- remove and return item at index (default last)
remove L.remove(value) -- remove first occurrence of value
reverse L.reverse() -- reverse *IN PLACE*
sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1</span></pre><div class=highlights>
<p>Before diving into the next chapter, make sure you're comfortable doing all of these things:
<div class=itemizedlist>
<ul>
<li>Defining and calling functions with <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional and named arguments</a>
<li>Using <a href="#apihelper.str.intro" title="Example 4.6. Introducing str"><code>str</code></a> to coerce any arbitrary value into a string representation
<li>Using <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code>getattr</code></a> to get references to functions and other attributes dynamically
<li>Extending the list comprehension syntax to do <a href="#apihelper.filter" title="4.5. Filtering Lists">list filtering</a>
<li>Recognizing <a href="#apihelper.andor" title="4.6. The Peculiar Nature of and and or">the <code>and-or</code> trick</a> and using it safely
<li>Defining <a href="#apihelper.lambda" title="4.7. Using lambda Functions"><code>lambda</code> functions</a>
<li><a href="#apihelper.funcassign">Assigning functions to variables</a> and calling the function by referencing the variable. I can't emphasize this enough, because this mode of thought is vital
to advancing your understanding of Python. You'll see more complex applications of this concept throughout this book.
</ul>
<div class=chapter>
<h2 id="fileinfo">Chapter 5. Objects and Object-Orientation</h2>
<p>This chapter, and pretty much every chapter after this, deals with object-oriented Python programming.
<h2 id="fileinfo.divein">5.1. Diving In</h2>
<p>Here is a complete, working Python program. Read the <a href="#odbchelper.docstring" title="2.3. Documenting Functions"><code>docstring</code>s</a> of the module, the classes, and the functions to get an overview of what this program does and how it works. As usual, don't
worry about the stuff you don't understand; that's what the rest of the chapter is for.
<div class=example><h3>Example 5.1. <code>fileinfo.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Framework for getting filetype-specific metadata.
Instantiate appropriate class with filename. Returned object acts like a
dictionary, with key-value pairs for each piece of metadata.
import fileinfo
info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3")
print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()])
Or use listDirectory function to get info on all files in a directory.
for info in fileinfo.listDirectory("/music/ap/", [".mp3"]):
...
Framework can be extended by adding classes for particular file types, e.g.
HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for
parsing its files appropriately; see MP3FileInfo for example.
"""
import os
import sys
from UserDict import UserDict
def stripnulls(data):
"strip whitespace and nulls"
return data.replace("\00", "").strip()
class FileInfo(UserDict):
"store file metadata"
def __init__(self, filename=None):
UserDict.__init__(self)
self["name"] = filename
class MP3FileInfo(FileInfo):
"store ID3v1.0 MP3 tags"
tagDataMap = {"title" : ( 3, 33, stripnulls),
"artist" : ( 33, 63, stripnulls),
"album" : ( 63, 93, stripnulls),
"year" : ( 93, 97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre" : (127, 128, ord)}
def __parse(self, filename):
"parse ID3v1.0 tags from MP3 file"
self.clear()
try:
fsock = open(filename, "rb", 0)
try:
fsock.seek(-128, 2)
tagdata = fsock.read(128)
finally:
fsock.close()
if tagdata[:3] == "TAG":
for tag, (start, end, parseFunc) in self.tagDataMap.items():
self[tag] = parseFunc(tagdata[start:end])
except IOError:
pass
def __setitem__(self, key, item):
if key == "name" and item:
self.__parse(item)
FileInfo.__setitem__(self, key, item)
def listDirectory(directory, fileExtList):
"get list of file info objects for files of particular extensions"
fileList = [os.path.normcase(f)
for f in os.listdir(directory)]
fileList = [os.path.join(directory, f)
for f in fileList
if os.path.splitext(f)[1] in fileExtList]
def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):
"get file info class from filename extension"
subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]
return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
return [getFileInfoClass(f)(f) for f in fileList]
if __name__ == "__main__":
for info in listDirectory("/music/_singles/", [".mp3"]): <span>&#x2460;</span>
print "\n".join(["%s=%s" % (k, v) for k, v in info.items()])
print</pre>
<ol>
<li>This program's output depends on the files on your hard drive. To get meaningful output, you'll need to change the directory
path to point to a directory of MP3 files on your own machine.
<p>This is the output I got on my machine. Your output will be different, unless, by some startling coincidence, you share my
exact taste in music.
<pre class=screen><samp>album=
artist=Ghost in the Machine
title=A Time Long Forgotten (Concept
genre=31
name=/music/_singles/a_time_long_forgotten_con.mp3
year=1999
comment=http://mp3.com/ghostmachine
album=Rave Mix
artist=***DJ MARY-JANE***
title=HELLRAISER****Trance from Hell
genre=31
name=/music/_singles/hellraiser.mp3
year=2000
comment=http://mp3.com/DJMARYJANE
album=Rave Mix
artist=***DJ MARY-JANE***
title=KAIRO****THE BEST GOA
genre=31
name=/music/_singles/kairo.mp3
year=2000
comment=http://mp3.com/DJMARYJANE
album=Journeys
artist=Masters of Balance
title=Long Way Home
genre=31
name=/music/_singles/long_way_home1.mp3
year=2000
comment=http://mp3.com/MastersofBalan
album=
artist=The Cynic Project
title=Sidewinder
genre=18
name=/music/_singles/sidewinder.mp3
year=2000
comment=http://mp3.com/cynicproject
album=Digitosis@128k
artist=VXpanded
title=Spinning
genre=255
name=/music/_singles/spinning.mp3
year=2000
comment=http://mp3.com/artists/95/vxp</span></pre><h2 id="fileinfo.fromimport">5.2. Importing Modules Using <code>from <var>module</var> import</code></h2>
<p>Python has two ways of importing modules. Both are useful, and you should know when to use each. One way, <code>import <var>module</var></code>, you've already seen in <a href="#odbchelper.objects" title="2.4. Everything Is an Object">Section 2.4, &#8220;Everything Is an Object&#8221;</a>. The other way accomplishes the same thing, but it has subtle and important differences.
<p>Here is the basic <code>from <var>module</var> import</code> syntax:<pre><code>
from UserDict import UserDict
</pre><p>This is similar to the <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's docstring"><code>import <var>module</var></code></a> syntax that you know and love, but with an important difference: the attributes and methods of the imported module <code>types</code> are imported directly into the local namespace, so they are available directly, without qualification by module name. You
can import individual items or use <code>from <var>module</var> import *</code> to import everything.
<table id="compare.fromimport.perl" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code>from <var>module</var> import *</code> in Python is like <code>use <var>module</var></code> in Perl; <code>import <var>module</var></code> in Python is like <code>require <var>module</var></code> in Perl.
<table id="compare.fromimport.java" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code>from <var>module</var> import *</code> in Python is like <code>import <var>module</var>.*</code> in Java; <code>import <var>module</var></code> in Python is like <code>import <var>module</var></code> in Java.
<div class=example><h3>Example 5.2. <code>import <var>module</var></code> <i class=foreignphrase><abbr>vs.</abbr></i> <code>from <var>module</var> import</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>import types</kbd>
<samp class=p>>>> </samp><kbd>types.FunctionType</kbd> <span>&#x2460;</span>
&lt;type 'function'>
<samp class=p>>>> </samp><kbd>FunctionType</kbd> <span>&#x2461;</span>
<samp class=traceback>Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
NameError: There is no variable named 'FunctionType'</samp>
<samp class=p>>>> </samp><kbd>from types import FunctionType</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>FunctionType</kbd> <span>&#x2463;</span>
&lt;type 'function'></pre>
<ol>
<li>The <code>types</code> module contains no methods; it just has attributes for each Python object type. Note that the attribute, <code>FunctionType</code>, must be qualified by the module name, <code>types</code>.
<li><code>FunctionType</code> by itself has not been defined in this namespace; it exists only in the context of <code>types</code>.
<li>This syntax imports the attribute <code>FunctionType</code> from the <code>types</code> module directly into the local namespace.
<li>Now <code>FunctionType</code> can be accessed directly, without reference to <code>types</code>.
<p>When should you use <code>from <var>module</var> import</code>?
<div class=itemizedlist>
<ul>
<li>If you will be accessing attributes and methods often and don't want to type the module name over and over, use <code>from <var>module</var> import</code>.
<li>If you want to selectively import some attributes and methods but not others, use <code>from <var>module</var> import</code>.
<li>If the module contains attributes or functions with the same name as ones in your module, you must use <code>import <var>module</var></code> to avoid name conflicts.
</ul>
<p>Other than that, it's just a matter of style, and you will see Python code written both ways.
<table class=caution border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/caution.png" alt="Caution" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Use <code>from module import *</code> sparingly, because it makes it difficult to determine where a particular function or attribute came from, and that makes
debugging and refactoring more difficult.
<div class=itemizedlist>
<h3>Further Reading on Module Importing Techniques</h3>
<ul>
<li><a href="http://www.effbot.org/guides/">eff-bot</a> has more to say on <a href="http://www.effbot.org/guides/import-confusion.htm"><code>import <var>module</var></code> <i class=foreignphrase><abbr>vs.</abbr></i> <code>from <var>module</var> import</code></a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> discusses advanced import techniques, including <a href="http://www.python.org/doc/current/tut/node8.html#SECTION008410000000000000000"><code>from <var>module</var> import *</code></a>.
</ul>
[classes stuff was here]
<div class=example><h3 id="fileinfo.class.example">Example 5.4. Defining the <code>FileInfo</code> Class</h3><pre><code>
from UserDict import UserDict
class FileInfo(UserDict): <span>&#x2460;</span></pre>
<ol>
<li>In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. So the <code>FileInfo</code> class is inherited from the <code>UserDict</code> class (which was <a href="#fileinfo.fromimport" title="5.2. Importing Modules Using from module import">imported from the <code>UserDict</code> module</a>). <code>UserDict</code> is a class that acts like a dictionary, allowing you to essentially subclass the dictionary datatype and add your own behavior.
(There are similar classes <code>UserList</code> and <code>UserString</code> which allow you to subclass lists and strings.) There is a bit of black magic behind this, which you will demystify later
in this chapter when you explore the <code>UserDict</code> class in more depth.
<table id="compare.extends.java" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. There is no special keyword like
<code>extends</code> in Java.
<p>Python supports multiple inheritance. In the parentheses following the class name, you can list as many ancestor classes as you
like, separated by commas.
<h3>5.3.1. Initializing and Coding Classes</h3>
<p>If creating new instances is easy, destroying them is even easier. In general, there is no need to explicitly free instances,
because they are freed automatically when the variables assigned to them go out of scope. Memory leaks are rare in Python.
<div class=example><h3 id="fileinfo.scope">Example 5.8. Trying to Implement a Memory Leak</h3><pre class=screen><samp class=p>>>> </samp><kbd>def leakmem():</kbd>
<samp class=p>... </samp>f = fileinfo.FileInfo('/music/_singles/kairo.mp3') <span>&#x2460;</span>
<samp class=p>... </samp>
<samp class=p>>>> </samp><kbd>for i in range(100):</kbd>
<samp class=p>... </samp>leakmem() <span>&#x2461;</span></pre>
<ol>
<li>Every time the <code>leakmem</code> function is called, you are creating an instance of <code>FileInfo</code> and assigning it to the variable <var>f</var>, which is a local variable within the function. Then the function ends without ever freeing <var>f</var>, so you would expect a memory leak, but you would be wrong. When the function ends, the local variable <var>f</var> goes out of scope. At this point, there are no longer any references to the newly created instance of <code>FileInfo</code> (since you never assigned it to anything other than <var>f</var>), so Python destroys the instance for us.
<li>No matter how many times you call the <code>leakmem</code> function, it will never leak memory, because every time, Python will destroy the newly created <code>FileInfo</code> class before returning from <code>leakmem</code>.
<p>The technical term for this form of garbage collection is &#8220;reference counting&#8221;. Python keeps a list of references to every instance created. In the above example, there was only one reference to the <code>FileInfo</code> instance: the local variable <var>f</var>. When the function ends, the variable <var>f</var> goes out of scope, so the reference count drops to <code>0</code>, and Python destroys the instance automatically.
<p>In previous versions of Python, there were situations where reference counting failed, and Python couldn't clean up after you. If you created two instances that referenced each other (for instance, a doubly-linked list,
where each node has a pointer to the previous and next node in the list), neither instance would ever be destroyed automatically
because Python (correctly) believed that there is always a reference to each instance. Python 2.0 has an additional form of garbage collection called &#8220;mark-and-sweep&#8221; which is smart enough to notice this virtual gridlock and clean up circular references correctly.
<p>As a former philosophy major, it disturbs me to think that things disappear when no one is looking at them, but that's exactly
what happens in Python. In general, you can simply forget about memory management and let Python clean up after you.
<div class=itemizedlist>
<h3>Further Reading on Garbage Collection</h3>
<ul>
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/specialattrs.html">built-in attributes like <code>__class__</code></a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-gc.html"><code>gc</code> module</a>, which gives you low-level control over Python's garbage collection.
</ul>
<h2 id="fileinfo.userdict">5.5. Exploring <code>UserDict</code>: A Wrapper Class</h2>
<p>As you've seen, <code>FileInfo</code> is a class that acts like a dictionary. To explore this further, let's look at the <code>UserDict</code> class in the <code>UserDict</code> module, which is the ancestor of the <code>FileInfo</code> class. This is nothing special; the class is written in Python and stored in a <code>.py</code> file, just like any other Python code. In particular, it's stored in the <code>lib</code> directory in your Python installation.
<table id="tip.locate" class=tip border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In the ActivePython <abbr>IDE</abbr> on Windows, you can quickly open any module in your library path by selecting
File->Locate... (<kbd class=shortcut>Ctrl-L</kbd>).
<div class=example><h3 id="fileinfo.userdict.init.example">Example 5.9. Defining the <code>UserDict</code> Class</h3><pre><code>
class UserDict: <span>&#x2460;</span>
def __init__(self, dict=None): <span>&#x2461;</span>
self.data = {} <span>&#x2462;</span>
if dict is not None: self.update(dict) <span>&#x2463;</span> <span>&#x2464;</span>
</pre>
<ol>
<li>Note that <code>UserDict</code> is a base class, not inherited from any other class.
<li>This is the <code>__init__</code> method that you <a href="#fileinfo.class.example" title="Example 5.4. Defining the FileInfo Class">overrode in the <code>FileInfo</code> class</a>. Note that the argument list in this ancestor class is different than the descendant. That's okay; each subclass can have
its own set of arguments, as long as it calls the ancestor with the correct arguments. Here the ancestor class has a way
to define initial values (by passing a dictionary in the <var>dict</var> argument) which the <code>FileInfo</code> does not use.
<li>Python supports data attributes (called &#8220;instance variables&#8221; in Java and Powerbuilder, and &#8220;member variables&#8221; in <abbr>C++</abbr>). Data attributes are pieces of data held by a specific instance of a class. In this case, each instance of <code>UserDict</code> will have a data attribute <var>data</var>. To reference this attribute from code outside the class, you qualify it with the instance name, <code><var>instance</var>.data</code>, in the same way that you qualify a function with its module name. To reference a data attribute from within the class,
you use <var>self</var> as the qualifier. By convention, all data attributes are initialized to reasonable values in the <code>__init__</code> method. However, this is not required, since data attributes, like local variables, <a href="#odbchelper.vardef" title="3.4. Declaring variables">spring into existence</a> when they are first assigned a value.
<li>The <code>update</code> method is a dictionary duplicator: it copies all the keys and values from one dictionary to another. This does <em>not</em> clear the target dictionary first; if the target dictionary already has some keys, the ones from the source dictionary will
be overwritten, but others will be left untouched. Think of <code>update</code> as a merge function, not a copy function.
<li>This is a syntax you may not have seen before (I haven't used it in the examples in this book). It's an <code>if</code> statement, but instead of having an indented block starting on the next line, there is just a single statement on the same
line, after the colon. This is perfectly legal syntax, which is just a shortcut you can use when you have only one statement
in a block. (It's like specifying a single statement without braces in <abbr>C++</abbr>.) You can use this syntax, or you can have indented code on subsequent lines, but you can't do both for the same block.
<table id="compare.overloading" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Java and Powerbuilder support function overloading by argument list, <i class=foreignphrase><abbr>i.e.</abbr></i> one class can have multiple methods with the same name but a different number of arguments, or arguments of different types.
Other languages (most notably <abbr>PL/SQL</abbr>) even support function overloading by argument name; <i class=foreignphrase><abbr>i.e.</abbr></i> one class can have multiple methods with the same name and the same number of arguments of the same type but different argument
names. Python supports neither of these; it has no form of function overloading whatsoever. Methods are defined solely by their name,
and there can be only one method per class with a given name. So if a descendant class has an <code>__init__</code> method, it <em>always</em> overrides the ancestor <code>__init__</code> method, even if the descendant defines it with a different argument list. And the same rule applies to any other method.
<table id="fileinfo.derivedclasses" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Guido, the original author of Python, explains method overriding this way: "Derived classes may override methods of their base classes. Because methods have no
special privileges when calling other methods of the same object, a method of a base class that calls another method defined
in the same base class, may in fact end up calling a method of a derived class that overrides it. (For <abbr>C++</abbr> programmers: all methods in Python are effectively virtual.)" If that doesn't make sense to you (it confuses the hell out of me), feel free to ignore it.
I just thought I'd pass it along.
<table id="note.dataattributes" class=caution border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/caution.png" alt="Caution" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Always assign an initial value to all of an instance's data attributes in the <code>__init__</code> method. It will save you hours of debugging later, tracking down <code>AttributeError</code> exceptions because you're referencing uninitialized (and therefore non-existent) attributes.
<div class=example><h3 id="fileinfo.userdict.normalmethods">Example 5.10. <code>UserDict</code> Normal Methods</h3><pre><code>
def clear(self): self.data.clear() <span>&#x2460;</span>
def copy(self): <span>&#x2461;</span>
if self.__class__ is UserDict: <span>&#x2462;</span>
return UserDict(self.data)
import copy <span>&#x2463;</span>
return copy.copy(self)
def keys(self): return self.data.keys() <span>&#x2464;</span>
def items(self): return self.data.items()
def values(self): return self.data.values()
</pre>
<ol>
<li><code>clear</code> is a normal class method; it is publicly available to be called by anyone at any time. Notice that <code>clear</code>, like all class methods, has <var>self</var> as its first argument. (Remember that you don't include <var>self</var> when you call the method; it's something that Python adds for you.) Also note the basic technique of this wrapper class: store a real dictionary (<var>data</var>) as a data attribute, define all the methods that a real dictionary has, and have each class method redirect to the corresponding
method on the real dictionary. (In case you'd forgotten, a dictionary's <code>clear</code> method <a href="#odbchelper.dict.del" title="Example 3.5. Deleting Items from a Dictionary">deletes all of its keys</a> and their associated values.)
<li>The <code>copy</code> method of a real dictionary returns a new dictionary that is an exact duplicate of the original (all the same key-value pairs).
But <code>UserDict</code> can't simply redirect to <code>self.data.copy</code>, because that method returns a real dictionary, and what you want is to return a new instance that is the same class as <var>self</var>.
<li>You use the <code>__class__</code> attribute to see if <var>self</var> is a <code>UserDict</code>; if so, you're golden, because you know how to copy a <code>UserDict</code>: just create a new <code>UserDict</code> and give it the real dictionary that you've squirreled away in <var>self.data</var>. Then you immediately return the new <code>UserDict</code> you don't even get to the <code>import copy</code> on the next line.
<li>If <code>self.__class__</code> is not <code>UserDict</code>, then <var>self</var> must be some subclass of <code>UserDict</code> (like maybe <code>FileInfo</code>), in which case life gets trickier. <code>UserDict</code> doesn't know how to make an exact copy of one of its descendants; there could, for instance, be other data attributes defined
in the subclass, so you would need to iterate through them and make sure to copy all of them. Luckily, Python comes with a module to do exactly this, and it's called <code>copy</code>. I won't go into the details here (though it's a wicked cool module, if you're ever inclined to dive into it on your own).
Suffice it to say that <code>copy</code> can copy arbitrary Python objects, and that's how you're using it here.
<li>The rest of the methods are straightforward, redirecting the calls to the built-in methods on <var>self.data</var>.
<table class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In versions of Python prior to 2.2, you could not directly subclass built-in datatypes like strings, lists, and dictionaries. To compensate for
this, Python comes with wrapper classes that mimic the behavior of these built-in datatypes: <code>UserString</code>, <code>UserList</code>, and <code>UserDict</code>. Using a combination of normal and special methods, the <code>UserDict</code> class does an excellent imitation of a dictionary. In Python 2.2 and later, you can inherit classes directly from built-in datatypes like <code>dict</code>. An example of this is given in the examples that come with this book, in <code>fileinfo_fromdict.py</code>.
<p>In Python, you can inherit directly from the <code>dict</code> built-in datatype, as shown in this example. There are three differences here compared to the <code>UserDict</code> version.
<div class=example><h3 id="fileinfo.userdict.fromdict">Example 5.11. Inheriting Directly from Built-In Datatype <code>dict</code></h3><pre><code>
class FileInfo(dict):<span>&#x2460;</span>
"store file metadata"
def __init__(self, filename=None): <span>&#x2461;</span>
self["name"] = filename
</pre>
<ol>
<li>The first difference is that you don't need to import the <code>UserDict</code> module, since <code>dict</code> is a built-in datatype and is always available. The second is that you are inheriting from <code>dict</code> directly, instead of from <code>UserDict.UserDict</code>.
<li>The third difference is subtle but important. Because of the way <code>UserDict</code> works internally, it requires you to manually call its <code>__init__</code> method to properly initialize its internal data structures. <code>dict</code> does not work like this; it is not a wrapper, and it requires no explicit initialization.
<div class=itemizedlist>
<h3>Further Reading on <code>UserDict</code></h3>
<ul>
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-UserDict.html"><code>UserDict</code> module</a> and the <a href="http://www.python.org/doc/current/lib/module-copy.html"><code>copy</code> module</a>.
</ul>
<h2 id="fileinfo.specialmethods">5.6. Special Class Methods</h2>
<p>In addition to normal class methods, there are a number of special methods that Python classes can define. Instead of being called directly by your code (like normal methods), special methods are called for
you by Python in particular circumstances or when specific syntax is used.
<p>As you saw in the <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class">previous section</a>, normal methods go a long way towards wrapping a dictionary in a class. But normal methods alone are not enough, because
there are a lot of things you can do with dictionaries besides call methods on them. For starters, you can <a href="#odbchelper.dict.define" title="Example 3.1. Defining a Dictionary">get</a> and <a href="#odbchelper.dict.modify" title="Example 3.2. Modifying a Dictionary">set</a> items with a syntax that doesn't include explicitly invoking methods. This is where special class methods come in: they
provide a way to map non-method-calling syntax into method calls.
<h3>5.6.1. Getting and Setting Items</h3>
<div class=example><h3>Example 5.12. The <code>__getitem__</code> Special Method</h3><pre><code>
def __getitem__(self, key): return self.data[key]</pre><pre class=screen><samp class=p>>>> </samp><kbd>f = fileinfo.FileInfo("/music/_singles/kairo.mp3")</kbd>
<samp class=p>>>> </samp><kbd>f</kbd>
{'name':'/music/_singles/kairo.mp3'}
<samp class=p>>>> </samp><kbd>f.__getitem__("name")</kbd> <span>&#x2460;</span>
'/music/_singles/kairo.mp3'
<samp class=p>>>> </samp><kbd>f["name"]</kbd> <span>&#x2461;</span>
'/music/_singles/kairo.mp3'</pre>
<ol>
<li>The <code>__getitem__</code> special method looks simple enough. Like the normal methods <code>clear</code>, <code>keys</code>, and <code>values</code>, it just redirects to the dictionary to return its value. But how does it get called? Well, you can call <code>__getitem__</code> directly, but in practice you wouldn't actually do that; I'm just doing it here to show you how it works. The right way
to use <code>__getitem__</code> is to get Python to call it for you.
<li>This looks just like the syntax you would use to <a href="#odbchelper.dict.define" title="Example 3.1. Defining a Dictionary">get a dictionary value</a>, and in fact it returns the value you would expect. But here's the missing link: under the covers, Python has converted this syntax to the method call <code>f.__getitem__("name")</code>. That's why <code>__getitem__</code> is a special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax.
<p>Of course, Python has a <code>__setitem__</code> special method to go along with <code>__getitem__</code>, as shown in the next example.
<div class=example><h3 id="fileinfo.specialmethods.setitem.example">Example 5.13. The <code>__setitem__</code> Special Method</h3><pre><code>
def __setitem__(self, key, item): self.data[key] = item</pre><pre class=screen><samp class=p>>>> </samp><kbd>f</kbd>
{'name':'/music/_singles/kairo.mp3'}
<samp class=p>>>> </samp><kbd>f.__setitem__("genre", 31)</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>f</kbd>
{'name':'/music/_singles/kairo.mp3', 'genre':31}
<samp class=p>>>> </samp><kbd>f["genre"] = 32</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>f</kbd>
{'name':'/music/_singles/kairo.mp3', 'genre':32}</pre>
<ol>
<li>Like the <code>__getitem__</code> method, <code>__setitem__</code> simply redirects to the real dictionary <var>self.data</var> to do its work. And like <code>__getitem__</code>, you wouldn't ordinarily call it directly like this; Python calls <code>__setitem__</code> for you when you use the right syntax.
<li>This looks like regular dictionary syntax, except of course that <var>f</var> is really a class that's trying very hard to masquerade as a dictionary, and <code>__setitem__</code> is an essential part of that masquerade. This line of code actually calls <code>f.__setitem__("genre", 32)</code> under the covers.
<p><code>__setitem__</code> is a special class method because it gets called for you, but it's still a class method. Just as easily as the <code>__setitem__</code> method was defined in <code>UserDict</code>, you can redefine it in the descendant class to override the ancestor method. This allows you to define classes that act
like dictionaries in some ways but define their own behavior above and beyond the built-in dictionary.
<p>This concept is the basis of the entire framework you're studying in this chapter. Each file type can have a handler class
that knows how to get metadata from a particular type of file. Once some attributes (like the file's name and location) are
known, the handler class knows how to derive other attributes automatically. This is done by overriding the <code>__setitem__</code> method, checking for particular keys, and adding additional processing when they are found.
<p>For example, <code>MP3FileInfo</code> is a descendant of <code>FileInfo</code>. When an <code>MP3FileInfo</code>'s <code>name</code> is set, it doesn't just set the <code>name</code> key (like the ancestor <code>FileInfo</code> does); it also looks in the file itself for <abbr>MP3</abbr> tags and populates a whole set of keys. The next example shows how this works.
<div class=example><h3>Example 5.14. Overriding <code>__setitem__</code> in <code>MP3FileInfo</code></h3><pre><code>
def __setitem__(self, key, item): <span>&#x2460;</span>
if key == "name" and item: <span>&#x2461;</span>
self.__parse(item) <span>&#x2462;</span>
FileInfo.__setitem__(self, key, item) <span>&#x2463;</span></pre>
<ol>
<li>Notice that this <code>__setitem__</code> method is defined exactly the same way as the ancestor method. This is important, since Python will be calling the method for you, and it expects it to be defined with a certain number of arguments. (Technically speaking,
the names of the arguments don't matter; only the number of arguments is important.)
<li>Here's the crux of the entire <code>MP3FileInfo</code> class: if you're assigning a value to the <code>name</code> key, you want to do something extra.
<li>The extra processing you do for <code>name</code>s is encapsulated in the <code>__parse</code> method. This is another class method defined in <code>MP3FileInfo</code>, and when you call it, you qualify it with <var>self</var>. Just calling <code>__parse</code> would look for a normal function defined outside the class, which is not what you want. Calling <code>self.__parse</code> will look for a class method defined within the class. This isn't anything new; you reference <a href="#fileinfo.userdict.normalmethods" title="Example 5.10. UserDict Normal Methods">data attributes</a> the same way.
<li>After doing this extra processing, you want to call the ancestor method. Remember that this is never done for you in Python; you must do it manually. Note that you're calling the immediate ancestor, <code>FileInfo</code>, even though it doesn't have a <code>__setitem__</code> method. That's okay, because Python will walk up the ancestor tree until it finds a class with the method you're calling, so this line of code will eventually
find and call the <code>__setitem__</code> defined in <code>UserDict</code>.
<table id="tip.self.call" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">When accessing data attributes within a class, you need to qualify the attribute name: <code>self.<var>attribute</var></code>. When calling other methods within a class, you need to qualify the method name: <code>self.<var>method</var></code>.
<div class=example><h3 id="fileinfo.specialmethods.setname">Example 5.15. Setting an <code>MP3FileInfo</code>'s <code>name</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>import fileinfo</kbd>
<samp class=p>>>> </samp><kbd>mp3file = fileinfo.MP3FileInfo()</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>mp3file</kbd>
{'name':None}
<samp class=p>>>> </samp><kbd>mp3file["name"] = "/music/_singles/kairo.mp3"</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>mp3file</kbd>
<samp>{'album': 'Rave Mix', 'artist': '***DJ MARY-JANE***', 'genre': 31,
'title': 'KAIRO****THE BEST GOA', 'name': '/music/_singles/kairo.mp3',
'year': '2000', 'comment': 'http://mp3.com/DJMARYJANE'}</samp>
<samp class=p>>>> </samp><kbd>mp3file["name"] = "/music/_singles/sidewinder.mp3"</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>mp3file</kbd>
<samp>{'album': '', 'artist': 'The Cynic Project', 'genre': 18, 'title': 'Sidewinder',
'name': '/music/_singles/sidewinder.mp3', 'year': '2000',
'comment': 'http://mp3.com/cynicproject'}</span></pre>
<ol>
<li>First, you create an instance of <code>MP3FileInfo</code>, without passing it a filename. (You can get away with this because the <var>filename</var> argument of the <code>__init__</code> method is <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional</a>.) Since <code>MP3FileInfo</code> has no <code>__init__</code> method of its own, Python walks up the ancestor tree and finds the <code>__init__</code> method of <code>FileInfo</code>. This <code>__init__</code> method manually calls the <code>__init__</code> method of <code>UserDict</code> and then sets the <code>name</code> key to <var>filename</var>, which is <code>None</code>, since you didn't pass a filename. Thus, <var>mp3file</var> initially looks like a dictionary with one key, <code>name</code>, whose value is <code>None</code>.
<li>Now the real fun begins. Setting the <code>name</code> key of <var>mp3file</var> triggers the <code>__setitem__</code> method on <code>MP3FileInfo</code> (not <code>UserDict</code>), which notices that you're setting the <code>name</code> key with a real value and calls <code>self.__parse</code>. Although you haven't traced through the <code>__parse</code> method yet, you can see from the output that it sets several other keys: <code>album</code>, <code>artist</code>, <code>genre</code>, <code>title</code>, <code>year</code>, and <code>comment</code>.
<li>Modifying the <code>name</code> key will go through the same process again: Python calls <code>__setitem__</code>, which calls <code>self.__parse</code>, which sets all the other keys.
<h2 id="fileinfo.morespecial">5.7. Advanced Special Class Methods</h2>
<p>Python has more special methods than just <code>__getitem__</code> and <code>__setitem__</code>. Some of them let you emulate functionality that you may not even know about.
<p>This example shows some of the other special methods in <code>UserDict</code>.
<div class=example><h3 id="fileinfo.morespecial.example">Example 5.16. More Special Methods in <code>UserDict</code></h3><pre><code>
def __repr__(self): return repr(self.data) <span>&#x2460;</span>
def __cmp__(self, dict): <span>&#x2461;</span>
if isinstance(dict, UserDict):
return cmp(self.data, dict.data)
else:
return cmp(self.data, dict)
def __len__(self): return len(self.data) <span>&#x2462;</span>
def __delitem__(self, key): del self.data[key] <span>&#x2463;</span></pre>
<ol>
<li><code>__repr__</code> is a special method that is called when you call <code>repr(<var>instance</var>)</code>. The <code>repr</code> function is a built-in function that returns a string representation of an object. It works on any object, not just class
instances. You're already intimately familiar with <code>repr</code> and you don't even know it. In the interactive window, when you type just a variable name and press the <kbd>ENTER</kbd> key, Python uses <code>repr</code> to display the variable's value. Go create a dictionary <var>d</var> with some data and then <code>print repr(d)</code> to see for yourself.
<li><code>__cmp__</code> is called when you compare class instances. In general, you can compare any two Python objects, not just class instances, by using <code>==</code>. There are rules that define when built-in datatypes are considered equal; for instance, dictionaries are equal when they
have all the same keys and values, and strings are equal when they are the same length and contain the same sequence of characters.
For class instances, you can define the <code>__cmp__</code> method and code the comparison logic yourself, and then you can use <code>==</code> to compare instances of your class and Python will call your <code>__cmp__</code> special method for you.
<li><code>__len__</code> is called when you call <code>len(<var>instance</var>)</code>. The <code>len</code> function is a built-in function that returns the length of an object. It works on any object that could reasonably be thought
of as having a length. The <code>len</code> of a string is its number of characters; the <code>len</code> of a dictionary is its number of keys; the <code>len</code> of a list or tuple is its number of elements. For class instances, define the <code>__len__</code> method and code the length calculation yourself, and then call <code>len(<var>instance</var>)</code> and Python will call your <code>__len__</code> special method for you.
<li><code>__delitem__</code> is called when you call <code>del <var>instance</var>[<var>key</var>]</code>, which you may remember as the way to <a href="#odbchelper.dict.del" title="Example 3.5. Deleting Items from a Dictionary">delete individual items from a dictionary</a>. When you use <code>del</code> on a class instance, Python calls the <code>__delitem__</code> special method for you.
<table id="compare.strequals.java" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In Java, you determine whether two string variables reference the same physical memory location by using <code>str1 == str2</code>. This is called <em>object identity</em>, and it is written in Python as <code>str1 is str2</code>. To compare string values in Java, you would use <code>str1.equals(str2)</code>; in Python, you would use <code>str1 == str2</code>. Java programmers who have been taught to believe that the world is a better place because <code>==</code> in Java compares by identity instead of by value may have a difficult time adjusting to Python's lack of such &#8220;gotchas&#8221;.
<p>At this point, you may be thinking, &#8220;All this work just to do something in a class that I can do with a built-in datatype.&#8221; And it's true that life would be easier (and the entire <code>UserDict</code> class would be unnecessary) if you could inherit from built-in datatypes like a dictionary. But even if you could, special
methods would still be useful, because they can be used in any class, not just wrapper classes like <code>UserDict</code>.
<p>Special methods mean that <em>any class</em> can store key/value pairs like a dictionary, just by defining the <code>__setitem__</code> method. <em>Any class</em> can act like a sequence, just by defining the <code>__getitem__</code> method. Any class that defines the <code>__cmp__</code> method can be compared with <code>==</code>. And if your class represents something that has a length, don't define a <code>GetLength</code> method; define the <code>__len__</code> method and use <code>len(<var>instance</var>)</code>.
<table id="note.physical.v.logical" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">While other object-oriented languages only let you define the physical model of an object (&#8220;this object has a <code>GetLength</code> method&#8221;), Python's special class methods like <code>__len__</code> allow you to define the logical model of an object (&#8220;this object has a length&#8221;).
<p>Python has a lot of other special methods. There's a whole set of them that let classes act like numbers, allowing you to add,
subtract, and do other arithmetic operations on class instances. (The canonical example of this is a class that represents
complex numbers, numbers with both real and imaginary components.) The <code>__call__</code> method lets a class act like a function, allowing you to call a class instance directly. And there are other special methods
that allow classes to have read-only and write-only data attributes; you'll talk more about those in later chapters.
<div class=itemizedlist>
<h3>Further Reading on Special Class Methods</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class=citetitle>Python Reference Manual</i></a> documents <a href="http://www.python.org/doc/current/ref/specialnames.html">all the special class methods</a>.
</ul>
<h2 id="fileinfo.classattributes">5.8. Introducing Class Attributes</h2>
<p>You already know about <a href="#fileinfo.userdict.init.example" title="Example 5.9. Defining the UserDict Class">data attributes</a>, which are variables owned by a specific instance of a class. Python also supports class attributes, which are variables owned by the class itself.
<div class=example><h3 id="fileinfo.classattributes.intro">Example 5.17. Introducing Class Attributes</h3><pre><code>
class MP3FileInfo(FileInfo):
"store ID3v1.0 MP3 tags"
tagDataMap = {"title" : ( 3, 33, stripnulls),
"artist" : ( 33, 63, stripnulls),
"album" : ( 63, 93, stripnulls),
"year" : ( 93, 97, stripnulls),
"comment" : ( 97, 126, stripnulls),
"genre" : (127, 128, ord)}</pre><pre class=screen><samp class=p>>>> </samp><kbd>import fileinfo</kbd>
<samp class=p>>>> </samp><kbd>fileinfo.MP3FileInfo</kbd> <span>&#x2460;</span>
&lt;class fileinfo.MP3FileInfo at 01257FDC>
<samp class=p>>>> </samp><kbd>fileinfo.MP3FileInfo.tagDataMap</kbd> <span>&#x2461;</span>
<samp>{'title': (3, 33, &lt;function stripnulls at 0260C8D4>),
'genre': (127, 128, &lt;built-in function ord>),
'artist': (33, 63, &lt;function stripnulls at 0260C8D4>),
'year': (93, 97, &lt;function stripnulls at 0260C8D4>),
'comment': (97, 126, &lt;function stripnulls at 0260C8D4>),
'album': (63, 93, &lt;function stripnulls at 0260C8D4>)}</samp>
<samp class=p>>>> </samp><kbd>m = fileinfo.MP3FileInfo()</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>m.tagDataMap</kbd>
<samp>{'title': (3, 33, &lt;function stripnulls at 0260C8D4>),
'genre': (127, 128, &lt;built-in function ord>),
'artist': (33, 63, &lt;function stripnulls at 0260C8D4>),
'year': (93, 97, &lt;function stripnulls at 0260C8D4>),
'comment': (97, 126, &lt;function stripnulls at 0260C8D4>),
'album': (63, 93, &lt;function stripnulls at 0260C8D4>)}</span></pre>
<ol>
<li><code>MP3FileInfo</code> is the class itself, not any particular instance of the class.
<li><var>tagDataMap</var> is a class attribute: literally, an attribute of the class. It is available before creating any instances of the class.
<li>Class attributes are available both through direct reference to the class and through any instance of the class.
<table id="compare.classattr.java" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In Java, both static variables (called class attributes in Python) and instance variables (called data attributes in Python) are defined immediately after the class definition (one with the <code>static</code> keyword, one without). In Python, only class attributes can be defined here; data attributes are defined in the <code>__init__</code> method.
<p>Class attributes can be used as class-level constants (which is how you use them in <code>MP3FileInfo</code>), but they are not really constants. You can also change them.
<table class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">There are no constants in Python. Everything can be changed if you try hard enough. This fits with one of the core principles of Python: bad behavior should be discouraged but not banned. If you really want to change the value of <code>None</code>, you can do it, but don't come running to me when your code is impossible to debug.
<div class=example><h3 id="fileinfo.classattributes.writeable.example">Example 5.18. Modifying Class Attributes</h3><pre class=screen><samp class=p>>>> </samp><kbd>class counter:</kbd>
<samp class=p>... </samp>count = 0 <span>&#x2460;</span>
<samp class=p>... </samp>def __init__(self):
<samp class=p>... </samp> self.__class__.count += 1 <span>&#x2461;</span>
<samp class=p>... </samp>
<samp class=p>>>> </samp><kbd>counter</kbd>
&lt;class __main__.counter at 010EAECC>
<samp class=p>>>> </samp><kbd>counter.count</kbd> <span>&#x2462;</span>
0
<samp class=p>>>> </samp><kbd>c = counter()</kbd>
<samp class=p>>>> </samp><kbd>c.count</kbd> <span>&#x2463;</span>
1
<samp class=p>>>> </samp><kbd>counter.count</kbd>
1
<samp class=p>>>> </samp><kbd>d = counter()</kbd> <span>&#x2464;</span>
<samp class=p>>>> </samp><kbd>d.count</kbd>
2
<samp class=p>>>> </samp><kbd>c.count</kbd>
2
<samp class=p>>>> </samp><kbd>counter.count</kbd>
2</pre>
<ol>
<li><var>count</var> is a class attribute of the <code>counter</code> class.
<li><code>__class__</code> is a built-in attribute of every class instance (of every class). It is a reference to the class that <var>self</var> is an instance of (in this case, the <code>counter</code> class).
<li>Because <var>count</var> is a class attribute, it is available through direct reference to the class, before you have created any instances of the
class.
<li>Creating an instance of the class calls the <code>__init__</code> method, which increments the class attribute <var>count</var> by <code>1</code>. This affects the class itself, not just the newly created instance.
<li>Creating a second instance will increment the class attribute <var>count</var> again. Notice how the class attribute is shared by the class and all instances of the class.
<h2 id="fileinfo.private">5.9. Private Functions</h2>
<p>Like most languages, Python has the concept of private elements:
<div class=itemizedlist>
<ul>
<li>Private functions, which can't be called from outside their module
<li>Private class methods, which can't be called from outside their class
<li>Private attributes, which can't be accessed from outside their class.
</ul>
<p>Unlike in most languages, whether a Python function, method, or attribute is private or public is determined entirely by its name.
<p>If the name of a Python function, class method, or attribute starts with (but doesn't end with) two underscores, it's private; everything else is
public. Python has no concept of <em>protected</em> class methods (accessible only in their own class and descendant classes). Class methods are either private (accessible
only in their own class) or public (accessible from anywhere).
<p>In <code>MP3FileInfo</code>, there are two methods: <code>__parse</code> and <code>__setitem__</code>. As you have already discussed, <code>__setitem__</code> is a <a href="#fileinfo.specialmethods.setitem.example" title="Example 5.13. The __setitem__ Special Method">special method</a>; normally, you would call it indirectly by using the dictionary syntax on a class instance, but it is public, and you could
call it directly (even from outside the <code>fileinfo</code> module) if you had a really good reason. However, <code>__parse</code> is private, because it has two underscores at the beginning of its name.
<table id="tip.specialmethodnames" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In Python, all special methods (like <a href="#fileinfo.specialmethods.setitem.example" title="Example 5.13. The __setitem__ Special Method"><code>__setitem__</code></a>) and built-in attributes (like <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's docstring"><code>__doc__</code></a>) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and
attributes this way, because it will only confuse you (and others) later.
<div class=example><h3>Example 5.19. Trying to Call a Private Method</h3><pre class=screen><samp class=p>>>> </samp><kbd>import fileinfo</kbd>
<samp class=p>>>> </samp><kbd>m = fileinfo.MP3FileInfo()</kbd>
<samp class=p>>>> </samp><kbd>m.__parse("/music/_singles/kairo.mp3")</kbd> <span>&#x2460;</span>
<samp class=traceback>Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'MP3FileInfo' instance has no attribute '__parse'</span></pre>
<ol>
<li>If you try to call a private method, Python will raise a slightly misleading exception, saying that the method does not exist. Of course it does exist, but it's private,
so it's not accessible outside the class.Strictly speaking, private methods are accessible outside their class, just not <em>easily</em> accessible. Nothing in Python is truly private; internally, the names of private methods and attributes are mangled and unmangled on the fly to make them
seem inaccessible by their given names. You can access the <code>__parse</code> method of the <code>MP3FileInfo</code> class by the name <code>_MP3FileInfo__parse</code>. Acknowledge that this is interesting, but promise to never, ever do it in real code. Private methods are private for a
reason, but like many other things in Python, their privateness is ultimately a matter of convention, not force.
<div class=itemizedlist>
<h3>Further Reading on Private Functions</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> discusses the inner workings of <a href="http://www.python.org/doc/current/tut/node11.html#SECTION0011600000000000000000">private variables</a>.
</ul>
<h2 id="fileinfo.summary">5.10. Summary</h2>
<p>That's it for the hard-core object trickery. You'll see a real-world application of special class methods in <a href="#soap">Chapter 12</a>, which uses <code>getattr</code> to create a proxy to a remote web service.
<p>The next chapter will continue using this code sample to explore other Python concepts, such as exceptions, file objects, and <code>for</code> loops.
<p>Before diving into the next chapter, make sure you're comfortable doing all of these things:
<div class=itemizedlist>
<ul>
<li>Importing modules using either <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's docstring"><code>import <var>module</var></code></a> or <a href="#fileinfo.fromimport" title="5.2. Importing Modules Using from module import"><code>from <var>module</var> import</code></a>
<li><a href="#fileinfo.class" title="5.3. Defining Classes">Defining</a> and <a href="#fileinfo.create" title="5.4. Instantiating Classes">instantiating</a> classes
<li>Defining <a href="#fileinfo.class.example" title="Example 5.4. Defining the FileInfo Class"><code>__init__</code> methods</a> and other <a href="#fileinfo.specialmethods" title="5.6. Special Class Methods">special class methods</a>, and understanding when they are called
<li>Subclassing <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class"><code>UserDict</code></a> to define classes that act like dictionaries
<li>Defining <a href="#fileinfo.userdict.init.example" title="Example 5.9. Defining the UserDict Class">data attributes</a> and <a href="#fileinfo.classattributes" title="5.8. Introducing Class Attributes">class attributes</a>, and understanding the differences between them
<li>Defining <a href="#fileinfo.private" title="5.9. Private Functions">private attributes and methods</a>
</ul>
<div class=chapter>
[exception stuff was here]
[for loop stuff was here]
<div class=example><h3>Example 6.12. Introducing <code><code>sys</code>.modules</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>import sys</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>print '\n'.join(sys.modules.keys())</kbd> <span>&#x2461;</span>
<samp>win32api
os.path
os
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat</span></pre>
<ol>
<li>The <code>sys</code> module contains system-level information, such as the version of Python you're running (<code><code>sys</code>.version</code> or <code><code>sys</code>.version_info</code>), and system-level options such as the maximum allowed recursion depth (<code><code>sys</code>.getrecursionlimit()</code> and <code><code>sys</code>.setrecursionlimit()</code>).
<li><code><code>sys</code>.modules</code> is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules <em>your</em> program has imported. Python preloads some modules on startup, and if you're using a Python <abbr>IDE</abbr>, <code><code>sys</code>.modules</code> contains all the modules imported by all the programs you've run within the <abbr>IDE</abbr>.
<p>This example demonstrates how to use <code><code>sys</code>.modules</code>.
<div class=example><h3>Example 6.13. Using <code><code>sys</code>.modules</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>import fileinfo</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>print '\n'.join(sys.modules.keys())</kbd>
<samp>win32api
os.path
os
fileinfo
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat</samp>
<samp class=p>>>> </samp><kbd>fileinfo</kbd>
&lt;module 'fileinfo' from 'fileinfo.pyc'>
<samp class=p>>>> </samp><kbd>sys.modules["fileinfo"]</kbd> <span>&#x2461;</span>
&lt;module 'fileinfo' from 'fileinfo.pyc'></pre>
<ol>
<li>As new modules are imported, they are added to <code><code>sys</code>.modules</code>. This explains why importing the same module twice is very fast: Python has already loaded and cached the module in <code><code>sys</code>.modules</code>, so importing the second time is simply a dictionary lookup.
<li>Given the name (as a string) of any previously-imported module, you can get a reference to the module itself through the <code><code>sys</code>.modules</code> dictionary.
<p>The next example shows how to use the <code>__module__</code> class attribute with the <code><code>sys</code>.modules</code> dictionary to get a reference to the module in which a class is defined.
<div class=example><h3>Example 6.14. The <code>__module__</code> Class Attribute</h3><pre class=screen><samp class=p>>>> </samp><kbd>from fileinfo import MP3FileInfo</kbd>
<samp class=p>>>> </samp><kbd>MP3FileInfo.__module__</kbd> <span>&#x2460;</span>
'fileinfo'
<samp class=p>>>> </samp><kbd>sys.modules[MP3FileInfo.__module__]</kbd> <span>&#x2461;</span>
&lt;module 'fileinfo' from 'fileinfo.pyc'></pre>
<ol>
<li>Every Python class has a built-in <a href="#fileinfo.classattributes" title="5.8. Introducing Class Attributes">class attribute</a> <code>__module__</code>, which is the name of the module in which the class is defined.
<li>Combining this with the <code><code>sys</code>.modules</code> dictionary, you can get a reference to the module in which a class is defined.
<p>Now you're ready to see how <code><code>sys</code>.modules</code> is used in <code>fileinfo.py</code>, the sample program introduced in <a href="#fileinfo">Chapter 5</a>. This example shows that portion of the code.
<div class=example><h3>Example 6.15. <code><code>sys</code>.modules</code> in <code>fileinfo.py</code></h3><pre><code>
def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]): <span>&#x2460;</span>
"get file info class from filename extension"
subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:] <span>&#x2461;</span>
return hasattr(module, subclass) and getattr(module, subclass) or FileInfo <span>&#x2462;</span></pre>
<ol>
<li>This is a function with two arguments; <var>filename</var> is required, but <var>module</var> is <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional</a> and defaults to the module that contains the <code>FileInfo</code> class. This looks inefficient, because you might expect Python to evaluate the <code><code>sys</code>.modules</code> expression every time the function is called. In fact, Python evaluates default expressions only once, the first time the module is imported. As you'll see later, you never call this
function with a <var>module</var> argument, so <var>module</var> serves as a function-level constant.
<li>You'll plow through this line later, after you dive into the <code>os</code> module. For now, take it on faith that <var>subclass</var> ends up as the name of a class, like <code>MP3FileInfo</code>.
<li>You already know about <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code>getattr</code></a>, which gets a reference to an object by name. <code>hasattr</code> is a complementary function that checks whether an object has a particular attribute; in this case, whether a module has
a particular class (although it works for any object and any attribute, just like <code>getattr</code>). In English, this line of code says, &#8220;If this module has the class named by <var>subclass</var> then return it, otherwise return the base class <code>FileInfo</code>.&#8221;
<div class=itemizedlist>
<h3>Further Reading on Modules</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> discusses exactly <a href="http://www.python.org/doc/current/tut/node6.html#SECTION006710000000000000000">when and how default arguments are evaluated</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-sys.html"><code>sys</code></a> module.
</ul>
<h2 id="fileinfo.os">6.5. Working with Directories</h2>
<p>The <code>os.path</code> module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing
the contents of a directory.
<div class=example><h3 id="fileinfo.os.path.join.example">Example 6.16. Constructing Pathnames</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import os</kbd>
<samp class=p>>>> </samp><kbd>os.path.join("c:\\music\\ap\\", "mahadeva.mp3")</kbd> <span>&#x2460;</span> <span>&#x2461;</span>
'c:\\music\\ap\\mahadeva.mp3'
<samp class=p>>>> </samp><kbd>os.path.join("c:\\music\\ap", "mahadeva.mp3")</kbd> <span>&#x2462;</span>
'c:\\music\\ap\\mahadeva.mp3'
<samp class=p>>>> </samp><kbd>os.path.expanduser("~")</kbd> <span>&#x2463;</span>
'c:\\Documents and Settings\\mpilgrim\\My Documents'
<samp class=p>>>> </samp><kbd>os.path.join(os.path.expanduser("~"), "Python")</kbd> <span>&#x2464;</span>
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'</pre>
<ol>
<li><code>os.path</code> is a reference to a module -- which module depends on your platform. Just as <a href="#crossplatform.example" title="Example 6.2. Supporting Platform-Specific Functionality"><code>getpass</code></a> encapsulates differences between platforms by setting <var>getpass</var> to a platform-specific function, <code>os</code> encapsulates differences between platforms by setting <var>path</var> to a platform-specific module.
<li>The <code>join</code> function of <code>os.path</code> constructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing
with pathnames on Windows is annoying because the backslash character must be escaped.)
<li>In this slightly less trivial case, <code>join</code> will add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since
<code>addSlashIfNecessary</code> is one of the stupid little functions I always need to write when building up my toolbox in a new language. <em>Do not</em> write this stupid little function in Python; smart people have already taken care of it for you.
<li><code>expanduser</code> will expand a pathname that uses <code>~</code> to represent the current user's home directory. This works on any platform where users have a home directory, like Windows,
<abbr>UNIX</abbr>, and Mac OS X; it has no effect on Mac OS.
<li>Combining these techniques, you can easily construct pathnames for directories and files under the user's home directory.
<div class=example><h3 id="splittingpathnames.example">Example 6.17. Splitting Pathnames</h3><pre class=screen><samp class=p>>>> </samp><kbd>os.path.split("c:\\music\\ap\\mahadeva.mp3")</kbd> <span>&#x2460;</span>
('c:\\music\\ap', 'mahadeva.mp3')
<samp class=p>>>> </samp><kbd>(filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3")</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>filepath</kbd> <span>&#x2462;</span>
'c:\\music\\ap'
<samp class=p>>>> </samp><kbd>filename</kbd> <span>&#x2463;</span>
'mahadeva.mp3'
<samp class=p>>>> </samp><kbd>(shortname, extension) = os.path.splitext(filename)</kbd> <span>&#x2464;</span>
<samp class=p>>>> </samp><kbd>shortname</kbd>
'mahadeva'
<samp class=p>>>> </samp><kbd>extension</kbd>
'.mp3'</pre>
<ol>
<li>The <code>split</code> function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use
<a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a> to return multiple values from a function? Well, <code>split</code> is such a function.
<li>You assign the return value of the <code>split</code> function into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple.
<li>The first variable, <var>filepath</var>, receives the value of the first element of the tuple returned from <code>split</code>, the file path.
<li>The second variable, <var>filename</var>, receives the value of the second element of the tuple returned from <code>split</code>, the filename.
<li><code>os.path</code> also contains a function <code>splitext</code>, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique
to assign each of them to separate variables.
<div class=example><h3 id="fileinfo.listdir.example">Example 6.18. Listing Directories</h3><pre class=screen><samp class=p>>>> </samp><kbd>os.listdir("c:\\music\\_singles\\")</kbd> <span>&#x2460;</span>
<samp>['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
'spinning.mp3']</samp>
<samp class=p>>>> </samp><kbd>dirname = "c:\\"</kbd>
<samp class=p>>>> </samp><kbd>os.listdir(dirname)</kbd> <span>&#x2461;</span>
<samp>['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin',
'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS',
'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys',
'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']</samp>
<samp class=p>>>> </samp><kbd>[f for f in os.listdir(dirname)</kbd>
<samp class=p>... </samp>if os.path.isfile(os.path.join(dirname, f))] <span>&#x2462;</span>
<samp>['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS',
'NTDETECT.COM', 'ntldr', 'pagefile.sys']</samp>
<samp class=p>>>> </samp><kbd>[f for f in os.listdir(dirname)</kbd>
<samp class=p>... </samp>if os.path.isdir(os.path.join(dirname, f))] <span>&#x2463;</span>
<samp>['cygwin', 'docbook', 'Documents and Settings', 'Incoming',
'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']</span></pre>
<ol>
<li>The <code>listdir</code> function takes a pathname and returns a list of the contents of the directory.
<li><code>listdir</code> returns both files and folders, with no indication of which is which.
<li>You can use <a href="#apihelper.filter" title="4.5. Filtering Lists">list filtering</a> and the <code>isfile</code> function of the <code>os.path</code> module to separate the files from the folders. <code>isfile</code> takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're using <code><code>os.path</code>.<code>join</code></code> to ensure a full pathname, but <code>isfile</code> also works with a partial path, relative to the current working directory. You can use <code>os.getcwd()</code> to get the current working directory.
<li><code>os.path</code> also has a <code>isdir</code> function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories
within a directory.
<div class=example><h3>Example 6.19. Listing Directories in <code>fileinfo.py</code></h3><pre><code>
def listDirectory(directory, fileExtList):
"get list of file info objects for files of particular extensions"
fileList = [os.path.normcase(f)
for f in os.listdir(directory)] <span>&#x2460;</span> <span>&#x2461;</span>
fileList = [os.path.join(directory, f)
for f in fileList
if os.path.splitext(f)[1] in fileExtList] <span>&#x2462;</span> <span>&#x2463;</span> <span>&#x2464;</span></pre>
<ol>
<li><code>os.listdir(directory)</code> returns a list of all the files and folders in <var>directory</var>.
<li>Iterating through the list with <var>f</var>, you use <code>os.path.normcase(f)</code> to normalize the case according to operating system defaults. <code>normcase</code> is a useful little function that compensates for case-insensitive operating systems that think that <code>mahadeva.mp3</code> and <code>mahadeva.MP3</code> are the same file. For instance, on Windows and Mac OS, <code>normcase</code> will convert the entire filename to lowercase; on <abbr>UNIX</abbr>-compatible systems, it will return the filename unchanged.
<li>Iterating through the normalized list with <var>f</var> again, you use <code>os.path.splitext(f)</code> to split each filename into name and extension.
<li>For each file, you see if the extension is in the list of file extensions you care about (<var>fileExtList</var>, which was passed to the <code>listDirectory</code> function).
<li>For each file you care about, you use <code>os.path.join(directory, f)</code> to construct the full pathname of the file, and return a list of the full pathnames.
<table id="tip.os" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Whenever possible, you should use the functions in <code>os</code> and <code>os.path</code> for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like
<code>os.path.split</code> work on <abbr>UNIX</abbr>, Windows, Mac OS, and any other platform supported by Python.
<p>There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you
may already be familiar with from working on the command line.
<div class=example><h3 id="fileinfo.os.glob.example">Example 6.20. Listing Directories with <code>glob</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>os.listdir("c:\\music\\_singles\\")</kbd> <span>&#x2460;</span>
<samp>['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
'spinning.mp3']</samp>
<samp class=p>>>> </samp><kbd>import glob</kbd>
<samp class=p>>>> </samp><kbd>glob.glob('c:\\music\\_singles\\*.mp3')</kbd> <span>&#x2461;</span>
<samp>['c:\\music\\_singles\\a_time_long_forgotten_con.mp3',
'c:\\music\\_singles\\hellraiser.mp3',
'c:\\music\\_singles\\kairo.mp3',
'c:\\music\\_singles\\long_way_home1.mp3',
'c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']</samp>
<samp class=p>>>> </samp><kbd>glob.glob('c:\\music\\_singles\\s*.mp3')</kbd> <span>&#x2462;</span>
<samp>['c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']</samp>
<samp class=p>>>> </samp><kbd>glob.glob('c:\\music\\*\\*.mp3')</kbd><span>&#x2463;</span>
</pre>
<ol>
<li>As you saw earlier, <code>os.listdir</code> simply takes a directory path and lists all files and directories in that directory.
<li>The <code>glob</code> module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard.
Here the wildcard is a directory path plus "*.mp3", which will match all <code>.mp3</code> files. Note that each element of the returned list already includes the full path of the file.
<li>If you want to find all the files in a specific directory that start with "s" and end with ".mp3", you can do that too.
<li>Now consider this scenario: you have a <code>music</code> directory, with several subdirectories within it, with <code>.mp3</code> files within each subdirectory. You can get a list of all of those with a single call to <code>glob</code>, by using two wildcards at once. One wildcard is the <code>"*.mp3"</code> (to match <code>.mp3</code> files), and one wildcard is <em>within the directory path itself</em>, to match any subdirectory within <code>c:\music</code>. That's a crazy amount of power packed into one deceptively simple-looking function!
<div class=itemizedlist>
<h3>Further Reading on the <code>os</code> Module</h3>
<ul>
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/240">questions about the <code>os</code> module</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-os.html"><code>os</code></a> module and the <a href="http://www.python.org/doc/current/lib/module-os.path.html"><code>os.path</code></a> module.
</ul>
[HTML stuff was here]
<h2 id="dialect.locals">8.5. <code>locals</code> and <code>globals</code></h2>
<p>Let's digress from <abbr>HTML</abbr> processing for a minute and talk about how Python handles variables. Python has two built-in functions, <code>locals</code> and <code>globals</code>, which provide dictionary-based access to local and global variables.
<p>Remember <code>locals</code>? You first saw it here:
<pre><code>
def unknown_starttag(self, tag, attrs):
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
self.pieces.append("&lt;%(tag)s%(strattrs)s>" % locals())
</pre><p>No, wait, you can't learn about <code>locals</code> yet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention.
<p>Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names
of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute.
<p>At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which
keeps track of the function's variables, including function arguments and locally defined variables. Each module has its
own namespace, called the global namespace, which keeps track of the module's variables, including functions, classes, any
other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any
module, which holds built-in functions and exceptions.
<p>When a line of code asks for the value of a variable <var>x</var>, Python will search for that variable in all the available namespaces, in order:
<div class=orderedlist>
<ol>
<li>local namespace - specific to the current function or class method. If the function defines a local variable <var>x</var>, or has an argument <var>x</var>, Python will use this and stop searching.
<li>global namespace - specific to the current module. If the module has defined a variable, function, or class called <var>x</var>, Python will use that and stop searching.
<li>built-in namespace - global to all modules. As a last resort, Python will assume that <var>x</var> is the name of built-in function or variable.
</ol>
<p>If Python doesn't find <var>x</var> in any of these namespaces, it gives up and raises a <code>NameError</code> with the message <samp>There is no variable named 'x'</samp>, which you saw back in <a href="#odbchelper.unboundvariable" title="Example 3.18. Referencing an Unbound Variable">Example 3.18, &#8220;Referencing an Unbound Variable&#8221;</a>, but you didn't appreciate how much work Python was doing before giving you that error.
<table class=important border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/important.png" alt="Important" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Python 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a <a href="#fileinfo.nested" title="Example 6.21. listDirectory">nested function</a> or <a href="#apihelper.lambda" title="4.7. Using lambda Functions"><code>lambda</code> function</a>, Python will search for that variable in the current (nested or <code>lambda</code>) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested or <code>lambda</code>) function's namespace, <em>then in the parent function's namespace</em>, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2:<pre><code>
from __future__ import nested_scopes</pre><p>Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are <em>directly accessible at run-time</em>. How? Well, the local namespace is accessible via the built-in <code>locals</code> function, and the global (module level) namespace is accessible via the built-in <code>globals</code> function.
<div class=example><h3>Example 8.10. Introducing <code>locals</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>def foo(arg):</kbd> <span>&#x2460;</span>
<samp class=p>... </samp>x = 1
<samp class=p>... </samp>print locals()
<samp class=p>... </samp>
<samp class=p>>>> </samp><kbd>foo(7)</kbd> <span>&#x2461;</span>
{'arg': 7, 'x': 1}
<samp class=p>>>> </samp><kbd>foo('bar')</kbd> <span>&#x2462;</span>
{'arg': 'bar', 'x': 1}</pre>
<ol>
<li>The function <code>foo</code> has two variables in its local namespace: <var>arg</var>, whose value is passed in to the function, and <var>x</var>, which is defined within the function.
<li><code>locals</code> returns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values
of the dictionary are the actual values of the variables. So calling <code>foo</code> with <code>7</code> prints the dictionary containing the function's two local variables: <var>arg</var> (<code>7</code>) and <var>x</var> (<code>1</code>).
<li>Remember, Python has dynamic typing, so you could just as easily pass a string in for <var>arg</var>; the function (and the call to <code>locals</code>) would still work just as well. <code>locals</code> works with all variables of all datatypes.
<p>What <code>locals</code> does for the local (function) namespace, <code>globals</code> does for the global (module) namespace. <code>globals</code> is more exciting, though, because a module's namespace is more exciting.
<sup>[<a name="d0e21226" href="#ftn.d0e21226">3</a>]</sup> Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes
defined in the module. Plus, it includes anything that was imported into the module.
<p>Remember the difference between <a href="#fileinfo.fromimport" title="5.2. Importing Modules Using from module import"><code>from <var>module</var> import</code></a> and <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's docstring"><code>import <var>module</var></code></a>? With <code>import <var>module</var></code>, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access
any of its functions or attributes: <code><var>module</var>.<var>function</var></code>. But with <code>from <var>module</var> import</code>, you're actually importing specific functions and attributes from another module into your own namespace, which is why you
access them directly without referencing the original module they came from. With the <code>globals</code> function, you can actually see this happen.
<div class=example><h3 id="dialect.globals.example">Example 8.11. Introducing <code>globals</code></h3>
<p>Look at the following block of code at the bottom of <code>BaseHTMLProcessor.py</code>:<pre><code>
if __name__ == "__main__":
for k, v in globals().items(): <span>&#x2460;</span>
print k, "=", v</pre>
<ol>
<li>Just so you don't get intimidated, remember that you've seen all this before. The <code>globals</code> function returns a dictionary, and you're <a href="#dictionaryiter.example" title="Example 6.10. Iterating Through a Dictionary">iterating through the dictionary</a> using the <code>items</code> method and <a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a>. The only thing new here is the <code>globals</code> function.
<p>Now running the script from the command line gives this output (note that your output may be slightly different, depending
on your platform and where you installed Python):<pre class=screen><samp class=p>c:\docbook\dip\py></samp> python BaseHTMLProcessor.py</pre><pre><code>
SGMLParser = sgmllib.SGMLParser <span>&#x2460;</span>
htmlentitydefs = &lt;module 'htmlentitydefs' from 'C:\Python23\lib\htmlentitydefs.py'> <span>&#x2461;</span>
BaseHTMLProcessor = __main__.BaseHTMLProcessor <span>&#x2462;</span>
__name__ = __main__ <span>&#x2463;</span>
... rest of output omitted for brevity...</pre>
<ol>
<li><code>SGMLParser</code> was imported from <code>sgmllib</code>, using <code>from <var>module</var> import</code>. That means that it was imported directly into the module's namespace, and here it is.
<li>Contrast this with <code>htmlentitydefs</code>, which was imported using <code>import</code>. That means that the <code>htmlentitydefs</code> module itself is in the namespace, but the <var>entitydefs</var> variable defined within <code>htmlentitydefs</code> is not.
<li>This module only defines one class, <code>BaseHTMLProcessor</code>, and here it is. Note that the value here is <a href="#fileinfo.classattributes.intro" title="Example 5.17. Introducing Class Attributes">the class itself</a>, not a specific instance of the class.
<li>Remember the <a href="#odbchelper.ifnametrick"><code>if __name__</code> trick</a>? When running a module (as opposed to importing it from another module), the built-in <code>__name__</code> attribute is a special value, <code>__main__</code>. Since you ran this module as a script from the command line, <code>__name__</code> is <code>__main__</code>, which is why the little test code to print the <code>globals</code> got executed.
<table id="tip.localsbyname" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Using the <code>locals</code> and <code>globals</code> functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors
the functionality of the <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code>getattr</code></a> function, which allows you to access arbitrary functions dynamically by providing the function name as a string.
<p>There is one other important difference between the <code>locals</code> and <code>globals</code> functions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning
it.
<div class=example><h3 id="dialect.locals.readonly.example">Example 8.12. <code>locals</code> is read-only, <code>globals</code> is not</h3><pre><code>
def foo(arg):
x = 1
print locals() <span>&#x2460;</span>
locals()["x"] = 2 <span>&#x2461;</span>
print "x=",x <span>&#x2462;</span>
z = 7
print "z=",z
foo(3)
globals()["z"] = 8 <span>&#x2463;</span>
print "z=",z <span>&#x2464;</span>
</pre>
<ol>
<li>Since <code>foo</code> is called with <code>3</code>, this will print <code>{'arg': 3, 'x': 1}</code>. This should not be a surprise.
<li><code>locals</code> is a function that returns a dictionary, and here you are setting a value in that dictionary. You might think that this
would change the value of the local variable <var>x</var> to <code>2</code>, but it doesn't. <code>locals</code> does not actually return the local namespace, it returns a copy. So changing it does nothing to the value of the variables
in the local namespace.
<li>This prints <code>x= 1</code>, not <code>x= 2</code>.
<li>After being burned by <code>locals</code>, you might think that this <em>wouldn't</em> change the value of <var>z</var>, but it does. Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself), <code>globals</code> returns the actual global namespace, not a copy: the exact opposite behavior of <code>locals</code>. So any changes to the dictionary returned by <code>globals</code> directly affect your global variables.
<li>This prints <code>z= 8</code>, not <code>z= 7</code>.
[XML stuff was here]
<h2 id="kgp.packages">9.2. Packages</h2>
<p>Actually parsing an <abbr>XML</abbr> document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour
to talk about packages.
<div class=example><h3>Example 9.5. Loading an <abbr>XML</abbr> document (a sneak peek)</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>from xml.dom import minidom</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp>xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')</pre>
<ol>
<li>This is a syntax you haven't seen before. It looks almost like the <code>from <var>module</var> import</code> you know and love, but the <code>"."</code> gives it away as something above and beyond a simple import. In fact, <code>xml</code> is what is known as a package, <code>dom</code> is a nested package within <code>xml</code>, and <code>minidom</code> is a module within <code>xml.dom</code>.
<p>That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than
directories of modules; nested packages are subdirectories. The modules within a package (or a nested package) are still
just <code>.py</code> files, like always, except that they're in a subdirectory instead of the main <code>lib/</code> directory of your Python installation.
<div class=example><h3>Example 9.6. File layout of a package</h3><pre class=screen>Python21/ root Python installation (home of the executable)
|
+--lib/ library directory (home of the standard library modules)
|
+-- xml/ xml package (really just a directory with other stuff in it)
|
+--sax/ xml.sax package (again, just a directory)
|
+--dom/ xml.dom package (contains minidom.py)
|
+--parsers/ xml.parsers package (used internally)</pre><p>So when you say <code>from xml.dom import minidom</code>, Python figures out that that means &#8220;look in the <code>xml</code> directory for a <code>dom</code> directory, and look in <em>that</em> for the <code>minidom</code> module, and import it as <code>minidom</code>&#8221;. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import
specific classes or functions from a module contained within a package. You can also import the package itself as a module.
The syntax is all the same; Python figures out what you mean based on the file layout of the package, and automatically does the right thing.
<div class=example><h3>Example 9.7. Packages are modules, too</h3><pre class=screen><samp class=p>>>> </samp><kbd>from xml.dom import minidom</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>minidom</kbd>
&lt;module 'xml.dom.minidom' from 'C:\Python21\lib\xml\dom\minidom.pyc'>
<samp class=p>>>> </samp><kbd>minidom.Element</kbd>
&lt;class xml.dom.minidom.Element at 01095744>
<samp class=p>>>> </samp><kbd>from xml.dom.minidom import Element</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>Element</kbd>
&lt;class xml.dom.minidom.Element at 01095744>
<samp class=p>>>> </samp><kbd>minidom.Element</kbd>
&lt;class xml.dom.minidom.Element at 01095744>
<samp class=p>>>> </samp><kbd>from xml import dom</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>dom</kbd>
&lt;module 'xml.dom' from 'C:\Python21\lib\xml\dom\__init__.pyc'>
<samp class=p>>>> </samp><kbd>import xml</kbd> <span>&#x2463;</span>
<samp class=p>>>> </samp><kbd>xml</kbd>
&lt;module 'xml' from 'C:\Python21\lib\xml\__init__.pyc'></pre>
<ol>
<li>Here you're importing a module (<code>minidom</code>) from a nested package (<code>xml.dom</code>). The result is that <code>minidom</code> is imported into your <a href="#dialect.locals" title="8.5. locals and globals">namespace</a>, and in order to reference classes within the <code>minidom</code> module (like <code>Element</code>), you need to preface them with the module name.
<li>Here you are importing a class (<code>Element</code>) from a module (<code>minidom</code>) from a nested package (<code>xml.dom</code>). The result is that <code>Element</code> is imported directly into your namespace. Note that this does not interfere with the previous import; the <code>Element</code> class can now be referenced in two ways (but it's all still the same class).
<li>Here you are importing the <code>dom</code> package (a nested package of <code>xml</code>) as a module in and of itself. Any level of a package can be treated as a module, as you'll see in a moment. It can even
have its own attributes and methods, just the modules you've seen before.
<li>Here you are importing the root level <code>xml</code> package as a module.
<p>So how can a package (which is just a directory on disk) be imported and treated as a module (which is always a file on disk)?
The answer is the magical <code>__init__.py</code> file. You see, packages are not simply directories; they are directories with a specific file, <code>__init__.py</code>, inside. This file defines the attributes and methods of the package. For instance, <code>xml.dom</code> contains a <code>Node</code> class, which is defined in <code>xml/dom/__init__.py</code>. When you import a package as a module (like <code>dom</code> from <code>xml</code>), you're really importing its <code>__init__.py</code> file.
<table class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">A package is a directory with the special <code>__init__.py</code> file in it. The <code>__init__.py</code> file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file,
but it has to exist. But if <code>__init__.py</code> doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages.
<p>So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an <code>xml</code> package with <code>sax</code> and <code>dom</code> packages inside, the authors could have chosen to put all the <code>sax</code> functionality in <code>xmlsax.py</code> and all the <code>dom</code> functionality in <code>xmldom.py</code>, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the <abbr>XML</abbr> package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different
areas simultaneously).
<p>If you ever find yourself writing a large subsystem in Python (or, more likely, when you realize that your small subsystem has grown into a large one), invest some time designing a good
package architecture. It's one of the many things Python is good at, so take advantage of it.
<h2 id="kgp.parse">9.3. Parsing <abbr>XML</abbr></h2>
<p>As I was saying, actually parsing an <abbr>XML</abbr> document is very simple: one line of code. Where you go from there is up to you.
(Unicode stuff was here)
<div class=chapter>
<h2 id="streams">Chapter 10. Scripts and Streams</h2>
<div class=example><h3>Example 10.12. Chaining commands</h3><pre class=screen>
<samp class=p>[you@localhost kgp]$ </samp>python kgp.py -g binary.xml <span>&#x2460;</span>
01100111
<samp class=p>[you@localhost kgp]$ </samp>cat binary.xml <span>&#x2461;</span>
<samp>&lt;?xml version="1.0"?>
&lt;!DOCTYPE grammar PUBLIC "-//diveintopython3.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
&lt;grammar>
&lt;ref id="bit">
&lt;p>0&lt;/p>
&lt;p>1&lt;/p>
&lt;/ref>
&lt;ref id="byte">
&lt;p>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>\
&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;xref id="bit"/>&lt;/p>
&lt;/ref>
&lt;/grammar></samp>
<samp class=p>[you@localhost kgp]$ </samp>cat binary.xml | python kgp.py -g - <span>&#x2462;</span> <span>&#x2463;</span>
10110001</pre>
<ol>
<li>As you saw in <a href="#kgp.divein" title="9.1. Diving in">Section 9.1, &#8220;Diving in&#8221;</a>, this will print a string of eight random bits, <code>0</code> or <code>1</code>.
<li>This simply prints out the entire contents of <code>binary.xml</code>. (Windows users should use <code>type</code> instead of <code>cat</code>.)
<li>This prints the contents of <code>binary.xml</code>, but the &#8220;<code>|</code>&#8221; character, called the &#8220;pipe&#8221; character, means that the contents will not be printed to the screen. Instead, they will become the standard input of the
next command, which in this case calls your Python script.
<li>Instead of specifying a module (like <code>binary.xml</code>), you specify &#8220;<code>-</code>&#8221;, which causes your script to load the grammar from standard input instead of from a specific file on disk. (More on how
this happens in the next example.) So the effect is the same as the first syntax, where you specified the grammar filename
directly, but think of the expansion possibilities here. Instead of simply doing <code>cat binary.xml</code>, you could run a script that dynamically generates the grammar, then you can pipe it into your script. It could come from
anywhere: a database, or some grammar-generating meta-script, or whatever. The point is that you don't need to change your
<code>kgp.py</code> script at all to incorporate any of this functionality. All you need to do is be able to take grammar files from standard
input, and you can separate all the other logic into another program.
<p>So how does the script &#8220;know&#8221; to read from standard input when the grammar file is &#8220;<code>-</code>&#8221;? It's not magic; it's just code.
<div class=example><h3>Example 10.13. Reading from standard input in <code>kgp.py</code></h3><pre><code>
def openAnything(source):
if source == "-": <span>&#x2460;</span>
import sys
return sys.stdin
# try to open with urllib (if source is http, ftp, or file URL)
import urllib
try:
[... snip ...]</pre>
<ol>
<li>This is the <code>openAnything</code> function from <code>toolbox.py</code>, which you previously examined in <a href="#kgp.openanything" title="10.1. Abstracting input sources">Section 10.1, &#8220;Abstracting input sources&#8221;</a>. All you've done is add three lines of code at the beginning of the function to check if the source is &#8220;<code>-</code>&#8221;; if so, you return <code>sys.stdin</code>. Really, that's it! Remember, <code>stdin</code> is a file-like object with a <code>read</code> method, so the rest of the code (in <code>kgp.py</code>, where you call <code>openAnything</code>) doesn't change a bit.
[more XML stuff was here]
<h2 id="kgp.commandline">10.6. Handling command-line arguments</h2>
<p>Python fully supports creating programs that can be run on the command line, complete with command-line arguments and either short-
or long-style flags to specify various options. None of this is <abbr>XML</abbr>-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it.
<p>It's difficult to talk about command-line processing without understanding how command-line arguments are exposed to your
Python program, so let's write a simple program to see them.
<div class=example><h3>Example 10.20. Introducing <var>sys.argv</var></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
#argecho.py
import sys
for arg in sys.argv: <span>&#x2460;</span>
print arg</pre>
<ol>
<li>Each command-line argument passed to the program will be in <var>sys.argv</var>, which is just a list. Here you are printing each argument on a separate line.
<div class=example><h3>Example 10.21. The contents of <var>sys.argv</var></h3><pre class=screen>
<samp class=p>[you@localhost py]$ </samp>python argecho.py <span>&#x2460;</span>
argecho.py
<samp class=p>[you@localhost py]$ </samp>python argecho.py abc def <span>&#x2461;</span>
<samp>argecho.py
abc
def</samp>
<samp class=p>[you@localhost py]$ </samp>python argecho.py --help <span>&#x2462;</span>
<samp>argecho.py
--help</samp>
<samp class=p>[you@localhost py]$ </samp>python argecho.py -m kant.xml <span>&#x2463;</span>
<samp>argecho.py
-m
kant.xml</span></pre>
<ol>
<li>The first thing to know about <var>sys.argv</var> is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later,
in <a href="#regression" title="Chapter 16. Functional Programming">Chapter 16, <i>Functional Programming</i></a>. Don't worry about it for now.
<li>Command-line arguments are separated by spaces, and each shows up as a separate element in the <var>sys.argv</var> list.
<li>Command-line flags, like <code>--help</code>, also show up as their own element in the <var>sys.argv</var> list.
<li>To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag
(<code>-m</code>) which takes an argument (<code>kant.xml</code>). Both the flag itself and the flag's argument are simply sequential elements in the <var>sys.argv</var> list. No attempt is made to associate one with the other; all you get is a list.
<p>So as you can see, you certainly have all the information passed on the command line, but then again, it doesn't look like
it's going to be all that easy to actually use it. For simple programs that only take a single argument and have no flags,
you can simply use <code>sys.argv[1]</code> to access the argument. There's no shame in this; I do it all the time. For more complex programs, you need the <code>getopt</code> module.
<div class=example><h3>Example 10.22. Introducing <code>getopt</code></h3><pre><code>
def main(argv):
grammar = "kant.xml" <span>&#x2460;</span>
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="]) <span>&#x2461;</span>
except getopt.GetoptError: <span>&#x2462;</span>
usage() <span>&#x2463;</span>
sys.exit(2)
...
if __name__ == "__main__":
main(sys.argv[1:])</pre>
<ol>
<li>First off, look at the bottom of the example and notice that you're calling the <code>main</code> function with <code>sys.argv[1:]</code>. Remember, <code>sys.argv[0]</code> is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off
and pass the rest of the list.
<li>This is where all the interesting processing happens. The <code>getopt</code> function of the <code>getopt</code> module takes three parameters: the argument list (which you got from <code>sys.argv[1:]</code>), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer
command-line flags that are equivalent to the single-character versions. This is quite confusing at first glance, and is
explained in more detail below.
<li>If anything goes wrong trying to parse these command-line flags, <code>getopt</code> will raise an exception, which you catch. You told <code>getopt</code> all the flags you understand, so this probably means that the end user passed some command-line flag that you don't understand.
<li>As is standard practice in the <abbr>UNIX</abbr> world, when the script is passed flags it doesn't understand, you print out a summary of proper usage and exit gracefully.
Note that I haven't shown the <code>usage</code> function here. You would still need to code that somewhere and have it print out the appropriate summary; it's not automatic.
<p>So what are all those parameters you pass to the <code>getopt</code> function? Well, the first one is simply the raw list of command-line flags and arguments (not including the first element,
the script name, which you already chopped off before calling the <code>main</code> function). The second is the list of short command-line flags that the script accepts.
<div class=variablelist>
<h3><code>"hg:d"</code></h3>
<dl>
<dt><code>-h</code></dt>
<dd>print usage summary</dd>
<dt><code>-g ...</code></dt>
<dd>use specified grammar file or URL</dd>
<dt><code>-d</code></dt>
<dd>show debugging information while parsing</dd>
</dl>
<p>The first and third flags are simply standalone flags; you specify them or you don't, and they do things (print help) or change
state (turn on debugging). However, the second flag (<code>-g</code>) <em>must</em> be followed by an argument, which is the name of the grammar file to read from. In fact it can be a filename or a web address,
and you don't know which yet (you'll figure it out later), but you know it has to be <em>something</em>. So you tell <code>getopt</code> this by putting a colon after the <code>g</code> in that second parameter to the <code>getopt</code> function.
<p>To further complicate things, the script accepts either short flags (like <code>-h</code>) or long flags (like <code>--help</code>), and you want them to do the same thing. This is what the third parameter to <code>getopt</code> is for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter.
<div class=variablelist>
<h3><code>["help", "grammar="]</code></h3>
<dl>
<dt><code>--help</code></dt>
<dd>print usage summary</dd>
<dt><code>--grammar ...</code></dt>
<dd>use specified grammar file or URL</dd>
</dl>
<p>Three things of note here:
<div class=orderedlist>
<ol>
<li>All long flags are preceded by two dashes on the command line, but you don't include those dashes when calling <code>getopt</code>. They are understood.
<li>The <code>--grammar</code> flag must always be followed by an additional argument, just like the <code>-g</code> flag. This is notated by an equals sign, <code>"grammar="</code>.
<li>The list of long flags is shorter than the list of short flags, because the <code>-d</code> flag does not have a corresponding long version. This is fine; only <code>-d</code> will turn on debugging. But the order of short and long flags needs to be the same, so you'll need to specify all the short
flags that <em>do</em> have corresponding long flags first, then all the rest of the short flags.
</ol>
<p>Confused yet? Let's look at the actual code and see if it makes sense in context.
<div class=example><h3>Example 10.23. Handling command-line arguments in <code>kgp.py</code></h3><pre><code>
def main(argv): <span>&#x2460;</span>
grammar = "kant.xml"
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
usage()
sys.exit(2)
for opt, arg in opts: <span>&#x2461;</span>
if opt in ("-h", "--help"): <span>&#x2462;</span>
usage()
sys.exit()
elif opt == '-d': <span>&#x2463;</span>
global _debug
_debug = 1
elif opt in ("-g", "--grammar"): <span>&#x2464;</span>
grammar = arg
source = "".join(args) <span>&#x2465;</span>
k = KantGenerator(grammar, source)
print k.output()</pre>
<ol>
<li>The <var>grammar</var> variable will keep track of the grammar file you're using. You initialize it here in case it's not specified on the command
line (using either the <code>-g</code> or the <code>--grammar</code> flag).
<li>The <var>opts</var> variable that you get back from <code>getopt</code> contains a list of tuples: <var>flag</var> and <var>argument</var>. If the flag doesn't take an argument, then <var>arg</var> will simply be <code>None</code>. This makes it easier to loop through the flags.
<li><code>getopt</code> validates that the command-line flags are acceptable, but it doesn't do any sort of conversion between short and long flags.
If you specify the <code>-h</code> flag, <var>opt</var> will contain <code>"-h"</code>; if you specify the <code>--help</code> flag, <var>opt</var> will contain <code>"--help"</code>. So you need to check for both.
<li>Remember, the <code>-d</code> flag didn't have a corresponding long flag, so you only need to check for the short form. If you find it, you set a global
variable that you'll refer to later to print out debugging information. (I used this during the development of the script.
What, you thought all these examples worked on the first try?)
<li>If you find a grammar file, either with a <code>-g</code> flag or a <code>--grammar</code> flag, you save the argument that followed it (stored in <var>arg</var>) into the <var>grammar</var> variable, overwriting the default that you initialized at the top of the <code>main</code> function.
<li>That's it. You've looped through and dealt with all the command-line flags. That means that anything left must be command-line
arguments. These come back from the <code>getopt</code> function in the <var>args</var> variable. In this case, you're treating them as source material for the parser. If there are no command-line arguments
specified, <var>args</var> will be an empty list, and <var>source</var> will end up as the empty string.
<h2 id="kgp.alltogether">10.7. Putting it all together</h2>
<p>You've covered a lot of ground. Let's step back and see how all the pieces fit together.
<p>To start with, this is a script that <a href="#kgp.commandline" title="10.6. Handling command-line arguments">takes its arguments on the command line</a>, using the <code>getopt</code> module.
<pre><code>
def main(argv):
...
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
...
for opt, arg in opts:
...</pre><p>You create a new instance of the <code>KantGenerator</code> class, and pass it the grammar file and source that may or may not have been specified on the command line.
<pre><code>
k = KantGenerator(grammar, source)</pre><p>The <code>KantGenerator</code> instance automatically loads the grammar, which is an <abbr>XML</abbr> file. You use your custom <code>openAnything</code> function to open the file (which <a href="#kgp.openanything" title="10.1. Abstracting input sources">could be stored in a local file or a remote web server</a>), then use the built-in <code>minidom</code> parsing functions to <a href="#kgp.parse" title="9.3. Parsing XML">parse the <abbr>XML</abbr> into a tree of Python objects</a>.
<pre><code>
def _load(self, source):
sock = toolbox.openAnything(source)
xmldoc = minidom.parse(sock).documentElement
sock.close()</pre><p>Oh, and along the way, you take advantage of your knowledge of the structure of the <abbr>XML</abbr> document to <a href="#kgp.cache" title="10.3. Caching node lookups">set up a little cache of references</a>, which are just elements in the <abbr>XML</abbr> document.
<pre><code>
def loadGrammar(self, grammar):
for ref in self.grammar.getElementsByTagName("ref"):
self.refs[ref.attributes["id"].value] = ref </pre><p>If you specified some source material on the command line, you use that; otherwise you rip through the grammar looking for
the "top-level" reference (that isn't referenced by anything else) and use that as a starting point.
<pre><code>
def getDefaultSource(self):
xrefs = {}
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
return '&lt;xref id="%s"/>' % random.choice(standaloneXrefs)</pre><p>Now you rip through the source material. The source material is also <abbr>XML</abbr>, and you parse it one node at a time. To keep the code separated and more maintainable, you use <a href="#kgp.handler" title="10.5. Creating separate handlers by node type">separate handlers for each node type</a>.
<pre><code>
def parse_Element(self, node):
handlerMethod = getattr(self, "do_%s" % node.tagName)
handlerMethod(node)</pre><p>You bounce through the grammar, <a href="#kgp.child" title="10.4. Finding direct children of a node">parsing all the children</a> of each <code>p</code> element,
<pre><code>
def do_p(self, node):
...
if doit:
for child in node.childNodes: self.parse(child)</pre><p>replacing <code>choice</code> elements with a random child,
<pre><code>
def do_choice(self, node):
self.parse(self.randomChildElement(node))</pre><p>and replacing <code>xref</code> elements with a random child of the corresponding <code>ref</code> element, which you previously cached.
<pre><code>
def do_xref(self, node):
id = node.attributes["id"].value
self.parse(self.randomChildElement(self.refs[id]))</pre><p>Eventually, you parse your way down to plain text,
<pre><code>
def parse_Text(self, node):
text = node.data
...
self.pieces.append(text)</pre><p>which you print out.
<pre><code>
def main(argv):
...
k = KantGenerator(grammar, source)
print k.output()</pre><h2 id="kgp.summary">10.8. Summary</h2>
<p>Python comes with powerful libraries for parsing and manipulating <abbr>XML</abbr> documents. The <code>minidom</code> takes an <abbr>XML</abbr> file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments,
error handling, even the ability to take input from the piped result of a previous program.
<p>Before moving on to the next chapter, you should be comfortable doing all of these things:
<div class=itemizedlist>
<ul>
<li><a href="#kgp.stdio" title="10.2. Standard input, output, and error">Chaining programs</a> with standard input and output
<li><a href="#kgp.handler" title="10.5. Creating separate handlers by node type">Defining dynamic dispatchers</a> with <code>getattr</code>.
<li><a href="#kgp.commandline" title="10.6. Handling command-line arguments">Using command-line flags</a> and validating them with <code>getopt</code>
</ul>
[HTTP web services stuff was here]
[unit testing stuff was here]
<div class=chapter>
<h2 id="roman1.5">Chapter 14. Test-First Programming</h2>
<h2 id="roman.stage1">14.1. <code>roman.py</code>, stage 1</h2>
<p>Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're
going to do this in stages, so you can see all the unit tests fail, then watch them pass one by one as you fill in the gaps
in <code>roman.py</code>.
<div class=example><h3>Example 14.1. <code>roman1.py</code></h3>
<p>This file is available in <code>py/roman/stage1/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass <span>&#x2460;</span>
class OutOfRangeError(RomanError): pass <span>&#x2461;</span>
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass <span>&#x2462;</span>
def to_roman(n):
"""convert integer to Roman numeral"""
pass <span>&#x2463;</span>
def from_roman(s):
"""convert Roman numeral to integer"""
pass
</pre>
<ol>
<li>This is how you define your own custom exceptions in Python. Exceptions are classes, and you create your own by subclassing existing exceptions. It is strongly recommended (but not
required) that you subclass <code>Exception</code>, which is the base class that all built-in exceptions inherit from. Here I am defining <code>RomanError</code> (inherited from <code>Exception</code>) to act as the base class for all my other custom exceptions to follow. This is a matter of style; I could just as easily
have inherited each individual exception from the <code>Exception</code> class directly.
<li>The <code>OutOfRangeError</code> and <code>NotIntegerError</code> exceptions will eventually be used by <code>to_roman()</code> to flag various forms of invalid input, as specified in <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to to_roman"><code>ToRomanBadInput</code></a>.
<li>The <code>InvalidRomanNumeralError</code> exception will eventually be used by <code>from_roman()</code> to flag invalid input, as specified in <a href="#roman.frombadinput.example" title="Example 13.4. Testing bad input to from_roman"><code>FromRomanBadInput</code></a>.
<li>At this stage, you want to define the <abbr>API</abbr> of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word <a href="#fileinfo.class.simplest" title="Example 5.3. The Simplest Python Class"><code>pass</code></a>.
<p>Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At
this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to <code>romantest.py</code> and re-evaluate why you coded a test so useless that it passes with do-nothing functions.
<li>At this stage, you want to define the <abbr>API</abbr> of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word <a href="#fileinfo.class.simplest" title="Example 5.3. The Simplest Python Class"><code>pass</code></a>.
<p>Run <code>romantest1.py</code> with the <code>-v</code> command-line option, which will give more verbose output so you can see exactly what's going on as each test case runs.
With any luck, your output should look like this:
<div class=example><h3 id="roman.stage1.output">Example 14.2. Output of <code>romantest1.py</code> against <code>roman1.py</code></h3><pre class=screen><samp>from_roman should only accept uppercase input ... ERROR
to_roman should always return uppercase ... ERROR
from_roman should fail with malformed antecedents ... FAIL
from_roman should fail with repeated pairs of numerals ... FAIL
from_roman should fail with too many repeated numerals ... FAIL
from_roman should give known result with known input ... FAIL
to_roman should give known result with known input ... FAIL
from_roman(to_roman(n))==n for all n ... FAIL
to_roman should fail with non-integer input ... FAIL
to_roman should fail with negative input ... FAIL
to_roman should fail with large input ... FAIL
to_roman should fail with 0 input ... FAIL
======================================================================
ERROR: from_roman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 154, in testFromRomanCase
roman1.from_roman(numeral.upper())
AttributeError: 'None' object has no attribute 'upper'</span><samp>
======================================================================
ERROR: to_roman should always return uppercase
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 148, in testToRomanCase
self.assertEqual(numeral, numeral.upper())
AttributeError: 'None' object has no attribute 'upper'</span><samp>
======================================================================
FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 133, in testMalformedAntecedent
self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 127, in testRepeatedPairs
self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should give known result with known input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 99, in testFromRomanKnownValues
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: to_roman should give known result with known input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 93, in testToRomanKnownValues
self.assertEqual(numeral, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: I != None</span><samp>
======================================================================
FAIL: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 141, in testSanity
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: to_roman should fail with non-integer input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 116, in testNonInteger
self.assertRaises(roman1.NotIntegerError, roman1.to_roman, 0.5)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: NotIntegerError</span><samp>
======================================================================
FAIL: to_roman should fail with negative input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 112, in testNegative
self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, -1)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
======================================================================
FAIL: to_roman should fail with large input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 104, in testTooLarge
self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 4000)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
======================================================================
FAIL: to_roman should fail with 0 input </span><span>&#x2460;</span><samp>
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 108, in testZero
self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 0)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError </span><span>&#x2461;</span><samp>
----------------------------------------------------------------------
Ran 12 tests in 0.040s </span><span>&#x2462;</span><samp>
FAILED (failures=10, errors=2) </span><span>&#x2463;</span></pre>
<h2 id="roman.stage2">14.2. <code>roman.py</code>, stage 2</h2>
<p>Now that you have the framework of the <code>roman</code> module laid out, it's time to start writing code and passing test cases.
<div class=example><h3 id="roman.stage2.example">Example 14.3. <code>roman2.py</code></h3>
<p>This file is available in <code>py/roman/stage2/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000), <span>&#x2460;</span>
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def to_roman(n):
"""convert integer to Roman numeral"""
result = ""
for numeral, integer in romanNumeralMap:
while n >= integer: <span>&#x2461;</span>
result += numeral
n -= integer
return result
def from_roman(s):
"""convert Roman numeral to integer"""
pass
</pre>
<ol>
<li><var>romanNumeralMap</var> is a tuple of tuples which defines three things:
<div class=orderedlist>
<ol>
<li>The character representations of the most basic Roman numerals. Note that this is not just the single-character Roman numerals;
you're also defining two-character pairs like <code>CM</code> (&#8220;one hundred less than one thousand&#8221;); this will make the <code>to_roman()</code> code simpler later.
<li>The order of the Roman numerals. They are listed in descending value order, from <code>M</code> all the way down to <code>I</code>.
<li>The value of each Roman numeral. Each inner tuple is a pair of <code>(<var>numeral</var>, <var>value</var>)</code>.
</ol>
<li>Here's where your rich data structure pays off, because you don't need any special logic to handle the subtraction rule.
To convert to Roman numerals, you simply iterate through <var>romanNumeralMap</var> looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation
to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
<div class=example><h3>Example 14.4. How <code>to_roman()</code> works</h3>
<p>If you're not clear how <code>to_roman()</code> works, add a <code>print</code> statement to the end of the <code>while</code> loop:<pre><code>
while n >= integer:
result += numeral
n -= integer
print 'subtracting', integer, 'from input, adding', numeral, 'to output'</pre><pre class=screen>
<samp class=p>>>> </samp><kbd>import roman2</kbd>
<samp class=p>>>> </samp><kbd>roman2.to_roman(1424)</kbd>
<samp>subtracting 1000 from input, adding M to output
subtracting 400 from input, adding CD to output
subtracting 10 from input, adding X to output
subtracting 10 from input, adding X to output
subtracting 4 from input, adding IV to output
'MCDXXIV'</span>
</pre><p>So <code>to_roman()</code> appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely.
<div class=example><h3>Example 14.5. Output of <code>romantest2.py</code> against <code>roman2.py</code></h3>
<p>Remember to run <code>romantest2.py</code> with the <code>-v</code> command-line flag to enable verbose mode.
<pre class=screen><samp>from_roman should only accept uppercase input ... FAIL
to_roman should always return uppercase ... ok</span><span>&#x2460;</span><samp>
from_roman should fail with malformed antecedents ... FAIL
from_roman should fail with repeated pairs of numerals ... FAIL
from_roman should fail with too many repeated numerals ... FAIL
from_roman should give known result with known input ... FAIL
to_roman should give known result with known input ... ok </span><span>&#x2461;</span><samp>
from_roman(to_roman(n))==n for all n ... FAIL
to_roman should fail with non-integer input ... FAIL </span><span>&#x2462;</span><samp>
to_roman should fail with negative input ... FAIL
to_roman should fail with large input ... FAIL
to_roman should fail with 0 input ... FAIL</span></pre>
<ol>
<li><code>to_roman()</code> does, in fact, always return uppercase, because <var>romanNumeralMap</var> defines the Roman numeral representations as uppercase. So this test passes already.
<li>Here's the big news: this version of the <code>to_roman()</code> function passes the <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">known values test</a>. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including
inputs that produce every single-character Roman numeral, the largest possible input (<code>3999</code>), and the input that produces the longest possible Roman numeral (<code>3888</code>). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
<li>However, the function does not &#8220;work&#8221; for bad values; it fails every single <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to to_roman">bad input test</a>. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to
be raised (via <code>assertRaises</code>), and you're never raising them. You'll do that in the next stage.
<p>Here's the rest of the output of the unit test, listing the details of all the failures. You're down to 10.
<pre class=screen><samp>
======================================================================
FAIL: from_roman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 156, in testFromRomanCase
roman2.from_roman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 133, in testMalformedAntecedent
self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 127, in testRepeatedPairs
self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should give known result with known input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 99, in testFromRomanKnownValues
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 141, in testSanity
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: to_roman should fail with non-integer input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 116, in testNonInteger
self.assertRaises(roman2.NotIntegerError, roman2.to_roman, 0.5)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: NotIntegerError</span><samp>
======================================================================
FAIL: to_roman should fail with negative input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 112, in testNegative
self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, -1)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
======================================================================
FAIL: to_roman should fail with large input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 104, in testTooLarge
self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
======================================================================
FAIL: to_roman should fail with 0 input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 108, in testZero
self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 0)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
----------------------------------------------------------------------
Ran 12 tests in 0.320s
FAILED (failures=10)</span></pre><h2 id="roman.stage3">14.3. <code>roman.py</code>, stage 3</h2>
<p>Now that <code>to_roman()</code> behaves correctly with good input (integers from <code>1</code> to <code>3999</code>), it's time to make it behave correctly with bad input (everything else).
<div class=example><h3>Example 14.6. <code>roman3.py</code></h3>
<p>This file is available in <code>py/roman/stage3/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def to_roman(n):
"""convert integer to Roman numeral"""
if not (0 &lt; n &lt; 4000): <span>&#x2460;</span>
raise OutOfRangeError, "number out of range (must be 1..3999)" <span>&#x2461;</span>
if int(n) &lt;> n: <span>&#x2462;</span>
raise NotIntegerError, "non-integers can not be converted"
result = "" <span>&#x2463;</span>
for numeral, integer in romanNumeralMap:
while n >= integer:
result += numeral
n -= integer
return result
def from_roman(s):
"""convert Roman numeral to integer"""
pass
</pre>
<ol>
<li>This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to <code>if not ((0 &lt; n) and (n &lt; 4000))</code>, but it's much easier to read. This is the range check, and it should catch inputs that are too large, negative, or zero.
<li>You raise exceptions yourself with the <code>raise</code> statement. You can raise any of the built-in exceptions, or you can raise any of your custom exceptions that you've defined.
The second parameter, the error message, is optional; if given, it is displayed in the traceback that is printed if the exception
is never handled.
<li>This is the non-integer check. Non-integers can not be converted to Roman numerals.
<li>The rest of the function is unchanged.
<div class=example><h3>Example 14.7. Watching <code>to_roman()</code> handle bad input</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import roman3</kbd>
<samp class=p>>>> </samp><kbd>roman3.to_roman(4000)</kbd>
<samp class=traceback>Traceback (most recent call last):
File "&lt;interactive input>", line 1, in ?
File "roman3.py", line 27, in to_roman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)</samp>
<samp class=p>>>> </samp><kbd>roman3.to_roman(1.5)</kbd>
<samp class=traceback>Traceback (most recent call last):
File "&lt;interactive input>", line 1, in ?
File "roman3.py", line 29, in to_roman
raise NotIntegerError, "non-integers can not be converted"
NotIntegerError: non-integers can not be converted</span>
</pre><div class=example><h3>Example 14.8. Output of <code>romantest3.py</code> against <code>roman3.py</code></h3><pre class=screen><samp>from_roman should only accept uppercase input ... FAIL
to_roman should always return uppercase ... ok
from_roman should fail with malformed antecedents ... FAIL
from_roman should fail with repeated pairs of numerals ... FAIL
from_roman should fail with too many repeated numerals ... FAIL
from_roman should give known result with known input ... FAIL
to_roman should give known result with known input ... ok </span><span>&#x2460;</span><samp>
from_roman(to_roman(n))==n for all n ... FAIL
to_roman should fail with non-integer input ... ok </span><span>&#x2461;</span><samp>
to_roman should fail with negative input ... ok </span><span>&#x2462;</span><samp>
to_roman should fail with large input ... ok
to_roman should fail with 0 input ... ok</span></pre>
<ol>
<li><code>to_roman()</code> still passes the <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">known values test</a>, which is comforting. All the tests that passed in <a href="#roman.stage2" title="14.2. roman.py, stage 2">stage 2</a> still pass, so the latest code hasn't broken anything.
<li>More exciting is the fact that all of the <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to to_roman">bad input tests</a> now pass. This test, <code>testNonInteger</code>, passes because of the <code>int(n) &lt;> n</code> check. When a non-integer is passed to <code>to_roman()</code>, the <code>int(n) &lt;> n</code> check notices it and raises the <code>NotIntegerError</code> exception, which is what <code>testNonInteger</code> is looking for.
<li>This test, <code>testNegative</code>, passes because of the <code>not (0 &lt; n &lt; 4000)</code> check, which raises an <code>OutOfRangeError</code> exception, which is what <code>testNegative</code> is looking for.
<pre class=screen><samp>
======================================================================
FAIL: from_roman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 156, in testFromRomanCase
roman3.from_roman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 133, in testMalformedAntecedent
self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 127, in testRepeatedPairs
self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should give known result with known input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 99, in testFromRomanKnownValues
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 141, in testSanity
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
----------------------------------------------------------------------
Ran 12 tests in 0.401s
FAILED (failures=6)</span> <span>&#x2460;</span></pre>
<ol>
<li>You're down to 6 failures, and all of them involve <code>from_roman()</code>: the known values test, the three separate bad input tests, the case check, and the sanity check. That means that <code>to_roman()</code> has passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires that <code>from_roman()</code> be written, which it isn't yet.) Which means that you must stop coding <code>to_roman()</code> now. No tweaking, no twiddling, no extra checks &#8220;just in case&#8221;. Stop. Now. Back away from the keyboard.
<table class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">The most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for
a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module.
<h2 id="roman.stage4">14.4. <code>roman.py</code>, stage 4</h2>
<p>Now that <code>to_roman()</code> is done, it's time to start coding <code>from_roman()</code>. Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than
the <code>to_roman()</code> function.
<div class=example><h3>Example 14.9. <code>roman4.py</code></h3>
<p>This file is available in <code>py/roman/stage4/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
# to_roman function omitted for clarity (it hasn't changed)
def from_roman(s):
"""convert Roman numeral to integer"""
result = 0
index = 0
for numeral, integer in romanNumeralMap:
while s[index:index+len(numeral)] == numeral: <span>&#x2460;</span>
result += integer
index += len(numeral)
return result
</pre>
<ol>
<li>The pattern here is the same as <a href="#roman.stage2.example" title="Example 14.3. roman2.py"><code>to_roman()</code></a>. You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer
values as often as possible, you match the &#8220;highest&#8221; Roman numeral character strings as often as possible.
<div class=example><h3>Example 14.10. How <code>from_roman()</code> works</h3>
<p>If you're not clear how <code>from_roman()</code> works, add a <code>print</code> statement to the end of the <code>while</code> loop:<pre><code>
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
print 'found', numeral, 'of length', len(numeral), ', adding', integer</pre><pre class=screen>
<samp class=p>>>> </samp><kbd>import roman4</kbd>
<samp class=p>>>> </samp><kbd>roman4.from_roman('MCMLXXII')</kbd>
<samp>found M , of length 1, adding 1000
found CM , of length 2, adding 900
found L , of length 1, adding 50
found X , of length 1, adding 10
found X , of length 1, adding 10
found I , of length 1, adding 1
found I , of length 1, adding 1
1972</span></pre><div class=example><h3>Example 14.11. Output of <code>romantest4.py</code> against <code>roman4.py</code></h3><pre class=screen><samp>from_roman should only accept uppercase input ... FAIL
to_roman should always return uppercase ... ok
from_roman should fail with malformed antecedents ... FAIL
from_roman should fail with repeated pairs of numerals ... FAIL
from_roman should fail with too many repeated numerals ... FAIL
from_roman should give known result with known input ... ok </span><span>&#x2460;</span><samp>
to_roman should give known result with known input ... ok
from_roman(to_roman(n))==n for all n ... ok</span><span>&#x2461;</span><samp>
to_roman should fail with non-integer input ... ok
to_roman should fail with negative input ... ok
to_roman should fail with large input ... ok
to_roman should fail with 0 input ... ok</span></pre>
<ol>
<li>Two pieces of exciting news here. The first is that <code>from_roman()</code> works for good input, at least for all the <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">known values</a> you test.
<li>The second is that the <a href="#roman.sanity.example" title="Example 13.5. Testing to_roman against from_roman">sanity check</a> also passed. Combined with the known values tests, you can be reasonably sure that both <code>to_roman()</code> and <code>from_roman()</code> work properly for all possible good values. (This is not guaranteed; it is theoretically possible that <code>to_roman()</code> has a bug that produces the wrong Roman numeral for some particular set of inputs, <em>and</em> that <code>from_roman()</code> has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that <code>to_roman()</code> generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write
more comprehensive test cases until it doesn't bother you.)
<pre class=screen><samp>
======================================================================
FAIL: from_roman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 156, in testFromRomanCase
roman4.from_roman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 133, in testMalformedAntecedent
self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 127, in testRepeatedPairs
self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
----------------------------------------------------------------------
Ran 12 tests in 1.222s
FAILED (failures=4)</span></pre><h2 id="roman.stage5">14.5. <code>roman.py</code>, stage 5</h2>
<p>Now that <code>from_roman()</code> works properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input.
That means finding a way to look at a string and determine if it's a valid Roman numeral. This is inherently more difficult
than <a href="#roman.stage3" title="14.3. roman.py, stage 3">validating numeric input</a> in <code>to_roman()</code>, but you have a powerful tool at your disposal: regular expressions.
<p>If you're not familiar with regular expressions and didn't read <a href="#re" title="Chapter 7. Regular Expressions">Chapter 7, <i>Regular Expressions</i></a>, now would be a good time.
<p>As you saw in <a href="#re.roman" title="7.3. Case Study: Roman Numerals">Section 7.3, &#8220;Case Study: Roman Numerals&#8221;</a>, there are several simple rules for constructing a Roman numeral, using the letters <code>M</code>, <code>D</code>, <code>C</code>, <code>L</code>, <code>X</code>, <code>V</code>, and <code>I</code>. Let's review the rules:
<div class=orderedlist>
<ol>
<li>Characters are additive. <code>I</code> is <code>1</code>, <code>II</code> is <code>2</code>, and <code>III</code> is <code>3</code>. <code>VI</code> is <code>6</code> (literally, &#8220;<code>5</code> and <code>1</code>&#8221;), <code>VII</code> is <code>7</code>, and <code>VIII</code> is <code>8</code>.
<li>The tens characters (<code>I</code>, <code>X</code>, <code>C</code>, and <code>M</code>) can be repeated up to three times. At <code>4</code>, you need to subtract from the next highest fives character. You can't represent <code>4</code> as <code>IIII</code>; instead, it is represented as <code>IV</code> (&#8220;<code>1</code> less than <code>5</code>&#8221;). <code>40</code> is written as <code>XL</code> (&#8220;<code>10</code> less than <code>50</code>&#8221;), <code>41</code> as <code>XLI</code>, <code>42</code> as <code>XLII</code>, <code>43</code> as <code>XLIII</code>, and then <code>44</code> as <code>XLIV</code> (&#8220;<code>10</code> less than <code>50</code>, then <code>1</code> less than <code>5</code>&#8221;).
<li>Similarly, at <code>9</code>, you need to subtract from the next highest tens character: <code>8</code> is <code>VIII</code>, but <code>9</code> is <code>IX</code> (&#8220;<code>1</code> less than <code>10</code>&#8221;), not <code>VIIII</code> (since the <code>I</code> character can not be repeated four times). <code>90</code> is <code>XC</code>, <code>900</code> is <code>CM</code>.
<li>The fives characters can not be repeated. <code>10</code> is always represented as <code>X</code>, never as <code>VV</code>. <code>100</code> is always <code>C</code>, never <code>LL</code>.
<li>Roman numerals are always written highest to lowest, and read left to right, so order of characters matters very much. <code>DC</code> is <code>600</code>; <code>CD</code> is a completely different number (<code>400</code>, &#8220;<code>100</code> less than <code>500</code>&#8221;). <code>CI</code> is <code>101</code>; <code>IC</code> is not even a valid Roman numeral (because you can't subtract <code>1</code> directly from <code>100</code>; you would need to write it as <code>XCIX</code>, &#8220;<code>10</code> less than <code>100</code>, then <code>1</code> less than <code>10</code>&#8221;).
</ol>
<div class=example><h3>Example 14.12. <code>roman5.py</code></h3>
<p>This file is available in <code>py/roman/stage5/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
import re
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def to_roman(n):
"""convert integer to Roman numeral"""
if not (0 &lt; n &lt; 4000):
raise OutOfRangeError, "number out of range (must be 1..3999)"
if int(n) &lt;> n:
raise NotIntegerError, "non-integers can not be converted"
result = ""
for numeral, integer in romanNumeralMap:
while n >= integer:
result += numeral
n -= integer
return result
#Define pattern to detect valid Roman numerals
romanNumeralPattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$' <span>&#x2460;</span>
def from_roman(s):
"""convert Roman numeral to integer"""
if not re.search(romanNumeralPattern, s):<span>&#x2461;</span>
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
result = 0
index = 0
for numeral, integer in romanNumeralMap:
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
return result
</pre>
<ol>
<li>This is just a continuation of the pattern you discussed in <a href="#re.roman" title="7.3. Case Study: Roman Numerals">Section 7.3, &#8220;Case Study: Roman Numerals&#8221;</a>. The tens places is either <code>XC</code> (<code>90</code>), <code>XL</code> (<code>40</code>), or an optional <code>L</code> followed by 0 to 3 optional <code>X</code> characters. The ones place is either <code>IX</code> (<code>9</code>), <code>IV</code> (<code>4</code>), or an optional <code>V</code> followed by 0 to 3 optional <code>I</code> characters.
<li>Having encoded all that logic into a regular expression, the code to check for invalid Roman numerals becomes trivial. If
<code>re.search</code> returns an object, then the regular expression matched and the input is valid; otherwise, the input is invalid.
<p>At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of
invalid Roman numerals. But don't take my word for it, look at the results:
<div class=example><h3>Example 14.13. Output of <code>romantest5.py</code> against <code>roman5.py</code></h3><pre class=screen><samp>
from_roman should only accept uppercase input ... ok </span><span>&#x2460;</span><samp>
to_roman should always return uppercase ... ok
from_roman should fail with malformed antecedents ... ok </span><span>&#x2461;</span><samp>
from_roman should fail with repeated pairs of numerals ... ok </span><span>&#x2462;</span><samp>
from_roman should fail with too many repeated numerals ... ok
from_roman should give known result with known input ... ok
to_roman should give known result with known input ... ok
from_roman(to_roman(n))==n for all n ... ok
to_roman should fail with non-integer input ... ok
to_roman should fail with negative input ... ok
to_roman should fail with large input ... ok
to_roman should fail with 0 input ... ok
----------------------------------------------------------------------
Ran 12 tests in 2.864s
OK </span><span>&#x2463;</span></pre>
<ol>
<li>One thing I didn't mention about regular expressions is that, by default, they are case-sensitive. Since the regular expression
<var>romanNumeralPattern</var> was expressed in uppercase characters, the <code>re.search</code> check will reject any input that isn't completely uppercase. So the uppercase input test passes.
<li>More importantly, the bad input tests pass. For instance, the malformed antecedents test checks cases like <code>MCMC</code>. As you've seen, this does not match the regular expression, so <code>from_roman()</code> raises an <code>InvalidRomanNumeralError</code> exception, which is what the malformed antecedents test case is looking for, so the test passes.
<li>In fact, all the bad input tests pass. This regular expression catches everything you could think of when you made your test
cases.
<li>And the anticlimax award of the year goes to the word &#8220;<code>OK</code>&#8221;, which is printed by the <code>unittest</code> module when all the tests pass.
<table class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">When all of your tests pass, stop coding.
[functional programming stuff was here]
<p>The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual
modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the
build process for this book; I have unit tests for several of the example programs (not just the <code>roman.py</code> module featured in <a href="#roman" title="Chapter 13. Unit Testing">Chapter 13, <i>Unit Testing</i></a>), and the first thing my automated build script does is run this program to make sure all my examples still work. If this
regression test fails, the build immediately stops. I don't want to release non-working examples any more than you want to
download them and sit around scratching your head and yelling at your monitor and wondering why they don't work.
<div class=example><h3>Example 16.1. <code>regression.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Regression testing framework
This module will search for scripts in the same directory named
XYZtest.py. Each such script should be a test suite that tests a
module through PyUnit. (As of Python 2.1, PyUnit is included in
the standard library as "unittest".) This script will aggregate all
found test suites into one big test suite and run them all at once.
"""
import sys, os, re, unittest
def regressionTest():
path = os.path.abspath(os.path.dirname(sys.argv[0]))
files = os.listdir(path)
test = re.compile("test\.py$", re.IGNORECASE)
files = filter(test.search, files)
filenameToModuleName = lambda f: os.path.splitext(f)[0]
moduleNames = map(filenameToModuleName, files)
modules = map(__import__, moduleNames)
load = unittest.defaultTestLoader.loadTestsFromModule
return unittest.TestSuite(map(load, modules))
if __name__ == "__main__":
unittest.main(defaultTest="regressionTest")
</pre><p>Running this script in the same directory as the rest of the example scripts that come with this book will find all the unit
tests, named <code><var><code>module</code></var>test.py</code>, run them as a single test, and pass or fail them all at once.
<div class=example><h3>Example 16.2. Sample output of <code>regression.py</code></h3><pre class=screen>
<samp class=p>[you@localhost py]$ </samp>python regression.py -v
help should fail with no object ... ok <span>&#x2460;</span><samp>
help should return known result for apihelper ... ok
help should honor collapse argument ... ok
help should honor spacing argument ... ok
buildConnectionString should fail with list input ... ok </span><span>&#x2461;</span><samp>
buildConnectionString should fail with string input ... ok
buildConnectionString should fail with tuple input ... ok
buildConnectionString handles empty dictionary ... ok
buildConnectionString returns known result with known input ... ok
from_roman should only accept uppercase input ... ok </span><span>&#x2462;</span><samp>
to_roman should always return uppercase ... ok
from_roman should fail with blank string ... ok
from_roman should fail with malformed antecedents ... ok
from_roman should fail with repeated pairs of numerals ... ok
from_roman should fail with too many repeated numerals ... ok
from_roman should give known result with known input ... ok
to_roman should give known result with known input ... ok
from_roman(to_roman(n))==n for all n ... ok
to_roman should fail with non-integer input ... ok
to_roman should fail with negative input ... ok
to_roman should fail with large input ... ok
to_roman should fail with 0 input ... ok
kgp a ref test ... ok
kgp b ref test ... ok
kgp c ref test ... ok
kgp d ref test ... ok
kgp e ref test ... ok
kgp f ref test ... ok
kgp g ref test ... ok
----------------------------------------------------------------------
Ran 29 tests in 2.799s
OK</span></pre>
<ol>
<li>The first 5 tests are from <code>apihelpertest.py</code>, which tests the example script from <a href="#apihelper" title="Chapter 4. The Power Of Introspection">Chapter 4, <i>The Power Of Introspection</i></a>.
<li>The next 5 tests are from <code>odbchelpertest.py</code>, which tests the example script from <a href="#odbchelper" title="Chapter 2. Your First Python Program">Chapter 2, <i>Your First Python Program</i></a>.
<li>The rest are from <code>romantest.py</code>, which you studied in depth in <a href="#roman" title="Chapter 13. Unit Testing">Chapter 13, <i>Unit Testing</i></a>.
<h2 id="regression.path">16.2. Finding the path</h2>
<p>When running Python scripts from the command line, it is sometimes useful to know where the currently running script is located on disk.
<p>This is one of those obscure little tricks that is virtually impossible to figure out on your own, but simple to remember
once you see it. The key to it is <code>sys.argv</code>. As you saw in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a>, this is a list that holds the list of command-line arguments. However, it also holds the name of the running script, exactly
as it was called from the command line, and this is enough information to determine its location.
<div class=example><h3>Example 16.3. <code>fullpath.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
import sys, os
print 'sys.argv[0] =', sys.argv[0] <span>&#x2460;</span>
pathname = os.path.dirname(sys.argv[0]) <span>&#x2461;</span>
print 'path =', pathname
print 'full path =', os.path.abspath(pathname) <span>&#x2462;</span></pre>
<ol>
<li>Regardless of how you run a script, <code>sys.argv[0]</code> will always contain the name of the script, exactly as it appears on the command line. This may or may not include any path
information, as you'll see shortly.
<li><code>os.path.dirname</code> takes a filename as a string and returns the directory path portion. If the given filename does not include any path information,
<code>os.path.dirname</code> returns an empty string.
<li><code>os.path.abspath</code> is the key here. It takes a pathname, which can be partial or even blank, and returns a fully qualified pathname.
<p><code>os.path.abspath</code> deserves further explanation. It is very flexible; it can take any kind of pathname.
<div class=example><h3>Example 16.4. Further explanation of <code>os.path.abspath</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import os</kbd>
<samp class=p>>>> </samp><kbd>os.getcwd()</kbd> <span>&#x2460;</span>
/home/you
<samp class=p>>>> </samp><kbd>os.path.abspath('')</kbd> <span>&#x2461;</span>
/home/you
<samp class=p>>>> </samp><kbd>os.path.abspath('.ssh')</kbd> <span>&#x2462;</span>
/home/you/.ssh
<samp class=p>>>> </samp><kbd>os.path.abspath('/home/you/.ssh')</kbd> <span>&#x2463;</span>
/home/you/.ssh
<samp class=p>>>> </samp><kbd>os.path.abspath('.ssh/../foo/')</kbd> <span>&#x2464;</span>
/home/you/foo</pre>
<ol>
<li><code>os.getcwd()</code> returns the current working directory.
<li>Calling <code>os.path.abspath</code> with an empty string returns the current working directory, same as <code>os.getcwd()</code>.
<li>Calling <code>os.path.abspath</code> with a partial pathname constructs a fully qualified pathname out of it, based on the current working directory.
<li>Calling <code>os.path.abspath</code> with a full pathname simply returns it.
<li><code>os.path.abspath</code> also <em>normalizes</em> the pathname it returns. Note that this example worked even though I don't actually have a 'foo' directory. <code>os.path.abspath</code> never checks your actual disk; this is all just string manipulation.
<table id="os.path.abspath.exist.note" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">The pathnames and filenames you pass to <code>os.path.abspath</code> do not need to exist.
<table id="os.path.normpath.note" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code>os.path.abspath</code> not only constructs full path names, it also normalizes them. That means that if you are in the <code>/usr/</code> directory, <code>os.path.abspath('bin/../local/bin')</code> will return <code>/usr/local/bin</code>. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without
turning it into a full pathname, use <code>os.path.normpath</code> instead.
<div class=example><h3>Example 16.5. Sample output from <code>fullpath.py</code></h3><pre class=screen>
<samp class=p>[you@localhost py]$ </samp>python /home/you/diveintopython3/common/py/fullpath.py <span>&#x2460;</span>
<samp>sys.argv[0] = /home/you/diveintopython3/common/py/fullpath.py
path = /home/you/diveintopython3/common/py
full path = /home/you/diveintopython3/common/py</samp>
<samp class=p>[you@localhost diveintopython3]$ </samp>python common/py/fullpath.py <span>&#x2461;</span>
<samp>sys.argv[0] = common/py/fullpath.py
path = common/py
full path = /home/you/diveintopython3/common/py</samp>
<samp class=p>[you@localhost diveintopython3]$ </samp>cd common/py
<samp class=p>[you@localhost py]$ </samp>python fullpath.py <span>&#x2462;</span>
<samp>sys.argv[0] = fullpath.py
path =
full path = /home/you/diveintopython3/common/py</span></pre>
<ol>
<li>In the first case, <code>sys.argv[0]</code> includes the full path of the script. You can then use the <code>os.path.dirname</code> function to strip off the script name and return the full directory name, and <code>os.path.abspath</code> simply returns what you give it.
<li>If the script is run by using a partial pathname, <code>sys.argv[0]</code> will still contain exactly what appears on the command line. <code>os.path.dirname</code> will then give you a partial pathname (relative to the current directory), and <code>os.path.abspath</code> will construct a full pathname from the partial pathname.
<li>If the script is run from the current directory without giving any path, <code>os.path.dirname</code> will simply return an empty string. Given an empty string, <code>os.path.abspath</code> returns the current directory, which is what you want, since the script was run from the current directory.
<table id="os.path.abspath.crossplatform.note" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Like the other functions in the <code>os</code> and <code>os.path</code> modules, <code>os.path.abspath</code> is cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash
as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of the <code>os</code> module.
<p><b>Addendum. </b>One reader was dissatisfied with this solution, and wanted to be able to run all the unit tests in the current directory,
not the directory where <code>regression.py</code> is located. He suggests this approach instead:
<div class=example><h3 id="regression.path.cwd.example">Example 16.6. Running scripts in the current directory</h3><pre><code>import sys, os, re, unittest
def regressionTest():
path = os.getcwd() <span>&#x2460;</span>
sys.path.append(path) <span>&#x2461;</span>
files = os.listdir(path) <span>&#x2462;</span>
</pre>
<ol>
<li>Instead of setting <var>path</var> to the directory where the currently running script is located, you set it to the current working directory instead. This
will be whatever directory you were in before you ran the script, which is not necessarily the same as the directory the script
is in. (Read that sentence a few times until you get it.)
<li>Append this directory to the Python library search path, so that when you dynamically import the unit test modules later, Python can find them. You didn't need to do this when <var>path</var> was the directory of the currently running script, because Python always looks in that directory.
<li>The rest of the function is the same.
<p>This technique will allow you to re-use this <code>regression.py</code> script on multiple projects. Just put the script in a common directory, then change to the project's directory before running
it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory where <code>regression.py</code> is located.
[more functional programming stuff was here]
<h2 id="regression.import">16.6. Dynamically importing modules</h2>
<p>OK, enough philosophizing. Let's talk about dynamically importing modules.
<p>First, let's look at how you normally import modules. The <code>import <var>module</var></code> syntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once
this way, with a comma-separated list. You did this on the very first line of this chapter's script.
<div class=example><h3>Example 16.13. Importing multiple modules at once</h3><pre><code>
import sys, os, re, unittest <span>&#x2460;</span>
</pre>
<ol>
<li>This imports four modules at once: <code>sys</code> (for system functions and access to the command line parameters), <code>os</code> (for operating system functions like directory listings), <code>re</code> (for regular expressions), and <code>unittest</code> (for unit testing).
<p>Now let's do the same thing, but with dynamic imports.
<div class=example><h3>Example 16.14. Importing modules dynamically</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>sys = __import__('sys')</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>os = __import__('os')</kbd>
<samp class=p>>>> </samp><kbd>re = __import__('re')</kbd>
<samp class=p>>>> </samp><kbd>unittest = __import__('unittest')</kbd>
<samp class=p>>>> </samp><kbd>sys</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>&lt;module 'sys' (built-in)></kbd>
<samp class=p>>>> </samp><kbd>os</kbd>
<samp class=p>>>> </samp><kbd>&lt;module 'os' from '/usr/local/lib/python2.2/os.pyc'></kbd>
</pre>
<ol>
<li>The built-in <code>__import__</code> function accomplishes the same goal as using the <code>import</code> statement, but it's an actual function, and it takes a string as an argument.
<li>The variable <var>sys</var> is now the <code>sys</code> module, just as if you had said <code>import sys</code>. The variable <var>os</var> is now the <code>os</code> module, and so forth.
<p>So <code>__import__</code> imports a module, but takes a string argument to do it. In this case the module you imported was just a hard-coded string,
but it could just as easily be a variable, or the result of a function call. And the variable that you assign the module
to doesn't need to match the module name, either. You could import a series of modules and assign them to a list.
<div class=example><h3>Example 16.15. Importing a list of modules dynamically</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>moduleNames = ['sys', 'os', 're', 'unittest']</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>moduleNames</kbd>
['sys', 'os', 're', 'unittest']
<samp class=p>>>> </samp><kbd>modules = map(__import__, moduleNames)</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>modules</kbd> <span>&#x2462;</span>
<samp>[&lt;module 'sys' (built-in)>,
&lt;module 'os' from 'c:\Python22\lib\os.pyc'>,
&lt;module 're' from 'c:\Python22\lib\re.pyc'>,
&lt;module 'unittest' from 'c:\Python22\lib\unittest.pyc'>]</samp>
<samp class=p>>>> </samp><kbd>modules[0].version</kbd> <span>&#x2463;</span>
'2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
<samp class=p>>>> </samp><kbd>import sys</kbd>
<samp class=p>>>> </samp><kbd>sys.version</kbd>
'2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
</pre>
<ol>
<li><var>moduleNames</var> is just a list of strings. Nothing fancy, except that the strings happen to be names of modules that you could import, if
you wanted to.
<li>Surprise, you wanted to import them, and you did, by mapping the <code>__import__</code> function onto the list. Remember, this takes each element of the list (<var>moduleNames</var>) and calls the function (<code>__import__</code>) over and over, once with each element of the list, builds a list of the return values, and returns the result.
<li>So now from a list of strings, you've created a list of actual modules. (Your paths may be different, depending on your operating
system, where you installed Python, the phase of the moon, etc.)
<li>To drive home the point that these are real modules, let's look at some module attributes. Remember, <var>modules[0]</var> <em>is</em> the <code>sys</code> module, so <var>modules[0].version</var> <em>is</em> <var>sys.version</var>. All the other attributes and methods of these modules are also available. There's nothing magic about the <code>import</code> statement, and there's nothing magic about modules. Modules are objects. Everything is an object.
<p>Now you should be able to put this all together and figure out what most of this chapter's code sample is doing.
<h2 id="regression.alltogether">16.7. Putting it all together</h2>
<p>You've learned enough now to deconstruct the first seven lines of this chapter's code sample: reading a directory and importing
selected modules within it.
<div class=example><h3>Example 16.16. The <code>regressionTest</code> function</h3><pre><code>
def regressionTest():
path = os.path.abspath(os.path.dirname(sys.argv[0]))
files = os.listdir(path)
test = re.compile("test\.py$", re.IGNORECASE)
files = filter(test.search, files)
filenameToModuleName = lambda f: os.path.splitext(f)[0]
moduleNames = map(filenameToModuleName, files)
modules = map(__import__, moduleNames)
load = unittest.defaultTestLoader.loadTestsFromModule
return unittest.TestSuite(map(load, modules))
</pre><p>Let's look at it line by line, interactively. Assume that the current directory is <code>c:\diveintopython3\py</code>, which contains the examples that come with this book, including this chapter's script. As you saw in <a href="#regression.path" title="16.2. Finding the path">Section 16.2, &#8220;Finding the path&#8221;</a>, the script directory will end up in the <var>path</var> variable, so let's start hard-code that and go from there.
<div class=example><h3>Example 16.17. Step 1: Get all the files</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import sys, os, re, unittest</kbd>
<samp class=p>>>> </samp><kbd>path = r'c:\diveintopython3\py'</kbd>
<samp class=p>>>> </samp><kbd>files = os.listdir(path) </kbd>
<samp class=p>>>> </samp><kbd>files</kbd> <span>&#x2460;</span>
<samp>['BaseHTMLProcessor.py', 'LICENSE.txt', 'apihelper.py', 'apihelpertest.py',
'argecho.py', 'autosize.py', 'builddialectexamples.py', 'dialect.py',
'fileinfo.py', 'fullpath.py', 'kgptest.py', 'makerealworddoc.py',
'odbchelper.py', 'odbchelpertest.py', 'parsephone.py', 'piglatin.py',
'plural.py', 'pluraltest.py', 'pyfontify.py', 'regression.py', 'roman.py', 'romantest.py',
'uncurly.py', 'unicode2koi8r.py', 'urllister.py', 'kgp', 'plural', 'roman',
'colorize.py']</span>
</pre>
<ol>
<li><var>files</var> is a list of all the files and directories in the script's directory. (If you've been running some of the examples already,
you may also see some <code>.pyc</code> files in there as well.)
<div class=example><h3>Example 16.18. Step 2: Filter to find the files you care about</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>test = re.compile("test\.py$", re.IGNORECASE)</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>files = filter(test.search, files)</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>files</kbd> <span>&#x2462;</span>
['apihelpertest.py', 'kgptest.py', 'odbchelpertest.py', 'pluraltest.py', 'romantest.py']
</pre>
<ol>
<li>This regular expression will match any string that ends with <code>test.py</code>. Note that you need to escape the period, since a period in a regular expression usually means &#8220;match any single character&#8221;, but you actually want to match a literal period instead.
<li>The compiled regular expression acts like a function, so you can use it to filter the large list of files and directories,
to find the ones that match the regular expression.
<li>And you're left with the list of unit testing scripts, because they were the only ones named <code>SOMETHINGtest.py</code>.
<div class=example><h3>Example 16.19. Step 3: Map filenames to module names</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>filenameToModuleName = lambda f: os.path.splitext(f)[0]</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>filenameToModuleName('romantest.py')</kbd> <span>&#x2461;</span>
'romantest'
<samp class=p>>>> </samp><kbd>filenameToModuleName('odchelpertest.py')</kbd>
'odbchelpertest'
<samp class=p>>>> </samp><kbd>moduleNames = map(filenameToModuleName, files)</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>moduleNames</kbd> <span>&#x2463;</span>
['apihelpertest', 'kgptest', 'odbchelpertest', 'pluraltest', 'romantest']
</pre>
<ol>
<li>As you saw in <a href="#apihelper.lambda" title="4.7. Using lambda Functions">Section 4.7, &#8220;Using lambda Functions&#8221;</a>, <code>lambda</code> is a quick-and-dirty way of creating an inline, one-line function. This one takes a filename with an extension and returns
just the filename part, using the standard library function <code>os.path.splitext</code> that you saw in <a href="#splittingpathnames.example" title="Example 6.17. Splitting Pathnames">Example 6.17, &#8220;Splitting Pathnames&#8221;</a>.
<li><var>filenameToModuleName</var> is a function. There's nothing magic about <code>lambda</code> functions as opposed to regular functions that you define with a <code>def</code> statement. You can call the <var>filenameToModuleName</var> function like any other, and it does just what you wanted it to do: strips the file extension off of its argument.
<li>Now you can apply this function to each file in the list of unit test files, using <code>map</code>.
<li>And the result is just what you wanted: a list of modules, as strings.
<div class=example><h3>Example 16.20. Step 4: Mapping module names to modules</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>modules = map(__import__, moduleNames)</kbd><span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>modules</kbd> <span>&#x2461;</span>
<samp>[&lt;module 'apihelpertest' from 'apihelpertest.py'>,
&lt;module 'kgptest' from 'kgptest.py'>,
&lt;module 'odbchelpertest' from 'odbchelpertest.py'>,
&lt;module 'pluraltest' from 'pluraltest.py'>,
&lt;module 'romantest' from 'romantest.py'>]</samp>
<samp class=p>>>> </samp><kbd>modules[-1]</kbd> <span>&#x2462;</span>
&lt;module 'romantest' from 'romantest.py'>
</pre>
<ol>
<li>As you saw in <a href="#regression.import" title="16.6. Dynamically importing modules">Section 16.6, &#8220;Dynamically importing modules&#8221;</a>, you can use a combination of <code>map</code> and <code>__import__</code> to map a list of module names (as strings) into actual modules (which you can call or access like any other module).
<li><var>modules</var> is now a list of modules, fully accessible like any other module.
<li>The last module in the list <em>is</em> the <code>romantest</code> module, just as if you had said <code>import romantest</code>.
<div class=example><h3>Example 16.21. Step 5: Loading the modules into a test suite</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>load = unittest.defaultTestLoader.loadTestsFromModule </kbd>
<samp class=p>>>> </samp><kbd>map(load, modules)</kbd> <span>&#x2460;</span>
<samp>[&lt;unittest.TestSuite tests=[
&lt;unittest.TestSuite tests=[&lt;apihelpertest.BadInput testMethod=testNoObject>]>,
&lt;unittest.TestSuite tests=[&lt;apihelpertest.KnownValues testMethod=testApiHelper>]>,
&lt;unittest.TestSuite tests=[
&lt;apihelpertest.ParamChecks testMethod=testCollapse>,
&lt;apihelpertest.ParamChecks testMethod=testSpacing>]>,
...
]
]</samp>
<samp class=p>>>> </samp><kbd>unittest.TestSuite(map(load, modules))</kbd> <span>&#x2461;</span>
</pre>
<ol>
<li>These are real module objects. Not only can you access them like any other module, instantiate classes and call functions,
you can also introspect into the module to figure out which classes and functions it has in the first place. That's what
the <code>loadTestsFromModule</code> method does: it introspects into each module and returns a <code>unittest.TestSuite</code> object for each module. Each <code>TestSuite</code> object actually contains a list of <code>TestSuite</code> objects, one for each <code>TestCase</code> class in your module, and each of those <code>TestSuite</code> objects contains a list of tests, one for each test method in your module.
<li>Finally, you wrap the list of <code>TestSuite</code> objects into one big test suite. The <code>unittest</code> module has no problem traversing this tree of nested test suites within test suites; eventually it gets down to an individual
test method and executes it, verifies that it passes or fails, and moves on to the next one.
<p>This introspection process is what the <code>unittest</code> module usually does for us. Remember that magic-looking <code>unittest.main()</code> function that our individual test modules called to kick the whole thing off? <code>unittest.main()</code> actually creates an instance of <code>unittest.TestProgram</code>, which in turn creates an instance of a <code>unittest.defaultTestLoader</code> and loads it up with the module that called it. (How does it get a reference to the module that called it if you don't give
it one? By using the equally-magic <code>__import__('__main__')</code> command, which dynamically imports the currently-running module. I could write a book on all the tricks and techniques used
in the <code>unittest</code> module, but then I'd never finish this one.)
<div class=example><h3>Example 16.22. Step 6: Telling <code>unittest</code> to use your test suite</h3><pre><code>
if __name__ == "__main__":
unittest.main(defaultTest="regressionTest") <span>&#x2460;</span>
</pre>
<ol>
<li>Instead of letting the <code>unittest</code> module do all its magic for us, you've done most of it yourself. You've created a function (<code>regressionTest</code>) that imports the modules yourself, calls <code>unittest.defaultTestLoader</code> yourself, and wraps it all up in a test suite. Now all you need to do is tell <code>unittest</code> that, instead of looking for tests and building a test suite in the usual way, it should just call the <code>regressionTest</code> function, which returns a ready-to-use <code>TestSuite</code>.
<h2 id="regression.summary">16.8. Summary</h2>
<p>The <code>regression.py</code> program and its output should now make perfect sense.
<p>You should now feel comfortable doing all of these things:
<div class=itemizedlist>
<ul>
<li>Manipulating <a href="#regression.path" title="16.2. Finding the path">path information</a> from the command line.
<li>Filtering lists <a href="#regression.filter" title="16.3. Filtering lists revisited">using <code>filter</code></a> instead of list comprehensions.
<li>Mapping lists <a href="#regression.map" title="16.4. Mapping lists revisited">using <code>map</code></a> instead of list comprehensions.
<li>Dynamically <a href="#regression.import" title="16.6. Dynamically importing modules">importing modules</a>.
</ul>
<div class=footnotes><br><hr width="100" align="left">
<div class=footnote>
<p><sup>[<a name="ftn.d0e35697" href="#d0e35697">7</a>] </sup>Technically, the second argument to <code>filter</code> can be any sequence, including lists, tuples, and custom classes that act like lists by defining the <code>__getitem__</code> special method. If possible, <code>filter</code> will return the same datatype as you give it, so filtering a list returns a list, but filtering a tuple returns a tuple.
<div class=footnote>
<p><sup>[<a name="ftn.d0e36079" href="#d0e36079">8</a>] </sup>Again, I should point out that <code>map</code> can take a list, a tuple, or any object that acts like a sequence. See previous footnote about <code>filter</code>.
<div class=chapter>
<h2 id="soundex">Chapter 18. Performance Tuning</h2>
<p>Performance tuning is a many-splendored thing. Just because Python is an interpreted language doesn't mean you shouldn't worry about code optimization. But don't worry about it <em>too</em> much.
<h2 id="soundex.divein">18.1. Diving in</h2>
<p>There are so many pitfalls involved in optimizing your code, it's hard to know where to start.
<p>Let's start here: <em>are you sure you need to do it at all?</em> Is your code really so bad? Is it worth the time to tune it? Over the lifetime of your application, how much time is going
to be spent running that code, compared to the time spent waiting for a remote database server, or waiting for user input?
<p>Second, <em>are you sure you're done coding?</em> Premature optimization is like spreading frosting on a half-baked cake. You spend hours or days (or more) optimizing your
code for performance, only to discover it doesn't do what you need it to do. That's time down the drain.
<p>This is not to say that code optimization is worthless, but you need to look at the whole system and decide whether it's the
best use of your time. Every minute you spend optimizing code is a minute you're not spending adding new features, or writing
documentation, or playing with your kids, or writing unit tests.
<p>Oh yes, unit tests. It should go without saying that you need a complete set of unit tests before you begin performance tuning.
The last thing you need is to introduce new bugs while fiddling with your algorithms.
<p>With these caveats in place, let's look at some techniques for optimizing Python code. The code in question is an implementation of the Soundex algorithm. Soundex was a method used in the early 20th century
for categorizing surnames in the United States census. It grouped similar-sounding names together, so even if a name was
misspelled, researchers had a chance of finding it. Soundex is still used today for much the same reason, although of course
we use computerized database servers now. Most database servers include a Soundex function.
<p>There are several subtle variations of the Soundex algorithm. This is the one used in this chapter:
<div class=orderedlist>
<ol>
<li>Keep the first letter of the name as-is.
<li>Convert the remaining letters to digits, according to a specific table:
<div class=itemizedlist>
<ul>
<li>B, F, P, and V become 1.
<li>C, G, J, K, Q, S, X, and Z become 2.
<li>D and T become 3.
<li>L becomes 4.
<li>M and N become 5.
<li>R becomes 6.
<li>All other letters become 9.
</ul>
<li>Remove consecutive duplicates.
<li>Remove all 9s altogether.
<li>If the result is shorter than four characters (the first letter plus three digits), pad the result with trailing zeros.
<li>if the result is longer than four characters, discard everything after the fourth character.
</ol>
<p>For example, my name, <code>Pilgrim</code>, becomes P942695. That has no consecutive duplicates, so nothing to do there. Then you remove the 9s, leaving P4265. That's
too long, so you discard the excess character, leaving P426.
<p>Another example: <code>Woo</code> becomes W99, which becomes W9, which becomes W, which gets padded with zeros to become W000.
<p>Here's a first attempt at a Soundex function:
<div class=example><h3>Example 18.1. <code>soundex/stage1/soundex1a.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
import string, re
charToSoundex = {"A": "9",
"B": "1",
"C": "2",
"D": "3",
"E": "9",
"F": "1",
"G": "2",
"H": "9",
"I": "9",
"J": "2",
"K": "2",
"L": "4",
"M": "5",
"N": "5",
"O": "9",
"P": "1",
"Q": "2",
"R": "6",
"S": "2",
"T": "3",
"U": "9",
"V": "1",
"W": "9",
"X": "2",
"Y": "9",
"Z": "2"}
def soundex(source):
"convert string to Soundex equivalent"
# Soundex requirements:
# source string must be at least 1 character
# and must consist entirely of letters
allChars = string.uppercase + string.lowercase
if not re.search('^[%s]+$' % allChars, source):
return "0000"
# Soundex algorithm:
# 1. make first character uppercase
source = source[0].upper() + source[1:]
# 2. translate all other characters to Soundex digits
digits = source[0]
for s in source[1:]:
s = s.upper()
digits += charToSoundex[s]
# 3. remove consecutive duplicates
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
# 4. remove all "9"s
digits3 = re.sub('9', '', digits2)
# 5. pad end with "0"s to 4 characters
while len(digits3) &lt; 4:
digits3 += "0"
# 6. return first 4 characters
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><div class=itemizedlist>
<h3>Further Reading on Soundex</h3>
<ul>
<li><a href="http://www.avotaynu.com/soundex.html">Soundexing and Genealogy</a> gives a chronology of the evolution of the Soundex and its regional variations.
</ul>
<h2 id="soundex.timeit">18.2. Using the <code>timeit</code> Module</h2>
<p>The most important thing you need to know about optimizing Python code is that you shouldn't write your own timing function.
<p>Timing short pieces of code is incredibly complex. How much processor time is your computer devoting to running this code?
Are there things running in the background? Are you sure? Every modern computer has background processes running, some all
the time, some intermittently. Cron jobs fire off at consistent intervals; background services occasionally &#8220;wake up&#8221; to do useful things like check for new mail, connect to instant messaging servers, check for application updates, scan for
viruses, check whether a disk has been inserted into your CD drive in the last 100 nanoseconds, and so on. Before you start
your timing tests, turn everything off and disconnect from the network. Then turn off all the things you forgot to turn off
the first time, then turn off the service that's incessantly checking whether the network has come back yet, then ...
<p>And then there's the matter of the variations introduced by the timing framework itself. Does the Python interpreter cache method name lookups? Does it cache code block compilations? Regular expressions? Will your code have
side effects if run more than once? Don't forget that you're dealing with small fractions of a second, so small mistakes
in your timing framework will irreparably skew your results.
<p>The Python community has a saying: &#8220;Python comes with batteries included.&#8221; Don't write your own timing framework. Python 2.3 comes with a perfectly good one called <code>timeit</code>.
<div class=example><h3>Example 18.2. Introducing <code>timeit</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import timeit</kbd>
<samp class=p>>>> </samp><kbd>t = timeit.Timer("soundex.soundex('Pilgrim')",</kbd>
<samp class=p>... </samp>"import soundex") <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>t.timeit()</kbd> <span>&#x2461;</span>
8.21683733547
<samp class=p>>>> </samp><kbd>t.repeat(3, 2000000)</kbd> <span>&#x2462;</span>
[16.48319309109, 16.46128984923, 16.44203948912]
</pre>
<ol>
<li>The <code>timeit</code> module defines one class, <code>Timer</code>, which takes two arguments. Both arguments are strings. The first argument is the statement you wish to time; in this case,
you are timing a call to the Soundex function within the <code>soundex</code> with an argument of <code>'Pilgrim'</code>. The second argument to the <code>Timer</code> class is the import statement that sets up the environment for the statement. Internally, <code>timeit</code> sets up an isolated virtual environment, manually executes the setup statement (importing the <code>soundex</code> module), then manually compiles and executes the timed statement (calling the Soundex function).
<li>Once you have the <code>Timer</code> object, the easiest thing to do is call <code>timeit()</code>, which calls your function 1 million times and returns the number of seconds it took to do it.
<li>The other major method of the <code>Timer</code> object is <code>repeat()</code>, which takes two optional arguments. The first argument is the number of times to repeat the entire test, and the second
argument is the number of times to call the timed statement within each test. Both arguments are optional, and they default
to <code>3</code> and <code>1000000</code> respectively. The <code>repeat()</code> method returns a list of the times each test cycle took, in seconds.
<blockquote class="note FIXME">
<p><span>&#x261E;</span>You can use the <code>timeit</code> module on the command line to test an existing Python program, without modifying the code. See <a href="http://docs.python.org/lib/node396.html">http://docs.python.org/lib/node396.html</a> for documentation on the command-line flags.
<p>Note that <code>repeat()</code> returns a list of times. The times will almost never be identical, due to slight variations in how much processor time the
Python interpreter is getting (and those pesky background processes that you can't get rid of). Your first thought might be to
say &#8220;Let's take the average and call that The True Number.&#8221;
<p>In fact, that's almost certainly wrong. The tests that took longer didn't take longer because of variations in your code
or in the Python interpreter; they took longer because of those pesky background processes, or other factors outside of the Python interpreter that you can't fully eliminate. If the different timing results differ by more than a few percent, you still
have too much variability to trust the results. Otherwise, take the minimum time and discard the rest.
<p>Python has a handy <code>min</code> function that takes a list and returns the smallest value:
<pre class=screen>
<samp class=p>>>> </samp><kbd>min(t.repeat(3, 1000000))</kbd>
8.22203948912
</pre><blockquote class="note FIXME">
<p><span>&#x261E;</span>The <code>timeit</code> module only works if you already know what piece of code you need to optimize. If you have a larger Python program and don't know where your performance problems are, check out <a href="http://docs.python.org/lib/module-hotshot.html">the <code>hotshot</code> module.</a><h2 id="soundex.stage1">18.3. Optimizing Regular Expressions</h2>
<p>The first thing the Soundex function checks is whether the input is a non-empty string of letters. What's the best way to
do this?
<p>If you answered &#8220;regular expressions&#8221;, go sit in the corner and contemplate your bad instincts. Regular expressions are almost never the right answer; they should
be avoided whenever possible. Not only for performance reasons, but simply because they're difficult to debug and maintain.
Also for performance reasons.
<p>This code fragment from <code>soundex/stage1/soundex1a.py</code> checks whether the function argument <var>source</var> is a word made entirely of letters, with at least one letter (not the empty string):
<pre><code>
allChars = string.uppercase + string.lowercase
if not re.search('^[%s]+$' % allChars, source):
return "0000"
</pre><p>How does <code>soundex1a.py</code> perform? For convenience, the <code>__main__</code> section of the script contains this code that calls the <code>timeit</code> module, sets up a timing test with three different names, tests each name three times, and displays the minimum time for
each:
<pre><code>
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><p>So how does <code>soundex1a.py</code> perform with this regular expression?
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1a.py
<samp>Woo W000 19.3356647283
Pilgrim P426 24.0772053431
Flingjingwaller F452 35.0463220884</span>
</pre><p>As you might expect, the algorithm takes significantly longer when called with longer names. There will be a few things we
can do to narrow that gap (make the function take less relative time for longer input), but the nature of the algorithm dictates
that it will never run in constant time.
<p>The other thing to keep in mind is that we are testing a representative sample of names. <code>Woo</code> is a kind of trivial case, in that it gets shorted down to a single letter and then padded with zeros. <code>Pilgrim</code> is a normal case, of average length and a mixture of significant and ignored letters. <code>Flingjingwaller</code> is extraordinarily long and contains consecutive duplicates. Other tests might also be helpful, but this hits a good range
of different cases.
<p>So what about that regular expression? Well, it's inefficient. Since the expression is testing for ranges of characters
(<code>A-Z</code> in uppercase, and <code>a-z</code> in lowercase), we can use a shorthand regular expression syntax. Here is <code>soundex/stage1/soundex1b.py</code>:
<pre><code>
if not re.search('^[A-Za-z]+$', source):
return "0000"
</pre><p><code>timeit</code> says <code>soundex1b.py</code> is slightly faster than <code>soundex1a.py</code>, but nothing to get terribly excited about:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1b.py
<samp>Woo W000 17.1361133887
Pilgrim P426 21.8201693232
Flingjingwaller F452 32.7262294509</span>
</pre><p>We saw in <a href="#roman.refactoring" title="15.3. Refactoring">Section 15.3, &#8220;Refactoring&#8221;</a> that regular expressions can be compiled and reused for faster results. Since this regular expression never changes across
function calls, we can compile it once and use the compiled version. Here is <code>soundex/stage1/soundex1c.py</code>:
<pre><code>
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
</pre><p>Using a compiled regular expression in <code>soundex1c.py</code> is significantly faster:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1c.py
<samp>Woo W000 14.5348347346
Pilgrim P426 19.2784703084
Flingjingwaller F452 30.0893873383</span>
</pre><p>But is this the wrong path? The logic here is simple: the input <var>source</var> needs to be non-empty, and it needs to be composed entirely of letters. Wouldn't it be faster to write a loop checking each
character, and do away with regular expressions altogether?
<p>Here is <code>soundex/stage1/soundex1d.py</code>:
<pre><code>
if not source:
return "0000"
for c in source:
if not ('A' &lt;= c &lt;= 'Z') and not ('a' &lt;= c &lt;= 'z'):
return "0000"
</pre><p>It turns out that this technique in <code>soundex1d.py</code> is <em>not</em> faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression):
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1d.py
<samp>Woo W000 15.4065058548
Pilgrim P426 22.2753567842
Flingjingwaller F452 37.5845122774</span>
</pre><p>Why isn't <code>soundex1d.py</code> faster? The answer lies in the interpreted nature of Python. The regular expression engine is written in C, and compiled to run natively on your computer. On the other hand, this
loop is written in Python, and runs through the Python interpreter. Even though the loop is relatively simple, it's not simple enough to make up for the overhead of being interpreted.
Regular expressions are never the right answer... except when they are.
<p>It turns out that Python offers an obscure string method. You can be excused for not knowing about it, since it's never been mentioned in this book.
The method is called <code>isalpha()</code>, and it checks whether a string contains only letters.
<p>This is <code>soundex/stage1/soundex1e.py</code>:
<pre><code>
if (not source) and (not source.isalpha()):
return "0000"
</pre><p>How much did we gain by using this specific method in <code>soundex1e.py</code>? Quite a bit.
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1e.py
<samp>Woo W000 13.5069504644
Pilgrim P426 18.2199394057
Flingjingwaller F452 28.9975225902</span>
</pre><div class=example><h3>Example 18.3. Best Result So Far: <code>soundex/stage1/soundex1e.py</code></h3><pre><code>
import string, re
charToSoundex = {"A": "9",
"B": "1",
"C": "2",
"D": "3",
"E": "9",
"F": "1",
"G": "2",
"H": "9",
"I": "9",
"J": "2",
"K": "2",
"L": "4",
"M": "5",
"N": "5",
"O": "9",
"P": "1",
"Q": "2",
"R": "6",
"S": "2",
"T": "3",
"U": "9",
"V": "1",
"W": "9",
"X": "2",
"Y": "9",
"Z": "2"}
def soundex(source):
if (not source) and (not source.isalpha()):
return "0000"
source = source[0].upper() + source[1:]
digits = source[0]
for s in source[1:]:
s = s.upper()
digits += charToSoundex[s]
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><h2 id="soundex.stage2">18.4. Optimizing Dictionary Lookups</h2>
<p>The second step of the Soundex algorithm is to convert characters to digits in a specific pattern. What's the best way to
do this?
<p>The most obvious solution is to define a dictionary with individual characters as keys and their corresponding digits as values,
and do dictionary lookups on each character. This is what we have in <code>soundex/stage1/soundex1c.py</code> (the current best result so far):
<pre><code>
charToSoundex = {"A": "9",
"B": "1",
"C": "2",
"D": "3",
"E": "9",
"F": "1",
"G": "2",
"H": "9",
"I": "9",
"J": "2",
"K": "2",
"L": "4",
"M": "5",
"N": "5",
"O": "9",
"P": "1",
"Q": "2",
"R": "6",
"S": "2",
"T": "3",
"U": "9",
"V": "1",
"W": "9",
"X": "2",
"Y": "9",
"Z": "2"}
def soundex(source):
# ... input check omitted for brevity ...
source = source[0].upper() + source[1:]
digits = source[0]
for s in source[1:]:
s = s.upper()
digits += charToSoundex[s]
</pre><p>You timed <code>soundex1c.py</code> already; this is how it performs:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1c.py
<samp>Woo W000 14.5341678901
Pilgrim P426 19.2650071448
Flingjingwaller F452 30.1003563302</span>
</pre><p>This code is straightforward, but is it the best solution? Calling <code>upper()</code> on each individual character seems inefficient; it would probably be better to call <code>upper()</code> once on the entire string.
<p>Then there's the matter of incrementally building the <var>digits</var> string. Incrementally building strings like this is horribly inefficient; internally, the Python interpreter needs to create a new string each time through the loop, then discard the old one.
<p>Python is good at lists, though. It can treat a string as a list of characters automatically. And lists are easy to combine into
strings again, using the string method <code>join()</code>.
<p>Here is <code>soundex/stage2/soundex2a.py</code>, which converts letters to digits by using &#8614; and <code>lambda</code>:
<pre><code>
def soundex(source):
# ...
source = source.upper()
digits = source[0] + "".join(map(lambda c: charToSoundex[c], source[1:]))
</pre><p>Surprisingly, <code>soundex2a.py</code> is not faster:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2a.py
<samp>Woo W000 15.0097526362
Pilgrim P426 19.254806407
Flingjingwaller F452 29.3790847719</span>
</pre><p>The overhead of the anonymous <code>lambda</code> function kills any performance you gain by dealing with the string as a list of characters.
<p><code>soundex/stage2/soundex2b.py</code> uses a list comprehension instead of &#8614; and <code>lambda</code>:
<pre><code>
source = source.upper()
digits = source[0] + "".join([charToSoundex[c] for c in source[1:]])
</pre><p>Using a list comprehension in <code>soundex2b.py</code> is faster than using &#8614; and <code>lambda</code> in <code>soundex2a.py</code>, but still not faster than the original code (incrementally building a string in <code>soundex1c.py</code>):
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2b.py
<samp>Woo W000 13.4221324219
Pilgrim P426 16.4901234654
Flingjingwaller F452 25.8186157738</span>
</pre><p>It's time for a radically different approach. Dictionary lookups are a general purpose tool. Dictionary keys can be any
length string (or many other data types), but in this case we are only dealing with single-character keys <em>and</em> single-character values. It turns out that Python has a specialized function for handling exactly this situation: the <code>string.maketrans</code> function.
<p>This is <code>soundex/stage2/soundex2c.py</code>:
<pre><code>
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
def soundex(source):
# ...
digits = source[0].upper() + source[1:].translate(charToSoundex)
</pre><p>What the heck is going on here? <code>string.maketrans</code> creates a translation matrix between two strings: the first argument and the second argument. In this case, the first argument
is the string <code>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</code>, and the second argument is the string <code>9123912992245591262391929291239129922455912623919292</code>. See the pattern? It's the same conversion pattern we were setting up longhand with a dictionary. A maps to 9, B maps
to 1, C maps to 2, and so forth. But it's not a dictionary; it's a specialized data structure that you can access using the
string method <code>translate</code>, which translates each character into the corresponding digit, according to the matrix defined by <code>string.maketrans</code>.
<p><code>timeit</code> shows that <code>soundex2c.py</code> is significantly faster than defining a dictionary and looping through the input and building the output incrementally:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2c.py
<samp>Woo W000 11.437645008
Pilgrim P426 13.2825062962
Flingjingwaller F452 18.5570110168</span>
</pre><p>You're not going to get much better than that. Python has a specialized function that does exactly what you want to do; use it and move on.
<div class=example><h3>Example 18.4. Best Result So Far: <code>soundex/stage2/soundex2c.py</code></h3><pre><code>
import string, re
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
digits = source[0].upper() + source[1:].translate(charToSoundex)
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><h2 id="soundex.stage3">18.5. Optimizing List Operations</h2>
<p>The third step in the Soundex algorithm is eliminating consecutive duplicate digits. What's the best way to do this?
<p>Here's the code we have so far, in <code>soundex/stage2/soundex2c.py</code>:
<pre><code>
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
</pre><p>Here are the performance results for <code>soundex2c.py</code>:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2c.py
<samp>Woo W000 12.6070768771
Pilgrim P426 14.4033353401
Flingjingwaller F452 19.7774882003</span>
</pre><p>The first thing to consider is whether it's efficient to check <var>digits[-1]</var> each time through the loop. Are list indexes expensive? Would we be better off maintaining the last digit in a separate
variable, and checking that instead?
<p>To answer this question, here is <code>soundex/stage3/soundex3a.py</code>:
<pre><code>
digits2 = ''
last_digit = ''
for d in digits:
if d != last_digit:
digits2 += d
last_digit = d
</pre><p><code>soundex3a.py</code> does not run any faster than <code>soundex2c.py</code>, and may even be slightly slower (although it's not enough of a difference to say for sure):
<pre class=screen>
<samp class=p>C:\samples\soundex\stage3></samp>python soundex3a.py
<samp>Woo W000 11.5346048171
Pilgrim P426 13.3950636184
Flingjingwaller F452 18.6108927252</span>
</pre><p>Why isn't <code>soundex3a.py</code> faster? It turns out that list indexes in Python are extremely efficient. Repeatedly accessing <var>digits2[-1]</var> is no problem at all. On the other hand, manually maintaining the last seen digit in a separate variable means we have <em>two</em> variable assignments for each digit we're storing, which wipes out any small gains we might have gotten from eliminating
the list lookup.
<p>Let's try something radically different. If it's possible to treat a string as a list of characters, it should be possible
to use a list comprehension to iterate through the list. The problem is, the code needs access to the previous character
in the list, and that's not easy to do with a straightforward list comprehension.
<p>However, it is possible to create a list of index numbers using the built-in <code>range()</code> function, and use those index numbers to progressively search through the list and pull out each character that is different
from the previous character. That will give you a list of characters, and you can use the string method <code>join()</code> to reconstruct a string from that.
<p>Here is <code>soundex/stage3/soundex3b.py</code>:
<pre><code>
digits2 = "".join([digits[i] for i in range(len(digits))
if i == 0 or digits[i-1] != digits[i]])
</pre><p>Is this faster? In a word, no.
<pre class=screen>
<samp class=p>C:\samples\soundex\stage3></samp>python soundex3b.py
<samp>Woo W000 14.2245271396
Pilgrim P426 17.8337165757
Flingjingwaller F452 25.9954005327</span>
</pre><p>It's possible that the techniques so far as have been &#8220;string-centric&#8221;. Python can convert a string into a list of characters with a single command: <code>list('abc')</code> returns <code>['a', 'b', 'c']</code>. Furthermore, lists can be <em>modified in place</em> very quickly. Instead of incrementally building a new list (or string) out of the source string, why not move elements around
within a single list?
<p>Here is <code>soundex/stage3/soundex3c.py</code>, which modifies a list in place to remove consecutive duplicate elements:
<pre><code>
digits = list(source[0].upper() + source[1:].translate(charToSoundex))
i=0
for item in digits:
if item==digits[i]: continue
i+=1
digits[i]=item
del digits[i+1:]
digits2 = "".join(digits)
</pre><p>Is this faster than <code>soundex3a.py</code> or <code>soundex3b.py</code>? No, in fact it's the slowest method yet:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage3></samp>python soundex3c.py
<samp>Woo W000 14.1662554878
Pilgrim P426 16.0397885765
Flingjingwaller F452 22.1789341942</span>
</pre><p>We haven't made any progress here at all, except to try and rule out several &#8220;clever&#8221; techniques. The fastest code we've seen so far was the original, most straightforward method (<code>soundex2c.py</code>). Sometimes it doesn't pay to be clever.
<div class=example><h3>Example 18.5. Best Result So Far: <code>soundex/stage2/soundex2c.py</code></h3><pre><code>
import string, re
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
digits = source[0].upper() + source[1:].translate(charToSoundex)
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><h2 id="soundex.stage4">18.6. Optimizing String Manipulation</h2>
<p>The final step of the Soundex algorithm is padding short results with zeros, and truncating long results. What is the best
way to do this?
<p>This is what we have so far, taken from <code>soundex/stage2/soundex2c.py</code>:
<pre><code>
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
</pre><p>These are the results for <code>soundex2c.py</code>:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2c.py
<samp>Woo W000 12.6070768771
Pilgrim P426 14.4033353401
Flingjingwaller F452 19.7774882003</span>
</pre><p>The first thing to consider is replacing that regular expression with a loop. This code is from <code>soundex/stage4/soundex4a.py</code>:
<pre><code>
digits3 = ''
for d in digits2:
if d != '9':
digits3 += d
</pre><p>Is <code>soundex4a.py</code> faster? Yes it is:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage4></samp>python soundex4a.py
<samp>Woo W000 6.62865531792
Pilgrim P426 9.02247576158
Flingjingwaller F452 13.6328416042</span>
</pre><p>But wait a minute. A loop to remove characters from a string? We can use a simple string method for that. Here's <code>soundex/stage4/soundex4b.py</code>:
<pre><code>
digits3 = digits2.replace('9', '')
</pre><p>Is <code>soundex4b.py</code> faster? That's an interesting question. It depends on the input:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage4></samp>python soundex4b.py
<samp>Woo W000 6.75477414029
Pilgrim P426 7.56652144337
Flingjingwaller F452 10.8727729362</span>
</pre><p>The string method in <code>soundex4b.py</code> is faster than the loop for most names, but it's actually slightly slower than <code>soundex4a.py</code> in the trivial case (of a very short name). Performance optimizations aren't always uniform; tuning that makes one case
faster can sometimes make other cases slower. In this case, the majority of cases will benefit from the change, so let's
leave it at that, but the principle is an important one to remember.
<p>Last but not least, let's examine the final two steps of the algorithm: padding short results with zeros, and truncating long
results to four characters. The code you see in <code>soundex4b.py</code> does just that, but it's horribly inefficient. Take a look at <code>soundex/stage4/soundex4c.py</code> to see why:
<pre><code>
digits3 += '000'
return digits3[:4]
</pre><p>Why do we need a <code>while</code> loop to pad out the result? We know in advance that we're going to truncate the result to four characters, and we know that
we already have at least one character (the initial letter, which is passed unchanged from the original <var>source</var> variable). That means we can simply add three zeros to the output, then truncate it. Don't get stuck in a rut over the
exact wording of the problem; looking at the problem slightly differently can lead to a simpler solution.
<p>How much speed do we gain in <code>soundex4c.py</code> by dropping the <code>while</code> loop? It's significant:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage4></samp>python soundex4c.py
<samp>Woo W000 4.89129791636
Pilgrim P426 7.30642134685
Flingjingwaller F452 10.689832367</span>
</pre><p>Finally, there is still one more thing you can do to these three lines of code to make them faster: you can combine them into
one line. Take a look at <code>soundex/stage4/soundex4d.py</code>:
<pre><code>
return (digits2.replace('9', '') + '000')[:4]
</pre><p>Putting all this code on one line in <code>soundex4d.py</code> is barely faster than <code>soundex4c.py</code>:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage4></samp>python soundex4d.py
<samp>Woo W000 4.93624105857
Pilgrim P426 7.19747593619
Flingjingwaller F452 10.5490700634</span>
</pre><p>It is also significantly less readable, and for not much performance gain. Is that worth it? I hope you have good comments.
Performance isn't everything. Your optimization efforts must always be balanced against threats to your program's readability
and maintainability.
<h2 id="soundex.summary">18.7. Summary</h2>
<p>This chapter has illustrated several important aspects of performance tuning in Python, and performance tuning in general.
<div class=itemizedlist>
<ul>
<li>If you need to choose between regular expressions and writing a loop, choose regular expressions. The regular expression
engine is compiled in C and runs natively on your computer; your loop is written in Python and runs through the Python interpreter.
<li>If you need to choose between regular expressions and string methods, choose string methods. Both are compiled in C, so choose
the simpler one.
<li>General-purpose dictionary lookups are fast, but specialtiy functions such as <code>string.maketrans</code> and string methods such as <code>isalpha()</code> are faster. If Python has a custom-tailored function for you, use it.
<li>Don't be too clever. Sometimes the most obvious algorithm is also the fastest.
<li>Don't sweat it too much. Performance isn't everything.
</ul>
<p>I can't emphasize that last point strongly enough. Over the course of this chapter, you made this function three times faster
and saved 20 seconds over 1 million function calls. Great. Now think: over the course of those million function calls, how
many seconds will your surrounding application wait for a database connection? Or wait for disk I/O? Or wait for user input?
Don't spend too much time over-optimizing one algorithm, or you'll ignore obvious improvements somewhere else. Develop an
instinct for the sort of code that Python runs well, correct obvious blunders if you find them, and leave the rest alone.
</body>
</html>