Files
dive-into-python3/dip2
T

2383 lines
170 KiB
Plaintext
Executable File

<h2 id="odbchelper.objects">2.4. Everything Is an Object</h2>
<h2 id="odbchelper.testing">2.6. Testing Modules</h2>
<p>Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them.
Here's an example that uses the <code>if</code> <code>__name__</code> trick.
<pre id="odbchelper.ifnametrick" class=programlisting>
if __name__ == "__main__":</pre><p>Some quick observations before you get to the good stuff. First, parentheses are not required around the <code>if</code> expression. Second, the <code>if</code> statement ends with a colon, and is followed by <a href="#odbchelper.indenting" title="2.5. Indenting Code">indented code</a>.
<table id="compare.equals.c" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Like <abbr>C</abbr>, Python uses <code>==</code> for comparison and <code>=</code> for assignment. Unlike <abbr>C</abbr>, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.
<p>So why is this particular <code>if</code> statement a trick? Modules are objects, and all modules have a built-in attribute <code>__name__</code>. A module's <code>__name__</code> depends on how you're using the module. If you <code>import</code> the module, then <code>__name__</code> is the module's filename, without a directory path or file extension. But you can also run the module directly as a standalone
program, in which case <code>__name__</code> will be a special default value, <code>__main__</code>.
<pre class=screen><samp class=p>>>> </samp><kbd>import odbchelper</kbd>
<samp class=p>>>> </samp>odbchelper.<code>__name__</code>
'odbchelper'</pre><p>Knowing this, you can design a test suite for your module within the module itself by putting it in this <code>if</code> statement. When you run the module directly, <code>__name__</code> is <code>__main__</code>, so the test suite executes. When you import the module, <code>__name__</code> is something else, so the test suite is ignored. This makes it easier to develop and debug new modules before integrating
them into a larger program.
<table id="tip.mac.runasmain" class=tip border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">On MacPython, there is an additional step to make the <code>if</code> <code>__name__</code> trick work. Pop up the module's options menu by clicking the black triangle in the upper-right corner of the window, and
make sure Run as __main__ is checked.
<div class=itemizedlist>
<h3>Further Reading on Importing Modules</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class=citetitle>Python Reference Manual</i></a> discusses the low-level details of <a href="http://www.python.org/doc/current/ref/import.html">importing modules</a>.
</ul>
<h2 id="odbchelper.vardef">3.4. Declaring variables</h2>
<p>Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from <a href="#odbchelper">Chapter 2</a>, <code>odbchelper.py</code>.
<p>Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring
into existence by being assigned a value, and they are automatically destroyed when they go out of scope.
<div class=example><h3 id="myparamsdef">Example 3.17. Defining the <var>myParams</var> Variable</h3><pre><code>
if __name__ == "__main__":
myParams = {"server":"mpilgrim", \
"database":"master", \
"uid":"sa", \
"pwd":"secret" \
}</pre><p>Notice the indentation. An <code>if</code> statement is a code block and needs to be indented just like a function.
<p>Also notice that the variable assignment is one command split over several lines, with a backslash (&#8220;<code>\</code>&#8221;) serving as a line-continuation marker.
<table id="tip.multiline" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">When a command is split among several lines with the line-continuation marker (&#8220;<code>\</code>&#8221;), the continued lines can be indented in any manner; Python's normally stringent indentation rules do not apply. If your Python <abbr>IDE</abbr> auto-indents the continued line, you should probably accept its default unless you have a burning reason not to.
<p><a name="tip.implicitmultiline"></a>Strictly speaking, expressions in parentheses, straight brackets, or curly braces (like <a href="#myparamsdef" title="Example 3.17. Defining the myParams Variable">defining a dictionary</a>) can be split into multiple lines with or without the line continuation character (&#8220;<code>\</code>&#8221;). I like to include the backslash even when it's not required because I think it makes the code easier to read, but that's
a matter of style.
[unbound variable exception example was here]
<h3 id="odbchelper.multiassign">3.4.2. Assigning Multiple Values at Once</h3>
<p>One of the cooler programming shortcuts in Python is using sequences to assign multiple values at once.
<div class=example><h3>Example 3.19. Assigning multiple values at once</h3><pre class=screen><samp class=p>>>> </samp><kbd>v = ('a', 'b', 'e')</kbd>
<samp class=p>>>> </samp><kbd>(x, y, z) = v</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>x</kbd>
'a'
<samp class=p>>>> </samp><kbd>y</kbd>
'b'
<samp class=p>>>> </samp><kbd>z</kbd>
'e'</pre>
<ol>
<li><var>v</var> is a tuple of three elements, and <code>(x, y, z)</code> is a tuple of three variables. Assigning one to the other assigns each of the values of <var>v</var> to each of the variables, in order.
<p>This has all sorts of uses. I often want to assign names to a range of values. In <abbr>C</abbr>, you would use <code>enum</code> and manually list each constant and its associated value, which seems especially tedious when the values are consecutive.
In Python, you can use the built-in <code>range</code> function with multi-variable assignment to quickly assign consecutive values.
<div class=example><h3 id="odbchelper.multiassign.range">Example 3.20. Assigning Consecutive Values</h3><pre class=screen><samp class=p>>>> </samp><kbd>range(7)</kbd> <span>&#x2460;</span>
[0, 1, 2, 3, 4, 5, 6]
<samp class=p>>>> </samp><kbd>(MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7)</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>MONDAY</kbd> <span>&#x2462;</span>
0
<samp class=p>>>> </samp><kbd>TUESDAY</kbd>
1
<samp class=p>>>> </samp><kbd>SUNDAY</kbd>
6</pre>
<ol>
<li>The built-in <code>range</code> function returns a list of integers. In its simplest form, it takes an upper limit and returns a zero-based list counting
up to but not including the upper limit. (If you like, you can pass other parameters to specify a base other than <code>0</code> and a step other than <code>1</code>. You can <code>print range.__doc__</code> for details.)
<li><var>MONDAY</var>, <var>TUESDAY</var>, <var>WEDNESDAY</var>, <var>THURSDAY</var>, <var>FRIDAY</var>, <var>SATURDAY</var>, and <var>SUNDAY</var> are the variables you're defining. (This example came from the <code>calendar</code> module, a fun little module that prints calendars, like the <abbr>UNIX</abbr> program <code>cal</code>. The <code>calendar</code> module defines integer constants for days of the week.)
<li>Now each variable has its value: <var>MONDAY</var> is <code>0</code>, <var>TUESDAY</var> is <code>1</code>, and so forth.
<p>You can also use multi-variable assignment to build functions that return multiple values, simply by returning a tuple of
all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including the <code>os</code> module, which you'll discuss in <a href="#filehandling">Chapter 6</a>.
<div class=itemizedlist>
<h3>Further Reading on Variables</h3>
<ul>
<li><a href="http://www.python.org/doc/current/ref/"><i class=citetitle>Python Reference Manual</i></a> shows examples of <a href="http://www.python.org/doc/current/ref/implicit-joining.html">when you can skip the line continuation character</a> and <a href="http://www.python.org/doc/current/ref/explicit-joining.html">when you need to use it</a>.
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class=citetitle>How to Think Like a Computer Scientist</i></a> shows how to use multi-variable assignment to <a href="http://www.ibiblio.org/obp/thinkCSpy/chap09.htm">swap the values of two variables</a>.
</ul>
<div class=example><h3>Example 6.12. Introducing <code><code>sys</code>.modules</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>import sys</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>print '\n'.join(sys.modules.keys())</kbd> <span>&#x2461;</span>
<samp>win32api
os.path
os
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat</span></pre>
<ol>
<li>The <code>sys</code> module contains system-level information, such as the version of Python you're running (<code><code>sys</code>.version</code> or <code><code>sys</code>.version_info</code>), and system-level options such as the maximum allowed recursion depth (<code><code>sys</code>.getrecursionlimit()</code> and <code><code>sys</code>.setrecursionlimit()</code>).
<li><code><code>sys</code>.modules</code> is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules <em>your</em> program has imported. Python preloads some modules on startup, and if you're using a Python <abbr>IDE</abbr>, <code><code>sys</code>.modules</code> contains all the modules imported by all the programs you've run within the <abbr>IDE</abbr>.
<p>This example demonstrates how to use <code><code>sys</code>.modules</code>.
<div class=example><h3>Example 6.13. Using <code><code>sys</code>.modules</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>import fileinfo</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>print '\n'.join(sys.modules.keys())</kbd>
<samp>win32api
os.path
os
fileinfo
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat</samp>
<samp class=p>>>> </samp><kbd>fileinfo</kbd>
&lt;module 'fileinfo' from 'fileinfo.pyc'>
<samp class=p>>>> </samp><kbd>sys.modules["fileinfo"]</kbd> <span>&#x2461;</span>
&lt;module 'fileinfo' from 'fileinfo.pyc'></pre>
<ol>
<li>As new modules are imported, they are added to <code><code>sys</code>.modules</code>. This explains why importing the same module twice is very fast: Python has already loaded and cached the module in <code><code>sys</code>.modules</code>, so importing the second time is simply a dictionary lookup.
<li>Given the name (as a string) of any previously-imported module, you can get a reference to the module itself through the <code><code>sys</code>.modules</code> dictionary.
<p>The next example shows how to use the <code>__module__</code> class attribute with the <code><code>sys</code>.modules</code> dictionary to get a reference to the module in which a class is defined.
<div class=example><h3>Example 6.14. The <code>__module__</code> Class Attribute</h3><pre class=screen><samp class=p>>>> </samp><kbd>from fileinfo import MP3FileInfo</kbd>
<samp class=p>>>> </samp><kbd>MP3FileInfo.__module__</kbd> <span>&#x2460;</span>
'fileinfo'
<samp class=p>>>> </samp><kbd>sys.modules[MP3FileInfo.__module__]</kbd> <span>&#x2461;</span>
&lt;module 'fileinfo' from 'fileinfo.pyc'></pre>
<ol>
<li>Every Python class has a built-in <a href="#fileinfo.classattributes" title="5.8. Introducing Class Attributes">class attribute</a> <code>__module__</code>, which is the name of the module in which the class is defined.
<li>Combining this with the <code><code>sys</code>.modules</code> dictionary, you can get a reference to the module in which a class is defined.
<p>Now you're ready to see how <code><code>sys</code>.modules</code> is used in <code>fileinfo.py</code>, the sample program introduced in <a href="#fileinfo">Chapter 5</a>. This example shows that portion of the code.
<div class=example><h3>Example 6.15. <code><code>sys</code>.modules</code> in <code>fileinfo.py</code></h3><pre><code>
def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]): <span>&#x2460;</span>
"get file info class from filename extension"
subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:] <span>&#x2461;</span>
return hasattr(module, subclass) and getattr(module, subclass) or FileInfo <span>&#x2462;</span></pre>
<ol>
<li>This is a function with two arguments; <var>filename</var> is required, but <var>module</var> is <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional</a> and defaults to the module that contains the <code>FileInfo</code> class. This looks inefficient, because you might expect Python to evaluate the <code><code>sys</code>.modules</code> expression every time the function is called. In fact, Python evaluates default expressions only once, the first time the module is imported. As you'll see later, you never call this
function with a <var>module</var> argument, so <var>module</var> serves as a function-level constant.
<li>You'll plow through this line later, after you dive into the <code>os</code> module. For now, take it on faith that <var>subclass</var> ends up as the name of a class, like <code>MP3FileInfo</code>.
<li>You already know about <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code>getattr</code></a>, which gets a reference to an object by name. <code>hasattr</code> is a complementary function that checks whether an object has a particular attribute; in this case, whether a module has
a particular class (although it works for any object and any attribute, just like <code>getattr</code>). In English, this line of code says, &#8220;If this module has the class named by <var>subclass</var> then return it, otherwise return the base class <code>FileInfo</code>.&#8221;
<div class=itemizedlist>
<h3>Further Reading on Modules</h3>
<ul>
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> discusses exactly <a href="http://www.python.org/doc/current/tut/node6.html#SECTION006710000000000000000">when and how default arguments are evaluated</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-sys.html"><code>sys</code></a> module.
</ul>
<h2 id="fileinfo.os">6.5. Working with Directories</h2>
<p>The <code>os.path</code> module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing
the contents of a directory.
<div class=example><h3 id="fileinfo.os.path.join.example">Example 6.16. Constructing Pathnames</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import os</kbd>
<samp class=p>>>> </samp><kbd>os.path.join("c:\\music\\ap\\", "mahadeva.mp3")</kbd> <span>&#x2460;</span> <span>&#x2461;</span>
'c:\\music\\ap\\mahadeva.mp3'
<samp class=p>>>> </samp><kbd>os.path.join("c:\\music\\ap", "mahadeva.mp3")</kbd> <span>&#x2462;</span>
'c:\\music\\ap\\mahadeva.mp3'
<samp class=p>>>> </samp><kbd>os.path.expanduser("~")</kbd> <span>&#x2463;</span>
'c:\\Documents and Settings\\mpilgrim\\My Documents'
<samp class=p>>>> </samp><kbd>os.path.join(os.path.expanduser("~"), "Python")</kbd> <span>&#x2464;</span>
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'</pre>
<ol>
<li><code>os.path</code> is a reference to a module -- which module depends on your platform. Just as <a href="#crossplatform.example" title="Example 6.2. Supporting Platform-Specific Functionality"><code>getpass</code></a> encapsulates differences between platforms by setting <var>getpass</var> to a platform-specific function, <code>os</code> encapsulates differences between platforms by setting <var>path</var> to a platform-specific module.
<li>The <code>join</code> function of <code>os.path</code> constructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing
with pathnames on Windows is annoying because the backslash character must be escaped.)
<li>In this slightly less trivial case, <code>join</code> will add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since
<code>addSlashIfNecessary</code> is one of the stupid little functions I always need to write when building up my toolbox in a new language. <em>Do not</em> write this stupid little function in Python; smart people have already taken care of it for you.
<li><code>expanduser</code> will expand a pathname that uses <code>~</code> to represent the current user's home directory. This works on any platform where users have a home directory, like Windows,
<abbr>UNIX</abbr>, and Mac OS X; it has no effect on Mac OS.
<li>Combining these techniques, you can easily construct pathnames for directories and files under the user's home directory.
<div class=example><h3 id="splittingpathnames.example">Example 6.17. Splitting Pathnames</h3><pre class=screen><samp class=p>>>> </samp><kbd>os.path.split("c:\\music\\ap\\mahadeva.mp3")</kbd> <span>&#x2460;</span>
('c:\\music\\ap', 'mahadeva.mp3')
<samp class=p>>>> </samp><kbd>(filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3")</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>filepath</kbd> <span>&#x2462;</span>
'c:\\music\\ap'
<samp class=p>>>> </samp><kbd>filename</kbd> <span>&#x2463;</span>
'mahadeva.mp3'
<samp class=p>>>> </samp><kbd>(shortname, extension) = os.path.splitext(filename)</kbd> <span>&#x2464;</span>
<samp class=p>>>> </samp><kbd>shortname</kbd>
'mahadeva'
<samp class=p>>>> </samp><kbd>extension</kbd>
'.mp3'</pre>
<ol>
<li>The <code>split</code> function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use
<a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a> to return multiple values from a function? Well, <code>split</code> is such a function.
<li>You assign the return value of the <code>split</code> function into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple.
<li>The first variable, <var>filepath</var>, receives the value of the first element of the tuple returned from <code>split</code>, the file path.
<li>The second variable, <var>filename</var>, receives the value of the second element of the tuple returned from <code>split</code>, the filename.
<li><code>os.path</code> also contains a function <code>splitext</code>, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique
to assign each of them to separate variables.
<div class=example><h3 id="fileinfo.listdir.example">Example 6.18. Listing Directories</h3><pre class=screen><samp class=p>>>> </samp><kbd>os.listdir("c:\\music\\_singles\\")</kbd> <span>&#x2460;</span>
<samp>['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
'spinning.mp3']</samp>
<samp class=p>>>> </samp><kbd>dirname = "c:\\"</kbd>
<samp class=p>>>> </samp><kbd>os.listdir(dirname)</kbd> <span>&#x2461;</span>
<samp>['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin',
'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS',
'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys',
'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']</samp>
<samp class=p>>>> </samp><kbd>[f for f in os.listdir(dirname)</kbd>
<samp class=p>... </samp>if os.path.isfile(os.path.join(dirname, f))] <span>&#x2462;</span>
<samp>['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS',
'NTDETECT.COM', 'ntldr', 'pagefile.sys']</samp>
<samp class=p>>>> </samp><kbd>[f for f in os.listdir(dirname)</kbd>
<samp class=p>... </samp>if os.path.isdir(os.path.join(dirname, f))] <span>&#x2463;</span>
<samp>['cygwin', 'docbook', 'Documents and Settings', 'Incoming',
'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']</span></pre>
<ol>
<li>The <code>listdir</code> function takes a pathname and returns a list of the contents of the directory.
<li><code>listdir</code> returns both files and folders, with no indication of which is which.
<li>You can use <a href="#apihelper.filter" title="4.5. Filtering Lists">list filtering</a> and the <code>isfile</code> function of the <code>os.path</code> module to separate the files from the folders. <code>isfile</code> takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're using <code><code>os.path</code>.<code>join</code></code> to ensure a full pathname, but <code>isfile</code> also works with a partial path, relative to the current working directory. You can use <code>os.getcwd()</code> to get the current working directory.
<li><code>os.path</code> also has a <code>isdir</code> function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories
within a directory.
<div class=example><h3>Example 6.19. Listing Directories in <code>fileinfo.py</code></h3><pre><code>
def listDirectory(directory, fileExtList):
"get list of file info objects for files of particular extensions"
fileList = [os.path.normcase(f)
for f in os.listdir(directory)] <span>&#x2460;</span> <span>&#x2461;</span>
fileList = [os.path.join(directory, f)
for f in fileList
if os.path.splitext(f)[1] in fileExtList] <span>&#x2462;</span> <span>&#x2463;</span> <span>&#x2464;</span></pre>
<ol>
<li><code>os.listdir(directory)</code> returns a list of all the files and folders in <var>directory</var>.
<li>Iterating through the list with <var>f</var>, you use <code>os.path.normcase(f)</code> to normalize the case according to operating system defaults. <code>normcase</code> is a useful little function that compensates for case-insensitive operating systems that think that <code>mahadeva.mp3</code> and <code>mahadeva.MP3</code> are the same file. For instance, on Windows and Mac OS, <code>normcase</code> will convert the entire filename to lowercase; on <abbr>UNIX</abbr>-compatible systems, it will return the filename unchanged.
<li>Iterating through the normalized list with <var>f</var> again, you use <code>os.path.splitext(f)</code> to split each filename into name and extension.
<li>For each file, you see if the extension is in the list of file extensions you care about (<var>fileExtList</var>, which was passed to the <code>listDirectory</code> function).
<li>For each file you care about, you use <code>os.path.join(directory, f)</code> to construct the full pathname of the file, and return a list of the full pathnames.
<table id="tip.os" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Whenever possible, you should use the functions in <code>os</code> and <code>os.path</code> for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like
<code>os.path.split</code> work on <abbr>UNIX</abbr>, Windows, Mac OS, and any other platform supported by Python.
<p>There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you
may already be familiar with from working on the command line.
<div class=example><h3 id="fileinfo.os.glob.example">Example 6.20. Listing Directories with <code>glob</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>os.listdir("c:\\music\\_singles\\")</kbd> <span>&#x2460;</span>
<samp>['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
'spinning.mp3']</samp>
<samp class=p>>>> </samp><kbd>import glob</kbd>
<samp class=p>>>> </samp><kbd>glob.glob('c:\\music\\_singles\\*.mp3')</kbd> <span>&#x2461;</span>
<samp>['c:\\music\\_singles\\a_time_long_forgotten_con.mp3',
'c:\\music\\_singles\\hellraiser.mp3',
'c:\\music\\_singles\\kairo.mp3',
'c:\\music\\_singles\\long_way_home1.mp3',
'c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']</samp>
<samp class=p>>>> </samp><kbd>glob.glob('c:\\music\\_singles\\s*.mp3')</kbd> <span>&#x2462;</span>
<samp>['c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']</samp>
<samp class=p>>>> </samp><kbd>glob.glob('c:\\music\\*\\*.mp3')</kbd><span>&#x2463;</span>
</pre>
<ol>
<li>As you saw earlier, <code>os.listdir</code> simply takes a directory path and lists all files and directories in that directory.
<li>The <code>glob</code> module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard.
Here the wildcard is a directory path plus "*.mp3", which will match all <code>.mp3</code> files. Note that each element of the returned list already includes the full path of the file.
<li>If you want to find all the files in a specific directory that start with "s" and end with ".mp3", you can do that too.
<li>Now consider this scenario: you have a <code>music</code> directory, with several subdirectories within it, with <code>.mp3</code> files within each subdirectory. You can get a list of all of those with a single call to <code>glob</code>, by using two wildcards at once. One wildcard is the <code>"*.mp3"</code> (to match <code>.mp3</code> files), and one wildcard is <em>within the directory path itself</em>, to match any subdirectory within <code>c:\music</code>. That's a crazy amount of power packed into one deceptively simple-looking function!
<div class=itemizedlist>
<h3>Further Reading on the <code>os</code> Module</h3>
<ul>
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/240">questions about the <code>os</code> module</a>.
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-os.html"><code>os</code></a> module and the <a href="http://www.python.org/doc/current/lib/module-os.path.html"><code>os.path</code></a> module.
</ul>
[HTML stuff was here]
<h2 id="dialect.locals">8.5. <code>locals</code> and <code>globals</code></h2>
<p>Let's digress from <abbr>HTML</abbr> processing for a minute and talk about how Python handles variables. Python has two built-in functions, <code>locals</code> and <code>globals</code>, which provide dictionary-based access to local and global variables.
<p>Remember <code>locals</code>? You first saw it here:
<pre><code>
def unknown_starttag(self, tag, attrs):
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
self.pieces.append("&lt;%(tag)s%(strattrs)s>" % locals())
</pre><p>No, wait, you can't learn about <code>locals</code> yet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention.
<p>Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names
of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute.
<p>At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which
keeps track of the function's variables, including function arguments and locally defined variables. Each module has its
own namespace, called the global namespace, which keeps track of the module's variables, including functions, classes, any
other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any
module, which holds built-in functions and exceptions.
<p>When a line of code asks for the value of a variable <var>x</var>, Python will search for that variable in all the available namespaces, in order:
<div class=orderedlist>
<ol>
<li>local namespace - specific to the current function or class method. If the function defines a local variable <var>x</var>, or has an argument <var>x</var>, Python will use this and stop searching.
<li>global namespace - specific to the current module. If the module has defined a variable, function, or class called <var>x</var>, Python will use that and stop searching.
<li>built-in namespace - global to all modules. As a last resort, Python will assume that <var>x</var> is the name of built-in function or variable.
</ol>
<p>If Python doesn't find <var>x</var> in any of these namespaces, it gives up and raises a <code>NameError</code> with the message <samp>There is no variable named 'x'</samp>, which you saw back in <a href="#odbchelper.unboundvariable" title="Example 3.18. Referencing an Unbound Variable">Example 3.18, &#8220;Referencing an Unbound Variable&#8221;</a>, but you didn't appreciate how much work Python was doing before giving you that error.
<table class=important border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/important.png" alt="Important" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Python 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a <a href="#fileinfo.nested" title="Example 6.21. listDirectory">nested function</a> or <a href="#apihelper.lambda" title="4.7. Using lambda Functions"><code>lambda</code> function</a>, Python will search for that variable in the current (nested or <code>lambda</code>) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested or <code>lambda</code>) function's namespace, <em>then in the parent function's namespace</em>, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2:<pre><code>
from __future__ import nested_scopes</pre><p>Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are <em>directly accessible at run-time</em>. How? Well, the local namespace is accessible via the built-in <code>locals</code> function, and the global (module level) namespace is accessible via the built-in <code>globals</code> function.
<div class=example><h3>Example 8.10. Introducing <code>locals</code></h3><pre class=screen><samp class=p>>>> </samp><kbd>def foo(arg):</kbd> <span>&#x2460;</span>
<samp class=p>... </samp>x = 1
<samp class=p>... </samp>print locals()
<samp class=p>... </samp>
<samp class=p>>>> </samp><kbd>foo(7)</kbd> <span>&#x2461;</span>
{'arg': 7, 'x': 1}
<samp class=p>>>> </samp><kbd>foo('bar')</kbd> <span>&#x2462;</span>
{'arg': 'bar', 'x': 1}</pre>
<ol>
<li>The function <code>foo</code> has two variables in its local namespace: <var>arg</var>, whose value is passed in to the function, and <var>x</var>, which is defined within the function.
<li><code>locals</code> returns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values
of the dictionary are the actual values of the variables. So calling <code>foo</code> with <code>7</code> prints the dictionary containing the function's two local variables: <var>arg</var> (<code>7</code>) and <var>x</var> (<code>1</code>).
<li>Remember, Python has dynamic typing, so you could just as easily pass a string in for <var>arg</var>; the function (and the call to <code>locals</code>) would still work just as well. <code>locals</code> works with all variables of all datatypes.
<p>What <code>locals</code> does for the local (function) namespace, <code>globals</code> does for the global (module) namespace. <code>globals</code> is more exciting, though, because a module's namespace is more exciting.
<sup>[<a name="d0e21226" href="#ftn.d0e21226">3</a>]</sup> Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes
defined in the module. Plus, it includes anything that was imported into the module.
<p>Remember the difference between <a href="#fileinfo.fromimport" title="5.2. Importing Modules Using from module import"><code>from <var>module</var> import</code></a> and <a href="#odbchelper.import" title="Example 2.3. Accessing the buildConnectionString Function's docstring"><code>import <var>module</var></code></a>? With <code>import <var>module</var></code>, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access
any of its functions or attributes: <code><var>module</var>.<var>function</var></code>. But with <code>from <var>module</var> import</code>, you're actually importing specific functions and attributes from another module into your own namespace, which is why you
access them directly without referencing the original module they came from. With the <code>globals</code> function, you can actually see this happen.
<div class=example><h3 id="dialect.globals.example">Example 8.11. Introducing <code>globals</code></h3>
<p>Look at the following block of code at the bottom of <code>BaseHTMLProcessor.py</code>:<pre><code>
if __name__ == "__main__":
for k, v in globals().items(): <span>&#x2460;</span>
print k, "=", v</pre>
<ol>
<li>Just so you don't get intimidated, remember that you've seen all this before. The <code>globals</code> function returns a dictionary, and you're <a href="#dictionaryiter.example" title="Example 6.10. Iterating Through a Dictionary">iterating through the dictionary</a> using the <code>items</code> method and <a href="#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable assignment</a>. The only thing new here is the <code>globals</code> function.
<p>Now running the script from the command line gives this output (note that your output may be slightly different, depending
on your platform and where you installed Python):<pre class=screen><samp class=p>c:\docbook\dip\py></samp> python BaseHTMLProcessor.py</pre><pre><code>
SGMLParser = sgmllib.SGMLParser <span>&#x2460;</span>
htmlentitydefs = &lt;module 'htmlentitydefs' from 'C:\Python23\lib\htmlentitydefs.py'> <span>&#x2461;</span>
BaseHTMLProcessor = __main__.BaseHTMLProcessor <span>&#x2462;</span>
__name__ = __main__ <span>&#x2463;</span>
... rest of output omitted for brevity...</pre>
<ol>
<li><code>SGMLParser</code> was imported from <code>sgmllib</code>, using <code>from <var>module</var> import</code>. That means that it was imported directly into the module's namespace, and here it is.
<li>Contrast this with <code>htmlentitydefs</code>, which was imported using <code>import</code>. That means that the <code>htmlentitydefs</code> module itself is in the namespace, but the <var>entitydefs</var> variable defined within <code>htmlentitydefs</code> is not.
<li>This module only defines one class, <code>BaseHTMLProcessor</code>, and here it is. Note that the value here is <a href="#fileinfo.classattributes.intro" title="Example 5.17. Introducing Class Attributes">the class itself</a>, not a specific instance of the class.
<li>Remember the <a href="#odbchelper.ifnametrick"><code>if __name__</code> trick</a>? When running a module (as opposed to importing it from another module), the built-in <code>__name__</code> attribute is a special value, <code>__main__</code>. Since you ran this module as a script from the command line, <code>__name__</code> is <code>__main__</code>, which is why the little test code to print the <code>globals</code> got executed.
<table id="tip.localsbyname" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Using the <code>locals</code> and <code>globals</code> functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors
the functionality of the <a href="#apihelper.getattr" title="4.4. Getting Object References With getattr"><code>getattr</code></a> function, which allows you to access arbitrary functions dynamically by providing the function name as a string.
<p>There is one other important difference between the <code>locals</code> and <code>globals</code> functions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning
it.
<div class=example><h3 id="dialect.locals.readonly.example">Example 8.12. <code>locals</code> is read-only, <code>globals</code> is not</h3><pre><code>
def foo(arg):
x = 1
print locals() <span>&#x2460;</span>
locals()["x"] = 2 <span>&#x2461;</span>
print "x=",x <span>&#x2462;</span>
z = 7
print "z=",z
foo(3)
globals()["z"] = 8 <span>&#x2463;</span>
print "z=",z <span>&#x2464;</span>
</pre>
<ol>
<li>Since <code>foo</code> is called with <code>3</code>, this will print <code>{'arg': 3, 'x': 1}</code>. This should not be a surprise.
<li><code>locals</code> is a function that returns a dictionary, and here you are setting a value in that dictionary. You might think that this
would change the value of the local variable <var>x</var> to <code>2</code>, but it doesn't. <code>locals</code> does not actually return the local namespace, it returns a copy. So changing it does nothing to the value of the variables
in the local namespace.
<li>This prints <code>x= 1</code>, not <code>x= 2</code>.
<li>After being burned by <code>locals</code>, you might think that this <em>wouldn't</em> change the value of <var>z</var>, but it does. Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself), <code>globals</code> returns the actual global namespace, not a copy: the exact opposite behavior of <code>locals</code>. So any changes to the dictionary returned by <code>globals</code> directly affect your global variables.
<li>This prints <code>z= 8</code>, not <code>z= 7</code>.
[XML stuff was here]
<h2 id="kgp.packages">9.2. Packages</h2>
<p>Actually parsing an <abbr>XML</abbr> document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour
to talk about packages.
<div class=example><h3>Example 9.5. Loading an <abbr>XML</abbr> document (a sneak peek)</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>from xml.dom import minidom</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp>xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')</pre>
<ol>
<li>This is a syntax you haven't seen before. It looks almost like the <code>from <var>module</var> import</code> you know and love, but the <code>"."</code> gives it away as something above and beyond a simple import. In fact, <code>xml</code> is what is known as a package, <code>dom</code> is a nested package within <code>xml</code>, and <code>minidom</code> is a module within <code>xml.dom</code>.
<p>That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than
directories of modules; nested packages are subdirectories. The modules within a package (or a nested package) are still
just <code>.py</code> files, like always, except that they're in a subdirectory instead of the main <code>lib/</code> directory of your Python installation.
<div class=example><h3>Example 9.6. File layout of a package</h3><pre class=screen>Python21/ root Python installation (home of the executable)
|
+--lib/ library directory (home of the standard library modules)
|
+-- xml/ xml package (really just a directory with other stuff in it)
|
+--sax/ xml.sax package (again, just a directory)
|
+--dom/ xml.dom package (contains minidom.py)
|
+--parsers/ xml.parsers package (used internally)</pre><p>So when you say <code>from xml.dom import minidom</code>, Python figures out that that means &#8220;look in the <code>xml</code> directory for a <code>dom</code> directory, and look in <em>that</em> for the <code>minidom</code> module, and import it as <code>minidom</code>&#8221;. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import
specific classes or functions from a module contained within a package. You can also import the package itself as a module.
The syntax is all the same; Python figures out what you mean based on the file layout of the package, and automatically does the right thing.
<div class=example><h3>Example 9.7. Packages are modules, too</h3><pre class=screen><samp class=p>>>> </samp><kbd>from xml.dom import minidom</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>minidom</kbd>
&lt;module 'xml.dom.minidom' from 'C:\Python21\lib\xml\dom\minidom.pyc'>
<samp class=p>>>> </samp><kbd>minidom.Element</kbd>
&lt;class xml.dom.minidom.Element at 01095744>
<samp class=p>>>> </samp><kbd>from xml.dom.minidom import Element</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>Element</kbd>
&lt;class xml.dom.minidom.Element at 01095744>
<samp class=p>>>> </samp><kbd>minidom.Element</kbd>
&lt;class xml.dom.minidom.Element at 01095744>
<samp class=p>>>> </samp><kbd>from xml import dom</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>dom</kbd>
&lt;module 'xml.dom' from 'C:\Python21\lib\xml\dom\__init__.pyc'>
<samp class=p>>>> </samp><kbd>import xml</kbd> <span>&#x2463;</span>
<samp class=p>>>> </samp><kbd>xml</kbd>
&lt;module 'xml' from 'C:\Python21\lib\xml\__init__.pyc'></pre>
<ol>
<li>Here you're importing a module (<code>minidom</code>) from a nested package (<code>xml.dom</code>). The result is that <code>minidom</code> is imported into your <a href="#dialect.locals" title="8.5. locals and globals">namespace</a>, and in order to reference classes within the <code>minidom</code> module (like <code>Element</code>), you need to preface them with the module name.
<li>Here you are importing a class (<code>Element</code>) from a module (<code>minidom</code>) from a nested package (<code>xml.dom</code>). The result is that <code>Element</code> is imported directly into your namespace. Note that this does not interfere with the previous import; the <code>Element</code> class can now be referenced in two ways (but it's all still the same class).
<li>Here you are importing the <code>dom</code> package (a nested package of <code>xml</code>) as a module in and of itself. Any level of a package can be treated as a module, as you'll see in a moment. It can even
have its own attributes and methods, just the modules you've seen before.
<li>Here you are importing the root level <code>xml</code> package as a module.
<p>So how can a package (which is just a directory on disk) be imported and treated as a module (which is always a file on disk)?
The answer is the magical <code>__init__.py</code> file. You see, packages are not simply directories; they are directories with a specific file, <code>__init__.py</code>, inside. This file defines the attributes and methods of the package. For instance, <code>xml.dom</code> contains a <code>Node</code> class, which is defined in <code>xml/dom/__init__.py</code>. When you import a package as a module (like <code>dom</code> from <code>xml</code>), you're really importing its <code>__init__.py</code> file.
<table class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">A package is a directory with the special <code>__init__.py</code> file in it. The <code>__init__.py</code> file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file,
but it has to exist. But if <code>__init__.py</code> doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages.
<p>So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an <code>xml</code> package with <code>sax</code> and <code>dom</code> packages inside, the authors could have chosen to put all the <code>sax</code> functionality in <code>xmlsax.py</code> and all the <code>dom</code> functionality in <code>xmldom.py</code>, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the <abbr>XML</abbr> package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different
areas simultaneously).
<p>If you ever find yourself writing a large subsystem in Python (or, more likely, when you realize that your small subsystem has grown into a large one), invest some time designing a good
package architecture. It's one of the many things Python is good at, so take advantage of it.
<h2 id="kgp.parse">9.3. Parsing <abbr>XML</abbr></h2>
<p>As I was saying, actually parsing an <abbr>XML</abbr> document is very simple: one line of code. Where you go from there is up to you.
<h2 id="kgp.commandline">10.6. Handling command-line arguments</h2>
<p>Python fully supports creating programs that can be run on the command line, complete with command-line arguments and either short-
or long-style flags to specify various options. None of this is <abbr>XML</abbr>-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it.
<p>It's difficult to talk about command-line processing without understanding how command-line arguments are exposed to your
Python program, so let's write a simple program to see them.
<div class=example><h3>Example 10.20. Introducing <var>sys.argv</var></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
#argecho.py
import sys
for arg in sys.argv: <span>&#x2460;</span>
print arg</pre>
<ol>
<li>Each command-line argument passed to the program will be in <var>sys.argv</var>, which is just a list. Here you are printing each argument on a separate line.
<div class=example><h3>Example 10.21. The contents of <var>sys.argv</var></h3><pre class=screen>
<samp class=p>[you@localhost py]$ </samp>python argecho.py <span>&#x2460;</span>
argecho.py
<samp class=p>[you@localhost py]$ </samp>python argecho.py abc def <span>&#x2461;</span>
<samp>argecho.py
abc
def</samp>
<samp class=p>[you@localhost py]$ </samp>python argecho.py --help <span>&#x2462;</span>
<samp>argecho.py
--help</samp>
<samp class=p>[you@localhost py]$ </samp>python argecho.py -m kant.xml <span>&#x2463;</span>
<samp>argecho.py
-m
kant.xml</span></pre>
<ol>
<li>The first thing to know about <var>sys.argv</var> is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later,
in <a href="#regression" title="Chapter 16. Functional Programming">Chapter 16, <i>Functional Programming</i></a>. Don't worry about it for now.
<li>Command-line arguments are separated by spaces, and each shows up as a separate element in the <var>sys.argv</var> list.
<li>Command-line flags, like <code>--help</code>, also show up as their own element in the <var>sys.argv</var> list.
<li>To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag
(<code>-m</code>) which takes an argument (<code>kant.xml</code>). Both the flag itself and the flag's argument are simply sequential elements in the <var>sys.argv</var> list. No attempt is made to associate one with the other; all you get is a list.
<p>So as you can see, you certainly have all the information passed on the command line, but then again, it doesn't look like
it's going to be all that easy to actually use it. For simple programs that only take a single argument and have no flags,
you can simply use <code>sys.argv[1]</code> to access the argument. There's no shame in this; I do it all the time. For more complex programs, you need the <code>getopt</code> module.
<div class=example><h3>Example 10.22. Introducing <code>getopt</code></h3><pre><code>
def main(argv):
grammar = "kant.xml" <span>&#x2460;</span>
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="]) <span>&#x2461;</span>
except getopt.GetoptError: <span>&#x2462;</span>
usage() <span>&#x2463;</span>
sys.exit(2)
...
if __name__ == "__main__":
main(sys.argv[1:])</pre>
<ol>
<li>First off, look at the bottom of the example and notice that you're calling the <code>main</code> function with <code>sys.argv[1:]</code>. Remember, <code>sys.argv[0]</code> is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off
and pass the rest of the list.
<li>This is where all the interesting processing happens. The <code>getopt</code> function of the <code>getopt</code> module takes three parameters: the argument list (which you got from <code>sys.argv[1:]</code>), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer
command-line flags that are equivalent to the single-character versions. This is quite confusing at first glance, and is
explained in more detail below.
<li>If anything goes wrong trying to parse these command-line flags, <code>getopt</code> will raise an exception, which you catch. You told <code>getopt</code> all the flags you understand, so this probably means that the end user passed some command-line flag that you don't understand.
<li>As is standard practice in the <abbr>UNIX</abbr> world, when the script is passed flags it doesn't understand, you print out a summary of proper usage and exit gracefully.
Note that I haven't shown the <code>usage</code> function here. You would still need to code that somewhere and have it print out the appropriate summary; it's not automatic.
<p>So what are all those parameters you pass to the <code>getopt</code> function? Well, the first one is simply the raw list of command-line flags and arguments (not including the first element,
the script name, which you already chopped off before calling the <code>main</code> function). The second is the list of short command-line flags that the script accepts.
<div class=variablelist>
<h3><code>"hg:d"</code></h3>
<dl>
<dt><code>-h</code></dt>
<dd>print usage summary</dd>
<dt><code>-g ...</code></dt>
<dd>use specified grammar file or URL</dd>
<dt><code>-d</code></dt>
<dd>show debugging information while parsing</dd>
</dl>
<p>The first and third flags are simply standalone flags; you specify them or you don't, and they do things (print help) or change
state (turn on debugging). However, the second flag (<code>-g</code>) <em>must</em> be followed by an argument, which is the name of the grammar file to read from. In fact it can be a filename or a web address,
and you don't know which yet (you'll figure it out later), but you know it has to be <em>something</em>. So you tell <code>getopt</code> this by putting a colon after the <code>g</code> in that second parameter to the <code>getopt</code> function.
<p>To further complicate things, the script accepts either short flags (like <code>-h</code>) or long flags (like <code>--help</code>), and you want them to do the same thing. This is what the third parameter to <code>getopt</code> is for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter.
<div class=variablelist>
<h3><code>["help", "grammar="]</code></h3>
<dl>
<dt><code>--help</code></dt>
<dd>print usage summary</dd>
<dt><code>--grammar ...</code></dt>
<dd>use specified grammar file or URL</dd>
</dl>
<p>Three things of note here:
<div class=orderedlist>
<ol>
<li>All long flags are preceded by two dashes on the command line, but you don't include those dashes when calling <code>getopt</code>. They are understood.
<li>The <code>--grammar</code> flag must always be followed by an additional argument, just like the <code>-g</code> flag. This is notated by an equals sign, <code>"grammar="</code>.
<li>The list of long flags is shorter than the list of short flags, because the <code>-d</code> flag does not have a corresponding long version. This is fine; only <code>-d</code> will turn on debugging. But the order of short and long flags needs to be the same, so you'll need to specify all the short
flags that <em>do</em> have corresponding long flags first, then all the rest of the short flags.
</ol>
<p>Confused yet? Let's look at the actual code and see if it makes sense in context.
<div class=example><h3>Example 10.23. Handling command-line arguments in <code>kgp.py</code></h3><pre><code>
def main(argv): <span>&#x2460;</span>
grammar = "kant.xml"
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
usage()
sys.exit(2)
for opt, arg in opts: <span>&#x2461;</span>
if opt in ("-h", "--help"): <span>&#x2462;</span>
usage()
sys.exit()
elif opt == '-d': <span>&#x2463;</span>
global _debug
_debug = 1
elif opt in ("-g", "--grammar"): <span>&#x2464;</span>
grammar = arg
source = "".join(args) <span>&#x2465;</span>
k = KantGenerator(grammar, source)
print k.output()</pre>
<ol>
<li>The <var>grammar</var> variable will keep track of the grammar file you're using. You initialize it here in case it's not specified on the command
line (using either the <code>-g</code> or the <code>--grammar</code> flag).
<li>The <var>opts</var> variable that you get back from <code>getopt</code> contains a list of tuples: <var>flag</var> and <var>argument</var>. If the flag doesn't take an argument, then <var>arg</var> will simply be <code>None</code>. This makes it easier to loop through the flags.
<li><code>getopt</code> validates that the command-line flags are acceptable, but it doesn't do any sort of conversion between short and long flags.
If you specify the <code>-h</code> flag, <var>opt</var> will contain <code>"-h"</code>; if you specify the <code>--help</code> flag, <var>opt</var> will contain <code>"--help"</code>. So you need to check for both.
<li>Remember, the <code>-d</code> flag didn't have a corresponding long flag, so you only need to check for the short form. If you find it, you set a global
variable that you'll refer to later to print out debugging information. (I used this during the development of the script.
What, you thought all these examples worked on the first try?)
<li>If you find a grammar file, either with a <code>-g</code> flag or a <code>--grammar</code> flag, you save the argument that followed it (stored in <var>arg</var>) into the <var>grammar</var> variable, overwriting the default that you initialized at the top of the <code>main</code> function.
<li>That's it. You've looped through and dealt with all the command-line flags. That means that anything left must be command-line
arguments. These come back from the <code>getopt</code> function in the <var>args</var> variable. In this case, you're treating them as source material for the parser. If there are no command-line arguments
specified, <var>args</var> will be an empty list, and <var>source</var> will end up as the empty string.
<h2 id="kgp.alltogether">10.7. Putting it all together</h2>
<p>You've covered a lot of ground. Let's step back and see how all the pieces fit together.
<p>To start with, this is a script that <a href="#kgp.commandline" title="10.6. Handling command-line arguments">takes its arguments on the command line</a>, using the <code>getopt</code> module.
<pre><code>
def main(argv):
...
try:
opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
...
for opt, arg in opts:
...</pre><p>You create a new instance of the <code>KantGenerator</code> class, and pass it the grammar file and source that may or may not have been specified on the command line.
<pre><code>
k = KantGenerator(grammar, source)</pre><p>The <code>KantGenerator</code> instance automatically loads the grammar, which is an <abbr>XML</abbr> file. You use your custom <code>openAnything</code> function to open the file (which <a href="#kgp.openanything" title="10.1. Abstracting input sources">could be stored in a local file or a remote web server</a>), then use the built-in <code>minidom</code> parsing functions to <a href="#kgp.parse" title="9.3. Parsing XML">parse the <abbr>XML</abbr> into a tree of Python objects</a>.
<pre><code>
def _load(self, source):
sock = toolbox.openAnything(source)
xmldoc = minidom.parse(sock).documentElement
sock.close()</pre><p>Oh, and along the way, you take advantage of your knowledge of the structure of the <abbr>XML</abbr> document to <a href="#kgp.cache" title="10.3. Caching node lookups">set up a little cache of references</a>, which are just elements in the <abbr>XML</abbr> document.
<pre><code>
def loadGrammar(self, grammar):
for ref in self.grammar.getElementsByTagName("ref"):
self.refs[ref.attributes["id"].value] = ref </pre><p>If you specified some source material on the command line, you use that; otherwise you rip through the grammar looking for
the "top-level" reference (that isn't referenced by anything else) and use that as a starting point.
<pre><code>
def getDefaultSource(self):
xrefs = {}
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
return '&lt;xref id="%s"/>' % random.choice(standaloneXrefs)</pre><p>Now you rip through the source material. The source material is also <abbr>XML</abbr>, and you parse it one node at a time. To keep the code separated and more maintainable, you use <a href="#kgp.handler" title="10.5. Creating separate handlers by node type">separate handlers for each node type</a>.
<pre><code>
def parse_Element(self, node):
handlerMethod = getattr(self, "do_%s" % node.tagName)
handlerMethod(node)</pre><p>You bounce through the grammar, <a href="#kgp.child" title="10.4. Finding direct children of a node">parsing all the children</a> of each <code>p</code> element,
<pre><code>
def do_p(self, node):
...
if doit:
for child in node.childNodes: self.parse(child)</pre><p>replacing <code>choice</code> elements with a random child,
<pre><code>
def do_choice(self, node):
self.parse(self.randomChildElement(node))</pre><p>and replacing <code>xref</code> elements with a random child of the corresponding <code>ref</code> element, which you previously cached.
<pre><code>
def do_xref(self, node):
id = node.attributes["id"].value
self.parse(self.randomChildElement(self.refs[id]))</pre><p>Eventually, you parse your way down to plain text,
<pre><code>
def parse_Text(self, node):
text = node.data
...
self.pieces.append(text)</pre><p>which you print out.
<pre><code>
def main(argv):
...
k = KantGenerator(grammar, source)
print k.output()</pre><h2 id="kgp.summary">10.8. Summary</h2>
<p>Python comes with powerful libraries for parsing and manipulating <abbr>XML</abbr> documents. The <code>minidom</code> takes an <abbr>XML</abbr> file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments,
error handling, even the ability to take input from the piped result of a previous program.
<p>Before moving on to the next chapter, you should be comfortable doing all of these things:
<div class=itemizedlist>
<ul>
<li><a href="#kgp.stdio" title="10.2. Standard input, output, and error">Chaining programs</a> with standard input and output
<li><a href="#kgp.handler" title="10.5. Creating separate handlers by node type">Defining dynamic dispatchers</a> with <code>getattr</code>.
<li><a href="#kgp.commandline" title="10.6. Handling command-line arguments">Using command-line flags</a> and validating them with <code>getopt</code>
</ul>
[HTTP web services stuff was here]
[unit testing stuff was here]
<div class=chapter>
<h2 id="roman1.5">Chapter 14. Test-First Programming</h2>
<h2 id="roman.stage1">14.1. <code>roman.py</code>, stage 1</h2>
<p>Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're
going to do this in stages, so you can see all the unit tests fail, then watch them pass one by one as you fill in the gaps
in <code>roman.py</code>.
<div class=example><h3>Example 14.1. <code>roman1.py</code></h3>
<p>This file is available in <code>py/roman/stage1/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass <span>&#x2460;</span>
class OutOfRangeError(RomanError): pass <span>&#x2461;</span>
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass <span>&#x2462;</span>
def to_roman(n):
"""convert integer to Roman numeral"""
pass <span>&#x2463;</span>
def from_roman(s):
"""convert Roman numeral to integer"""
pass
</pre>
<ol>
<li>This is how you define your own custom exceptions in Python. Exceptions are classes, and you create your own by subclassing existing exceptions. It is strongly recommended (but not
required) that you subclass <code>Exception</code>, which is the base class that all built-in exceptions inherit from. Here I am defining <code>RomanError</code> (inherited from <code>Exception</code>) to act as the base class for all my other custom exceptions to follow. This is a matter of style; I could just as easily
have inherited each individual exception from the <code>Exception</code> class directly.
<li>The <code>OutOfRangeError</code> and <code>NotIntegerError</code> exceptions will eventually be used by <code>to_roman()</code> to flag various forms of invalid input, as specified in <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to to_roman"><code>ToRomanBadInput</code></a>.
<li>The <code>InvalidRomanNumeralError</code> exception will eventually be used by <code>from_roman()</code> to flag invalid input, as specified in <a href="#roman.frombadinput.example" title="Example 13.4. Testing bad input to from_roman"><code>FromRomanBadInput</code></a>.
<li>At this stage, you want to define the <abbr>API</abbr> of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word <a href="#fileinfo.class.simplest" title="Example 5.3. The Simplest Python Class"><code>pass</code></a>.
<p>Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At
this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to <code>romantest.py</code> and re-evaluate why you coded a test so useless that it passes with do-nothing functions.
<li>At this stage, you want to define the <abbr>API</abbr> of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word <a href="#fileinfo.class.simplest" title="Example 5.3. The Simplest Python Class"><code>pass</code></a>.
<p>Run <code>romantest1.py</code> with the <code>-v</code> command-line option, which will give more verbose output so you can see exactly what's going on as each test case runs.
With any luck, your output should look like this:
<div class=example><h3 id="roman.stage1.output">Example 14.2. Output of <code>romantest1.py</code> against <code>roman1.py</code></h3><pre class=screen><samp>from_roman should only accept uppercase input ... ERROR
to_roman should always return uppercase ... ERROR
from_roman should fail with malformed antecedents ... FAIL
from_roman should fail with repeated pairs of numerals ... FAIL
from_roman should fail with too many repeated numerals ... FAIL
from_roman should give known result with known input ... FAIL
to_roman should give known result with known input ... FAIL
from_roman(to_roman(n))==n for all n ... FAIL
to_roman should fail with non-integer input ... FAIL
to_roman should fail with negative input ... FAIL
to_roman should fail with large input ... FAIL
to_roman should fail with 0 input ... FAIL
======================================================================
ERROR: from_roman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 154, in testFromRomanCase
roman1.from_roman(numeral.upper())
AttributeError: 'None' object has no attribute 'upper'</span><samp>
======================================================================
ERROR: to_roman should always return uppercase
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 148, in testToRomanCase
self.assertEqual(numeral, numeral.upper())
AttributeError: 'None' object has no attribute 'upper'</span><samp>
======================================================================
FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 133, in testMalformedAntecedent
self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 127, in testRepeatedPairs
self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should give known result with known input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 99, in testFromRomanKnownValues
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: to_roman should give known result with known input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 93, in testToRomanKnownValues
self.assertEqual(numeral, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: I != None</span><samp>
======================================================================
FAIL: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 141, in testSanity
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: to_roman should fail with non-integer input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 116, in testNonInteger
self.assertRaises(roman1.NotIntegerError, roman1.to_roman, 0.5)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: NotIntegerError</span><samp>
======================================================================
FAIL: to_roman should fail with negative input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 112, in testNegative
self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, -1)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
======================================================================
FAIL: to_roman should fail with large input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 104, in testTooLarge
self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 4000)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
======================================================================
FAIL: to_roman should fail with 0 input </span><span>&#x2460;</span><samp>
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 108, in testZero
self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 0)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError </span><span>&#x2461;</span><samp>
----------------------------------------------------------------------
Ran 12 tests in 0.040s </span><span>&#x2462;</span><samp>
FAILED (failures=10, errors=2) </span><span>&#x2463;</span></pre>
<h2 id="roman.stage2">14.2. <code>roman.py</code>, stage 2</h2>
<p>Now that you have the framework of the <code>roman</code> module laid out, it's time to start writing code and passing test cases.
<div class=example><h3 id="roman.stage2.example">Example 14.3. <code>roman2.py</code></h3>
<p>This file is available in <code>py/roman/stage2/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000), <span>&#x2460;</span>
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def to_roman(n):
"""convert integer to Roman numeral"""
result = ""
for numeral, integer in romanNumeralMap:
while n >= integer: <span>&#x2461;</span>
result += numeral
n -= integer
return result
def from_roman(s):
"""convert Roman numeral to integer"""
pass
</pre>
<ol>
<li><var>romanNumeralMap</var> is a tuple of tuples which defines three things:
<div class=orderedlist>
<ol>
<li>The character representations of the most basic Roman numerals. Note that this is not just the single-character Roman numerals;
you're also defining two-character pairs like <code>CM</code> (&#8220;one hundred less than one thousand&#8221;); this will make the <code>to_roman()</code> code simpler later.
<li>The order of the Roman numerals. They are listed in descending value order, from <code>M</code> all the way down to <code>I</code>.
<li>The value of each Roman numeral. Each inner tuple is a pair of <code>(<var>numeral</var>, <var>value</var>)</code>.
</ol>
<li>Here's where your rich data structure pays off, because you don't need any special logic to handle the subtraction rule.
To convert to Roman numerals, you simply iterate through <var>romanNumeralMap</var> looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation
to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
<div class=example><h3>Example 14.4. How <code>to_roman()</code> works</h3>
<p>If you're not clear how <code>to_roman()</code> works, add a <code>print</code> statement to the end of the <code>while</code> loop:<pre><code>
while n >= integer:
result += numeral
n -= integer
print 'subtracting', integer, 'from input, adding', numeral, 'to output'</pre><pre class=screen>
<samp class=p>>>> </samp><kbd>import roman2</kbd>
<samp class=p>>>> </samp><kbd>roman2.to_roman(1424)</kbd>
<samp>subtracting 1000 from input, adding M to output
subtracting 400 from input, adding CD to output
subtracting 10 from input, adding X to output
subtracting 10 from input, adding X to output
subtracting 4 from input, adding IV to output
'MCDXXIV'</span>
</pre><p>So <code>to_roman()</code> appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely.
<div class=example><h3>Example 14.5. Output of <code>romantest2.py</code> against <code>roman2.py</code></h3>
<p>Remember to run <code>romantest2.py</code> with the <code>-v</code> command-line flag to enable verbose mode.
<pre class=screen><samp>from_roman should only accept uppercase input ... FAIL
to_roman should always return uppercase ... ok</span><span>&#x2460;</span><samp>
from_roman should fail with malformed antecedents ... FAIL
from_roman should fail with repeated pairs of numerals ... FAIL
from_roman should fail with too many repeated numerals ... FAIL
from_roman should give known result with known input ... FAIL
to_roman should give known result with known input ... ok </span><span>&#x2461;</span><samp>
from_roman(to_roman(n))==n for all n ... FAIL
to_roman should fail with non-integer input ... FAIL </span><span>&#x2462;</span><samp>
to_roman should fail with negative input ... FAIL
to_roman should fail with large input ... FAIL
to_roman should fail with 0 input ... FAIL</span></pre>
<ol>
<li><code>to_roman()</code> does, in fact, always return uppercase, because <var>romanNumeralMap</var> defines the Roman numeral representations as uppercase. So this test passes already.
<li>Here's the big news: this version of the <code>to_roman()</code> function passes the <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">known values test</a>. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including
inputs that produce every single-character Roman numeral, the largest possible input (<code>3999</code>), and the input that produces the longest possible Roman numeral (<code>3888</code>). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
<li>However, the function does not &#8220;work&#8221; for bad values; it fails every single <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to to_roman">bad input test</a>. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to
be raised (via <code>assertRaises</code>), and you're never raising them. You'll do that in the next stage.
<p>Here's the rest of the output of the unit test, listing the details of all the failures. You're down to 10.
<pre class=screen><samp>
======================================================================
FAIL: from_roman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 156, in testFromRomanCase
roman2.from_roman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 133, in testMalformedAntecedent
self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 127, in testRepeatedPairs
self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should give known result with known input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 99, in testFromRomanKnownValues
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 141, in testSanity
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: to_roman should fail with non-integer input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 116, in testNonInteger
self.assertRaises(roman2.NotIntegerError, roman2.to_roman, 0.5)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: NotIntegerError</span><samp>
======================================================================
FAIL: to_roman should fail with negative input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 112, in testNegative
self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, -1)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
======================================================================
FAIL: to_roman should fail with large input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 104, in testTooLarge
self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
======================================================================
FAIL: to_roman should fail with 0 input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 108, in testZero
self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 0)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError</span><samp>
----------------------------------------------------------------------
Ran 12 tests in 0.320s
FAILED (failures=10)</span></pre><h2 id="roman.stage3">14.3. <code>roman.py</code>, stage 3</h2>
<p>Now that <code>to_roman()</code> behaves correctly with good input (integers from <code>1</code> to <code>3999</code>), it's time to make it behave correctly with bad input (everything else).
<div class=example><h3>Example 14.6. <code>roman3.py</code></h3>
<p>This file is available in <code>py/roman/stage3/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def to_roman(n):
"""convert integer to Roman numeral"""
if not (0 &lt; n &lt; 4000): <span>&#x2460;</span>
raise OutOfRangeError, "number out of range (must be 1..3999)" <span>&#x2461;</span>
if int(n) &lt;> n: <span>&#x2462;</span>
raise NotIntegerError, "non-integers can not be converted"
result = "" <span>&#x2463;</span>
for numeral, integer in romanNumeralMap:
while n >= integer:
result += numeral
n -= integer
return result
def from_roman(s):
"""convert Roman numeral to integer"""
pass
</pre>
<ol>
<li>This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to <code>if not ((0 &lt; n) and (n &lt; 4000))</code>, but it's much easier to read. This is the range check, and it should catch inputs that are too large, negative, or zero.
<li>You raise exceptions yourself with the <code>raise</code> statement. You can raise any of the built-in exceptions, or you can raise any of your custom exceptions that you've defined.
The second parameter, the error message, is optional; if given, it is displayed in the traceback that is printed if the exception
is never handled.
<li>This is the non-integer check. Non-integers can not be converted to Roman numerals.
<li>The rest of the function is unchanged.
<div class=example><h3>Example 14.7. Watching <code>to_roman()</code> handle bad input</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import roman3</kbd>
<samp class=p>>>> </samp><kbd>roman3.to_roman(4000)</kbd>
<samp class=traceback>Traceback (most recent call last):
File "&lt;interactive input>", line 1, in ?
File "roman3.py", line 27, in to_roman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)</samp>
<samp class=p>>>> </samp><kbd>roman3.to_roman(1.5)</kbd>
<samp class=traceback>Traceback (most recent call last):
File "&lt;interactive input>", line 1, in ?
File "roman3.py", line 29, in to_roman
raise NotIntegerError, "non-integers can not be converted"
NotIntegerError: non-integers can not be converted</span>
</pre><div class=example><h3>Example 14.8. Output of <code>romantest3.py</code> against <code>roman3.py</code></h3><pre class=screen><samp>from_roman should only accept uppercase input ... FAIL
to_roman should always return uppercase ... ok
from_roman should fail with malformed antecedents ... FAIL
from_roman should fail with repeated pairs of numerals ... FAIL
from_roman should fail with too many repeated numerals ... FAIL
from_roman should give known result with known input ... FAIL
to_roman should give known result with known input ... ok </span><span>&#x2460;</span><samp>
from_roman(to_roman(n))==n for all n ... FAIL
to_roman should fail with non-integer input ... ok </span><span>&#x2461;</span><samp>
to_roman should fail with negative input ... ok </span><span>&#x2462;</span><samp>
to_roman should fail with large input ... ok
to_roman should fail with 0 input ... ok</span></pre>
<ol>
<li><code>to_roman()</code> still passes the <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">known values test</a>, which is comforting. All the tests that passed in <a href="#roman.stage2" title="14.2. roman.py, stage 2">stage 2</a> still pass, so the latest code hasn't broken anything.
<li>More exciting is the fact that all of the <a href="#roman.tobadinput.example" title="Example 13.3. Testing bad input to to_roman">bad input tests</a> now pass. This test, <code>testNonInteger</code>, passes because of the <code>int(n) &lt;> n</code> check. When a non-integer is passed to <code>to_roman()</code>, the <code>int(n) &lt;> n</code> check notices it and raises the <code>NotIntegerError</code> exception, which is what <code>testNonInteger</code> is looking for.
<li>This test, <code>testNegative</code>, passes because of the <code>not (0 &lt; n &lt; 4000)</code> check, which raises an <code>OutOfRangeError</code> exception, which is what <code>testNegative</code> is looking for.
<pre class=screen><samp>
======================================================================
FAIL: from_roman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 156, in testFromRomanCase
roman3.from_roman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 133, in testMalformedAntecedent
self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 127, in testRepeatedPairs
self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should give known result with known input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 99, in testFromRomanKnownValues
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
======================================================================
FAIL: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 141, in testSanity
self.assertEqual(integer, result)
File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None</span><samp>
----------------------------------------------------------------------
Ran 12 tests in 0.401s
FAILED (failures=6)</span> <span>&#x2460;</span></pre>
<ol>
<li>You're down to 6 failures, and all of them involve <code>from_roman()</code>: the known values test, the three separate bad input tests, the case check, and the sanity check. That means that <code>to_roman()</code> has passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires that <code>from_roman()</code> be written, which it isn't yet.) Which means that you must stop coding <code>to_roman()</code> now. No tweaking, no twiddling, no extra checks &#8220;just in case&#8221;. Stop. Now. Back away from the keyboard.
<table class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">The most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for
a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module.
<h2 id="roman.stage4">14.4. <code>roman.py</code>, stage 4</h2>
<p>Now that <code>to_roman()</code> is done, it's time to start coding <code>from_roman()</code>.
the <code>to_roman()</code> function.
<div class=example><h3>Example 14.9. <code>roman4.py</code></h3>
<p>This file is available in <code>py/roman/stage4/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
# to_roman function omitted for clarity (it hasn't changed)
def from_roman(s):
"""convert Roman numeral to integer"""
result = 0
index = 0
for numeral, integer in romanNumeralMap:
while s[index:index+len(numeral)] == numeral: <span>&#x2460;</span>
result += integer
index += len(numeral)
return result
</pre>
<ol>
<li>The pattern here is the same as <a href="#roman.stage2.example" title="Example 14.3. roman2.py"><code>to_roman()</code></a>. You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer
values as often as possible, you match the &#8220;highest&#8221; Roman numeral character strings as often as possible.
<div class=example><h3>Example 14.10. How <code>from_roman()</code> works</h3>
<p>If you're not clear how <code>from_roman()</code> works, add a <code>print</code> statement to the end of the <code>while</code> loop:<pre><code>
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
print 'found', numeral, 'of length', len(numeral), ', adding', integer</pre><pre class=screen>
<samp class=p>>>> </samp><kbd>import roman4</kbd>
<samp class=p>>>> </samp><kbd>roman4.from_roman('MCMLXXII')</kbd>
<samp>found M , of length 1, adding 1000
found CM , of length 2, adding 900
found L , of length 1, adding 50
found X , of length 1, adding 10
found X , of length 1, adding 10
found I , of length 1, adding 1
found I , of length 1, adding 1
1972</span></pre><div class=example><h3>Example 14.11. Output of <code>romantest4.py</code> against <code>roman4.py</code></h3><pre class=screen><samp>from_roman should only accept uppercase input ... FAIL
to_roman should always return uppercase ... ok
from_roman should fail with malformed antecedents ... FAIL
from_roman should fail with repeated pairs of numerals ... FAIL
from_roman should fail with too many repeated numerals ... FAIL
from_roman should give known result with known input ... ok </span><span>&#x2460;</span><samp>
to_roman should give known result with known input ... ok
from_roman(to_roman(n))==n for all n ... ok</span><span>&#x2461;</span><samp>
to_roman should fail with non-integer input ... ok
to_roman should fail with negative input ... ok
to_roman should fail with large input ... ok
to_roman should fail with 0 input ... ok</span></pre>
<ol>
<li>Two pieces of exciting news here. The first is that <code>from_roman()</code> works for good input, at least for all the <a href="#roman.testtoromanknownvalues.example" title="Example 13.2. testToRomanKnownValues">known values</a> you test.
<li>The second is that the <a href="#roman.sanity.example" title="Example 13.5. Testing to_roman against from_roman">sanity check</a> also passed. Combined with the known values tests, you can be reasonably sure that both <code>to_roman()</code> and <code>from_roman()</code> work properly for all possible good values. (This is not guaranteed; it is theoretically possible that <code>to_roman()</code> has a bug that produces the wrong Roman numeral for some particular set of inputs, <em>and</em> that <code>from_roman()</code> has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that <code>to_roman()</code> generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write
more comprehensive test cases until it doesn't bother you.)
<pre class=screen><samp>
======================================================================
FAIL: from_roman should only accept uppercase input
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 156, in testFromRomanCase
roman4.from_roman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 133, in testMalformedAntecedent
self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 127, in testRepeatedPairs
self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
======================================================================
FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
</span><samp class=traceback>Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 122, in testTooManyRepeatedNumerals
self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError</span><samp>
----------------------------------------------------------------------
Ran 12 tests in 1.222s
FAILED (failures=4)</span></pre><h2 id="roman.stage5">14.5. <code>roman.py</code>, stage 5</h2>
<div class=example><h3>Example 14.12. <code>roman5.py</code></h3>
<p>This file is available in <code>py/roman/stage5/</code> in the examples directory.
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Convert to and from Roman numerals"""
import re
#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass
#Define digit mapping
romanNumeralMap = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
('I', 1))
def to_roman(n):
"""convert integer to Roman numeral"""
if not (0 &lt; n &lt; 4000):
raise OutOfRangeError, "number out of range (must be 1..3999)"
if int(n) &lt;> n:
raise NotIntegerError, "non-integers can not be converted"
result = ""
for numeral, integer in romanNumeralMap:
while n >= integer:
result += numeral
n -= integer
return result
#Define pattern to detect valid Roman numerals
romanNumeralPattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$' <span>&#x2460;</span>
def from_roman(s):
"""convert Roman numeral to integer"""
if not re.search(romanNumeralPattern, s):<span>&#x2461;</span>
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
result = 0
index = 0
for numeral, integer in romanNumeralMap:
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
return result
</pre>
<ol>
<li>This is just a continuation of the pattern you discussed in <a href="#re.roman" title="7.3. Case Study: Roman Numerals">Section 7.3, &#8220;Case Study: Roman Numerals&#8221;</a>. The tens places is either <code>XC</code> (<code>90</code>), <code>XL</code> (<code>40</code>), or an optional <code>L</code> followed by 0 to 3 optional <code>X</code> characters. The ones place is either <code>IX</code> (<code>9</code>), <code>IV</code> (<code>4</code>), or an optional <code>V</code> followed by 0 to 3 optional <code>I</code> characters.
<li>Having encoded all that logic into a regular expression, the code to check for invalid Roman numerals becomes trivial. If
<code>re.search</code> returns an object, then the regular expression matched and the input is valid; otherwise, the input is invalid.
<p>At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of
invalid Roman numerals. But don't take my word for it, look at the results:
<div class=example><h3>Example 14.13. Output of <code>romantest5.py</code> against <code>roman5.py</code></h3><pre class=screen><samp>
from_roman should only accept uppercase input ... ok </span><span>&#x2460;</span><samp>
to_roman should always return uppercase ... ok
from_roman should fail with malformed antecedents ... ok </span><span>&#x2461;</span><samp>
from_roman should fail with repeated pairs of numerals ... ok </span><span>&#x2462;</span><samp>
from_roman should fail with too many repeated numerals ... ok
from_roman should give known result with known input ... ok
to_roman should give known result with known input ... ok
from_roman(to_roman(n))==n for all n ... ok
to_roman should fail with non-integer input ... ok
to_roman should fail with negative input ... ok
to_roman should fail with large input ... ok
to_roman should fail with 0 input ... ok
----------------------------------------------------------------------
Ran 12 tests in 2.864s
OK </span><span>&#x2463;</span></pre>
<ol>
<li>One thing I didn't mention about regular expressions is that, by default, they are case-sensitive. Since the regular expression
<var>romanNumeralPattern</var> was expressed in uppercase characters, the <code>re.search</code> check will reject any input that isn't completely uppercase. So the uppercase input test passes.
<li>More importantly, the bad input tests pass. For instance, the malformed antecedents test checks cases like <code>MCMC</code>. As you've seen, this does not match the regular expression, so <code>from_roman()</code> raises an <code>InvalidRomanNumeralError</code> exception, which is what the malformed antecedents test case is looking for, so the test passes.
<li>In fact, all the bad input tests pass. This regular expression catches everything you could think of when you made your test
cases.
<li><table class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">When all of your tests pass, stop coding.
[functional programming stuff was here]
<p>The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual
modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the
build process for this book; I have unit tests for several of the example programs (not just the <code>roman.py</code> module featured in <a href="#roman" title="Chapter 13. Unit Testing">Chapter 13, <i>Unit Testing</i></a>), and the first thing my automated build script does is run this program to make sure all my examples still work. If this
regression test fails, the build immediately stops. I don't want to release non-working examples any more than you want to
download them and sit around scratching your head and yelling at your monitor and wondering why they don't work.
<div class=example><h3>Example 16.1. <code>regression.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
"""Regression testing framework
This module will search for scripts in the same directory named
XYZtest.py. Each such script should be a test suite that tests a
module through PyUnit. (As of Python 2.1, PyUnit is included in
the standard library as "unittest".) This script will aggregate all
found test suites into one big test suite and run them all at once.
"""
import sys, os, re, unittest
def regressionTest():
path = os.path.abspath(os.path.dirname(sys.argv[0]))
files = os.listdir(path)
test = re.compile("test\.py$", re.IGNORECASE)
files = filter(test.search, files)
filenameToModuleName = lambda f: os.path.splitext(f)[0]
moduleNames = map(filenameToModuleName, files)
modules = map(__import__, moduleNames)
load = unittest.defaultTestLoader.loadTestsFromModule
return unittest.TestSuite(map(load, modules))
if __name__ == "__main__":
unittest.main(defaultTest="regressionTest")
</pre><p>Running this script in the same directory as the rest of the example scripts that come with this book will find all the unit
tests, named <code><var><code>module</code></var>test.py</code>, run them as a single test, and pass or fail them all at once.
<div class=example><h3>Example 16.2. Sample output of <code>regression.py</code></h3><pre class=screen>
<samp class=p>[you@localhost py]$ </samp>python regression.py -v
help should fail with no object ... ok <span>&#x2460;</span><samp>
help should return known result for apihelper ... ok
help should honor collapse argument ... ok
help should honor spacing argument ... ok
buildConnectionString should fail with list input ... ok </span><span>&#x2461;</span><samp>
buildConnectionString should fail with string input ... ok
buildConnectionString should fail with tuple input ... ok
buildConnectionString handles empty dictionary ... ok
buildConnectionString returns known result with known input ... ok
from_roman should only accept uppercase input ... ok </span><span>&#x2462;</span><samp>
to_roman should always return uppercase ... ok
from_roman should fail with blank string ... ok
from_roman should fail with malformed antecedents ... ok
from_roman should fail with repeated pairs of numerals ... ok
from_roman should fail with too many repeated numerals ... ok
from_roman should give known result with known input ... ok
to_roman should give known result with known input ... ok
from_roman(to_roman(n))==n for all n ... ok
to_roman should fail with non-integer input ... ok
to_roman should fail with negative input ... ok
to_roman should fail with large input ... ok
to_roman should fail with 0 input ... ok
kgp a ref test ... ok
kgp b ref test ... ok
kgp c ref test ... ok
kgp d ref test ... ok
kgp e ref test ... ok
kgp f ref test ... ok
kgp g ref test ... ok
----------------------------------------------------------------------
Ran 29 tests in 2.799s
OK</span></pre>
<ol>
<li>The first 5 tests are from <code>apihelpertest.py</code>, which tests the example script from <a href="#apihelper" title="Chapter 4. The Power Of Introspection">Chapter 4, <i>The Power Of Introspection</i></a>.
<li>The next 5 tests are from <code>odbchelpertest.py</code>, which tests the example script from <a href="#odbchelper" title="Chapter 2. Your First Python Program">Chapter 2, <i>Your First Python Program</i></a>.
<li>The rest are from <code>romantest.py</code>, which you studied in depth in <a href="#roman" title="Chapter 13. Unit Testing">Chapter 13, <i>Unit Testing</i></a>.
<h2 id="regression.path">16.2. Finding the path</h2>
<p>When running Python scripts from the command line, it is sometimes useful to know where the currently running script is located on disk.
<p>This is one of those obscure little tricks that is virtually impossible to figure out on your own, but simple to remember
once you see it. The key to it is <code>sys.argv</code>. As you saw in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a>, this is a list that holds the list of command-line arguments. However, it also holds the name of the running script, exactly
as it was called from the command line, and this is enough information to determine its location.
<div class=example><h3>Example 16.3. <code>fullpath.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
import sys, os
print 'sys.argv[0] =', sys.argv[0] <span>&#x2460;</span>
pathname = os.path.dirname(sys.argv[0]) <span>&#x2461;</span>
print 'path =', pathname
print 'full path =', os.path.abspath(pathname) <span>&#x2462;</span></pre>
<ol>
<li>Regardless of how you run a script, <code>sys.argv[0]</code> will always contain the name of the script, exactly as it appears on the command line. This may or may not include any path
information, as you'll see shortly.
<li><code>os.path.dirname</code> takes a filename as a string and returns the directory path portion. If the given filename does not include any path information,
<code>os.path.dirname</code> returns an empty string.
<li><code>os.path.abspath</code> is the key here. It takes a pathname, which can be partial or even blank, and returns a fully qualified pathname.
<p><code>os.path.abspath</code> deserves further explanation. It is very flexible; it can take any kind of pathname.
<div class=example><h3>Example 16.4. Further explanation of <code>os.path.abspath</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import os</kbd>
<samp class=p>>>> </samp><kbd>os.getcwd()</kbd> <span>&#x2460;</span>
/home/you
<samp class=p>>>> </samp><kbd>os.path.abspath('')</kbd> <span>&#x2461;</span>
/home/you
<samp class=p>>>> </samp><kbd>os.path.abspath('.ssh')</kbd> <span>&#x2462;</span>
/home/you/.ssh
<samp class=p>>>> </samp><kbd>os.path.abspath('/home/you/.ssh')</kbd> <span>&#x2463;</span>
/home/you/.ssh
<samp class=p>>>> </samp><kbd>os.path.abspath('.ssh/../foo/')</kbd> <span>&#x2464;</span>
/home/you/foo</pre>
<ol>
<li><code>os.getcwd()</code> returns the current working directory.
<li>Calling <code>os.path.abspath</code> with an empty string returns the current working directory, same as <code>os.getcwd()</code>.
<li>Calling <code>os.path.abspath</code> with a partial pathname constructs a fully qualified pathname out of it, based on the current working directory.
<li>Calling <code>os.path.abspath</code> with a full pathname simply returns it.
<li><code>os.path.abspath</code> also <em>normalizes</em> the pathname it returns. Note that this example worked even though I don't actually have a 'foo' directory. <code>os.path.abspath</code> never checks your actual disk; this is all just string manipulation.
<table id="os.path.abspath.exist.note" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">The pathnames and filenames you pass to <code>os.path.abspath</code> do not need to exist.
<table id="os.path.normpath.note" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code>os.path.abspath</code> not only constructs full path names, it also normalizes them. That means that if you are in the <code>/usr/</code> directory, <code>os.path.abspath('bin/../local/bin')</code> will return <code>/usr/local/bin</code>. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without
turning it into a full pathname, use <code>os.path.normpath</code> instead.
<div class=example><h3>Example 16.5. Sample output from <code>fullpath.py</code></h3><pre class=screen>
<samp class=p>[you@localhost py]$ </samp>python /home/you/diveintopython3/common/py/fullpath.py <span>&#x2460;</span>
<samp>sys.argv[0] = /home/you/diveintopython3/common/py/fullpath.py
path = /home/you/diveintopython3/common/py
full path = /home/you/diveintopython3/common/py</samp>
<samp class=p>[you@localhost diveintopython3]$ </samp>python common/py/fullpath.py <span>&#x2461;</span>
<samp>sys.argv[0] = common/py/fullpath.py
path = common/py
full path = /home/you/diveintopython3/common/py</samp>
<samp class=p>[you@localhost diveintopython3]$ </samp>cd common/py
<samp class=p>[you@localhost py]$ </samp>python fullpath.py <span>&#x2462;</span>
<samp>sys.argv[0] = fullpath.py
path =
full path = /home/you/diveintopython3/common/py</span></pre>
<ol>
<li>In the first case, <code>sys.argv[0]</code> includes the full path of the script. You can then use the <code>os.path.dirname</code> function to strip off the script name and return the full directory name, and <code>os.path.abspath</code> simply returns what you give it.
<li>If the script is run by using a partial pathname, <code>sys.argv[0]</code> will still contain exactly what appears on the command line. <code>os.path.dirname</code> will then give you a partial pathname (relative to the current directory), and <code>os.path.abspath</code> will construct a full pathname from the partial pathname.
<li>If the script is run from the current directory without giving any path, <code>os.path.dirname</code> will simply return an empty string. Given an empty string, <code>os.path.abspath</code> returns the current directory, which is what you want, since the script was run from the current directory.
<table id="os.path.abspath.crossplatform.note" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">Like the other functions in the <code>os</code> and <code>os.path</code> modules, <code>os.path.abspath</code> is cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash
as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of the <code>os</code> module.
<p><b>Addendum. </b>One reader was dissatisfied with this solution, and wanted to be able to run all the unit tests in the current directory,
not the directory where <code>regression.py</code> is located. He suggests this approach instead:
<div class=example><h3 id="regression.path.cwd.example">Example 16.6. Running scripts in the current directory</h3><pre><code>import sys, os, re, unittest
def regressionTest():
path = os.getcwd() <span>&#x2460;</span>
sys.path.append(path) <span>&#x2461;</span>
files = os.listdir(path) <span>&#x2462;</span>
</pre>
<ol>
<li>Instead of setting <var>path</var> to the directory where the currently running script is located, you set it to the current working directory instead. This
will be whatever directory you were in before you ran the script, which is not necessarily the same as the directory the script
is in. (Read that sentence a few times until you get it.)
<li>Append this directory to the Python library search path, so that when you dynamically import the unit test modules later, Python can find them. You didn't need to do this when <var>path</var> was the directory of the currently running script, because Python always looks in that directory.
<li>The rest of the function is the same.
<p>This technique will allow you to re-use this <code>regression.py</code> script on multiple projects. Just put the script in a common directory, then change to the project's directory before running
it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory where <code>regression.py</code> is located.
[more functional programming stuff was here]
<h2 id="regression.import">16.6. Dynamically importing modules</h2>
<p>OK, enough philosophizing. Let's talk about dynamically importing modules.
<p>First, let's look at how you normally import modules. The <code>import <var>module</var></code> syntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once
this way, with a comma-separated list. You did this on the very first line of this chapter's script.
<div class=example><h3>Example 16.13. Importing multiple modules at once</h3><pre><code>
import sys, os, re, unittest <span>&#x2460;</span>
</pre>
<ol>
<li>This imports four modules at once: <code>sys</code> (for system functions and access to the command line parameters), <code>os</code> (for operating system functions like directory listings), <code>re</code> (for regular expressions), and <code>unittest</code> (for unit testing).
<p>Now let's do the same thing, but with dynamic imports.
<div class=example><h3>Example 16.14. Importing modules dynamically</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>sys = __import__('sys')</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>os = __import__('os')</kbd>
<samp class=p>>>> </samp><kbd>re = __import__('re')</kbd>
<samp class=p>>>> </samp><kbd>unittest = __import__('unittest')</kbd>
<samp class=p>>>> </samp><kbd>sys</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>&lt;module 'sys' (built-in)></kbd>
<samp class=p>>>> </samp><kbd>os</kbd>
<samp class=p>>>> </samp><kbd>&lt;module 'os' from '/usr/local/lib/python2.2/os.pyc'></kbd>
</pre>
<ol>
<li>The built-in <code>__import__</code> function accomplishes the same goal as using the <code>import</code> statement, but it's an actual function, and it takes a string as an argument.
<li>The variable <var>sys</var> is now the <code>sys</code> module, just as if you had said <code>import sys</code>. The variable <var>os</var> is now the <code>os</code> module, and so forth.
<p>So <code>__import__</code> imports a module, but takes a string argument to do it. In this case the module you imported was just a hard-coded string,
but it could just as easily be a variable, or the result of a function call. And the variable that you assign the module
to doesn't need to match the module name, either. You could import a series of modules and assign them to a list.
<div class=example><h3>Example 16.15. Importing a list of modules dynamically</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>moduleNames = ['sys', 'os', 're', 'unittest']</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>moduleNames</kbd>
['sys', 'os', 're', 'unittest']
<samp class=p>>>> </samp><kbd>modules = map(__import__, moduleNames)</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>modules</kbd> <span>&#x2462;</span>
<samp>[&lt;module 'sys' (built-in)>,
&lt;module 'os' from 'c:\Python22\lib\os.pyc'>,
&lt;module 're' from 'c:\Python22\lib\re.pyc'>,
&lt;module 'unittest' from 'c:\Python22\lib\unittest.pyc'>]</samp>
<samp class=p>>>> </samp><kbd>modules[0].version</kbd> <span>&#x2463;</span>
'2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
<samp class=p>>>> </samp><kbd>import sys</kbd>
<samp class=p>>>> </samp><kbd>sys.version</kbd>
'2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
</pre>
<ol>
<li><var>moduleNames</var> is just a list of strings. Nothing fancy, except that the strings happen to be names of modules that you could import, if
you wanted to.
<li>Surprise, you wanted to import them, and you did, by mapping the <code>__import__</code> function onto the list. Remember, this takes each element of the list (<var>moduleNames</var>) and calls the function (<code>__import__</code>) over and over, once with each element of the list, builds a list of the return values, and returns the result.
<li>So now from a list of strings, you've created a list of actual modules. (Your paths may be different, depending on your operating
system, where you installed Python, the phase of the moon, etc.)
<li>To drive home the point that these are real modules, let's look at some module attributes. Remember, <var>modules[0]</var> <em>is</em> the <code>sys</code> module, so <var>modules[0].version</var> <em>is</em> <var>sys.version</var>. All the other attributes and methods of these modules are also available. There's nothing magic about the <code>import</code> statement, and there's nothing magic about modules. Modules are objects. Everything is an object.
<p>Now you should be able to put this all together and figure out what most of this chapter's code sample is doing.
<h2 id="regression.alltogether">16.7. Putting it all together</h2>
<p>You've learned enough now to deconstruct the first seven lines of this chapter's code sample: reading a directory and importing
selected modules within it.
<div class=example><h3>Example 16.16. The <code>regressionTest</code> function</h3><pre><code>
def regressionTest():
path = os.path.abspath(os.path.dirname(sys.argv[0]))
files = os.listdir(path)
test = re.compile("test\.py$", re.IGNORECASE)
files = filter(test.search, files)
filenameToModuleName = lambda f: os.path.splitext(f)[0]
moduleNames = map(filenameToModuleName, files)
modules = map(__import__, moduleNames)
load = unittest.defaultTestLoader.loadTestsFromModule
return unittest.TestSuite(map(load, modules))
</pre><p>Let's look at it line by line, interactively. Assume that the current directory is <code>c:\diveintopython3\py</code>, which contains the examples that come with this book, including this chapter's script. As you saw in <a href="#regression.path" title="16.2. Finding the path">Section 16.2, &#8220;Finding the path&#8221;</a>, the script directory will end up in the <var>path</var> variable, so let's start hard-code that and go from there.
<div class=example><h3>Example 16.17. Step 1: Get all the files</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import sys, os, re, unittest</kbd>
<samp class=p>>>> </samp><kbd>path = r'c:\diveintopython3\py'</kbd>
<samp class=p>>>> </samp><kbd>files = os.listdir(path) </kbd>
<samp class=p>>>> </samp><kbd>files</kbd> <span>&#x2460;</span>
<samp>['BaseHTMLProcessor.py', 'LICENSE.txt', 'apihelper.py', 'apihelpertest.py',
'argecho.py', 'autosize.py', 'builddialectexamples.py', 'dialect.py',
'fileinfo.py', 'fullpath.py', 'kgptest.py', 'makerealworddoc.py',
'odbchelper.py', 'odbchelpertest.py', 'parsephone.py', 'piglatin.py',
'plural.py', 'pluraltest.py', 'pyfontify.py', 'regression.py', 'roman.py', 'romantest.py',
'uncurly.py', 'unicode2koi8r.py', 'urllister.py', 'kgp', 'plural', 'roman',
'colorize.py']</span>
</pre>
<ol>
<li><var>files</var> is a list of all the files and directories in the script's directory. (If you've been running some of the examples already,
you may also see some <code>.pyc</code> files in there as well.)
<div class=example><h3>Example 16.18. Step 2: Filter to find the files you care about</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>test = re.compile("test\.py$", re.IGNORECASE)</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>files = filter(test.search, files)</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>files</kbd> <span>&#x2462;</span>
['apihelpertest.py', 'kgptest.py', 'odbchelpertest.py', 'pluraltest.py', 'romantest.py']
</pre>
<ol>
<li>This regular expression will match any string that ends with <code>test.py</code>. Note that you need to escape the period, since a period in a regular expression usually means &#8220;match any single character&#8221;, but you actually want to match a literal period instead.
<li>The compiled regular expression acts like a function, so you can use it to filter the large list of files and directories,
to find the ones that match the regular expression.
<li>And you're left with the list of unit testing scripts, because they were the only ones named <code>SOMETHINGtest.py</code>.
<div class=example><h3>Example 16.19. Step 3: Map filenames to module names</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>filenameToModuleName = lambda f: os.path.splitext(f)[0]</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>filenameToModuleName('romantest.py')</kbd> <span>&#x2461;</span>
'romantest'
<samp class=p>>>> </samp><kbd>filenameToModuleName('odchelpertest.py')</kbd>
'odbchelpertest'
<samp class=p>>>> </samp><kbd>moduleNames = map(filenameToModuleName, files)</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>moduleNames</kbd> <span>&#x2463;</span>
['apihelpertest', 'kgptest', 'odbchelpertest', 'pluraltest', 'romantest']
</pre>
<ol>
<li>As you saw in <a href="#apihelper.lambda" title="4.7. Using lambda Functions">Section 4.7, &#8220;Using lambda Functions&#8221;</a>, <code>lambda</code> is a quick-and-dirty way of creating an inline, one-line function. This one takes a filename with an extension and returns
just the filename part, using the standard library function <code>os.path.splitext</code> that you saw in <a href="#splittingpathnames.example" title="Example 6.17. Splitting Pathnames">Example 6.17, &#8220;Splitting Pathnames&#8221;</a>.
<li><var>filenameToModuleName</var> is a function. There's nothing magic about <code>lambda</code> functions as opposed to regular functions that you define with a <code>def</code> statement. You can call the <var>filenameToModuleName</var> function like any other, and it does just what you wanted it to do: strips the file extension off of its argument.
<li>Now you can apply this function to each file in the list of unit test files, using <code>map</code>.
<li>And the result is just what you wanted: a list of modules, as strings.
<div class=example><h3>Example 16.20. Step 4: Mapping module names to modules</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>modules = map(__import__, moduleNames)</kbd><span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>modules</kbd> <span>&#x2461;</span>
<samp>[&lt;module 'apihelpertest' from 'apihelpertest.py'>,
&lt;module 'kgptest' from 'kgptest.py'>,
&lt;module 'odbchelpertest' from 'odbchelpertest.py'>,
&lt;module 'pluraltest' from 'pluraltest.py'>,
&lt;module 'romantest' from 'romantest.py'>]</samp>
<samp class=p>>>> </samp><kbd>modules[-1]</kbd> <span>&#x2462;</span>
&lt;module 'romantest' from 'romantest.py'>
</pre>
<ol>
<li>As you saw in <a href="#regression.import" title="16.6. Dynamically importing modules">Section 16.6, &#8220;Dynamically importing modules&#8221;</a>, you can use a combination of <code>map</code> and <code>__import__</code> to map a list of module names (as strings) into actual modules (which you can call or access like any other module).
<li><var>modules</var> is now a list of modules, fully accessible like any other module.
<li>The last module in the list <em>is</em> the <code>romantest</code> module, just as if you had said <code>import romantest</code>.
<div class=example><h3>Example 16.21. Step 5: Loading the modules into a test suite</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>load = unittest.defaultTestLoader.loadTestsFromModule </kbd>
<samp class=p>>>> </samp><kbd>map(load, modules)</kbd> <span>&#x2460;</span>
<samp>[&lt;unittest.TestSuite tests=[
&lt;unittest.TestSuite tests=[&lt;apihelpertest.BadInput testMethod=testNoObject>]>,
&lt;unittest.TestSuite tests=[&lt;apihelpertest.KnownValues testMethod=testApiHelper>]>,
&lt;unittest.TestSuite tests=[
&lt;apihelpertest.ParamChecks testMethod=testCollapse>,
&lt;apihelpertest.ParamChecks testMethod=testSpacing>]>,
...
]
]</samp>
<samp class=p>>>> </samp><kbd>unittest.TestSuite(map(load, modules))</kbd> <span>&#x2461;</span>
</pre>
<ol>
<li>These are real module objects. Not only can you access them like any other module, instantiate classes and call functions,
you can also introspect into the module to figure out which classes and functions it has in the first place. That's what
the <code>loadTestsFromModule</code> method does: it introspects into each module and returns a <code>unittest.TestSuite</code> object for each module. Each <code>TestSuite</code> object actually contains a list of <code>TestSuite</code> objects, one for each <code>TestCase</code> class in your module, and each of those <code>TestSuite</code> objects contains a list of tests, one for each test method in your module.
<li>Finally, you wrap the list of <code>TestSuite</code> objects into one big test suite. The <code>unittest</code> module has no problem traversing this tree of nested test suites within test suites; eventually it gets down to an individual
test method and executes it, verifies that it passes or fails, and moves on to the next one.
<p>This introspection process is what the <code>unittest</code> module usually does for us. Remember that magic-looking <code>unittest.main()</code> function that our individual test modules called to kick the whole thing off? <code>unittest.main()</code> actually creates an instance of <code>unittest.TestProgram</code>, which in turn creates an instance of a <code>unittest.defaultTestLoader</code> and loads it up with the module that called it. (How does it get a reference to the module that called it if you don't give
it one? By using the equally-magic <code>__import__('__main__')</code> command, which dynamically imports the currently-running module. I could write a book on all the tricks and techniques used
in the <code>unittest</code> module, but then I'd never finish this one.)
<div class=example><h3>Example 16.22. Step 6: Telling <code>unittest</code> to use your test suite</h3><pre><code>
if __name__ == "__main__":
unittest.main(defaultTest="regressionTest") <span>&#x2460;</span>
</pre>
<ol>
<li>Instead of letting the <code>unittest</code> module do all its magic for us, you've done most of it yourself. You've created a function (<code>regressionTest</code>) that imports the modules yourself, calls <code>unittest.defaultTestLoader</code> yourself, and wraps it all up in a test suite. Now all you need to do is tell <code>unittest</code> that, instead of looking for tests and building a test suite in the usual way, it should just call the <code>regressionTest</code> function, which returns a ready-to-use <code>TestSuite</code>.
<h2 id="regression.summary">16.8. Summary</h2>
<p>The <code>regression.py</code> program and its output should now make perfect sense.
<p>You should now feel comfortable doing all of these things:
<div class=itemizedlist>
<ul>
<li>Manipulating <a href="#regression.path" title="16.2. Finding the path">path information</a> from the command line.
<li>Filtering lists <a href="#regression.filter" title="16.3. Filtering lists revisited">using <code>filter</code></a> instead of list comprehensions.
<li>Mapping lists <a href="#regression.map" title="16.4. Mapping lists revisited">using <code>map</code></a> instead of list comprehensions.
<li>Dynamically <a href="#regression.import" title="16.6. Dynamically importing modules">importing modules</a>.
</ul>
<div class=footnotes><br><hr width="100" align="left">
<div class=footnote>
<p><sup>[<a name="ftn.d0e35697" href="#d0e35697">7</a>] </sup>Technically, the second argument to <code>filter</code> can be any sequence, including lists, tuples, and custom classes that act like lists by defining the <code>__getitem__</code> special method. If possible, <code>filter</code> will return the same datatype as you give it, so filtering a list returns a list, but filtering a tuple returns a tuple.
<div class=footnote>
<p><sup>[<a name="ftn.d0e36079" href="#d0e36079">8</a>] </sup>Again, I should point out that <code>map</code> can take a list, a tuple, or any object that acts like a sequence. See previous footnote about <code>filter</code>.
<div class=chapter>
<h2 id="soundex">Chapter 18. Performance Tuning</h2>
<p>Performance tuning is a many-splendored thing. Just because Python is an interpreted language doesn't mean you shouldn't worry about code optimization. But don't worry about it <em>too</em> much.
<h2 id="soundex.divein">18.1. Diving in</h2>
<p>There are so many pitfalls involved in optimizing your code, it's hard to know where to start.
<p>Let's start here: <em>are you sure you need to do it at all?</em> Is your code really so bad? Is it worth the time to tune it? Over the lifetime of your application, how much time is going
to be spent running that code, compared to the time spent waiting for a remote database server, or waiting for user input?
<p>Second, <em>are you sure you're done coding?</em> Premature optimization is like spreading frosting on a half-baked cake. You spend hours or days (or more) optimizing your
code for performance, only to discover it doesn't do what you need it to do. That's time down the drain.
<p>This is not to say that code optimization is worthless, but you need to look at the whole system and decide whether it's the
best use of your time. Every minute you spend optimizing code is a minute you're not spending adding new features, or writing
documentation, or playing with your kids, or writing unit tests.
<p>Oh yes, unit tests. It should go without saying that you need a complete set of unit tests before you begin performance tuning.
The last thing you need is to introduce new bugs while fiddling with your algorithms.
<p>With these caveats in place, let's look at some techniques for optimizing Python code. The code in question is an implementation of the Soundex algorithm. Soundex was a method used in the early 20th century
for categorizing surnames in the United States census. It grouped similar-sounding names together, so even if a name was
misspelled, researchers had a chance of finding it. Soundex is still used today for much the same reason, although of course
we use computerized database servers now. Most database servers include a Soundex function.
<p>There are several subtle variations of the Soundex algorithm. This is the one used in this chapter:
<div class=orderedlist>
<ol>
<li>Keep the first letter of the name as-is.
<li>Convert the remaining letters to digits, according to a specific table:
<div class=itemizedlist>
<ul>
<li>B, F, P, and V become 1.
<li>C, G, J, K, Q, S, X, and Z become 2.
<li>D and T become 3.
<li>L becomes 4.
<li>M and N become 5.
<li>R becomes 6.
<li>All other letters become 9.
</ul>
<li>Remove consecutive duplicates.
<li>Remove all 9s altogether.
<li>If the result is shorter than four characters (the first letter plus three digits), pad the result with trailing zeros.
<li>if the result is longer than four characters, discard everything after the fourth character.
</ol>
<p>For example, my name, <code>Pilgrim</code>, becomes P942695. That has no consecutive duplicates, so nothing to do there. Then you remove the 9s, leaving P4265. That's
too long, so you discard the excess character, leaving P426.
<p>Another example: <code>Woo</code> becomes W99, which becomes W9, which becomes W, which gets padded with zeros to become W000.
<p>Here's a first attempt at a Soundex function:
<div class=example><h3>Example 18.1. <code>soundex/stage1/soundex1a.py</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre><code>
import string, re
charToSoundex = {"A": "9",
"B": "1",
"C": "2",
"D": "3",
"E": "9",
"F": "1",
"G": "2",
"H": "9",
"I": "9",
"J": "2",
"K": "2",
"L": "4",
"M": "5",
"N": "5",
"O": "9",
"P": "1",
"Q": "2",
"R": "6",
"S": "2",
"T": "3",
"U": "9",
"V": "1",
"W": "9",
"X": "2",
"Y": "9",
"Z": "2"}
def soundex(source):
"convert string to Soundex equivalent"
# Soundex requirements:
# source string must be at least 1 character
# and must consist entirely of letters
allChars = string.uppercase + string.lowercase
if not re.search('^[%s]+$' % allChars, source):
return "0000"
# Soundex algorithm:
# 1. make first character uppercase
source = source[0].upper() + source[1:]
# 2. translate all other characters to Soundex digits
digits = source[0]
for s in source[1:]:
s = s.upper()
digits += charToSoundex[s]
# 3. remove consecutive duplicates
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
# 4. remove all "9"s
digits3 = re.sub('9', '', digits2)
# 5. pad end with "0"s to 4 characters
while len(digits3) &lt; 4:
digits3 += "0"
# 6. return first 4 characters
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><div class=itemizedlist>
<h3>Further Reading on Soundex</h3>
<ul>
<li><a href="http://www.avotaynu.com/soundex.html">Soundexing and Genealogy</a> gives a chronology of the evolution of the Soundex and its regional variations.
</ul>
<h2 id="soundex.timeit">18.2. Using the <code>timeit</code> Module</h2>
<p>The most important thing you need to know about optimizing Python code is that you shouldn't write your own timing function.
<p>Timing short pieces of code is incredibly complex. How much processor time is your computer devoting to running this code?
Are there things running in the background? Are you sure? Every modern computer has background processes running, some all
the time, some intermittently. Cron jobs fire off at consistent intervals; background services occasionally &#8220;wake up&#8221; to do useful things like check for new mail, connect to instant messaging servers, check for application updates, scan for
viruses, check whether a disk has been inserted into your CD drive in the last 100 nanoseconds, and so on. Before you start
your timing tests, turn everything off and disconnect from the network. Then turn off all the things you forgot to turn off
the first time, then turn off the service that's incessantly checking whether the network has come back yet, then ...
<p>And then there's the matter of the variations introduced by the timing framework itself. Does the Python interpreter cache method name lookups? Does it cache code block compilations? Regular expressions? Will your code have
side effects if run more than once? Don't forget that you're dealing with small fractions of a second, so small mistakes
in your timing framework will irreparably skew your results.
<p>The Python community has a saying: &#8220;Python comes with batteries included.&#8221; Don't write your own timing framework. Python 2.3 comes with a perfectly good one called <code>timeit</code>.
<div class=example><h3>Example 18.2. Introducing <code>timeit</code></h3>
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import timeit</kbd>
<samp class=p>>>> </samp><kbd>t = timeit.Timer("soundex.soundex('Pilgrim')",</kbd>
<samp class=p>... </samp>"import soundex") <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>t.timeit()</kbd> <span>&#x2461;</span>
8.21683733547
<samp class=p>>>> </samp><kbd>t.repeat(3, 2000000)</kbd> <span>&#x2462;</span>
[16.48319309109, 16.46128984923, 16.44203948912]
</pre>
<ol>
<li>The <code>timeit</code> module defines one class, <code>Timer</code>, which takes two arguments. Both arguments are strings. The first argument is the statement you wish to time; in this case,
you are timing a call to the Soundex function within the <code>soundex</code> with an argument of <code>'Pilgrim'</code>. The second argument to the <code>Timer</code> class is the import statement that sets up the environment for the statement. Internally, <code>timeit</code> sets up an isolated virtual environment, manually executes the setup statement (importing the <code>soundex</code> module), then manually compiles and executes the timed statement (calling the Soundex function).
<li>Once you have the <code>Timer</code> object, the easiest thing to do is call <code>timeit()</code>, which calls your function 1 million times and returns the number of seconds it took to do it.
<li>The other major method of the <code>Timer</code> object is <code>repeat()</code>, which takes two optional arguments. The first argument is the number of times to repeat the entire test, and the second
argument is the number of times to call the timed statement within each test. Both arguments are optional, and they default
to <code>3</code> and <code>1000000</code> respectively. The <code>repeat()</code> method returns a list of the times each test cycle took, in seconds.
<blockquote class="note FIXME">
<p><span>&#x261E;</span>You can use the <code>timeit</code> module on the command line to test an existing Python program, without modifying the code. See <a href="http://docs.python.org/lib/node396.html">http://docs.python.org/lib/node396.html</a> for documentation on the command-line flags.
<p>Note that <code>repeat()</code> returns a list of times. The times will almost never be identical, due to slight variations in how much processor time the
Python interpreter is getting (and those pesky background processes that you can't get rid of). Your first thought might be to
say &#8220;Let's take the average and call that The True Number.&#8221;
<p>In fact, that's almost certainly wrong. The tests that took longer didn't take longer because of variations in your code
or in the Python interpreter; they took longer because of those pesky background processes, or other factors outside of the Python interpreter that you can't fully eliminate. If the different timing results differ by more than a few percent, you still
have too much variability to trust the results. Otherwise, take the minimum time and discard the rest.
<p>Python has a handy <code>min</code> function that takes a list and returns the smallest value:
<pre class=screen>
<samp class=p>>>> </samp><kbd>min(t.repeat(3, 1000000))</kbd>
8.22203948912
</pre><blockquote class="note FIXME">
<p><span>&#x261E;</span>The <code>timeit</code> module only works if you already know what piece of code you need to optimize. If you have a larger Python program and don't know where your performance problems are, check out <a href="http://docs.python.org/lib/module-hotshot.html">the <code>hotshot</code> module.</a><h2 id="soundex.stage1">18.3. Optimizing Regular Expressions</h2>
<p>The first thing the Soundex function checks is whether the input is a non-empty string of letters. What's the best way to
do this?
<p>If you answered &#8220;regular expressions&#8221;, go sit in the corner and contemplate your bad instincts. Regular expressions are almost never the right answer; they should
be avoided whenever possible. Not only for performance reasons, but simply because they're difficult to debug and maintain.
Also for performance reasons.
<p>This code fragment from <code>soundex/stage1/soundex1a.py</code> checks whether the function argument <var>source</var> is a word made entirely of letters, with at least one letter (not the empty string):
<pre><code>
allChars = string.uppercase + string.lowercase
if not re.search('^[%s]+$' % allChars, source):
return "0000"
</pre><p>How does <code>soundex1a.py</code> perform? For convenience, the <code>__main__</code> section of the script contains this code that calls the <code>timeit</code> module, sets up a timing test with three different names, tests each name three times, and displays the minimum time for
each:
<pre><code>
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><p>So how does <code>soundex1a.py</code> perform with this regular expression?
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1a.py
<samp>Woo W000 19.3356647283
Pilgrim P426 24.0772053431
Flingjingwaller F452 35.0463220884</span>
</pre><p>As you might expect, the algorithm takes significantly longer when called with longer names. There will be a few things we
can do to narrow that gap (make the function take less relative time for longer input), but the nature of the algorithm dictates
that it will never run in constant time.
<p>The other thing to keep in mind is that we are testing a representative sample of names. <code>Woo</code> is a kind of trivial case, in that it gets shorted down to a single letter and then padded with zeros. <code>Pilgrim</code> is a normal case, of average length and a mixture of significant and ignored letters. <code>Flingjingwaller</code> is extraordinarily long and contains consecutive duplicates. Other tests might also be helpful, but this hits a good range
of different cases.
<p>So what about that regular expression? Well, it's inefficient. Since the expression is testing for ranges of characters
(<code>A-Z</code> in uppercase, and <code>a-z</code> in lowercase), we can use a shorthand regular expression syntax. Here is <code>soundex/stage1/soundex1b.py</code>:
<pre><code>
if not re.search('^[A-Za-z]+$', source):
return "0000"
</pre><p><code>timeit</code> says <code>soundex1b.py</code> is slightly faster than <code>soundex1a.py</code>, but nothing to get terribly excited about:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1b.py
<samp>Woo W000 17.1361133887
Pilgrim P426 21.8201693232
Flingjingwaller F452 32.7262294509</span>
</pre><p>We saw in <a href="#roman.refactoring" title="15.3. Refactoring">Section 15.3, &#8220;Refactoring&#8221;</a> that regular expressions can be compiled and reused for faster results. Since this regular expression never changes across
function calls, we can compile it once and use the compiled version. Here is <code>soundex/stage1/soundex1c.py</code>:
<pre><code>
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
</pre><p>Using a compiled regular expression in <code>soundex1c.py</code> is significantly faster:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1c.py
<samp>Woo W000 14.5348347346
Pilgrim P426 19.2784703084
Flingjingwaller F452 30.0893873383</span>
</pre><p>But is this the wrong path? The logic here is simple: the input <var>source</var> needs to be non-empty, and it needs to be composed entirely of letters. Wouldn't it be faster to write a loop checking each
character, and do away with regular expressions altogether?
<p>Here is <code>soundex/stage1/soundex1d.py</code>:
<pre><code>
if not source:
return "0000"
for c in source:
if not ('A' &lt;= c &lt;= 'Z') and not ('a' &lt;= c &lt;= 'z'):
return "0000"
</pre><p>It turns out that this technique in <code>soundex1d.py</code> is <em>not</em> faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression):
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1d.py
<samp>Woo W000 15.4065058548
Pilgrim P426 22.2753567842
Flingjingwaller F452 37.5845122774</span>
</pre><p>Why isn't <code>soundex1d.py</code> faster? The answer lies in the interpreted nature of Python. The regular expression engine is written in C, and compiled to run natively on your computer. On the other hand, this
loop is written in Python, and runs through the Python interpreter. Even though the loop is relatively simple, it's not simple enough to make up for the overhead of being interpreted.
Regular expressions are never the right answer... except when they are.
<p>It turns out that Python offers an obscure string method. You can be excused for not knowing about it, since it's never been mentioned in this book.
The method is called <code>isalpha()</code>, and it checks whether a string contains only letters.
<p>This is <code>soundex/stage1/soundex1e.py</code>:
<pre><code>
if (not source) and (not source.isalpha()):
return "0000"
</pre><p>How much did we gain by using this specific method in <code>soundex1e.py</code>? Quite a bit.
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1e.py
<samp>Woo W000 13.5069504644
Pilgrim P426 18.2199394057
Flingjingwaller F452 28.9975225902</span>
</pre><div class=example><h3>Example 18.3. Best Result So Far: <code>soundex/stage1/soundex1e.py</code></h3><pre><code>
import string, re
charToSoundex = {"A": "9",
"B": "1",
"C": "2",
"D": "3",
"E": "9",
"F": "1",
"G": "2",
"H": "9",
"I": "9",
"J": "2",
"K": "2",
"L": "4",
"M": "5",
"N": "5",
"O": "9",
"P": "1",
"Q": "2",
"R": "6",
"S": "2",
"T": "3",
"U": "9",
"V": "1",
"W": "9",
"X": "2",
"Y": "9",
"Z": "2"}
def soundex(source):
if (not source) and (not source.isalpha()):
return "0000"
source = source[0].upper() + source[1:]
digits = source[0]
for s in source[1:]:
s = s.upper()
digits += charToSoundex[s]
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><h2 id="soundex.stage2">18.4. Optimizing Dictionary Lookups</h2>
<p>The second step of the Soundex algorithm is to convert characters to digits in a specific pattern. What's the best way to
do this?
<p>The most obvious solution is to define a dictionary with individual characters as keys and their corresponding digits as values,
and do dictionary lookups on each character. This is what we have in <code>soundex/stage1/soundex1c.py</code> (the current best result so far):
<pre><code>
charToSoundex = {"A": "9",
"B": "1",
"C": "2",
"D": "3",
"E": "9",
"F": "1",
"G": "2",
"H": "9",
"I": "9",
"J": "2",
"K": "2",
"L": "4",
"M": "5",
"N": "5",
"O": "9",
"P": "1",
"Q": "2",
"R": "6",
"S": "2",
"T": "3",
"U": "9",
"V": "1",
"W": "9",
"X": "2",
"Y": "9",
"Z": "2"}
def soundex(source):
# ... input check omitted for brevity ...
source = source[0].upper() + source[1:]
digits = source[0]
for s in source[1:]:
s = s.upper()
digits += charToSoundex[s]
</pre><p>You timed <code>soundex1c.py</code> already; this is how it performs:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage1></samp>python soundex1c.py
<samp>Woo W000 14.5341678901
Pilgrim P426 19.2650071448
Flingjingwaller F452 30.1003563302</span>
</pre><p>This code is straightforward, but is it the best solution? Calling <code>upper()</code> on each individual character seems inefficient; it would probably be better to call <code>upper()</code> once on the entire string.
<p>Then there's the matter of incrementally building the <var>digits</var> string. Incrementally building strings like this is horribly inefficient; internally, the Python interpreter needs to create a new string each time through the loop, then discard the old one.
<p>Python is good at lists, though. It can treat a string as a list of characters automatically. And lists are easy to combine into
strings again, using the string method <code>join()</code>.
<p>Here is <code>soundex/stage2/soundex2a.py</code>, which converts letters to digits by using &#8614; and <code>lambda</code>:
<pre><code>
def soundex(source):
# ...
source = source.upper()
digits = source[0] + "".join(map(lambda c: charToSoundex[c], source[1:]))
</pre><p>Surprisingly, <code>soundex2a.py</code> is not faster:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2a.py
<samp>Woo W000 15.0097526362
Pilgrim P426 19.254806407
Flingjingwaller F452 29.3790847719</span>
</pre><p>The overhead of the anonymous <code>lambda</code> function kills any performance you gain by dealing with the string as a list of characters.
<p><code>soundex/stage2/soundex2b.py</code> uses a list comprehension instead of &#8614; and <code>lambda</code>:
<pre><code>
source = source.upper()
digits = source[0] + "".join([charToSoundex[c] for c in source[1:]])
</pre><p>Using a list comprehension in <code>soundex2b.py</code> is faster than using &#8614; and <code>lambda</code> in <code>soundex2a.py</code>, but still not faster than the original code (incrementally building a string in <code>soundex1c.py</code>):
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2b.py
<samp>Woo W000 13.4221324219
Pilgrim P426 16.4901234654
Flingjingwaller F452 25.8186157738</span>
</pre><p>It's time for a radically different approach. Dictionary lookups are a general purpose tool. Dictionary keys can be any
length string (or many other data types), but in this case we are only dealing with single-character keys <em>and</em> single-character values. It turns out that Python has a specialized function for handling exactly this situation: the <code>string.maketrans</code> function.
<p>This is <code>soundex/stage2/soundex2c.py</code>:
<pre><code>
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
def soundex(source):
# ...
digits = source[0].upper() + source[1:].translate(charToSoundex)
</pre><p>What the heck is going on here? <code>string.maketrans</code> creates a translation matrix between two strings: the first argument and the second argument. In this case, the first argument
is the string <code>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</code>, and the second argument is the string <code>9123912992245591262391929291239129922455912623919292</code>. See the pattern? It's the same conversion pattern we were setting up longhand with a dictionary. A maps to 9, B maps
to 1, C maps to 2, and so forth. But it's not a dictionary; it's a specialized data structure that you can access using the
string method <code>translate</code>, which translates each character into the corresponding digit, according to the matrix defined by <code>string.maketrans</code>.
<p><code>timeit</code> shows that <code>soundex2c.py</code> is significantly faster than defining a dictionary and looping through the input and building the output incrementally:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2c.py
<samp>Woo W000 11.437645008
Pilgrim P426 13.2825062962
Flingjingwaller F452 18.5570110168</span>
</pre><p>You're not going to get much better than that. Python has a specialized function that does exactly what you want to do; use it and move on.
<div class=example><h3>Example 18.4. Best Result So Far: <code>soundex/stage2/soundex2c.py</code></h3><pre><code>
import string, re
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
digits = source[0].upper() + source[1:].translate(charToSoundex)
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><h2 id="soundex.stage3">18.5. Optimizing List Operations</h2>
<p>The third step in the Soundex algorithm is eliminating consecutive duplicate digits. What's the best way to do this?
<p>Here's the code we have so far, in <code>soundex/stage2/soundex2c.py</code>:
<pre><code>
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
</pre><p>Here are the performance results for <code>soundex2c.py</code>:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2c.py
<samp>Woo W000 12.6070768771
Pilgrim P426 14.4033353401
Flingjingwaller F452 19.7774882003</span>
</pre><p>The first thing to consider is whether it's efficient to check <var>digits[-1]</var> each time through the loop. Are list indexes expensive? Would we be better off maintaining the last digit in a separate
variable, and checking that instead?
<p>To answer this question, here is <code>soundex/stage3/soundex3a.py</code>:
<pre><code>
digits2 = ''
last_digit = ''
for d in digits:
if d != last_digit:
digits2 += d
last_digit = d
</pre><p><code>soundex3a.py</code> does not run any faster than <code>soundex2c.py</code>, and may even be slightly slower (although it's not enough of a difference to say for sure):
<pre class=screen>
<samp class=p>C:\samples\soundex\stage3></samp>python soundex3a.py
<samp>Woo W000 11.5346048171
Pilgrim P426 13.3950636184
Flingjingwaller F452 18.6108927252</span>
</pre><p>Why isn't <code>soundex3a.py</code> faster? It turns out that list indexes in Python are extremely efficient. Repeatedly accessing <var>digits2[-1]</var> is no problem at all. On the other hand, manually maintaining the last seen digit in a separate variable means we have <em>two</em> variable assignments for each digit we're storing, which wipes out any small gains we might have gotten from eliminating
the list lookup.
<p>Let's try something radically different. If it's possible to treat a string as a list of characters, it should be possible
to use a list comprehension to iterate through the list. The problem is, the code needs access to the previous character
in the list, and that's not easy to do with a straightforward list comprehension.
<p>However, it is possible to create a list of index numbers using the built-in <code>range()</code> function, and use those index numbers to progressively search through the list and pull out each character that is different
from the previous character. That will give you a list of characters, and you can use the string method <code>join()</code> to reconstruct a string from that.
<p>Here is <code>soundex/stage3/soundex3b.py</code>:
<pre><code>
digits2 = "".join([digits[i] for i in range(len(digits))
if i == 0 or digits[i-1] != digits[i]])
</pre><p>Is this faster? In a word, no.
<pre class=screen>
<samp class=p>C:\samples\soundex\stage3></samp>python soundex3b.py
<samp>Woo W000 14.2245271396
Pilgrim P426 17.8337165757
Flingjingwaller F452 25.9954005327</span>
</pre><p>It's possible that the techniques so far as have been &#8220;string-centric&#8221;. Python can convert a string into a list of characters with a single command: <code>list('abc')</code> returns <code>['a', 'b', 'c']</code>. Furthermore, lists can be <em>modified in place</em> very quickly. Instead of incrementally building a new list (or string) out of the source string, why not move elements around
within a single list?
<p>Here is <code>soundex/stage3/soundex3c.py</code>, which modifies a list in place to remove consecutive duplicate elements:
<pre><code>
digits = list(source[0].upper() + source[1:].translate(charToSoundex))
i=0
for item in digits:
if item==digits[i]: continue
i+=1
digits[i]=item
del digits[i+1:]
digits2 = "".join(digits)
</pre><p>Is this faster than <code>soundex3a.py</code> or <code>soundex3b.py</code>? No, in fact it's the slowest method yet:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage3></samp>python soundex3c.py
<samp>Woo W000 14.1662554878
Pilgrim P426 16.0397885765
Flingjingwaller F452 22.1789341942</span>
</pre><p>We haven't made any progress here at all, except to try and rule out several &#8220;clever&#8221; techniques. The fastest code we've seen so far was the original, most straightforward method (<code>soundex2c.py</code>). Sometimes it doesn't pay to be clever.
<div class=example><h3>Example 18.5. Best Result So Far: <code>soundex/stage2/soundex2c.py</code></h3><pre><code>
import string, re
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
digits = source[0].upper() + source[1:].translate(charToSoundex)
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())
</pre><h2 id="soundex.stage4">18.6. Optimizing String Manipulation</h2>
<p>The final step of the Soundex algorithm is padding short results with zeros, and truncating long results. What is the best
way to do this?
<p>This is what we have so far, taken from <code>soundex/stage2/soundex2c.py</code>:
<pre><code>
digits3 = re.sub('9', '', digits2)
while len(digits3) &lt; 4:
digits3 += "0"
return digits3[:4]
</pre><p>These are the results for <code>soundex2c.py</code>:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage2></samp>python soundex2c.py
<samp>Woo W000 12.6070768771
Pilgrim P426 14.4033353401
Flingjingwaller F452 19.7774882003</span>
</pre><p>The first thing to consider is replacing that regular expression with a loop. This code is from <code>soundex/stage4/soundex4a.py</code>:
<pre><code>
digits3 = ''
for d in digits2:
if d != '9':
digits3 += d
</pre><p>Is <code>soundex4a.py</code> faster? Yes it is:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage4></samp>python soundex4a.py
<samp>Woo W000 6.62865531792
Pilgrim P426 9.02247576158
Flingjingwaller F452 13.6328416042</span>
</pre><p>But wait a minute. A loop to remove characters from a string? We can use a simple string method for that. Here's <code>soundex/stage4/soundex4b.py</code>:
<pre><code>
digits3 = digits2.replace('9', '')
</pre><p>Is <code>soundex4b.py</code> faster? That's an interesting question. It depends on the input:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage4></samp>python soundex4b.py
<samp>Woo W000 6.75477414029
Pilgrim P426 7.56652144337
Flingjingwaller F452 10.8727729362</span>
</pre><p>The string method in <code>soundex4b.py</code> is faster than the loop for most names, but it's actually slightly slower than <code>soundex4a.py</code> in the trivial case (of a very short name). Performance optimizations aren't always uniform; tuning that makes one case
faster can sometimes make other cases slower. In this case, the majority of cases will benefit from the change, so let's
leave it at that, but the principle is an important one to remember.
<p>Last but not least, let's examine the final two steps of the algorithm: padding short results with zeros, and truncating long
results to four characters. The code you see in <code>soundex4b.py</code> does just that, but it's horribly inefficient. Take a look at <code>soundex/stage4/soundex4c.py</code> to see why:
<pre><code>
digits3 += '000'
return digits3[:4]
</pre><p>Why do we need a <code>while</code> loop to pad out the result? We know in advance that we're going to truncate the result to four characters, and we know that
we already have at least one character (the initial letter, which is passed unchanged from the original <var>source</var> variable). That means we can simply add three zeros to the output, then truncate it. Don't get stuck in a rut over the
exact wording of the problem; looking at the problem slightly differently can lead to a simpler solution.
<p>How much speed do we gain in <code>soundex4c.py</code> by dropping the <code>while</code> loop? It's significant:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage4></samp>python soundex4c.py
<samp>Woo W000 4.89129791636
Pilgrim P426 7.30642134685
Flingjingwaller F452 10.689832367</span>
</pre><p>Finally, there is still one more thing you can do to these three lines of code to make them faster: you can combine them into
one line. Take a look at <code>soundex/stage4/soundex4d.py</code>:
<pre><code>
return (digits2.replace('9', '') + '000')[:4]
</pre><p>Putting all this code on one line in <code>soundex4d.py</code> is barely faster than <code>soundex4c.py</code>:
<pre class=screen>
<samp class=p>C:\samples\soundex\stage4></samp>python soundex4d.py
<samp>Woo W000 4.93624105857
Pilgrim P426 7.19747593619
Flingjingwaller F452 10.5490700634</span>
</pre><p>It is also significantly less readable, and for not much performance gain. Is that worth it? I hope you have good comments.
Performance isn't everything. Your optimization efforts must always be balanced against threats to your program's readability
and maintainability.
<h2 id="soundex.summary">18.7. Summary</h2>
<p>This chapter has illustrated several important aspects of performance tuning in Python, and performance tuning in general.
<div class=itemizedlist>
<ul>
<li>If you need to choose between regular expressions and writing a loop, choose regular expressions. The regular expression
engine is compiled in C and runs natively on your computer; your loop is written in Python and runs through the Python interpreter.
<li>If you need to choose between regular expressions and string methods, choose string methods. Both are compiled in C, so choose
the simpler one.
<li>General-purpose dictionary lookups are fast, but specialtiy functions such as <code>string.maketrans</code> and string methods such as <code>isalpha()</code> are faster. If Python has a custom-tailored function for you, use it.
<li>Don't be too clever. Sometimes the most obvious algorithm is also the fastest.
<li>Don't sweat it too much. Performance isn't everything.
</ul>
<p>I can't emphasize that last point strongly enough. Over the course of this chapter, you made this function three times faster
and saved 20 seconds over 1 million function calls. Great. Now think: over the course of those million function calls, how
many seconds will your surrounding application wait for a database connection? Or wait for disk I/O? Or wait for user input?
Don't spend too much time over-optimizing one algorithm, or you'll ignore obvious improvements somewhere else. Develop an
instinct for the sort of code that Python runs well, correct obvious blunders if you find them, and leave the rest alone.
</body>
</html>