couple of sections of new-and-improved "unit testing" chapter

This commit is contained in:
Mark Pilgrim
2009-02-16 23:36:13 -05:00
parent 29129df299
commit 93849215bc
14 changed files with 765 additions and 527 deletions
+6 -6
View File
@@ -44,7 +44,7 @@ body{counter-reset:h1 20}
<li><a href=#cantconvertbytesobject>Can&#8217;t convert '<code>bytes</code>' object to <code>str</code> implicitly</a>
</ol>
</ol>
<h2 id=divingin>Introducing <code class=filename>chardet</code>: a mini-FAQ</h2>
<h2 id=divingin>Introducing <code class=filename>chardet</code>: a mini-<abbr>FAQ</abbr></h2>
<p class=fancy>When you think of &#8220;text,&#8221; you probably think of &#8220;characters and symbols I see on my computer screen.&#8221; But computers don&#8217;t deal in characters and symbols; they deal in bits and bytes. Every piece of text you&#8217;ve ever seen on a computer screen is actually stored in a particular <em>character encoding</em>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
<p>In reality, it&#8217;s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it&#8217;s &#8220;text&#8221;, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
<h3 id=faq.what>What is character encoding auto-detection?</h3>
@@ -58,11 +58,11 @@ body{counter-reset:h1 20}
<h3 id=faq.yippie>Yippie! Screw the standards, I&#8217;ll just auto-detect everything!</h3>
<p>Don&#8217;t do that. Virtually every format and protocol contains a method for specifying character encoding.
<ul>
<li>HTTP can define a <code>charset</code> parameter in the <code>Content-type</code> header.
<li>HTML documents can define a <code>&lt;meta http-equiv="content-type"&gt;</code> element in the <code>&lt;head&gt;</code> of a web page.
<li>XML documents can define an <code>encoding</code> attribute in the XML prolog.
<li><abbr>HTTP</abbr> can define a <code>charset</code> parameter in the <code>Content-type</code> header.
<li><abbr>HTML</abbr> documents can define a <code>&lt;meta http-equiv="content-type"&gt;</code> element in the <code>&lt;head&gt;</code> of a web page.
<li><abbr>XML</abbr> documents can define an <code>encoding</code> attribute in the <abbr>XML</abbr> prolog.
</ul>
<p>If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards <em>and</em> figure out which one wins if they give you conflicting information.)
<p>If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over <abbr>HTTP</abbr>, you need to support both standards <em>and</em> figure out which one wins if they give you conflicting information.)
<p>Despite the complexity, it&#8217;s worthwhile to follow standards and <a href=http://www.w3.org/2001/tag/doc/mime-respect>respect explicit character encoding information</a>. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
<h3 id=faq.why>Why bother with auto-detection if it&#8217;s slow, inaccurate, and non-standard?</h3>
<p>Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn&#8217;t work. There are also some poorly designed standards that have no way to specify encoding at all.
@@ -676,7 +676,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
<pre><code>class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
<p id=skiphighbitdetectorcode>This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that&#8217;s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII <em>bytes</em> in the range 128-255.
<p id=skiphighbitdetectorcode>This pre-compiles a regular expression designed to find non-<abbr>ASCII</abbr> characters in the range 128&ndash;255 (0x80&ndash;0xFF). Wait, that&#8217;s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-<abbr>ASCII</abbr> <em>bytes</em> in the range 128-255.
<p>And therein lies the problem.
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we&#8217;re searching is not a string, it&#8217;s a byte array. Looking at the traceback, this error occurred in <code class=filename>universaldetector.py</code>:
<p class=skip><a href=#skipfeedhighbitdetectorcode>skip over this</a>
+388 -410
View File
File diff suppressed because it is too large Load Diff
+1 -1
View File
@@ -30,7 +30,7 @@ a:visited{color:darkorchid}
pre{white-space:pre-wrap;padding-left:2.154em;line-height:2.154;border-left:1px dotted}
.widgets{float:left}
.widgets,.widgets a,.download{font-size:small;line-height:2.154}
.block{clear:left}
.block,ol{clear:left}
pre a,.widgets a{padding:0.4375em 0;border:0}
.widgets a{text-decoration:underline}
pre a:hover{border:0}
+1 -1
View File
@@ -23,7 +23,7 @@ li:last-child:before{content:"A. \00a0 \00a0"}
<li><a href=regular-expressions.html>Regular expressions</a>
<li>
<li>
<li>
<li><a href=unit-testing.html>Unit testing</a>
<li>
<li>
<li>
+4 -4
View File
@@ -111,7 +111,7 @@ body{counter-reset:h1 2}
<li>Integers can be arbitrarily large.
</ol>
<blockquote class="note compare python2">
<p><span>&#x261E;</span>Python 2 had separate types for <code>int</code> and <code>long</code>. The <code>int</code> datatype was limited by <code>sys.maxint</code>, which varied by platform but was usually <code>2<sup>32</sup>-1</code>. Python 3 has just one integer type, which behaves mostly like the old <code>long</code> type from Python 2. See <a href=http://www.python.org/dev/peps/pep-0237>PEP 237</a> for details.
<p><span>&#x261E;</span>Python 2 had separate types for <code>int</code> and <code>long</code>. The <code>int</code> datatype was limited by <code>sys.maxint</code>, which varied by platform but was usually <code>2<sup>32</sup>-1</code>. Python 3 has just one integer type, which behaves mostly like the old <code>long</code> type from Python 2. See <a href=http://www.python.org/dev/peps/pep-0237><abbr>PEP</abbr> 237</a> for details.
</blockquote>
<p>You can do all kinds of things with numbers.
<pre class=screen>
@@ -137,7 +137,7 @@ body{counter-reset:h1 2}
<li>The <code>%</code> operator gives the remainder after performing integer division. <code>11</code> divided by <code>2</code> is <code>5</code> with a remainder of <code>1</code>, so the result here is <code>1</code>.
</ol>
<blockquote class="note compare python2">
<p><span>&#x261E;</span>In Python 2, the <code>/</code> operator usually meant integer division, but you could make it behave like floating point division by including a special directive in your code. In Python 3, the <code>/</code> operator always means floating point division. See <a href=http://www.python.org/dev/peps/pep-0238/>PEP 238</a> for details.
<p><span>&#x261E;</span>In Python 2, the <code>/</code> operator usually meant integer division, but you could make it behave like floating point division by including a special directive in your code. In Python 3, the <code>/</code> operator always means floating point division. See <a href=http://www.python.org/dev/peps/pep-0238/><abbr>PEP</abbr> 238</a> for details.
</blockquote>
<p>FIXME fractions, math module, numbers in a boolean context
<h2 id=lists>Lists</h2>
@@ -357,8 +357,8 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<ul>
<li>fractions
<li>math module
<li>PEP 237
<li>PEP 238
<li><abbr>PEP</abbr> 237
<li><abbr>PEP</abbr> 238
<li>links to appendix
<li>...etc...
</ul>
+11 -11
View File
@@ -145,7 +145,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
<p id=skipcompareunicode>
<h2 id=long><code>long</code> data type</h2>
<p>Python 2 had separate <code>int</code> and <code>long</code> types for non-floating-point numbers. An <code>int</code> could not be any larger than <a href=#renames><code>sys.maxint</code></a>, which varied by platform. Longs were defined by appending an <code>L</code> to the end of the number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called <code>int</code>, which mostly behaves like the <code>long</code> type in Python 2. Since there are no longer two types, there is no need for special syntax to distinguish them.
<p>Further reading: <a href=http://www.python.org/dev/peps/pep-0237/>PEP 237: Unifying Long Integers and Integers</a>.
<p>Further reading: <a href=http://www.python.org/dev/peps/pep-0237/><abbr>PEP</abbr> 237: Unifying Long Integers and Integers</a>.
<p class=skip><a href=#skipcomparelong>skip over this table</a>
<table id=comparelong>
<tr><th>Notes</th>
@@ -259,7 +259,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
<h2 id=imports>Modules that have been renamed or reorganized</h2>
<p>Several modules in the Python Standard Library have been renamed. Several other modules which are related to each other have been combined or reorganized to make their association more logical.
<h3 id=http><code>http</code></h3>
<p>In Python 3, several related HTTP modules have been combined into a single package, <code>http</code>.
<p>In Python 3, several related <abbr>HTTP</abbr> modules have been combined into a single package, <code>http</code>.
<p class=skip><a href=#skipcompareimporthttp>skip over this table</a>
<table id=compareimporthttp>
<tr><th>Notes</th>
@@ -282,10 +282,10 @@ import CGIHttpServer</code></pre></td>
<td><code>import http.server</code></td></tr>
</table>
<ol id=skipcompareimporthttp>
<li>The <code>http.client</code> module implements a low-level library that can request HTTP resources and interpret HTTP responses.
<li>The <code>http.cookies</code> module provides a Pythonic interface to browser cookies that are sent in a <code>Set-Cookie:</code> HTTP header.
<li>The <code>http.client</code> module implements a low-level library that can request <abbr>HTTP</abbr> resources and interpret <abbr>HTTP</abbr> responses.
<li>The <code>http.cookies</code> module provides a Pythonic interface to browser cookies that are sent in a <code>Set-Cookie:</code> <abbr>HTTP</abbr> header.
<li>The <code>http.cookiejar</code> module manipulates the actual files on disk that popular web browsers use to store cookies.
<li>The <code>http.server</code> module provides a basic HTTP server.
<li>The <code>http.server</code> module provides a basic <abbr>HTTP</abbr> server.
</ol>
<h3 id=urllib><code>urllib</code></h3>
<p>Python 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, <code>urllib</code>.
@@ -319,15 +319,15 @@ from urllib2 import HTTPError</code></pre></td>
from urllib.error import HTTPError</code></pre></td></tr>
</table>
<ol id=skipcompareimporturllib>
<li>The old <code>urllib</code> module in Python 2 had a variety of functions, including <code>urlopen()</code> for fetching data and <code>splittype()</code>, <code>splithost()</code>, and <code>splituser()</code> for splitting a URL into its constituent parts. These functions have been reorganized more logically within the new <code>urllib</code> package. <code>2to3</code> will also change all calls to these functions so they use the new naming scheme.
<li>The old <code>urllib</code> module in Python 2 had a variety of functions, including <code>urlopen()</code> for fetching data and <code>splittype()</code>, <code>splithost()</code>, and <code>splituser()</code> for splitting a <abbr>URL</abbr> into its constituent parts. These functions have been reorganized more logically within the new <code>urllib</code> package. <code>2to3</code> will also change all calls to these functions so they use the new naming scheme.
<li>The old <code>urllib2</code> module in Python 2 has been folded into into the <code>urllib</code> package in Python 3. All your <code>urllib2</code> favorites &mdash; the <code>build_opener()</code> method, <code>Request</code> objects, and <code>HTTPBasicAuthHandler</code> and friends &mdash; are still available.
<li>The <code>urllib.parse</code> module in Python 3 contains all the parsing functions from the old <code>urlparse</code> module in Python 2.
<li>The <code>urllib.robotparser</code> module parses <a href=http://www.robotstxt.org/><code>robots.txt</code> files</a>.
<li>The <code>FancyURLopener</code> class, which handles HTTP redirects and other status codes, is still available in the new <code>urllib.request</code> module. The <code>urlencode</code> function has moved to <code>urllib.parse</code>.
<li>The <code>FancyURLopener</code> class, which handles <abbr>HTTP</abbr> redirects and other status codes, is still available in the new <code>urllib.request</code> module. The <code>urlencode</code> function has moved to <code>urllib.parse</code>.
<li>The <code>Request</code> object is still available in <code>urllib.request</code>, but constants like <code>HTTPError</code> have been moved to <code>urllib.error</code>.
</ol>
<h3 id=dbm><code>dbm</code></h3>
<p>All the various DBM clones are now in a single package, <code>dbm</code>. If you need a specific variant like GNU DBM, you can import the appropriate module within the <code>dbm</code> package.
<p>All the various <abbr>DBM</abbr> clones are now in a single package, <code>dbm</code>. If you need a specific variant like <abbr>GNU</abbr> <abbr>DBM</abbr>, you can import the appropriate module within the <code>dbm</code> package.
<p class=skip><a href=#skipcompareimportdbm>skip over this table</a>
<table id=compareimportdbm>
<tr><th>Notes</th>
@@ -353,7 +353,7 @@ import whichdb</code></pre></td>
</table>
<p id=skipcompareimportdbm>
<h3 id=xmlrpc><code>xmlrpc</code></h3>
<p>XML-RPC is a lightweight method of performing remote RPC calls over HTTP. The XML-RPC client library and several XML-RPC server implementations are now combined in a single package, <code>xmlrpc</code>.
<p><abbr>XML-RPC</abbr> is a lightweight method of performing remote <abbr>RPC</abbr> calls over <abbr>HTTP</abbr>. The <abbr>XML-RPC</abbr> client library and several <abbr>XML-RPC</abbr> server implementations are now combined in a single package, <code>xmlrpc</code>.
<p class=skip><a href=#skipcompareimportxmlrpc>skip over this table</a>
<table id=compareimportxmlrpc>
<tr><th>Notes</th>
@@ -417,14 +417,14 @@ except ImportError:
<li>The <code>copyreg</code> module adds pickle support for custom types defined in C.
<li>The <code>queue</code> module implements a multi-producer, multi-consumer queue.
<li>The <code>socketserver</code> module provides generic base classes for implementing different kinds of socket servers.
<li>The <code>configparser</code> module parses INI-style configuration files.
<li>The <code>configparser</code> module parses <abbr>INI</abbr>-style configuration files.
<li>The <code>reprlib</code> module reimplements the built-in <code>repr()</code> function, but with limits on how many values are represented.
<li>The <code>subprocess</code> module allows you to spawn processes, connect to their pipes, and obtain their return codes.
</ol>
<h2 id=import>Relative imports within a package</h2>
<p>A package is a group of related modules that function as a single entity. In Python 2, when modules within a package need to reference each other, you use <code>import foo</code> or <code>from foo import Bar</code>. The Python 2 interpreter first searches within the current package to find <code>foo.py</code>, and then moves on to the other directories in the Python search path (<code>sys.path</code>). Python 3 works a bit differently. Instead of searching the current package, it goes directly to the Python search path. If you want one module within a package to import another module in the same package, you need to explicitly provide the relative path between the two modules.
<p>Suppose you had this package, with multiple files in the same directory:
<p class=skip><a href=#skippackageart>skip over this ASCII art</a>
<p class=skip><a href=#skippackageart>skip over this <abbr>ASCII</abbr> art</a>
<pre>chardet/
|
+--__init__.py
+57 -57
View File
@@ -15,7 +15,7 @@ body{counter-reset:h1 4}
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Regular expressions</h1>
<blockquote class=q>
<p><span>&#x275D;</span> Some people, when confronted with a problem, think &#8220;I know, I'll use regular expressions.&#8221; Now they have two problems. <span>&#x275E;</span><br>&mdash; <cite>Jamie Zawinski</cite>
<p><span>&#x275D;</span> Some people, when confronted with a problem, think &#8220;I know, I&#8217;ll use regular expressions.&#8221; Now they have two problems. <span>&#x275E;</span><br>&mdash; <cite>Jamie Zawinski</cite>
</blockquote>
<ol>
<li><a href=#divingin>Diving in</a>
@@ -34,13 +34,13 @@ body{counter-reset:h1 4}
<li><a href=#summary>Summary</a>
</ol>
<h2 id=divingin>Diving in</h2>
<p>Regular expressions are a powerful and standardized way of searching, replacing, and parsing text with complex patterns of
characters. If you've used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the <a href=http://docs.python.org/dev/library/re.html#module-contents><code>re</code> module</a> to get an overview of the available functions and their arguments.
<p>Strings have methods for searching and replacing &mdash; <code>index()</code>, <code>find()</code>, <code>split()</code>, <code>count()</code>, <code>replace()</code>, <i class=baa>&amp;</i>c. &mdash; but they are limited to the simplest of cases. For example, the <code>index()</code> method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string <var>s</var>, you must call <code>s.lower()</code> or <code>s.upper()</code> and make sure your search strings are the appropriate case to match. The <code>replace()</code> and <code>split()</code> methods have the same limitations.
<p>If your goal can be accomplished with string functions, you should use them. They're fast and simple and easy to read, and there's a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with <code>if</code> statements to handle special cases, or if you're combining them with <code>split()</code> and <code>join()</code> and list comprehensions in weird unreadable ways, you may need to move up to regular expressions.
<p class=fancy>Regular expressions are a powerful and standardized way of searching, replacing, and parsing text with complex patterns of
characters. If you&#8217;ve used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the <a href=http://docs.python.org/dev/library/re.html#module-contents><code>re</code> module</a> to get an overview of the available functions and their arguments.
<p>Strings have methods for searching and replacing: <code>index()</code>, <code>find()</code>, <code>split()</code>, <code>count()</code>, <code>replace()</code>, <i class=baa>&amp;</i>c. But these methods are limited to the simplest of cases. For example, the <code>index()</code> method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string <var>s</var>, you must call <code>s.lower()</code> or <code>s.upper()</code> and make sure your search strings are the appropriate case to match. The <code>replace()</code> and <code>split()</code> methods have the same limitations.
<p>If your goal can be accomplished with string functions, you should use them. They&#8217;re fast and simple and easy to read, and there&#8217;s a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with <code>if</code> statements to handle special cases, or if you&#8217;re combining them with <code>split()</code> and <code>join()</code> and list comprehensions in weird unreadable ways, you may need to move up to regular expressions.
<p>Although the regular expression syntax is tight and unlike normal code, the result can end up being <em>more</em> readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions, so you can include fine-grained documentation within them.
<h2 id=streetaddresses>Case study: street addresses</h2>
<p>This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don't just make this stuff up; it's actually useful.) This example shows how I approached the problem.
<p>This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don&#8217;t just make this stuff up; it&#8217;s actually useful.) This example shows how I approached the problem.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>s = '100 NORTH MAIN ROAD'</kbd>
<a><samp class=prompt>>>> </samp><kbd>s.replace('ROAD', 'RD.')</kbd> <span>&#x2460;</span></a>
@@ -56,9 +56,9 @@ characters. If you've used regular expressions in other languages (like Perl), t
<ol>
<li>My goal is to standardize a street address so that <code>'ROAD'</code> is always abbreviated as <code>'RD.'</code>. At first glance, I thought this was simple enough that I could just use the string method <code>replace()</code>. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, <code>'ROAD'</code>, was a constant. And in this deceptively simple example, <code>s.replace()</code> does indeed work.
<li>Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that <code>'ROAD'</code> appears twice in the address, once as part of the street name <code>'BROAD'</code> and once as its own word. The <code>replace()</code> method sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed.
<li>To solve the problem of addresses with more than one <code>'ROAD'</code> substring, you could resort to something like this: only search and replace <code>'ROAD'</code> in the last four characters of the address (<code>s[-4:]</code>), and leave the string alone (<code>s[:-4]</code>). But you can see that this is already getting unwieldy. For example, the pattern is dependent on the length of the string you're replacing. (If you were replacing <code>'STREET'</code> with <code>'ST.'</code>, you would need to use <code>s[:-6]</code> and <code>s[-6:].replace(...)</code>.) Would you like to come back in six months and debug this? I know I wouldn't.
<li>It's time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the <code>re</code> module.
<li>Take a look at the first parameter: <code>'ROAD$'</code>. This is a simple regular expression that matches <code>'ROAD'</code> only when it occurs at the end of a string. The <code>$</code> means &#8220;end of the string.&#8221; (There is a corresponding character, the caret <code>^</code>, which means &#8220;beginning of the string.&#8221;) Using the <code>re.sub</code> function, you search the string <var>s</var> for the regular expression <code>'ROAD$'</code> and replace it with <code>'RD.'</code>. This matches the <code>ROAD</code> at the end of the string <var>s</var>, but does <em>not</em> match the <code>ROAD</code> that's part of the word <code>BROAD</code>, because that's in the middle of <var>s</var>.
<li>To solve the problem of addresses with more than one <code>'ROAD'</code> substring, you could resort to something like this: only search and replace <code>'ROAD'</code> in the last four characters of the address (<code>s[-4:]</code>), and leave the string alone (<code>s[:-4]</code>). But you can see that this is already getting unwieldy. For example, the pattern is dependent on the length of the string you&#8217;re replacing. (If you were replacing <code>'STREET'</code> with <code>'ST.'</code>, you would need to use <code>s[:-6]</code> and <code>s[-6:].replace(...)</code>.) Would you like to come back in six months and debug this? I know I wouldn&#8217;t.
<li>It&#8217;s time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the <code>re</code> module.
<li>Take a look at the first parameter: <code>'ROAD$'</code>. This is a simple regular expression that matches <code>'ROAD'</code> only when it occurs at the end of a string. The <code>$</code> means &#8220;end of the string.&#8221; (There is a corresponding character, the caret <code>^</code>, which means &#8220;beginning of the string.&#8221;) Using the <code>re.sub</code> function, you search the string <var>s</var> for the regular expression <code>'ROAD$'</code> and replace it with <code>'RD.'</code>. This matches the <code>ROAD</code> at the end of the string <var>s</var>, but does <em>not</em> match the <code>ROAD</code> that&#8217;s part of the word <code>BROAD</code>, because that&#8217;s in the middle of <var>s</var>.
</ol>
<p>Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching <code>'ROAD'</code> at the end of the address, was not good enough, because not all addresses included a street designation at all. Some addresses simply ended with the street name. I got away with it most of the time, but if the street name was <code>'BROAD'</code>, then the regular expression would match <code>'ROAD'</code> at the end of the string as part of the word <code>'BROAD'</code>, which is not what I wanted.
<pre class=screen>
@@ -75,13 +75,13 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>re.sub(r'\bROAD\b', 'RD.', s)</kbd> <span>&#x2463;</span></a>
<samp>'100 BROAD RD. APT 3'</samp></pre>
<ol>
<li>What I <em>really</em> wanted was to match <code>'ROAD'</code> when it was at the end of the string <em>and</em> it was its own word (and not a part of some larger word). To express this in a regular expression, you use <code>\b</code>, which means &#8220;a word boundary must occur right here.&#8221; In Python, this is complicated by the fact that the <code>'\'</code> character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it's a bug in syntax or a bug in your regular expression.
<li>What I <em>really</em> wanted was to match <code>'ROAD'</code> when it was at the end of the string <em>and</em> it was its own word (and not a part of some larger word). To express this in a regular expression, you use <code>\b</code>, which means &#8220;a word boundary must occur right here.&#8221; In Python, this is complicated by the fact that the <code>'\'</code> character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it&#8217;s a bug in syntax or a bug in your regular expression.
<li>To work around the backslash plague, you can use what is called a <i>raw string</i> [FIXME reference to strings chapter], by prefixing the string with the letter <code>r</code>. This tells Python that nothing in this string should be escaped; <code>'\t'</code> is a tab character, but <code>r'\t'</code> is really the backslash character <code>\</code> followed by the letter <code>t</code>. I recommend always using raw strings when dealing with regular expressions; otherwise, things get too confusing too quickly (and regular expressions are confusing enough already).
<li><em>*sigh*</em> Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word <code>'ROAD'</code> as a whole word by itself, but it wasn't at the end, because the address had an apartment number after the street designation. Because <code>'ROAD'</code> isn't at the very end of the string, it doesn't match, so the entire call to <code>re.sub</code> ends up replacing nothing at all, and you get the original string back, which is not what you want.
<li>To solve this problem, I removed the <code>$</code> character and added another <code>\b</code>. Now the regular expression reads &#8220;match <code>'ROAD'</code> when it's a whole word by itself anywhere in the string,&#8221; whether at the end, the beginning, or somewhere in the middle.
<li><em>*sigh*</em> Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word <code>'ROAD'</code> as a whole word by itself, but it wasn&#8217;t at the end, because the address had an apartment number after the street designation. Because <code>'ROAD'</code> isn&#8217;t at the very end of the string, it doesn&#8217;t match, so the entire call to <code>re.sub</code> ends up replacing nothing at all, and you get the original string back, which is not what you want.
<li>To solve this problem, I removed the <code>$</code> character and added another <code>\b</code>. Now the regular expression reads &#8220;match <code>'ROAD'</code> when it&#8217;s a whole word by itself anywhere in the string,&#8221; whether at the end, the beginning, or somewhere in the middle.
</ol>
<h2 id=romannumerals>Case study: Roman numerals</h2>
<p>You've most likely seen Roman numerals, even if you didn't recognize them. You may have seen them in copyrights of old movies and television shows (&#8220;Copyright <code>MCMXLVI</code>&#8221; instead of &#8220;Copyright <code>1946</code>&#8221;), or on the dedication walls of libraries or universities (&#8220;established <code>MDCCCLXXXVIII</code>&#8221; instead of &#8220;established <code>1888</code>&#8221;). You may also have seen them in outlines and bibliographical references. It's a system of representing numbers that really does date back to the ancient Roman empire (hence the name).
<p>You&#8217;ve most likely seen Roman numerals, even if you didn&#8217;t recognize them. You may have seen them in copyrights of old movies and television shows (&#8220;Copyright <code>MCMXLVI</code>&#8221; instead of &#8220;Copyright <code>1946</code>&#8221;), or on the dedication walls of libraries or universities (&#8220;established <code>MDCCCLXXXVIII</code>&#8221; instead of &#8220;established <code>1888</code>&#8221;). You may also have seen them in outlines and bibliographical references. It&#8217;s a system of representing numbers that really does date back to the ancient Roman empire (hence the name).
<p>In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.
<ul>
<li><code>I = 1</code>
@@ -95,13 +95,13 @@ characters. If you've used regular expressions in other languages (like Perl), t
<p>The following are some general rules for constructing Roman numerals:
<ul>
<li>Characters are additive. <code>I</code> is <code>1</code>, <code>II</code> is <code>2</code>, and <code>III</code> is <code>3</code>. <code>VI</code> is <code>6</code> (literally, &#8220;<code>5</code> and <code>1</code>&#8221;), <code>VII</code> is <code>7</code>, and <code>VIII</code> is <code>8</code>.
<li>The tens characters (<code>I</code>, <code>X</code>, <code>C</code>, and <code>M</code>) can be repeated up to three times. At <code>4</code>, you need to subtract from the next highest fives character. You can't represent <code>4</code> as <code>IIII</code>; instead, it is represented as <code>IV</code> (&#8220;<code>1</code> less than <code>5</code>&#8221;). The number <code>40</code> is written as <code>XL</code> (<code>10</code> less than <code>50</code>), <code>41</code> as <code>XLI</code>, <code>42</code> as <code>XLII</code>, <code>43</code> as <code>XLIII</code>, and then <code>44</code> as <code>XLIV</code> (<code>10</code> less than <code>50</code>, then <code>1</code> less than <code>5</code>).
<li>The tens characters (<code>I</code>, <code>X</code>, <code>C</code>, and <code>M</code>) can be repeated up to three times. At <code>4</code>, you need to subtract from the next highest fives character. You can&#8217;t represent <code>4</code> as <code>IIII</code>; instead, it is represented as <code>IV</code> (&#8220;<code>1</code> less than <code>5</code>&#8221;). The number <code>40</code> is written as <code>XL</code> (<code>10</code> less than <code>50</code>), <code>41</code> as <code>XLI</code>, <code>42</code> as <code>XLII</code>, <code>43</code> as <code>XLIII</code>, and then <code>44</code> as <code>XLIV</code> (<code>10</code> less than <code>50</code>, then <code>1</code> less than <code>5</code>).
<li>Similarly, at <code>9</code>, you need to subtract from the next highest tens character: <code>8</code> is <code>VIII</code>, but <code>9</code> is <code>IX</code> (<code>1</code> less than <code>10</code>), not <code>VIIII</code> (since the <code>I</code> character can not be repeated four times). The number <code>90</code> is <code>XC</code>, <code>900</code> is <code>CM</code>.
<li>The fives characters can not be repeated. The number <code>10</code> is always represented as <code>X</code>, never as <code>VV</code>. The number <code>100</code> is always <code>C</code>, never <code>LL</code>.
<li>Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much. <code>DC</code> is <code>600</code>; <code>CD</code> is a completely different number (<code>400</code>, <code>100</code> less than <code>500</code>). <code>CI</code> is <code>101</code>; <code>IC</code> is not even a valid Roman numeral (because you can't subtract <code>1</code> directly from <code>100</code>; you would need to write it as <code>XCIX</code>, for <code>10</code> less than <code>100</code>, then <code>1</code> less than <code>10</code>).
<li>Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much. <code>DC</code> is <code>600</code>; <code>CD</code> is a completely different number (<code>400</code>, <code>100</code> less than <code>500</code>). <code>CI</code> is <code>101</code>; <code>IC</code> is not even a valid Roman numeral (because you can&#8217;t subtract <code>1</code> directly from <code>100</code>; you would need to write it as <code>XCIX</code>, for <code>10</code> less than <code>100</code>, then <code>1</code> less than <code>10</code>).
</ul>
<h3 id=thousands>Checking for thousands</h3>
<p>What would it take to validate that an arbitrary string is a valid Roman numeral? Let's take it one digit at a time. Since Roman numerals are always written highest to lowest, let's start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of <code>M</code> characters.
<p>What would it take to validate that an arbitrary string is a valid Roman numeral? Let&#8217;s take it one digit at a time. Since Roman numerals are always written highest to lowest, let&#8217;s start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of <code>M</code> characters.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import re</kbd>
<a><samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?$'</kbd> <span>&#x2460;</span></a>
@@ -115,11 +115,11 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, '')</kbd> <span>&#x2465;</span></a>
<samp>&lt;SRE_Match object at 0106F4A8></samp></pre>
<ol>
<li>This pattern has three parts. <code>^</code> matches what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the <code>M</code> characters were, which is not what you want. You want to make sure that the <code>M</code> characters, if they're there, are at the beginning of the string. <code>M?</code> optionally matches a single <code>M</code> character. Since this is repeated three times, you're matching anywhere from zero to three <code>M</code> characters in a row. And <code>$</code> matches the end of the string. When combined with the <code>^</code> character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the <code>M</code> characters.
<li>This pattern has three parts. <code>^</code> matches what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the <code>M</code> characters were, which is not what you want. You want to make sure that the <code>M</code> characters, if they&#8217;re there, are at the beginning of the string. <code>M?</code> optionally matches a single <code>M</code> character. Since this is repeated three times, you&#8217;re matching anywhere from zero to three <code>M</code> characters in a row. And <code>$</code> matches the end of the string. When combined with the <code>^</code> character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the <code>M</code> characters.
<li>The essence of the <code>re</code> module is the <code>search()</code> function, that takes a regular expression (<var>pattern</var>) and a string (<code>'M'</code>) to try to match against the regular expression. If a match is found, <code>search()</code> returns an object which has various methods to describe the match; if no match is found, <code>search()</code> returns <code>None</code>, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return value of <code>search()</code>. <code>'M'</code> matches this regular expression, because the first optional <code>M</code> matches and the second and third optional <code>M</code> characters are ignored.
<li><code>'MM'</code> matches because the first and second optional <code>M</code> characters match and the third <code>M</code> is ignored.
<li><code>'MMM'</code> matches because all three <code>M</code> characters match.
<li><code>'MMMM'</code> does not match. All three <code>M</code> characters match, but then the regular expression insists on the string ending (because of the <code>$</code> character), and the string doesn't end yet (because of the fourth <code>M</code>). So <code>search()</code> returns <code>None</code>.
<li><code>'MMMM'</code> does not match. All three <code>M</code> characters match, but then the regular expression insists on the string ending (because of the <code>$</code> character), and the string doesn&#8217;t end yet (because of the fourth <code>M</code>). So <code>search()</code> returns <code>None</code>.
<li>Interestingly, an empty string also matches this regular expression, since all the <code>M</code> characters are optional.
</ol>
<h3 id=hundreds>Checking for hundreds</h3>
@@ -164,10 +164,10 @@ characters. If you've used regular expressions in other languages (like Perl), t
<li><code>'MCM'</code> matches because the first <code>M</code> matches, the second and third <code>M</code> characters are ignored, and the <code>CM</code> matches (so the <code>CD</code> and <code>D?C?C?C?</code> patterns are never even considered). <code>MCM</code> is the Roman numeral representation of <code>1900</code>.
<li><code>'MD'</code> matches because the first <code>M</code> matches, the second and third <code>M</code> characters are ignored, and the <code>D?C?C?C?</code> pattern matches <code>D</code> (each of the three <code>C</code> characters are optional and are ignored). <code>MD</code> is the Roman numeral representation of <code>1500</code>.
<li><code>'MMMCCC'</code> matches because all three <code>M</code> characters match, and the <code>D?C?C?C?</code> pattern matches <code>CCC</code> (the <code>D</code> is optional and is ignored). <code>MMMCCC</code> is the Roman numeral representation of <code>3300</code>.
<li><code>'MCMC'</code> does not match. The first <code>M</code> matches, the second and third <code>M</code> characters are ignored, and the <code>CM</code> matches, but then the <code>$</code> does not match because you're not at the end of the string yet (you still have an unmatched <code>C</code> character). The <code>C</code> does <em>not</em> match as part of the <code>D?C?C?C?</code> pattern, because the mutually exclusive <code>CM</code> pattern has already matched.
<li><code>'MCMC'</code> does not match. The first <code>M</code> matches, the second and third <code>M</code> characters are ignored, and the <code>CM</code> matches, but then the <code>$</code> does not match because you&#8217;re not at the end of the string yet (you still have an unmatched <code>C</code> character). The <code>C</code> does <em>not</em> match as part of the <code>D?C?C?C?</code> pattern, because the mutually exclusive <code>CM</code> pattern has already matched.
<li>Interestingly, an empty string still matches this pattern, because all the <code>M</code> characters are optional and ignored, and the empty string matches the <code>D?C?C?C?</code> pattern where all the characters are optional and ignored.
</ol>
<p>Whew! See how quickly regular expressions can get nasty? And you've only covered the thousands and hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because they're exactly the same pattern. But let's look at another way to express the pattern.
<p>Whew! See how quickly regular expressions can get nasty? And you&#8217;ve only covered the thousands and hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because they&#8217;re exactly the same pattern. But let&#8217;s look at another way to express the pattern.
<h2 id=nmsyntax>Using the <code>{n,m}</code> Syntax</h2>
<p>In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.
<pre class=screen>
@@ -184,8 +184,8 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMMM')</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp></pre>
<ol>
<li>This matches the start of the string, and then the first optional <code>M</code>, but not the second and third <code>M</code> (but that's okay because they're optional), and then the end of the string.
<li>This matches the start of the string, and then the first and second optional <code>M</code>, but not the third <code>M</code> (but that's okay because it's optional), and then the end of the string.
<li>This matches the start of the string, and then the first optional <code>M</code>, but not the second and third <code>M</code> (but that&#8217;s okay because they&#8217;re optional), and then the end of the string.
<li>This matches the start of the string, and then the first and second optional <code>M</code>, but not the third <code>M</code> (but that&#8217;s okay because it&#8217;s optional), and then the end of the string.
<li>This matches the start of the string, and then all three optional <code>M</code>, and then the end of the string.
<li>This matches the start of the string, and then all three optional <code>M</code>, but then does not match the the end of the string (because there is still one unmatched <code>M</code>), so the pattern does not match and returns <code>None</code>.
</ol>
@@ -207,7 +207,7 @@ characters. If you've used regular expressions in other languages (like Perl), t
<li>This matches the start of the string, then three <code>M</code> out of a possible three, but then <em>does not match</em> the end of the string. The regular expression allows for up to only three <code>M</code> characters before the end of the string, but you have four, so the pattern does not match and returns <code>None</code>.
</ol>
<h3 id=tensandones>Checking for tens and ones</h3>
<p>Now let's expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.
<p>Now let&#8217;s expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'</kbd>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MCMXL')</kbd> <span>&#x2460;</span></a>
@@ -227,7 +227,7 @@ characters. If you've used regular expressions in other languages (like Perl), t
<li>This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then the optional <code>L</code> and all three optional <code>X</code> characters, then the end of the string. <code>MCMLXXX</code> is the Roman numeral representation of <code>1980</code>.
<li>This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then the optional <code>L</code> and all three optional <code>X</code> characters, then <em>fails to match</em> the end of the string because there is still one more <code>X</code> unaccounted for. So the entire pattern fails to match, and returns <code>None</code>. <code>MCMLXXXX</code> is not a valid Roman numeral.
</ol>
<p>The expression for the ones place follows the same pattern. I'll spare you the details and show you the end result.
<p>The expression for the ones place follows the same pattern. I&#8217;ll spare you the details and show you the end result.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'</kbd>
</pre><p>So what does that look like using this alternate <code>{n,m}</code> syntax? This example shows the new syntax.
@@ -244,19 +244,19 @@ characters. If you've used regular expressions in other languages (like Perl), t
<ol>
<li>This matches the start of the string, then one of a possible four <code>M</code> characters, then <code>D?C{0,3}</code>. Of that, it matches the optional <code>D</code> and zero of three possible <code>C</code> characters. Moving on, it matches <code>L?X{0,3}</code> by matching the optional <code>L</code> and zero of three possible <code>X</code> characters. Then it matches <code>V?I{0,3}</code> by matching the optional V and zero of three possible <code>I</code> characters, and finally the end of the string. <code>MDLV</code> is the Roman numeral representation of <code>1555</code>.
<li>This matches the start of the string, then two of a possible four <code>M</code> characters, then the <code>D?C{0,3}</code> with a <code>D</code> and one of three possible <code>C</code> characters; then <code>L?X{0,3}</code> with an <code>L</code> and one of three possible <code>X</code> characters; then <code>V?I{0,3}</code> with a <code>V</code> and one of three possible <code>I</code> characters; then the end of the string. <code>MMDCLXVI</code> is the Roman numeral representation of <code>2666</code>.
<li>This matches the start of the string, then four out of four <code>M</code> characters, then <code>D?C{0,3}</code> with a <code>D</code> and three out of three <code>C</code> characters; then <code>L?X{0,3}</code> with an <code>L</code> and three out of three <code>X</code> characters; then <code>V?I{0,3}</code> with a <code>V</code> and three out of three <code>I</code> characters; then the end of the string. <code>MMMMDCCCLXXXVIII</code> is the Roman numeral representation of <code>3888</code>, and it's the longest Roman numeral you can write without extended syntax.
<li>Watch closely. (I feel like a magician. &#8220;Watch closely, kids, I'm going to pull a rabbit out of my hat.&#8221;) This matches the start of the string, then zero out of four <code>M</code>, then matches <code>D?C{0,3}</code> by skipping the optional <code>D</code> and matching zero out of three <code>C</code>, then matches <code>L?X{0,3}</code> by skipping the optional <code>L</code> and matching zero out of three <code>X</code>, then matches <code>V?I{0,3}</code> by skipping the optional <code>V</code> and matching one out of three <code>I</code>. Then the end of the string. Whoa.
<li>This matches the start of the string, then four out of four <code>M</code> characters, then <code>D?C{0,3}</code> with a <code>D</code> and three out of three <code>C</code> characters; then <code>L?X{0,3}</code> with an <code>L</code> and three out of three <code>X</code> characters; then <code>V?I{0,3}</code> with a <code>V</code> and three out of three <code>I</code> characters; then the end of the string. <code>MMMMDCCCLXXXVIII</code> is the Roman numeral representation of <code>3888</code>, and it&#8217;s the longest Roman numeral you can write without extended syntax.
<li>Watch closely. (I feel like a magician. &#8220;Watch closely, kids, I&#8217;m going to pull a rabbit out of my hat.&#8221;) This matches the start of the string, then zero out of four <code>M</code>, then matches <code>D?C{0,3}</code> by skipping the optional <code>D</code> and matching zero out of three <code>C</code>, then matches <code>L?X{0,3}</code> by skipping the optional <code>L</code> and matching zero out of three <code>X</code>, then matches <code>V?I{0,3}</code> by skipping the optional <code>V</code> and matching one out of three <code>I</code>. Then the end of the string. Whoa.
</ol>
<p>If you followed all that and understood it on the first try, you're doing better than I did. Now imagine trying to understand someone else's regular expressions, in the middle of a critical function of a large program. Or even imagine coming back to your own regular expressions a few months later. I've done it, and it's not a pretty sight.
<p>Now let's explore an alternate syntax that can help keep your expressions maintainable.
<p>If you followed all that and understood it on the first try, you&#8217;re doing better than I did. Now imagine trying to understand someone else&#8217;s regular expressions, in the middle of a critical function of a large program. Or even imagine coming back to your own regular expressions a few months later. I&#8217;ve done it, and it&#8217;s not a pretty sight.
<p>Now let&#8217;s explore an alternate syntax that can help keep your expressions maintainable.
<h2 id=verbosere>Verbose Regular Expressions</h2>
<p>So far you've just been dealing with what I'll call &#8220;compact&#8221; regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee that you'll be able to understand it six months later. What you really need is inline documentation.
<p>So far you&#8217;ve just been dealing with what I&#8217;ll call &#8220;compact&#8221; regular expressions. As you&#8217;ve seen, they are difficult to read, and even if you figure out what one does, that&#8217;s no guarantee that you&#8217;ll be able to understand it six months later. What you really need is inline documentation.
<p>Python allows you to do this with something called <i>verbose regular expressions</i>. A verbose regular expression is different from a compact regular expression in two ways:
<ul>
<li>Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They're not matched at all. (If you want to match a space in a verbose regular expression, you'll need to escape it by putting a backslash in front of it.)
<li>Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a <code>#</code> character and goes until the end of the line. In this case it's a comment within a multi-line string instead of within your source code, but it works the same way.
<li>Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They&#8217;re not matched at all. (If you want to match a space in a verbose regular expression, you&#8217;ll need to escape it by putting a backslash in front of it.)
<li>Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a <code>#</code> character and goes until the end of the line. In this case it&#8217;s a comment within a multi-line string instead of within your source code, but it works the same way.
</ul>
<p>This will be more clear with an example. Let's revisit the compact regular expression you've been working with, and make it a verbose regular expression. This example shows how.
<p>This will be more clear with an example. Let&#8217;s revisit the compact regular expression you&#8217;ve been working with, and make it a verbose regular expression. This example shows how.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>pattern = """
^ # beginning of string
@@ -277,14 +277,14 @@ characters. If you've used regular expressions in other languages (like Perl), t
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'M')</kbd> <span>&#x2463;</span></a></pre>
<ol>
<li>The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them: <code>re.VERBOSE</code> is a constant defined in the <code>re</code> module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it's a lot more readable.
<li>The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them: <code>re.VERBOSE</code> is a constant defined in the <code>re</code> module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it&#8217;s a lot more readable.
<li>This matches the start of the string, then one of a possible four <code>M</code>, then <code>CM</code>, then <code>L</code> and three of a possible three <code>X</code>, then <code>IX</code>, then the end of the string.
<li>This matches the start of the string, then four of a possible four <code>M</code>, then <code>D</code> and three of a possible three <code>C</code>, then <code>L</code> and three of a possible three <code>X</code>, then <code>V</code> and three of a possible three <code>I</code>, then the end of the string.
<li>This does not match. Why? Because it doesn't have the <code>re.VERBOSE</code> flag, so the <code>re.search</code> function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can't auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose.
<li>This does not match. Why? Because it doesn&#8217;t have the <code>re.VERBOSE</code> flag, so the <code>re.search</code> function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can&#8217;t auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose.
</ol>
<h2 id=phonenumbers>Case study: parsing phone numbers</h2>
<p>So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions are much more powerful than that. When a regular expression <em>does</em> match, you can pick out specific pieces of it. You can find out what matched where.
<p>This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the area code, trunk, number, and optionally an extension separately in the company's database. I scoured the Web and found many examples of regular expressions that purported to do this, but none of them were permissive enough.
<p>So far you&#8217;ve concentrated on matching whole patterns. Either the pattern matches, or it doesn&#8217;t. But regular expressions are much more powerful than that. When a regular expression <em>does</em> match, you can pick out specific pieces of it. You can find out what matched where.
<p>This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the area code, trunk, number, and optionally an extension separately in the company&#8217;s database. I scoured the Web and found many examples of regular expressions that purported to do this, but none of them were permissive enough.
<p>Here are the phone numbers I needed to be able to accept:
<ul>
<li><code>800-555-1212</code>
@@ -298,7 +298,7 @@ characters. If you've used regular expressions in other languages (like Perl), t
<li><code>work 1-(800) 555.1212 #1234</code>
</ul>
<p>Quite a variety! In each of these cases, I need to know that the area code was <code>800</code>, the trunk was <code>555</code>, and the rest of the phone number was <code>1212</code>. For those with an extension, I need to know that the extension was <code>1234</code>.
<p>Let's work through developing a solution for phone number parsing. This example shows the first step.
<p>Let&#8217;s work through developing a solution for phone number parsing. This example shows the first step.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212').groups()</kbd> <span>&#x2461;</span></a>
@@ -306,9 +306,9 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212-1234')</kbd> <span>&#x2462;</span></a>
<samp class=prompt>>>> </samp></pre>
<ol>
<li>Always read regular expressions from left to right. This one matches the beginning of the string, and then <code>(\d{3})</code>. What's <code>\d{3}</code>? Well, the <code>{3}</code> means &#8220;match exactly three numeric digits&#8221;; it's a variation on the <a href="#re.nm" title="7.4. Using the {n,m} Syntax"><code>{n,m} syntax</code></a> you saw earlier. <code>\d</code> means &#8220;any numeric digit&#8221; (<code>0</code> through <code>9</code>). Putting it in parentheses means &#8220;match exactly three numeric digits, <em>and then remember them as a group that I can ask for later</em>&#8221;. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another group of exactly four digits. Then match the end of the string.
<li>Always read regular expressions from left to right. This one matches the beginning of the string, and then <code>(\d{3})</code>. What&#8217;s <code>\d{3}</code>? Well, the <code>{3}</code> means &#8220;match exactly three numeric digits&#8221;; it&#8217;s a variation on the <a href="#re.nm" title="7.4. Using the {n,m} Syntax"><code>{n,m} syntax</code></a> you saw earlier. <code>\d</code> means &#8220;any numeric digit&#8221; (<code>0</code> through <code>9</code>). Putting it in parentheses means &#8220;match exactly three numeric digits, <em>and then remember them as a group that I can ask for later</em>&#8221;. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another group of exactly four digits. Then match the end of the string.
<li>To get access to the groups that the regular expression parser remembered along the way, use the <code>groups()</code> method on the object that the <code>search()</code> method returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you defined three groups, one with three digits, one with three digits, and one with four digits.
<li>This regular expression is not the final answer, because it doesn't handle a phone number with an extension on the end. For that, you'll need to expand the regular expression.
<li>This regular expression is not the final answer, because it doesn&#8217;t handle a phone number with an extension on the end. For that, you&#8217;ll need to expand the regular expression.
</ol>
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')</kbd> <span>&#x2460;</span></a>
@@ -319,10 +319,10 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp></pre>
<ol>
<li>This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then a remembered group of three digits, then a hyphen, then a remembered group of three digits, then a hyphen, then a remembered group of four digits. What's new is that you then match another hyphen, and a remembered group of one or more digits, then the end of the string.
<li>This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then a remembered group of three digits, then a hyphen, then a remembered group of three digits, then a hyphen, then a remembered group of four digits. What&#8217;s new is that you then match another hyphen, and a remembered group of one or more digits, then the end of the string.
<li>The <code>groups()</code> method now returns a tuple of four elements, since the regular expression now defines four groups to remember.
<li>Unfortunately, this regular expression is not the final answer either, because it assumes that the different parts of the phone number are separated by hyphens. What if they're separated by spaces, or commas, or dots? You need a more general solution to match several different types of separators.
<li>Oops! Not only does this regular expression not do everything you want, it's actually a step backwards, because now you can't parse phone numbers <em>without</em> an extension. That's not what you wanted at all; if the extension is there, you want to know what it is, but if it's not there, you still want to know what the different parts of the main number are.
<li>Unfortunately, this regular expression is not the final answer either, because it assumes that the different parts of the phone number are separated by hyphens. What if they&#8217;re separated by spaces, or commas, or dots? You need a more general solution to match several different types of separators.
<li>Oops! Not only does this regular expression not do everything you want, it&#8217;s actually a step backwards, because now you can&#8217;t parse phone numbers <em>without</em> an extension. That&#8217;s not what you wanted at all; if the extension is there, you want to know what it is, but if it&#8217;s not there, you still want to know what the different parts of the main number are.
</ol>
<p>The next example shows the regular expression to handle separators between the different parts of the phone number.
<pre class=screen>
@@ -336,11 +336,11 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2464;</span></a>
<samp class=prompt>>>> </samp></pre>
<ol>
<li>Hang on to your hat. You're matching the beginning of the string, then a group of three digits, then <code>\D+</code>. What the heck is that? Well, <code>\D</code> matches any character <em>except</em> a numeric digit, and <code>+</code> means &#8220;1 or more&#8221;. So <code>\D+</code> matches one or more characters that are not digits. This is what you're using instead of a literal hyphen, to try to match different separators.
<li>Hang on to your hat. You&#8217;re matching the beginning of the string, then a group of three digits, then <code>\D+</code>. What the heck is that? Well, <code>\D</code> matches any character <em>except</em> a numeric digit, and <code>+</code> means &#8220;1 or more&#8221;. So <code>\D+</code> matches one or more characters that are not digits. This is what you&#8217;re using instead of a literal hyphen, to try to match different separators.
<li>Using <code>\D+</code> instead of <code>-</code> means you can now match phone numbers where the parts are separated by spaces instead of hyphens.
<li>Of course, phone numbers separated by hyphens still work too.
<li>Unfortunately, this is still not the final answer, because it assumes that there is a separator at all. What if the phone number is entered without any spaces or hyphens at all?
<li>Oops! This still hasn't fixed the problem of requiring extensions. Now you have two problems, but you can solve both of them with the same technique.
<li>Oops! This still hasn&#8217;t fixed the problem of requiring extensions. Now you have two problems, but you can solve both of them with the same technique.
</ol>
<p>The next example shows the regular expression for handling phone numbers <em>without</em> separators.
<pre class=screen>
@@ -354,11 +354,11 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('(800)5551212 x1234')</kbd> <span>&#x2464;</span></a>
<samp class=prompt>>>> </samp></pre>
<ol>
<li>The only change you've made since that last step is changing all the <code>+</code> to <code>*</code>. Instead of <code>\D+</code> between the parts of the phone number, you now match on <code>\D*</code>. Remember that <code>+</code> means &#8220;1 or more&#8221;? Well, <code>*</code> means &#8220;zero or more&#8221;. So now you should be able to parse phone numbers even when there is no separator character at all.
<li>The only change you&#8217;ve made since that last step is changing all the <code>+</code> to <code>*</code>. Instead of <code>\D+</code> between the parts of the phone number, you now match on <code>\D*</code>. Remember that <code>+</code> means &#8220;1 or more&#8221;? Well, <code>*</code> means &#8220;zero or more&#8221;. So now you should be able to parse phone numbers even when there is no separator character at all.
<li>Lo and behold, it actually works. Why? You matched the beginning of the string, then a remembered group of three digits (<code>800</code>), then zero non-numeric characters, then a remembered group of three digits (<code>555</code>), then zero non-numeric characters, then a remembered group of four digits (<code>1212</code>), then zero non-numeric characters, then a remembered group of an arbitrary number of digits (<code>1234</code>), then the end of the string.
<li>Other variations work now too: dots instead of hyphens, and both a space and an <code>x</code> before the extension.
<li>Finally, you've solved the other long-standing problem: extensions are optional again. If no extension is found, the <code>groups()</code> method still returns a tuple of four elements, but the fourth element is just an empty string.
<li>I hate to be the bearer of bad news, but you're not finished yet. What's the problem here? There's an extra character before the area code, but the regular expression assumes that the area code is the first thing at the beginning of the string. No problem, you can use the same technique of &#8220;zero or more non-numeric characters&#8221; to skip over the leading characters before the area code.
<li>Finally, you&#8217;ve solved the other long-standing problem: extensions are optional again. If no extension is found, the <code>groups()</code> method still returns a tuple of four elements, but the fourth element is just an empty string.
<li>I hate to be the bearer of bad news, but you&#8217;re not finished yet. What&#8217;s the problem here? There&#8217;s an extra character before the area code, but the regular expression assumes that the area code is the first thing at the beginning of the string. No problem, you can use the same technique of &#8220;zero or more non-numeric characters&#8221; to skip over the leading characters before the area code.
</ol>
<p>The next example shows how to handle leading characters in phone numbers.
<pre class=screen>
@@ -370,12 +370,12 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('work 1-(800) 555.1212 #1234')</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp></pre>
<ol>
<li>This is the same as in the previous example, except now you're matching <code>\D*</code>, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you're not remembering these non-numeric characters (they're not in parentheses). If you find them, you'll just skip over them and then start remembering the area code whenever you get to it.
<li>You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis after the area code is already handled; it's treated as a non-numeric separator and matched by the <code>\D*</code> after the first remembered group.)
<li>Just a sanity check to make sure you haven't broken anything that used to work. Since the leading characters are entirely optional, this matches the beginning of the string, then zero non-numeric characters, then a remembered group of three digits (<code>800</code>), then one non-numeric character (the hyphen), then a remembered group of three digits (<code>555</code>), then one non-numeric character (the hyphen), then a remembered group of four digits (<code>1212</code>), then zero non-numeric characters, then a remembered group of zero digits, then the end of the string.
<li>This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn't this phone number match? Because there's a <code>1</code> before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (<code>\D*</code>). Aargh.
<li>This is the same as in the previous example, except now you&#8217;re matching <code>\D*</code>, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you&#8217;re not remembering these non-numeric characters (they&#8217;re not in parentheses). If you find them, you&#8217;ll just skip over them and then start remembering the area code whenever you get to it.
<li>You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis after the area code is already handled; it&#8217;s treated as a non-numeric separator and matched by the <code>\D*</code> after the first remembered group.)
<li>Just a sanity check to make sure you haven&#8217;t broken anything that used to work. Since the leading characters are entirely optional, this matches the beginning of the string, then zero non-numeric characters, then a remembered group of three digits (<code>800</code>), then one non-numeric character (the hyphen), then a remembered group of three digits (<code>555</code>), then one non-numeric character (the hyphen), then a remembered group of four digits (<code>1212</code>), then zero non-numeric characters, then a remembered group of zero digits, then the end of the string.
<li>This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn&#8217;t this phone number match? Because there&#8217;s a <code>1</code> before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (<code>\D*</code>). Aargh.
</ol>
<p>Let's back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than trying to match it all just so you can skip over it, let's take a different approach: don't explicitly match the beginning of the string at all. This approach is shown in the next example.
<p>Let&#8217;s back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than trying to match it all just so you can skip over it, let&#8217;s take a different approach: don&#8217;t explicitly match the beginning of the string at all. This approach is shown in the next example.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('work 1-(800) 555.1212 #1234').groups()</kbd> <span>&#x2461;</span></a>
@@ -385,13 +385,13 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('80055512121234')</kbd> <span>&#x2463;</span></a>
<samp>('800', '555', '1212', '1234')</samp></pre>
<ol>
<li>Note the lack of <code>^</code> in this regular expression. You are not matching the beginning of the string anymore. There's nothing that says you need to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there.
<li>Note the lack of <code>^</code> in this regular expression. You are not matching the beginning of the string anymore. There&#8217;s nothing that says you need to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there.
<li>Now you can successfully parse a phone number that includes leading characters and a leading digit, plus any number of any kind of separators around each part of the phone number.
<li>Sanity check. this still works.
<li>That still works too.
</ol>
<p>See how quickly a regular expression can get out of control? Take a quick glance at any of the previous iterations. Can you tell the difference between one and the next?
<p>While you still understand the final answer (and it is the final answer; if you've discovered a case it doesn't handle, I don't want to know about it), let's write it out as a verbose regular expression, before you forget why you made the choices you made.
<p>While you still understand the final answer (and it is the final answer; if you&#8217;ve discovered a case it doesn&#8217;t handle, I don&#8217;t want to know about it), let&#8217;s write it out as a verbose regular expression, before you forget why you made the choices you made.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
@@ -409,11 +409,11 @@ characters. If you've used regular expressions in other languages (like Perl), t
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2461;</span></a>
<samp>('800', '555', '1212', '')</samp></pre>
<ol>
<li>Other than being spread out over multiple lines, this is exactly the same regular expression as the last step, so it's no surprise that it parses the same inputs.
<li>Final sanity check. Yes, this still works. You're done.
<li>Other than being spread out over multiple lines, this is exactly the same regular expression as the last step, so it&#8217;s no surprise that it parses the same inputs.
<li>Final sanity check. Yes, this still works. You&#8217;re done.
</ol>
<h2 id=summary>Summary</h2>
<p>This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you're completely overwhelmed by them now, believe me, you ain't seen nothing yet.
<p>This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you&#8217;re completely overwhelmed by them now, believe me, you ain&#8217;t seen nothing yet.
<p>You should now be familiar with the following techniques:
<ul>
<li><code>^</code> matches the beginning of a string.
-4
View File
@@ -27,7 +27,3 @@ def to_roman(n):
result += numeral
n -= integer
return result
def from_roman(s):
"""convert Roman numeral to integer"""
pass
+2 -6
View File
@@ -22,8 +22,8 @@ roman_numeral_map = (('M', 1000),
def to_roman(n):
"""convert integer to Roman numeral"""
if n > 3999:
raise OutOfRangeError("number out of range (must be less than 3999)")
# if n > 3999:
# raise OutOfRangeError("number out of range (must be less than 3999)")
result = ""
for numeral, integer in roman_numeral_map:
@@ -31,7 +31,3 @@ def to_roman(n):
result += numeral
n -= integer
return result
def from_roman(s):
"""convert Roman numeral to integer"""
pass
-4
View File
@@ -31,7 +31,3 @@ def to_roman(n):
result += numeral
n -= integer
return result
def from_roman(s):
"""convert Roman numeral to integer"""
pass
-4
View File
@@ -34,7 +34,3 @@ def to_roman(n):
result += numeral
n -= integer
return result
def from_roman(s):
"""convert Roman numeral to integer"""
pass
+6 -12
View File
@@ -121,22 +121,16 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
<li>...mention why from module import * is only allowed at module level
</ol>
</ol>
<li>Unit testing
<li><a href=unit-testing.html>Unit testing</a>
<ol>
<li>Introduction to Roman numerals
<li>Diving in
<li>Introducing romantest.py
<li>Testing for success
<li>Testing for failure
<li>Testing for sanity
<li><a href=unit-testing.html#divingin>(Not) diving in</a>
<li><a href=unit-testing.html#romantest1><code>romantest1.py</code></a>
<li><a href=unit-testing.html#romantest2><code>romantest2.py</code></a>
<li>...
</ol>
<li>Test-first programming
<ol>
<li>roman.py, stage 1
<li>roman.py, stage 2
<li>roman.py, stage 3
<li>roman.py, stage 4
<li>roman.py, stage 5
<li>...
</ol>
<li>Refactoring your code
<ol>
+278
View File
@@ -0,0 +1,278 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Unit testing - Dive into Python 3</title>
<link rel=stylesheet type=text/css href=dip3.css>
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<style type=text/css>
body{counter-reset:h1 7}
</style>
</head>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse id=search><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=root value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Unit testing</h1>
<blockquote class=q>
<p><span>&#x275D;</span> Certitude is not the test of certainty. We have been cocksure of many things that were not so. <span>&#x275E;</span><br>&mdash; <cite>Oliver Wendell Holmes, Jr.</cite>
</blockquote>
<ol>
<li><a href=#divingin>(Not) diving in</a>
<li><a href=#romantest1><code>romantest1.py</code></a>
<li><a href=#romantest2><code>romantest2.py</code></a>
<li>...
</ol>
<h2 id=divingin>(Not) diving in</h2>
<p class=fancy>In previous chapters, you &#8220;dived in&#8221; by immediately looking at code and trying to understand it as quickly as possible. Now that you have some Python under your belt, you're going to step back and look at the steps that happen <em>before</em> the code gets written.
<p>In this chapter, you're going to write, debug, and optimize a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in <a href="regular-expressions.html#romannumerals">&#8220;Case study: roman numerals&#8221;</a>. Now let's step back and consider what it would take to expand that into a two-way utility.
<p><a href="regular-expressions.html#romannumerals">The rules for Roman numerals</a> lead to a number of interesting observations:
<ol>
<li>There is only one correct way to represent a particular number as Roman numerals.
<li>The converse is also true: if a string of characters is a valid Roman numeral, it represents only one number (that is, it can only be read one way).
<li>There is a limited range of numbers that can be expressed as Roman numerals, specifically <code>1</code> through <code>3999</code>. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent that its normal value should be multiplied by <code>1000</code>, but you're not going to deal with that. For the purposes of this chapter, let's stipulate that Roman numerals go from <code>1</code> to <code>3999</code>.)
<li>There is no way to represent <code>0</code> in Roman numerals.
<li>There is no way to represent negative numbers in Roman numerals.
<li>There is no way to represent fractions or non-integer numbers in Roman numerals.
</ol>
<p>Let's start mapping out what a <code>roman.py</code> module should do. It will have two main functions, <code>to_roman()</code> and <code>from_roman()</code>. The <code>to_roman()</code> function should take an integer from <code>1</code> to <code>3999</code> and return the Roman numeral representation as a string&hellip;</p>
<p>Stop right there. Now let's do something a little unexpected: write a test case that checks whether the <code>to_roman()</code> function does what you want it to. You read that right: you're going to write code that tests code that you haven't written yet.
<p>This is called <i>unit testing</i>. The set of two conversion functions &mdash; <code>to_roman()</code>, and later <code>from_roman()</code> &mdash; can be written and tested as a unit, separate from any larger program that imports them. Python has a framework for unit testing, the appropriately-named <code>unittest</code> module.
<p>Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important to write them early (preferably before writing the code that they test), and to keep them updated as code and requirements change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases of development:
<ul>
<li>Before writing code, it forces you to detail your requirements in a useful fashion.
<li>While writing code, it keeps you from over-coding. When all the test cases pass, the function is complete.
<li>When refactoring code, it assures you that the new version behaves the same way as the old version.
<li>When maintaining code, it helps you cover your ass when someone comes screaming that your latest change broke their old code. (&#8220;But <em>sir</em>, all the unit tests passed when I checked it in...&#8221;)
<li>When writing code in a team, it increases confidence that the code you're about to commit isn't going to break someone else's code, because you can run their unit tests first. (I've seen this sort of thing in code sprints. A team breaks up the assignment, everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team. That way, nobody goes off too far into developing code that doesn't play well with others.)
</ul>
<h2 id=romantest1><code>romantest1.py</code></h2>
<p>A test case answers a single question about the code it is testing. A test case should be able to...
<ul>
<li>...run completely by itself, without any human input. Unit testing is about automation.
<li>...determine by itself whether the function it is testing has passed or failed, without a human interpreting the results.
<li>...run in isolation, separate from any other test cases (even if they test the same functions). Each test case is an island.
</ul>
<p>Given that, let's build a test case for the first requirement:
<ol>
<li>The <code>to_roman()</code> function should return the Roman numeral representation for all integers <code>1</code> to <code>3999</code>.
</ol>
<p>It is not immediately obvious how this code does&hellip; well, <em>anything</em>. It defines a class which has no <code>__init__()</code> method. The class <em>does</em> have another method, but it is never called. The entire script has a <code>__main__</code> block, but it doesn't reference the class or its method. But it does do something, I promise.
<p class=download>[<a href=romantest1.py>download <code>romantest1.py</code></a>]
<pre><code>import roman1
import unittest
<a>class KnownValues(unittest.TestCase): <span>&#x2460;</span></a>
known_values = ( (1, 'I'),
(2, 'II'),
(3, 'III'),
(4, 'IV'),
(5, 'V'),
(6, 'VI'),
(7, 'VII'),
(8, 'VIII'),
(9, 'IX'),
(10, 'X'),
(50, 'L'),
(100, 'C'),
(500, 'D'),
(1000, 'M'),
(31, 'XXXI'),
(148, 'CXLVIII'),
(294, 'CCXCIV'),
(312, 'CCCXII'),
(421, 'CDXXI'),
(528, 'DXXVIII'),
(621, 'DCXXI'),
(782, 'DCCLXXXII'),
(870, 'DCCCLXX'),
(941, 'CMXLI'),
(1043, 'MXLIII'),
(1110, 'MCX'),
(1226, 'MCCXXVI'),
(1301, 'MCCCI'),
(1485, 'MCDLXXXV'),
(1509, 'MDIX'),
(1607, 'MDCVII'),
(1754, 'MDCCLIV'),
(1832, 'MDCCCXXXII'),
(1993, 'MCMXCIII'),
(2074, 'MMLXXIV'),
(2152, 'MMCLII'),
(2212, 'MMCCXII'),
(2343, 'MMCCCXLIII'),
(2499, 'MMCDXCIX'),
(2574, 'MMDLXXIV'),
(2646, 'MMDCXLVI'),
(2723, 'MMDCCXXIII'),
(2892, 'MMDCCCXCII'),
(2975, 'MMCMLXXV'),
(3051, 'MMMLI'),
(3185, 'MMMCLXXXV'),
(3250, 'MMMCCL'),
(3313, 'MMMCCCXIII'),
(3408, 'MMMCDVIII'),
(3501, 'MMMDI'),
(3610, 'MMMDCX'),
(3743, 'MMMDCCXLIII'),
(3844, 'MMMDCCCXLIV'),
(3888, 'MMMDCCCLXXXVIII'),
(3940, 'MMMCMXL'),
<a> (3999, 'MMMCMXCIX')) <span>&#x2461;</span></a>
<a> def test_to_roman_known_values(self): <span>&#x2462;</span></a>
"""to_roman should give known result with known input"""
for integer, numeral in self.known_values:
<a> result = roman1.to_roman(integer) <span>&#x2463;</span></a>
<a> self.assertEqual(numeral, result) <span>&#x2464;</span></a>
if __name__ == "__main__":
unittest.main()</code></pre>
<ol>
<li>To write a test case, first subclass the <code>TestCase</code> class of the <code>unittest</code> module. This class provides many useful methods which you can use in your test case to test specific conditions.
<li>This is a list of integer/numeral pairs that I verified manually. It includes the lowest ten numbers, the highest number, every number that translates to a single-character Roman numeral, and a random sampling of other valid numbers. The point of a unit test is not to test every possible input, but to test a representative sample.
<li>Every individual test is its own method, which must take no parameters and return no value. If the method exits normally without raising an exception, the test is considered passed; if the method raises an exception, the test is considered failed.
<li>Here you call the actual <code>to_roman()</code> function. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you have now defined the <acronym>API</acronym> for the <code>to_roman()</code> function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the <acronym>API</acronym> is different than that, this test is considered failed. Also notice that you are not trapping any exceptions when you call <code>to_roman()</code>. This is intentional. <code>to_roman()</code> shouldn't raise an exception when you call it with valid input, and these input values are all valid. If <code>to_roman()</code> raises an exception, this test is considered failed.
<li>Assuming the <code>to_roman()</code> function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check whether it returned the <em>right</em> value. This is a common question, and the <code>TestCase</code> class provides a method, <code>assertEqual</code>, to check whether two values are equal. If the result returned from <code>to_roman()</code> (<var>result</var>) does not match the known value you were expecting (<var>numeral</var>), <code>assertEqual</code> will raise an exception and the test will fail. If the two values are equal, <code>assertEqual</code> will do nothing. If every value returned from <code>to_roman()</code> matches the known value you expect, <code>assertEqual</code> never raises an exception, so <code>testToRomanKnownValues</code> eventually exits normally, which means <code>to_roman()</code> has passed this test.
</ol>
<p>Once you have a test case, you can start coding the <code>to_roman()</code> function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you've written any code, you're doing it wrong &mdash; your tests aren't testing your code at all! Write a test that fails, then code until it passes.
<pre><code># roman1.py
function to_roman(n):
"""convert integer to Roman numeral"""
<a> pass <span>&#x2460;</span></a></code></pre>
<ol>
<li>At this stage, you want to define the <acronym>API</acronym> of the <code>to_roman()</code> function, but you don't want to code it yet. (Your test needs to fail first.) To stub it out, use the Python reserved word <code>pass</code> [FIXME ref], which does precisely nothing.</a>.
</ol>
<p>Execute <code>romantest1.py</code> on the command line to run the test. If you call it with the <code>-v</code> command-line option, it will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this:
<pre class=screen>
<samp class=prompt>you@localhost:~$ </samp><kbd>python3 romantest1.py -v</kbd>
<samp><a>to_roman should give known result with known input ... FAIL <span>&#x2460;</span></a>
======================================================================
FAIL: to_roman should give known result with known input
----------------------------------------------------------------------
Traceback (most recent call last):
File "romantest1.py", line 73, in test_to_roman_known_values
self.assertEqual(numeral, result)
<a>AssertionError: 'I' != None <span>&#x2461;</span></a>
----------------------------------------------------------------------
<a>Ran 1 test in 0.016s <span>&#x2462;</span></a>
<a>FAILED (failures=1) <span>&#x2463;</span></a></samp></pre>
<ol>
<li>Running the script runs <code>unittest.main()</code>, which runs each test case. Each test case is a method within each class in <code>romantest.py</code> that inherits from <code>unittest.TestCase</code>. For each test case, the <code>unittest</code> module will print out the <code>docstring</code> of the method and whether that test passed or failed. As expected, this test case fails.
<li>For each failed test case, <code>unittest</code> displays the trace information showing exactly what happened. In this case, the call to <code>assertEqual()</code> raised an <code>AssertionError</code> because it was expecting <code>to_roman(1)</code> to return <code>"I"</code>, but it didn't. (Since there was no explicit return statement, the function returned <code>None</code>, the Python null value.)
<li>After the detail of each test, <code>unittest</code> displays a summary of how many tests were performed and how long it took.
<li>Overall, the unit test failed because at least one test case did not pass. When a test case doesn't pass, <code>unittest</code> distinguishes between failures and errors. A failure is a call to an <code>assertXYZ</code> method, like <code>assertEqual</code> or <code>assertRaises</code>, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort of exception raised in the code you're testing or the unit test case itself.
</ol>
<p><em>Now</em>, finally, you can write the <code>to_roman()</code> function.
<p class=download>[<a href=roman1.py>download <code>roman1.py</code></a>]
<pre><code>roman_numeral_map = (('M', 1000),
('CM', 900),
('D', 500),
('CD', 400),
('C', 100),
('XC', 90),
('L', 50),
('XL', 40),
('X', 10),
('IX', 9),
('V', 5),
('IV', 4),
<a> ('I', 1)) <span>&#x2460;</span></a>
def to_roman(n):
"""convert integer to Roman numeral"""
result = ""
for numeral, integer in roman_numeral_map:
<a> while n >= integer: <span>&#x2461;</span></a>
result += numeral
n -= integer
return result</code></pre>
<ol>
<li><var>roman_numeral_map</var> is a tuple of tuples which defines three things: the character representations of the most basic Roman numerals; the order of the Roman numerals (in descending value order, from <code>M</code> all the way down to <code>I</code>); the value of each Roman numeral. Each inner tuple is a pair of <code>(<var>numeral</var>, <var>value</var>)</code>. It's not just single-character Roman numerals; it also defines two-character pairs like <code>CM</code> (&#8220;one hundred less than one thousand&#8221;). This makes the <code>to_roman()</code> function code simpler.
<li>Here's where the rich data structure of <var>roman_numeral_map</var> pays off, because you don't need any special logic to handle the subtraction rule. To convert to Roman numerals, simply iterate through <var>roman_numeral_map</var> looking for the largest integer value less than or equal to the input. Once found, add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
</ol>
<p>If you're still not clear how the <code>to_roman()</code> function works, add a <code>print()</code> call to the end of the <code>while</code> loop:
<pre><code>
while n >= integer:
result += numeral
n -= integer
print('subtracting {0} from input, adding {1} to output'.format(integer, numeral))</code></pre>
<p>With the debug <code>print()</code> statements, the output looks like this:
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import roman1</kbd>
<samp class=prompt>>>> </samp><kbd>roman1.to_roman(1424)</kbd>
<samp>subtracting 1000 from input, adding M to output
subtracting 400 from input, adding CD to output
subtracting 10 from input, adding X to output
subtracting 10 from input, adding X to output
subtracting 4 from input, adding IV to output
'MCDXXIV'</samp></pre>
<p>So the <code>to_roman()</code> function appears to work, at least in this manual spot check. But will it pass the test case you wrote?
<pre class=screen>
<samp class=prompt>you@localhost:~$ </samp><kbd>python3 romantest1.py -v</kbd>
<samp>to_roman should give known result with known input ... ok
----------------------------------------------------------------------
Ran 1 test in 0.016s
OK</samp></pre>
<ol>
<li>Hooray! The <code>to_roman()</code> function passes the &#8220;known values&#8221; test case. It's not comprehensive, but it does put the function through its paces with a variety of inputs, including inputs that produce every single-character Roman numeral, the largest possible input (<code>3999</code>), and the input that produces the longest possible Roman numeral (<code>3888</code>). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
</ol>
<p>&#8220;Good&#8221; input? Hmm. What about bad input?
<h2 id=romantest2><code>romantest2.py</code></h2>
<p>It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import roman1</kbd>
<a><samp class=prompt>>>> </samp><kbd>roman1.to_roman(4000)</kbd> <span>&#x2460;</span></a>
<samp>'MMMM'</samp>
<samp class=prompt>>>> </samp><kbd>roman1.to_roman(5000)</kbd>
<samp>'MMMMM'</samp>
<samp class=prompt>>>> </samp><kbd>roman1.to_roman(9999)</kbd>
<samp>'MMMMMMMMMCMXCIX'</samp></pre>
<ol>
<li>FIXME
</ol>
<p>The question to ask yourself is, &#8220;How can I express this as a testable requirement?&#8221; How's this for starters:
<blockquote>
<p>The <code>to_roman()</code> function should fail when given an integer greater than <code>3999</code>.
</blockquote>
<p>What would that test look like?
<p class=download>[<a href=romantest2.py>download <code>romantest2.py</code></a>]
<pre><code>class ToRomanBadInput(unittest.TestCase):
def test_too_large(self):
"""to_roman should fail with large input"""
self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000)</code></pre>
<!-- FIXME callouts -->
<p>...
<!--
For instance, the <code>testFromRomanCase</code> method (&#8220;<code>from_roman()</code> should only accept uppercase input&#8221;) was an error, because the call to <code>numeral.upper()</code> raised an <code>AttributeError</code> exception, because <code>to_roman()</code> was supposed to return a string but didn't. But <code>testZero</code> (&#8220;<code>to_roman()</code> should fail with 0 input&#8221;) was a failure, because the call to <code>from_roman()</code> did not raise the <code>InvalidRomanNumeral</code> exception that <code>assertRaises</code> was looking for.
-->
<!--
<li>For each failed test case, <code>unittest</code> displays the trace information showing exactly what happened. In this case, the call to <code>assertRaises</code> (also called <code>failUnlessRaises</code>) raised an <code>AssertionError</code> because it was expecting <code>to_roman()</code> to raise an <code>OutOfRangeError</code> and it didn't.
-->
<!--
<p>Given all of this, what would you expect out of a set of functions to convert to and from Roman numerals?
<ol>
<li><code>to_roman</code> should return the Roman numeral representation for all integers <code>1</code> to <code>3999</code>.
<li><code>to_roman</code> should fail when given an integer outside the range <code>1</code> to <code>3999</code>.
<li><code>to_roman</code> should fail when given a non-integer number.
<li><code>from_roman</code> should take a valid Roman numeral and return the number that it represents.
<li><code>from_roman</code> should fail when given an invalid Roman numeral.
<li>If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number
you started with. So <code>from_roman(to_roman(n)) == n</code> for all <var>n</var> in <code>1..3999</code>.
<li><code>to_roman</code> should always return a Roman numeral using uppercase letters.
<li><code>from_roman</code> should only accept uppercase Roman numerals (<i class=foreignphrase><acronym>i.e.</acronym></i> it should fail when given lowercase input).
</ol>
-->
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim, <a href=http://creativecommons.org/licenses/by-sa/3.0/ rel=license>CC-BY-SA-3.0</a>
<script type=text/javascript src=jquery.js></script>
<script type=text/javascript src=dip3.js></script>
+11 -7
View File
@@ -40,7 +40,7 @@ body{counter-reset:h1 1}
</ol>
<h2 id=divingin>Diving in</h2>
<p class=fancy>You know how other books go on and on about programming fundamentals and finally work up to building something useful? Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
<p class=download>[<a href=humansize.py>download</a>]</p>
<p class=download>[<a href=humansize.py>download <code>humansize.py</code></a>]</p>
<pre><code>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
@@ -70,11 +70,13 @@ if __name__ == "__main__":
print(approximate_size(1000000000000, False))
print(approximate_size(1000000000000))</code></pre>
<p>Now let's run this program on the command line. On Windows, it will look something like this:
<pre class=screen><samp class=prompt>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<pre class=screen>
<samp class=prompt>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<p>On Mac OS X or Linux, it would look something like this:
<pre class=screen><samp class=prompt>you@localhost:~$ </samp><kbd>python3 humansize.py</kbd>
<pre class=screen>
<samp class=prompt>you@localhost:~$ </samp><kbd>python3 humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<!-- FIXME: this would be a good place to explain what the program, you know, actually does -->
@@ -103,14 +105,14 @@ if __name__ == "__main__":
<dd>A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
</dd>
<dt>weakly typed language</dt>
<dd>A language in which types are &#8220;automagically&#8221; coerced to other types as needed; the opposite of strongly typed. PHP is weakly typed. In PHP, you can concatenate the string <code>'12'</code> and the integer <code>3</code> to get the string <code>'123'</code>, then treat that as the integer <code>123</code>, all without any explicit conversion. [FIXME double-check this]
<dd>A language in which types are &#8220;automagically&#8221; coerced to other types as needed; the opposite of strongly typed. <abbr>PHP</abbr> is weakly typed. In <abbr>PHP</abbr>, you can concatenate the string <code>'12'</code> and the integer <code>3</code> to get the string <code>'123'</code>, then treat that as the integer <code>123</code>, all without any explicit conversion. [FIXME double-check this]
</dd>
</dl>
<p>So Python is both <em>dynamically typed</em> (because it doesn't use explicit datatype declarations) and <em>strongly typed</em> (because once a variable has a datatype, it actually matters).
<p>If you have experience in other programming languages, this table may help you visualize how Python compares to them:
<table class=simple>
<tr><th></th><th>Statically typed</th><th>Dynamically typed</th></tr>
<tr><th>Weakly typed</th><td>C, Objective-C</td><td>JavaScript, Perl 5, PHP</td></tr>
<tr><th>Weakly typed</th><td>C, Objective-C</td><td>JavaScript, Perl 5, <abbr>PHP</abbr></td></tr>
<tr><th>Strongly typed</th><td>Pascal, Java</td><td>Python, Ruby</td></tr>
</table>
<h2 id=readability>Writing readable code</h2>
@@ -220,11 +222,13 @@ if __name__ == "__main__":
<p><span>&#x261E;</span>Like <abbr>C</abbr>, Python uses <code>==</code> for comparison and <code>=</code> for assignment. Unlike <abbr>C</abbr>, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.
</blockquote>
<p>So what makes this <code>if</code> statement special? Well, modules are objects, and all modules have a built-in attribute <code>__name__</code>. A module's <code>__name__</code> depends on how you're using the module. If you <code>import</code> the module, then <code>__name__</code> is the module's filename, without a directory path or file extension.
<pre class=screen><samp class=prompt>>>> </samp><kbd>import humansize</kbd>
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import humansize</kbd>
<samp class=prompt>>>> </samp><kbd>humansize.__name__</kbd>
<samp>'humansize'</samp></pre>
<p>But you can also run the module directly as a standalone program, in which case <code>__name__</code> will be a special default value, <code>__main__</code>. Python will evaluate this <code>if</code> statement, find a true expression, and execute the <code>if</code> code block. In this case, to print two values.
<pre class=screen><samp class=prompt>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<pre class=screen>
<samp class=prompt>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<h2 id=furtherreading>Further reading</h2>