added asides, new styles

This commit is contained in:
Mark Pilgrim
2009-03-28 15:58:35 -05:00
parent fe57cb0215
commit 1a4ce72944
9 changed files with 127 additions and 83 deletions
+17 -11
View File
@@ -19,26 +19,27 @@ mark{background:#ff8;font-weight:bold}
<p><span>&#x275D;</span> Words, words. They&#8217;re all we have to go on. <span>&#x275E;</span><br>&mdash; <a href=http://www.imdb.com/title/tt0100519/quotes>Rosencrantz and Guildenstern are Dead</a>
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving in</h2>
<h2 id=divingin>Diving In</h2>
<p class=f>Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In <a href=strings.html>Chapter 3</a>, I talked about the history of character encoding and the creation of Unicode, the &#8220;one encoding to rule them all.&#8221; I&#8217;d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.
<p>I&#8217;d also like a pony.
<p>A Unicode pony.
<p>A Unipony, as it were.
<p>I&#8217;ll settle for character encoding auto-detection.
<h2 id=faq.what>What is character encoding auto-detection?</h2>
<h2 id=faq.what>What is Character Encoding Auto-Detection?</h2>
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It&#8217;s like cracking a code when you don&#8217;t have the decryption key.
<h3 id=faq.impossible>Isn&#8217;t that impossible?</h3>
<h3 id=faq.impossible>Isn&#8217;t That Impossible?</h3>
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds &#8220;txzqJv 2!dasd0a QqdKjvz&#8221; will instantly recognize that that isn&#8217;t English (even though it is composed entirely of English letters). By studying lots of &#8220;typical&#8221; text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text&#8217;s language.
<p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
<h3 id=faq.who>Does such an algorithm exist?</h3>
<h3 id=faq.who>Does Such An Algorithm Exist?</h3>
<p>As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of pages that have no encoding information whatsoever. <a href=http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>Mozilla Firefox contains an encoding auto-detection library</a> which is open source. <a href=http://chardet.feedparser.org/>I ported the library to Python 2</a> and dubbed it the <code>chardet</code> module. This chapter will take you step-by-step through the process of porting the <code>chardet</code> module from Python 2 to Python 3.
<h2 id=divingin2>Introducing the <code>chardet</code> module</h2>
<h2 id=divingin2>Introducing The <code>chardet</code> Module</h2>
<p>[FIXME download link, possibly on chardet.feedparser.org, possibly local]
<p>Before we set off porting the code, it would help if you understood how the code worked! This is a brief guide to navigating the code itself.
<aside>Encoding detection is really language detection in drag.</aside>
<p>The main entry point for the detection algorithm is <code>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code>chardet/__init__.py</code>, but that&#8217;s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
<ol>
@@ -48,18 +49,19 @@ mark{background:#ff8;font-weight:bold}
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn&#8217;t know a character encoding from a hole in the ground.
</ol>
<h3 id=how.bom><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></h3>
<h3 id=how.bom><code>UTF-n</code> With A <abbr title="Byte Order Mark">BOM</abbr></h3>
<p>If the text starts with a <abbr title="Byte Order Mark">BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr title="Byte Order Mark">BOM</abbr> will tell us exactly which one; that&#8217;s what it&#8217;s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
<h3 id=how.esc>Escaped encodings</h3>
<h3 id=how.esc>Escaped Encodings</h3>
<p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <code>UniversalDetector</code> creates an <code>EscCharSetProber</code> (defined in <code>escprober.py</code>) and feeds it the text.
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <code>HZ-GB-2312</code>, <code>ISO-2022-CN</code>, <code>ISO-2022-JP</code>, and <code>ISO-2022-KR</code> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
<h3 id=how.mb>Multi-byte encodings</h3>
<h3 id=how.mb>Multi-Byte Encodings</h3>
<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of &#8220;probers&#8221; for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
<p>Most of the multi-byte encoding probers are inherited from <code>MultiByteCharSetProber</code> (defined in <code>mbcharsetprober.py</code>), and simply hook up the appropriate state machine and distribution analyzer and let <code>MultiByteCharSetProber</code> do the rest of the work. <code>MultiByteCharSetProber</code> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <code>MultiByteCharSetProber</code> feeds the text to an encoding-specific distribution analyzer.
<p>The distribution analyzers (each defined in <code>chardistribution.py</code>) use language-specific models of which characters are used most frequently. Once <code>MultiByteCharSetProber</code> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <code>MultiByteCharSetProber</code> returns the result to <code>MBCSGroupProber</code>, which returns it to <code>UniversalDetector</code>, which returns it to the caller.
<p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <code>EUC-JP</code> and <code>SHIFT_JIS</code>, so the <code>SJISProber</code> (defined in <code>sjisprober.py</code>) also uses 2-character distribution analysis. <code>SJISContextAnalysis</code> and <code>EUCJPContextAnalysis</code> (both defined in <code>jpcntx.py</code> and both inheriting from a common <code>JapaneseContextAnalysis</code> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <code>SJISProber</code>, which checks both analyzers and returns the higher confidence level to <code>MBCSGroupProber</code>.
<h3 id=how.sb>Single-byte encodings</h3>
<h3 id=how.sb>Single-Byte Encodings</h3>
<aside>Seriously, where&#8217;s my Unicode pony?</aside>
<p>The single-byte encoding prober, <code>SBCSGroupProber</code> (defined in <code>sbcsgroupprober.py</code>), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: <code>windows-1251</code>, <code>KOI8-R</code>, <code>ISO-8859-5</code>, <code>MacCyrillic</code>, <code>IBM855</code>, and <code>IBM866</code> (Russian); <code>ISO-8859-7</code> and <code>windows-1253</code> (Greek); <code>ISO-8859-5</code> and <code>windows-1251</code> (Bulgarian); <code>ISO-8859-2</code> and <code>windows-1250</code> (Hungarian); <code>TIS-620</code> (Thai); <code>windows-1255</code> and <code>ISO-8859-8</code> (Hebrew).
<p><code>SBCSGroupProber</code> feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, <code>SingleByteCharSetProber</code> (defined in <code>sbcharsetprober.py</code>), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. <code>SingleByteCharSetProber</code> processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code>hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored &#8220;backwards&#8221; line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
@@ -567,8 +569,9 @@ RefactoringTool: Files that were modified:
RefactoringTool: test.py</samp></pre>
<p>[FIXME explain the difference in import syntax]
<p>Well, that wasn&#8217;t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it&#8217;ll work?
<h2 id=manual>Fixing what <code>2to3</code> can&#8217;t</h2>
<h2 id=manual>Fixing What <code>2to3</code> Can&#8217;t</h2>
<h3 id=falseisinvalidsyntax><code>False</code> is invalid syntax</h3>
<aside>You do have tests, right?</aside>
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it&#8217;s a good way to test our ported code to make sure there aren&#8217;t any bugs lurking anywhere.
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp class=traceback>Traceback (most recent call last):
@@ -613,6 +616,7 @@ import sys</code></pre>
<p>There are variations of this problem scattered throughout the <code>chardet</code> library. In some places it&#8217;s "<code>import constants, sys</code>"; in other places, it&#8217;s "<code>import constants, re</code>". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
<p>Onward!
<h3 id=namefileisnotdefined>Name <var>'file'</var> is not defined</h3>
<aside>open() is the new file(). PapayaWhip is the new black.</aside>
<p>And here we go again, running <code>test.py</code> to try to execute our test cases&hellip;</p>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
@@ -654,6 +658,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
.
for line in open(f, 'rb'):
u.feed(line)</code></pre>
<aside>Not an array of characters, but an array of bytes.</aside>
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string &mdash; an array of Unicode characters &mdash; according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
@@ -776,6 +781,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
self._mInputState = eEscAscii
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
<aside>Each item in a string is a string. Each item in a byte array is an integer.</aside>
<p>This error doesn't occur the first time the <code>feed()</code> method gets called; it occurs the <em>second time</em>, after <var>self._mLastChar</var> has been set to the last byte of <var>aBuf</var>. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>aBuf = b'\xEF\xBB\xBF'</kbd> <span>&#x2460;</span></a>
@@ -1115,7 +1121,7 @@ NameError: global name 'reduce' is not defined</samp></pre>
<mark> total = reduce(operator.add, self._mFreqCounter)</mark></code></pre>
<p>The <code>reduce()</code> function takes two arguments &mdash; a function and a list (strictly speaking, any iterable object will do) &mdash; and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.
<p>This monstrosity was so common in Python 2 that Python 3 added a global <code>sum()</code> function.
<p>This monstrosity was so common that Python added a global <code>sum()</code> function.
<pre><code> def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
+16 -7
View File
@@ -62,11 +62,11 @@ abbr{font-variant:small-caps;text-transform:lowercase;letter-spacing:0.1em}
.note{margin:3.5em 4.94em}
.note span{display:block;float:left;font-size:xx-large;line-height:0.875;margin:0 0.22em 0 -1.22em}
.c,pre,.w,.w a,.d{line-height:2.154}
.f:first-letter{float:left;color:#ddd;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
h1,h2,h3,p,ul,ol{margin:1.75em 0;font-size:medium}
.f:first-letter{float:left;color:lightblue;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
p,ul,ol{margin:1.75em 0;font-size:medium}
/* basics */
html{background:#fff;color:#000}
html{background:#fff;color:#333}
body{margin:1.75em 28px}
form div{float:right}
.c{text-align:center;margin:2.154em 0}
@@ -74,7 +74,7 @@ form div{float:right}
/* links */
a{text-decoration:none;border-bottom:1px dotted}
a:hover{border-bottom:1px solid}
a:link,.w a{color:#26c}
a:link,.w a{color:steelblue}
a:visited{color:#93c}
.c a{color:inherit}
@@ -92,10 +92,19 @@ kbd{font-weight:bold}
li ol,.q{margin:0}
pre a,.w a,pre a:hover{border:0}
/* headers */
h1{background:PapayaWhip;width:100%} /* all hail PapayaWhip */
/* headers and pullquotes */
h1,h2,h3,aside{font-family:"Book Antiqua",Palatino,Georgia,serif}
h1,h2,h3{font-variant:small-caps}
h1,h2{letter-spacing:-1px}
h1,h1 code{font-size:xx-large}
h2,h2 code{font-size:x-large}
h3,h3 code{font-size:large}
h1{border-bottom:4px double;width:100%;margin:1em 0}
h1:before{content:"Chapter " counter(h1) ". "}
h1{counter-reset:h2}
h2:before{counter-increment:h2;content:counter(h1) "." counter(h2) ". "}
h2{counter-reset:h3}
h2{counter-reset:h3;border-top:1px dotted;margin-top:1.75em;padding-top:1.75em}
#toc + h2{border:0;margin:0;padding:0}
#toc + h2:before{content:""}
h3:before{counter-increment:h3;content:counter(h1) "." counter(h2) "." counter(h3) ". "}
aside{display:block;float:right;font-style:oblique;font-size:xx-large;width:25%;margin:1.75em 0 .75em 1.75em;background:steelblue;color:white;padding:1.75em;border:1px solid;-moz-border-radius:1em;-webkit-border-radius:1em;border-radius:1em}
+1 -1
View File
@@ -100,6 +100,6 @@ function showTOC() {
toc += '</ol>';
level -= 1;
}
toc = '<span>&#9662;</span> <a href="javascript:hideTOC()">hide table of contents</a>' + toc;
toc = '<span>&#9662;</span> <a href="javascript:hideTOC()">hide table of contents</a><ol start=0><li><a href=table-of-contents.html><span>&uarr;</span> Full table of contents</a></li>' + toc.substring(4);
$("#toc").html(toc);
}
+16 -11
View File
@@ -14,9 +14,10 @@ body{counter-reset:h1 11}
<p><span>&#x275D;</span> East is East, and West is West, and never the twain shall meet. <span>&#x275E;</span><br>&mdash; <a href=http://en.wikiquote.org/wiki/Rudyard_Kipling>Rudyard Kipling</a>
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving in</h2>
<h2 id=divingin>Diving In</h2>
<p class=f>Let&#8217;s talk about plural nouns. Also, functions that return other functions, advanced regular expressions, iterators, and generators. But first, let&#8217;s talk about how to make plural nouns. (If you haven&#8217;t read <a href=regular-expressions.html>the chapter on regular expressions</a>, now would be a good time. This chapter assumes you understand the basics of regular expressions, and quickly descends into more advanced uses.)
<p>English is a schizophrenic language that borrows from a lot of other languages, and the rules for making singular nouns into plural nouns are varied and complex. There are rules, and then there are exceptions to those rules, and then there are exceptions to the exceptions.
<aside>English is a schizophrenic language.</aside>
<p>If you grew up in an English-speaking country or learned English in a formal school setting, you&#8217;re probably familiar with the basic rules:
<ul>
<li>If a word ends in S, X, or Z, add ES. <i>Bass</i> becomes <i>basses</i>, <i>fax</i> becomes <i>faxes</i>, and <i>waltz</i> becomes <i>waltzes</i>.
@@ -27,7 +28,7 @@ body{counter-reset:h1 11}
<p>(I know, there are a lot of exceptions. <i>Man</i> becomes <i>men</i> and <i>woman</i> becomes <i>women</i>, but <i>human</i> becomes <i>humans</i>. <i>Mouse</i> becomes <i>mice</i> and <i>louse</i> becomes <i>lice</i>, but <i>house</i> becomes <i>houses</i>. <i>Knife</i> becomes <i>knives</i> and <i>wife</i> becomes <i>wives</i>, but <i>lowlife</i> becomes <i>lowlifes</i>. And don&#8217;t even get me started on words that are their own plural, like <i>sheep</i>, <i>deer</i>, and <i>haiku</i>.)
<p>Other languages, of course, are completely different.
<p>Let&#8217;s design a Python library that automatically pluralizes English nouns. We&#8217;ll start just these four rules, but keep in mind that you&#8217;ll inevitably need to add more.
<h2 id=i-know>I know, let&#8217;s use regular expressions!</h2>
<h2 id=i-know>I Know, Let&#8217;s Use Regular Expressions!</h2>
<p>So you&#8217;re looking at words, which, at least in English, means you&#8217;re looking at strings of characters. You have rules that say you need to find different combinations of characters, then do different things to them. This sounds like a job for regular expressions!
<p class=d>[<a href=examples/plural1.py>download <code>plural1.py</code></a>]
<pre><code>import re
@@ -111,7 +112,7 @@ def plural(noun):
</ol>
<p>Regular expression substitutions are extremely powerful, and the <code>\1</code> syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn&#8217;t directly map to the way you first described the pluralizing rules. You originally laid out rules like &#8220;if the word ends in S, X, or Z, then add ES&#8221;. If you look at this function, you have two lines of code that say &#8220;if the word ends in S, X, or Z, then add ES&#8221;. It doesn&#8217;t get much more direct than that.
<h2 id=a-list-of-functions>A list of functions</h2>
<h2 id=a-list-of-functions>A List Of Functions</h2>
<p>Now you&#8217;re going to add a level of abstraction. You started by defining a list of rules: if this, do that, otherwise go to the next rule. Let&#8217;s temporarily complicate part of the program so you can simplify another part.
@@ -159,6 +160,7 @@ def plural(noun):
<li>Since the rules have been broken out into a separate data structure, the new <code>plural()</code> function can be reduced to a few lines of code. Using a <code>for</code> loop, you can pull out the match and apply rules two at a time (one match, one apply) from the <var>rules</var> structure. On the first iteration of the <code>for</code> loop, <var>matches_rule</var> will get <code>match_sxz</code>, and <var>apply_rule</var> will get <code>apply_sxz</code>. On the second iteration (assuming you get that far), <var>matches_rule</var> will be assigned <code>match_h</code>, and <var>apply_rule</var> will be assigned <code>apply_h</code>. The function is guaranteed to return something eventually, because the final match rule (<code>match_default</code>) simply returns <code>True</code>, meaning the corresponding apply rule (<code>apply_default</code>) will always be applied.
</ol>
<aside>The &#8220;rules&#8221; variable is a list of functions.</aside>
<p>The reason this technique works is that <a href=your-first-python-program.html#everythingisanobject>everything in Python is an object</a>, including functions. The <var>rules</var> data structure contains functions &mdash; not names of functions, but actual function objects. When they get assigned in the <code>for</code> loop, then <var>matches_rule</var> and <var>apply_rule</var> are actual functions that you can call. On the first iteration of the <code>for</code> loop, this is equivalent to calling <code>matches_sxz(noun)</code>, and if it returns a match, calling <code>apply_sxz(noun)</code>.
<p>If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire <code>for</code> loop is equivalent to the following:
@@ -188,7 +190,7 @@ def plural(noun):
<p>But this is really just a stepping stone to the next section. Let&#8217;s move on&hellip;
<h2 id=a-list-of-patterns>A list of patterns</h2>
<h2 id=a-list-of-patterns>A List Of Patterns</h2>
<p>Defining separate named functions for each match and apply rule isn&#8217;t really necessary. You never call them directly; you add them to the <var>rules</var> list and call them through there. Furthermore, each function follows one of two patterns. All the match functions call <code>re.search()</code>, and all the apply functions call <code>re.sub()</code>. Let&#8217;s factor out the patterns so that defining new rules can be easier.
@@ -234,7 +236,7 @@ def build_match_and_apply_functions(pattern, search, replace):
<li>Since the <var>rules</var> list is the same as the previous example (really, it is), it should come as no surprise that the <code>plural()</code> function hasn&#8217;t changed at all. It&#8217;s completely generic; it takes a list of rule functions and calls them in order. It doesn&#8217;t care how the rules are defined. In the previous example, they were defined as seperate named functions. Now they are built dynamically by mapping the output of the <code>build_match_and_apply_functions()</code> function onto a list of raw strings. It doesn&#8217;t matter; the <code>plural</code> function still works the same way.
</ol>
<h2 id=a-file-of-patterns>A file of patterns</h2>
<h2 id=a-file-of-patterns>A File Of Patterns</h2>
<p>You&#8217;ve factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them.
@@ -325,7 +327,7 @@ def plural(noun):
<p>Since <code>make_counter</code> sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing <var>x</var> and spitting out values. But let&#8217;s look at more productive uses of generators instead.
<h3 id=a-fibonacci-generator>A Fibonacci generator</h3>
<h3 id=a-fibonacci-generator>A Fibonacci Generator</h3>
<p class=d>[<a href=examples/fibonacci.py>download <code>fibonacci.py</code></a>]
<pre><code>def fib(max):
@@ -339,6 +341,8 @@ def plural(noun):
<li><var>b</var> is the next number in the sequence, so assign that to <var>a</var>, but also calculate the next value (<code>a + b</code>) and assign that to <var>b</var> for later use. Note that this happens in parallel; if <var>a</var> is <code>3</code> and <var>b</var> is <code>5</code>, then <code>a, b = b, a + b</code> will set <var>a</var> to <code>5</code> (the previous value of <var>b</var>) and <var>b</var> to <code>8</code> (the sum of the previous values of <var>a</var> and <var>b</var>).
</ol>
<aside>&#8220;yield&#8221; pauses a function. &#8220;next()&#8221; resumes where it left off.</aside>
<p>So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way is easier to read. Also, it works well with <code>for</code> loops.
<pre class=screen>
@@ -351,7 +355,7 @@ def plural(noun):
<li>Each time through the <code>for</code> loop, <var>n</var> gets a new value from the <code>yield</code> statement in <code>fib()</code>, and all you have to do is print it out. Once <code>fib()</code> runs out of numbers (<var>a</var> becomes bigger than <var>max</var>, which in this case is <code>1000</code>), then the <code>for</code> loop exits gracefully.
</ol>
<h3 id=a-plural-rule-generator>A plural rule generator</h3>
<h3 id=a-plural-rule-generator>A Plural Rule Generator</h3>
<p>Let&#8217;s go back to <code>plural5.py</code> and see how this version of the <code>plural()</code> function works.
@@ -381,7 +385,7 @@ def plural(noun):
<p>In truth, generators are special case of <i>iterators</i>. A function that <code>yield</code>s values is a nice, compact way of building an iterator without building an iterator. Let me show you what I mean by that.
<h3 id=a-fibonacci-iterator>A Fibonacci iterator</h3>
<h3 id=a-fibonacci-iterator>A Fibonacci Iterator</h3>
<p>Remember <a href=a-fibonacci-generator>the Fibonacci generator</a>? Here it is as a built-from-scratch iterator:
@@ -428,9 +432,10 @@ def plural(noun):
<li>How does the <code>for</code> loop know when to stop? I&#8217;m glad you asked! When <code>next(fib_iter)</code> raises a <code>StopIteration</code> exception, the <code>for</code> loop will swallow the exception and gracefully exit. (Any other exception will pass through and be raised as usual.) And where have you seen a <code>StopIteration</code> exception? In the <code>__next__()</code> method, of course!
</ul>
<h3 id=a-plural-rule-iterator>A plural rule iterator</h3>
<h3 id=a-plural-rule-iterator>A Plural Rule Iterator</h3>
<p>Now it&#8217;s time for the finale&hellip;
<aside>iter(f) calls f.__iter__<br>next(f) calls f.__next__</aside>
<p>Now it&#8217;s time for the finale.
<p class=d>[<a href=examples/plural6.py>download <code>plural6.py</code></a>]
<pre><code>class LazyRules:
@@ -558,7 +563,7 @@ rules = LazyRules()</code></pre>
<li><strong>Separation of code and data.</strong> All the patterns are stored in a separate file. Code is code, and data is data, and never the twain shall meet.
</ol>
<h2 id=furtherreading>Further reading</h2>
<h2 id=furtherreading>Further Reading</h2>
<ul>
<li><a href=http://www.python.org/dev/peps/pep-0234/>PEP 234: Iterators</a>
<li><a href=http://www.python.org/dev/peps/pep-0255/>PEP 255: Simple Generators</a>
+22 -17
View File
@@ -14,7 +14,7 @@ body{counter-reset:h1 2}
<p><span>&#x275D;</span> Wonder is the foundation of all philosophy, inquiry its progress, ignorance its end. <span>&#x275E;</span><br>&mdash; Michel de Montaigne
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving in</h2>
<h2 id=divingin>Diving In</h2>
<p class=f>Cast aside <a href=your-first-python-program.html>your first Python program</a> for just a minute, and let's talk about datatypes. In Python, <a href=your-first-python-program.html#datatypes>every variable has a datatype</a>, but you don't need to declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally.
<p>Python has many native datatypes. Here are the important ones:
<ol>
@@ -29,6 +29,7 @@ body{counter-reset:h1 2}
<p>Of course, there are a lot more types than these seven. <a href=your-first-python-program.html#everythingisanobject>Everything is an object</a> in Python, so there are types like <i>module</i>, <i>function</i>, <i>class</i>, <i>method</i>, <i>file</i>, and even <i>compiled code</i>. You've already seen some of these: <a href=your-first-python-program.html#runningscripts>modules have names</a>, <a href=your-first-python-program.html#docstrings>functions have <code>docstrings</code></a>, <i class=baa>&amp;</i>c. You'll learn about classes in [FIXME xref] and files in [FIXME xref].
<p>Strings and bytes are important enough &mdash; and complicated enough &mdash; that they get their own chapter. Let's look at the others first.
<h2 id=booleans>Booleans</h2>
<aside>You can use virtually any expression in a boolean context.</aside>
<p>Booleans are either true or false. Python has two constants, <code>True</code> and <code>False</code>, which can be used to assign boolean values directly. Expressions can also evaluate to a boolean value. In certain places (like <code>if</code> statements), Python expects an expression to evaluate to a boolean value. These places are called <i>boolean contexts</i>. You can use virtually any expression in a boolean context, and Python will try to determine its truth value. Different datatypes have different rules about which values are true or false in a boolean context. (This will make more sense once you see some concrete examples later in this chapter.)
<p>For example, take this snippet from <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
<pre><code>if size &lt; 0:
@@ -60,7 +61,7 @@ body{counter-reset:h1 2}
<li>Adding an <code>int</code> to an <code>int</code> yields an <code>int</code>.
<li>Adding an <code>int</code> to a <code>float</code> yields a <code>float</code>. Python coerces the <code>int</code> into a <code>float</code> to perform the addition, then returns a <code>float</code> as the result.
</ol>
<h3 id=number-coercion>Coercing integers to floats and vice-versa</h3>
<h3 id=number-coercion>Coercing Integers To Floats And Vice-Versa</h3>
<p>As you just saw, some operators (like addition) will coerce integers to floating point numbers as needed. You can also coerce them by yourself.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>float(2)</kbd> <span>&#x2460;</span></a>
@@ -86,7 +87,7 @@ body{counter-reset:h1 2}
<blockquote class="note compare python2">
<p><span>&#x261E;</span>Python 2 had separate types for <code>int</code> and <code>long</code>. The <code>int</code> datatype was limited by <code>sys.maxint</code>, which varied by platform but was usually <code>2<sup>32</sup>-1</code>. Python 3 has just one integer type, which behaves mostly like the old <code>long</code> type from Python 2. See <a href=http://www.python.org/dev/peps/pep-0237><abbr>PEP</abbr> 237</a> for details.
</blockquote>
<h3 id=common-numerical-operations>Common numerical operations</h3>
<h3 id=common-numerical-operations>Common Numerical Operations</h3>
<p>You can do all kinds of things with numbers.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>11 / 2</kbd> <span>&#x2460;</span></a>
@@ -145,7 +146,8 @@ body{counter-reset:h1 2}
<li>The <code>math</code> module has all the basic trigonometric functions, including <code>sin()</code>, <code>cos()</code>, <code>tan()</code>, and variants like <code>asin()</code>.
<li>Note, however, that Python does not have infinite precision. <code>tan(&pi; / 4)</code> should return <code>1.0</code>, not <code>0.99999999999999989</code>.
</ol>
<h3 id=numbers-in-a-boolean-context>Numbers in a boolean context</h3>
<h3 id=numbers-in-a-boolean-context>Numbers In A Boolean Context</h3>
<aside>Zero values are false, and non-zero values are true.</aside>
<p>You can use numbers <a href="#booleans">in a boolean context</a>, such as an <code>if</code> statement. Zero values are false, and non-zero values are true.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>def is_it_true(anything):</kbd> <span>&#x2460;</span></a>
@@ -183,7 +185,7 @@ body{counter-reset:h1 2}
<blockquote class="note compare java">
<p><span>&#x261E;</span>A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the <code>ArrayList</code> class, which can hold arbitrary objects and can expand dynamically as new items are added.
</blockquote>
<h3 id=creatinglists>Creating a list</h3>
<h3 id=creatinglists>Creating A List</h3>
<p>Creating a list is easy: use square brackets to wrap a comma-separated list of values.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>a_list = ['a', 'b', 'mpilgrim', 'z', 'example']</kbd> <span>&#x2460;</span></a>
@@ -204,7 +206,8 @@ body{counter-reset:h1 2}
<li>A negative index accesses items from the end of the list counting backwards. The last item of any non-empty list is always <code>a_list[-1]</code>.
<li>If the negative index is confusing to you, think of it this way: <code>a_list[-<var>n</var>] == a_list[len(a_list) - <var>n</var>]</code>. So in this list, <code>a_list[-3] == a_list[5 - 3] == a_list[2]</code>.
</ol>
<h3 id=slicinglists>Slicing a list</h3>
<h3 id=slicinglists>Slicing A List</h3>
<aside>a_list[0] is the first item of a_list.</aside>
<p>Once you've defined a list, you can get any part of it as a new list. This is called <i>slicing</i> the list.
<pre class=screen>
<samp class=p>>>> </samp><kbd>a_list</kbd>
@@ -229,7 +232,7 @@ body{counter-reset:h1 2}
<li>Similarly, if the right slice index is the length of the list, you can leave it out. So <code>a_list[3:]</code> is the same as <code>a_list[3:5]</code>, because this list has five items. There is a pleasing symmetry here. In this five-item list, <code>a_list[:3]</code> returns the first 3 items, and <code>a_list[3:]</code> returns the last two items. In fact, <code>a_list[:<var>n</var>]</code> will always return the first <var>n</var> items, and <code>a_list[<var>n</var>:]</code> will return the rest, regardless of the length of the list.
<li>If both slice indices are left out, all items of the list are included. But this is not the same as the original <var>a_list</var> variable. It is a new list that happens to have all the same items. <code>a_list[:]</code> is shorthand for making a complete copy of a list.
</ol>
<h3 id=extendinglists>Adding items to a list</h3>
<h3 id=extendinglists>Adding Items To A List</h3>
<p>There are four ways to add items to a list.
<pre class=screen>
<samp class=p>>>> </samp><kbd>a_list = ['a']</kbd>
@@ -265,7 +268,7 @@ body{counter-reset:h1 2}
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 'b', 'c', 'd', 'e', 'f', ['g', 'h', 'i']]</samp>
<a><samp class=p>>>> </samp><kbd>len(a_list)</kbd> <span>&#x2463;</span></a>
<samp>4</samp>
<samp>7</samp>
<samp class=p>>>> </samp><kbd>a_list[-1]</kbd>
<samp>['g', 'h', 'i']</samp></pre>
<ol>
@@ -274,7 +277,7 @@ body{counter-reset:h1 2}
<li>On the other hand, the <code>append()</code> method takes any number of arguments, each of which can be any datatype. Here, you're calling the <code>append()</code> method with a single argument, a list of three items.
<li>If you start with a list of six items and append a list onto it, you end up with... a list of seven items. Why seven? Because the last item (which you just appended) <em>is itself a list</em>. Lists can contain any type of data, including other lists. That may be what you want, or it may not. But it's what you asked for, and it's what you got.
</ol>
<h3 id=searchinglists>Searching for values in a list</h3>
<h3 id=searchinglists>Searching For Values In A List</h3>
<pre class=screen>
<samp class=p>>>> </samp><kbd>a_list = ['a', 'b', 'new', 'mpilgrim', 'new']</kbd>
<a><samp class=p>>>> </samp><kbd>'mpilgrim' in a_list</kbd> <span>&#x2460;</span></a>
@@ -296,7 +299,8 @@ ValueError: list.index(x): x not in list</samp></pre>
<li>The <code>index()</code> method finds the <em>first</em> occurrence of a value in the list. In this case, <code>'new'</code> occurs twice in the list, in <code>a_list[2]</code> and <code>a_list[4]</code>, but the <code>index()</code> method will return only the index of the first occurrence.
<li>As you might <em>not</em> expect, if the value is not found in the list, Python raises an exception. This is notably different from most languages, which will return some invalid index (like <code>-1</code>). While this may seem annoying at first, I think you will come to appreciate it. It means your program will crash at the source of the problem instead of failing strangely and silently later.
</ol>
<h3 id=lists-in-a-boolean-context>Lists in a boolean context</h3>
<h3 id=lists-in-a-boolean-context>Lists In A Boolean Context</h3>
<aside>Empty lists are false; all other lists are true.</aside>
<p>You can also use a list in <a href=#booleans>a boolean context</a>, such as an <code>if</code> statement.
<pre class=screen>
<samp class=p>>>> </samp><kbd>def is_it_true(anything):</kbd>
@@ -322,7 +326,7 @@ ValueError: list.index(x): x not in list</samp></pre>
<blockquote class="note compare perl5">
<p><span>&#x261E;</span>A dictionary in Python is like a hash in Perl 5. In Perl 5, variables that store hashes always start with a <code>%</code> character. In Python, variables can be named anything, and Python keeps track of the datatype internally.
</blockquote>
<h3 id=creating-dictionaries>Creating a dictionary</h3>
<h3 id=creating-dictionaries>Creating A Dictionary</h3>
<p>Creating a dictionary is easy. The syntax is similar to <a href=#sets>sets</a>, but instead of values, you have key-value pairs. Once you have a dictionary, you can look up values by their key.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>a_dict = {"server":"db.diveintopython3.org", "database":"mysql"}</kbd> <span>&#x2460;</span></a>
@@ -342,7 +346,7 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<li><code>'database'</code> is a key, and its associated value, referenced by <code>a_dict["database"]</code>, is <code>'mysql'</code>.
<li>You can get values by key, but you can't get keys by value. So <code>a_dict["server"]</code> is <code>'db.diveintopython3.org'</code>, but <code>a_dict["db.diveintopython3.org"]</code> raises an exception, because <code>'db.diveintopython3.org'</code> is not a key.
</ol>
<h3 id=modifying-dictionaries>Modifying a dictionary</h3>
<h3 id=modifying-dictionaries>Modifying A Dictionary</h3>
<p>Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any time, or you can modify the value of an existing key. Continuing from the previous example:
<pre class=screen>
<samp class=p>>>> </samp><kbd>a_dict</kbd>
@@ -366,7 +370,7 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<li>Assigning a value to an existing dictionary key simply replaces the old value with the new one.
<li>Will this change the value of the <code>user</code> key back to "mark"? No! Look at the key closely &mdash; that's a capital <kbd>U</kbd> in <kbd>"User"</kbd>. Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it's completely different.
</ol>
<h3 id=mixed-value-dictionaries>Mixed-value dictionaries</h3>
<h3 id=mixed-value-dictionaries>Mixed-Value Dictionaries</h3>
<p>Dictionaries aren't just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don't all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.
<p>In fact, you've already seen a dictionary with non-string keys and values, in <a href=your-first-python-program.html#divingin>your first Python program</a>.
<pre><code>SUFFIXES = {1000: ('KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'),
@@ -389,8 +393,9 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<li>Similarly, <code>1024</code> is a key in the <code>SUFFIXES</code> dictionary; its value is also a list of eight items.
<li>Since <code>SUFFIXES[1000]</code> is a list, you can address individual items in the list by their 0-based index.
</ol>
<h3 id=dictionaries-in-a-boolean-context>Dictionaries in a boolean context</h3>
<p>You can also use a list in <a href=#booleans>a boolean context</a>, such as an <code>if</code> statement.
<h3 id=dictionaries-in-a-boolean-context>Dictionaries In A Boolean Context</h3>
<aside>Empty dictionaries are false; all other dictionaries are true.</aside>
<p>You can also use a dictionary in <a href=#booleans>a boolean context</a>, such as an <code>if</code> statement.
<pre class=screen>
<samp class=p>>>> </samp><kbd>def is_it_true(anything):</kbd>
<samp class=p>... </samp><kbd> if anything:</kbd>
@@ -427,7 +432,7 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<samp class=p>>>> </samp><kbd>x == y</kbd>
<samp>True</samp>
</pre>
<h3 id=none-in-a-boolean-context><code>None</code> in a boolean context</h3>
<h3 id=none-in-a-boolean-context><code>None</code> In A Boolean Context</h3>
<p>In <a href=#booleans>a boolean context</a>, <code>None</code> is false and <code>not None</code> is true.
<pre class=screen>
<samp class=p>>>> </samp><kbd>def is_it_true(anything):</kbd>
@@ -440,7 +445,7 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<samp>no, it's false</samp>
<samp class=p>>>> </samp><kbd>is_it_true(not None)</kbd>
<samp>yes, it's true</samp></pre>
<h2 id=furtherreading>Further reading</h2>
<h2 id=furtherreading>Further Reading</h2>
<ul>
<li><a href="http://docs.python.org/dev/3.0/library/fractions.html">The <code>fractions</code> module</a>
<li><a href="http://docs.python.org/dev/3.0/library/math.html">The <code>math</code> module</a>
+13 -8
View File
@@ -14,14 +14,14 @@ body{counter-reset:h1 4}
<p><span>&#x275D;</span> Some people, when confronted with a problem, think &#8220;I know, I&#8217;ll use regular expressions.&#8221; Now they have two problems. <span>&#x275E;</span><br>&mdash; <a href=http://www.jwz.org/hacks/marginal.html>Jamie Zawinski</a>
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving in</h2>
<h2 id=divingin>Diving In</h2>
<p class=f>Every modern programming language has built-in functions for working with strings. In Python, strings have methods for searching and replacing: <code>index()</code>, <code>find()</code>, <code>split()</code>, <code>count()</code>, <code>replace()</code>, <i class=baa>&amp;</i>c. But these methods are limited to the simplest of cases. For example, the <code>index()</code> method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string <var>s</var>, you must call <code>s.lower()</code> or <code>s.upper()</code> and make sure your search strings are the appropriate case to match. The <code>replace()</code> and <code>split()</code> methods have the same limitations.
<p>If your goal can be accomplished with string methods, you should use them. They&#8217;re fast and simple and easy to read, and there&#8217;s a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with <code>if</code> statements to handle special cases, or if you&#8217;re chaining calls to <code>split()</code> and <code>join()</code> to slice-and-dice your strings, you may need to move up to regular expressions.
<p>Regular expressions are a powerful and (mostly) standardized way of searching, replacing, and parsing text with complex patterns of characters. Although the regular expression syntax is tight and unlike normal code, the result can end up being <em>more</em> readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions, so you can include fine-grained documentation within them.
<blockquote class="note compare perl5">
<p><span>&#x261E;</span>If you&#8217;ve used regular expressions in other languages (like Perl 5), Python&#8217;s syntax will be very familiar. Read the summary of the <a href=http://docs.python.org/dev/library/re.html#module-contents><code>re</code> module</a> to get an overview of the available functions and their arguments.
</blockquote>
<h2 id=streetaddresses>Case study: street addresses</h2>
<h2 id=streetaddresses>Case Study: Street Addresses</h2>
<p>This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don&#8217;t just make this stuff up; it&#8217;s actually useful.) This example shows how I approached the problem.
<pre class=screen>
<samp class=p>>>> </samp><kbd>s = '100 NORTH MAIN ROAD'</kbd>
@@ -42,6 +42,7 @@ body{counter-reset:h1 4}
<li>It&#8217;s time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the <code>re</code> module.
<li>Take a look at the first parameter: <code>'ROAD$'</code>. This is a simple regular expression that matches <code>'ROAD'</code> only when it occurs at the end of a string. The <code>$</code> means &#8220;end of the string.&#8221; (There is a corresponding character, the caret <code>^</code>, which means &#8220;beginning of the string.&#8221;) Using the <code>re.sub</code> function, you search the string <var>s</var> for the regular expression <code>'ROAD$'</code> and replace it with <code>'RD.'</code>. This matches the <code>ROAD</code> at the end of the string <var>s</var>, but does <em>not</em> match the <code>ROAD</code> that&#8217;s part of the word <code>BROAD</code>, because that&#8217;s in the middle of <var>s</var>.
</ol>
<aside>^ matches the start of a string. $ matches the end of a string.</aside>
<p>Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching <code>'ROAD'</code> at the end of the address, was not good enough, because not all addresses included a street designation at all. Some addresses simply ended with the street name. I got away with it most of the time, but if the street name was <code>'BROAD'</code>, then the regular expression would match <code>'ROAD'</code> at the end of the string as part of the word <code>'BROAD'</code>, which is not what I wanted.
<pre class=screen>
<samp class=p>>>> </samp><kbd>s = '100 BROAD'</kbd>
@@ -62,7 +63,7 @@ body{counter-reset:h1 4}
<li><em>*sigh*</em> Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word <code>'ROAD'</code> as a whole word by itself, but it wasn&#8217;t at the end, because the address had an apartment number after the street designation. Because <code>'ROAD'</code> isn&#8217;t at the very end of the string, it doesn&#8217;t match, so the entire call to <code>re.sub</code> ends up replacing nothing at all, and you get the original string back, which is not what you want.
<li>To solve this problem, I removed the <code>$</code> character and added another <code>\b</code>. Now the regular expression reads &#8220;match <code>'ROAD'</code> when it&#8217;s a whole word by itself anywhere in the string,&#8221; whether at the end, the beginning, or somewhere in the middle.
</ol>
<h2 id=romannumerals>Case study: Roman numerals</h2>
<h2 id=romannumerals>Case Study: Roman Numerals</h2>
<p>You&#8217;ve most likely seen Roman numerals, even if you didn&#8217;t recognize them. You may have seen them in copyrights of old movies and television shows (&#8220;Copyright <code>MCMXLVI</code>&#8221; instead of &#8220;Copyright <code>1946</code>&#8221;), or on the dedication walls of libraries or universities (&#8220;established <code>MDCCCLXXXVIII</code>&#8221; instead of &#8220;established <code>1888</code>&#8221;). You may also have seen them in outlines and bibliographical references. It&#8217;s a system of representing numbers that really does date back to the ancient Roman empire (hence the name).
<p>In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.
<ul>
@@ -82,7 +83,7 @@ body{counter-reset:h1 4}
<li>The fives characters can not be repeated. The number <code>10</code> is always represented as <code>X</code>, never as <code>VV</code>. The number <code>100</code> is always <code>C</code>, never <code>LL</code>.
<li>Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much. <code>DC</code> is <code>600</code>; <code>CD</code> is a completely different number (<code>400</code>, <code>100</code> less than <code>500</code>). <code>CI</code> is <code>101</code>; <code>IC</code> is not even a valid Roman numeral (because you can&#8217;t subtract <code>1</code> directly from <code>100</code>; you would need to write it as <code>XCIX</code>, for <code>10</code> less than <code>100</code>, then <code>1</code> less than <code>10</code>).
</ul>
<h3 id=thousands>Checking for thousands</h3>
<h3 id=thousands>Checking For Thousands</h3>
<p>What would it take to validate that an arbitrary string is a valid Roman numeral? Let&#8217;s take it one digit at a time. Since Roman numerals are always written highest to lowest, let&#8217;s start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of <code>M</code> characters.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import re</kbd>
@@ -104,7 +105,8 @@ body{counter-reset:h1 4}
<li><code>'MMMM'</code> does not match. All three <code>M</code> characters match, but then the regular expression insists on the string ending (because of the <code>$</code> character), and the string doesn&#8217;t end yet (because of the fourth <code>M</code>). So <code>search()</code> returns <code>None</code>.
<li>Interestingly, an empty string also matches this regular expression, since all the <code>M</code> characters are optional.
</ol>
<h3 id=hundreds>Checking for hundreds</h3>
<h3 id=hundreds>Checking For Hundreds</h3>
<aside>? makes a pattern optional.</aside>
<p>The hundreds place is more difficult than the thousands, because there are several mutually exclusive ways it could be expressed, depending on its value.
<ul>
<li><code>100 = C</code>
@@ -150,7 +152,8 @@ body{counter-reset:h1 4}
<li>Interestingly, an empty string still matches this pattern, because all the <code>M</code> characters are optional and ignored, and the empty string matches the <code>D?C?C?C?</code> pattern where all the characters are optional and ignored.
</ol>
<p>Whew! See how quickly regular expressions can get nasty? And you&#8217;ve only covered the thousands and hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because they&#8217;re exactly the same pattern. But let&#8217;s look at another way to express the pattern.
<h2 id=nmsyntax>Using the <code>{n,m}</code> Syntax</h2>
<h2 id=nmsyntax>Using The <code>{n,m}</code> Syntax</h2>
<aside>{1,4} matches between 1 and 4 occurrences of a pattern.</aside>
<p>In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import re</kbd>
@@ -188,7 +191,7 @@ body{counter-reset:h1 4}
<li>This matches the start of the string, then three <code>M</code> out of a possible three, then the end of the string.
<li>This matches the start of the string, then three <code>M</code> out of a possible three, but then <em>does not match</em> the end of the string. The regular expression allows for up to only three <code>M</code> characters before the end of the string, but you have four, so the pattern does not match and returns <code>None</code>.
</ol>
<h3 id=tensandones>Checking for tens and ones</h3>
<h3 id=tensandones>Checking For Tens And Ones</h3>
<p>Now let&#8217;s expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.
<pre class=screen>
<samp class=p>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'</kbd>
@@ -209,6 +212,7 @@ body{counter-reset:h1 4}
<li>This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then the optional <code>L</code> and all three optional <code>X</code> characters, then the end of the string. <code>MCMLXXX</code> is the Roman numeral representation of <code>1980</code>.
<li>This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then the optional <code>L</code> and all three optional <code>X</code> characters, then <em>fails to match</em> the end of the string because there is still one more <code>X</code> unaccounted for. So the entire pattern fails to match, and returns <code>None</code>. <code>MCMLXXXX</code> is not a valid Roman numeral.
</ol>
<aside>(A|B) matches either pattern A or pattern B.</aside>
<p>The expression for the ones place follows the same pattern. I&#8217;ll spare you the details and show you the end result.
<pre class=screen>
<samp class=p>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'</kbd>
@@ -264,7 +268,8 @@ body{counter-reset:h1 4}
<li>This matches the start of the string, then four of a possible four <code>M</code>, then <code>D</code> and three of a possible three <code>C</code>, then <code>L</code> and three of a possible three <code>X</code>, then <code>V</code> and three of a possible three <code>I</code>, then the end of the string.
<li>This does not match. Why? Because it doesn&#8217;t have the <code>re.VERBOSE</code> flag, so the <code>re.search</code> function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can&#8217;t auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose.
</ol>
<h2 id=phonenumbers>Case study: parsing phone numbers</h2>
<h2 id=phonenumbers>Case study: Parsing Phone Numbers</h2>
<aside>\d matches any numeric digit (0&ndash;9). \D matches anything but digits.</aside>
<p>So far you&#8217;ve concentrated on matching whole patterns. Either the pattern matches, or it doesn&#8217;t. But regular expressions are much more powerful than that. When a regular expression <em>does</em> match, you can pick out specific pieces of it. You can find out what matched where.
<p>This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the area code, trunk, number, and optionally an extension separately in the company&#8217;s database. I scoured the Web and found many examples of regular expressions that purported to do this, but none of them were permissive enough.
<p>Here are the phone numbers I needed to be able to accept:
+17 -11
View File
@@ -16,13 +16,15 @@ body{counter-reset:h1 3}
My alphabet starts where your alphabet ends! <span>&#x275E;</span><br>&mdash; Dr. Seuss, On Beyond Zebra!
</blockquote>
<p id=toc>&nbsp;
<h2 id=boring-stuff>Some boring stuff you need to understand before you can dive in</h2>
<h2 id=boring-stuff>Some Boring Stuff You Need To Understand Before You Can Dive In</h2>
<p class=f>Did you know that the people of <a href="http://en.wikipedia.org/wiki/Bougainville_Province">Bougainville</a> have the smallest alphabet in the world? Their <a href="http://en.wikipedia.org/wiki/Rotokas_alphabet">Rotokas alphabet</a> is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters &mdash; 52 if you count uppercase and lowercase separately &mdash; plus a handful of <i class=baa>!@#$%&</i> punctuation marks.
<p>When people talk about &#8220;text,&#8221; they&#8217;re thinking of &#8220;characters and symbols on the computer screen.&#8221; But computers don&#8217;t deal in characters and symbols; they deal in bits and bytes. Every piece of text you&#8217;ve ever seen on a computer screen is actually stored in a particular <i>character encoding</i>. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.
<p>In reality, it&#8217;s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes &mdash; a file, a web page, whatever &mdash; and claims it&#8217;s &#8220;text,&#8221; you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you&#8217;re left with the unenviable task of cracking the code yourself. Chances are you&#8217;ll get it wrong, and the result will be gibberish.
<aside>Everything you thought you knew about strings is wrong.</aside>
<p>Surely you&#8217;ve seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn&#8217;t declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it&#8217;s merely annoying; in other languages, the result can be completely unreadable.
<p>There are character encodings for each major language in the world. Since each language is different, and memory and disk space have historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0&ndash;255) to represent that language&#8217;s characters. For instance, you&#8217;re probably familiar with the <abbr>ASCII</abbr> encoding, which stores English characters as numbers ranging from 0 to 127. (65 is capital &#8220;A&#8221;, 97 is lowercase &#8220;a&#8221;, <i class=baa>&amp;</i>c.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that&#8217;s 7 out of the 8 bits in a byte.
@@ -97,8 +99,9 @@ La Pe&ntilde;a</pre>
</ol>
</div>
<h2 id=divingin>Diving in</h2>
<h2 id=divingin>Diving In</h2>
<aside>Strings can be defined with either single or double quotes.</aside>
<p>Let's take another look at <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
<p class=d>[<a href=examples/humansize.py>download <code>humansize.py</code></a>]
@@ -135,7 +138,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<li>There's a&hellip; whoa, what the heck is that?
</ol>
<h2 id=formatting-strings>Formatting strings</h2>
<h2 id=formatting-strings>Formatting Strings</h2>
<p>Python 3 supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with single placeholder.
@@ -149,7 +152,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<li>There's a lot going on here. First, that's a method call on a string literal. <em>Strings are objects</em>, and objects have methods. Second, the whole expression evaluates to a string. Third, <code>{0}</code> and <code>{1}</code> are <i>replacement fields</i>, which are replaced by the arguments passed to the <code>format()</code> method.
</ol>
<h3 id=compound-field-names>Compound field names</h3>
<h3 id=compound-field-names>Compound Field Names</h3>
<p>The previous example shows the simplest case, where the replacement fields are simply integers. Integer replacement fields are treated as positional indices into the argument list of the <code>format()</code> method. That means that <code>{0}</code> is replaced by the first argument (<var>username</var> in this case), <code>{1}</code> is replaced by the second argument (<var>password</var>), <i class=baa>&amp;</i>c. You can have as many positional indices as you have arguments, and you can have as many arguments as you want. But replacement fields are much more powerful than that.
@@ -166,6 +169,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<li>This looks complicated, but it's not. <code>{0}</code> would refer to the first argument passed to the <code>format()</code> method, <var>si_suffixes</var>. But <var>si_suffixes</var> is a list. So <code>{0[0]}</code> refers to the first item of the list which is the first argument passed to the <code>format()</code> method: <code>'KB'</code>. Meanwhile, <code>{0[1]}</code> refers to the second item of the same list: <code>'MB'</code>. Everything outside the curly braces &mdash; including <code>1000</code>, the equals sign, and the spaces &mdash; is untouched. The final result is the string <code>'1000KB = 1MB'</code>.
</ol>
<aside>{0} is replaced by the 1<sup>st</sup> format() argument. {1} is replaced by the 2<sup>nd</sup>.</aside>
<p>What this example shows is that <em>format specifers can access items and properties of data structures using (almost) Python syntax</em>. This is called <i>compound field names</i>. The following compound field names "just work":
<ul>
@@ -195,7 +199,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<lI><code>sys.modules["humansize"].SUFFIXES[1000][0]</code> is the first item of the list of <abbr>SI</abbr> suffixes: <code>'KB'</code>. Therefore, the complete replacement field <code>{0.modules[humansize].SUFFIXES[1000][0]}</code> is replaced by the two-character string <code>KB</code>.
</ul>
<h3 id=format-specifiers>Format specifiers</h3>
<h3 id=format-specifiers>Format Specifiers</h3>
<p>But wait! There's more! Let's take another look at that strange line of code from <code>humansize.py</code>:
@@ -216,7 +220,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<p>For all the gory details on format specifiers, consult the <a href="http://docs.python.org/dev/3.0/library/string.html#format-specification-mini-language">Format Specification Mini-Language</a> in the official Python documentation.
<h2 id=common-string-methods>Other common string methods</h2>
<h2 id=common-string-methods>Other Common String Methods</h2>
<p>Besides formatting, strings can do a number of other useful tricks.
@@ -241,7 +245,7 @@ experience of years.</samp>
<li>You can input multi-line strings in the Python interactive shell. Once you start a multi-line string with triple quotation marks, just hit <kbd>ENTER</kbd> and the interactive shell will prompt you to continue the string. Typing the closing triple quotation marks ends the string, and the next <kbd>ENTER</kbd> will execute the command (in this case, assigning the string to <var>s</var>).
<li>The <code>splitlines()</code> method takes one multi-line string and returns a list of strings, one for each line of the original. Note that the carriage returns at the end of each line are not included.
<li>The <code>lower()</code> method converts the entire string to lowercase. (Similarly, the <code>upper()</code> method converts a string to uppercase.)
<li>the <code>count()</code> method counts the number of occurrences of a substring. Yes, there really are six &#8220;f&#8221;s in that sentence!
<li>The <code>count()</code> method counts the number of occurrences of a substring. Yes, there really are six &#8220;f&#8221;s in that sentence!
</ol>
<!--
@@ -318,7 +322,7 @@ is an object. You might have thought I meant that string <em>variables</em> are
</div>
-->
<h2 id=string-module>The <code>string</code> module</h2>
<h2 id=string-module>The <code>string</code> Module</h2>
<p>[FIXME is this worth keeping? The module still exists in 3.0; check if it's going away in 3.1 or something.]
@@ -326,9 +330,11 @@ is an object. You might have thought I meant that string <em>variables</em> are
<p>When I first learned Python, I expected <code>join</code> to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the <code>join</code> method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate <code>string</code> module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like <code>lower</code>, <code>upper</code>, and <code>split</code>. But many hard-core Python programmers objected to the new <code>join</code> method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old <code>string</code> module (which still has a lot of useful stuff in it). I use the new <code>join</code> method exclusively, but you will see code written either way, and if it really bothers you, you can use the old <code>string.join</code> function instead.
</div>
<h2 id=byte-arrays>Strings vs. bytes</h2>
<h2 id=byte-arrays>Strings vs. Bytes</h2>
<h2 id=py-encoding>Character encoding of Python source code</h2>
<p>FIXME
<h2 id=py-encoding>Character Encoding Of Python Source Code</h2>
<p>Python 3 assumes that your source code &mdash; <i>i.e.</i> each <code>.py</code> file &mdash; is encoded in <abbr>UTF-8</abbr>.
@@ -347,7 +353,7 @@ is an object. You might have thought I meant that string <em>variables</em> are
<p>For more information, consult <a href="http://www.python.org/dev/peps/pep-0263/"><abbr>PEP</abbr> 263: Defining Python Source Code Encodings</a>.
<h2 id=furtherreading>Further reading</h2>
<h2 id=furtherreading>Further Reading</h2>
<p>On Unicode in Python:
+7 -4
View File
@@ -14,7 +14,7 @@ body{counter-reset:h1 7}
<p><span>&#x275D;</span> Certitude is not the test of certainty. We have been cocksure of many things that were not so. <span>&#x275E;</span><br>&mdash; <a href=http://en.wikiquote.org/wiki/Oliver_Wendell_Holmes,_Jr.>Oliver Wendell Holmes, Jr.</a>
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>(Not) diving in</h2>
<h2 id=divingin>(Not) Diving In</h2>
<p class=f>How do you know that the code you wrote yesterday still works after the changes you made today? Every seasoned programmer has war stories of an &#8220;innocent&#8221; change that couldn't <em>possibly</em> have affected that other &#8220;unrelated&#8221; module&hellip; If this sounds familiar, this chapter is for you.
<p>In this chapter, you're going to write and debug a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in <a href="regular-expressions.html#romannumerals">&#8220;Case study: roman numerals&#8221;</a>. Now step back and consider what it would take to expand that into a two-way utility.
<p><a href="regular-expressions.html#romannumerals">The rules for Roman numerals</a> lead to a number of interesting observations:
@@ -37,7 +37,8 @@ body{counter-reset:h1 7}
<li>When maintaining code, it helps you cover your ass when someone comes screaming that your latest change broke their old code. (&#8220;But <em>sir</em>, all the unit tests passed when I checked it in...&#8221;)
<li>When writing code in a team, it increases confidence that the code you're about to commit isn't going to break someone else's code, because you can run their unit tests first. (I've seen this sort of thing in code sprints. A team breaks up the assignment, everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team. That way, nobody goes off too far into developing code that doesn't play well with others.)
</ul>
<h2 id=romantest1>A single question</h2>
<h2 id=romantest1>A Single Question</h2>
<aside>Every test is an island.</aside>
<p>A test case answers a single question about the code it is testing. A test case should be able to...
<ul>
<li>...run completely by itself, without any human input. Unit testing is about automation.
@@ -126,6 +127,7 @@ if __name__ == "__main__":
<li>Here you call the actual <code>to_roman()</code> function. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you have now defined the <abbr>API</abbr> for the <code>to_roman()</code> function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the <abbr>API</abbr> is different than that, this test is considered failed. Also notice that you are not trapping any exceptions when you call <code>to_roman()</code>. This is intentional. <code>to_roman()</code> shouldn't raise an exception when you call it with valid input, and these input values are all valid. If <code>to_roman()</code> raises an exception, this test is considered failed.
<li>Assuming the <code>to_roman()</code> function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check whether it returned the <em>right</em> value. This is a common question, and the <code>TestCase</code> class provides a method, <code>assertEqual</code>, to check whether two values are equal. If the result returned from <code>to_roman()</code> (<var>result</var>) does not match the known value you were expecting (<var>numeral</var>), <code>assertEqual</code> will raise an exception and the test will fail. If the two values are equal, <code>assertEqual</code> will do nothing. If every value returned from <code>to_roman()</code> matches the known value you expect, <code>assertEqual</code> never raises an exception, so <code>testToRomanKnownValues</code> eventually exits normally, which means <code>to_roman()</code> has passed this test.
</ol>
<aside>Write a test that fails, then code until it passes.</aside>
<p>Once you have a test case, you can start coding the <code>to_roman()</code> function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you've written any code, you're doing it wrong &mdash; your tests aren't testing your code at all! Write a test that fails, then code until it passes.
<pre><code># roman1.py
@@ -215,7 +217,8 @@ OK</samp></pre>
<li>Hooray! The <code>to_roman()</code> function passes the &#8220;known values&#8221; test case. It's not comprehensive, but it does put the function through its paces with a variety of inputs, including inputs that produce every single-character Roman numeral, the largest possible input (<code>3999</code>), and the input that produces the longest possible Roman numeral (<code>3888</code>). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
</ol>
<p>&#8220;Good&#8221; input? Hmm. What about bad input?
<h2 id=romantest2>&#8220;Halt and catch fire&#8221;</h2>
<h2 id=romantest2>&#8220;Halt And Catch Fire&#8221;</h2>
<aside>The Pythonic way to halt and catch fire is to raise an exception.</aside>
<p>It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import roman1</kbd>
@@ -326,7 +329,7 @@ OK</samp></pre>
<ol>
<li>Hooray! Both tests pass. Because you worked iteratively, bouncing back and forth between testing and coding, you can be sure that the two lines of code you just wrote were the cause of that one test going from &#8220;fail&#8221; to &#8220;pass.&#8221; That kind of confidence doesn't come cheap, but it will pay for itself over the lifetime of your code.
</ol>
<h2 id=romantest3>More halting, more fire</h2>
<h2 id=romantest3>More Halting, More Fire</h2>
<p>...
<!--
For instance, the <code>testFromRomanCase</code> method (&#8220;<code>from_roman()</code> should only accept uppercase input&#8221;) was an error, because the call to <code>numeral.upper()</code> raised an <code>AttributeError</code> exception, because <code>to_roman()</code> was supposed to return a string but didn't. But <code>testZero</code> (&#8220;<code>to_roman()</code> should fail with 0 input&#8221;) was a failure, because the call to <code>from_roman()</code> did not raise the <code>InvalidRomanNumeral</code> exception that <code>assertRaises</code> was looking for.
+18 -13
View File
@@ -15,7 +15,7 @@ th{font-family:inherit !important}
<p><span>&#x275D;</span> Don&#8217;t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate. <span>&#x275E;</span><br>&mdash; <a href=http://en.wikiquote.org/wiki/Buddhism>Ven. Henepola Gunaratana</a>
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving in</h2>
<h2 id=divingin>Diving In</h2>
<p class=f>Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
<p class=d>[<a href=examples/humansize.py>download <code>humansize.py</code></a>]
<pre><code>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
@@ -57,9 +57,10 @@ if __name__ == "__main__":
<samp>1.0 TB
931.3 GiB</samp></pre>
<p>FIXME: this would be a good place to explain what the program, you know, actually does.
<h2 id=declaringfunctions>Declaring functions</h2>
<h2 id=declaringfunctions>Declaring Functions</h2>
<p>Python has functions like most other languages, but it does not have separate header files like <abbr>C++</abbr> or <code>interface</code>/<code>implementation</code> sections like Pascal. When you need a function, just declare it, like this:
<pre><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):</code></pre>
<aside>When you need a function, just declare it.</aside>
<p>The keyword <code>def</code> starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments are separated with commas.
<p>Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value. (In fact, every Python function returns a value; if the function ever executes a <code>return</code> statement, it will return that value, otherwise it will return <code>None</code>, the Python null value.)
<blockquote class=note>
@@ -69,7 +70,7 @@ if __name__ == "__main__":
<blockquote class="note compare java">
<p><span>&#x261E;</span>In Java and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
</blockquote>
<h3 id=datatypes>How Python's datatypes compare to other programming languages</h3>
<h3 id=datatypes>How Python's Datatypes Compare to Other Programming Languages</h3>
<p>An erudite reader sent me this explanation of how Python compares to other programming languages:
<dl>
<dt>statically typed language</dt>
@@ -92,9 +93,9 @@ if __name__ == "__main__":
<tr><th>Weakly typed</th><td>C, Objective-C</td><td>JavaScript, Perl 5, <abbr>PHP</abbr></td></tr>
<tr><th>Strongly typed</th><td>Pascal, Java</td><td>Python, Ruby</td></tr>
</table>
<h2 id=readability>Writing readable code</h2>
<h2 id=readability>Writing Readable Code</h2>
<p>I won't bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you've forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You'll thank me in six months.
<h3 id=docstrings>Documentation strings</h3>
<h3 id=docstrings>Documentation Strings</h3>
<p>You can document a Python function by giving it a documentation string (<code>docstring</code> for short). In this program, the <code>approximate_size</code> function has a <code>docstring</code>:
<pre><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):
"""Convert a file size to human-readable form.
@@ -107,6 +108,7 @@ if __name__ == "__main__":
Returns: string
"""</code></pre>
<aside>Every function deserves a decent docstring.</aside>
<p>Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you'll see them most often used when defining a <code>docstring</code>.
<blockquote class="note compare perl5">
<p><span>&#x261E;</span>Triple quotes are also an easy way to define a string with both single and double quotes, like <code>qq/.../</code> in Perl 5.
@@ -115,11 +117,13 @@ if __name__ == "__main__":
<blockquote class=note>
<p><span>&#x261E;</span>Many Python <abbr>IDE</abbr>s use the <code>docstring</code> to provide context-sensitive documentation, so that when you type a function name, its <code>docstring</code> appears as a tooltip. This can be incredibly helpful, but it's only as good as the <code>docstring</code>s you write.
</blockquote>
<h3 id=functionannotations>Function annotations</h3>
<!--
<h3 id=functionannotations>Function Annotations</h3>
<p>FIXME
<h3 id=styleconventions>Style conventions</h3>
<h3 id=styleconventions>Style Conventions</h3>
<p>FIXME
<h2 id=everythingisanobject>Everything is an object</h2>
-->
<h2 id=everythingisanobject>Everything Is An Object</h2>
<p>In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime. A function, like everything else in Python, is an object.
<p>Run the interactive Python shell and follow along:
<pre class=screen>
@@ -145,7 +149,7 @@ if __name__ == "__main__":
<blockquote class="note compare perl5">
<p><span>&#x261E;</span><code>import</code> in Python is like <code>require</code> in Perl. Once you <code>import</code> a Python module, you access its functions with <code><var>module</var>.<var>function</var></code>; once you <code>require</code> a Perl module, you access its functions with <code><var>module</var>::<var>function</var></code>.
</blockquote>
<h3 id=importsearchpath>The <code>import</code> search path</h3>
<h3 id=importsearchpath>The <code>import</code> Search Path</h3>
<p>Before this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in <code>sys.path</code>. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.)
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>import sys</kbd> <span>&#x2460;</span></a>
@@ -160,11 +164,11 @@ if __name__ == "__main__":
<li>Actually, I lied; the truth is more complicated than that, because not all modules are stored as <code>.py</code> files. Some, like the <code>sys</code> module, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (The <code>sys</code> module is written in <abbr>C</abbr>.)
<li>You can add a new directory to Python's search path at runtime by appending the directory name to <code>sys.path</code>, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You'll learn more about <code>append()</code> and other list methods in [FIXME xref-was-#datatypes].)
</ol>
<h3 id=whatsanobject>What's an object?</h3>
<h3 id=whatsanobject>What's An Object?</h3>
<p>Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute <code>__doc__</code>, which returns the <var>docstring</var> defined in the function's source code. The <code>sys</code> module is an object which has (among other things) an attribute called <var>path</var>. And so forth.
<p>Still, this doesn't answer the more fundamental question: what is an object? Different programming languages define &#8220;object&#8221; in different ways. In some, it means that <em>all</em> objects <em>must</em> have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in [FIXME xref-was-#datatypes]), and not all objects are subclassable (more on this in [FIXME xref-was-#fileinfo]). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function (more in this in [FIXME xref-was-#apihelp]).
<p>This is so important that I'm going to repeat it in case you missed it the first few times: <em>everything in Python is an object</em>. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.
<h2 id=indentingcode>Indenting code</h2>
<h2 id=indentingcode>Indenting Code</h2>
<p>Python functions have no explicit <code>begin</code> or <code>end</code>, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (<code>:</code>) and the indentation of the code itself.
<pre><code>
<a>def approximate_size(size, a_kilobyte_is_1024_bytes=True): <span>&#x2460;</span></a>
@@ -189,7 +193,8 @@ if __name__ == "__main__":
<blockquote class="note compare java">
<p><span>&#x261E;</span>Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. <abbr>C++</abbr> and Java use semicolons to separate statements and curly braces to separate code blocks.
</blockquote>
<h2 id=runningscripts>Running scripts</h2>
<h2 id=runningscripts>Running Scripts</h2>
<aside>Everything in Python is an object.</aside>
<p>Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them, by including a special block of code that executes when you run the Python file on the command line. Take the last few lines of <code>humansize.py</code>:
<pre><code>
if __name__ == "__main__":
@@ -208,7 +213,7 @@ if __name__ == "__main__":
<samp class=p>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<h2 id=furtherreading>Further reading</h2>
<h2 id=furtherreading>Further Reading</h2>
<ul>
<li><a href=http://www.python.org/dev/peps/pep-0257/>PEP 257: Docstring Conventions</a> explains what distinguishes a good <code>docstring</code> from a great <code>docstring</code>.
<li><a href=http://docs.python.org/3.0/tutorial/controlflow.html#documentation-strings>Python Tutorial: Documentation Strings</a> also touches on the subject.