sick, can't sleep, may as well fiddle endlessly

This commit is contained in:
Mark Pilgrim
2009-03-17 03:11:52 -04:00
parent 08be466e7b
commit 77654693cf
15 changed files with 1597 additions and 1246 deletions
+2 -4
View File
@@ -1,18 +1,16 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>About the book - Dive Into Python 3</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<link rel=stylesheet type=text/css href=dip3.css>
<style>
h1:before{content:""}
</style>
</head>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31>&#xa0;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html>Dive Into Python 3</a> <span>&#8227;</span>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> <a href=table-of-contents.html>Dive Into Python 3</a> <span>&#8227;</span>
<h1>About the book</h1>
<p>The content of <cite>Dive Into Python 3</cite> is licensed under the <a href=http://creativecommons.org/licenses/by-sa/3.0/ rel=license>Creative Commons Attribution-ShareAlike 3.0 Unported License</a>.
<p>The <code>chardet</code> library referenced in <a href=case-study-porting-chardet-to-python-3.html>Case study: porting <code>chardet</code> to Python 3</a> is licensed under the <abbr title="Lesser GNU Public License">LGPL</abbr> 2.1 or later. All other example code is licensed under the <abbr>MIT</abbr> license. Full licensing terms are included in each source code file.
+71 -69
View File
@@ -1,19 +1,21 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Case study: porting chardet to Python 3 - Dive into Python 3</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<link rel=stylesheet type=text/css href=dip3.css>
<style>
body{counter-reset:h1 20}
ins,del,mark{line-height:2.154;text-decoration:none;font-style:normal;display:inline-block;width:100%}
ins{background:#9f9}
del{background:#f87}
mark{background:#ff8;font-weight:bold}
</style>
</head>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&#xa0;<input name=q size=31>&#xa0;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#case-study-porting-chardet-to-python-3>Dive Into Python 3</a> <span>&#8227;</span>
<p class=s><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#case-study-porting-chardet-to-python-3>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Case study: porting <code>chardet</code> to Python 3</h1>
<blockquote class=q>
<p><span>&#x275D;</span> Words, words. They&#8217;re all we have to go on. <span>&#x275E;</span><br>&mdash; <cite>Rosencrantz and Guildenstern are Dead</cite>
@@ -49,7 +51,7 @@ body{counter-reset:h1 20}
<li><a href=#summary>Summary</a>
</ol>
<h2 id=divingin>Diving in</h2>
<p class=fancy>Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In <a href=strings.html>Chapter 3</a>, I talked about the history of character encoding and the creation of Unicode, the &#8220;one encoding to rule them all.&#8221; I&#8217;d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.
<p class=f>Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In <a href=strings.html>Chapter 3</a>, I talked about the history of character encoding and the creation of Unicode, the &#8220;one encoding to rule them all.&#8221; I&#8217;d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.
<p>I&#8217;d also like a pony.
<p>A Unicode pony.
<p>A Unipony, as it were.
@@ -98,8 +100,8 @@ body{counter-reset:h1 20}
<p>We&#8217;re going to migrate the <code>chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy &mdash; a function was renamed or moved to a different modules &mdash; but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>. In this chapter, we&#8217;ll start by running <code>2to3</code> on the <code>chardet</code> package, but as you&#8217;ll see, there will still be a lot of work to do after the automated tools have performed their magic.
<p>The main <code>chardet</code> package is split across several different files, all in the same directory. The <code>2to3</code> script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and <code>2to3</code> will convert each of the files in turn.
<p id=noscript>[The code examples will be easier to follow if you enable Javascript, but whatever.]
<p class=skip><a href=#skip2to3output>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w chardet\</kbd>
<p class=s><a href=#skip2to3output>skip over this</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w chardet\</kbd>
<samp>RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
@@ -566,8 +568,8 @@ RefactoringTool: chardet\sjisprober.py
RefactoringTool: chardet\universaldetector.py
RefactoringTool: chardet\utf8prober.py</samp></pre>
<p id=skip2to3output>Now run the <code>2to3</code> script on the testing harness, <code>test.py</code>.
<p class=skip><a href=#skip2to3outputtest>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w test.py</kbd>
<p class=s><a href=#skip2to3outputtest>skip over this</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w test.py</kbd>
<samp>RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
@@ -602,8 +604,8 @@ RefactoringTool: test.py</samp></pre>
<h2 id=manual>Fixing what <code>2to3</code> can&#8217;t</h2>
<h3 id=falseisinvalidsyntax><code>False</code> is invalid syntax</h3>
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it&#8217;s a good way to test our ported code to make sure there aren&#8217;t any bugs lurking anywhere.
<p class=skip><a href=#skipinvalidsyntax>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skipinvalidsyntax>skip over this</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 1, in &lt;module>
from chardet.universaldetector import UniversalDetector
@@ -612,7 +614,7 @@ RefactoringTool: test.py</samp></pre>
^
SyntaxError: invalid syntax</samp></pre>
<p id=skipinvalidsyntax>Hmm, a small snag. In Python 3, <code>False</code> is a reserved word, so you can&#8217;t use it as a variable name. Let&#8217;s look at <code>constants.py</code> to see where it&#8217;s defined. Here&#8217;s the original version from <code>constants.py</code>, before the <code>2to3</code> script changed it:
<p class=skip><a href=#skipbuiltincode>skip over this</a>
<p class=s><a href=#skipbuiltincode>skip over this</a>
<pre><code>import __builtin__
if not hasattr(__builtin__, 'False'):
False = 0
@@ -629,8 +631,8 @@ else:
<p>Ah, wasn&#8217;t that satisfying? The code is shorter and more readable already.
<h3 id=nomodulenamedconstants>No module named <code>constants</code></h3>
<p>Time to run <code>test.py</code> again and see how far it gets.
<p class=skip><a href=#skipnomodulenamedconstants>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skipnomodulenamedconstants>skip over this</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 1, in &lt;module>
from chardet.universaldetector import UniversalDetector
@@ -649,8 +651,8 @@ import sys</code></pre>
<p>Onward!
<h3 id=namefileisnotdefined>Name <var>'file'</var> is not defined</h3>
<p>And here we go again, running <code>test.py</code> to try to execute our test cases&hellip;</p>
<p class=skip><a href=#skipnamefileisnotdefined>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skipnamefileisnotdefined>skip over this</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 9, in &lt;module>
@@ -662,8 +664,8 @@ NameError: name 'file' is not defined</samp></pre>
<p>And that&#8217;s all I have to say about that.
<h3 id=cantuseastringpattern>Can&#8217;t use a string pattern on a bytes-like object</h3>
<p>Now things are starting to get interesting. And by &#8220;interesting,&#8221; I mean &#8220;confusing as all hell.&#8221;
<p class=skip><a href=#skipcantuseastringpattern>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skipcantuseastringpattern>skip over this</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 10, in &lt;module>
@@ -673,14 +675,14 @@ NameError: name 'file' is not defined</samp></pre>
TypeError: can't use a string pattern on a bytes-like object</samp></pre>
<p id=skipcantuseastringpattern>
<p>To debug this, let&#8217;s see what <var>self._highBitDetector</var> is. It&#8217;s defined in the <var>__init__</var> method of the <var>UniversalDetector</var> class:
<p class=skip><a href=#skiphighbitdetectorcode>skip over this</a>
<p class=s><a href=#skiphighbitdetectorcode>skip over this</a>
<pre><code>class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
<p id=skiphighbitdetectorcode>This pre-compiles a regular expression designed to find non-<abbr>ASCII</abbr> characters in the range 128&ndash;255 (0x80&ndash;0xFF). Wait, that&#8217;s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-<abbr>ASCII</abbr> <em>bytes</em> in the range 128-255.
<p>And therein lies the problem.
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string &mdash; that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string &mdash; again, an array of characters. But what we&#8217;re searching is not a string, it&#8217;s a byte array. Looking at the traceback, this error occurred in <code>universaldetector.py</code>:
<p class=skip><a href=#skipfeedhighbitdetectorcode>skip over this</a>
<p class=s><a href=#skipfeedhighbitdetectorcode>skip over this</a>
<pre><code>def feed(self, aBuf):
.
.
@@ -688,7 +690,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):</code></pre>
<p id=skipfeedhighbitdetectorcode>And what is <var>aBuf</var>? Let&#8217;s backtrack further to a place that calls <code>UniversalDetector.feed()</code>. One place that calls it is the test harness, <code>test.py</code>.
<p class=skip><a href=#skiptestharnessfeedcode>skip over this</a>
<p class=s><a href=#skiptestharnessfeedcode>skip over this</a>
<pre><code>u = UniversalDetector()
.
.
@@ -698,7 +700,7 @@ for line in open(f, 'rb'):
<p id=skiptestharnessfeedcode>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string &mdash; an array of Unicode characters &mdash; according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
<p class=skip><a href=#skip-cant-use-a-string-pattern-solution>skip over this code listing</a>
<p class=s><a href=#skip-cant-use-a-string-pattern-solution>skip over this code listing</a>
<pre><code> class UniversalDetector:
def __init__(self):
<del>- self._highBitDetector = re.compile(b'[\x80-\xFF]')</del>
@@ -709,7 +711,7 @@ for line in open(f, 'rb'):
self._mCharSetProbers = []
self.reset()</code></pre>
<p id=skip-case-use-a-string-pattern-solution>Searching the entire codebase for other uses of the <code>re</code> module turns up two more instances, in <code>charsetprober.py</code>. Again, the code is defining regular expressions as strings but executing them on <var>aBuf</var>, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays.
<p class=skip><a href=#cantconvertbytesobject>skip over this code listing</a>
<p class=s><a href=#cantconvertbytesobject>skip over this code listing</a>
<pre><code> class CharSetProber:
.
.
@@ -726,8 +728,8 @@ for line in open(f, 'rb'):
<h3 id=cantconvertbytesobject>Can't convert <code>'bytes'</code> object to <code>str</code> implicitly</h3>
<p>Curiouser and curiouser&hellip;
<p class=skip><a href=#skipcantconvertbytesobject>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skipcantconvertbytesobject>skip over this</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 10, in &lt;module>
@@ -736,12 +738,12 @@ for line in open(f, 'rb'):
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<p id=skipcantconvertbytesobject>There's an unfortunate clash of coding style and Python interpreter here. The <code>TypeError</code> could be anywhere on that line, but the traceback doesn't tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
<p class=skip><a href=#skip-split-conditional>skip over this code listing</a>
<p class=s><a href=#skip-split-conditional>skip over this code listing</a>
<pre><code>elif (self._mInputState == ePureAscii) and \
self._escDetector.search(self._mLastChar + aBuf):</code></pre>
<p id=skip-split-conditional>And re-run the test:</p>
<p class=skip><a href=#skip-cant-convert-bytes-object-2>skip over this command output listing</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skip-cant-convert-bytes-object-2>skip over this command output listing</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 10, in &lt;module>
@@ -751,7 +753,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<p id=skip-over-cant-convert-bytes-object-2>Aha! The problem was not in the first conditional (<code>self._mInputState == ePureAscii</code>) but in the second one. So what could cause a <code>TypeError</code> there? Perhaps you're thinking that the <code>search()</code> method is expecting a value of a different type, but that wouldn't generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may <em>crash</em> if you pass it a value of a different type than it's expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the <code>search()</code> method. So the problem must be in that <code>+</code> operation, as it's trying to construct the value that it will eventually pass to the <code>search()</code> method.
<p>We know from <a href=#cantuseastringpattern>previous debugging</a> that <var>aBuf</var> is a byte array. So what is <code>self._mLastChar</code>? It's an instance variable, defined in the <code>reset()</code> method, which is actually called from the <code>__init__()</code> method.
<p class=skip><a href=#skip-mlastchar-declaration>skip over this code listing</a>
<p class=s><a href=#skip-mlastchar-declaration>skip over this code listing</a>
<pre><code>class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(b'[\x80-\xFF]')
@@ -769,7 +771,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<mark> self._mLastChar = ''</mark></code></pre>
<p id=skip-mlastchar-declaration>And now we have our answer. Do you see it? <var>self._mLastChar</var> is a string, but <var>aBuf</var> is a byte array. And you can't concatenate a string to a byte array &mdash; not even a zero-length string.
<p>So what is <var>self._mLastChar</var> anyway? The answer is in the <code>feed()</code> method, just a few lines down from where the trackback occurred.
<p class=skip><a href=#skip-mlastchar-set>skip over this code listing</a>
<p class=s><a href=#skip-mlastchar-set>skip over this code listing</a>
<pre><code>if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
self._mInputState = eHighbyte
@@ -779,7 +781,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
<p>The calling function calls this <code>feed()</code> method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as <var>aBuf</var>), then stores the last byte in <var>self._mLastChar</var> in case it's needed during the next call. (In a multi-byte encoding, the <code>feed()</code> method might get called with half of a character, then called again with the other half.) But because <var>aBuf</var> is now a byte array instead of a string, <var>self._mLastChar</var> needs to be a byte array as well. Thus:
<p class=skip><a href=#skip-mlastchar-solution>skip over this code listing</a>
<p class=s><a href=#skip-mlastchar-solution>skip over this code listing</a>
<pre><code> def reset(self):
.
.
@@ -787,7 +789,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<del>- self._mLastChar = ''</del>
<ins>+ self._mLastChar = b''</ins></code></pre>
<p id=skip-mlastchar-solution>Searching the entire codebase for <code>"mLastChar"</code> turns up a similar problem in <code>mbcharsetprober.py</code>, but instead of tracking the last character, it tracks the last <em>two</em> characters. The <code>MultiByteCharSetProber</code> class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.
<p class=skip><a href=#skip-mbcharsetprober>skip over this code listing</a>
<p class=s><a href=#skip-mbcharsetprober>skip over this code listing</a>
<pre><code>
class MultiByteCharSetProber(CharSetProber):
def __init__(self):
@@ -807,8 +809,8 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<ins>+ self._mLastChar = [0, 0]</ins></code></pre>
<h3 id=unsupportedoperandtypeforplus>Unsupported operand type(s) for +: <code>'int'</code> and <code>'bytes'</code></h3>
<p>I have good news, and I have bad news. The good news is we're making progress&hellip;
<p class=skip><a href=#skip-unsupported-operand-types>skip over this command listing</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skip-unsupported-operand-types>skip over this command listing</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 10, in &lt;module>
@@ -819,7 +821,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
<p id=skip-unsupported-operand-types>&hellip;The bad news is it doesn't always feel like progress.
<p>But this is progress! Really! Even though the traceback calls out the same line of code, it's a different error than it used to be. Progress! So what's the problem now? The last time I checked, this line of code didn't try to concatenate an <code>int</code> with a byte array (<code>bytes</code>). In fact, you just spent a lot of time <a href=#cantconvertbytesobject>ensuring that <var>self._mLastChar</var> was a byte array</a>. How did it turn into an <code>int</code>?
<p>The answer lies not in the previous lines of code, but in the following lines.
<p class=skip><a href=#skip-mlastchar-highlight>skip over this code listing</a>
<p class=s><a href=#skip-mlastchar-highlight>skip over this code listing</a>
<pre><code>if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
self._mInputState = eHighbyte
@@ -829,24 +831,24 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
<p id=skip-mlastchar-highlight>This error doesn't occur the first time the <code>feed()</code> method gets called; it occurs the <em>second time</em>, after <var>self._mLastChar</var> has been set to the last byte of <var>aBuf</var>. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
<p class=skip><a href=#skip-mlastchar-interactive>skip over this interpreter listing</a>
<p class=s><a href=#skip-mlastchar-interactive>skip over this interpreter listing</a>
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>aBuf = b'\xEF\xBB\xBF'</kbd> <span>&#x2460;</span></a>
<samp class=prompt>>>> </samp><kbd>len(aBuf)</kbd>
<a><samp class=p>>>> </samp><kbd>aBuf = b'\xEF\xBB\xBF'</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>len(aBuf)</kbd>
<samp>3</samp>
<samp class=prompt>>>> </samp><kbd>mLastChar = aBuf[-1]</kbd>
<a><samp class=prompt>>>> </samp><kbd>mLastChar</kbd> <span>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd>mLastChar = aBuf[-1]</kbd>
<a><samp class=p>>>> </samp><kbd>mLastChar</kbd> <span>&#x2461;</span></a>
<samp>191</samp>
<a><samp class=prompt>>>> </samp><kbd>type(mLastChar)</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>type(mLastChar)</kbd> <span>&#x2462;</span></a>
<samp>&lt;class 'int'></samp>
<a><samp class=prompt>>>> </samp><kbd>mLastChar + aBuf</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>mLastChar + aBuf</kbd> <span>&#x2463;</span></a>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in &lt;module>
TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp>
<a><samp class=prompt>>>> </samp><kbd>mLastChar = aBuf[-1:]</kbd> <span>&#x2464;</span></a>
<samp class=prompt>>>> </samp><kbd>mLastChar</kbd>
<a><samp class=p>>>> </samp><kbd>mLastChar = aBuf[-1:]</kbd> <span>&#x2464;</span></a>
<samp class=p>>>> </samp><kbd>mLastChar</kbd>
<samp>b'\xbf'</samp>
<a><samp class=prompt>>>> </samp><kbd>mLastChar + aBuf</kbd> <span>&#x2465;</span></a>
<a><samp class=p>>>> </samp><kbd>mLastChar + aBuf</kbd> <span>&#x2465;</span></a>
<samp>b'\xbf\xef\xbb\xbf'</samp></pre>
<ol id=skip-mlastchar-interactive>
<li>Define a byte array of length 3.
@@ -864,8 +866,8 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp>
<ins>+ self._mLastChar = aBuf[-1:]</ins></code></pre>
<h3 id=ordexpectedstring><code>ord()</code> expected string of length 1, but <code>int</code> found</h3>
<p>Tired yet? You're almost there&hellip;
<p class=skip><a href=#skip-ord-expected-string>skip over this command output listing</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skip-ord-expected-string>skip over this command output listing</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml</samp>
<samp class=traceback>Traceback (most recent call last):
@@ -881,28 +883,28 @@ tests\Big5\0804.blogspot.com.xml</samp>
byteCls = self._mModel['classTable'][ord(c)]
TypeError: ord() expected string of length 1, but int found</samp></pre>
<p id=skip-ord-expected-string>OK, so <var>c</var> is an <code>int</code>, but the <code>ord()</code> function was expecting a 1-character string. Fair enough. Where is <var>c</var> defined?
<p class=skip><a href=#skip-next-state>skip over this code listing</a>
<p class=s><a href=#skip-next-state>skip over this code listing</a>
<pre><code># codingstatemachine.py
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
byteCls = self._mModel['classTable'][ord(c)]</code></pre>
<p id=skip-next-state>That's no help; it's just passed into the function. Let's pop the stack.
<p class=skip><a href=#skip-utf8prober-feed>skip over this code listing</a>
<p class=s><a href=#skip-utf8prober-feed>skip over this code listing</a>
<pre><code># utf8prober.py
def feed(self, aBuf):
for c in aBuf:
codingState = self._mCodingSM.next_state(c)</code></pre>
<p id=skip-utf8prober-feed>And now we have the answer. Do you see it? In Python 2, <var>aBuf</var> was a string, so <var>c</var> was a 1-character string. (That's what you get when you iterate over a string &mdash; all the characters, one by one.) But now, <var>aBuf</var> is a byte array, so <var>c</var> is an <code>int</code>, not a 1-character string. In other words, there's no need to call the <code>ord()</code> function because <var>c</var> is already an <code>int</code>!
<p>Thus:
<p class=skip><a href=#skip-ordc-diff>skip over this code listing</a>
<p class=s><a href=#skip-ordc-diff>skip over this code listing</a>
<pre><code> def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
<del>- byteCls = self._mModel['classTable'][ord(c)]</del>
<ins>+ byteCls = self._mModel['classTable'][c]</ins></code></pre>
<p>Searching the entire codebase for instances of <code>"ord(c)"</code> uncovers similar problems in <code>sbcharsetprober.py</code>&hellip;
<p class=skip><a href=#skip-sbcharsetprober-code>skip over this code listing</a>
<p class=s><a href=#skip-sbcharsetprober-code>skip over this code listing</a>
<pre><code># sbcharsetprober.py
def feed(self, aBuf):
if not self._mModel['keepEnglishLetter']:
@@ -913,14 +915,14 @@ def feed(self, aBuf):
for c in aBuf:
<mark> order = self._mModel['charToOrderMap'][ord(c)]</mark></code></pre>
<p id=skip-sbcharsetprober-code>&hellip;and <code>latin1prober.py</code>&hellip;
<p class=skip><a href=#skip-latin1prober-code-2>skip over this code listing</a>
<p class=s><a href=#skip-latin1prober-code-2>skip over this code listing</a>
<pre><code># latin1prober.py
def feed(self, aBuf):
aBuf = self.filter_with_english_letters(aBuf)
for c in aBuf:
<mark> charClass = Latin1_CharToClass[ord(c)]</mark></code></pre>
<p id=skip-sbcharsetprober-code-2><var>c</var> is iterating over <var>aBuf</var>, which means it is an integer, not a 1-character string. The solution is the same: change <code>ord(c)</code> to just plain <code>c</code>.
<p class=skip><a href=#unorderabletypes>skip over this code listing</a>
<p class=s><a href=#unorderabletypes>skip over this code listing</a>
<pre><code> # sbcharsetprober.py
def feed(self, aBuf):
if not self._mModel['keepEnglishLetter']:
@@ -941,8 +943,8 @@ def feed(self, aBuf):
</code></pre>
<h3 id=unorderabletypes>Unorderable types: <code>int()</code> >= <code>str()</code></h3>
<p>Let's go again.
<p class=skip><a href=#skip-unorderable-types-screen>skip over this command output listing</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skip-unorderable-types-screen>skip over this command output listing</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml</samp>
<samp>Traceback (most recent call last):
@@ -961,7 +963,7 @@ tests\Big5\0804.blogspot.com.xml</samp>
TypeError: unorderable types: int() >= str()</samp></pre>
<p id=skip-unorderable-types-screen>Did you notice? This time around, the code passed the first test case (<code>tests\ascii\howto.diveintomark.org.xml</code>). You're making real progress here.
<p>So what's this all about? &#8220;Unorderable types&#8221;? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code:
<p class=skip><a href=#skip-unorderable-types-1>skip over this code listing</a>
<p class=s><a href=#skip-unorderable-types-1>skip over this code listing</a>
<pre><code>class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
if not aStr: return -1, 1
@@ -972,7 +974,7 @@ TypeError: unorderable types: int() >= str()</samp></pre>
else:
charLen = 1</code></pre>
<p id=skip-unorderable-types-1>And where does <var>aStr</var> come from? Let's pop the stack:
<p class=skip><a href=#skip-unorderable-types-2>skip over this code listing</a>
<p class=s><a href=#skip-unorderable-types-2>skip over this code listing</a>
<pre><code>def feed(self, aBuf, aLen):
.
.
@@ -983,7 +985,7 @@ TypeError: unorderable types: int() >= str()</samp></pre>
<p id=skip-unorderable-types-2>Oh look, it's our old friend, <var>aBuf</var>. As you might have guessed from every other issue we've encountered in this chapter, <var>aBuf</var> is a byte array. Here, the <code>feed()</code> method isn't just passing it on wholesale; it's slicing it. But as you saw <a href=#unsupportedoperandtypeforplus>earlier in this chapter</a>, slicing a byte array returns a byte array, so the <var>aStr</var> parameter that gets passed to the <code>get_order()</code> method is still a byte array.
<p>And what is this code trying to do with <var>aStr</var>? It's taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because <var>aStr</var> and <var>aBuf</var> were strings, and <var>aStr[0]</var> would be a string, and you can compare strings for inequality. But in Python 3, <var>aStr</var> and <var>aBuf</var> are byte arrays, <var>aStr[0]</var> is an integer, and you can't compare integers and strings for inequality without explicitly coercing one of them.
<p>In this case, there's no need to make the code more complicated by adding an explicit coercion. <var>aStr[0]</var> yields an integer; the things you're comparing to are all constants. Let's change them from 1-character strings to integers.
<p class=skip><a href=#skip-unorderable-types-3>skip over this code listing</a>
<p class=s><a href=#skip-unorderable-types-3>skip over this code listing</a>
<pre><code> class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
if not aStr: return -1, 1
@@ -1037,8 +1039,8 @@ TypeError: unorderable types: int() >= str()</samp></pre>
return -1, charLen</code></pre>
<p>Searching the entire codebase for occurrences of the <code>ord()</code> function uncovers the same problem in <code>chardistribution.py</code>:
<p class=skip><a href=#skip-unorderable-types-4>skip over this command output listing</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skip-unorderable-types-4>skip over this command output listing</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml</samp>
<samp class=traceback>Traceback (most recent call last):
@@ -1056,7 +1058,7 @@ tests\Big5\0804.blogspot.com.xml</samp>
if (aStr[0] >= '\x81') and (aStr[0] &lt;= '\x9F'):
TypeError: unorderable types: int() >= str()</samp></pre>
<p id=skip-unorderable-types-4>The fix is the same:
<p class=skip><a href=#reduceisnotdefined>skip over this code listing</a>
<p class=s><a href=#reduceisnotdefined>skip over this code listing</a>
<pre><code> class EUCTWDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
CharDistributionAnalysis.__init__(self)
@@ -1163,8 +1165,8 @@ TypeError: unorderable types: int() >= str()</samp></pre>
return -1</code></pre>
<h3 id=reduceisnotdefined>Global name <code>'reduce'</code> is not defined</h3>
<p>Once more into the breach&hellip;
<p class=skip><a href=#skip-reduceisnotdefined-output>skip over this command output listing</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skip-reduceisnotdefined-output>skip over this command output listing</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml</samp>
<samp class=traceback>Traceback (most recent call last):
@@ -1177,14 +1179,14 @@ tests\Big5\0804.blogspot.com.xml</samp>
NameError: global name 'reduce' is not defined</samp></pre>
<p id=skip-reduceisnotdefined-output>According to the official <a href=http://docs.python.org/dev/3.0/whatsnew/3.0.html#builtins>What's New In Python 3.0</a> guide, the <code>reduce()</code> function has been moved out of the global namespace and into the <code>functools</code> module. Quoting the guide: "Use <code>functools.reduce()</code> if you really need it; however, 99 percent of the time an explicit <code>for</code> loop is more readable."
<p>OK then, let's refactor it to use a <code>for</code> loop.
<p class=skip><a href=#skip-reduce-code>skip over this code listing</a>
<p class=s><a href=#skip-reduce-code>skip over this code listing</a>
<pre><code>def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
<mark> total = reduce(operator.add, self._mFreqCounter)</mark></code></pre>
<p>The <code>reduce()</code> function takes two arguments &mdash; a function and a list (strictly speaking, any iterable object will do) &mdash; and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result. It looks much more readable as a <code>for</code> loop.
<p class=skip><a href=#skip-reduce-refactoring>skip over this code listing</a>
<p class=s><a href=#skip-reduce-refactoring>skip over this code listing</a>
<pre><code> def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
@@ -1194,8 +1196,8 @@ NameError: global name 'reduce' is not defined</samp></pre>
<ins>+ for frequency in self._mFreqCounter:</ins>
<ins>+ total += frequency</ins></code></pre>
<p id=skip-reduce-refactoring>I CAN HAZ TESTZ?
<p class=skip><a href=#skip-final-output>skip over this command output listing</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=s><a href=#skip-final-output>skip over this command output listing</a>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\blog.worren.net.xml Big5 with confidence 0.99
@@ -1239,6 +1241,6 @@ tests\EUC-JP\arclamp.jp.xml EUC-JP with confide
<li><em>You</em> need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least, you need to be comfortable with all its quirks and musty corners. The bugs are everywhere.
<li>Test cases are essential. Don't port anything without them. Don't even try. The <em>only</em> reason I have any confidence at all that <code>chardet</code> works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I <em>never</em> would have found half of these problems with manual spot-checking.
</ol>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &bull; <a href=about.html>open standards &bull; open content &bull; open source</a>
<script src=jquery.js></script>
<script src=dip3.js></script>
+723 -723
View File
File diff suppressed because it is too large Load Diff
+21 -33
View File
@@ -1,16 +1,16 @@
/* typography */
body,.widgets a{font:medium 'Gill Sans','Gill Sans MT',Corbel,Helvetica,Jara,'Nimbus Sans L',sans-serif;line-height:1.75;word-spacing:0.1em}
body,.w a{font:medium 'Gill Sans','Gill Sans MT',Corbel,Helvetica,Jara,'Nimbus Sans L',sans-serif;line-height:1.75;word-spacing:0.1em}
pre,kbd,code,samp{font-family:Consolas,'Andale Mono',Monaco,'Liberation Mono','Bitstream Vera Sans Mono','DejaVu Sans Mono',monospace;font-size:medium;line-height:1.75;word-spacing:0}
span,tr + tr th:first-child{font:medium 'Arial Unicode MS',FreeSerif,OpenSymbol,'DejaVu Sans',sans-serif}
span{font:medium 'Arial Unicode MS',FreeSerif,OpenSymbol,'DejaVu Sans',sans-serif}
pre span{font-family:'Arial Unicode MS','DejaVu Sans',FreeSerif,OpenSymbol,sans-serif}
.baa{font:oblique large Constantia,Baskerville,Palatino,'Palatino Linotype','URW Palladio L',serif}
abbr{font-variant:small-caps;text-transform:lowercase;letter-spacing:0.1em}
.q{margin:auto;text-align:right;font-style:oblique}
.q{text-align:right;font-style:oblique}
.q span{font-size:large}
.note{margin-left:4.94em}
.note span{display:block;float:left;font-size:xx-large;line-height:0.875;margin:0 0.22em 0 -1.22em}
.c,pre,.widgets,.widgets a,.download,ins,del,mark{line-height:2.154}
.fancy:first-letter{float:left;color:#ddd;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
.c,pre,.w,.w a,.download{line-height:2.154}
.f:first-letter{float:left;color:#ddd;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
h1,h2,h3,p,ul,ol{margin:1.75em 0;font-size:medium}
/* basics */
@@ -22,46 +22,34 @@ form div{float:right}
/* links */
a{background:transparent;text-decoration:none;border-bottom:1px dotted}
a:hover{border-bottom:1px solid}
a:link,.widgets a{color:#26c}
a:link,.w a{color:#26c}
a:visited{color:#93c}
.skip a,.skip a:hover,.skip a:visited{position:absolute;left:0px;top:-500px;width:1px;height:1px;overflow:hidden}
.skip a:active,.skip a:focus{position:static;width:auto;height:auto}
/* skip links */
.s a,.s a:hover,.s a:visited{position:absolute;left:0px;top:-500px;width:1px;height:1px;overflow:hidden}
.s a:active,.s a:focus{position:static;width:auto;height:auto}
/* code blocks */
pre{white-space:pre-wrap;padding-left:2.154em;border-left:1px solid #ddd}
.widgets{float:left}
.c,.widgets,.widgets a,.download{font-size:small}
.w{float:left}
.c,.w,.w a,.download{font-size:small}
.block,ol,p,blockquote,h1,h2,h3{clear:left}
pre a,.widgets a{padding:0.4375em 0}
.widgets a{text-decoration:underline}
kbd,mark{font-weight:bold}
.prompt{color:#667}
ins,del,mark{text-decoration:none;font-style:normal;display:inline-block;width:100%}
del{background:#f87}
ins{background:#9f9}
mark{background:#ff8}
pre a,.w a{padding:0.4375em 0}
.w a{text-decoration:underline}
kbd{font-weight:bold}
.p{color:#667}
/* tables */
table{width:100%;border-collapse:collapse}
th,td{width:45%;padding:0 0.5em;border:1px solid #bbb}
th{text-align:left;vertical-align:baseline}
td{vertical-align:top}
th:first-child{width:10%;text-align:center}
.simple th{font-family:inherit !important}
.hover{background:#eee;cursor:default}
/* hover effect for table rows, list items, and lines in code blocks */
.h{background:#eee;cursor:default}
/* overrides */
th,td,td pre,li ol{margin:0}
td pre{padding:0}
pre a,.widgets a,pre a:hover,td pre{border:0}
li ol,.q{margin:0}
pre a,.w a,pre a:hover{border:0}
/* headers */
h1,#noscript{background:PapayaWhip;width:100%}
h1,#noscript{background:PapayaWhip;width:100%} /* all hail PapayaWhip */
h1:before{content:"Chapter " counter(h1) ". "}
h1{counter-reset:h2}
h2:before{counter-increment:h2;content:counter(h1) "." counter(h2) ". "}
h2{counter-reset:h3}
h3:before{counter-increment:h3;content:counter(h1) "." counter(h2) "." counter(h3) ". "}
/* HTML 5 support */
article,aside,dialog,footer,header,section{display:block}
+9 -9
View File
@@ -11,7 +11,7 @@ $(document).ready(function() {
for (var lang in LANGS) {
$("blockquote.compare").filter("blockquote." + lang).each(function(i) {
$(this).wrapInner('<div class="block"></div>');
$(this).prepend('<div class="widgets">[ <a href="#" onclick="toggleComparisonNotes(\'' + lang + '\');return false" class="toggle">hide ' + LANGS[lang] + ' notes</a> ]</div>');
$(this).prepend('<div class="w">[ <a href="#" onclick="toggleComparisonNotes(\'' + lang + '\');return false" class="toggle">hide ' + LANGS[lang] + ' notes</a> ]</div>');
});
}
*/
@@ -26,10 +26,10 @@ $(document).ready(function() {
$("pre.code, pre.screen").each(function(i) {
this.id = "autopre" + i;
$(this).wrapInner('<div class="block"></div>');
$(this).prepend('<div class="widgets">[<a class="toggle" href="javascript:toggleCodeBlock(\'' + this.id + '\')">' + HS['visible'] + '</a>] [<a href="javascript:plainTextOnClick(\'' + this.id + '\')">open in new window</a>]</div>');
$(this).prepend('<div class="w">[<a class="toggle" href="javascript:toggleCodeBlock(\'' + this.id + '\')">' + HS['visible'] + '</a>] [<a href="javascript:plainTextOnClick(\'' + this.id + '\')">open in new window</a>]</div>');
$(this).prev("p.download").each(function(i) {
$(this).next("pre").find("div.widgets").append(" " + $(this).html());
$(this).next("pre").find("div.w").append(" " + $(this).html());
this.parentNode.removeChild(this);
});
});
@@ -39,8 +39,8 @@ $(document).ready(function() {
$(this).find("a:not([href])").each(function(i) {
var a = $(this);
var li = a.parents("pre").next("ol").find("li:nth-child(" + (i+1) + ")");
li.add(a).hover(function() { a.addClass("hover"); li.addClass("hover"); },
function() { a.removeClass("hover"); li.removeClass("hover"); });
li.add(a).hover(function() { a.addClass("h"); li.addClass("h"); },
function() { a.removeClass("h"); li.removeClass("h"); });
});
});
@@ -50,8 +50,8 @@ $(document).ready(function() {
var tr = $(this);
var li = tr.parents("table").next("ol").find("li:nth-child(" + (i+1) + ")");
if (li.length > 0) {
li.add(tr).hover(function() { tr.addClass("hover"); li.addClass("hover"); },
function() { tr.removeClass("hover"); li.removeClass("hover"); });
li.add(tr).hover(function() { tr.addClass("h"); li.addClass("h"); },
function() { tr.removeClass("h"); li.removeClass("h"); });
}
});
});
@@ -63,7 +63,7 @@ $(document).ready(function() {
function toggleComparisonNotes(lang) {
// FIXME: save state in cookie, pass state to toggle(), reset text accordingly
$("blockquote." + lang + " div.block").toggle(false);
$("blockquote." + lang + " div.widgets a.toggle").text("show " + LANGS[lang] + " notes");
$("blockquote." + lang + " div.w a.toggle").text("show " + LANGS[lang] + " notes");
}
*/
@@ -75,7 +75,7 @@ function toggleCodeBlock(id) {
function plainTextOnClick(id) {
var clone = $("#" + id).clone();
clone.find("div.widgets, span").remove();
clone.find("div.w, span").remove();
var win = window.open("about:blank", "plaintext", "toolbar=0,scrollbars=1,location=0,statusbar=0,menubar=0,resizable=1,width=600,height=400,left=35,top=75");
win.document.open();
win.document.write('<pre>' + clone.html());
+362
View File
@@ -20,3 +20,365 @@ for line in open(input_file).readlines():
else:
out.write(g)
out.close()
out = open(output_file)
html = out.read()
out.close()
html = html.replace("&aring;", "&#229;")
html = html.replace("&#62;", "&gt;")
html = html.replace("&#x3e;", "&gt;")
html = html.replace("&#8835;", "&sup;")
html = html.replace("&#x2283;", "&sup;")
html = html.replace("&Ntilde;", "&#209;")
html = html.replace("&#x3d2;", "&#978;")
html = html.replace("&upsih;", "&#978;")
html = html.replace("&Yacute;", "&#221;")
html = html.replace("&Atilde;", "&#195;")
html = html.replace("&#x221a;", "&#8730;")
html = html.replace("&#x2297;", "&#8855;")
html = html.replace("&otimes;", "&#8855;")
html = html.replace("&aelig;", "&#230;")
html = html.replace("&#936;", "&Psi;")
html = html.replace("&#x3a8;", "&Psi;")
html = html.replace("&#x395;", "&#917;")
html = html.replace("&Epsilon;", "&#917;")
html = html.replace("&Icirc;", "&#206;")
html = html.replace("&Eacute;", "&#201;")
html = html.replace("&#x39b;", "&#923;")
html = html.replace("&Lambda;", "&#923;")
html = html.replace("&#x2033;", "&#8243;")
html = html.replace("&#x39a;", "&#922;")
html = html.replace("&Kappa;", "&#922;")
html = html.replace("&#x3c2;", "&#962;")
html = html.replace("&sigmaf;", "&#962;")
html = html.replace("&#8206;", "&lrm;")
html = html.replace("&#x200e;", "&lrm;")
html = html.replace("&cedil;", "&#184;")
html = html.replace("&#8194;", "&ensp;")
html = html.replace("&#x2002;", "&ensp;")
html = html.replace("&AElig;", "&#198;")
html = html.replace("&#x2032;", "&#8242;")
html = html.replace("&#932;", "&Tau;")
html = html.replace("&#x3a4;", "&Tau;")
html = html.replace("&#x2308;", "&#8968;")
html = html.replace("&#8659;", "&dArr;")
html = html.replace("&#x21d3;", "&dArr;")
html = html.replace("&#8805;", "&ge;")
html = html.replace("&#x2265;", "&ge;")
html = html.replace("&#8901;", "&sdot;")
html = html.replace("&#x22c5;", "&sdot;")
html = html.replace("&#x230a;", "&#8970;")
html = html.replace("&lfloor;", "&#8970;")
html = html.replace("&#8656;", "&lArr;")
html = html.replace("&#x21d0;", "&lArr;")
html = html.replace("&brvbar;", "&#166;")
html = html.replace("&Otilde;", "&#213;")
html = html.replace("&#x398;", "&#920;")
html = html.replace("&Theta;", "&#920;")
html = html.replace("&#928;", "&Pi;")
html = html.replace("&#x3a0;", "&Pi;")
html = html.replace("&#x152;", "&#338;")
html = html.replace("&OElig;", "&#338;")
html = html.replace("&#x160;", "&#352;")
html = html.replace("&Scaron;", "&#352;")
html = html.replace("&egrave;", "&#232;")
html = html.replace("&#8834;", "&sub;")
html = html.replace("&#x2282;", "&sub;")
html = html.replace("&iexcl;", "&#161;")
html = html.replace("&#8721;", "&sum;")
html = html.replace("&#x2211;", "&sum;")
html = html.replace("&ntilde;", "&#241;")
html = html.replace("&atilde;", "&#227;")
html = html.replace("&#x3b8;", "&#952;")
html = html.replace("&theta;", "&#952;")
html = html.replace("&#8836;", "&nsub;")
html = html.replace("&#x2284;", "&nsub;")
html = html.replace("&#8660;", "&hArr;")
html = html.replace("&#x21d4;", "&hArr;")
html = html.replace("&Oslash;", "&#216;")
html = html.replace("&THORN;", "&#222;")
html = html.replace("&#924;", "&Mu;")
html = html.replace("&#x39c;", "&Mu;")
html = html.replace("&#x2009;", "&#8201;")
html = html.replace("&thinsp;", "&#8201;")
html = html.replace("&ecirc;", "&#234;")
html = html.replace("&#x201e;", "&#8222;")
html = html.replace("&Aring;", "&#197;")
html = html.replace("&#x2207;", "&#8711;")
html = html.replace("&#x2030;", "&#8240;")
html = html.replace("&permil;", "&#8240;")
html = html.replace("&Ugrave;", "&#217;")
html = html.replace("&#951;", "&eta;")
html = html.replace("&#x3b7;", "&eta;")
html = html.replace("&Agrave;", "&#192;")
html = html.replace("&#x2200;", "&#8704;")
html = html.replace("&forall;", "&#8704;")
html = html.replace("&#240;", "&eth;")
html = html.replace("&#xf0;", "&eth;")
html = html.replace("&#x2309;", "&#8969;")
html = html.replace("&Egrave;", "&#200;")
html = html.replace("&divide;", "&#247;")
html = html.replace("&igrave;", "&#236;")
html = html.replace("&otilde;", "&#245;")
html = html.replace("&pound;", "&#163;")
html = html.replace("&#x2044;", "&#8260;")
html = html.replace("&#208;", "&ETH;")
html = html.replace("&#xd0;", "&ETH;")
html = html.replace("&#x2217;", "&#8727;")
html = html.replace("&lowast;", "&#8727;")
html = html.replace("&#967;", "&chi;")
html = html.replace("&#x3c7;", "&chi;")
html = html.replace("&Aacute;", "&#193;")
html = html.replace("&#x392;", "&#914;")
html = html.replace("&#8869;", "&perp;")
html = html.replace("&#x22a5;", "&perp;")
html = html.replace("&#x2234;", "&#8756;")
html = html.replace("&there4;", "&#8756;")
html = html.replace("&#960;", "&pi;")
html = html.replace("&#x3c0;", "&pi;")
html = html.replace("&#x2205;", "&#8709;")
html = html.replace("&#x2209;", "&#8713;")
html = html.replace("&icirc;", "&#238;")
html = html.replace("&#8226;", "&bull;")
html = html.replace("&#x2022;", "&bull;")
html = html.replace("&#x3c5;", "&#965;")
html = html.replace("&upsilon;", "&#965;")
html = html.replace("&Oacute;", "&#211;")
html = html.replace("&#x3ba;", "&#954;")
html = html.replace("&kappa;", "&#954;")
html = html.replace("&ccedil;", "&#231;")
html = html.replace("&#8745;", "&cap;")
html = html.replace("&#x2229;", "&cap;")
html = html.replace("&#956;", "&mu;")
html = html.replace("&#x3bc;", "&mu;")
html = html.replace("&#176;", "&deg;")
html = html.replace("&#xb0;", "&deg;")
html = html.replace("&#964;", "&tau;")
html = html.replace("&#x3c4;", "&tau;")
html = html.replace("&#8195;", "&emsp;")
html = html.replace("&#x2003;", "&emsp;")
html = html.replace("&#x2026;", "&#8230;")
html = html.replace("&hellip;", "&#8230;")
html = html.replace("&ucirc;", "&#251;")
html = html.replace("&ugrave;", "&#249;")
html = html.replace("&#8773;", "&cong;")
html = html.replace("&#x2245;", "&cong;")
html = html.replace("&#x399;", "&#921;")
html = html.replace("&#x22;", "&#34;")
html = html.replace("&quot;", "&#34;")
html = html.replace("&#8594;", "&rarr;")
html = html.replace("&#x2192;", "&rarr;")
html = html.replace("&#929;", "&Rho;")
html = html.replace("&#x3a1;", "&Rho;")
html = html.replace("&uacute;", "&#250;")
html = html.replace("&acirc;", "&#226;")
html = html.replace("&#8764;", "&sim;")
html = html.replace("&#x223c;", "&sim;")
html = html.replace("&#966;", "&phi;")
html = html.replace("&#x3c6;", "&phi;")
html = html.replace("&#x2666;", "&#9830;")
html = html.replace("&Ccedil;", "&#199;")
html = html.replace("&#919;", "&Eta;")
html = html.replace("&#x397;", "&Eta;")
html = html.replace("&#x393;", "&#915;")
html = html.replace("&Gamma;", "&#915;")
html = html.replace("&#8364;", "&euro;")
html = html.replace("&#x20ac;", "&euro;")
html = html.replace("&#x3d1;", "&#977;")
html = html.replace("&thetasym;", "&#977;")
html = html.replace("&#x201c;", "&#8220;")
html = html.replace("&#x2665;", "&#9829;")
html = html.replace("&hearts;", "&#9829;")
html = html.replace("&oacute;", "&#243;")
html = html.replace("&#8204;", "&zwnj;")
html = html.replace("&#x200c;", "&zwnj;")
html = html.replace("&#165;", "&yen;")
html = html.replace("&#xa5;", "&yen;")
html = html.replace("&ograve;", "&#242;")
html = html.replace("&#935;", "&Chi;")
html = html.replace("&#x3a7;", "&Chi;")
html = html.replace("&#x2122;", "&#8482;")
html = html.replace("&#958;", "&xi;")
html = html.replace("&#x3be;", "&xi;")
html = html.replace("&#x2dc;", "&#732;")
html = html.replace("&tilde;", "&#732;")
html = html.replace("&#x2039;", "&#8249;")
html = html.replace("&lsaquo;", "&#8249;")
html = html.replace("&#x153;", "&#339;")
html = html.replace("&oelig;", "&#339;")
html = html.replace("&#x2261;", "&#8801;")
html = html.replace("&#8804;", "&le;")
html = html.replace("&#x2264;", "&le;")
html = html.replace("&#8746;", "&cup;")
html = html.replace("&#x222a;", "&cup;")
html = html.replace("&#x178;", "&#376;")
html = html.replace("&#60;", "&lt;")
html = html.replace("&#x3c;", "&lt;")
html = html.replace("&#x3a5;", "&#933;")
html = html.replace("&Upsilon;", "&#933;")
html = html.replace("&#x2013;", "&#8211;")
html = html.replace("&yacute;", "&#253;")
html = html.replace("&#8476;", "&real;")
html = html.replace("&#x211c;", "&real;")
html = html.replace("&#968;", "&psi;")
html = html.replace("&#x3c8;", "&psi;")
html = html.replace("&#x203a;", "&#8250;")
html = html.replace("&rsaquo;", "&#8250;")
html = html.replace("&#8595;", "&darr;")
html = html.replace("&#x2193;", "&darr;")
html = html.replace("&#x391;", "&#913;")
html = html.replace("&Alpha;", "&#913;")
html = html.replace("&#172;", "&not;")
html = html.replace("&#xac;", "&not;")
html = html.replace("&#x26;", "&#38;")
html = html.replace("&oslash;", "&#248;")
html = html.replace("&acute;", "&#180;")
html = html.replace("&#8205;", "&zwj;")
html = html.replace("&#x200d;", "&zwj;")
html = html.replace("&laquo;", "&#171;")
html = html.replace("&#x201d;", "&#8221;")
html = html.replace("&Igrave;", "&#204;")
html = html.replace("&micro;", "&#181;")
html = html.replace("&#173;", "&shy;")
html = html.replace("&#xad;", "&shy;")
html = html.replace("&#8839;", "&supe;")
html = html.replace("&#x2287;", "&supe;")
html = html.replace("&szlig;", "&#223;")
html = html.replace("&#x2663;", "&#9827;")
html = html.replace("&agrave;", "&#224;")
html = html.replace("&Ocirc;", "&#212;")
html = html.replace("&#8596;", "&harr;")
html = html.replace("&#x2194;", "&harr;")
html = html.replace("&#8592;", "&larr;")
html = html.replace("&#x2190;", "&larr;")
html = html.replace("&frac12;", "&#189;")
html = html.replace("&#8733;", "&prop;")
html = html.replace("&#x221d;", "&prop;")
html = html.replace("&#x2c6;", "&#710;")
html = html.replace("&ocirc;", "&#244;")
html = html.replace("&#x2248;", "&#8776;")
html = html.replace("&#168;", "&uml;")
html = html.replace("&#xa8;", "&uml;")
html = html.replace("&#8719;", "&prod;")
html = html.replace("&#x220f;", "&prod;")
html = html.replace("&#174;", "&reg;")
html = html.replace("&#xae;", "&reg;")
html = html.replace("&#8207;", "&rlm;")
html = html.replace("&#x200f;", "&rlm;")
html = html.replace("&#x221e;", "&#8734;")
html = html.replace("&#x3a3;", "&#931;")
html = html.replace("&Sigma;", "&#931;")
html = html.replace("&#x2014;", "&#8212;")
html = html.replace("&#8593;", "&uarr;")
html = html.replace("&#x2191;", "&uarr;")
html = html.replace("&times;", "&#215;")
html = html.replace("&#8658;", "&rArr;")
html = html.replace("&#x21d2;", "&rArr;")
html = html.replace("&#8744;", "&or;")
html = html.replace("&#x2228;", "&or;")
html = html.replace("&#x3b3;", "&#947;")
html = html.replace("&gamma;", "&#947;")
html = html.replace("&#x3bb;", "&#955;")
html = html.replace("&lambda;", "&#955;")
html = html.replace("&#9002;", "&rang;")
html = html.replace("&#x232a;", "&rang;")
html = html.replace("&#x2020;", "&#8224;")
html = html.replace("&dagger;", "&#8224;")
html = html.replace("&#x2111;", "&#8465;")
html = html.replace("&#x2135;", "&#8501;")
html = html.replace("&alefsym;", "&#8501;")
html = html.replace("&#8838;", "&sube;")
html = html.replace("&#x2286;", "&sube;")
html = html.replace("&#x3b1;", "&#945;")
html = html.replace("&alpha;", "&#945;")
html = html.replace("&#925;", "&Nu;")
html = html.replace("&#x39d;", "&Nu;")
html = html.replace("&plusmn;", "&#177;")
html = html.replace("&frac34;", "&#190;")
html = html.replace("&#x203e;", "&#8254;")
html = html.replace("&#x394;", "&#916;")
html = html.replace("&Delta;", "&#916;")
html = html.replace("&#9674;", "&loz;")
html = html.replace("&#x25ca;", "&loz;")
html = html.replace("&#x3b9;", "&#953;")
html = html.replace("&iacute;", "&#237;")
html = html.replace("&#x3b5;", "&#949;")
html = html.replace("&epsilon;", "&#949;")
html = html.replace("&#x2118;", "&#8472;")
html = html.replace("&weierp;", "&#8472;")
html = html.replace("&#8706;", "&part;")
html = html.replace("&#x2202;", "&part;")
html = html.replace("&#x3b4;", "&#948;")
html = html.replace("&delta;", "&#948;")
html = html.replace("&#x3bf;", "&#959;")
html = html.replace("&omicron;", "&#959;")
html = html.replace("&#926;", "&Xi;")
html = html.replace("&#x39e;", "&Xi;")
html = html.replace("&#x2021;", "&#8225;")
html = html.replace("&Dagger;", "&#8225;")
html = html.replace("&Ograve;", "&#210;")
html = html.replace("&Ucirc;", "&#219;")
html = html.replace("&#x161;", "&#353;")
html = html.replace("&scaron;", "&#353;")
html = html.replace("&#x2018;", "&#8216;")
html = html.replace("&#8712;", "&isin;")
html = html.replace("&#x2208;", "&isin;")
html = html.replace("&#x396;", "&#918;")
html = html.replace("&#x2212;", "&#8722;")
html = html.replace("&#8743;", "&and;")
html = html.replace("&#x2227;", "&and;")
html = html.replace("&#8736;", "&ang;")
html = html.replace("&#x2220;", "&ang;")
html = html.replace("&curren;", "&#164;")
html = html.replace("&#8747;", "&int;")
html = html.replace("&#x222b;", "&int;")
html = html.replace("&#x230b;", "&#8971;")
html = html.replace("&rfloor;", "&#8971;")
html = html.replace("&#x21b5;", "&#8629;")
html = html.replace("&#x2203;", "&#8707;")
html = html.replace("&#x2295;", "&#8853;")
html = html.replace("&Acirc;", "&#194;")
html = html.replace("&#982;", "&piv;")
html = html.replace("&#x3d6;", "&piv;")
html = html.replace("&#8715;", "&ni;")
html = html.replace("&#x220b;", "&ni;")
html = html.replace("&#934;", "&Phi;")
html = html.replace("&#x3a6;", "&Phi;")
html = html.replace("&Iacute;", "&#205;")
html = html.replace("&Uacute;", "&#218;")
html = html.replace("&#x39f;", "&#927;")
html = html.replace("&Omicron;", "&#927;")
html = html.replace("&#8800;", "&ne;")
html = html.replace("&#x2260;", "&ne;")
html = html.replace("&iquest;", "&#191;")
html = html.replace("&#x201a;", "&#8218;")
html = html.replace("&Ecirc;", "&#202;")
html = html.replace("&#x3b6;", "&#950;")
html = html.replace("&#x3a9;", "&#937;")
html = html.replace("&Omega;", "&#937;")
html = html.replace("&#957;", "&nu;")
html = html.replace("&#x3bd;", "&nu;")
html = html.replace("&frac14;", "&#188;")
html = html.replace("&aacute;", "&#225;")
html = html.replace("&#8657;", "&uArr;")
html = html.replace("&#x21d1;", "&uArr;")
html = html.replace("&#x3b2;", "&#946;")
html = html.replace("&#x192;", "&#402;")
html = html.replace("&#961;", "&rho;")
html = html.replace("&#x3c1;", "&rho;")
html = html.replace("&eacute;", "&#233;")
html = html.replace("&#x3c9;", "&#969;")
html = html.replace("&omega;", "&#969;")
html = html.replace("&middot;", "&#183;")
html = html.replace("&#9001;", "&lang;")
html = html.replace("&#x2329;", "&lang;")
html = html.replace("&#x2660;", "&#9824;")
html = html.replace("&spades;", "&#9824;")
html = html.replace("&#x2019;", "&#8217;")
html = html.replace("&thorn;", "&#254;")
html = html.replace("&raquo;", "&#187;")
html = html.replace("&#x3c3;", "&#963;")
html = html.replace("&sigma;", "&#963;")
out = open(output_file, 'w')
out.write(html)
out.close()
+8 -10
View File
@@ -1,5 +1,4 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Dive Into Python 3</title>
@@ -9,15 +8,14 @@
<link rel=stylesheet type=text/css href=dip3.css>
<style>
h1:before{content:""}
li:last-child{list-style:none;margin:0 0 0 -1.7em}
li:last-child:before{content:"A. \00a0 \00a0"}
li.todo{color:#ddd}
span{cursor:default}
#a{list-style:none;margin:0 0 0 -1.7em}
#a:before{content:"A. \00a0 \00a0"}
.todo{color:#ddd}
</style>
</head>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31>&#xa0;<input type=submit name=sa value=Search></div></form>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here:&#xa0;&#xa0;<span title="Ce n'est pas un point">&bull;</span>
<p>You are here:&nbsp;&nbsp;<span title="Ce n'est pas un point">&bull;</span>
<h1>Dive Into Python 3</h1>
@@ -47,15 +45,15 @@ span{cursor:default}
<li class=todo>Creating graphics with the Python Imaging Library
<li class=todo>Where to go from here
<li><a href=case-study-porting-chardet-to-python-3.html>Case study: porting <code>chardet</code> to Python 3</a>
<li><a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>
<li id=a><a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>
</ol>
<p>There is a <a href=http://hg.diveintopython3.org/>changelog</a>, a <a type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>feed</a>, and <a href="http://www.reddit.com/search?q=%22Dive+Into+Python+3%22&amp;sort=new">discussion on Reddit</a>. During development, you can download the book by cloning the Mercurial repository:
<pre><samp class=prompt>you@localhost:~$ </samp><kbd>hg clone http://hg.diveintopython3.org/ diveintopython3</kbd></pre>
<pre><samp class=p>you@localhost:~$ </samp><kbd>hg clone http://hg.diveintopython3.org/ diveintopython3</kbd></pre>
<p>The final version will be downloadable as <abbr>HTML</abbr> and <abbr>PDF</abbr>.
<p class=c>This site is optimized for Lynx just because fuck you.<br>I&#8217;m told it also looks good in graphical browsers.
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &bull; <a href=about.html>open standards &bull; open content &bull; open source</a>
+140 -142
View File
@@ -1,19 +1,17 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Native datatypes - Dive into Python 3</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<link rel=stylesheet type=text/css href=dip3.css>
<style>
body{counter-reset:h1 2}
</style>
</head>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&#xa0;<input name=q size=31>&#xa0;<input type=submit name=root value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#native-datatypes>Dive Into Python 3</a> <span>&#8227;</span>
<p class=s><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=root value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#native-datatypes>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Native datatypes</h1>
<blockquote class=q>
<p><span>&#x275D;</span> Wonder is the foundation of all philosophy, research its progress, ignorance its end. <span>&#x275E;</span><br>&mdash; <cite>Michel de Montaigne</cite>
@@ -61,7 +59,7 @@ body{counter-reset:h1 2}
<li><a href=#furtherreading>Further reading</a>
</ol>
<h2 id=divingin>Diving in</h2>
<p class=fancy>Cast aside <a href=your-first-python-program.html>your first Python program</a> for just a minute, and let's talk about datatypes. In Python, <a href=your-first-python-program.html#datatypes>every variable has a datatype</a>, but you don't need to declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally.
<p class=f>Cast aside <a href=your-first-python-program.html>your first Python program</a> for just a minute, and let's talk about datatypes. In Python, <a href=your-first-python-program.html#datatypes>every variable has a datatype</a>, but you don't need to declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally.
<p>Python has many native datatypes. Here are the important ones:
<ol>
<li><b>Booleans</b> are either <code>True</code> or <code>False</code>.
@@ -82,25 +80,25 @@ body{counter-reset:h1 2}
raise ValueError('number must be non-negative')</code></pre>
<p><var>size</var> is an integer, <code>0</code> is an integer, and <code>&lt;</code> is a numerical operator. The result of the expression <code>size &lt; 0</code> is always a boolean. You can test this yourself in the Python interactive shell:
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>size = 1</kbd>
<samp class=prompt>>>> </samp><kbd>size &lt; 0</kbd>
<samp class=p>>>> </samp><kbd>size = 1</kbd>
<samp class=p>>>> </samp><kbd>size &lt; 0</kbd>
<samp>False</samp>
<samp class=prompt>>>> </samp><kbd>size = 0</kbd>
<samp class=prompt>>>> </samp><kbd>size &lt; 0</kbd>
<samp class=p>>>> </samp><kbd>size = 0</kbd>
<samp class=p>>>> </samp><kbd>size &lt; 0</kbd>
<samp>False</samp>
<samp class=prompt>>>> </samp><kbd>size = -1</kbd>
<samp class=prompt>>>> </samp><kbd>size &lt; 0</kbd>
<samp class=p>>>> </samp><kbd>size = -1</kbd>
<samp class=p>>>> </samp><kbd>size &lt; 0</kbd>
<samp>True</samp></pre>
<h2 id=numbers>Numbers</h2>
<p>Numbers are awesome. There are so many to choose from. Python supports both integers and floating point numbers. There's no type declaration to distinguish them; Python tells them apart by the presence or absence of a decimal point.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>type(1)</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>type(1)</kbd> <span>&#x2460;</span></a>
<samp>&lt;class 'int'></samp>
<a><samp class=prompt>>>> </samp><kbd>1 + 1</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>1 + 1</kbd> <span>&#x2461;</span></a>
<samp>2</samp>
<a><samp class=prompt>>>> </samp><kbd>1 + 1.0</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>1 + 1.0</kbd> <span>&#x2462;</span></a>
<samp>2.0</samp>
<samp class=prompt>>>> </samp><kbd>type(2.0)</kbd>
<samp class=p>>>> </samp><kbd>type(2.0)</kbd>
<samp>&lt;class 'float'></samp></pre>
<ol>
<li>You can use the <code>type()</code> function to check the type of any value or variable. As you might expect, <code>1</code> is an <code>int</code>.
@@ -110,17 +108,17 @@ body{counter-reset:h1 2}
<h3 id=number-coercion>Coercing integers to floats and vice-versa</h3>
<p>As you just saw, some operators (like addition) will coerce integers to floating point numbers as needed. You can also coerce them by yourself.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>float(2)</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>float(2)</kbd> <span>&#x2460;</span></a>
<samp>2.0</samp>
<a><samp class=prompt>>>> </samp><kbd>int(2.0)</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>int(2.0)</kbd> <span>&#x2461;</span></a>
<samp>2</samp>
<a><samp class=prompt>>>> </samp><kbd>int(2.5)</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>int(2.5)</kbd> <span>&#x2462;</span></a>
<samp>2</samp>
<a><samp class=prompt>>>> </samp><kbd>int(-2.5)</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>int(-2.5)</kbd> <span>&#x2463;</span></a>
<samp>-2</samp>
<a><samp class=prompt>>>> </samp><kbd>1.12345678901234567890</kbd> <span>&#x2464;</span></a>
<a><samp class=p>>>> </samp><kbd>1.12345678901234567890</kbd> <span>&#x2464;</span></a>
<samp>1.1234567890123457</samp>
<a><samp class=prompt>>>> </samp><kbd>type(1000000000000000)</kbd> <span>&#x2465;</span></a>
<a><samp class=p>>>> </samp><kbd>type(1000000000000000)</kbd> <span>&#x2465;</span></a>
<samp>&lt;class 'int'></samp></pre>
<ol>
<li>You can explicitly coerce an <code>int</code> to a <code>float</code> by calling the <code>float()</code> function.
@@ -136,17 +134,17 @@ body{counter-reset:h1 2}
<h3 id=common-numerical-operations>Common numerical operations</h3>
<p>You can do all kinds of things with numbers.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>11 / 2</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>11 / 2</kbd> <span>&#x2460;</span></a>
<samp>5.5</samp>
<a><samp class=prompt>>>> </samp><kbd>11 // 2</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>11 // 2</kbd> <span>&#x2461;</span></a>
<samp>5</samp>
<a><samp class=prompt>>>> </samp><kbd>&minus;11 // 2</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>&minus;11 // 2</kbd> <span>&#x2462;</span></a>
<samp>&minus;6</samp>
<a><samp class=prompt>>>> </samp><kbd>11.0 // 2</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>11.0 // 2</kbd> <span>&#x2463;</span></a>
<samp>5.0</samp>
<a><samp class=prompt>>>> </samp><kbd>11 ** 2</kbd> <span>&#x2464;</span></a>
<a><samp class=p>>>> </samp><kbd>11 ** 2</kbd> <span>&#x2464;</span></a>
<samp>121</samp>
<a><samp class=prompt>>>> </samp><kbd>11 % 2</kbd> <span>&#x2465;</span></a>
<a><samp class=p>>>> </samp><kbd>11 % 2</kbd> <span>&#x2465;</span></a>
<samp>1</samp>
</pre>
<ol>
@@ -163,13 +161,13 @@ body{counter-reset:h1 2}
<h3 id=fractions>Fractions</h3>
<p>Python isn't limited to integers and floating point numbers. It can also do all the fancy math you learned in high school and promptly forgot about.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>import fractions</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>x = fractions.Fraction(1, 3)</kbd> <span>&#x2461;</span></a>
<samp class=prompt>>>> </samp><kbd>x</kbd>
<a><samp class=p>>>> </samp><kbd>import fractions</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>x = fractions.Fraction(1, 3)</kbd> <span>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd>x</kbd>
<samp>Fraction(1, 3)</samp>
<a><samp class=prompt>>>> </samp><kbd>x * 2</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>x * 2</kbd> <span>&#x2462;</span></a>
<samp>Fraction(2, 3)</samp>
<a><samp class=prompt>>>> </samp><kbd>fractions.Fraction(6, 4)</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>fractions.Fraction(6, 4)</kbd> <span>&#x2463;</span></a>
<samp>Fraction(3, 2)</samp></pre>
<ol>
<li>To start using fractions, import the <code>fractions</code> module.
@@ -180,12 +178,12 @@ body{counter-reset:h1 2}
<h3 id=trig>Trigonometry</h3>
<p>You can also do basic trigonometry in Python.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import math</kbd>
<a><samp class=prompt>>>> </samp><kbd>math.pi</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>import math</kbd>
<a><samp class=p>>>> </samp><kbd>math.pi</kbd> <span>&#x2460;</span></a>
<samp>3.1415926535897931</samp>
<a><samp class=prompt>>>> </samp><kbd>math.sin(math.pi / 2)</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>math.sin(math.pi / 2)</kbd> <span>&#x2461;</span></a>
<samp>1.0</samp>
<a><samp class=prompt>>>> </samp><kbd>math.tan(math.pi / 4)</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>math.tan(math.pi / 4)</kbd> <span>&#x2462;</span></a>
<samp>0.99999999999999989</samp></pre>
<ol>
<li>The <code>math</code> module has a constant for &pi;, the ratio of a circle's circumference to its diameter.
@@ -195,26 +193,26 @@ body{counter-reset:h1 2}
<h3 id=numbers-in-a-boolean-context>Numbers in a boolean context</h3>
<p>You can use numbers <a href="#booleans">in a boolean context</a>, such as an <code>if</code> statement. Zero values are false, and non-zero values are true.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>def is_it_true(anything):</kbd> <span>&#x2460;</span></a>
<samp class=prompt>... </samp><kbd> if anything:</kbd>
<samp class=prompt>... </samp><kbd> print("yes, it's true")</kbd>
<samp class=prompt>... </samp><kbd> else:</kbd>
<samp class=prompt>... </samp><kbd> print("no, it's false")</kbd>
<samp class=prompt>...</samp>
<a><samp class=prompt>>>> </samp><kbd>is_it_true(1)</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>def is_it_true(anything):</kbd> <span>&#x2460;</span></a>
<samp class=p>... </samp><kbd> if anything:</kbd>
<samp class=p>... </samp><kbd> print("yes, it's true")</kbd>
<samp class=p>... </samp><kbd> else:</kbd>
<samp class=p>... </samp><kbd> print("no, it's false")</kbd>
<samp class=p>...</samp>
<a><samp class=p>>>> </samp><kbd>is_it_true(1)</kbd> <span>&#x2461;</span></a>
<samp>yes, it's true</samp>
<samp class=prompt>>>> </samp><kbd>is_it_true(-1)</kbd>
<samp class=p>>>> </samp><kbd>is_it_true(-1)</kbd>
<samp>yes, it's true</samp>
<samp class=prompt>>>> </samp><kbd>is_it_true(0)</kbd>
<samp class=p>>>> </samp><kbd>is_it_true(0)</kbd>
<samp>no, it's false</samp>
<a><samp class=prompt>>>> </samp><kbd>is_it_true(0.1)</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>is_it_true(0.1)</kbd> <span>&#x2462;</span></a>
<samp>yes, it's true</samp>
<samp class=prompt>>>> </samp><kbd>is_it_true(0.0)</kbd>
<samp class=p>>>> </samp><kbd>is_it_true(0.0)</kbd>
<samp>no, it's false</samp>
<samp class=prompt>>>> </samp><kbd>import fractions</kbd>
<a><samp class=prompt>>>> </samp><kbd>is_it_true(fractions.Fraction(1, 2))</kbd> <span>&#x2463;</span></a>
<samp class=p>>>> </samp><kbd>import fractions</kbd>
<a><samp class=p>>>> </samp><kbd>is_it_true(fractions.Fraction(1, 2))</kbd> <span>&#x2463;</span></a>
<samp>yes, it's true</samp>
<samp class=prompt>>>> </samp><kbd>is_it_true(fractions.Fraction(0, 1))</kbd>
<samp class=p>>>> </samp><kbd>is_it_true(fractions.Fraction(0, 1))</kbd>
<samp>no, it's false</samp></pre>
<ol>
<li>Did you know you can define your own functions in the Python interactive shell? Just press <kbd>ENTER</kbd> at the end of each line, and <kbd>ENTER</kbd> on a blank line to finish.
@@ -233,16 +231,16 @@ body{counter-reset:h1 2}
<h3 id=creatinglists>Creating a list</h3>
<p>Creating a list is easy: use square brackets to wrap a comma-separated list of values.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>a_list = ['a', 'b', 'mpilgrim', 'z', 'example']</kbd> <span>&#x2460;</span></a>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
<a><samp class=p>>>> </samp><kbd>a_list = ['a', 'b', 'mpilgrim', 'z', 'example']</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>a_list</kbd>
['a', 'b', 'mpilgrim', 'z', 'example']
<a><samp class=prompt>>>> </samp><kbd>a_list[0]</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[0]</kbd> <span>&#x2461;</span></a>
<samp>'a'</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[4]</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[4]</kbd> <span>&#x2462;</span></a>
<samp>'example'</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[-1]</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[-1]</kbd> <span>&#x2463;</span></a>
<samp>'example'</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[-3]</kbd> <span>&#x2464;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[-3]</kbd> <span>&#x2464;</span></a>
<samp>'mpilgrim'</samp></pre>
<ol>
<li>First, you define a list of five items. Note that they retain their original order. This is not an accident. A list is an ordered set of items.
@@ -254,19 +252,19 @@ body{counter-reset:h1 2}
<h3 id=slicinglists>Slicing a list</h3>
<p>Once you've defined a list, you can get any part of it as a new list. This is called <i>slicing</i> the list.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 'b', 'mpilgrim', 'z', 'example']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[1:3]</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[1:3]</kbd> <span>&#x2460;</span></a>
<samp>['b', 'mpilgrim']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[1:-1]</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[1:-1]</kbd> <span>&#x2461;</span></a>
<samp>['b', 'mpilgrim', 'z']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[0:3]</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[0:3]</kbd> <span>&#x2462;</span></a>
<samp>['a', 'b', 'mpilgrim']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[:3]</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[:3]</kbd> <span>&#x2463;</span></a>
<samp>['a', 'b', 'mpilgrim']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[3:]</kbd> <span>&#x2464;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[3:]</kbd> <span>&#x2464;</span></a>
<samp>['z', 'example']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[:]</kbd> <span>&#x2465;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list[:]</kbd> <span>&#x2465;</span></a>
['a', 'b', 'mpilgrim', 'z', 'example']</pre>
<ol>
<li>You can get a part of a list, called a &#8220;slice&#8221;, by specifying two indices. The return value is a new list containing all the items of the list, in order, starting with the first slice index (in this case <code>a_list[1]</code>), up to but not including the second slice index (in this case <code>a_list[3]</code>).
@@ -279,18 +277,18 @@ body{counter-reset:h1 2}
<h3 id=extendinglists>Adding items to a list</h3>
<p>There are four ways to add items to a list.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>a_list = ['a']</kbd>
<a><samp class=prompt>>>> </samp><kbd>a_list = a_list + [2.0, 3]</kbd> <span>&#x2460;</span></a>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
<samp class=p>>>> </samp><kbd>a_list = ['a']</kbd>
<a><samp class=p>>>> </samp><kbd>a_list = a_list + [2.0, 3]</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 2.0, 3]</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list.append(True)</kbd> <span>&#x2461;</span></a>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
<a><samp class=p>>>> </samp><kbd>a_list.append(True)</kbd> <span>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 2.0, 3, True]</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list.extend(['four', 'e'])</kbd> <span>&#x2462;</span></a>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
<a><samp class=p>>>> </samp><kbd>a_list.extend(['four', 'e'])</kbd> <span>&#x2462;</span></a>
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 2.0, 3, True, 'four', 'e']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list.insert(1, 'a')</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
<a><samp class=p>>>> </samp><kbd>a_list.insert(1, 'a')</kbd> <span>&#x2463;</span></a>
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 'a', 2.0, 3, True, 'four', 'e']</samp></pre>
<ol>
<li>The <code>+</code> operator concatenates lists. A list can contain any number of items; there is no size limit (other than available memory). A list can contain items of any datatype; they don't all need to be the same type. Here we have a list containing a string, a floating point number, and an integer.
@@ -300,20 +298,20 @@ body{counter-reset:h1 2}
</ol>
<p>Let's look closer at the difference between <code>append()</code> and <code>extend()</code>.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>a_list = ['a', 'b', 'c']</kbd>
<a><samp class=prompt>>>> </samp><kbd>a_list.extend(['d', 'e', 'f'])</kbd> <span>&#x2460;</span></a>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
<samp class=p>>>> </samp><kbd>a_list = ['a', 'b', 'c']</kbd>
<a><samp class=p>>>> </samp><kbd>a_list.extend(['d', 'e', 'f'])</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 'b', 'c', 'd', 'e', 'f']</samp>
<a><samp class=prompt>>>> </samp><kbd>len(a_list)</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>len(a_list)</kbd> <span>&#x2461;</span></a>
<samp>6</samp>
<samp class=prompt>>>> </samp><kbd>a_list[-1]</kbd>
<samp class=p>>>> </samp><kbd>a_list[-1]</kbd>
<samp>'f'</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list.append(['g', 'h', 'i'])</kbd> <span>&#x2462;</span></a>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
<a><samp class=p>>>> </samp><kbd>a_list.append(['g', 'h', 'i'])</kbd> <span>&#x2462;</span></a>
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 'b', 'c', 'd', 'e', 'f', ['g', 'h', 'i']]</samp>
<a><samp class=prompt>>>> </samp><kbd>len(a_list)</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>len(a_list)</kbd> <span>&#x2463;</span></a>
<samp>4</samp>
<samp class=prompt>>>> </samp><kbd>a_list[-1]</kbd>
<samp class=p>>>> </samp><kbd>a_list[-1]</kbd>
<samp>['g', 'h', 'i']</samp></pre>
<ol>
<li>The <code>extend()</code> method takes a single argument, which is always a list, and adds each of the items of that list to <var>a_list</var>.
@@ -323,16 +321,16 @@ body{counter-reset:h1 2}
</ol>
<h3 id=searchinglists>Searching for values in a list</h3>
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>a_list = ['a', 'b', 'new', 'mpilgrim', 'new']</kbd>
<a><samp class=prompt>>>> </samp><kbd>'mpilgrim' in a_list</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>a_list = ['a', 'b', 'new', 'mpilgrim', 'new']</kbd>
<a><samp class=p>>>> </samp><kbd>'mpilgrim' in a_list</kbd> <span>&#x2460;</span></a>
<samp>True</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list.index('mpilgrim')</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list.index('mpilgrim')</kbd> <span>&#x2461;</span></a>
<samp>3</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list.index('new')</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list.index('new')</kbd> <span>&#x2462;</span></a>
<samp>2</samp>
<a><samp class=prompt>>>> </samp><kbd>'c' in a_list</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>'c' in a_list</kbd> <span>&#x2463;</span></a>
<samp>False</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list.index('c')</kbd> <span>&#x2464;</span></a>
<a><samp class=p>>>> </samp><kbd>a_list.index('c')</kbd> <span>&#x2464;</span></a>
<samp class=traceback>Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
ValueError: list.index(x): x not in list</samp></pre>
@@ -346,15 +344,15 @@ ValueError: list.index(x): x not in list</samp></pre>
<h3 id=lists-in-a-boolean-context>Lists in a boolean context</h3>
<p>You can also use a list in <a href=#booleans>a boolean context</a>, such as an <code>if</code> statement.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>def is_it_true(anything):</kbd>
<samp class=prompt>... </samp><kbd> if anything:</kbd>
<samp class=prompt>... </samp><kbd> print("yes, it's true")</kbd>
<samp class=prompt>... </samp><kbd> else:</kbd>
<samp class=prompt>... </samp><kbd> print("no, it's false")</kbd>
<samp class=prompt>...</samp>
<a><samp class=prompt>>>> </samp><kbd>is_it_true([])</kbd> <span>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd>def is_it_true(anything):</kbd>
<samp class=p>... </samp><kbd> if anything:</kbd>
<samp class=p>... </samp><kbd> print("yes, it's true")</kbd>
<samp class=p>... </samp><kbd> else:</kbd>
<samp class=p>... </samp><kbd> print("no, it's false")</kbd>
<samp class=p>...</samp>
<a><samp class=p>>>> </samp><kbd>is_it_true([])</kbd> <span>&#x2461;</span></a>
<samp>no, it's false</samp>
<a><samp class=prompt>>>> </samp><kbd>is_it_true(['a'])</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>is_it_true(['a'])</kbd> <span>&#x2462;</span></a>
<samp>yes, it's true</samp></pre>
<ol>
<li>In a boolean context, an empty list is false.
@@ -372,14 +370,14 @@ ValueError: list.index(x): x not in list</samp></pre>
<h3 id=creating-dictionaries>Creating a dictionary</h3>
<p>Creating a dictionary is easy. The syntax is similar to <a href=#sets>sets</a>, but instead of values, you have key-value pairs. Once you have a dictionary, you can look up values by their key.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>a_dict = {"server":"db.diveintopython3.org", "database":"mysql"}</kbd> <span>&#x2460;</span></a>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<a><samp class=p>>>> </samp><kbd>a_dict = {"server":"db.diveintopython3.org", "database":"mysql"}</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>a_dict</kbd>
<samp>{'server': 'db.diveintopython3.org', 'database': 'mysql'}</samp>
<a><samp class=prompt>>>> </samp><kbd>a_dict["server"]</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>a_dict["server"]</kbd> <span>&#x2461;</span></a>
'db.diveintopython3.org'
<a><samp class=prompt>>>> </samp><kbd>a_dict["database"]</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>a_dict["database"]</kbd> <span>&#x2462;</span></a>
'mysql'
<a><samp class=prompt>>>> </samp><kbd>a_dict["db.diveintopython3.org"]</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>a_dict["db.diveintopython3.org"]</kbd> <span>&#x2463;</span></a>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in &lt;module>
KeyError: 'db.diveintopython3.org'</samp></pre>
@@ -392,19 +390,19 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<h3 id=modifying-dictionaries>Modifying a dictionary</h3>
<p>Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any time, or you can modify the value of an existing key. Continuing from the previous example:
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<samp class=p>>>> </samp><kbd>a_dict</kbd>
<samp>{'server': 'db.diveintopython3.org', 'database': 'mysql'}</samp>
<a><samp class=prompt>>>> </samp><kbd>a_dict["database"] = "blog"</kbd> <span>&#x2460;</span></a>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<a><samp class=p>>>> </samp><kbd>a_dict["database"] = "blog"</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>a_dict</kbd>
<samp>{'server': 'db.diveintopython3.org', 'database': 'blog'}</samp>
<a><samp class=prompt>>>> </samp><kbd>a_dict["user"] = "mark"</kbd> <span>&#x2461;</span></a>
<a><samp class=prompt>>>> </samp><kbd>a_dict</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>a_dict["user"] = "mark"</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>a_dict</kbd> <span>&#x2462;</span></a>
<samp>{'server': 'db.diveintopython3.org', 'user': 'mark', 'database': 'blog'}</samp>
<a><samp class=prompt>>>> </samp><kbd>a_dict["user"] = "dora"</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<a><samp class=p>>>> </samp><kbd>a_dict["user"] = "dora"</kbd> <span>&#x2463;</span></a>
<samp class=p>>>> </samp><kbd>a_dict</kbd>
<samp>{'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}</samp>
<a><samp class=prompt>>>> </samp><kbd>a_dict["User"] = "mark"</kbd> <span>&#x2464;</span></a>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<a><samp class=p>>>> </samp><kbd>a_dict["User"] = "mark"</kbd> <span>&#x2464;</span></a>
<samp class=p>>>> </samp><kbd>a_dict</kbd>
<samp>{'User': 'mark', 'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}</samp></pre>
<ol>
<li>You can not have duplicate keys in a dictionary. Assigning a value to an existing key will wipe out the old value.
@@ -420,15 +418,15 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
1024: ('KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB')}</code></pre>
<p>Let's tear that apart in the interactive shell.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],</kbd>
<samp class=prompt>... </samp><kbd> 1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}</kbd>
<a><samp class=prompt>>>> </samp><kbd>len(SUFFIXES)</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],</kbd>
<samp class=p>... </samp><kbd> 1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}</kbd>
<a><samp class=p>>>> </samp><kbd>len(SUFFIXES)</kbd> <span>&#x2460;</span></a>
<samp>2</samp>
<a><samp class=prompt>>>> </samp><kbd>SUFFIXES[1000]</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>SUFFIXES[1000]</kbd> <span>&#x2461;</span></a>
<samp>['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']</samp>
<a><samp class=prompt>>>> </samp><kbd>SUFFIXES[1024]</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>SUFFIXES[1024]</kbd> <span>&#x2462;</span></a>
<samp>['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']</samp>
<a><samp class=prompt>>>> </samp><kbd>SUFFIXES[1000][3]</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>SUFFIXES[1000][3]</kbd> <span>&#x2463;</span></a>
<samp>'TB'</samp></pre>
<ol>
<li>As with <a href=#lists>lists</a><!-- and <a href=#sets>sets</a>-->, the <code>len()</code> function gives you the number of items in a dictionary.
@@ -439,15 +437,15 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<h3 id=dictionaries-in-a-boolean-context>Dictionaries in a boolean context</h3>
<p>You can also use a list in <a href=#booleans>a boolean context</a>, such as an <code>if</code> statement.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>def is_it_true(anything):</kbd>
<samp class=prompt>... </samp><kbd> if anything:</kbd>
<samp class=prompt>... </samp><kbd> print("yes, it's true")</kbd>
<samp class=prompt>... </samp><kbd> else:</kbd>
<samp class=prompt>... </samp><kbd> print("no, it's false")</kbd>
<samp class=prompt>...</samp>
<a><samp class=prompt>>>> </samp><kbd>is_it_true({})</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>def is_it_true(anything):</kbd>
<samp class=p>... </samp><kbd> if anything:</kbd>
<samp class=p>... </samp><kbd> print("yes, it's true")</kbd>
<samp class=p>... </samp><kbd> else:</kbd>
<samp class=p>... </samp><kbd> print("no, it's false")</kbd>
<samp class=p>...</samp>
<a><samp class=p>>>> </samp><kbd>is_it_true({})</kbd> <span>&#x2460;</span></a>
<samp>no, it's false</samp>
<a><samp class=prompt>>>> </samp><kbd>is_it_true({'a': 1})</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>is_it_true({'a': 1})</kbd> <span>&#x2461;</span></a>
<samp>yes, it's true</samp></pre>
<ol>
<li>In a boolean context, an empty dictionary is false.
@@ -457,35 +455,35 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<p><code>None</code> is a special constant in Python. It is a null value. <code>None</code> is not the same as <code>False</code>. <code>None</code> is not <code>0</code>. <code>None</code> is not an empty string. Comparing <code>None</code> to anything other than <code>None</code> will always return <code>False</code>.
<p><code>None</code> is the only null value. It has its own datatype (<code>NoneType</code>). You can assign <code>None</code> to any variable, but you can not create other <code>NoneType</code> objects. All variables whose value is <code>None</code> are equal to each other.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>type(None)</kbd>
<samp class=p>>>> </samp><kbd>type(None)</kbd>
<samp>&lt;class 'NoneType'></samp>
<samp class=prompt>>>> </samp><kbd>None == False</kbd>
<samp class=p>>>> </samp><kbd>None == False</kbd>
<samp>False</samp>
<samp class=prompt>>>> </samp><kbd>None == 0</kbd>
<samp class=p>>>> </samp><kbd>None == 0</kbd>
<samp>False</samp>
<samp class=prompt>>>> </samp><kbd>None == ''</kbd>
<samp class=p>>>> </samp><kbd>None == ''</kbd>
<samp>False</samp>
<samp class=prompt>>>> </samp><kbd>None == None</kbd>
<samp class=p>>>> </samp><kbd>None == None</kbd>
<samp>True</samp>
<samp class=prompt>>>> </samp><kbd>x = None</kbd>
<samp class=prompt>>>> </samp><kbd>x == None</kbd>
<samp class=p>>>> </samp><kbd>x = None</kbd>
<samp class=p>>>> </samp><kbd>x == None</kbd>
<samp>True</samp>
<samp class=prompt>>>> </samp><kbd>y = None</kbd>
<samp class=prompt>>>> </samp><kbd>x == y</kbd>
<samp class=p>>>> </samp><kbd>y = None</kbd>
<samp class=p>>>> </samp><kbd>x == y</kbd>
<samp>True</samp>
</pre>
<h3 id=none-in-a-boolean-context><code>None</code> in a boolean context</h3>
<p>In <a href=#booleans>a boolean context</a>, <code>None</code> is false and <code>not None</code> is true.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>def is_it_true(anything):</kbd>
<samp class=prompt>... </samp><kbd> if anything:</kbd>
<samp class=prompt>... </samp><kbd> print("yes, it's true")</kbd>
<samp class=prompt>... </samp><kbd> else:</kbd>
<samp class=prompt>... </samp><kbd> print("no, it's false")</kbd>
<samp class=prompt>...</samp>
<samp class=prompt>>>> </samp><kbd>is_it_true(None)</kbd>
<samp class=p>>>> </samp><kbd>def is_it_true(anything):</kbd>
<samp class=p>... </samp><kbd> if anything:</kbd>
<samp class=p>... </samp><kbd> print("yes, it's true")</kbd>
<samp class=p>... </samp><kbd> else:</kbd>
<samp class=p>... </samp><kbd> print("no, it's false")</kbd>
<samp class=p>...</samp>
<samp class=p>>>> </samp><kbd>is_it_true(None)</kbd>
<samp>no, it's false</samp>
<samp class=prompt>>>> </samp><kbd>is_it_true(not None)</kbd>
<samp class=p>>>> </samp><kbd>is_it_true(not None)</kbd>
<samp>yes, it's true</samp></pre>
<h2 id=furtherreading>Further reading</h2>
<ul>
@@ -494,6 +492,6 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<li><a href="http://www.python.org/dev/peps/pep-0237/"><abbr>PEP</abbr> 237: Unifying Long Integers and Integers</a>
<li><a href="http://www.python.org/dev/peps/pep-0238/"><abbr>PEP</abbr> 238: Changing the Division Operator</a>
</ul>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &bull; <a href=about.html>open standards &bull; open content &bull; open source</a>
<script src=jquery.js></script>
<script src=dip3.js></script>
+62 -56
View File
@@ -1,21 +1,27 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Porting code to Python 3 with 2to3 - Dive into Python 3</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<link rel=stylesheet type=text/css href=dip3.css>
<style>
h1:before{counter-increment:h1;content:"Appendix A. "}
h2:before{counter-increment:h2;content:"A." counter(h2) ". "}
h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
tr + tr th:first-child{font:medium 'Arial Unicode MS',FreeSerif,OpenSymbol,'DejaVu Sans',sans-serif}
table{width:100%;border-collapse:collapse}
th,td{width:45%;padding:0 0.5em;border:1px solid #bbb}
th{text-align:left;vertical-align:baseline}
td{vertical-align:top}
th:first-child{width:10%;text-align:center}
th,td,td pre{margin:0}
td pre{padding:0;border:0}
</style>
</head>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&#xa0;<input name=q size=31>&#xa0;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#porting-code-to-python-3-with-2to3>Dive Into Python 3</a> <span>&#8227;</span>
<p class=s><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#porting-code-to-python-3-with-2to3>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Porting code to Python 3 with <code>2to3</code></h1>
<blockquote class=q>
<p><span>&#x275D;</span> Life is pleasant. Death is peaceful. It&#8217;s the transition that&#8217;s troublesome. <span>&#x275E;</span><br>&mdash; Isaac Asimov (attributed)
@@ -79,11 +85,11 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
</ol>
</ol>
<h2 id=divingin>Diving in</h2>
<p class=fancy>Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. <a href=case-study-porting-chardet-to-python-3.html#running2to3>Case study: porting <code>chardet</code> to Python 3</a> describes how to run the <code>2to3</code> script, then shows some things it can't fix automatically. This appendix documents what it <em>can</em> fix automatically.
<p class=f>Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. <a href=case-study-porting-chardet-to-python-3.html#running2to3>Case study: porting <code>chardet</code> to Python 3</a> describes how to run the <code>2to3</code> script, then shows some things it can't fix automatically. This appendix documents what it <em>can</em> fix automatically.
<h2 id=print><code>print</code> statement</h2>
<p>In Python 2, <code>print</code> was a statement. Whatever you wanted to print simply followed the <code>print</code> keyword. In Python 3, <code>print()</code> is a function &mdash; whatever you want to print is passed to <code>print()</code> like any other function.
<p id=noscript>[The code examples will be easier to follow if you enable Javascript, but whatever.]
<p class=skip><a href=#skipcompareprint>skip over this table</a>
<p class=s><a href=#skipcompareprint>skip over this table</a>
<table id=compareprint>
<tr>
<th class=notes>Notes</th>
@@ -115,7 +121,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
</ol>
<h2 id=unicodeliteral>Unicode string literals</h2>
<p>Python 2 had two string types: Unicode strings and non-Unicode strings. Python 3 has one string type: Unicode strings.
<p class=skip><a href=#skipcompareunicodeliteral>skip over this table</a>
<p class=s><a href=#skipcompareunicodeliteral>skip over this table</a>
<table id=compareunicodeliteral>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -134,7 +140,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
</ol>
<h2 id=unicode><code>unicode()</code> global function</h2>
<p>Python 2 had two global functions to coerce objects into strings: <code>unicode()</code> to coerce them into Unicode strings, and <code>str()</code> to coerce them into non-Unicode strings. Python 3 has only one string type, Unicode strings, so the <code>str()</code> function is all you need. (The <code>unicode()</code> function no longer exists.)
<p class=skip><a href=#skipcompareunicode>skip over this table</a>
<p class=s><a href=#skipcompareunicode>skip over this table</a>
<table id=compareunicode>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -148,7 +154,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
<h2 id=long><code>long</code> data type</h2>
<p>Python 2 had separate <code>int</code> and <code>long</code> types for non-floating-point numbers. An <code>int</code> could not be any larger than <a href=#renames><code>sys.maxint</code></a>, which varied by platform. Longs were defined by appending an <code>L</code> to the end of the number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called <code>int</code>, which mostly behaves like the <code>long</code> type in Python 2. Since there are no longer two types, there is no need for special syntax to distinguish them.
<p>Further reading: <a href=http://www.python.org/dev/peps/pep-0237/><abbr>PEP</abbr> 237: Unifying Long Integers and Integers</a>.
<p class=skip><a href=#skipcomparelong>skip over this table</a>
<p class=s><a href=#skipcomparelong>skip over this table</a>
<table id=comparelong>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -179,7 +185,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
</ol>
<h2 id=ne>&lt;> comparison</h2>
<p>Python 2 supported <code>&lt;></code> as a synonym for <code>!=</code>, the not-equals comparison operator. Python 3 supports the <code>!=</code> operator, but not <code>&lt;></code>.
<p class=skip><a href=#skipcomparene>skip over this table</a>
<p class=s><a href=#skipcomparene>skip over this table</a>
<table id=comparene>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -198,7 +204,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
</ol>
<h2 id=has_key><code>has_key()</code> dictionary method</h2>
<p>In Python 2, dictionaries had a <code>has_key()</code> method to test whether the dictionary had a certain key. In Python 3, this method no longer exists. Instead, you need to use the <code>in</code> operator.
<p class=skip><a href=#skipcomparehas_key>skip over this table</a>
<p class=s><a href=#skipcomparehas_key>skip over this table</a>
<table id=comparehas_key>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -229,7 +235,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
</ol>
<h2 id=dict>Dictionary methods that return lists</h2>
<p>In Python 2, many dictionary methods returned lists. The most frequently used methods were <code>keys()</code>, <code>items()</code>, and <code>values()</code>. In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method's return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing.
<p class=skip><a href=#skipcomparedict>skip over this table</a>
<p class=s><a href=#skipcomparedict>skip over this table</a>
<table id=comparedict>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -262,7 +268,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
<p>Several modules in the Python Standard Library have been renamed. Several other modules which are related to each other have been combined or reorganized to make their association more logical.
<h3 id=http><code>http</code></h3>
<p>In Python 3, several related <abbr>HTTP</abbr> modules have been combined into a single package, <code>http</code>.
<p class=skip><a href=#skipcompareimporthttp>skip over this table</a>
<p class=s><a href=#skipcompareimporthttp>skip over this table</a>
<table id=compareimporthttp>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -291,7 +297,7 @@ import CGIHttpServer</code></pre></td>
</ol>
<h3 id=urllib><code>urllib</code></h3>
<p>Python 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, <code>urllib</code>.
<p class=skip><a href=#skipcompareimporturllib>skip over this table</a>
<p class=s><a href=#skipcompareimporturllib>skip over this table</a>
<table id=compareimporturllib>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -330,7 +336,7 @@ from urllib.error import HTTPError</code></pre></td></tr>
</ol>
<h3 id=dbm><code>dbm</code></h3>
<p>All the various <abbr>DBM</abbr> clones are now in a single package, <code>dbm</code>. If you need a specific variant like <abbr>GNU</abbr> <abbr>DBM</abbr>, you can import the appropriate module within the <code>dbm</code> package.
<p class=skip><a href=#skipcompareimportdbm>skip over this table</a>
<p class=s><a href=#skipcompareimportdbm>skip over this table</a>
<table id=compareimportdbm>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -356,7 +362,7 @@ import whichdb</code></pre></td>
<p id=skipcompareimportdbm>
<h3 id=xmlrpc><code>xmlrpc</code></h3>
<p><abbr>XML-RPC</abbr> is a lightweight method of performing remote <abbr>RPC</abbr> calls over <abbr>HTTP</abbr>. The <abbr>XML-RPC</abbr> client library and several <abbr>XML-RPC</abbr> server implementations are now combined in a single package, <code>xmlrpc</code>.
<p class=skip><a href=#skipcompareimportxmlrpc>skip over this table</a>
<p class=s><a href=#skipcompareimportxmlrpc>skip over this table</a>
<table id=compareimportxmlrpc>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -372,7 +378,7 @@ import SimpleXMLRPCServer</code></pre></td>
</table>
<p id=skipcompareimportxmlrpc>
<h3 id=othermodules>Other modules</h3>
<p class=skip><a href=#skipcompareimports>skip over this table</a>
<p class=s><a href=#skipcompareimports>skip over this table</a>
<table id=compareimports>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -426,7 +432,7 @@ except ImportError:
<h2 id=import>Relative imports within a package</h2>
<p>A package is a group of related modules that function as a single entity. In Python 2, when modules within a package need to reference each other, you use <code>import foo</code> or <code>from foo import Bar</code>. The Python 2 interpreter first searches within the current package to find <code>foo.py</code>, and then moves on to the other directories in the Python search path (<code>sys.path</code>). Python 3 works a bit differently. Instead of searching the current package, it goes directly to the Python search path. If you want one module within a package to import another module in the same package, you need to explicitly provide the relative path between the two modules.
<p>Suppose you had this package, with multiple files in the same directory:
<p class=skip><a href=#skippackageart>skip over this <abbr>ASCII</abbr> art</a>
<p class=s><a href=#skippackageart>skip over this <abbr>ASCII</abbr> art</a>
<pre>chardet/
|
+--__init__.py
@@ -437,7 +443,7 @@ except ImportError:
|
+--universaldetector.py</pre>
<p id=skippackageart>Now suppose that <code>universaldetector.py</code> needs to import the entire <code>constants.py</code> file and one class from <code>mbcharsetprober.py</code>. How do you do it?
<p class=skip><a href=#skipcompareimport>skip over this table</a>
<p class=s><a href=#skipcompareimport>skip over this table</a>
<table id=compareimport>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -456,7 +462,7 @@ except ImportError:
</ol>
<h2 id=next><code>next()</code> iterator method</h2>
<p>In Python 2, iterators had a <code>next()</code> method which returned the next item in the sequence. That's still true in Python 3, but there is now also a global <code>next()</code> function that takes an iterator as an argument.
<p class=skip><a href=#skipcomparenext>skip over this table</a>
<p class=s><a href=#skipcomparenext>skip over this table</a>
<table id=comparenext>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -497,7 +503,7 @@ for an_iterator in a_sequence_of_iterators:
</ol>
<h2 id=filter><code>filter()</code> global function</h2>
<p>In Python 2, the <code>filter()</code> function returned a list, the result of filtering a sequence through a function that returned <code>True</code> or <code>False</code> for each item in the sequence. In Python 3, the <code>filter()</code> function returns an iterator, not a list.
<p class=skip><a href=#skipcomparefilter>skip over this table</a>
<p class=s><a href=#skipcomparefilter>skip over this table</a>
<table id=comparefilter>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -528,7 +534,7 @@ for an_iterator in a_sequence_of_iterators:
</ol>
<h2 id=map><code>map()</code> global function</h2>
<p>In much the same way as <a href=#filter><code>filter()</code></a>, the <code>map()</code> function now returns an iterator. (In Python 2, it returned a list.)
<p class=skip><a href=#skipcomparemap>skip over this table</a>
<p class=s><a href=#skipcomparemap>skip over this table</a>
<table id=comparemap>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -559,7 +565,7 @@ for an_iterator in a_sequence_of_iterators:
</ol>
<h2 id=reduce><code>reduce()</code> global function (3.1+)</h2>
<p>In Python 3, the <code>reduce()</code> function has been removed from the global namespace and placed in the <code>functools</code> module.
<p class=skip><a href=#skipcomparereduce>skip over this table</a>
<p class=s><a href=#skipcomparereduce>skip over this table</a>
<table id=comparereduce>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -575,7 +581,7 @@ reduce(a, b, c)</code></pre></td></tr>
</blockquote>
<h2 id=apply><code>apply()</code> global function</h2>
<p>Python 2 had a global function called <code>apply()</code>, which took a function <var>f</var> and a list <code>[a, b, c]</code> and returned <code>f(a, b, c)</code>. In Python 3, the <code>apply()</code> function no longer exists. Instead, there is a new function calling syntax that allows you to pass a list and have Python apply the list as the function's arguments.
<p class=skip><a href=#skipcompareapply>skip over this table</a>
<p class=s><a href=#skipcompareapply>skip over this table</a>
<table id=compareapply>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -602,7 +608,7 @@ reduce(a, b, c)</code></pre></td></tr>
</ol>
<h2 id=intern><code>intern()</code> global function</h2>
<p>In Python 2, you could call the <code>intern()</code> function on a string to intern it as a performance optimization. In Python 3, the <code>intern()</code> function has been moved to the <code>sys</code> module.
<p class=skip><a href=#skipcompareintern>skip over this table</a>
<p class=s><a href=#skipcompareintern>skip over this table</a>
<table id=compareintern>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -615,7 +621,7 @@ reduce(a, b, c)</code></pre></td></tr>
<p id=skipcompareintern>
<h2 id=exec><code>exec</code> statement</h2>
<p>Just as <a href=#print>the <code>print</code> statement</a> became a function in Python 3, so too has the <code>exec</code> statement. The <code>exec()</code> function takes a string which contains arbitrary Python code and executes it as if it were just another statement or expression.
<p class=skip><a href=#skipcompareexec>skip over this table</a>
<p class=s><a href=#skipcompareexec>skip over this table</a>
<table id=compareexec>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -638,7 +644,7 @@ reduce(a, b, c)</code></pre></td></tr>
</ol>
<h2 id=execfile><code>execfile</code> statement (3.1+)</h2>
<p>Like the old <a href=#exec><code>exec</code> statement</a>, the old <code>execfile</code> statement will execute strings as if they were Python code. Where <code>exec</code> took a string, <code>execfile</code> took a filename. In Python 3, the <code>execfile</code> statement has been eliminated. If you really need to take a file of Python code and execute it (but you're not willing to simply import it), you can accomplish the same thing by opening the file, reading its contents, calling the global <code>compile()</code> function to force the Python interpreter to compile the code, and then call the new <code>exec()</code> function.
<p class=skip><a href=#skipcompareexecfile>skip over this table</a>
<p class=s><a href=#skipcompareexecfile>skip over this table</a>
<table id=compareexecfile>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -653,7 +659,7 @@ reduce(a, b, c)</code></pre></td></tr>
</blockquote>
<h2 id=repr><code>repr</code> literals (backticks)</h2>
<p>In Python 2, there was a special syntax of wrapping any object in backticks (like <code>`x`</code>) to get a representation of the object. In Python 3, this capability still exists, but you can no longer use backticks to get it. Instead, use the global <code>repr()</code> function.
<p class=skip><a href=#skipcomparerepr>skip over this table</a>
<p class=s><a href=#skipcomparerepr>skip over this table</a>
<table id=comparerepr>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -672,7 +678,7 @@ reduce(a, b, c)</code></pre></td></tr>
</ol>
<h2 id=except><code>try...except</code> statement</h2>
<p>The syntax for catching exceptions has changed slightly between Python 2 and Python 3.
<p class=skip><a href=#skipcompareexcept>skip over this table</a>
<p class=s><a href=#skipcompareexcept>skip over this table</a>
<table id=compareexcept>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -720,7 +726,7 @@ except:
</blockquote>
<h2 id=raise><code>raise</code> statement</h2>
<p>The syntax for raising your own exceptions has changed slightly between Python 2 and Python 3.
<p class=skip><a href=#skipcompareraise>skip over this table</a>
<p class=s><a href=#skipcompareraise>skip over this table</a>
<table id=compareraise>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -747,7 +753,7 @@ except:
</ol>
<h2 id=throw><code>throw</code> method on generators</h2>
<p>In Python 2, generators have a <code>throw()</code> method. Calling <code>a_generator.throw()</code> raises an exception at the point where the generator was paused, then returns the next value yielded by the generator function. In Python 3, this functionality is still available, but the syntax is slightly different.
<p class=skip><a href=#skipcomparethrow>skip over this table</a>
<p class=s><a href=#skipcomparethrow>skip over this table</a>
<table id=comparethrow>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -770,7 +776,7 @@ except:
</ol>
<h2 id=xrange><code>xrange()</code> global function</h2>
<p>In Python 2, there were two ways to get a range of numbers: <code>range()</code>, which returned a list, and <code>xrange()</code>, which returned an iterator. In Python 3, <code>range()</code> returns an iterator, and <code>xrange()</code> doesn't exist.
<p class=skip><a href=#skipcomparexrange>skip over this table</a>
<p class=s><a href=#skipcomparexrange>skip over this table</a>
<table id=comparexrange>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -801,7 +807,7 @@ except:
</ol>
<h2 id=raw_input><code>raw_input()</code> and <code>input()</code> global functions</h2>
<p>Python 2 had two global functions for asking the user for input on the command line. The first, called <code>input()</code>, expected the user to enter a Python expression (and returned the result). The second, called <code>raw_input()</code>, just returned whatever the user typed. This was wildly confusing for beginners and widely regarded as a &#8220;wart&#8221; in the language. Python 3 excises this wart by renaming <code>raw_input()</code> to <code>input()</code>, so it works the way everyone naively expects it to work.
<p class=skip><a href=#skipcompareraw_input>skip over this table</a>
<p class=s><a href=#skipcompareraw_input>skip over this table</a>
<table id=compareraw_input>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -824,7 +830,7 @@ except:
</ol>
<h2 id=funcattrs><code>func_*</code> function attributes</h2>
<p>In Python 2, code within functions can access special attributes about the function itself. In Python 3, these special function attributes have been renamed for consistency with other attributes.
<p class=skip><a href=#skipcomparefuncattrs>skip over this table</a>
<p class=s><a href=#skipcomparefuncattrs>skip over this table</a>
<table id=comparefuncattrs>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -863,7 +869,7 @@ except:
</ol>
<h2 id=xreadlines><code>xreadlines()</code> I/O method</h2>
<p>In Python 2, file objects had an <code>xreadlines()</code> method which returned an iterator that would read the file one line at a time. This was useful in <code>for</code> loops, among other places. In fact, it was so useful, later versions of Python 2 added the capability to file objects themselves.
<p class=skip><a href=#skipcomparexreadlines>skip over this table</a>
<p class=s><a href=#skipcomparexreadlines>skip over this table</a>
<table id=comparexreadlines>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -883,7 +889,7 @@ except:
<p class=c><span style="font-size:56px;line-height:0.88">&#x2603;</span>
<h2 id=tuple_params><code>lambda</code> functions with multiple parameters</h2>
<p>In Python 2, you could define anonymous <code>lambda</code> functions which took multiple parameters by defining the function as taking a tuple with a specific number of items. In effect, Python 2 would &#8220;unpack&#8221; the tuple into named arguments, which you could then reference (by name) within the <code>lambda</code> function. In Python 3, you can still pass a tuple to a <code>lambda</code> function, but the Python interpreter will not unpack the tuple into named arguments. Instead, you will need to reference each argument by its positional index.
<p class=skip><a href=#skipcomparetuple_params>skip over this table</a>
<p class=s><a href=#skipcomparetuple_params>skip over this table</a>
<table id=comparetuple_params>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -906,7 +912,7 @@ except:
</ol>
<h2 id=methodattrs>Special method attributes</h2>
<p>In Python 2, class methods can reference the class object they are defined in, as well as the method object itself. <code>im_self</code> is the class instance object; the class <code>im_func</code> is the function object; <code>im_class</code> is the class of <code>im_self</code> (for bound methods) or the class that asked for the method (for unbound methods). In Python 3, these special method attributes have been renamed to follow the naming conventions of other attributes.
<p class=skip><a href=#skipcomparemethodattrs>skip over this table</a>
<p class=s><a href=#skipcomparemethodattrs>skip over this table</a>
<table id=comparemethodattrs>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -925,7 +931,7 @@ except:
<p id=skipcomparemethodattrs>
<h2 id=nonzero><code>__nonzero__</code> special class attribute</h2>
<p>In Python 2, you could build your own classes that could be used in a boolean context. For example, you could instantiate the class and then use the instance in an <code>if</code> statement. To do this, you defined a special <code>__nonzero__()</code> method which returned <code>True</code> or <code>False</code>, and it was called whenever the instance was used in a boolean context. In Python 3, you can still do this, but the name of the method has changed to <code>__bool__()</code>.
<p class=skip><a href=#skipcomparenonzero>skip over this table</a>
<p class=s><a href=#skipcomparenonzero>skip over this table</a>
<table id=comparenonzero>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -950,7 +956,7 @@ except:
</ol>
<h2 id=numliterals>Octal literals</h2>
<p>The syntax for defining base 8 (octal) numbers has changed slightly between Python 2 and Python 3.
<p class=skip><a href=#skipcomparenumliterals>skip over this table</a>
<p class=s><a href=#skipcomparenumliterals>skip over this table</a>
<table id=comparenumliterals>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -963,7 +969,7 @@ except:
<p id=skipcomparenumliterals>
<h2 id=renames><code>sys.maxint</code></h2>
<p>Due to the <a href=#long>integration of the <code>long</code> and <code>int</code> types</a>, the <code>sys.maxint</code> constant is no longer accurate. Because the value may still be useful in determining platform-specific capabilities, it has been retained but renamed as <code>sys.maxsize</code>.
<p class=skip><a href=#skipcomparerenames>skip over this table</a>
<p class=s><a href=#skipcomparerenames>skip over this table</a>
<table id=comparerenames>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -982,7 +988,7 @@ except:
</ol>
<h2 id=callable><code>callable()</code> global function</h2>
<p>In Python 2, you could check whether an object was callable (like a function) with the global <code>callable()</code> function. In Python 3, this global function has been eliminated. To check whether an object is callable, check for the existence of the <code>__call__()</code> special method.
<p class=skip><a href=#skipcomparecallable>skip over this table</a>
<p class=s><a href=#skipcomparecallable>skip over this table</a>
<table id=comparecallable>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -995,7 +1001,7 @@ except:
<p id=skipcomparecallable>
<h2 id=zip><code>zip()</code> global function</h2>
<p>In Python 2, the global <code>zip()</code> function took any number of sequences and returned a list of tuples. The first tuple contained the first item from each sequence; the second tuple contained the second item from each sequence; and so on. In Python 3, <code>zip()</code> returns an iterator instead of a list.
<p class=skip><a href=#skipcomparezip>skip over this table</a>
<p class=s><a href=#skipcomparezip>skip over this table</a>
<table id=comparezip>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -1014,7 +1020,7 @@ except:
</ol>
<h2 id=standarderror><code>StandardError</code> exception</h2>
<p>In Python 2, <code>StandardError</code> was the base class for all built-in exceptions other than <code>StopIteration</code>, <code>GeneratorExit</code>, <code>KeyboardInterrupt</code>, and <code>SystemExit</code>. In Python 3, <code>StandardError</code> has been eliminated; use <code>Exception</code> instead.
<p class=skip><a href=#skipcomparestandarderror>skip over this table</a>
<p class=s><a href=#skipcomparestandarderror>skip over this table</a>
<table id=comparestandarderror>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -1030,7 +1036,7 @@ except:
<p id=skipcomparestandarderror>
<h2 id=types><code>types</code> module constants</h2>
<p>The <code>types</code> module contains a variety of constants to help you determine the type of an object. In Python 2, it contained constants for all primitive types like <code>dict</code> and <code>int</code>. In Python 3, these constants have been eliminated; just use the primitive type name instead.
<p class=skip><a href=#skipcomparetypes>skip over this table</a>
<p class=s><a href=#skipcomparetypes>skip over this table</a>
<table id=comparetypes>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -1058,7 +1064,7 @@ except:
<p id=skipcomparetypes>
<h2 id=isinstance><code>isinstance()</code> global function (3.1+)</h2>
<p>The <code>isinstance()</code> function checks whether an object is an instance of a particular class or type. In Python 2, you could pass a tuple of types, and <code>isinstance()</code> would return <code>True</code> if the object was any of those types. In Python 3, you can still do this, but passing the same type twice is deprecated.
<p class=skip><a href=#skipcompareisinstance>skip over this table</a>
<p class=s><a href=#skipcompareisinstance>skip over this table</a>
<table id=compareisinstance>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -1073,7 +1079,7 @@ except:
</blockquote>
<h2 id=basestring><code>basestring</code> datatype</h2>
<p>Python 2 had two string types: Unicode and non-Unicode. But there was also another type, <code>basestring</code>. It was an abstract type, a superclass for both the <code>str</code> and <code>unicode</code> types. It couldn't be called or instantiated directly, but you could pass it to the global <code>isinstance()</code> function to check whether an object was either a Unicode or non-Unicode string. In Python 3, there is only one string type, so <code>basestring</code> has no reason to exist.
<p class=skip><a href=#skipcomparebasestring>skip over this table</a>
<p class=s><a href=#skipcomparebasestring>skip over this table</a>
<table id=comparebasestring>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -1112,7 +1118,7 @@ except:
</ol>
<h2 id=sys_exc><code>sys.exc_type</code>, <code>sys.exc_value</code>, <code>sys.exc_traceback</code></h2>
<p>Python 2 had three variables in the <code>sys</code> module that you could access while an exception was being handled: <code>sys.exc_type</code>, <code>sys.exc_value</code>, <code>sys.exc_traceback</code>. (Actually, these date all the way back to Python 1.) Ever since Python 1.5, these variables have been deprecated in favor of <code>sys.exc_info</code>, which is a tuple that contains all three values. In Python 3, these individual variables have finally gone away; you must use <code>sys.exc_info</code>.
<p class=skip><a href=#skipcomparesys_exc>skip over this table</a>
<p class=s><a href=#skipcomparesys_exc>skip over this table</a>
<table id=comparesys_exc>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -1131,7 +1137,7 @@ except:
<p id=skipcomparesys_exc>
<h2 id=paren>List comprehensions over tuples</h2>
<p>In Python 2, if you wanted to code a list comprehension that iterated over a tuple, you did not need to put parentheses around the tuple values. In Python 3, explicit parentheses are required.
<p class=skip><a href=#skipcompareparen>skip over this table</a>
<p class=s><a href=#skipcompareparen>skip over this table</a>
<table id=compareparen>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -1144,7 +1150,7 @@ except:
<p id=skipcompareparen>
<h2 id=getcwdu><code>os.getcwdu()</code> function</h2>
<p>Python 2 had a function named <code>os.getcwd()</code>, which returned the current working directory as a (non-Unicode) string. Because modern file systems can handle directory names in any character encoding, Python 2.3 introduced <code>os.getcwdu()</code>. The <code>os.getcwdu()</code> function returned the current working directory as a Unicode string. In Python 3, there is only one string type (Unicode), so <code>os.getcwd()</code> is all you need.
<p class=skip><a href=#skipcomparegetcwdu>skip over this table</a>
<p class=s><a href=#skipcomparegetcwdu>skip over this table</a>
<table id=comparegetcwdu>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -1157,7 +1163,7 @@ except:
<p id=skipcomparegetcwdu>
<h2 id=metaclass>Metaclasses</h2>
<p>In Python 2, you could create metaclasses either by defining the <code>metaclass</code> argument in the class declaration, or by defining a special class-level <code>__metaclass__</code> attribute. In Python 3, the class-level attribute has been eliminated.
<p class=skip><a href=#skipcomparemetaclass>skip over this table</a>
<p class=s><a href=#skipcomparemetaclass>skip over this table</a>
<table id=comparemetaclass>
<tr><th>Notes</th>
<th>Python 2</th>
@@ -1190,7 +1196,7 @@ except:
<blockquote class=note>
<p><span>&#x261E;</span>The <code>2to3</code> script will not fix <code>set()</code> literals by default. To enable this fix, specify <kbd>-f set_literal</kbd> on the command line when you call <code>2to3</code>.
</blockquote>
<p class=skip><a href=#skipcompareset_literal>skip over this table</a>
<p class=s><a href=#skipcompareset_literal>skip over this table</a>
<table id=compareset_literal>
<tr><th>Notes</th>
<th>Before</th>
@@ -1212,7 +1218,7 @@ except:
<blockquote class=note>
<p><span>&#x261E;</span>The <code>2to3</code> script will not fix the <code>buffer()</code> function by default. To enable this fix, specify <kbd>-f buffer</kbd> on the command line when you call <code>2to3</code>.
</blockquote>
<p class=skip><a href=#skipcomparebuffer>skip over this table</a>
<p class=s><a href=#skipcomparebuffer>skip over this table</a>
<table id=comparebuffer>
<tr><th>Notes</th>
<th>Before</th>
@@ -1228,7 +1234,7 @@ except:
<blockquote class=note>
<p><span>&#x261E;</span>The <code>2to3</code> script will not fix whitespace around commas by default. To enable this fix, specify <kbd>-f wscomma</kbd> on the command line when you call <code>2to3</code>.
</blockquote>
<p class=skip><a href=#skipcomparewscomma>skip over this table</a>
<p class=s><a href=#skipcomparewscomma>skip over this table</a>
<table id=comparewscomma>
<tr><th>Notes</th>
<th>Before</th>
@@ -1247,7 +1253,7 @@ except:
<blockquote class=note>
<p><span>&#x261E;</span>The <code>2to3</code> script will not fix common idioms by default. To enable this fix, specify <kbd>-f idioms</kbd> on the command line when you call <code>2to3</code>.
</blockquote>
<p class=skip><a href=#skipcompareidioms>skip over this table</a>
<p class=s><a href=#skipcompareidioms>skip over this table</a>
<table id=compareidioms>
<tr><th>Notes</th>
<th>Before</th>
@@ -1273,6 +1279,6 @@ do_stuff(a_list)</code></pre></td></tr>
</table>
<p id=skipcompareidioms>
<p>FIXME: once the rest of the book is written, this appendix should contain copious links back to any chapter or section that touches on these features.
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &bull; <a href=about.html>open standards &bull; open content &bull; open source</a>
<script src=jquery.js></script>
<script src=dip3.js></script>
+19 -11
View File
@@ -4,11 +4,13 @@
rm -rf build
mkdir build
cp *.py robots.txt *.js *.css build/
# minimize HTML (note: this script is quite fragile and relies on knowledge of how I write HTML)
for f in *.html; do
python htmlminimizer.py "$f" build/"$f"
done
# replace local jquery reference with Google API loader
# jQuery will be served by Google AJAX Libraries API
sed -i -e "s|jquery\.js|http://www.google.com/jsapi|g" build/*.html
sed -i -e "s|//google\.|google.|g" build/dip3.js
sed -i -e "s|//}.; /\* google\..*|});|g" build/dip3.js
@@ -18,16 +20,22 @@ revision=`hg log|grep changeset|cut -d":" -f3|head -1`
java -jar yuicompressor-2.4.2.jar build/dip3.js > build/$revision.js
java -jar yuicompressor-2.4.2.jar build/dip3.css > build/$revision.css
sed -i -e "s|;}|}|g" build/$revision.css
css=`cat build/$revision.css`
sed -i -e "s|dip3\.js|http://wearehugh.com/dip3/${revision}.js|g" build/*.html
#sed -i -e "s|dip3\.css|http://wearehugh.com/dip3/${revision}.css|g" build/*.html
sed -i -e "s|<link rel=stylesheet type=text/css href=dip3.css>|<style>${css}</style>|g" -e "s|</style><style>||g" build/*.html
sed -i -e "s|html5\.js|http://wearehugh.com/dip3/html5.js|g" build/*.html
sed -i -e "s|=http:|=|g" build/*.html
# set file permissions for public consumption
# put CSS inline
css=`cat build/$revision.css`
sed -i -e "s|<link rel=stylesheet type=text/css href=dip3.css>|<style>${css}</style>|g" -e "s|</style><style>||g" build/*.html
# JS will be served from a separate domain
sed -i -e "s|dip3\.js|http://wearehugh.com/dip3/${revision}.js|g" build/*.html
sed -i -e "s|html5\.js|http://wearehugh.com/dip3/html5.js|g" build/*.html
# minimize URLs
sed -i -e "s|=http:|=|g" build/*.html
sed -i -e "s|href=index.html|href=/|g" build/*.html
# set file permissions (hg resets these, don't know why)
chmod 644 build/*.html build/*.css build/*.js build/*.py build/*.txt
# and push to production
rsync -essh -avzP build/$revision.js build/html5.js diveintomark.org:~/web/wearehugh.com/dip3/
rsync -essh -avzP build/*.html build/*.py build/*.txt diveintomark.org:~/web/diveintopython3.org/
# ship it!
#rsync -essh -avzP build/$revision.js build/html5.js diveintomark.org:~/web/wearehugh.com/dip3/
#rsync -essh -avzP build/*.html build/*.py build/*.txt diveintomark.org:~/web/diveintopython3.org/
+101 -103
View File
@@ -1,19 +1,17 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Regular expressions - Dive into Python 3</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<link rel=stylesheet type=text/css href=dip3.css>
<style>
body{counter-reset:h1 4}
</style>
</head>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&#xa0;<input name=q size=31>&#xa0;<input type=submit name=root value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#regular-expressions>Dive Into Python 3</a> <span>&#8227;</span>
<p class=s><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=root value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#regular-expressions>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Regular expressions</h1>
<blockquote class=q>
<p><span>&#x275D;</span> Some people, when confronted with a problem, think &#8220;I know, I&#8217;ll use regular expressions.&#8221; Now they have two problems. <span>&#x275E;</span><br>&mdash; <cite>Jamie Zawinski</cite>
@@ -35,7 +33,7 @@ body{counter-reset:h1 4}
<li><a href=#summary>Summary</a>
</ol>
<h2 id=divingin>Diving in</h2>
<p class=fancy>Every modern programming language has built-in functions for working with strings. In Python, strings have methods for searching and replacing: <code>index()</code>, <code>find()</code>, <code>split()</code>, <code>count()</code>, <code>replace()</code>, <i class=baa>&amp;</i>c. But these methods are limited to the simplest of cases. For example, the <code>index()</code> method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string <var>s</var>, you must call <code>s.lower()</code> or <code>s.upper()</code> and make sure your search strings are the appropriate case to match. The <code>replace()</code> and <code>split()</code> methods have the same limitations.
<p class=f>Every modern programming language has built-in functions for working with strings. In Python, strings have methods for searching and replacing: <code>index()</code>, <code>find()</code>, <code>split()</code>, <code>count()</code>, <code>replace()</code>, <i class=baa>&amp;</i>c. But these methods are limited to the simplest of cases. For example, the <code>index()</code> method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string <var>s</var>, you must call <code>s.lower()</code> or <code>s.upper()</code> and make sure your search strings are the appropriate case to match. The <code>replace()</code> and <code>split()</code> methods have the same limitations.
<p>If your goal can be accomplished with string methods, you should use them. They&#8217;re fast and simple and easy to read, and there&#8217;s a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with <code>if</code> statements to handle special cases, or if you&#8217;re chaining calls to <code>split()</code> and <code>join()</code> to slice-and-dice your strings, you may need to move up to regular expressions.
<p>Regular expressions are a powerful and (mostly) standardized way of searching, replacing, and parsing text with complex patterns of characters. Although the regular expression syntax is tight and unlike normal code, the result can end up being <em>more</em> readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions, so you can include fine-grained documentation within them.
<blockquote class="note compare perl5">
@@ -45,16 +43,16 @@ body{counter-reset:h1 4}
<p>This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don&#8217;t just make this stuff up; it&#8217;s actually useful.) This example shows how I approached the problem.
<p id=noscript>[The code examples will be easier to follow if you enable Javascript, but whatever.]
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>s = '100 NORTH MAIN ROAD'</kbd>
<a><samp class=prompt>>>> </samp><kbd>s.replace('ROAD', 'RD.')</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>s = '100 NORTH MAIN ROAD'</kbd>
<a><samp class=p>>>> </samp><kbd>s.replace('ROAD', 'RD.')</kbd> <span>&#x2460;</span></a>
<samp>'100 NORTH MAIN RD.'</samp>
<samp class=prompt>>>> </samp><kbd>s = '100 NORTH BROAD ROAD'</kbd>
<a><samp class=prompt>>>> </samp><kbd>s.replace('ROAD', 'RD.')</kbd> <span>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd>s = '100 NORTH BROAD ROAD'</kbd>
<a><samp class=p>>>> </samp><kbd>s.replace('ROAD', 'RD.')</kbd> <span>&#x2461;</span></a>
<samp>'100 NORTH BRD. RD.'</samp>
<a><samp class=prompt>>>> </samp><kbd>s[:-4] + s[-4:].replace('ROAD', 'RD.')</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>s[:-4] + s[-4:].replace('ROAD', 'RD.')</kbd> <span>&#x2462;</span></a>
<samp>'100 NORTH BROAD RD.'</samp>
<a><samp class=prompt>>>> </samp><kbd>import re</kbd> <span>&#x2463;</span></a>
<a><samp class=prompt>>>> </samp><kbd>re.sub('ROAD$', 'RD.', s)</kbd> <span>&#x2464;</span></a>
<a><samp class=p>>>> </samp><kbd>import re</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>re.sub('ROAD$', 'RD.', s)</kbd> <span>&#x2464;</span></a>
<samp>'100 NORTH BROAD RD.'</samp></pre>
<ol>
<li>My goal is to standardize a street address so that <code>'ROAD'</code> is always abbreviated as <code>'RD.'</code>. At first glance, I thought this was simple enough that I could just use the string method <code>replace()</code>. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, <code>'ROAD'</code>, was a constant. And in this deceptively simple example, <code>s.replace()</code> does indeed work.
@@ -65,17 +63,17 @@ body{counter-reset:h1 4}
</ol>
<p>Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching <code>'ROAD'</code> at the end of the address, was not good enough, because not all addresses included a street designation at all. Some addresses simply ended with the street name. I got away with it most of the time, but if the street name was <code>'BROAD'</code>, then the regular expression would match <code>'ROAD'</code> at the end of the string as part of the word <code>'BROAD'</code>, which is not what I wanted.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>s = '100 BROAD'</kbd>
<samp class=prompt>>>> </samp><kbd>re.sub('ROAD$', 'RD.', s)</kbd>
<samp class=p>>>> </samp><kbd>s = '100 BROAD'</kbd>
<samp class=p>>>> </samp><kbd>re.sub('ROAD$', 'RD.', s)</kbd>
<samp>'100 BRD.'</samp>
<a><samp class=prompt>>>> </samp><kbd>re.sub('\\bROAD$', 'RD.', s)</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>re.sub('\\bROAD$', 'RD.', s)</kbd> <span>&#x2460;</span></a>
<samp>'100 BROAD'</samp>
<a><samp class=prompt>>>> </samp><kbd>re.sub(r'\bROAD$', 'RD.', s)</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>re.sub(r'\bROAD$', 'RD.', s)</kbd> <span>&#x2461;</span></a>
<samp>'100 BROAD'</samp>
<samp class=prompt>>>> </samp><kbd>s = '100 BROAD ROAD APT. 3'</kbd>
<a><samp class=prompt>>>> </samp><kbd>re.sub(r'\bROAD$', 'RD.', s)</kbd> <span>&#x2462;</span></a>
<samp class=p>>>> </samp><kbd>s = '100 BROAD ROAD APT. 3'</kbd>
<a><samp class=p>>>> </samp><kbd>re.sub(r'\bROAD$', 'RD.', s)</kbd> <span>&#x2462;</span></a>
<samp>'100 BROAD ROAD APT. 3'</samp>
<a><samp class=prompt>>>> </samp><kbd>re.sub(r'\bROAD\b', 'RD.', s)</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>re.sub(r'\bROAD\b', 'RD.', s)</kbd> <span>&#x2463;</span></a>
<samp>'100 BROAD RD. APT 3'</samp></pre>
<ol>
<li>What I <em>really</em> wanted was to match <code>'ROAD'</code> when it was at the end of the string <em>and</em> it was its own word (and not a part of some larger word). To express this in a regular expression, you use <code>\b</code>, which means &#8220;a word boundary must occur right here.&#8221; In Python, this is complicated by the fact that the <code>'\'</code> character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it&#8217;s a bug in syntax or a bug in your regular expression.
@@ -106,16 +104,16 @@ body{counter-reset:h1 4}
<h3 id=thousands>Checking for thousands</h3>
<p>What would it take to validate that an arbitrary string is a valid Roman numeral? Let&#8217;s take it one digit at a time. Since Roman numerals are always written highest to lowest, let&#8217;s start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of <code>M</code> characters.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import re</kbd>
<a><samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?$'</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'M')</kbd> <span>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd>import re</kbd>
<a><samp class=p>>>> </samp><kbd>pattern = '^M?M?M?$'</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'M')</kbd> <span>&#x2461;</span></a>
<samp>&lt;SRE_Match object at 0106FB58></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MM')</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MM')</kbd> <span>&#x2462;</span></a>
<samp>&lt;SRE_Match object at 0106C290></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMM')</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMM')</kbd> <span>&#x2463;</span></a>
<samp>&lt;SRE_Match object at 0106AA38></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMMM')</kbd> <span>&#x2464;</span></a>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, '')</kbd> <span>&#x2465;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMMM')</kbd> <span>&#x2464;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, '')</kbd> <span>&#x2465;</span></a>
<samp>&lt;SRE_Match object at 0106F4A8></samp></pre>
<ol>
<li>This pattern has three parts. <code>^</code> matches what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the <code>M</code> characters were, which is not what you want. You want to make sure that the <code>M</code> characters, if they&#8217;re there, are at the beginning of the string. <code>M?</code> optionally matches a single <code>M</code> character. Since this is repeated three times, you&#8217;re matching anywhere from zero to three <code>M</code> characters in a row. And <code>$</code> matches the end of the string. When combined with the <code>^</code> character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the <code>M</code> characters.
@@ -151,16 +149,16 @@ body{counter-reset:h1 4}
</ul>
<p>This example shows how to validate the hundreds place of a Roman numeral.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import re</kbd>
<a><samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MCM')</kbd> <span>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd>import re</kbd>
<a><samp class=p>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MCM')</kbd> <span>&#x2461;</span></a>
<samp>&lt;SRE_Match object at 01070390></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MD')</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MD')</kbd> <span>&#x2462;</span></a>
<samp>&lt;SRE_Match object at 01073A50></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMMCCC')</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMMCCC')</kbd> <span>&#x2463;</span></a>
<samp>&lt;SRE_Match object at 010748A8></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MCMC')</kbd> <span>&#x2464;</span></a>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, '')</kbd> <span>&#x2465;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MCMC')</kbd> <span>&#x2464;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, '')</kbd> <span>&#x2465;</span></a>
<samp>&lt;SRE_Match object at 01071D98></samp></pre>
<ol>
<li>This pattern starts out the same as the previous one, checking for the beginning of the string (<code>^</code>), then the thousands place (<code>M?M?M?</code>). Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical bars: <code>CM</code>, <code>CD</code>, and <code>D?C?C?C?</code> (which is an optional <code>D</code> followed by zero to three optional <code>C</code> characters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first one that matches, and ignores the rest.
@@ -174,18 +172,18 @@ body{counter-reset:h1 4}
<h2 id=nmsyntax>Using the <code>{n,m}</code> Syntax</h2>
<p>In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import re</kbd>
<samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?$'</kbd>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'M')</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>import re</kbd>
<samp class=p>>>> </samp><kbd>pattern = '^M?M?M?$'</kbd>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'M')</kbd> <span>&#x2460;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EE090></samp>
<samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?$'</kbd>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MM')</kbd> <span>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd>pattern = '^M?M?M?$'</kbd>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MM')</kbd> <span>&#x2461;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?$'</kbd>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMM')</kbd> <span>&#x2462;</span></a>
<samp class=p>>>> </samp><kbd>pattern = '^M?M?M?$'</kbd>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMM')</kbd> <span>&#x2462;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EE090></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMMM')</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp></pre>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMMM')</kbd> <span>&#x2463;</span></a>
<samp class=p>>>> </samp></pre>
<ol>
<li>This matches the start of the string, and then the first optional <code>M</code>, but not the second and third <code>M</code> (but that&#8217;s okay because they&#8217;re optional), and then the end of the string.
<li>This matches the start of the string, and then the first and second optional <code>M</code>, but not the third <code>M</code> (but that&#8217;s okay because it&#8217;s optional), and then the end of the string.
@@ -193,15 +191,15 @@ body{counter-reset:h1 4}
<li>This matches the start of the string, and then all three optional <code>M</code>, but then does not match the the end of the string (because there is still one unmatched <code>M</code>), so the pattern does not match and returns <code>None</code>.
</ol>
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>pattern = '^M{0,3}$'</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'M')</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>pattern = '^M{0,3}$'</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'M')</kbd> <span>&#x2461;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MM')</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MM')</kbd> <span>&#x2462;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EE090></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMM')</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMM')</kbd> <span>&#x2463;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEDA8></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMMM')</kbd> <span>&#x2464;</span></a>
<samp class=prompt>>>> </samp></pre>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMMM')</kbd> <span>&#x2464;</span></a>
<samp class=p>>>> </samp></pre>
<ol>
<li>This pattern says: &#8220;Match the start of the string, then anywhere from zero to three <code>M</code> characters, then the end of the string.&#8221; The 0 and 3 can be any numbers; if you want to match at least one but no more than three <code>M</code> characters, you could say <code>M{1,3}</code>.
<li>This matches the start of the string, then one <code>M</code> out of a possible three, then the end of the string.
@@ -212,17 +210,17 @@ body{counter-reset:h1 4}
<h3 id=tensandones>Checking for tens and ones</h3>
<p>Now let&#8217;s expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'</kbd>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MCMXL')</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'</kbd>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MCMXL')</kbd> <span>&#x2460;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MCML')</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MCML')</kbd> <span>&#x2461;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MCMLX')</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MCMLX')</kbd> <span>&#x2462;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MCMLXXX')</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MCMLXXX')</kbd> <span>&#x2463;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MCMLXXXX')</kbd> <span>&#x2464;</span></a>
<samp class=prompt>>>> </samp></pre>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MCMLXXXX')</kbd> <span>&#x2464;</span></a>
<samp class=p>>>> </samp></pre>
<ol>
<li>This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then <code>XL</code>, then the end of the string. Remember, the <code>(A|B|C)</code> syntax means &#8220;match exactly one of A, B, or C&#8221;. You match <code>XL</code>, so you ignore the <code>XC</code> and <code>L?X?X?X?</code> choices, and then move on to the end of the string. <code>MCML</code> is the Roman numeral representation of <code>1940</code>.
<li>This matches the start of the string, then the first optional <code>M</code>, then <code>CM</code>, then <code>L?X?X?X?</code>. Of the <code>L?X?X?X?</code>, it matches the <code>L</code> and skips all three optional <code>X</code> characters. Then you move to the end of the string. <code>MCML</code> is the Roman numeral representation of <code>1950</code>.
@@ -232,17 +230,17 @@ body{counter-reset:h1 4}
</ol>
<p>The expression for the ones place follows the same pattern. I&#8217;ll spare you the details and show you the end result.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'</kbd>
<samp class=p>>>> </samp><kbd>pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'</kbd>
</pre><p>So what does that look like using this alternate <code>{n,m}</code> syntax? This example shows the new syntax.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'</kbd>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MDLV')</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'</kbd>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MDLV')</kbd> <span>&#x2460;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMDCLXVI')</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMDCLXVI')</kbd> <span>&#x2461;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMMMDCCCLXXXVIII')</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMMMDCCCLXXXVIII')</kbd> <span>&#x2462;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'I')</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'I')</kbd> <span>&#x2463;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp></pre>
<ol>
<li>This matches the start of the string, then one of a possible four <code>M</code> characters, then <code>D?C{0,3}</code>. Of that, it matches the optional <code>D</code> and zero of three possible <code>C</code> characters. Moving on, it matches <code>L?X{0,3}</code> by matching the optional <code>L</code> and zero of three possible <code>X</code> characters. Then it matches <code>V?I{0,3}</code> by matching the optional V and zero of three possible <code>I</code> characters, and finally the end of the string. <code>MDLV</code> is the Roman numeral representation of <code>1555</code>.
@@ -261,7 +259,7 @@ body{counter-reset:h1 4}
</ul>
<p>This will be more clear with an example. Let&#8217;s revisit the compact regular expression you&#8217;ve been working with, and make it a verbose regular expression. This example shows how.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>pattern = """
<samp class=p>>>> </samp><kbd>pattern = """
^ # beginning of string
M{0,3} # thousands - 0 to 3 M's
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
@@ -272,13 +270,13 @@ body{counter-reset:h1 4}
# or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
"""</kbd>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'M', re.VERBOSE)</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'M', re.VERBOSE)</kbd> <span>&#x2460;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MCMLXXXIX', re.VERBOSE)</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MCMLXXXIX', re.VERBOSE)</kbd> <span>&#x2461;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'MMMMDCCCLXXXVIII', re.VERBOSE)</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'MMMMDCCCLXXXVIII', re.VERBOSE)</kbd> <span>&#x2462;</span></a>
<samp>&lt;_sre.SRE_Match object at 0x008EEB48></samp>
<a><samp class=prompt>>>> </samp><kbd>re.search(pattern, 'M')</kbd> <span>&#x2463;</span></a></pre>
<a><samp class=p>>>> </samp><kbd>re.search(pattern, 'M')</kbd> <span>&#x2463;</span></a></pre>
<ol>
<li>The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them: <code>re.VERBOSE</code> is a constant defined in the <code>re</code> module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it&#8217;s a lot more readable.
<li>This matches the start of the string, then one of a possible four <code>M</code>, then <code>CM</code>, then <code>L</code> and three of a possible three <code>X</code>, then <code>IX</code>, then the end of the string.
@@ -303,24 +301,24 @@ body{counter-reset:h1 4}
<p>Quite a variety! In each of these cases, I need to know that the area code was <code>800</code>, the trunk was <code>555</code>, and the rest of the phone number was <code>1212</code>. For those with an extension, I need to know that the extension was <code>1234</code>.
<p>Let&#8217;s work through developing a solution for phone number parsing. This example shows the first step.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212').groups()</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212').groups()</kbd> <span>&#x2461;</span></a>
<samp>('800', '555', '1212')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212-1234')</kbd> <span>&#x2462;</span></a>
<samp class=prompt>>>> </samp></pre>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212-1234')</kbd> <span>&#x2462;</span></a>
<samp class=p>>>> </samp></pre>
<ol>
<li>Always read regular expressions from left to right. This one matches the beginning of the string, and then <code>(\d{3})</code>. What&#8217;s <code>\d{3}</code>? Well, the <code>{3}</code> means &#8220;match exactly three numeric digits&#8221;; it&#8217;s a variation on the <a href="#re.nm" title="7.4. Using the {n,m} Syntax"><code>{n,m} syntax</code></a> you saw earlier. <code>\d</code> means &#8220;any numeric digit&#8221; (<code>0</code> through <code>9</code>). Putting it in parentheses means &#8220;match exactly three numeric digits, <em>and then remember them as a group that I can ask for later</em>&#8221;. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another group of exactly four digits. Then match the end of the string.
<li>To get access to the groups that the regular expression parser remembered along the way, use the <code>groups()</code> method on the object that the <code>search()</code> method returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you defined three groups, one with three digits, one with three digits, and one with four digits.
<li>This regular expression is not the final answer, because it doesn&#8217;t handle a phone number with an extension on the end. For that, you&#8217;ll need to expand the regular expression.
</ol>
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212-1234').groups()</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212-1234').groups()</kbd> <span>&#x2461;</span></a>
<samp>('800', '555', '1212', '1234')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800 555 1212 1234')</kbd> <span>&#x2462;</span></a>
<samp class=prompt>>>> </samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp></pre>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800 555 1212 1234')</kbd> <span>&#x2462;</span></a>
<samp class=p>>>> </samp>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2463;</span></a>
<samp class=p>>>> </samp></pre>
<ol>
<li>This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then a remembered group of three digits, then a hyphen, then a remembered group of three digits, then a hyphen, then a remembered group of four digits. What&#8217;s new is that you then match another hyphen, and a remembered group of one or more digits, then the end of the string.
<li>The <code>groups()</code> method now returns a tuple of four elements, since the regular expression now defines four groups to remember.
@@ -329,15 +327,15 @@ body{counter-reset:h1 4}
</ol>
<p>The next example shows the regular expression to handle separators between the different parts of the phone number.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800 555 1212 1234').groups()</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800 555 1212 1234').groups()</kbd> <span>&#x2461;</span></a>
<samp>('800', '555', '1212', '1234')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212-1234').groups()</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212-1234').groups()</kbd> <span>&#x2462;</span></a>
<samp>('800', '555', '1212', '1234')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('80055512121234')</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2464;</span></a>
<samp class=prompt>>>> </samp></pre>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('80055512121234')</kbd> <span>&#x2463;</span></a>
<samp class=p>>>> </samp>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2464;</span></a>
<samp class=p>>>> </samp></pre>
<ol>
<li>Hang on to your hat. You&#8217;re matching the beginning of the string, then a group of three digits, then <code>\D+</code>. What the heck is that? Well, <code>\D</code> matches any character <em>except</em> a numeric digit, and <code>+</code> means &#8220;1 or more&#8221;. So <code>\D+</code> matches one or more characters that are not digits. This is what you&#8217;re using instead of a literal hyphen, to try to match different separators.
<li>Using <code>\D+</code> instead of <code>-</code> means you can now match phone numbers where the parts are separated by spaces instead of hyphens.
@@ -347,15 +345,15 @@ body{counter-reset:h1 4}
</ol>
<p>The next example shows the regular expression for handling phone numbers <em>without</em> separators.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('80055512121234').groups()</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('80055512121234').groups()</kbd> <span>&#x2461;</span></a>
<samp>('800', '555', '1212', '1234')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800.555.1212 x1234').groups()</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800.555.1212 x1234').groups()</kbd> <span>&#x2462;</span></a>
<samp>('800', '555', '1212', '1234')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212').groups()</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212').groups()</kbd> <span>&#x2463;</span></a>
<samp>('800', '555', '1212', '')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('(800)5551212 x1234')</kbd> <span>&#x2464;</span></a>
<samp class=prompt>>>> </samp></pre>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('(800)5551212 x1234')</kbd> <span>&#x2464;</span></a>
<samp class=p>>>> </samp></pre>
<ol>
<li>The only change you&#8217;ve made since that last step is changing all the <code>+</code> to <code>*</code>. Instead of <code>\D+</code> between the parts of the phone number, you now match on <code>\D*</code>. Remember that <code>+</code> means &#8220;1 or more&#8221;? Well, <code>*</code> means &#8220;zero or more&#8221;. So now you should be able to parse phone numbers even when there is no separator character at all.
<li>Lo and behold, it actually works. Why? You matched the beginning of the string, then a remembered group of three digits (<code>800</code>), then zero non-numeric characters, then a remembered group of three digits (<code>555</code>), then zero non-numeric characters, then a remembered group of four digits (<code>1212</code>), then zero non-numeric characters, then a remembered group of an arbitrary number of digits (<code>1234</code>), then the end of the string.
@@ -365,13 +363,13 @@ body{counter-reset:h1 4}
</ol>
<p>The next example shows how to handle leading characters in phone numbers.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('(800)5551212 ext. 1234').groups()</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('(800)5551212 ext. 1234').groups()</kbd> <span>&#x2461;</span></a>
<samp>('800', '555', '1212', '1234')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212').groups()</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212').groups()</kbd> <span>&#x2462;</span></a>
<samp>('800', '555', '1212', '')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('work 1-(800) 555.1212 #1234')</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp></pre>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('work 1-(800) 555.1212 #1234')</kbd> <span>&#x2463;</span></a>
<samp class=p>>>> </samp></pre>
<ol>
<li>This is the same as in the previous example, except now you&#8217;re matching <code>\D*</code>, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you&#8217;re not remembering these non-numeric characters (they&#8217;re not in parentheses). If you find them, you&#8217;ll just skip over them and then start remembering the area code whenever you get to it.
<li>You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis after the area code is already handled; it&#8217;s treated as a non-numeric separator and matched by the <code>\D*</code> after the first remembered group.)
@@ -380,12 +378,12 @@ body{counter-reset:h1 4}
</ol>
<p>Let&#8217;s back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than trying to match it all just so you can skip over it, let&#8217;s take a different approach: don&#8217;t explicitly match the beginning of the string at all. This approach is shown in the next example.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('work 1-(800) 555.1212 #1234').groups()</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('work 1-(800) 555.1212 #1234').groups()</kbd> <span>&#x2461;</span></a>
<samp>('800', '555', '1212', '1234')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2462;</span></a>
<samp>('800', '555', '1212', '')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('80055512121234')</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('80055512121234')</kbd> <span>&#x2463;</span></a>
<samp>('800', '555', '1212', '1234')</samp></pre>
<ol>
<li>Note the lack of <code>^</code> in this regular expression. You are not matching the beginning of the string anymore. There&#8217;s nothing that says you need to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there.
@@ -396,7 +394,7 @@ body{counter-reset:h1 4}
<p>See how quickly a regular expression can get out of control? Take a quick glance at any of the previous iterations. Can you tell the difference between one and the next?
<p>While you still understand the final answer (and it is the final answer; if you&#8217;ve discovered a case it doesn&#8217;t handle, I don&#8217;t want to know about it), let&#8217;s write it out as a verbose regular expression, before you forget why you made the choices you made.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>phonePattern = re.compile(r'''
<samp class=p>>>> </samp><kbd>phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D* # optional separator is any number of non-digits
@@ -407,9 +405,9 @@ body{counter-reset:h1 4}
(\d*) # extension is optional and can be any number of digits
$ # end of string
''', re.VERBOSE)</kbd>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('work 1-(800) 555.1212 #1234').groups()</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('work 1-(800) 555.1212 #1234').groups()</kbd> <span>&#x2460;</span></a>
<samp>('800', '555', '1212', '1234')</samp>
<a><samp class=prompt>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>phonePattern.search('800-555-1212')</kbd> <span>&#x2461;</span></a>
<samp>('800', '555', '1212', '')</samp></pre>
<ol>
<li>Other than being spread out over multiple lines, this is exactly the same regular expression as the last step, so it&#8217;s no surprise that it parses the same inputs.
@@ -432,6 +430,6 @@ body{counter-reset:h1 4}
<li><code>(x)</code> in general is a <em>remembered group</em>. You can get the value of what matched by using the <code>groups()</code> method of the object returned by <code>re.search</code>.
</ul>
<p>Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough about them to know when they are appropriate, when they will solve your problems, and when they will cause more problems than they solve.
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &bull; <a href=about.html>open standards &bull; open content &bull; open source</a>
<script src=jquery.js></script>
<script src=dip3.js></script>
+32 -34
View File
@@ -1,19 +1,17 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Strings - Dive into Python 3</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<link rel=stylesheet type=text/css href=dip3.css>
<style>
body{counter-reset:h1 3}
</style>
</head>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&#xa0;<input name=q size=31>&#xa0;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#strings>Dive Into Python 3</a> <span>&#8227;</span>
<p class=s><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#strings>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Strings</h1>
<blockquote class=q>
<p><span>&#x275D;</span> I&#8217;m telling you this &#8217;cause you&#8217;re one of my friends.<br>
@@ -35,7 +33,7 @@ My alphabet starts where your alphabet ends! <span>&#x275E;</span><br>&mdash; <c
<li><a href=#furtherreading>Further reading</a>
</ol>
<h2 id=divingin>Diving in</h2>
<p class=fancy>Chinese has thousands of characters. The <a href="http://en.wikipedia.org/wiki/Rotokas_alphabet">Rotokas alphabet</a> of <a href="http://en.wikipedia.org/wiki/Bougainville_Province">Bougainville</a> is the smallest alphabet in the world, with just 12 letters. English has 26, plus a handful of punctuation marks. Python 3 can handle all of these languages, and more.
<p class=f>Chinese has thousands of characters. The <a href="http://en.wikipedia.org/wiki/Rotokas_alphabet">Rotokas alphabet</a> of <a href="http://en.wikipedia.org/wiki/Bougainville_Province">Bougainville</a> is the smallest alphabet in the world, with just 12 letters. English has 26, plus a handful of punctuation marks. Python 3 can handle all of these languages, and more.
<p>When people talk about &#8220;text,&#8221; they&#8217;re thinking of &#8220;characters and symbols on the computer screen.&#8221; But computers don&#8217;t deal in characters and symbols; they deal in bits and bytes. Every piece of text you&#8217;ve ever seen on a computer screen is actually stored in a particular <i>character encoding</i>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
@@ -91,21 +89,21 @@ FIXME: update for Python 3
<p>Python has had Unicode support throughout the language since version 2.0. The <abbr>XML</abbr> package uses Unicode to store all parsed <abbr>XML</abbr> data, but you can use Unicode anywhere.
<div class=example><h3>Example 9.13. Introducing Unicode</h3><pre class=screen>
<samp class=prompt>>>> </samp><kbd>s = u'Dive in'</kbd> <span>&#x2460;</span>
<samp class=prompt>>>> </samp><kbd>s</kbd>
<samp class=p>>>> </samp><kbd>s = u'Dive in'</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>s</kbd>
u'Dive in'
<samp class=prompt>>>> </samp><kbd>print s</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>print s</kbd> <span>&#x2461;</span>
Dive in</pre><div class=calloutlist>
<ol>
<li>To create a Unicode string instead of a regular <abbr>ASCII</abbr> string, add the letter &#8220;<code>u</code>&#8221; before the string. Note that this particular string doesn't have any non-<abbr>ASCII</abbr> characters. That's fine; Unicode is a superset of <abbr>ASCII</abbr> (a very large superset at that), so any regular <abbr>ASCII</abbr> string can also be stored as Unicode.
<li>When printing a string, Python will attempt to convert it to your default encoding, which is usually <abbr>ASCII</abbr>. (More on this in a minute.) Since this Unicode string is made up of characters that are also <abbr>ASCII</abbr> characters, printing it has the same result as printing a normal <abbr>ASCII</abbr> string; the conversion is seamless, and if you didn't know that <var>s</var> was a Unicode string, you'd never notice the difference.
<div class=example><h3>Example 9.14. Storing non-<abbr>ASCII</abbr> characters</h3><pre class=screen>
<samp class=prompt>>>> </samp><kbd>s = u'La Pe\xf1a'</kbd> <span>&#x2460;</span>
<samp class=prompt>>>> </samp><kbd>print s</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>s = u'La Pe\xf1a'</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>print s</kbd> <span>&#x2461;</span>
<samp class=traceback>Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)</samp>
<samp class=prompt>>>> </samp><kbd>print s.encode('latin-1')</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>print s.encode('latin-1')</kbd> <span>&#x2462;</span>
La Pe&ntilde;a</pre><div class=calloutlist>
<ol>
<li>The real advantage of Unicode, of course, is its ability to store non-<abbr>ASCII</abbr> characters, like the Spanish &#8220;<code>&ntilde;</code>&#8221; (<code>n</code> with a tilde over it). The Unicode character code for the tilde-n is <code>0xf1</code> in hexadecimal (241 in decimal), which you can type like this: <code>\xf1</code>.
@@ -146,9 +144,9 @@ http://www.python.org/dev/peps/pep-3120/ - UTF-8 is now the default encoding (Py
to insert values into a string with the <code>%s</code> placeholder.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>k = "uid"</kbd>
<samp class=prompt>>>> </samp><kbd>v = "sa"</kbd>
<samp class=prompt>>>> </samp><kbd>"%s=%s" % (k, v)</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>k = "uid"</kbd>
<samp class=p>>>> </samp><kbd>v = "sa"</kbd>
<samp class=p>>>> </samp><kbd>"%s=%s" % (k, v)</kbd> <span>&#x2460;</span>
<samp>'uid=sa'</samp></pre>
<ol>
<li>The whole expression evaluates to a string. The first <code>%s</code> is replaced by the value of <var>k</var>; the second <code>%s</code> is replaced by the value of <var>v</var>. All other characters in the string (in this case, the equal sign) stay as they are.
@@ -160,16 +158,16 @@ http://www.python.org/dev/peps/pep-3120/ - UTF-8 is now the default encoding (Py
string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>uid = "sa"</kbd>
<samp class=prompt>>>> </samp><kbd>pwd = "secret"</kbd>
<samp class=prompt>>>> </samp><kbd>print pwd + " is not a good password for " + uid</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>uid = "sa"</kbd>
<samp class=p>>>> </samp><kbd>pwd = "secret"</kbd>
<samp class=p>>>> </samp><kbd>print pwd + " is not a good password for " + uid</kbd> <span>&#x2460;</span>
secret is not a good password for sa
<samp class=prompt>>>> </samp><kbd>print "%s is not a good password for %s" % (pwd, uid)</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>print "%s is not a good password for %s" % (pwd, uid)</kbd> <span>&#x2461;</span>
secret is not a good password for sa
<samp class=prompt>>>> </samp><kbd>userCount = 6</kbd>
<samp class=prompt>>>> </samp><kbd>print "Users connected: %d" % (userCount, )</kbd> <span>&#x2462;</span> <span>&#x2463;</span>
<samp class=p>>>> </samp><kbd>userCount = 6</kbd>
<samp class=p>>>> </samp><kbd>print "Users connected: %d" % (userCount, )</kbd> <span>&#x2462;</span> <span>&#x2463;</span>
Users connected: 6
<samp class=prompt>>>> </samp><kbd>print "Users connected: " + userCount</kbd> <span>&#x2464;</span>
<samp class=p>>>> </samp><kbd>print "Users connected: " + userCount</kbd> <span>&#x2464;</span>
<samp class=traceback>Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
TypeError: cannot concatenate 'str' and 'int' objects</samp></pre>
@@ -184,11 +182,11 @@ TypeError: cannot concatenate 'str' and 'int' objects</samp></pre>
<p>As with <code>printf</code> in <abbr>C</abbr>, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>print "Today's stock price: %f" % 50.4625</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>print "Today's stock price: %f" % 50.4625</kbd> <span>&#x2460;</span>
<samp>50.462500</samp>
<samp class=prompt>>>> </samp><kbd>print "Today's stock price: %.2f" % 50.4625</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>print "Today's stock price: %.2f" % 50.4625</kbd> <span>&#x2461;</span>
<samp>50.46</samp>
<samp class=prompt>>>> </samp><kbd>print "Change since yesterday: %+.2f" % 1.5</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>print "Change since yesterday: %+.2f" % 1.5</kbd> <span>&#x2462;</span>
<samp>+1.50</samp></pre>
<ol>
<li>The <code>%f</code> string formatting option treats the value as a decimal, and prints it to six decimal places.
@@ -213,10 +211,10 @@ is an object. You might have thought I meant that string <em>variables</em> are
<!--<code>join</code> works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements will raise an exception.-->
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}</kbd>
<samp class=prompt>>>> </samp><kbd>["%s=%s" % (k, v) for k, v in params.items()]</kbd>
<samp class=p>>>> </samp><kbd>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}</kbd>
<samp class=p>>>> </samp><kbd>["%s=%s" % (k, v) for k, v in params.items()]</kbd>
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
<samp class=prompt>>>> </samp><kbd>";".join(["%s=%s" % (k, v) for k, v in params.items()])</kbd>
<samp class=p>>>> </samp><kbd>";".join(["%s=%s" % (k, v) for k, v in params.items()])</kbd>
'server=mpilgrim;uid=sa;database=master;pwd=secret'</pre>
<p>This string is then returned from the <code>odbchelper</code> function and printed by the calling block, which gives you the output that you marveled at when you started reading this chapter.
@@ -224,13 +222,13 @@ is an object. You might have thought I meant that string <em>variables</em> are
<p>You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's called <code>split</code>.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']</kbd>
<samp class=prompt>>>> </samp><kbd>s = ";".join(li)</kbd>
<samp class=prompt>>>> </samp><kbd>s</kbd>
<samp class=p>>>> </samp><kbd>li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']</kbd>
<samp class=p>>>> </samp><kbd>s = ";".join(li)</kbd>
<samp class=p>>>> </samp><kbd>s</kbd>
'server=mpilgrim;uid=sa;database=master;pwd=secret'
<samp class=prompt>>>> </samp><kbd>s.split(";")</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>s.split(";")</kbd> <span>&#x2460;</span>
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
<samp class=prompt>>>> </samp><kbd>s.split(";", 1)</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>s.split(";", 1)</kbd> <span>&#x2461;</span>
['server=mpilgrim', 'uid=sa;database=master;pwd=secret']</pre>
<ol>
<li><code>split</code> reverses <code>join</code> by splitting a string into a multi-element list. Note that the delimiter (&#8220;<code>;</code>&#8221;) is stripped out completely; it does not appear in any of the elements of the returned list.
@@ -263,6 +261,6 @@ http://www.w3.org/People/Dürst/papers.html
http://rishida.net/scripts/chinese/
</pre>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &bull; <a href=about.html>open standards &bull; open content &bull; open source</a>
<script src=jquery.js></script>
<script src=dip3.js></script>
+3 -5
View File
@@ -1,11 +1,9 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Table of contents - Dive Into Python 3</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<link rel=stylesheet type=text/css href=dip3.css>
<style>
h1:before{content:""}
@@ -15,8 +13,8 @@ ul{list-style:none;margin:0;padding:0}
ul li ol{margin:0;padding:0 0 0 2.5em}
</style>
</head>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31>&#xa0;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> Dive Into Python 3 <span>&#8227;</span>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> Dive Into Python 3 <span>&#8227;</span>
<h1>Table of contents</h1>
<ol start=0>
<li>Installing Python
@@ -380,4 +378,4 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
<li>Dictionary comprehensions
<li>Views (several dictionary methods return them, they're dynamic, update when the dictionary changes, etc.)
</ul>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &bull; <a href=about.html>open standards &bull; open content &bull; open source</a>
+16 -18
View File
@@ -1,19 +1,17 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Unit testing - Dive into Python 3</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<link rel=stylesheet type=text/css href=dip3.css>
<style>
body{counter-reset:h1 7}
</style>
</head>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&#xa0;<input name=q size=31>&#xa0;<input type=submit name=root value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#unit-testing>Dive Into Python 3</a> <span>&#8227;</span>
<p class=s><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=root value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#unit-testing>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Unit testing</h1>
<blockquote class=q>
<p><span>&#x275D;</span> Certitude is not the test of certainty. We have been cocksure of many things that were not so. <span>&#x275E;</span><br>&mdash; <cite>Oliver Wendell Holmes, Jr.</cite>
@@ -26,7 +24,7 @@ body{counter-reset:h1 7}
<li>...
</ol>
<h2 id=divingin>(Not) diving in</h2>
<p class=fancy>How do you know that the code you wrote yesterday still works after the changes you made today? Every seasoned programmer has war stories of an &#8220;innocent&#8221; change that couldn't <em>possibly</em> have affected that other &#8220;unrelated&#8221; module&hellip; If this sounds familiar, this chapter is for you.
<p class=f>How do you know that the code you wrote yesterday still works after the changes you made today? Every seasoned programmer has war stories of an &#8220;innocent&#8221; change that couldn't <em>possibly</em> have affected that other &#8220;unrelated&#8221; module&hellip; If this sounds familiar, this chapter is for you.
<p>In this chapter, you're going to write and debug a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in <a href="regular-expressions.html#romannumerals">&#8220;Case study: roman numerals&#8221;</a>. Now step back and consider what it would take to expand that into a two-way utility.
<p><a href="regular-expressions.html#romannumerals">The rules for Roman numerals</a> lead to a number of interesting observations:
<ol>
@@ -149,7 +147,7 @@ function to_roman(n):
</ol>
<p>Execute <code>romantest1.py</code> on the command line to run the test. If you call it with the <code>-v</code> command-line option, it will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this:
<pre class=screen>
<samp class=prompt>you@localhost:~$ </samp><kbd>python3 romantest1.py -v</kbd>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest1.py -v</kbd>
<samp><a>to_roman should give known result with known input ... FAIL <span>&#x2460;</span></a>
======================================================================
@@ -206,8 +204,8 @@ while n >= integer:
print('subtracting {0} from input, adding {1} to output'.format(integer, numeral))</code></pre>
<p>With the debug <code>print()</code> statements, the output looks like this:
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import roman1</kbd>
<samp class=prompt>>>> </samp><kbd>roman1.to_roman(1424)</kbd>
<samp class=p>>>> </samp><kbd>import roman1</kbd>
<samp class=p>>>> </samp><kbd>roman1.to_roman(1424)</kbd>
<samp>subtracting 1000 from input, adding M to output
subtracting 400 from input, adding CD to output
subtracting 10 from input, adding X to output
@@ -216,7 +214,7 @@ subtracting 4 from input, adding IV to output
'MCDXXIV'</samp></pre>
<p>So the <code>to_roman()</code> function appears to work, at least in this manual spot check. But will it pass the test case you wrote?
<pre class=screen>
<samp class=prompt>you@localhost:~$ </samp><kbd>python3 romantest1.py -v</kbd>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest1.py -v</kbd>
<samp>to_roman should give known result with known input ... ok
----------------------------------------------------------------------
@@ -230,12 +228,12 @@ OK</samp></pre>
<h2 id=romantest2>&#8220;Halt and catch fire&#8221;</h2>
<p>It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import roman1</kbd>
<samp class=prompt>>>> </samp><kbd>roman1.to_roman(4000)</kbd>
<samp class=p>>>> </samp><kbd>import roman1</kbd>
<samp class=p>>>> </samp><kbd>roman1.to_roman(4000)</kbd>
<samp>'MMMM'</samp>
<samp class=prompt>>>> </samp><kbd>roman1.to_roman(5000)</kbd>
<samp class=p>>>> </samp><kbd>roman1.to_roman(5000)</kbd>
<samp>'MMMMM'</samp>
<a><samp class=prompt>>>> </samp><kbd>roman1.to_roman(9000)</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>roman1.to_roman(9000)</kbd> <span>&#x2460;</span></a>
<samp>'MMMMMMMMM'</samp></pre>
<ol>
<li>That's definitely not what you wanted &mdash; that's not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is <em>baaaaaaad</em>; if a program is going to fail, it is far better that it fail quickly and noisily. &#8220;Halt and catch fire,&#8221; as the saying goes. The Pythonic way to halt and catch fire is to raise an exception.
@@ -260,7 +258,7 @@ OK</samp></pre>
<p>Also note that you're passing the <code>to_roman()</code> function itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned recently how handy it is that <a href="your-first-python-program.html#everythingisanobject">everything in Python is an object</a>?
<p>So what happens when you run the test suite with this new test?
<pre class=screen>
<samp class=prompt>you@localhost:~$ </samp><kbd>python3 romantest2.py -v</kbd>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest2.py -v</kbd>
<samp>to_roman should give known result with known input ... ok
<a>to_roman should fail with large input ... ERROR <span>&#x2460;</span></a>
@@ -289,7 +287,7 @@ FAILED (errors=1)</samp></pre>
</ol>
<p>Now run the test suite again.
<pre class=screen>
<samp class=prompt>you@localhost:~$ </samp><kbd>python3 romantest2.py -v</kbd>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest2.py -v</kbd>
<samp>to_roman should give known result with known input ... ok
<a>to_roman should fail with large input ... FAIL <span>&#x2460;</span></a>
@@ -327,7 +325,7 @@ FAILED (failures=1)</samp></pre>
</ol>
<p>Does this make the test pass? Let's find out.
<pre class=screen>
<samp class=prompt>you@localhost:~$ </samp><kbd>python3 romantest2.py -v</kbd>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest2.py -v</kbd>
<samp>to_roman should give known result with known input ... ok
<a>to_roman should fail with large input ... ok <span>&#x2460;</span></a>
@@ -364,6 +362,6 @@ For instance, the <code>testFromRomanCase</code> method (&#8220;<code>from_roman
<li><code>from_roman</code> should only accept uppercase Roman numerals (<i class=foreignphrase><abbr>i.e.</abbr></i> it should fail when given lowercase input).
</ol>
-->
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &bull; <a href=about.html>open standards &bull; open content &bull; open source</a>
<script src=jquery.js></script>
<script src=dip3.js></script>
+28 -29
View File
@@ -1,19 +1,18 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>Your first Python program - Dive into Python 3</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<link rel=stylesheet type=text/css href=dip3.css>
<style>
body{counter-reset:h1 1}
th{font-family:inherit !important}
</style>
</head>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&#xa0;<input name=q size=31>&#xa0;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#your-first-python-program>Dive Into Python 3</a> <span>&#8227;</span>
<p class=s><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#your-first-python-program>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Your first Python program</h1>
<blockquote class=q>
<p><span>&#x275D;</span> Don&#8217;t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate. <span>&#x275E;</span><br>&mdash; <cite>Ven. Henepola Gunararatana</cite>
@@ -40,9 +39,9 @@ body{counter-reset:h1 1}
<li><a href=#furtherreading>Further reading</a>
</ol>
<h2 id=divingin>Diving in</h2>
<p class=fancy>Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
<p class=f>Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
<p id=noscript>[The code examples will be easier to follow if you enable Javascript, but whatever.]
<p class=skip><a href=#skip-humansize-py>skip over this code listing</a>
<p class=s><a href=#skip-humansize-py>skip over this code listing</a>
<p class=download>[<a href=humansize.py>download <code>humansize.py</code></a>]
<pre><code>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
@@ -73,14 +72,14 @@ if __name__ == "__main__":
print(approximate_size(1000000000000, False))
print(approximate_size(1000000000000))</code></pre>
<p id=skip-humansize-py>Now let's run this program on the command line. On Windows, it will look something like this:
<p class=skip><a href=#skip-humansize-screen>skip over this command output listing</a>
<p class=s><a href=#skip-humansize-screen>skip over this command output listing</a>
<pre class=screen>
<samp class=prompt>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp class=p>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<p>On Mac OS X or Linux, it would look something like this:
<pre class=screen>
<samp class=prompt>you@localhost:~$ </samp><kbd>python3 humansize.py</kbd>
<samp class=p>you@localhost:~$ </samp><kbd>python3 humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<p id=skip-humansize-screen>FIXME: this would be a good place to explain what the program, you know, actually does.
@@ -114,7 +113,7 @@ if __name__ == "__main__":
</dl>
<p>So Python is both <em>dynamically typed</em> (because it doesn't use explicit datatype declarations) and <em>strongly typed</em> (because once a variable has a datatype, it actually matters).
<p>If you have experience in other programming languages, this table may help you visualize how Python compares to them:
<table class=simple>
<table>
<tr><th></th><th>Statically typed</th><th>Dynamically typed</th></tr>
<tr><th>Weakly typed</th><td>C, Objective-C</td><td>JavaScript, Perl 5, <abbr>PHP</abbr></td></tr>
<tr><th>Strongly typed</th><td>Pascal, Java</td><td>Python, Ruby</td></tr>
@@ -123,7 +122,7 @@ if __name__ == "__main__":
<p>I won't bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you've forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You'll thank me in six months.
<h3 id=docstrings>Documentation strings</h3>
<p>You can document a Python function by giving it a documentation string (<code>docstring</code> for short). In this program, the <code>approximate_size</code> function has a <code>docstring</code>:
<p class=skip><a href=#skip-approximate-size>skip over this code listing</a>
<p class=s><a href=#skip-approximate-size>skip over this code listing</a>
<pre><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):
"""Convert a file size to human-readable form.
@@ -150,12 +149,12 @@ if __name__ == "__main__":
<h2 id=everythingisanobject>Everything is an object</h2>
<p>In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime. A function, like everything else in Python, is an object.
<p>Run the interactive Python shell and follow along:
<p class=skip><a href=#skip-everything-is-an-object-screen>skip over this interpreter listing</a>
<p class=s><a href=#skip-everything-is-an-object-screen>skip over this interpreter listing</a>
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>import humansize</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>print(humansize.approximate_size(4096, True))</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>import humansize</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>print(humansize.approximate_size(4096, True))</kbd> <span>&#x2461;</span></a>
<samp>4.0 KiB</samp>
<a><samp class=prompt>>>> </samp><kbd>print(humansize.approximate_size.__doc__)</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>print(humansize.approximate_size.__doc__)</kbd> <span>&#x2462;</span></a>
<samp>Convert a file size to human-readable form.
Keyword arguments:
@@ -176,14 +175,14 @@ if __name__ == "__main__":
</blockquote>
<h3 id=importsearchpath>The <code>import</code> search path</h3>
<p>Before this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in <code>sys.path</code>. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.)
<p class=skip><a href=#skip-import-search-path-screen>skip over this interpreter listing</a>
<p class=s><a href=#skip-import-search-path-screen>skip over this interpreter listing</a>
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>import sys</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>sys.path</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>import sys</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>sys.path</kbd> <span>&#x2461;</span></a>
<samp>['', '/usr/lib/python30.zip', '/usr/lib/python3.0', '/usr/lib/python3.0/plat-linux2@EXTRAMACHDEPPATH@', '/usr/lib/python3.0/lib-dynload', '/usr/lib/python3.0/dist-packages', '/usr/local/lib/python3.0/dist-packages']</samp>
<a><samp class=prompt>>>> </samp><kbd>sys</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>sys</kbd> <span>&#x2462;</span></a>
<samp>&lt;module 'sys' (built-in)></samp>
<a><samp class=prompt>>>> </samp><kbd>sys.path.append('/my/new/path')</kbd> <span>&#x2463;</span></a></pre>
<a><samp class=p>>>> </samp><kbd>sys.path.append('/my/new/path')</kbd> <span>&#x2463;</span></a></pre>
<ol id=skip-import-search-path-screen>
<li>Importing the <code>sys</code> module makes all of its functions and attributes available.
<li><code>sys.path</code> is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a <code>.py</code> file whose name matches what you're trying to import.
@@ -196,7 +195,7 @@ if __name__ == "__main__":
<p>This is so important that I'm going to repeat it in case you missed it the first few times: <em>everything in Python is an object</em>. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.
<h2 id=indentingcode>Indenting code</h2>
<p>Python functions have no explicit <code>begin</code> or <code>end</code>, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (<code>:</code>) and the indentation of the code itself.
<p class=skip><a href=#skip-indenting-code>skip over this code listing</a>
<p class=s><a href=#skip-indenting-code>skip over this code listing</a>
<pre><code>
<a>def approximate_size(size, a_kilobyte_is_1024_bytes=True): <span>&#x2460;</span></a>
<a> if size &lt; 0: <span>&#x2461;</span></a>
@@ -222,7 +221,7 @@ if __name__ == "__main__":
</blockquote>
<h2 id=runningscripts>Running scripts</h2>
<p>Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them, by including a special block of code that executes when you run the Python file on the command line. Take the last few lines of <code>humansize.py</code>:
<p class=skip><a href=#skip-running-scripts>skip over this code listing</a>
<p class=s><a href=#skip-running-scripts>skip over this code listing</a>
<pre><code>
if __name__ == "__main__":
print(approximate_size(1000000000000, False))
@@ -231,15 +230,15 @@ if __name__ == "__main__":
<p><span>&#x261E;</span>Like <abbr>C</abbr>, Python uses <code>==</code> for comparison and <code>=</code> for assignment. Unlike <abbr>C</abbr>, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.
</blockquote>
<p>So what makes this <code>if</code> statement special? Well, modules are objects, and all modules have a built-in attribute <code>__name__</code>. A module's <code>__name__</code> depends on how you're using the module. If you <code>import</code> the module, then <code>__name__</code> is the module's filename, without a directory path or file extension.
<p class=skip><a href=#skip-import-humansize>skip over this interpreter listing</a>
<p class=s><a href=#skip-import-humansize>skip over this interpreter listing</a>
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>import humansize</kbd>
<samp class=prompt>>>> </samp><kbd>humansize.__name__</kbd>
<samp class=p>>>> </samp><kbd>import humansize</kbd>
<samp class=p>>>> </samp><kbd>humansize.__name__</kbd>
<samp>'humansize'</samp></pre>
<p id=skip-import-humansize>But you can also run the module directly as a standalone program, in which case <code>__name__</code> will be a special default value, <code>__main__</code>. Python will evaluate this <code>if</code> statement, find a true expression, and execute the <code>if</code> code block. In this case, to print two values.
<p class=skip><a href=#furtherreading>skip over this command output listing</a>
<p class=s><a href=#furtherreading>skip over this command output listing</a>
<pre class=screen>
<samp class=prompt>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp class=p>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<h2 id=furtherreading>Further reading</h2>
@@ -249,6 +248,6 @@ if __name__ == "__main__":
<li><a href=http://www.python.org/dev/peps/pep-0008/>PEP 8: Style Guide for Python Code</a> discusses good indentation style.
<li><a href=http://docs.python.org/3.0/reference/><cite>Python Reference Manual</cite></a> explains what it means to say that <a href=http://docs.python.org/3.0/reference/datamodel.html#objects-values-and-types>everything in Python is an object</a>, because some people are pedantic and like to discuss that sort of thing at great length.
</ul>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &bull; <a href=about.html>open standards &bull; open content &bull; open source</a>
<script src=jquery.js></script>
<script src=dip3.js></script>