whats-new, more special-method-names, typography fiddling

This commit is contained in:
Mark Pilgrim
2009-04-29 23:49:36 -04:00
parent 098df1da63
commit 4d69a47f98
14 changed files with 337 additions and 259 deletions
+4
View File
@@ -20,6 +20,8 @@ body{counter-reset:h1 11}
<h2 id=divingin>Diving In</h2>
<p class=f>FIXME
<h2 id=ordereddict>Ordered Dictionary: Not An Oxymoron</h2>
<p class=d>[<a href=examples/ordereddict.py>download <code>ordereddict.py</code></a>]
<pre><code>import collections
import itertools
@@ -92,6 +94,8 @@ class OrderedDict(dict, collections.MutableMapping):
return all(p==q for p, q in itertools.zip_longest(self.items(), other.items()))
return dict.__eq__(self, other)</code></pre>
<h2 id=implementing-fractions>Implementing Fractions</h2>
<p class=nav><a rel=prev class=todo><span>&#x261C;</a> <a rel=next class=todo><span>&#x261E;</span></a>
<p class=c>&copy; 2001&ndash;9 <a href=about.html>Mark Pilgrim</a>
<script src=jquery.js></script>
+19 -19
View File
@@ -17,7 +17,7 @@ body{counter-reset:h1 7}
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving In</h2>
<p class=f>H<code>AWAII + IDAHO + IOWA + OHIO == STATES</code>. Or, to put it another way, <code>510199 + 98153 + 9301 + 3593 == 621246</code>. Am I speaking in tongues? No, it's just a puzzle.
<p class=f>H<code>AWAII + IDAHO + IOWA + OHIO == STATES</code>. Or, to put it another way, <code>510199 + 98153 + 9301 + 3593 == 621246</code>. Am I speaking in tongues? No, it&#8217;s just a puzzle.
<p>Let me spell it out for you.
@@ -38,7 +38,7 @@ E = 4</code></pre>
<p>The most well-known alphametic puzzle is <code>SEND + MORE = MONEY</code>.
<p>In this chapter, we'll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles <em>in just 14 lines of code</em>.
<p>In this chapter, we&#8217;ll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles <em>in just 14 lines of code</em>.
<p class=d>[<a href=examples/alphametics.py>download <code>alphametics.py</code></a>]
<pre><code>import re
@@ -91,13 +91,13 @@ if __name__ == '__main__':
<a><samp class=p>>>> </samp><kbd>re.findall('[A-Z]+', 'SEND + MORE == MONEY')</kbd> <span>&#x2461;</span></a>
<samp>['SEND', 'MORE', 'MONEY']</samp></pre>
<ol>
<li>The <code>re</code> module is Python's implementation of <a href=regular-expressions.html>regular expressions</a>. It has a nifty function called <code>findall()</code> which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the string. In this case, the pattern matches sequences of numbers. The <code>findall()</code> function returns a list of all the substrings that matched the pattern.
<li>The <code>re</code> module is Python&#8217;s implementation of <a href=regular-expressions.html>regular expressions</a>. It has a nifty function called <code>findall()</code> which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the string. In this case, the pattern matches sequences of numbers. The <code>findall()</code> function returns a list of all the substrings that matched the pattern.
<li>Here the regular expression pattern matches sequences of letters. Again, the return value is a list, and each item in the list is a string that matched the regular expression pattern.
</ol>
<h2 id=unique-items>Finding the unique items in a sequence</h2>
<p>Set comprehensions make it trivial to find the unique items in a sequence. [FIXME-not sure if I'm going to cover set comprehensions in an earlier chapter; if not, this is certainly an abrupt and inadequate introduction to the topic.]
<p>Set comprehensions make it trivial to find the unique items in a sequence. [FIXME-not sure if I&#8217;m going to cover set comprehensions in an earlier chapter; if not, this is certainly an abrupt and inadequate introduction to the topic.]
<pre class=screen>
<samp class=p>>>> </samp><kbd>a_list = ['a', 'c', 'b', 'a', 'd', 'b']</kbd>
@@ -112,7 +112,7 @@ if __name__ == '__main__':
<a><samp class=p>>>> </samp><kbd>{c for c in ''.join(words)}</kbd> <span>&#x2463;</span></a>
<samp>{'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}</samp></pre>
<ol>
<li>Given a list of several strings, a set comprehension with the identity function will return a set of unique strings from the list. This makes sense if you think of it like a <code>for</code> loop. Take the first item from the list, put it in the set. Second. Third. Fourth &mdash; wait, that's in the set already, so it only gets listed once. Fifth. Sixth &mdash; again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn't even need to be sorted first.
<li>Given a list of several strings, a set comprehension with the identity function will return a set of unique strings from the list. This makes sense if you think of it like a <code>for</code> loop. Take the first item from the list, put it in the set. Second. Third. Fourth &mdash; wait, that&#8217;s in the set already, so it only gets listed once. Fifth. Sixth &mdash; again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn&#8217;t even need to be sorted first.
<li>The same technique works with strings, since a string is just a sequence of characters.
<li>Given a list of strings, <code>''.join(<var>a_list</var>)</code> concatenates all the strings together into one.
<li>So, given a list of strings, this set comprehension returns all the unique characters across all the strings, with no duplicates.
@@ -126,7 +126,7 @@ if __name__ == '__main__':
<h2 id=assert>Making assertions</h2>
<p>Like many programming languages, Python has an <code>assert</code> statement. Here's how it works.
<p>Like many programming languages, Python has an <code>assert</code> statement. Here&#8217;s how it works.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>assert 1 + 1 == 2</kbd> <span>&#x2460;</span></a>
@@ -172,9 +172,9 @@ AssertionError</samp></pre>
<h2 id=permutations>Calculating Permutations&hellip; The Lazy Way!</h2>
<p>First of all, what the heck are permutations? Permutations are a mathematical concept. (There are actually several definitions, depending on what kind of math you're doing. Here I'm talking about combinatorics, but if that doesn't mean anything to you, don't worry about it. As always, <a href="http://en.wikipedia.org/wiki/Permutation">Wikipedia is your friend</a>.)
<p>First of all, what the heck are permutations? Permutations are a mathematical concept. (There are actually several definitions, depending on what kind of math you&#8217;re doing. Here I&#8217;m talking about combinatorics, but if that doesn&#8217;t mean anything to you, don&#8217;t worry about it. As always, <a href="http://en.wikipedia.org/wiki/Permutation">Wikipedia is your friend</a>.)
<p>The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated. Mathematicians say things like "let's find the permutations of 3 different items taken 2 at a time," which means you have a sequence of 3 items and you want to find all the possible ordered pairs.
<p>The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated. Mathematicians say things like &#8220;let&#8217;s find the permutations of 3 different items taken 2 at a time,&#8221; which means you have a sequence of 3 items and you want to find all the possible ordered pairs.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>import itertools</kbd> <span>&#x2460;</span></a>
@@ -197,13 +197,13 @@ AssertionError</samp></pre>
StopIteration</samp></pre>
<ol>
<li>The <code>itertools</code> module has all kinds of fun stuff in it, including a <ocde>permutations()</code> function that does all the hard work of finding permutations.
<li>The <code>permutations()</code> function takes a sequence (here a list of three integers) and a number, which is the number of items you want in each smaller group. The function returns an iterator, which you can use in a <code>foor</code> loop or any old place that iterates. Here I'll step through the iterator manually to show all the values.
<li>The <code>permutations()</code> function takes a sequence (here a list of three integers) and a number, which is the number of items you want in each smaller group. The function returns an iterator, which you can use in a <code>foor</code> loop or any old place that iterates. Here I&#8217;ll step through the iterator manually to show all the values.
<li>The first permutation of <code>[1, 2, 3]</code> taken 2 at a time is <code>(1, 2)</code>.
<li>Note that permutations are ordered: <code>(2, 1)</code> is different than <code>(1, 2)</code>.
<li>That's it! Those are all the permutations of <code>[1, 2, 3]</code> taken 2 at a time. Pairs like <code>(1, 1)</code> and <code>(2, 2)</code> never show up, because they contain repeats so they aren't valid permutations. When there are no more permutations, the iterator raises a <code>StopIteration</code> exception.
<li>That&#8217;s it! Those are all the permutations of <code>[1, 2, 3]</code> taken 2 at a time. Pairs like <code>(1, 1)</code> and <code>(2, 2)</code> never show up, because they contain repeats so they aren&#8217;t valid permutations. When there are no more permutations, the iterator raises a <code>StopIteration</code> exception.
</ol>
<p>The <code>permutations()</code> function doesn't have to take a list. It can take any sequence &mdash; even a string.
<p>The <code>permutations()</code> function doesn&#8217;t have to take a list. It can take any sequence &mdash; even a string.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import itertools</kbd>
@@ -245,7 +245,7 @@ StopIteration</samp>
<samp>[('A', 'B'), ('A', 'C'), ('B', 'C')]</samp></pre>
<ol>
<li>The <code>itertools.product()</code> function returns an iterator containing the Cartesian product of two sequences.
<li>The <code>itertools.combinations()</code> function returns an iterator containing all the possible combinations of the given sequence of the given length. This is like the <code>itertools.permutations()</code> function, except combinations don't include items that are duplicates of other items in a different order. So <code>itertools.permutations('ABC', 2)</code> will return both <code>('A', 'B')</code> and <code>('B', 'A')</code> (among others), but <code>itertools.combinations('ABC', 2)</code> will not return <code>('B', 'A')</code> because it is a duplicate of <code>('A', 'B')</code> in a different order.
<li>The <code>itertools.combinations()</code> function returns an iterator containing all the possible combinations of the given sequence of the given length. This is like the <code>itertools.permutations()</code> function, except combinations don&#8217;t include items that are duplicates of other items in a different order. So <code>itertools.permutations('ABC', 2)</code> will return both <code>('A', 'B')</code> and <code>('B', 'A')</code> (among others), but <code>itertools.combinations('ABC', 2)</code> will not return <code>('B', 'A')</code> because it is a duplicate of <code>('A', 'B')</code> in a different order.
</ol>
<p class=d>[<a href=examples/favorite-people.txt>download <code>favorite-people.txt</code></a>]
@@ -273,7 +273,7 @@ StopIteration</samp>
<li>But the <code>sorted()</code> function can also take a function as the <var>key</var> parameter, and it sorts by that key. In this case, the sort function is <code>len()</code>, so it sorts by <code>len(<var>each item</var>)</code>. Shorter names come first, then longer, then longest.
</ol>
<p>What does this have to do with the <code>itertools</code> module? I'm glad you asked.
<p>What does this have to do with the <code>itertools</code> module? I&#8217;m glad you asked.
<pre class=screen>
<p>&hellip;continuing from the previous interactive shell&hellip;
@@ -330,7 +330,7 @@ Wesley</samp></pre>
<li>On the other hand, the <code>itertools.zip_longest()</code> function stops at the end of the <em>longest</em> sequence, inserting <code>None</code> values for items past the end of the shorter sequences.
</ol>
<p>OK, that was all very interesting, but how does it relate to the alphametics solver? Here's how:
<p>OK, that was all very interesting, but how does it relate to the alphametics solver? Here&#8217;s how:
<pre class=screen>
<samp class=p>>>> </samp><kbd>characters = ('S', 'M', 'E', 'D', 'O', 'N', 'R', 'Y')</kbd>
@@ -343,7 +343,7 @@ Wesley</samp></pre>
'N': '5', 'S': '1', 'R': '6', 'Y': '7'}</samp></pre>
<ol>
<li>Given a list of letters and a list of digits (each represented here as 1-character strings), the <code>zip</code> function will create a pairing of letters and digits, in order.
<li>Why is that cool? Because that data structure happens to be exactly the right structure to pass to the <code>dict()</code> function to create a dictionary that uses letters as keys and their associated digits as values. Although the printed representation of the dictionary lists the pairs in a different order (dictionaries have no "order" per se), you can see that each letter is associated with the digit, based on the ordering of the original <var>characters</var> and <var>guess</var> sequences.
<li>Why is that cool? Because that data structure happens to be exactly the right structure to pass to the <code>dict()</code> function to create a dictionary that uses letters as keys and their associated digits as values. Although the printed representation of the dictionary lists the pairs in a different order (dictionaries have no &#8220;order&#8221; per se), you can see that each letter is associated with the digit, based on the ordering of the original <var>characters</var> and <var>guess</var> sequences.
</ol>
<p>The alphametics solver uses this technique to create a dictionary that maps letters in the puzzle to digits in the solution, for each possible solution.
@@ -355,7 +355,7 @@ for guess in itertools.permutations(digits, len(characters)):
...
<mark> equation = puzzle.translate(dict(zip(characters, guess)))</mark></code></pre>
<p>But what is this <code>translate()</code> method? Ah, now you're getting to the <em>really</em> fun part.
<p>But what is this <code>translate()</code> method? Ah, now you&#8217;re getting to the <em>really</em> fun part.
<h2 id=string-translate>A New Kind Of String Manipulation</h2>
@@ -411,9 +411,9 @@ for guess in itertools.permutations(digits, len(characters)):
<h2 id=furtherreading>Further Reading</h2>
<ul>
<li><a href="http://blip.tv/file/1947373/">Watch Raymond Hettinger's "Easy AI with Python" talk</a> at PyCon 2009
<li><a href="http://code.activestate.com/recipes/576615/">Recipe 576615: Alphametics solver</a>, Raymond Hettinger's original alphametics solver for Python 2
<li><a href="http://code.activestate.com/recipes/users/178123/">More of Raymond Hettinger's recipes</a> in the ActiveState Code repository
<li><a href="http://blip.tv/file/1947373/">Watch Raymond Hettinger&#8217;s "Easy AI with Python" talk</a> at PyCon 2009
<li><a href="http://code.activestate.com/recipes/576615/">Recipe 576615: Alphametics solver</a>, Raymond Hettinger&#8217;s original alphametics solver for Python 2
<li><a href="http://code.activestate.com/recipes/users/178123/">More of Raymond Hettinger&#8217;s recipes</a> in the ActiveState Code repository
<li><a href="http://en.wikipedia.org/wiki/Verbal_arithmetic">Alphametics on Wikipedia</a>
<li><a href="http://www.tkcs-collins.com/truman/alphamet/index.shtml">Alphametics Index</a>, including <a href="http://www.tkcs-collins.com/truman/alphamet/alphamet.shtml">lots of puzzles</a> and <a href="http://www.tkcs-collins.com/truman/alphamet/alpha_gen.shtml">a generator to make your own</a>
</ul>
+32 -32
View File
@@ -614,7 +614,7 @@ ImportError: No module named constants</samp></pre>
<p>Needs to become two separate imports:
<pre><code>from . import constants
import sys</code></pre>
<p>There are variations of this problem scattered throughout the <code>chardet</code> library. In some places it&#8217;s "<code>import constants, sys</code>"; in other places, it&#8217;s "<code>import constants, re</code>". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
<p>There are variations of this problem scattered throughout the <code>chardet</code> library. In some places it&#8217;s &#8220;<code>import constants, sys</code>&#8221;; in other places, it&#8217;s &#8220;<code>import constants, re</code>&#8221;. The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
<p>Onward!
<h3 id=namefileisnotdefined>Name <var>'file'</var> is not defined</h3>
<aside>open() is the new file(). PapayaWhip is the new black.</aside>
@@ -697,7 +697,7 @@ for line in open(f, 'rb'):
File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<p>There's an unfortunate clash of coding style and Python interpreter here. The <code>TypeError</code> could be anywhere on that line, but the traceback doesn't tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
<p>There&#8217;s an unfortunate clash of coding style and Python interpreter here. The <code>TypeError</code> could be anywhere on that line, but the traceback doesn&#8217;t tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
<pre><code>elif (self._mInputState == ePureAscii) and \
self._escDetector.search(self._mLastChar + aBuf):</code></pre>
<p>And re-run the test:</p>
@@ -709,8 +709,8 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed
self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<p>Aha! The problem was not in the first conditional (<code>self._mInputState == ePureAscii</code>) but in the second one. So what could cause a <code>TypeError</code> there? Perhaps you're thinking that the <code>search()</code> method is expecting a value of a different type, but that wouldn't generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may <em>crash</em> if you pass it a value of a different type than it's expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the <code>search()</code> method. So the problem must be in that <code>+</code> operation, as it's trying to construct the value that it will eventually pass to the <code>search()</code> method.
<p>We know from <a href=#cantuseastringpattern>previous debugging</a> that <var>aBuf</var> is a byte array. So what is <code>self._mLastChar</code>? It's an instance variable, defined in the <code>reset()</code> method, which is actually called from the <code>__init__()</code> method.
<p>Aha! The problem was not in the first conditional (<code>self._mInputState == ePureAscii</code>) but in the second one. So what could cause a <code>TypeError</code> there? Perhaps you&#8217;re thinking that the <code>search()</code> method is expecting a value of a different type, but that wouldn&#8217;t generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may <em>crash</em> if you pass it a value of a different type than it&#8217;s expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the <code>search()</code> method. So the problem must be in that <code>+</code> operation, as it&#8217;s trying to construct the value that it will eventually pass to the <code>search()</code> method.
<p>We know from <a href=#cantuseastringpattern>previous debugging</a> that <var>aBuf</var> is a byte array. So what is <code>self._mLastChar</code>? It&#8217;s an instance variable, defined in the <code>reset()</code> method, which is actually called from the <code>__init__()</code> method.
<pre><code>class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(b'[\x80-\xFF]')
@@ -726,7 +726,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
self._mGotData = False
self._mInputState = ePureAscii
<mark> self._mLastChar = ''</mark></code></pre>
<p>And now we have our answer. Do you see it? <var>self._mLastChar</var> is a string, but <var>aBuf</var> is a byte array. And you can't concatenate a string to a byte array &mdash; not even a zero-length string.
<p>And now we have our answer. Do you see it? <var>self._mLastChar</var> is a string, but <var>aBuf</var> is a byte array. And you can&#8217;t concatenate a string to a byte array &mdash; not even a zero-length string.
<p>So what is <var>self._mLastChar</var> anyway? The answer is in the <code>feed()</code> method, just a few lines down from where the trackback occurred.
<pre><code>if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
@@ -736,14 +736,14 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
self._mInputState = eEscAscii
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
<p>The calling function calls this <code>feed()</code> method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as <var>aBuf</var>), then stores the last byte in <var>self._mLastChar</var> in case it's needed during the next call. (In a multi-byte encoding, the <code>feed()</code> method might get called with half of a character, then called again with the other half.) But because <var>aBuf</var> is now a byte array instead of a string, <var>self._mLastChar</var> needs to be a byte array as well. Thus:
<p>The calling function calls this <code>feed()</code> method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as <var>aBuf</var>), then stores the last byte in <var>self._mLastChar</var> in case it&#8217;s needed during the next call. (In a multi-byte encoding, the <code>feed()</code> method might get called with half of a character, then called again with the other half.) But because <var>aBuf</var> is now a byte array instead of a string, <var>self._mLastChar</var> needs to be a byte array as well. Thus:
<pre><code> def reset(self):
.
.
.
<del>- self._mLastChar = ''</del>
<ins>+ self._mLastChar = b''</ins></code></pre>
<p>Searching the entire codebase for <code>"mLastChar"</code> turns up a similar problem in <code>mbcharsetprober.py</code>, but instead of tracking the last character, it tracks the last <em>two</em> characters. The <code>MultiByteCharSetProber</code> class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.
<p>Searching the entire codebase for &#8220;<code>mLastChar</code>&#8221; turns up a similar problem in <code>mbcharsetprober.py</code>, but instead of tracking the last character, it tracks the last <em>two</em> characters. The <code>MultiByteCharSetProber</code> class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.
<pre><code>
class MultiByteCharSetProber(CharSetProber):
def __init__(self):
@@ -762,7 +762,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<del>- self._mLastChar = ['\x00', '\x00']</del>
<ins>+ self._mLastChar = [0, 0]</ins></code></pre>
<h3 id=unsupportedoperandtypeforplus>Unsupported operand type(s) for +: <code>'int'</code> and <code>'bytes'</code></h3>
<p>I have good news, and I have bad news. The good news is we're making progress&hellip;
<p>I have good news, and I have bad news. The good news is we&#8217;re making progress&hellip;
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class=traceback>Traceback (most recent call last):
@@ -771,8 +771,8 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed
self._escDetector.search(self._mLastChar + aBuf):
TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
<p>&hellip;The bad news is it doesn't always feel like progress.
<p>But this is progress! Really! Even though the traceback calls out the same line of code, it's a different error than it used to be. Progress! So what's the problem now? The last time I checked, this line of code didn't try to concatenate an <code>int</code> with a byte array (<code>bytes</code>). In fact, you just spent a lot of time <a href=#cantconvertbytesobject>ensuring that <var>self._mLastChar</var> was a byte array</a>. How did it turn into an <code>int</code>?
<p>&hellip;The bad news is it doesn&#8217;t always feel like progress.
<p>But this is progress! Really! Even though the traceback calls out the same line of code, it&#8217;s a different error than it used to be. Progress! So what&#8217;s the problem now? The last time I checked, this line of code didn&#8217;t try to concatenate an <code>int</code> with a byte array (<code>bytes</code>). In fact, you just spent a lot of time <a href=#cantconvertbytesobject>ensuring that <var>self._mLastChar</var> was a byte array</a>. How did it turn into an <code>int</code>?
<p>The answer lies not in the previous lines of code, but in the following lines.
<pre><code>if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
@@ -783,7 +783,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
<aside>Each item in a string is a string. Each item in a byte array is an integer.</aside>
<p>This error doesn't occur the first time the <code>feed()</code> method gets called; it occurs the <em>second time</em>, after <var>self._mLastChar</var> has been set to the last byte of <var>aBuf</var>. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
<p>This error doesn&#8217;t occur the first time the <code>feed()</code> method gets called; it occurs the <em>second time</em>, after <var>self._mLastChar</var> has been set to the last byte of <var>aBuf</var>. Well, what&#8217;s the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>aBuf = b'\xEF\xBB\xBF'</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>len(aBuf)</kbd>
@@ -805,19 +805,19 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp>
<ol>
<li>Define a byte array of length 3.
<li>The last element of the byte array is 191.
<li>That's an integer.
<li>Concatenating an integer with a byte array doesn't work. You've now replicated the error you just found in <code>universaldetector.py</code>.
<li>Ah, here's the fix. Instead of taking the last element of the byte array, use <a href=native-datatypes.html#slicinglists>list slicing</a> to create a new byte array containing just the last element. That is, start with the last element and continue the slice until the end of the byte array. Now <var>mLastChar</var> is a byte array of length 1.
<li>That&#8217;s an integer.
<li>Concatenating an integer with a byte array doesn&#8217;t work. You&#8217;ve now replicated the error you just found in <code>universaldetector.py</code>.
<li>Ah, here&#8217;s the fix. Instead of taking the last element of the byte array, use <a href=native-datatypes.html#slicinglists>list slicing</a> to create a new byte array containing just the last element. That is, start with the last element and continue the slice until the end of the byte array. Now <var>mLastChar</var> is a byte array of length 1.
<li>Concatenating a byte array of length 1 with a byte array of length 3 returns a new byte array of length 4.
</ol>
<p>So, to ensure that the <code>feed()</code> method in <code>universaldetector.py</code> continues to work no matter how often it's called, you need to <a href=#cantconvertbytesobject>initialize <var>self._mLastChar</var> as a 0-length byte array</a>, then <em>make sure it stays a byte array</em>.
<p>So, to ensure that the <code>feed()</code> method in <code>universaldetector.py</code> continues to work no matter how often it&#8217;s called, you need to <a href=#cantconvertbytesobject>initialize <var>self._mLastChar</var> as a 0-length byte array</a>, then <em>make sure it stays a byte array</em>.
<pre><code> self._escDetector.search(self._mLastChar + aBuf):
self._mInputState = eEscAscii
<del>- self._mLastChar = aBuf[-1]</del>
<ins>+ self._mLastChar = aBuf[-1:]</ins></code></pre>
<h3 id=ordexpectedstring><code>ord()</code> expected string of length 1, but <code>int</code> found</h3>
<p>Tired yet? You're almost there&hellip;
<p>Tired yet? You&#8217;re almost there&hellip;
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml</samp>
@@ -839,19 +839,19 @@ def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
byteCls = self._mModel['classTable'][ord(c)]</code></pre>
<p>That's no help; it's just passed into the function. Let's pop the stack.
<p>That&#8217;s no help; it&#8217;s just passed into the function. Let&#8217;s pop the stack.
<pre><code># utf8prober.py
def feed(self, aBuf):
for c in aBuf:
codingState = self._mCodingSM.next_state(c)</code></pre>
<p>And now we have the answer. Do you see it? In Python 2, <var>aBuf</var> was a string, so <var>c</var> was a 1-character string. (That's what you get when you iterate over a string &mdash; all the characters, one by one.) But now, <var>aBuf</var> is a byte array, so <var>c</var> is an <code>int</code>, not a 1-character string. In other words, there's no need to call the <code>ord()</code> function because <var>c</var> is already an <code>int</code>!
<p>And now we have the answer. Do you see it? In Python 2, <var>aBuf</var> was a string, so <var>c</var> was a 1-character string. (That&#8217;s what you get when you iterate over a string &mdash; all the characters, one by one.) But now, <var>aBuf</var> is a byte array, so <var>c</var> is an <code>int</code>, not a 1-character string. In other words, there&#8217;s no need to call the <code>ord()</code> function because <var>c</var> is already an <code>int</code>!
<p>Thus:
<pre><code> def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
<del>- byteCls = self._mModel['classTable'][ord(c)]</del>
<ins>+ byteCls = self._mModel['classTable'][c]</ins></code></pre>
<p>Searching the entire codebase for instances of <code>"ord(c)"</code> uncovers similar problems in <code>sbcharsetprober.py</code>&hellip;
<p>Searching the entire codebase for instances of &#8220;<code>ord(c)</code>&#8221; uncovers similar problems in <code>sbcharsetprober.py</code>&hellip;
<pre><code># sbcharsetprober.py
def feed(self, aBuf):
if not self._mModel['keepEnglishLetter']:
@@ -887,7 +887,7 @@ def feed(self, aBuf):
<ins>+ charClass = Latin1_CharToClass[c]</ins>
</code></pre>
<h3 id=unorderabletypes>Unorderable types: <code>int()</code> >= <code>str()</code></h3>
<p>Let's go again.
<p>Let&#8217;s go again.
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml</samp>
@@ -905,8 +905,8 @@ tests\Big5\0804.blogspot.com.xml</samp>
File "C:\home\chardet\chardet\jpcntx.py", line 176, in get_order
if ((aStr[0] >= '\x81') and (aStr[0] &lt;= '\x9F')) or \
TypeError: unorderable types: int() >= str()</samp></pre>
<p>Did you notice? This time around, the code passed the first test case (<code>tests\ascii\howto.diveintomark.org.xml</code>). You're making real progress here.
<p>So what's this all about? &#8220;Unorderable types&#8221;? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code:
<p>Did you notice? This time around, the code passed the first test case (<code>tests\ascii\howto.diveintomark.org.xml</code>). You&#8217;re making real progress here.
<p>So what&#8217;s this all about? &#8220;Unorderable types&#8221;? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code:
<pre><code>class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
if not aStr: return -1, 1
@@ -916,7 +916,7 @@ TypeError: unorderable types: int() >= str()</samp></pre>
charLen = 2
else:
charLen = 1</code></pre>
<p>And where does <var>aStr</var> come from? Let's pop the stack:
<p>And where does <var>aStr</var> come from? Let&#8217;s pop the stack:
<pre><code>def feed(self, aBuf, aLen):
.
.
@@ -924,9 +924,9 @@ TypeError: unorderable types: int() >= str()</samp></pre>
i = self._mNeedToSkipCharNum
while i &lt; aLen:
<mark> order, charLen = self.get_order(aBuf[i:i+2])</mark></code></pre>
<p>Oh look, it's our old friend, <var>aBuf</var>. As you might have guessed from every other issue we've encountered in this chapter, <var>aBuf</var> is a byte array. Here, the <code>feed()</code> method isn't just passing it on wholesale; it's slicing it. But as you saw <a href=#unsupportedoperandtypeforplus>earlier in this chapter</a>, slicing a byte array returns a byte array, so the <var>aStr</var> parameter that gets passed to the <code>get_order()</code> method is still a byte array.
<p>And what is this code trying to do with <var>aStr</var>? It's taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because <var>aStr</var> and <var>aBuf</var> were strings, and <var>aStr[0]</var> would be a string, and you can compare strings for inequality. But in Python 3, <var>aStr</var> and <var>aBuf</var> are byte arrays, <var>aStr[0]</var> is an integer, and you can't compare integers and strings for inequality without explicitly coercing one of them.
<p>In this case, there's no need to make the code more complicated by adding an explicit coercion. <var>aStr[0]</var> yields an integer; the things you're comparing to are all constants. Let's change them from 1-character strings to integers.
<p>Oh look, it&#8217;s our old friend, <var>aBuf</var>. As you might have guessed from every other issue we&#8217;ve encountered in this chapter, <var>aBuf</var> is a byte array. Here, the <code>feed()</code> method isn&#8217;t just passing it on wholesale; it&#8217;s slicing it. But as you saw <a href=#unsupportedoperandtypeforplus>earlier in this chapter</a>, slicing a byte array returns a byte array, so the <var>aStr</var> parameter that gets passed to the <code>get_order()</code> method is still a byte array.
<p>And what is this code trying to do with <var>aStr</var>? It&#8217;s taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because <var>aStr</var> and <var>aBuf</var> were strings, and <var>aStr[0]</var> would be a string, and you can compare strings for inequality. But in Python 3, <var>aStr</var> and <var>aBuf</var> are byte arrays, <var>aStr[0]</var> is an integer, and you can&#8217;t compare integers and strings for inequality without explicitly coercing one of them.
<p>In this case, there&#8217;s no need to make the code more complicated by adding an explicit coercion. <var>aStr[0]</var> yields an integer; the things you&#8217;re comparing to are all constants. Let&#8217;s change them from 1-character strings to integers.
<pre><code> class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
if not aStr: return -1, 1
@@ -1115,7 +1115,7 @@ tests\Big5\0804.blogspot.com.xml</samp>
File "C:\home\chardet\chardet\latin1prober.py", line 126, in get_confidence
total = reduce(operator.add, self._mFreqCounter)
NameError: global name 'reduce' is not defined</samp></pre>
<p>According to the official <a href=http://docs.python.org/3.0/whatsnew/3.0.html#builtins>What's New In Python 3.0</a> guide, the <code>reduce()</code> function has been moved out of the global namespace and into the <code>functools</code> module. Quoting the guide: "Use <code>functools.reduce()</code> if you really need it; however, 99 percent of the time an explicit <code>for</code> loop is more readable." You can read more about the decision from Guido van Rossum's weblog: <a href="http://www.artima.com/weblogs/viewpost.jsp?thread=98196">The fate of reduce() in Python 3000</a>.
<p>According to the official <a href=http://docs.python.org/3.0/whatsnew/3.0.html#builtins>What&#8217;s New In Python 3.0</a> guide, the <code>reduce()</code> function has been moved out of the global namespace and into the <code>functools</code> module. Quoting the guide: &#8220;Use <code>functools.reduce()</code> if you really need it; however, 99 percent of the time an explicit <code>for</code> loop is more readable.&#8221; You can read more about the decision from Guido van Rossum&#8217;s weblog: <a href="http://www.artima.com/weblogs/viewpost.jsp?thread=98196">The fate of reduce() in Python 3000</a>.
<pre><code>def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
@@ -1129,7 +1129,7 @@ NameError: global name 'reduce' is not defined</samp></pre>
<del>- total = reduce(operator.add, self._mFreqCounter)</del>
<ins>+ total = sum(self._mFreqCounter)</ins></code></pre>
<p>Since you're no longer using the <code>operator</code> module, you can remove that <code>import</code> from the top of the file as well.
<p>Since you&#8217;re no longer using the <code>operator</code> module, you can remove that <code>import</code> from the top of the file as well.
<pre><code> from .charsetprober import CharSetProber
from . import constants
<del>- import operator</del></code></pre>
@@ -1172,11 +1172,11 @@ tests\EUC-JP\arclamp.jp.xml EUC-JP with confide
<h2 id=summary>Summary</h2>
<p>What have we learned?
<ol>
<li>Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There's no way around it. It's hard.
<li>The <a href=porting-code-to-python-3-with-2to3.html>automated <code>2to3</code> tool</a> is helpful as far as it goes, but it will only do the easy parts &mdash; function renames, module renames, syntax changes. It's an impressive piece of engineering, but in the end it's just an intelligent search-and-replace bot.
<li>The #1 porting problem in this library was the difference between strings and bytes. In this case that seems obvious, since the whole point of the <code>chardet</code> library is to convert a stream of bytes into a string. But &#8220;a stream of bytes&#8221; comes up more often than you might think. Reading a file in &#8220;binary&#8221; mode? You'll get a stream of bytes. Fetching a web page? Calling a web <abbr>API</abbr>? They return a stream of bytes, too.
<li>Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There&#8217;s no way around it. It&#8217;s hard.
<li>The <a href=porting-code-to-python-3-with-2to3.html>automated <code>2to3</code> tool</a> is helpful as far as it goes, but it will only do the easy parts &mdash; function renames, module renames, syntax changes. It&#8217;s an impressive piece of engineering, but in the end it&#8217;s just an intelligent search-and-replace bot.
<li>The #1 porting problem in this library was the difference between strings and bytes. In this case that seems obvious, since the whole point of the <code>chardet</code> library is to convert a stream of bytes into a string. But &#8220;a stream of bytes&#8221; comes up more often than you might think. Reading a file in &#8220;binary&#8221; mode? You&#8217;ll get a stream of bytes. Fetching a web page? Calling a web <abbr>API</abbr>? They return a stream of bytes, too.
<li><em>You</em> need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least, you need to be comfortable with all its quirks and musty corners. The bugs are everywhere.
<li>Test cases are essential. Don't port anything without them. Don't even try. The <em>only</em> reason I have any confidence at all that <code>chardet</code> works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I <em>never</em> would have found half of these problems with manual spot-checking.
<li>Test cases are essential. Don&#8217;t port anything without them. Don&#8217;t even try. The <em>only</em> reason I have any confidence at all that <code>chardet</code> works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I <em>never</em> would have found half of these problems with manual spot-checking.
</ol>
<p class=nav><a rel=prev class=todo><span>&#x261C;</a> <a rel=next href=porting-code-to-python-3-with-2to3.html title="onward to &#8220;Porting Code to Python 3 with 2to3&#8221;"><span>&#x261E;</span></a>
+2 -1
View File
@@ -22,7 +22,8 @@ h1:before{content:""}
<p>You can see the <a href=table-of-contents.html>full table of contents</a> (<strong>not finalized</strong>), or read what I&#8217;ve written so far:</p>
<ol start=0>
<ol start=-1>
<li><a href=whats-new.html>What&#8217;s New in &#8220;Dive Into Python 3&#8221;</a>
<li class=todo>Installing Python
<li><a href=your-first-python-program.html>Your First Python Program</a>
<li><a href=native-datatypes.html>Native Datatypes</a>
+16 -47
View File
@@ -40,33 +40,33 @@ body{counter-reset:h1 6}
self.a, self.b = self.b, self.a + self.b
return fib</code></pre>
<p>Let's take that one line at a time.
<p>Let&#8217;s take that one line at a time.
<pre><code>class Fib:</code></pre>
<p><code>class</code>? What's a class?
<p><code>class</code>? What&#8217;s a class?
<h2 id=defining-classes>Defining Classes</h2>
<p>Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you've defined.
<p>Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you&#8217;ve defined.
<p>Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word <code>class</code>, followed by the class name. Technically, that's all that's required, since a class doesn't need to inherit from any other class.
<p>Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word <code>class</code>, followed by the class name. Technically, that&#8217;s all that&#8217;s required, since a class doesn&#8217;t need to inherit from any other class.
<pre><code>
class PapayaWhip: <span>&#x2460;</span>
pass <span>&#x2461;</span></code></pre>
<ol>
<li>The name of this class is <code>PapayaWhip</code>, and it doesn't inherit from any other class. Class names are usually capitalized, <code>EachWordLikeThis</code>, but this is only a convention, not a requirement.
<li>The name of this class is <code>PapayaWhip</code>, and it doesn&#8217;t inherit from any other class. Class names are usually capitalized, <code>EachWordLikeThis</code>, but this is only a convention, not a requirement.
<li>You probably guessed this, but everything in a class is indented, just like the code within a function, <code>if</code> statement, <code>for</code> loop, or any other block of code. The first line not indented is outside the class.
</ol>
<p>This <code>PapayaWhip</code> class doesn't define any methods or attributes, but syntactically, there needs to be something in the definition, thus the <code>pass</code> statement. This is a Python reserved word that just means &#8220;move along, nothing to see here&#8221;. It's a statement that does nothing, and it's a good placeholder when you're stubbing out functions or classes.
<p>This <code>PapayaWhip</code> class doesn&#8217;t define any methods or attributes, but syntactically, there needs to be something in the definition, thus the <code>pass</code> statement. This is a Python reserved word that just means &#8220;move along, nothing to see here&#8221;. It&#8217;s a statement that does nothing, and it&#8217;s a good placeholder when you&#8217;re stubbing out functions or classes.
<blockquote class="note compare java">
<p><span>&#x261E;</span>The <code>pass</code> statement in Python is like a empty set of curly braces (<code>{}</code>) in Java or C.
</blockquote>
<p>Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don't have explicit constructors and destructors. Although it's not required, Python classes <em>can</em> have something similar to a constructor: the <code>__init__()</code> method.
<p>Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don&#8217;t have explicit constructors and destructors. Although it&#8217;s not required, Python classes <em>can</em> have something similar to a constructor: the <code>__init__()</code> method.
<h3 id=init-method>The <code>__init__()</code> Method</h3>
@@ -79,10 +79,10 @@ class Fib:
<a> def __init__(self, max): <span>&#x2461;</span></code></pre>
<ol>
<li>Classes can (and should) have <code>docstring</code>s too, just like modules and functions.
<li>The <code>__init__()</code> method is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor of the class. It's tempting, because it looks like a constructor (by convention, the <code>__init__()</code> method is the first method defined for the class), acts like one (it's the first piece of code executed in a newly created instance of the class), and even sounds like one. Incorrect, because the object has already been constructed by the time the <code>__init__()</code> method is called, and you already have a valid reference to the new instance of the class.
<li>The <code>__init__()</code> method is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor of the class. It&#8217;s tempting, because it looks like a constructor (by convention, the <code>__init__()</code> method is the first method defined for the class), acts like one (it&#8217;s the first piece of code executed in a newly created instance of the class), and even sounds like one. Incorrect, because the object has already been constructed by the time the <code>__init__()</code> method is called, and you already have a valid reference to the new instance of the class.
</ol>
<p>The first argument of every class method, including the <code>__init__()</code> method, is always a reference to the current instance of the class. By convention, this argument is named <var>self</var>. This argument fills the role of the reserved word <code>this</code> in <abbr>C++</abbr> or Java, but <var>self</var> is not a reserved word in Python, merely a naming convention. Nonetheless, please don't call it anything but <var>self</var>; this is a very strong convention.
<p>The first argument of every class method, including the <code>__init__()</code> method, is always a reference to the current instance of the class. By convention, this argument is named <var>self</var>. This argument fills the role of the reserved word <code>this</code> in <abbr>C++</abbr> or Java, but <var>self</var> is not a reserved word in Python, merely a naming convention. Nonetheless, please don&#8217;t call it anything but <var>self</var>; this is a very strong convention.
<p>In the <code>__init__()</code> method, <var>self</var> refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify <var>self</var> explicitly when defining the method, you do <em>not</em> specify it when calling the method; Python will add it for you automatically.
@@ -99,10 +99,10 @@ class Fib:
<a><samp class=p>>>> </samp><kbd>fib.__doc__</kbd> <span>&#x2463;</span></a>
<samp>'iterator that yields numbers in the Fibanocci sequence'</samp></code></pre>
<ol>
<li>You are creating an instance of the <code>Fib</code> class (defined in the <code>fibonacci2</code> module) and assigning the newly created instance to the variable <var>fib</var>. You are passing one parameter, <code>100</code>, which will end up as the <var>max</var> argument in <code>Fib</code>'s <code>__init__()</code> method.
<li>You are creating an instance of the <code>Fib</code> class (defined in the <code>fibonacci2</code> module) and assigning the newly created instance to the variable <var>fib</var>. You are passing one parameter, <code>100</code>, which will end up as the <var>max</var> argument in <code>Fib</code>&#8217;s <code>__init__()</code> method.
<li><var>fib</var> is now an instance of the <code>Fib</code> class.
<li>Every class instance has a built-in attribute, <code>__class__</code>, which is the object's class. Java programmers may be familiar with the <code>Class</code> class, which contains methods like <code>getName</code> and <code>getSuperclass</code> to get metadata information about an object. In Python, this kind of metadata is available directly on the object itself through attributes like <code>__class__</code>, <code>__name__</code>, and <code>__bases__</code>.
<li>You can access the instance's <code>docstring</code> just as with a function or a module. All instances of a class share the same <code>docstring</code>.
<li>Every class instance has a built-in attribute, <code>__class__</code>, which is the object&#8217;s class. Java programmers may be familiar with the <code>Class</code> class, which contains methods like <code>getName</code> and <code>getSuperclass</code> to get metadata information about an object. In Python, this kind of metadata is available directly on the object itself through attributes like <code>__class__</code>, <code>__name__</code>, and <code>__bases__</code>.
<li>You can access the instance&#8217;s <code>docstring</code> just as with a function or a module. All instances of a class share the same <code>docstring</code>.
</ol>
<blockquote class="note compare java">
@@ -117,7 +117,7 @@ class Fib:
def __init__(self, max):
<a> self.max = max <span>&#x2460;</span></a></code></pre>
<ol>
<li>What is <var>self.max</var>? It's an instance variable. It is completely separate from <var>max</var>, which was passed into the <code>__init__()</code> method as an argument. <var>self.max</var> is &#8220;global&#8221; to the instance. That means that you can access it from other methods.
<li>What is <var>self.max</var>? It&#8217;s an instance variable. It is completely separate from <var>max</var>, which was passed into the <code>__init__()</code> method as an argument. <var>self.max</var> is &#8220;global&#8221; to the instance. That means that you can access it from other methods.
</ol>
<pre><code>class Fib:
@@ -147,7 +147,7 @@ class Fib:
<h2 id=a-fibonacci-iterator>A Fibonacci Iterator</h2>
<p><em>Now</em> you're ready to learn how to build an iterator. An iterator is just a class that defines an <code>__iter__()</code> method.
<p><em>Now</em> you&#8217;re ready to learn how to build an iterator. An iterator is just a class that defines an <code>__iter__()</code> method.
<p class=d>[<a href=examples/fibonacci2.py>download <code>fibonacci2.py</code></a>]
<pre><code><a>class Fib: <span>&#x2460;</span></a>
@@ -195,7 +195,7 @@ class Fib:
<h2 id=a-plural-rule-iterator>A Plural Rule Iterator</h2>
<aside>iter(f) calls f.__iter__<br>next(f) calls f.__next__</aside>
<p>Now it&#8217;s time for the finale. Let's rewrite the <a href=generators.html>plural rules generator</a> as an iterator.
<p>Now it&#8217;s time for the finale. Let&#8217;s rewrite the <a href=generators.html>plural rules generator</a> as an iterator.
<p class=d>[<a href=examples/plural6.py>download <code>plural6.py</code></a>]
<pre><code>class LazyRules:
@@ -246,7 +246,7 @@ rules = LazyRules()</code></pre>
<li>Also, this is a good place to initialize the cache, which you&#8217;ll use later as you read the patterns from the pattern file.
</ol>
<p>Before we continue, let's take a closer look at <var>rules_f</var>. It's not defined within the <code>__init__()</code> method. In fact, it's not defined within <em>any</em> method. It's defined at the class level. It's a <i>class variable</i>, and although you can access it just like an instance variable (<var>self.rules_f</var>), it is shared across all instances of the <code>LazyRules</code> class.
<p>Before we continue, let&#8217;s take a closer look at <var>rules_f</var>. It&#8217;s not defined within the <code>__init__()</code> method. In fact, it&#8217;s not defined within <em>any</em> method. It&#8217;s defined at the class level. It&#8217;s a <i>class variable</i>, and although you can access it just like an instance variable (<var>self.rules_f</var>), it is shared across all instances of the <code>LazyRules</code> class.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import plural6</kbd>
@@ -364,34 +364,3 @@ rules = LazyRules()</code></pre>
<p class=c>&copy; 2001&ndash;9 <a href=about.html>Mark Pilgrim</a>
<script src=jquery.js></script>
<script src=dip3.js></script>
<!--
FIXME some good stuff here about calling ancestor's methods explicitly. need to find where to put it once we have an example of a class that inherits from something else.
<li>Some pseudo-object-oriented languages like Powerbuilder have a concept of &#8220;extending&#8221; constructors and other events, where the ancestor's method is called automatically before the descendant's method is executed. Python does not do this; you must always explicitly call the appropriate method in the ancestor class.
<li>I told you that this class acts like a dictionary, and here is the first sign of it. You're assigning the argument <var>filename</var> as the value of this object's <code>name</code> key.
<li>Note that the <code>__init__</code> method never returns a value.
<h3>5.3.2. Knowing When to Use <var>self</var> and <code>__init__</code></h3>
<p>When defining your class methods, you <em>must</em> explicitly list <var>self</var> as the first argument for each method, including <code>__init__</code>. When you call a method of an ancestor class from within your class, you <em>must</em> include the <var>self</var> argument. But when you call your class method from outside, you do not specify anything for the <var>self</var> argument; you skip it entirely, and Python automatically adds the instance reference for you. I am aware that this is confusing at first; it's not really inconsistent,
but it may appear inconsistent because it relies on a distinction (between bound and unbound methods) that you don't know
about yet.
<p>Whew. I realize that's a lot to absorb, but you'll get the hang of it. All Python classes work the same way, so once you learn one, you've learned them all. If you forget everything else, remember this
one thing, because I promise it will trip you up:<table id="tip.initoptional" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code>__init__</code> methods are optional, but when you define one, you must remember to explicitly call the ancestor's <code>__init__</code> method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor,
the descendant method must explicitly call the ancestor method at the proper time, with the proper arguments.
<div class=itemizedlist>
<h3>Further Reading on Python Classes</h3>
<ul>
<li><a href="http://www.freenetpages.co.uk/hp/alan.gauld/" title="Python book for first-time programmers"><i class=citetitle>Learning to Program</i></a> has a gentler <a href="http://www.freenetpages.co.uk/hp/alan.gauld/tutclass.htm">introduction to classes</a>.
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class=citetitle>How to Think Like a Computer Scientist</i></a> shows how to <a href="http://www.ibiblio.org/obp/thinkCSpy/chap12.htm">use classes to model compound datatypes</a>.
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> has an in-depth look at <a href="http://www.python.org/doc/current/tut/node11.html">classes, namespaces, and inheritance</a>.
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/242">common questions about classes</a>.
</ul>
-->
+23 -23
View File
@@ -17,7 +17,7 @@ body{counter-reset:h1 2}
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving In</h2>
<p class=f>Cast aside <a href=your-first-python-program.html>your first Python program</a> for just a minute, and let's talk about datatypes. In Python, <a href=your-first-python-program.html#datatypes>every variable has a datatype</a>, but you don't need to declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally.
<p class=f>Cast aside <a href=your-first-python-program.html>your first Python program</a> for just a minute, and let&#8217;s talk about datatypes. In Python, <a href=your-first-python-program.html#datatypes>every variable has a datatype</a>, but you don&#8217;t need to declare it explicitly. Based on each variable&#8217;s original assignment, Python figures out what type it is and keeps tracks of that internally.
<p>Python has many native datatypes. Here are the important ones:
<ol>
<li><b>Booleans</b> are either <code>True</code> or <code>False</code>.
@@ -28,8 +28,8 @@ body{counter-reset:h1 2}
<li><b>Sets</b> are unordered bags of values.
<li><b>Dictionaries</b> are unordered bags of key-value pairs.
</ol>
<p>Of course, there are a lot more types than these seven. <a href=your-first-python-program.html#everythingisanobject>Everything is an object</a> in Python, so there are types like <i>module</i>, <i>function</i>, <i>class</i>, <i>method</i>, <i>file</i>, and even <i>compiled code</i>. You've already seen some of these: <a href=your-first-python-program.html#runningscripts>modules have names</a>, <a href=your-first-python-program.html#docstrings>functions have <code>docstrings</code></a>, <i class=baa>&amp;</i>c. You'll learn about classes in [FIXME xref] and files in [FIXME xref].
<p>Strings and bytes are important enough &mdash; and complicated enough &mdash; that they get their own chapter. Let's look at the others first.
<p>Of course, there are a lot more types than these seven. <a href=your-first-python-program.html#everythingisanobject>Everything is an object</a> in Python, so there are types like <i>module</i>, <i>function</i>, <i>class</i>, <i>method</i>, <i>file</i>, and even <i>compiled code</i>. You&#8217;ve already seen some of these: <a href=your-first-python-program.html#runningscripts>modules have names</a>, <a href=your-first-python-program.html#docstrings>functions have <code>docstrings</code></a>, <i class=baa>&amp;</i>c. You&#8217;ll learn about classes in [FIXME xref] and files in [FIXME xref].
<p>Strings and bytes are important enough &mdash; and complicated enough &mdash; that they get their own chapter. Let&#8217;s look at the others first.
<h2 id=booleans>Booleans</h2>
<aside>You can use virtually any expression in a boolean context.</aside>
<p>Booleans are either true or false. Python has two constants, <code>True</code> and <code>False</code>, which can be used to assign boolean values directly. Expressions can also evaluate to a boolean value. In certain places (like <code>if</code> statements), Python expects an expression to evaluate to a boolean value. These places are called <i>boolean contexts</i>. You can use virtually any expression in a boolean context, and Python will try to determine its truth value. Different datatypes have different rules about which values are true or false in a boolean context. (This will make more sense once you see some concrete examples later in this chapter.)
@@ -48,7 +48,7 @@ body{counter-reset:h1 2}
<samp class=p>>>> </samp><kbd>size &lt; 0</kbd>
<samp>True</samp></pre>
<h2 id=numbers>Numbers</h2>
<p>Numbers are awesome. There are so many to choose from. Python supports both integers and floating point numbers. There's no type declaration to distinguish them; Python tells them apart by the presence or absence of a decimal point.
<p>Numbers are awesome. There are so many to choose from. Python supports both integers and floating point numbers. There&#8217;s no type declaration to distinguish them; Python tells them apart by the presence or absence of a decimal point.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>type(1)</kbd> <span>&#x2460;</span></a>
<samp>&lt;class 'int'></samp>
@@ -82,7 +82,7 @@ body{counter-reset:h1 2}
<li>You can explicitly coerce an <code>int</code> to a <code>float</code> by calling the <code>float()</code> function.
<li>Unsurprisingly, you can also coerce a <code>float</code> to an <code>int</code> by calling <code>int()</code>.
<li>The <code>int()</code> function will truncate, not round.
<li>The <code>int()</code> function truncates negative numbers towards <code>0</code>. It's a true truncate function, not a a floor function.
<li>The <code>int()</code> function truncates negative numbers towards <code>0</code>. It&#8217;s a true truncate function, not a a floor function.
<li>Floating point numbers are accurate to 15 decimal places.
<li>Integers can be arbitrarily large.
</ol>
@@ -108,8 +108,8 @@ body{counter-reset:h1 2}
<ol>
<li>The <code>/</code> operator performs floating point division. It returns a <code>float</code> even if both the numerator and denominator are <code>int</code>s.
<li>The <code>//</code> operator performs a quirky kind of integer division. When the result is positive, you can think of it as truncating (not rounding) to <code>0</code> decimal places, but be careful with that.
<li>When integer-dividing negative numbers, the <code>//</code> operator rounds &#8220;up&#8221; to the nearest integer. Mathematically speaking, it's rounding &#8220;down&#8221; since <code>&minus;6</code> is less than <code>&minus;5</code>, but it could trip you up if you expecting it to truncate to <code>&minus;5</code>.
<li>The <code>//</code> operator doesn't always return an integer. If either the numerator or denominator is a <code>float</code>, it will still round to the nearest integer, but the actual return value will be a <code>float</code>.
<li>When integer-dividing negative numbers, the <code>//</code> operator rounds &#8220;up&#8221; to the nearest integer. Mathematically speaking, it&#8217;s rounding &#8220;down&#8221; since <code>&minus;6</code> is less than <code>&minus;5</code>, but it could trip you up if you expecting it to truncate to <code>&minus;5</code>.
<li>The <code>//</code> operator doesn&#8217;t always return an integer. If either the numerator or denominator is a <code>float</code>, it will still round to the nearest integer, but the actual return value will be a <code>float</code>.
<li>The <code>**</code> operator means &#8220;raised to the power of.&#8221; <code>11<sup>2</sup></code> is <code>121</code>.
<li>The <code>%</code> operator gives the remainder after performing integer division. <code>11</code> divided by <code>2</code> is <code>5</code> with a remainder of <code>1</code>, so the result here is <code>1</code>.
</ol>
@@ -117,7 +117,7 @@ body{counter-reset:h1 2}
<p><span>&#x261E;</span>In Python 2, the <code>/</code> operator usually meant integer division, but you could make it behave like floating point division by including a special directive in your code. In Python 3, the <code>/</code> operator always means floating point division. See <a href=http://www.python.org/dev/peps/pep-0238/><abbr>PEP</abbr> 238</a> for details.
</blockquote>
<h3 id=fractions>Fractions</h3>
<p>Python isn't limited to integers and floating point numbers. It can also do all the fancy math you learned in high school and promptly forgot about.
<p>Python isn&#8217;t limited to integers and floating point numbers. It can also do all the fancy math you learned in high school and promptly forgot about.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>import fractions</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>x = fractions.Fraction(1, 3)</kbd> <span>&#x2461;</span></a>
@@ -144,7 +144,7 @@ body{counter-reset:h1 2}
<a><samp class=p>>>> </samp><kbd>math.tan(math.pi / 4)</kbd> <span>&#x2462;</span></a>
<samp>0.99999999999999989</samp></pre>
<ol>
<li>The <code>math</code> module has a constant for &pi;, the ratio of a circle's circumference to its diameter.
<li>The <code>math</code> module has a constant for &pi;, the ratio of a circle&#8217;s circumference to its diameter.
<li>The <code>math</code> module has all the basic trigonometric functions, including <code>sin()</code>, <code>cos()</code>, <code>tan()</code>, and variants like <code>asin()</code>.
<li>Note, however, that Python does not have infinite precision. <code>tan(&pi; / 4)</code> should return <code>1.0</code>, not <code>0.99999999999999989</code>.
</ol>
@@ -176,16 +176,16 @@ body{counter-reset:h1 2}
<ol>
<li>Did you know you can define your own functions in the Python interactive shell? Just press <kbd>ENTER</kbd> at the end of each line, and <kbd>ENTER</kbd> on a blank line to finish.
<li>In a boolean context, non-zero integers are true; <code>0</code> is false.
<li>Non-zero floating point numbers are true; <code>0.0</code> is false. Be careful with this one! If there's the slightest rounding error (not impossible, as you saw in the previous section) then Python will be testing <code>0.0000000000001</code> instead of <code>0</code> and will return <code>True</code>.
<li>Non-zero floating point numbers are true; <code>0.0</code> is false. Be careful with this one! If there&#8217;s the slightest rounding error (not impossible, as you saw in the previous section) then Python will be testing <code>0.0000000000001</code> instead of <code>0</code> and will return <code>True</code>.
<li>Fractions can also be used in a boolean context. <code>Fraction(0, n)</code> is false for all values of <var>n</var>. All other fractions are true.
</ol>
<h2 id=lists>Lists</h2>
<p>Lists are Python's workhorse datatype. When I say &#8220;list,&#8221; you might be thinking &#8220;array whose size I have to declare in advance, that can only contain items of the same type, <i class=baa>&amp;</i>c.&#8221; Don't think that. Lists are much cooler than that.
<p>Lists are Python&#8217;s workhorse datatype. When I say &#8220;list,&#8221; you might be thinking &#8220;array whose size I have to declare in advance, that can only contain items of the same type, <i class=baa>&amp;</i>c.&#8221; Don&#8217;t think that. Lists are much cooler than that.
<blockquote class="note compare perl5">
<p><span>&#x261E;</span>A list in Python is like an array in Perl 5. In Perl 5, variables that store arrays always start with the <code>@</code> character; in Python, variables can be named anything, and Python keeps track of the datatype internally.
</blockquote>
<blockquote class="note compare java">
<p><span>&#x261E;</span>A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the <code>ArrayList</code> class, which can hold arbitrary objects and can expand dynamically as new items are added.
<p><span>&#x261E;</span>A list in Python is much more than an array in Java (although it can be used as one if that&#8217;s really all you want out of life). A better analogy would be to the <code>ArrayList</code> class, which can hold arbitrary objects and can expand dynamically as new items are added.
</blockquote>
<h3 id=creatinglists>Creating A List</h3>
<p>Creating a list is easy: use square brackets to wrap a comma-separated list of values.
@@ -210,7 +210,7 @@ body{counter-reset:h1 2}
</ol>
<h3 id=slicinglists>Slicing A List</h3>
<aside>a_list[0] is the first item of a_list.</aside>
<p>Once you've defined a list, you can get any part of it as a new list. This is called <i>slicing</i> the list.
<p>Once you&#8217;ve defined a list, you can get any part of it as a new list. This is called <i>slicing</i> the list.
<pre class=screen>
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 'b', 'mpilgrim', 'z', 'example']</samp>
@@ -228,7 +228,7 @@ body{counter-reset:h1 2}
['a', 'b', 'mpilgrim', 'z', 'example']</pre>
<ol>
<li>You can get a part of a list, called a &#8220;slice&#8221;, by specifying two indices. The return value is a new list containing all the items of the list, in order, starting with the first slice index (in this case <code>a_list[1]</code>), up to but not including the second slice index (in this case <code>a_list[3]</code>).
<li>Slicing works if one or both of the slice indices is negative. If it helps, you can think of it this way: reading the list from left to right, the first slice index specifies the first item you want, and the second slice index specifies the first item you don't want. The return value is everything in between.
<li>Slicing works if one or both of the slice indices is negative. If it helps, you can think of it this way: reading the list from left to right, the first slice index specifies the first item you want, and the second slice index specifies the first item you don&#8217;t want. The return value is everything in between.
<li>Lists are zero-based, so <code>a_list[0:3]</code> returns the first three items of the list, starting at <code>a_list[0]</code>, up to but not including <code>a_list[3]</code>.
<li>If the left slice index is <code>0</code>, you can leave it out, and <code>0</code> is implied. So <code>a_list[:3]</code> is the same as <code>a_list[0:3]</code>, because the starting <code>0</code> is implied.
<li>Similarly, if the right slice index is the length of the list, you can leave it out. So <code>a_list[3:]</code> is the same as <code>a_list[3:5]</code>, because this list has five items. There is a pleasing symmetry here. In this five-item list, <code>a_list[:3]</code> returns the first 3 items, and <code>a_list[3:]</code> returns the last two items. In fact, <code>a_list[:<var>n</var>]</code> will always return the first <var>n</var> items, and <code>a_list[<var>n</var>:]</code> will return the rest, regardless of the length of the list.
@@ -251,12 +251,12 @@ body{counter-reset:h1 2}
<samp class=p>>>> </samp><kbd>a_list</kbd>
<samp>['a', 'a', 2.0, 3, True, 'four', 'e']</samp></pre>
<ol>
<li>The <code>+</code> operator concatenates lists. A list can contain any number of items; there is no size limit (other than available memory). A list can contain items of any datatype; they don't all need to be the same type. Here we have a list containing a string, a floating point number, and an integer.
<li>The <code>+</code> operator concatenates lists. A list can contain any number of items; there is no size limit (other than available memory). A list can contain items of any datatype; they don&#8217;t all need to be the same type. Here we have a list containing a string, a floating point number, and an integer.
<li>The <code>append()</code> method adds a single item to the end of the list. (Now we have <em>four</em> different datatypes in the list!)
<li>Lists are implemented as classes. &#8220;Creating&#8221; a list is really instantiating a class. As such, a list has methods that operate on it. The <code>extend()</code> method takes one argument, a list, and appends each of the items of the argument to the original list.
<li>The <code>insert()</code> method inserts a single item into a list. The first argument is the index of the first item in the list that will get bumped out of position. List items do not need to be unique; for example, there are now two separate items with the value <code>'a'</code>, <code>a_list[0]</code> and <code>a_list[1]</code>.
</ol>
<p>Let's look closer at the difference between <code>append()</code> and <code>extend()</code>.
<p>Let&#8217;s look closer at the difference between <code>append()</code> and <code>extend()</code>.
<pre class=screen>
<samp class=p>>>> </samp><kbd>a_list = ['a', 'b', 'c']</kbd>
<a><samp class=p>>>> </samp><kbd>a_list.extend(['d', 'e', 'f'])</kbd> <span>&#x2460;</span></a>
@@ -276,8 +276,8 @@ body{counter-reset:h1 2}
<ol>
<li>The <code>extend()</code> method takes a single argument, which is always a list, and adds each of the items of that list to <var>a_list</var>.
<li>If you start with a list of three items and extend it with a list of another three items, you end up with a list of six items.
<li>On the other hand, the <code>append()</code> method takes any number of arguments, each of which can be any datatype. Here, you're calling the <code>append()</code> method with a single argument, a list of three items.
<li>If you start with a list of six items and append a list onto it, you end up with... a list of seven items. Why seven? Because the last item (which you just appended) <em>is itself a list</em>. Lists can contain any type of data, including other lists. That may be what you want, or it may not. But it's what you asked for, and it's what you got.
<li>On the other hand, the <code>append()</code> method takes any number of arguments, each of which can be any datatype. Here, you&#8217;re calling the <code>append()</code> method with a single argument, a list of three items.
<li>If you start with a list of six items and append a list onto it, you end up with... a list of seven items. Why seven? Because the last item (which you just appended) <em>is itself a list</em>. Lists can contain any type of data, including other lists. That may be what you want, or it may not. But it&#8217;s what you asked for, and it&#8217;s what you got.
</ol>
<h3 id=searchinglists>Searching For Values In A List</h3>
<pre class=screen>
@@ -324,7 +324,7 @@ ValueError: list.index(x): x not in list</samp></pre>
<p>FIXME
-->
<h2 id=dictionaries>Dictionaries</h2>
<p>One of Python's most important datatypes is the dictionary, which defines one-to-one relationships between keys and values.
<p>One of Python&#8217;s most important datatypes is the dictionary, which defines one-to-one relationships between keys and values.
<blockquote class="note compare perl5">
<p><span>&#x261E;</span>A dictionary in Python is like a hash in Perl 5. In Perl 5, variables that store hashes always start with a <code>%</code> character. In Python, variables can be named anything, and Python keeps track of the datatype internally.
</blockquote>
@@ -346,7 +346,7 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<li>First, you create a new dictionary with two items and assign it to the variable <var>a_dict</var>. Each item is a key-value pair, and the whole set of items is enclosed in curly braces.
<li><code>'server'</code> is a key, and its associated value, referenced by <code>a_dict["server"]</code>, is <code>'db.diveintopython3.org'</code>.
<li><code>'database'</code> is a key, and its associated value, referenced by <code>a_dict["database"]</code>, is <code>'mysql'</code>.
<li>You can get values by key, but you can't get keys by value. So <code>a_dict["server"]</code> is <code>'db.diveintopython3.org'</code>, but <code>a_dict["db.diveintopython3.org"]</code> raises an exception, because <code>'db.diveintopython3.org'</code> is not a key.
<li>You can get values by key, but you can&#8217;t get keys by value. So <code>a_dict["server"]</code> is <code>'db.diveintopython3.org'</code>, but <code>a_dict["db.diveintopython3.org"]</code> raises an exception, because <code>'db.diveintopython3.org'</code> is not a key.
</ol>
<h3 id=modifying-dictionaries>Modifying A Dictionary</h3>
<p>Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any time, or you can modify the value of an existing key. Continuing from the previous example:
@@ -370,11 +370,11 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<li>You can add new key-value pairs at any time. This syntax is identical to modifying existing values.
<li>The new dictionary item (key <code>'user'</code>, value <code>'mark'</code>) appears to be in the middle. In fact, it was just a coincidence that the items appeared to be in order in the first example; it is just as much a coincidence that they appear to be out of order now.
<li>Assigning a value to an existing dictionary key simply replaces the old value with the new one.
<li>Will this change the value of the <code>user</code> key back to "mark"? No! Look at the key closely &mdash; that's a capital <kbd>U</kbd> in <kbd>"User"</kbd>. Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it's completely different.
<li>Will this change the value of the <code>user</code> key back to "mark"? No! Look at the key closely &mdash; that&#8217;s a capital <kbd>U</kbd> in <kbd>"User"</kbd>. Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it&#8217;s completely different.
</ol>
<h3 id=mixed-value-dictionaries>Mixed-Value Dictionaries</h3>
<p>Dictionaries aren't just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don't all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.
<p>In fact, you've already seen a dictionary with non-string keys and values, in <a href=your-first-python-program.html#divingin>your first Python program</a>.
<p>Dictionaries aren&#8217;t just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don&#8217;t all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.
<p>In fact, you&#8217;ve already seen a dictionary with non-string keys and values, in <a href=your-first-python-program.html#divingin>your first Python program</a>.
<pre><code>SUFFIXES = {1000: ('KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'),
1024: ('KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB')}</code></pre>
<p>Let's tear that apart in the interactive shell.
+21 -21
View File
@@ -27,7 +27,7 @@ td pre{padding:0;border:0}
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving in</h2>
<p class=f>Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. <a href=case-study-porting-chardet-to-python-3.html#running2to3>Case study: porting <code>chardet</code> to Python 3</a> describes how to run the <code>2to3</code> script, then shows some things it can't fix automatically. This appendix documents what it <em>can</em> fix automatically.
<p class=f>Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. <a href=case-study-porting-chardet-to-python-3.html#running2to3>Case study: porting <code>chardet</code> to Python 3</a> describes how to run the <code>2to3</code> script, then shows some things it can&#8217;t fix automatically. This appendix documents what it <em>can</em> fix automatically.
<h2 id=print><code>print</code> statement</h2>
<p>In Python 2, <code>print</code> was a statement. Whatever you wanted to print simply followed the <code>print</code> keyword. In Python 3, <code>print()</code> is a function &mdash; whatever you want to print is passed to <code>print()</code> like any other function.
<table>
@@ -110,7 +110,7 @@ td pre{padding:0;border:0}
<ol>
<li>Base 10 long integer literals become base 10 integer literals.
<li>Base 16 long integer literals become base 16 integer literals.
<li>In Python 3, the old <code>long()</code> function no longer exists, since longs don't exist. To coerce a variable to an integer, use the <code>int()</code> function.
<li>In Python 3, the old <code>long()</code> function no longer exists, since longs don&#8217;t exist. To coerce a variable to an integer, use the <code>int()</code> function.
<li>To check whether a variable is an integer, get its type and compare it to <code>int</code>, not <code>long</code>.
<li>You can also use the <code>isinstance()</code> function to check data types; again, use <code>int</code>, not <code>long</code>, to check for integers.
</ol>
@@ -161,7 +161,7 @@ td pre{padding:0;border:0}
<li>Again with the parentheses, for the same reason.
</ol>
<h2 id=dict>Dictionary methods that return lists</h2>
<p>In Python 2, many dictionary methods returned lists. The most frequently used methods were <code>keys()</code>, <code>items()</code>, and <code>values()</code>. In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method's return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing.
<p>In Python 2, many dictionary methods returned lists. The most frequently used methods were <code>keys()</code>, <code>items()</code>, and <code>values()</code>. In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method&#8217;s return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing.
<table>
<tr><th>Notes
<th>Python 2
@@ -219,7 +219,7 @@ import CGIHttpServer</code></pre>
<li>The <code>http.server</code> module provides a basic <abbr>HTTP</abbr> server.
</ol>
<h3 id=urllib><code>urllib</code></h3>
<p>Python 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, <code>urllib</code>.
<p>Python 2 had a rat&#8217;s nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, <code>urllib</code>.
<table>
<tr><th>Notes
<th>Python 2
@@ -368,10 +368,10 @@ except ImportError:
</table>
<ol>
<li>When you need to import an entire module from elsewhere in your package, use the new <code>from . import</code> syntax. The period is actually a relative path from this file (<code>universaldetector.py</code>) to the file you want to import (<code>constants.py</code>). In this case, they are in the same directory, thus the single period. You can also import from the parent directory (<code>from .. import anothermodule</code>) or a subdirectory.
<li>To import a specific class or function from another module directly into your module's namespace, prefix the target module with a relative path, minus the trailing slash. In this case, <code>mbcharsetprober.py</code> is in the same directory as <code>universaldetector.py</code>, so the path is a single period. You can also import form the parent directory (<code>from ..anothermodule import AnotherClass</code>) or a subdirectory.
<li>To import a specific class or function from another module directly into your module&#8217;s namespace, prefix the target module with a relative path, minus the trailing slash. In this case, <code>mbcharsetprober.py</code> is in the same directory as <code>universaldetector.py</code>, so the path is a single period. You can also import form the parent directory (<code>from ..anothermodule import AnotherClass</code>) or a subdirectory.
</ol>
<h2 id=next><code>next()</code> iterator method</h2>
<p>In Python 2, iterators had a <code>next()</code> method which returned the next item in the sequence. That's still true in Python 3, but there is now also a global <code>next()</code> function that takes an iterator as an argument.
<p>In Python 2, iterators had a <code>next()</code> method which returned the next item in the sequence. That&#8217;s still true in Python 3, but there is now also a global <code>next()</code> function that takes an iterator as an argument.
<table>
<tr><th>Notes
<th>Python 2
@@ -403,11 +403,11 @@ for an_iterator in a_sequence_of_iterators:
an_iterator.__next__()</code></pre>
</table>
<ol>
<li>In the simplest case, instead of calling an iterator's <code>next()</code> method, you now pass the iterator itself to the global <code>next()</code> function.
<li>In the simplest case, instead of calling an iterator&#8217;s <code>next()</code> method, you now pass the iterator itself to the global <code>next()</code> function.
<li>If you have a function that returns an iterator, call the function and pass the result to the <code>next()</code> function. (The <code>2to3</code> script is smart enough to convert this properly.)
<li>If you define your own class and mean to use it as an iterator, define the <code>__next__()</code> special method.
<li>If you define your own class and just happen to have a method named <code>next()</code> that takes one or more arguments, <code>2to3</code> will not touch it. This class can not be used as an iterator, because its <code>next()</code> method takes arguments.
<li>This one is a bit tricky. If you have a local variable named <var>next</var>, then it takes precedence over the new global <code>next()</code> function. In this case, you need to call the iterator's special <code>__next()__</code> method to get the next item in the sequence. (Alternatively, you could also refactor the code so the local variable wasn't named <var>next</var>, but <code>2to3</code> will not do that for you automatically.)
<li>This one is a bit tricky. If you have a local variable named <var>next</var>, then it takes precedence over the new global <code>next()</code> function. In this case, you need to call the iterator&#8217;s special <code>__next()__</code> method to get the next item in the sequence. (Alternatively, you could also refactor the code so the local variable wasn&#8217;t named <var>next</var>, but <code>2to3</code> will not do that for you automatically.)
</ol>
<h2 id=filter><code>filter()</code> global function</h2>
<p>In Python 2, the <code>filter()</code> function returned a list, the result of filtering a sequence through a function that returned <code>True</code> or <code>False</code> for each item in the sequence. In Python 3, the <code>filter()</code> function returns an iterator, not a list.
@@ -482,7 +482,7 @@ reduce(a, b, c)</code></pre>
<p><span>&#x261E;</span>The version of <code>2to3</code> that shipped with Python 3.0 would not fix the <code>reduce()</code> function automatically. The fix first appeared in the <code>2to3</code> script that shipped with Python 3.1.
</blockquote>
<h2 id=apply><code>apply()</code> global function</h2>
<p>Python 2 had a global function called <code>apply()</code>, which took a function <var>f</var> and a list <code>[a, b, c]</code> and returned <code>f(a, b, c)</code>. In Python 3, the <code>apply()</code> function no longer exists. Instead, there is a new function calling syntax that allows you to pass a list and have Python apply the list as the function's arguments.
<p>Python 2 had a global function called <code>apply()</code>, which took a function <var>f</var> and a list <code>[a, b, c]</code> and returned <code>f(a, b, c)</code>. In Python 3, the <code>apply()</code> function no longer exists. Instead, there is a new function calling syntax that allows you to pass a list and have Python apply the list as the function&#8217;s arguments.
<table>
<tr><th>Notes
<th>Python 2
@@ -538,7 +538,7 @@ reduce(a, b, c)</code></pre>
<li>Even fancier, the old <code>exec</code> statement could also take a local namespace (like the variables defined within a function). In Python 3, the <code>exec()</code> function can do that too.
</ol>
<h2 id=execfile><code>execfile</code> statement (3.1+)</h2>
<p>Like the old <a href=#exec><code>exec</code> statement</a>, the old <code>execfile</code> statement will execute strings as if they were Python code. Where <code>exec</code> took a string, <code>execfile</code> took a filename. In Python 3, the <code>execfile</code> statement has been eliminated. If you really need to take a file of Python code and execute it (but you're not willing to simply import it), you can accomplish the same thing by opening the file, reading its contents, calling the global <code>compile()</code> function to force the Python interpreter to compile the code, and then call the new <code>exec()</code> function.
<p>Like the old <a href=#exec><code>exec</code> statement</a>, the old <code>execfile</code> statement will execute strings as if they were Python code. Where <code>exec</code> took a string, <code>execfile</code> took a filename. In Python 3, the <code>execfile</code> statement has been eliminated. If you really need to take a file of Python code and execute it (but you&#8217;re not willing to simply import it), you can accomplish the same thing by opening the file, reading its contents, calling the global <code>compile()</code> function to force the Python interpreter to compile the code, and then call the new <code>exec()</code> function.
<table>
<tr><th>Notes
<th>Python 2
@@ -607,7 +607,7 @@ except:
<ol>
<li>Instead of a comma after the exception type, Python 3 uses a new keyword, <code>as</code>.
<li>The <code>as</code> keyword also works for catching multiple types of exceptions at once.
<li>If you catch an exception but don't actually care about accessing the exception object itself, the syntax is identical between Python 2 and Python 3.
<li>If you catch an exception but don&#8217;t actually care about accessing the exception object itself, the syntax is identical between Python 2 and Python 3.
<li>Similarly, if you use a fallback to catch <em>all</em> exceptions, the syntax is identical.
</ol>
<blockquote class=note>
@@ -660,7 +660,7 @@ except:
<li>Python 2 also supported throwing an exception with <em>only</em> a custom error message. Python 3 does not support this, and the <code>2to3</code> script will display a warning telling you that you will need to fix this code manually.
</ol>
<h2 id=xrange><code>xrange()</code> global function</h2>
<p>In Python 2, there were two ways to get a range of numbers: <code>range()</code>, which returned a list, and <code>xrange()</code>, which returned an iterator. In Python 3, <code>range()</code> returns an iterator, and <code>xrange()</code> doesn't exist.
<p>In Python 2, there were two ways to get a range of numbers: <code>range()</code>, which returned a list, and <code>xrange()</code>, which returned an iterator. In Python 3, <code>range()</code> returns an iterator, and <code>xrange()</code> doesn&#8217;t exist.
<table>
<tr><th>Notes
<th>Python 2
@@ -738,11 +738,11 @@ except:
<td><code>a_function.__code__</code>
</table>
<ol>
<li>The <code>__name__</code> attribute (previously <code>func_name</code>) contains the function's name.
<li>The <code>__doc__</code> attribute (previously <code>func_doc</code>) contains the <i>docstring</i> that you defined in the function's source code.
<li>The <code>__name__</code> attribute (previously <code>func_name</code>) contains the function&#8217;s name.
<li>The <code>__doc__</code> attribute (previously <code>func_doc</code>) contains the <i>docstring</i> that you defined in the function&#8217;s source code.
<li>The <code>__defaults__</code> attribute (previously <code>func_defaults</code>) is a tuple containing default argument values for those arguments that have default values.
<li>The <code>__dict__</code> attribute (previously <code>func_dict</code>) is the namespace supporting arbitrary function attributes.
<li>The <code>__closure__</code> attribute (previously <code>func_closure</code>) is a tuple of cells that contain bindings for the function's free variables.
<li>The <code>__closure__</code> attribute (previously <code>func_closure</code>) is a tuple of cells that contain bindings for the function&#8217;s free variables.
<li>The <code>__globals__</code> attribute (previously <code>func_globals</code>) is a reference to the global namespace of the module in which the function was defined.
<li>The <code>__code__</code> attribute (previously <code>func_code</code>) is a code object representing the compiled function body.
</ol>
@@ -934,7 +934,7 @@ except:
<p><span>&#x261E;</span>The version of <code>2to3</code> that shipped with Python 3.0 would not fix these cases of <code>isinstance()</code> automatically. The fix first appeared in the <code>2to3</code> script that shipped with Python 3.1.
</blockquote>
<h2 id=basestring><code>basestring</code> datatype</h2>
<p>Python 2 had two string types: Unicode and non-Unicode. But there was also another type, <code>basestring</code>. It was an abstract type, a superclass for both the <code>str</code> and <code>unicode</code> types. It couldn't be called or instantiated directly, but you could pass it to the global <code>isinstance()</code> function to check whether an object was either a Unicode or non-Unicode string. In Python 3, there is only one string type, so <code>basestring</code> has no reason to exist.
<p>Python 2 had two string types: Unicode and non-Unicode. But there was also another type, <code>basestring</code>. It was an abstract type, a superclass for both the <code>str</code> and <code>unicode</code> types. It couldn&#8217;t be called or instantiated directly, but you could pass it to the global <code>isinstance()</code> function to check whether an object was either a Unicode or non-Unicode string. In Python 3, there is only one string type, so <code>basestring</code> has no reason to exist.
<table>
<tr><th>Notes
<th>Python 2
@@ -966,7 +966,7 @@ except:
<li>Instead of <code>itertools.izip()</code>, just use the global <code>zip()</code> function.
<li>Instead of <code>itertools.imap()</code>, just use <code>map()</code>.
<li><code>itertools.ifilter()</code> becomes <code>filter()</code>.
<li>The <code>itertools</code> module still exists in Python 3, it just doesn't have the functions that have migrated to the global namespace. The <code>2to3</code> script is smart enough to remove the specific imports that no longer exist, while leaving other imports intact.
<li>The <code>itertools</code> module still exists in Python 3, it just doesn&#8217;t have the functions that have migrated to the global namespace. The <code>2to3</code> script is smart enough to remove the specific imports that no longer exist, while leaving other imports intact.
</ol>
<h2 id=sys_exc><code>sys.exc_type</code>, <code>sys.exc_value</code>, <code>sys.exc_traceback</code></h2>
<p>Python 2 had three variables in the <code>sys</code> module that you could access while an exception was being handled: <code>sys.exc_type</code>, <code>sys.exc_value</code>, <code>sys.exc_traceback</code>. (Actually, these date all the way back to Python 1.) Ever since Python 1.5, these variables have been deprecated in favor of <code>sys.exc_info</code>, which is a tuple that contains all three values. In Python 3, these individual variables have finally gone away; you must use <code>sys.exc_info</code>.
@@ -1027,11 +1027,11 @@ except:
</table>
<ol>
<li>Declaring the metaclass in the class declaration worked in Python 2, and it still works the same in Python 3.
<li>Declaring the metaclass in a class attribute worked in Python 2, but doesn't work in Python 3.
<li>Declaring the metaclass in a class attribute worked in Python 2, but doesn&#8217;t work in Python 3.
<li>The <code>2to3</code> script is smart enough to construct a valid class declaration, even if the class is inherited from one or more base classes.
</ol>
<h2 id=nitpick>Matters of style</h2>
<p>The rest of the &#8220;fixes&#8221; listed here aren't really fixes per se. That is, the things they change are matters of style, not substance. They work just as well in Python 3 as they do in Python 2, but the developers of Python have a vested interest in making Python code as uniform as possible. To that end, there is an <a href=http://www.python.org/dev/peps/pep-0008/>official Python style guide</a> which outlines &mdash; in excruciating detail &mdash; all sorts of nitpicky details that you almost certainly don't care about. And given that <code>2to3</code> provides such a great infrastructure for converting Python code from one thing to another, the authors took it upon themselves to add a few optional features to improve the readability of your Python programs.
<p>The rest of the &#8220;fixes&#8221; listed here aren&#8217;t really fixes per se. That is, the things they change are matters of style, not substance. They work just as well in Python 3 as they do in Python 2, but the developers of Python have a vested interest in making Python code as uniform as possible. To that end, there is an <a href=http://www.python.org/dev/peps/pep-0008/>official Python style guide</a> which outlines &mdash; in excruciating detail &mdash; all sorts of nitpicky details that you almost certainly don&#8217;t care about. And given that <code>2to3</code> provides such a great infrastructure for converting Python code from one thing to another, the authors took it upon themselves to add a few optional features to improve the readability of your Python programs.
<h3 id=set_literal><code>set()</code> literals (explicit)</h3>
<p>In Python 2, the only way to define a literal set in your code was to call <code>set(a_sequence)</code>. This still works in Python 3, but a clearer way of doing it is to use the new set literal notation: curly braces. (Dictionaries are also defined with curly braces, which makes sense once you think about it, because dictionaries are just sets of key-value pairs.)
<blockquote class=note>
@@ -1053,7 +1053,7 @@ except:
<td><code>{i for i in a_sequence}</code>
</table>
<h3 id=buffer><code>buffer()</code> global function (explicit)</h3>
<p>Python objects implemented in C can export a &#8220;buffer interface,&#8221; which is a block of memory that is directly readable and writeable without copying. (That is exactly as powerful and scary as it sounds.) In Python 3, <code>buffer()</code> has been renamed to <code>memoryview()</code>. (It's a little more complicated than that, but you can almost certainly ignore the differences.)
<p>Python objects implemented in C can export a &#8220;buffer interface,&#8221; which is a block of memory that is directly readable and writeable without copying. (That is exactly as powerful and scary as it sounds.) In Python 3, <code>buffer()</code> has been renamed to <code>memoryview()</code>. (It&#8217;s a little more complicated than that, but you can almost certainly ignore the differences.)
<blockquote class=note>
<p><span>&#x261E;</span>The <code>2to3</code> script will not fix the <code>buffer()</code> function by default. To enable this fix, specify <kbd>-f buffer</kbd> on the command line when you call <code>2to3</code>.
</blockquote>
@@ -1084,7 +1084,7 @@ except:
<td><code>{a: b}</code>
</table>
<h3 id=idioms>Common idioms (explicit)</h3>
<p>There were a number of common idioms built up in the Python community. Some, like the <code>while 1:</code> loop, date back to Python 1. (Python didn't have a true boolean type until version 2.3, so developers used <code>1</code> and <code>0</code> instead.) Modern Python programmers should train their brains to use modern versions of these idioms instead.
<p>There were a number of common idioms built up in the Python community. Some, like the <code>while 1:</code> loop, date back to Python 1. (Python didn&#8217;t have a true boolean type until version 2.3, so developers used <code>1</code> and <code>0</code> instead.) Modern Python programmers should train their brains to use modern versions of these idioms instead.
<blockquote class=note>
<p><span>&#x261E;</span>The <code>2to3</code> script will not fix common idioms by default. To enable this fix, specify <kbd>-f idioms</kbd> on the command line when you call <code>2to3</code>.
</blockquote>
+21 -21
View File
@@ -17,13 +17,13 @@ body{counter-reset:h1 10}
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving In</h2>
<p class=f>Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by &#8220;bug&#8221;? A bug is a test case you haven't written yet.
<p class=f>Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by &#8220;bug&#8221;? A bug is a test case you haven&#8217;t written yet.
<pre class=screen><samp class=p>>>> </samp><kbd>import roman7</kbd>
<a><samp class=p>>>> </samp><kbd>roman7.from_roman("")</kbd> <span>&#x2460;</span></a>
<samp>0</samp></pre>
<ol>
<li>Remember in the [FIXME-xref] previous section when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals? Well, it turns out that this is still true for the final version of the regular expression. And that's a bug; you want an empty string to raise an <code>InvalidRomanNumeralError</code> exception just like any other sequence of characters that don't represent a valid Roman numeral.
<li>Remember in the [FIXME-xref] previous section when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals? Well, it turns out that this is still true for the final version of the regular expression. And that&#8217;s a bug; you want an empty string to raise an <code>InvalidRomanNumeralError</code> exception just like any other sequence of characters that don&#8217;t represent a valid Roman numeral.
</ol>
<p>After reproducing the bug, and before fixing it, you should write a test case that fails, thus illustrating the bug.
@@ -107,15 +107,15 @@ Ran 11 tests in 0.156s
<a><samp>OK</samp> <span>&#x2461;</span></a></pre>
<ol>
<li>The blank string test case now passes, so the bug is fixed.
<li>All the other test cases still pass, which means that this bug fix didn't break anything else. Stop coding.
<li>All the other test cases still pass, which means that this bug fix didn&#8217;t break anything else. Stop coding.
</ol>
<p>Coding this way does not make fixing bugs any easier. Simple bugs (like this one) require simple test cases; complex bugs will require complex test cases. In a testing-centric environment, it may <em>seem</em> like it takes longer to fix a bug, since you need to articulate in code exactly what the bug is (to write the test case), then fix the bug itself. Then if the test case doesn't pass right away, you need to figure out whether the fix was wrong, or whether the test case itself has a bug in it. However, in the long run, this back-and-forth between test code and code tested pays for itself, because it makes it more likely that bugs are fixed correctly the first time. Also, since you can easily re-run <em>all</em> the test cases along with your new one, you are much less likely to break old code when fixing new code. Today's unit test is tomorrow's regression test.
<p>Coding this way does not make fixing bugs any easier. Simple bugs (like this one) require simple test cases; complex bugs will require complex test cases. In a testing-centric environment, it may <em>seem</em> like it takes longer to fix a bug, since you need to articulate in code exactly what the bug is (to write the test case), then fix the bug itself. Then if the test case doesn&#8217;t pass right away, you need to figure out whether the fix was wrong, or whether the test case itself has a bug in it. However, in the long run, this back-and-forth between test code and code tested pays for itself, because it makes it more likely that bugs are fixed correctly the first time. Also, since you can easily re-run <em>all</em> the test cases along with your new one, you are much less likely to break old code when fixing new code. Today&#8217;s unit test is tomorrow&#8217;s regression test.
<h2 id=changing-requirements>Handling Changing Requirements</h2>
<p>Despite your best efforts to pin your customers to the ground and extract exact requirements from them on pain of horrible nasty things involving scissors and hot wax, requirements will change. Most customers don't know what they want until they see it, and even if they do, they aren't that good at articulating what they want precisely enough to be useful. And even if they do, they'll want more in the next release anyway. So be prepared to update your test cases as requirements change.
<p>Despite your best efforts to pin your customers to the ground and extract exact requirements from them on pain of horrible nasty things involving scissors and hot wax, requirements will change. Most customers don&#8217;t know what they want until they see it, and even if they do, they aren&#8217;t that good at articulating what they want precisely enough to be useful. And even if they do, they&#8217;ll want more in the next release anyway. So be prepared to update your test cases as requirements change.
<p>Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Remember [FIXME-xref] the rule that said that no character could be repeated more than three times? Well, the Romans were willing to make an exception to that rule by having 4 <code>M</code> characters in a row to represent <code>4000</code>. If you make this change, you'll be able to expand the range of convertible numbers from <code>1..3999</code> to <code>1..4999</code>. But first, you need to make some changes to your test cases.
<p>Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Remember [FIXME-xref] the rule that said that no character could be repeated more than three times? Well, the Romans were willing to make an exception to that rule by having 4 <code>M</code> characters in a row to represent <code>4000</code>. If you make this change, you&#8217;ll be able to expand the range of convertible numbers from <code>1..3999</code> to <code>1..4999</code>. But first, you need to make some changes to your test cases.
<p class=d>[<a href=examples/roman8.py>download <code>roman8.py</code></a>]
<pre><code>
@@ -157,7 +157,7 @@ class RoundtripCheck(unittest.TestCase):
result = roman8.from_roman(numeral)
self.assertEqual(integer, result)</code></pre>
<ol>
<li>The existing known values don't change (they're all still reasonable values to test), but you need to add a few more in the <code>4000</code> range. Here I've included <code>4000</code> (the shortest), <code>4500</code> (the second shortest), <code>4888</code> (the longest), and <code>4999</code> (the largest).
<li>The existing known values don&#8217;t change (they&#8217;re all still reasonable values to test), but you need to add a few more in the <code>4000</code> range. Here I&#8217;ve included <code>4000</code> (the shortest), <code>4500</code> (the second shortest), <code>4888</code> (the longest), and <code>4999</code> (the largest).
<li>The definition of &#8220;large input&#8221; has changed. This test used to call <code>to_roman()</code> with <code>4000</code> and expect an error; now that <code>4000-4999</code> are good values, you need to bump this up to <code>5000</code>.
<li>The definition of &#8220;too many repeated numerals&#8221; has also changed. This test used to call <code>from_roman()</code> with <code>'MMMM'</code> and expect an error; now that <code>MMMM</code> is considered a valid Roman numeral, you need to bump this up to <code>'MMMMM'</code>.
<li>The sanity check loops through every number in the range, from <code>1</code> to <code>3999</code>. Since the range has now expanded, this <code>for</code> loop need to be updated as well to go up to <code>4999</code>.
@@ -220,7 +220,7 @@ FAILED (errors=3)</samp></pre>
<li>The roundtrip check will also fail as soon as it hits <code>4000</code>, because <code>to_roman()</code> still thinks this is out of range.
</ol>
<p>Now that you have test cases that fail due to the new requirements, you can think about fixing the code to bring it in line with the test cases. (One thing that takes some getting used to when you first start coding unit tests is that the code being tested is never &#8220;ahead&#8221; of the test cases. While it's behind, you still have some work to do, and as soon as it catches up to the test cases, you stop coding.)
<p>Now that you have test cases that fail due to the new requirements, you can think about fixing the code to bring it in line with the test cases. (One thing that takes some getting used to when you first start coding unit tests is that the code being tested is never &#8220;ahead&#8221; of the test cases. While it&#8217;s behind, you still have some work to do, and as soon as it catches up to the test cases, you stop coding.)
<p class=d>[<a href=examples/roman9.py>download <code>roman9.py</code></a>]
<pre><code>
@@ -255,11 +255,11 @@ def from_roman(s):
.
.</code></pre>
<ol>
<li>You don't need to make any changes to the <code>from_roman()</code> function at all. The only change is to <var>roman_numeral_pattern</var>. If you look closely, you'll notice that I changed the maximum number of optional <code>M</code> characters from <code>3</code> to <code>4</code> in the first section of the regular expression. This will allow the Roman numeral equivalents of <code>4999</code> instead of <code>3999</code>. The actual <code>from_roman()</code> function is completely generic; it just looks for repeated Roman numeral characters and adds them up, without caring how many times they repeat. The only reason it didn't handle <code>'MMMM'</code> before is that you explicitly stopped it with the regular expression pattern matching.
<li>The <code>to_roman()</code> function only needs one small change, in the range check. Where you used to check <code>0 &lt; n &lt; 4000</code>, you now check <code>0 &lt; n &lt; 5000</code>. And you change the error message that you <code>raise</code> to reflect the new acceptable range (<code>1..4999</code> instead of <code>1..3999</code>). You don't need to make any changes to the rest of the function; it handles the new cases already. (It merrily adds <code>'M'</code> for each thousand that it finds; given <code>4000</code>, it will spit out <code>'MMMM'</code>. The only reason it didn't do this before is that you explicitly stopped it with the range check.)
<li>You don&#8217;t need to make any changes to the <code>from_roman()</code> function at all. The only change is to <var>roman_numeral_pattern</var>. If you look closely, you&#8217;ll notice that I changed the maximum number of optional <code>M</code> characters from <code>3</code> to <code>4</code> in the first section of the regular expression. This will allow the Roman numeral equivalents of <code>4999</code> instead of <code>3999</code>. The actual <code>from_roman()</code> function is completely generic; it just looks for repeated Roman numeral characters and adds them up, without caring how many times they repeat. The only reason it didn&#8217;t handle <code>'MMMM'</code> before is that you explicitly stopped it with the regular expression pattern matching.
<li>The <code>to_roman()</code> function only needs one small change, in the range check. Where you used to check <code>0 &lt; n &lt; 4000</code>, you now check <code>0 &lt; n &lt; 5000</code>. And you change the error message that you <code>raise</code> to reflect the new acceptable range (<code>1..4999</code> instead of <code>1..3999</code>). You don&#8217;t need to make any changes to the rest of the function; it handles the new cases already. (It merrily adds <code>'M'</code> for each thousand that it finds; given <code>4000</code>, it will spit out <code>'MMMM'</code>. The only reason it didn&#8217;t do this before is that you explicitly stopped it with the range check.)
</ol>
<p>You may be skeptical that these two small changes are all that you need. Hey, don't take my word for it; see for yourself.
<p>You may be skeptical that these two small changes are all that you need. Hey, don&#8217;t take my word for it; see for yourself.
<pre class=screen>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest9.py -v</kbd>
@@ -288,13 +288,13 @@ Ran 12 tests in 0.203s
<h2 id=refactoring>Refactoring</h2>
<p>The best thing about comprehensive unit testing is not the feeling you get when all your test cases finally pass, or even the feeling you get when someone else blames you for breaking their code and you can actually <em>prove</em> that you didn't. The best thing about unit testing is that it gives you the freedom to refactor mercilessly.
<p>The best thing about comprehensive unit testing is not the feeling you get when all your test cases finally pass, or even the feeling you get when someone else blames you for breaking their code and you can actually <em>prove</em> that you didn&#8217;t. The best thing about unit testing is that it gives you the freedom to refactor mercilessly.
<p>Refactoring is the process of taking working code and making it work better. Usually, &#8220;better&#8221; means &#8220;faster&#8221;, although it can also mean &#8220;using less memory&#8221;, or &#8220;using less disk space&#8221;, or simply &#8220;more elegantly&#8221;. Whatever it means to you, to your project, in your environment, refactoring is important to the long-term health of any program.
<p>Here, &#8220;better&#8221; means both &#8220;faster&#8221; and &#8220;easier to maintain.&#8221; Specifically, the <code>from_roman()</code> function is slower and more complex than I'd like, because of that big nasty regular expression that you use to validate Roman numerals. Now, you might think, "Sure, the regular expression is big and hairy, but how else am I supposed to validate that an arbitrary string is a valid a Roman numeral?"
<p>Here, &#8220;better&#8221; means both &#8220;faster&#8221; and &#8220;easier to maintain.&#8221; Specifically, the <code>from_roman()</code> function is slower and more complex than I&#8217;d like, because of that big nasty regular expression that you use to validate Roman numerals. Now, you might think, "Sure, the regular expression is big and hairy, but how else am I supposed to validate that an arbitrary string is a valid a Roman numeral?"
<p>Answer: there's only 5000 of them; why don't you just build a lookup table? This idea gets even better when you realize that <em>you don't need to use regular expressions at all</em>. As you build the lookup table for converting integers to Roman numerals, you can build the reverse lookup table to convert Roman numerals to integers. By the time you need to check whether an arbitrary string is a valid Roman numeral, you will have collected all the valid Roman numerals. &#8220;Validating&#8221; is reduced to a single dictionary lookup.
<p>Answer: there&#8217;s only 5000 of them; why don&#8217;t you just build a lookup table? This idea gets even better when you realize that <em>you don&#8217;t need to use regular expressions at all</em>. As you build the lookup table for converting integers to Roman numerals, you can build the reverse lookup table to convert Roman numerals to integers. By the time you need to check whether an arbitrary string is a valid Roman numeral, you will have collected all the valid Roman numerals. &#8220;Validating&#8221; is reduced to a single dictionary lookup.
<p>And best of all, you already have a complete set of unit tests. You can change over half the code in the module, but the unit tests will stay the same. That means you can prove &mdash; to yourself and to others &mdash; that the new code works just as well as the original.
@@ -357,13 +357,13 @@ def build_lookup_tables():
build_lookup_tables()</code></pre>
<p>Let's break that down into digestable pieces. Arguably, the most important line is the last one:
<p>Let&#8217;s break that down into digestable pieces. Arguably, the most important line is the last one:
<pre><code>build_lookup_tables()</code></pre>
<p>You will note that is a function call, but there's no <code>if</code> statement around it. This is not an <code>if __name__ == '__main__'</code> block; it gets called <em>when the module is imported</em>. (It is important to understand that modules are only imported once, then cached. If you import an already-imported module, it does nothing. So this code will only get called the first time you import this module.)
<p>You will note that is a function call, but there&#8217;s no <code>if</code> statement around it. This is not an <code>if __name__ == '__main__'</code> block; it gets called <em>when the module is imported</em>. (It is important to understand that modules are only imported once, then cached. If you import an already-imported module, it does nothing. So this code will only get called the first time you import this module.)
<p>So what does the <code>build_lookup_tables()</code> function do? I'm glad you asked.
<p>So what does the <code>build_lookup_tables()</code> function do? I&#8217;m glad you asked.
<pre><code><a>to_roman_table = [ None ]
from_roman_table = {}
@@ -438,7 +438,7 @@ to_roman should fail with 0 input ... ok
OK</samp></pre>
<ol>
<li>Not that you asked, but it's fast, too! Like, almost 10&times; as fast. Of course, it's not entirely a fair comparison, because this version takes longer to import (when it builds the lookup tables). But since the import is only done once, the startup cost is amortized over all the calls to the <code>to_roman()</code> and <code>from_roman()</code> functions. Since the tests make several thousand function calls (the roundtrip test alone makes 10,000), this savings adds up in a hurry!
<li>Not that you asked, but it&#8217;s fast, too! Like, almost 10&times; as fast. Of course, it&#8217;s not entirely a fair comparison, because this version takes longer to import (when it builds the lookup tables). But since the import is only done once, the startup cost is amortized over all the calls to the <code>to_roman()</code> and <code>from_roman()</code> functions. Since the tests make several thousand function calls (the roundtrip test alone makes 10,000), this savings adds up in a hurry!
</ol>
<p>The moral of the story?
@@ -451,9 +451,9 @@ OK</samp></pre>
<h2 id=summary>Summary</h2>
<p>Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and increase flexibility in any long-term project. It is also important to understand that unit testing is not a panacea, a Magic Problem Solver, or a silver bullet. Writing good test cases is hard, and keeping them up to date takes discipline (especially when customers are screaming for critical bug fixes). Unit testing is not a replacement for other forms of testing, including functional testing, integration testing, and user acceptance testing. But it is feasible, and it does work, and once you've seen it work, you'll wonder how you ever got along without it.
<p>Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and increase flexibility in any long-term project. It is also important to understand that unit testing is not a panacea, a Magic Problem Solver, or a silver bullet. Writing good test cases is hard, and keeping them up to date takes discipline (especially when customers are screaming for critical bug fixes). Unit testing is not a replacement for other forms of testing, including functional testing, integration testing, and user acceptance testing. But it is feasible, and it does work, and once you&#8217;ve seen it work, you&#8217;ll wonder how you ever got along without it.
<p>These few chapters have covered a lot of ground, and much of it wasn't even Python-specific. There are unit testing frameworks for many languages, all of which require you to understand the same basic concepts:
<p>These few chapters have covered a lot of ground, and much of it wasn&#8217;t even Python-specific. There are unit testing frameworks for many languages, all of which require you to understand the same basic concepts:
<ul>
<li>Designing test cases that are specific, automated, and independent
@@ -461,7 +461,7 @@ OK</samp></pre>
<li>Writing tests that test good input and check for proper results
<li>Writing tests that test bad input and check for proper failure responses
<li>Writing and updating test cases to reflect new requirements
<li>Refactoring mercilessly to improve performance, scalability, readability, maintainability, or whatever other -ility you're lacking
<li>Refactoring mercilessly to improve performance, scalability, readability, maintainability, or whatever other -ility you&#8217;re lacking
</ul>
<p class=c>&copy; 2001&ndash;9 <a href=about.html>Mark Pilgrim</a>
+65 -6
View File
@@ -50,7 +50,8 @@ __ne__
__gt__ - covered in fractions.py
__ge__ - covered in fractions.py
__bool__ - covered in fractions.py
__cmp__ (*)
(__cmp__ is gone)
</pre>
<h2 id=custom-attributes>Custom Attributes</h2>
@@ -118,7 +119,15 @@ __reversed__ - covered in ordereddict.py
<h2 id=acts-like-number>Classes That Act Like Numbers</h2>
<p>FIXME binary operator intro
<p>Using the appropriate special methods, you can define your own classes that act like numbers. That is, you can add them, subtract them, and perform other mathematical operations on them. This is how <a href=advanced-classes.html#implementing-fractions>fractions are implemented</a> &mdash; the <code>Fraction</code> class implements these special methods, then you can do things like this:
<pre class=screen>
<samp class=p>>>> </samp><kbd>from fractions import Fraction</kbd>
<samp class=p>>>> </samp><kbd>x = Fraction(1, 3)</kbd>
<samp class=p>>>> </samp><kbd>x / 3</kbd>
<samp>Fraction(1, 9)</samp></pre>
<p>Here is the comprehensive list of special methods you need to implement a number-like class.
<table>
<tr><th>Notes
@@ -195,7 +204,24 @@ __xor__
__or__
-->
<p>FIXME explain circumstances under which reflected methods will be called. <!-- If <var>x</var> doesn't implement a given special method, or if it implements it but return <code>NotImplemented</code>, the Python interpreter will try a different approach &mdash; calling a special method on <var>y</var> instead of <var>x</var>.-->
<p>That&#8217;s all well and good if <var>x</var> is an instance of a class that implements those methods. But what if it doesn&#8217;t implement one of them? Or worse, what if it implements it, but it can&#8217;t handle certain kinds of arguments? For example:
<pre class=screen>
<samp class=p>>>> </samp><kbd>from fractions import Fraction</kbd>
<samp class=p>>>> </samp><kbd>x = Fraction(1, 3)</kbd>
<samp class=p>>>> </samp><kbd>1 / x</kbd>
<samp>Fraction(3, 1)</samp></pre>
<p>This is <em>not</em> a case of taking a <code>Fraction</code> and dividing it by an integer (as in the previous example). That case was straightforward: <code>x / 3</code> calls <code>x.__truediv__(3)</code>, and the <code>__truedive__()</code> method of the <code>Fraction</code> class handles all the math. But integers don&#8217;t &#8220;know&#8221; how to do arithmetic operations with fractions. So why does this example work?
<p>The answer lies in a second set of arithmetic special methods with <i>reflected operands</i>. Given an arithmetic operation that takes two operands (<i>e.g.</i> <code>x / y</code>), there are two ways to go about it:
<ol>
<li>Tell <var>x</var> to divide itself by <var>y</var>, or
<li>Tell <var>y</var> to divide itself into <var>x</var>
</ol>
<p>The set of special methods above take the first approach: given <code>x / y</code>, they provide a way for <var>x</var> to say &#8220;I know how to divide myself by <var>y</var>.&#8221; The following set of special methods tackle the second approach: they provide a way for <var>y</var> to say &#8220;I know how to be the denominator and divide myself into <var>x</var>.&#8221;
<table>
<tr><th>Notes
@@ -271,7 +297,7 @@ __rxor__
__ror__
-->
<p>FIXME explain in-place augmented assignments
<p>But wait! There&#8217;s more! If you&#8217;re doing &#8220;in-place&#8221; operations, like <code>x /= 3</code>, there are even more special methods you can define.
<table>
<tr><th>Notes
@@ -343,7 +369,17 @@ __ixor__
__ior__
-->
<p>FIXME unary operator intro
<p>Note: for the most part, the in-place operation methods are not required. If you don&#8217;t define an in-place method for a particular operation, Python will try the methods. For example, to execute the expression <code>x /= y</code>, Python will:
<ol>
<li>Try calling <code>x.__itruediv__(<var>y</var>)</code>. If this method is defined and returns a value other than <code>NotImplemented</code>, we&#8217;re done.
<li>Try calling <code>x.__truediv__(<var>y</var>)</code>. If this method is defined and returns a value other than <code>NotImplemented</code>, the old value of <var>x</var> is discarded and replaced with the return value, just as if you had done <code> x = x / y</code> instead.
<li>Try calling <code>y.__rtruediv__(<var>y</var>)</code>. If this method is defined and returns a value other than <code>NotImplemented</code>, the old value of <var>x</var> is discarded and replaced with the return value.
</ol>
<p>So you only need to define in-place methods like the <code>__itruediv__()</code> method if you want to do some special optimization for in-place operands. Otherwise Python will essentially reformulate the in-place operand to use a regular operand + a variable assignment.
<p>There are also a few &#8220;unary&#8221; mathematical operations you can perform on number-like objects by themselves.
<table>
<tr><th>Notes
@@ -399,7 +435,7 @@ __ior__
<td><code>math.trunc(x)</code>
<td><code>x.__trunc__()</code>
<tr><th>
<td>???
<td>??? FIXME what the hell is this?
<td><code>???</code>
<td><code>x.__index__()</code>
</table>
@@ -439,6 +475,29 @@ __reduce_ex__ (*)
<pre>
__enter__ see http://docs.python.org/3.0/library/stdtypes.html#typecontextmanager
__exit__
relevant excerpt from io.py:
def __enter__(self) -> "IOBase": # That's a forward reference
"""Context management protocol. Returns self."""
self._checkClosed()
return self
def __exit__(self, *args) -> None:
"""Context management protocol. Calls close()"""
self.close()
relevant excerpt from http://www.python.org/doc/3.0/reference/datamodel.html#with-statement-context-managers
object.__enter__(self)
Enter the runtime context related to this object. The with statement will bind this methods return value to the target(s) specified in the as clause of the statement, if any.
object.__exit__(self, exc_type, exc_value, traceback)
Exit the runtime context related to this object. The parameters describe the exception that caused the context to be exited. If the context was exited without an exception, all three arguments will be None.
If an exception is supplied, and the method wishes to suppress the exception (i.e., prevent it from being propagated), it should return a true value. Otherwise, the exception will be processed normally upon exit from this method.
Note that __exit__() methods should not reraise the passed-in exception; this is the callers responsibility.
</pre>
<h2 id=esoterica>Really Esoteric Stuff</h2>
+21 -21
View File
@@ -49,19 +49,19 @@ My alphabet starts where your alphabet ends! <span>&#x275E;</span><br>&mdash; Dr
<p><i>Enter Unicode.</i>
<p>Unicode is a system designed to represent <em>every</em> character from <em>every</em> language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0&ndash;4294967295. (That's 2<sup>32</sup>&minus;1.) Each 4-byte number represents a unique character used in at least one of the world's languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn't be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no &#8220;modes&#8221; to keep track of. <code>U+0041</code> is always <code>'A'</code>, even if your language doesn't have an <code>'A'</code> in it.
<p>Unicode is a system designed to represent <em>every</em> character from <em>every</em> language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0&ndash;4294967295. (That&#8217;s 2<sup>32</sup>&minus;1.) Each 4-byte number represents a unique character used in at least one of the world&#8217;s languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn&#8217;t be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no &#8220;modes&#8221; to keep track of. <code>U+0041</code> is always <code>'A'</code>, even if your language doesn&#8217;t have an <code>'A'</code> in it.
<p>On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more &#8220;mode switching&#8221; to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character<span title="interrobang!">&#8253;</span> That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it's wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character.
<p>On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more &#8220;mode switching&#8221; to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character<span title="interrobang!">&#8253;</span> That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it&#8217;s wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character.
<p>There is a Unicode encoding that uses four bytes per character. It's called UTF-32, because 32 bits = 4 bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the <var>Nth</var> character of a string in constant time, because the <var>Nth</var> character starts at the <var>4&times;Nth</var> byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every freaking character.
<p>There is a Unicode encoding that uses four bytes per character. It&#8217;s called UTF-32, because 32 bits = 4 bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the <var>Nth</var> character of a string in constant time, because the <var>Nth</var> character starts at the <var>4&times;Nth</var> byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every freaking character.
<p>Even though there are a lot of Unicode characters, it turns out that most people will never use anything beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes every character from 0&ndash;65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used &#8220;astral plane&#8221; Unicode characters beyond 65535. Most obvious advantage: UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store instead of four bytes (except for the ones that don't). And you can still easily find the <var>Nth</var> character of a string in constant time, if you assume that the string doesn't include any astral plane characters, which is a good assumption right up until the moment that it's not.
<p>Even though there are a lot of Unicode characters, it turns out that most people will never use anything beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes every character from 0&ndash;65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used &#8220;astral plane&#8221; Unicode characters beyond 65535. Most obvious advantage: UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store instead of four bytes (except for the ones that don&#8217;t). And you can still easily find the <var>Nth</var> character of a string in constant time, if you assume that the string doesn&#8217;t include any astral plane characters, which is a good assumption right up until the moment that it&#8217;s not.
<p>But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character <code>U+4E2D</code> could be stored in UTF-16 as either <code>4E 2D</code> or <code>2D 4E</code>, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you're safe &mdash; different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you're going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence <code>4E 2D</code> means <code>U+4E2D</code> or <code>U+2D4E</code>.
<p>But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character <code>U+4E2D</code> could be stored in UTF-16 as either <code>4E 2D</code> or <code>2D 4E</code>, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you&#8217;re safe &mdash; different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you&#8217;re going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence <code>4E 2D</code> means <code>U+4E2D</code> or <code>U+2D4E</code>.
<p>To solve <em>this</em> problem, the multi-byte Unicode encodings define a &#8220;Byte Order Mark,&#8221; which is a special non-printable character that you can include at the beginning of your document to indicate what order your bytes are in. For UTF-16, the Byte Order Mark is <code>U+FEFF</code>. If you receive a UTF-16 document that starts with the bytes <code>FF FE</code>, you know the byte ordering is one way; if it starts with <code>FE FF</code>, you know the byte ordering is reversed.
<p>Still, UTF-16 isn't exactly ideal, especially if you're dealing with a lot of <abbr>ASCII</abbr> characters. If you think about it, even a Chinese web page is going to contain a lot of <abbr>ASCII</abbr> characters &mdash; all the elements and attributes surrounding the printable Chinese characters. Being able to find the <var>Nth</var> character in O(1) time is nice, but there's still the nagging problem of those astral plane characters, which mean that you can't <em>guarantee</em> that every character is exactly two bytes, so you can't <em>really</em> find the <var>Nth</var> character in O(1) time unless you maintain a separate index. And boy, there sure is a lot of <abbr>ASCII</abbr> text in the world&hellip;
<p>Still, UTF-16 isn&#8217;t exactly ideal, especially if you&#8217;re dealing with a lot of <abbr>ASCII</abbr> characters. If you think about it, even a Chinese web page is going to contain a lot of <abbr>ASCII</abbr> characters &mdash; all the elements and attributes surrounding the printable Chinese characters. Being able to find the <var>Nth</var> character in O(1) time is nice, but there&#8217;s still the nagging problem of those astral plane characters, which mean that you can&#8217;t <em>guarantee</em> that every character is exactly two bytes, so you can&#8217;t <em>really</em> find the <var>Nth</var> character in O(1) time unless you maintain a separate index. And boy, there sure is a lot of <abbr>ASCII</abbr> text in the world&hellip;
<p>Other people pondered these questions, and they came up with a solution:
@@ -71,7 +71,7 @@ My alphabet starts where your alphabet ends! <span>&#x275E;</span><br>&mdash; Dr
<p>Disadvantages: because each character can take a different number of bytes, finding the <var>Nth</var> character is an O(N) operation. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.
<p>Advantages: super-efficient encoding of common <abbr>ASCII</abbr> characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you'll have to trust me on this, because I'm not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.
<p>Advantages: super-efficient encoding of common <abbr>ASCII</abbr> characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you&#8217;ll have to trust me on this, because I&#8217;m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.
<h2 id=divingin>Diving In</h2>
@@ -95,7 +95,7 @@ My alphabet starts where your alphabet ends! <span>&#x275E;</span><br>&mdash; Dr
<h2 id=formatting-strings>Formatting Strings</h2>
<aside>Strings can be defined with either single or double quotes.</aside>
<p>Let's take another look at <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
<p>Let&#8217;s take another look at <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
<p class=d>[<a href=examples/humansize.py>download <code>humansize.py</code></a>]
<pre><code>
@@ -127,8 +127,8 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<li><code>'KB'</code>, <code>'MB'</code>, <code>'GB'</code>&hellip; those are each strings.
<li>Function docstrings are strings. This docstring spans multiple lines, so it uses three-in-a-row quotes to start and end the string.
<li>These three-in-a-row quotes end the docstring.
<li>There's another string, being passed to the exception as a human-readable error message.
<li>There's a&hellip; whoa, what the heck is that?
<li>There&#8217;s another string, being passed to the exception as a human-readable error message.
<li>There&#8217;s a&hellip; whoa, what the heck is that?
</ol>
<p>Python 3 supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with single placeholder.
@@ -140,7 +140,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<samp>"mark's password is PapayaWhip"</samp></pre>
<ol>
<li>No, my password is not really <kbd>PapayaWhip</kbd>.
<li>There's a lot going on here. First, that's a method call on a string literal. <em>Strings are objects</em>, and objects have methods. Second, the whole expression evaluates to a string. Third, <code>{0}</code> and <code>{1}</code> are <i>replacement fields</i>, which are replaced by the arguments passed to the <code>format()</code> method.
<li>There&#8217;s a lot going on here. First, that&#8217;s a method call on a string literal. <em>Strings are objects</em>, and objects have methods. Second, the whole expression evaluates to a string. Third, <code>{0}</code> and <code>{1}</code> are <i>replacement fields</i>, which are replaced by the arguments passed to the <code>format()</code> method.
</ol>
<h3 id=compound-field-names>Compound Field Names</h3>
@@ -156,8 +156,8 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<samp>'1000KB = 1MB'</samp>
</pre>
<ol>
<li>Rather than calling any function in the <code>humansize</code> module, you're just grabbing one of the data structures it defines: the list of "SI" (powers-of-1000) suffixes.
<li>This looks complicated, but it's not. <code>{0}</code> would refer to the first argument passed to the <code>format()</code> method, <var>si_suffixes</var>. But <var>si_suffixes</var> is a list. So <code>{0[0]}</code> refers to the first item of the list which is the first argument passed to the <code>format()</code> method: <code>'KB'</code>. Meanwhile, <code>{0[1]}</code> refers to the second item of the same list: <code>'MB'</code>. Everything outside the curly braces &mdash; including <code>1000</code>, the equals sign, and the spaces &mdash; is untouched. The final result is the string <code>'1000KB = 1MB'</code>.
<li>Rather than calling any function in the <code>humansize</code> module, you&#8217;re just grabbing one of the data structures it defines: the list of "SI" (powers-of-1000) suffixes.
<li>This looks complicated, but it&#8217;s not. <code>{0}</code> would refer to the first argument passed to the <code>format()</code> method, <var>si_suffixes</var>. But <var>si_suffixes</var> is a list. So <code>{0[0]}</code> refers to the first item of the list which is the first argument passed to the <code>format()</code> method: <code>'KB'</code>. Meanwhile, <code>{0[1]}</code> refers to the second item of the same list: <code>'MB'</code>. Everything outside the curly braces &mdash; including <code>1000</code>, the equals sign, and the spaces &mdash; is untouched. The final result is the string <code>'1000KB = 1MB'</code>.
</ol>
<aside>{0} is replaced by the 1<sup>st</sup> format() argument. {1} is replaced by the 2<sup>nd</sup>.</aside>
@@ -171,7 +171,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<li><em>Any combination of the above</em>
</ul>
<p>Just to blow your mind, here's an example that combines all of the above:
<p>Just to blow your mind, here&#8217;s an example that combines all of the above:
<pre class=screen>
<samp class=p>>>> </samp><kbd>import humansize</kbd>
@@ -179,7 +179,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<samp class=p>>>> </samp><kbd>"1MB = 1000{0.modules[humansize].SUFFIXES[1000][0]}".format(sys)</kbd>
<samp>'1MB = 1000KB'</samp></pre>
<p>Here's how it works:
<p>Here&#8217;s how it works:
<ul>
<li>The <code>sys</code> module holds information about the currently running Python instance. Since you just imported it, you can pass the <code>sys</code> module itself as an argument to the <code>format()</code> method. So the replacement field <code>{0}</code> refers to the <code>sys</code> module.
@@ -192,12 +192,12 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<h3 id=format-specifiers>Format Specifiers</h3>
<p>But wait! There's more! Let's take another look at that strange line of code from <code>humansize.py</code>:
<p>But wait! There&#8217;s more! Let&#8217;s take another look at that strange line of code from <code>humansize.py</code>:
<pre><code>if size &lt; multiple:
return "{0:.1f} {1}".format(size, suffix)</code></pre>
<p><code>{1}</code> is replaced with the second argument passed to the <code>format()</code> method, which is <var>suffix</var>. But what is <code>{0:.1f}</code>? It's two things: <code>{0}</code>, which you recognize, and <code>:.1f</code>, which you don't. The second half (including and after the colon) defines the <i>format specifier</i>, which further refines how the replaced variable should be formatted.
<p><code>{1}</code> is replaced with the second argument passed to the <code>format()</code> method, which is <var>suffix</var>. But what is <code>{0:.1f}</code>? It&#8217;s two things: <code>{0}</code>, which you recognize, and <code>:.1f</code>, which you don&#8217;t. The second half (including and after the colon) defines the <i>format specifier</i>, which further refines how the replaced variable should be formatted.
<blockquote class="note compare clang">
<p><span>&#x261E;</span>Format specifiers allow you to munge the replacement text in a variety of useful ways, like the <code>printf()</code> function in C. You can add zero- or space-padding, align strings, control decimal precision, and even convert numbers to hexadecimal.
@@ -239,7 +239,7 @@ experience of years.</samp>
<li>The <code>count()</code> method counts the number of occurrences of a substring. Yes, there really are six &#8220;f&#8221;s in that sentence!
</ol>
<p>Here's another common case. Let's say you have a list of key-value pairs in the form <code><var>key1</var>=<var>value1</var>&amp;<var>key2</var>=<var>value2</var></code>, and you want to split them up and make a dictionary of the form <code>{key1: value1, key2: value2}</code>.
<p>Here&#8217;s another common case. Let&#8217;s say you have a list of key-value pairs in the form <code><var>key1</var>=<var>value1</var>&amp;<var>key2</var>=<var>value2</var></code>, and you want to split them up and make a dictionary of the form <code>{key1: value1, key2: value2}</code>.
<pre class=screen>
<samp class=p>>>> </samp><kbd>query = 'user=pilgrim&amp;database=master&amp;password=PapayaWhip'</kbd>
@@ -324,8 +324,8 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp>
<a><samp class=p>>>> </samp><kbd>s.count(by.decode('ascii'))</kbd> <span>&#x2462;</span></a>
<samp>1</samp></pre>
<ol>
<li>You can't concatenate bytes and strings. They are two different data types.
<li>You can't count the occurrences of bytes in a string, because there are no bytes in a string. A string is a sequence of characters. Perhaps you meant &#8220;count the occurrences of the string that you would get after decoding this sequence of bytes in a particular character encoding&#8221;? Well then, you'll need to say that explicitly. Python 3 won't implicitly convert bytes to strings or strings to bytes.
<li>You can&#8217;t concatenate bytes and strings. They are two different data types.
<li>You can&#8217;t count the occurrences of bytes in a string, because there are no bytes in a string. A string is a sequence of characters. Perhaps you meant &#8220;count the occurrences of the string that you would get after decoding this sequence of bytes in a particular character encoding&#8221;? Well then, you&#8217;ll need to say that explicitly. Python 3 won&#8217;t implicitly convert bytes to strings or strings to bytes.
<li>By an amazing coincidence, this line of code says &#8220;count the occurrences of the string that you would get after decoding this sequence of bytes in this particular character encoding.&#8221;
</ol>
@@ -393,7 +393,7 @@ FIXME: move this to the intro of the upcoming files chapter?
<ul>
<li><a href="http://docs.python.org/3.0/howto/unicode.html">Python Unicode HOWTO</a>
<li><a href="http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit">What's New In Python 3: Text vs. Data Instead Of Unicode vs. 8-bit</a>
<li><a href="http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit">What&#8217;s New In Python 3: Text vs. Data Instead Of Unicode vs. 8-bit</a>
</ul>
<p>On Unicode in general:
+2 -1
View File
@@ -15,7 +15,8 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=25>&nbsp;<input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> Dive Into Python 3 <span>&#8227;</span>
<h1>Table of contents</h1>
<ol start=0>
<ol start=-1>
<li id=whats-new><a href=whats-new.html>What&#8217;s New In &#8220;Dive Into Python 3&#8221;</a>
<li>Installing Python
<ol>
<li>Python on Windows
+40 -40
View File
@@ -17,18 +17,18 @@ body{counter-reset:h1 8}
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>(Not) Diving In</h2>
<p class=f>In this chapter, you're going to write and debug a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in <a href="regular-expressions.html#romannumerals">&#8220;Case study: roman numerals&#8221;</a>. Now step back and consider what it would take to expand that into a two-way utility.
<p class=f>In this chapter, you&#8217;re going to write and debug a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in <a href="regular-expressions.html#romannumerals">&#8220;Case study: roman numerals&#8221;</a>. Now step back and consider what it would take to expand that into a two-way utility.
<p><a href="regular-expressions.html#romannumerals">The rules for Roman numerals</a> lead to a number of interesting observations:
<ol>
<li>There is only one correct way to represent a particular number as a Roman numeral.
<li>The converse is also true: if a string of characters is a valid Roman numeral, it represents only one number (that is, it can only be interpreted one way).
<li>There is a limited range of numbers that can be expressed as Roman numerals, specifically <code>1</code> through <code>3999</code>. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent that its normal value should be multiplied by <code>1000</code>, but you're not going to deal with that. For the purposes of this chapter, let's stipulate that Roman numerals go from <code>1</code> to <code>3999</code>.)
<li>There is a limited range of numbers that can be expressed as Roman numerals, specifically <code>1</code> through <code>3999</code>. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent that its normal value should be multiplied by <code>1000</code>, but you&#8217;re not going to deal with that. For the purposes of this chapter, let&#8217;s stipulate that Roman numerals go from <code>1</code> to <code>3999</code>.)
<li>There is no way to represent <code>0</code> in Roman numerals.
<li>There is no way to represent negative numbers in Roman numerals.
<li>There is no way to represent fractions or non-integer numbers in Roman numerals.
</ol>
<p>Let's start mapping out what a <code>roman.py</code> module should do. It will have two main functions, <code>to_roman()</code> and <code>from_roman()</code>. The <code>to_roman()</code> function should take an integer from <code>1</code> to <code>3999</code> and return the Roman numeral representation as a string&hellip;</p>
<p>Stop right there. Now let's do something a little unexpected: write a test case that checks whether the <code>to_roman()</code> function does what you want it to. You read that right: you're going to write code that tests code that you haven't written yet.
<p>Let&#8217;s start mapping out what a <code>roman.py</code> module should do. It will have two main functions, <code>to_roman()</code> and <code>from_roman()</code>. The <code>to_roman()</code> function should take an integer from <code>1</code> to <code>3999</code> and return the Roman numeral representation as a string&hellip;</p>
<p>Stop right there. Now let&#8217;s do something a little unexpected: write a test case that checks whether the <code>to_roman()</code> function does what you want it to. You read that right: you&#8217;re going to write code that tests code that you haven&#8217;t written yet.
<p>This is called <i>unit testing</i>. The set of two conversion functions &mdash; <code>to_roman()</code>, and later <code>from_roman()</code> &mdash; can be written and tested as a unit, separate from any larger program that imports them. Python has a framework for unit testing, the appropriately-named <code>unittest</code> module.
<p>Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important to write them early (preferably before writing the code that they test), and to keep them updated as code and requirements change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases of development:
<ul>
@@ -36,7 +36,7 @@ body{counter-reset:h1 8}
<li>While writing code, it keeps you from over-coding. When all the test cases pass, the function is complete.
<li>When refactoring code, it assures you that the new version behaves the same way as the old version.
<li>When maintaining code, it helps you cover your ass when someone comes screaming that your latest change broke their old code. (&#8220;But <em>sir</em>, all the unit tests passed when I checked it in...&#8221;)
<li>When writing code in a team, it increases confidence that the code you're about to commit isn't going to break someone else's code, because you can run their unit tests first. (I've seen this sort of thing in code sprints. A team breaks up the assignment, everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team. That way, nobody goes off too far into developing code that doesn't play well with others.)
<li>When writing code in a team, it increases confidence that the code you&#8217;re about to commit isn&#8217;t going to break someone else&#8217;s code, because you can run their unit tests first. (I&#8217;ve seen this sort of thing in code sprints. A team breaks up the assignment, everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team. That way, nobody goes off too far into developing code that doesn&#8217;t play well with others.)
</ul>
<h2 id=romantest1>A Single Question</h2>
<aside>Every test is an island.</aside>
@@ -46,11 +46,11 @@ body{counter-reset:h1 8}
<li>...determine by itself whether the function it is testing has passed or failed, without a human interpreting the results.
<li>...run in isolation, separate from any other test cases (even if they test the same functions). Each test case is an island.
</ul>
<p>Given that, let's build a test case for the first requirement:
<p>Given that, let&#8217;s build a test case for the first requirement:
<ol>
<li>The <code>to_roman()</code> function should return the Roman numeral representation for all integers <code>1</code> to <code>3999</code>.
</ol>
<p>It is not immediately obvious how this code does&hellip; well, <em>anything</em>. It defines a class which has no <code>__init__()</code> method. The class <em>does</em> have another method, but it is never called. The entire script has a <code>__main__</code> block, but it doesn't reference the class or its method. But it does do something, I promise.
<p>It is not immediately obvious how this code does&hellip; well, <em>anything</em>. It defines a class which has no <code>__init__()</code> method. The class <em>does</em> have another method, but it is never called. The entire script has a <code>__main__</code> block, but it doesn&#8217;t reference the class or its method. But it does do something, I promise.
<p class=d>[<a href=examples/romantest1.py>download <code>romantest1.py</code></a>]
<pre><code>import roman1
import unittest
@@ -125,20 +125,20 @@ if __name__ == "__main__":
<li>To write a test case, first subclass the <code>TestCase</code> class of the <code>unittest</code> module. This class provides many useful methods which you can use in your test case to test specific conditions.
<li>This is a list of integer/numeral pairs that I verified manually. It includes the lowest ten numbers, the highest number, every number that translates to a single-character Roman numeral, and a random sampling of other valid numbers. The point of a unit test is not to test every possible input, but to test a representative sample.
<li>Every individual test is its own method, which must take no parameters and return no value. If the method exits normally without raising an exception, the test is considered passed; if the method raises an exception, the test is considered failed.
<li>Here you call the actual <code>to_roman()</code> function. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you have now defined the <abbr>API</abbr> for the <code>to_roman()</code> function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the <abbr>API</abbr> is different than that, this test is considered failed. Also notice that you are not trapping any exceptions when you call <code>to_roman()</code>. This is intentional. <code>to_roman()</code> shouldn't raise an exception when you call it with valid input, and these input values are all valid. If <code>to_roman()</code> raises an exception, this test is considered failed.
<li>Here you call the actual <code>to_roman()</code> function. (Well, the function hasn&#8217;t be written yet, but once it is, this is the line that will call it.) Notice that you have now defined the <abbr>API</abbr> for the <code>to_roman()</code> function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the <abbr>API</abbr> is different than that, this test is considered failed. Also notice that you are not trapping any exceptions when you call <code>to_roman()</code>. This is intentional. <code>to_roman()</code> shouldn&#8217;t raise an exception when you call it with valid input, and these input values are all valid. If <code>to_roman()</code> raises an exception, this test is considered failed.
<li>Assuming the <code>to_roman()</code> function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check whether it returned the <em>right</em> value. This is a common question, and the <code>TestCase</code> class provides a method, <code>assertEqual</code>, to check whether two values are equal. If the result returned from <code>to_roman()</code> (<var>result</var>) does not match the known value you were expecting (<var>numeral</var>), <code>assertEqual</code> will raise an exception and the test will fail. If the two values are equal, <code>assertEqual</code> will do nothing. If every value returned from <code>to_roman()</code> matches the known value you expect, <code>assertEqual</code> never raises an exception, so <code>testToRomanKnownValues</code> eventually exits normally, which means <code>to_roman()</code> has passed this test.
</ol>
<aside>Write a test that fails, then code until it passes.</aside>
<p>Once you have a test case, you can start coding the <code>to_roman()</code> function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you've written any code, you're doing it wrong &mdash; your tests aren't testing your code at all! Write a test that fails, then code until it passes.
<p>Once you have a test case, you can start coding the <code>to_roman()</code> function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you&#8217;ve written any code, you&#8217;re doing it wrong &mdash; your tests aren&#8217;t testing your code at all! Write a test that fails, then code until it passes.
<pre><code># roman1.py
function to_roman(n):
"""convert integer to Roman numeral"""
<a> pass <span>&#x2460;</span></a></code></pre>
<ol>
<li>At this stage, you want to define the <abbr>API</abbr> of the <code>to_roman()</code> function, but you don't want to code it yet. (Your test needs to fail first.) To stub it out, use the Python reserved word <code>pass</code> [FIXME ref], which does precisely nothing.
<li>At this stage, you want to define the <abbr>API</abbr> of the <code>to_roman()</code> function, but you don&#8217;t want to code it yet. (Your test needs to fail first.) To stub it out, use the Python reserved word <code>pass</code> [FIXME ref], which does precisely nothing.
</ol>
<p>Execute <code>romantest1.py</code> on the command line to run the test. If you call it with the <code>-v</code> command-line option, it will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this:
<p>Execute <code>romantest1.py</code> on the command line to run the test. If you call it with the <code>-v</code> command-line option, it will give more verbose output so you can see exactly what&#8217;s going on as each test case runs. With any luck, your output should look like this:
<pre class=screen>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest1.py -v</kbd>
<samp><a>to_roman should give known result with known input ... FAIL <span>&#x2460;</span></a>
@@ -157,9 +157,9 @@ Traceback (most recent call last):
<a>FAILED (failures=1) <span>&#x2463;</span></a></samp></pre>
<ol>
<li>Running the script runs <code>unittest.main()</code>, which runs each test case. Each test case is a method within each class in <code>romantest.py</code> that inherits from <code>unittest.TestCase</code>. For each test case, the <code>unittest</code> module will print out the <code>docstring</code> of the method and whether that test passed or failed. As expected, this test case fails.
<li>For each failed test case, <code>unittest</code> displays the trace information showing exactly what happened. In this case, the call to <code>assertEqual()</code> raised an <code>AssertionError</code> because it was expecting <code>to_roman(1)</code> to return <code>"I"</code>, but it didn't. (Since there was no explicit return statement, the function returned <code>None</code>, the Python null value.)
<li>For each failed test case, <code>unittest</code> displays the trace information showing exactly what happened. In this case, the call to <code>assertEqual()</code> raised an <code>AssertionError</code> because it was expecting <code>to_roman(1)</code> to return <code>"I"</code>, but it didn&#8217;t. (Since there was no explicit return statement, the function returned <code>None</code>, the Python null value.)
<li>After the detail of each test, <code>unittest</code> displays a summary of how many tests were performed and how long it took.
<li>Overall, the unit test failed because at least one test case did not pass. When a test case doesn't pass, <code>unittest</code> distinguishes between failures and errors. A failure is a call to an <code>assertXYZ</code> method, like <code>assertEqual</code> or <code>assertRaises</code>, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort of exception raised in the code you're testing or the unit test case itself.
<li>Overall, the unit test failed because at least one test case did not pass. When a test case doesn&#8217;t pass, <code>unittest</code> distinguishes between failures and errors. A failure is a call to an <code>assertXYZ</code> method, like <code>assertEqual</code> or <code>assertRaises</code>, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort of exception raised in the code you&#8217;re testing or the unit test case itself.
</ol>
<p><em>Now</em>, finally, you can write the <code>to_roman()</code> function.
<p class=d>[<a href=examples/roman1.py>download <code>roman1.py</code></a>]
@@ -186,10 +186,10 @@ def to_roman(n):
n -= integer
return result</code></pre>
<ol>
<li><var>roman_numeral_map</var> is a tuple of tuples which defines three things: the character representations of the most basic Roman numerals; the order of the Roman numerals (in descending value order, from <code>M</code> all the way down to <code>I</code>); the value of each Roman numeral. Each inner tuple is a pair of <code>(<var>numeral</var>, <var>value</var>)</code>. It's not just single-character Roman numerals; it also defines two-character pairs like <code>CM</code> (&#8220;one hundred less than one thousand&#8221;). This makes the <code>to_roman()</code> function code simpler.
<li>Here's where the rich data structure of <var>roman_numeral_map</var> pays off, because you don't need any special logic to handle the subtraction rule. To convert to Roman numerals, simply iterate through <var>roman_numeral_map</var> looking for the largest integer value less than or equal to the input. Once found, add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
<li><var>roman_numeral_map</var> is a tuple of tuples which defines three things: the character representations of the most basic Roman numerals; the order of the Roman numerals (in descending value order, from <code>M</code> all the way down to <code>I</code>); the value of each Roman numeral. Each inner tuple is a pair of <code>(<var>numeral</var>, <var>value</var>)</code>. It&#8217;s not just single-character Roman numerals; it also defines two-character pairs like <code>CM</code> (&#8220;one hundred less than one thousand&#8221;). This makes the <code>to_roman()</code> function code simpler.
<li>Here&#8217;s where the rich data structure of <var>roman_numeral_map</var> pays off, because you don&#8217;t need any special logic to handle the subtraction rule. To convert to Roman numerals, simply iterate through <var>roman_numeral_map</var> looking for the largest integer value less than or equal to the input. Once found, add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
</ol>
<p>If you're still not clear how the <code>to_roman()</code> function works, add a <code>print()</code> call to the end of the <code>while</code> loop:
<p>If you&#8217;re still not clear how the <code>to_roman()</code> function works, add a <code>print()</code> call to the end of the <code>while</code> loop:
<pre><code>
while n >= integer:
result += numeral
@@ -215,7 +215,7 @@ Ran 1 test in 0.016s
OK</samp></pre>
<ol>
<li>Hooray! The <code>to_roman()</code> function passes the &#8220;known values&#8221; test case. It's not comprehensive, but it does put the function through its paces with a variety of inputs, including inputs that produce every single-character Roman numeral, the largest possible input (<code>3999</code>), and the input that produces the longest possible Roman numeral (<code>3888</code>). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
<li>Hooray! The <code>to_roman()</code> function passes the &#8220;known values&#8221; test case. It&#8217;s not comprehensive, but it does put the function through its paces with a variety of inputs, including inputs that produce every single-character Roman numeral, the largest possible input (<code>3999</code>), and the input that produces the longest possible Roman numeral (<code>3888</code>). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
</ol>
<p>&#8220;Good&#8221; input? Hmm. What about bad input?
<h2 id=romantest2>&#8220;Halt And Catch Fire&#8221;</h2>
@@ -230,9 +230,9 @@ OK</samp></pre>
<a><samp class=p>>>> </samp><kbd>roman1.to_roman(9000)</kbd> <span>&#x2460;</span></a>
<samp>'MMMMMMMMM'</samp></pre>
<ol>
<li>That's definitely not what you wanted &mdash; that's not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is <em>baaaaaaad</em>; if a program is going to fail, it is far better that it fail quickly and noisily. &#8220;Halt and catch fire,&#8221; as the saying goes. The Pythonic way to halt and catch fire is to raise an exception.
<li>That&#8217;s definitely not what you wanted &mdash; that&#8217;s not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is <em>baaaaaaad</em>; if a program is going to fail, it is far better that it fail quickly and noisily. &#8220;Halt and catch fire,&#8221; as the saying goes. The Pythonic way to halt and catch fire is to raise an exception.
</ol>
<p>The question to ask yourself is, &#8220;How can I express this as a testable requirement?&#8221; How's this for starters:
<p>The question to ask yourself is, &#8220;How can I express this as a testable requirement?&#8221; How&#8217;s this for starters:
<blockquote>
<p>The <code>to_roman()</code> function should raise an <code>OutOfRangeError</code> when given an integer greater than <code>3999</code>.
</blockquote>
@@ -244,12 +244,12 @@ OK</samp></pre>
"""to_roman should fail with large input"""
<a> self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000) <span>&#x2462;</span></a></code></pre>
<ol>
<li>Like the previous test case, you create a class that inherits from <code>unittest.TestCase</code>. You can have more than one test per class (as you'll see later in this chapter), but I chose to create a new class here because this test is something different than the last one. We'll keep all the good input tests together in one class, and all the bad input tests together in another.
<li>Like the previous test case, you create a class that inherits from <code>unittest.TestCase</code>. You can have more than one test per class (as you&#8217;ll see later in this chapter), but I chose to create a new class here because this test is something different than the last one. We&#8217;ll keep all the good input tests together in one class, and all the bad input tests together in another.
<li>Like the previous test case, the test itself is a method of the class, with a name starting with <code>test</code>.
<li>The <code>unittest.TestCase</code> class provides the <code>assertRaises</code> method, which takes the following arguments: the exception you're expecting, the function you're testing, and the arguments you're passing to that function. (If the function you're testing takes more than one argument, pass them all to <code>assertRaises</code>, in order, and it will pass them right along to the function you're testing.)
<li>The <code>unittest.TestCase</code> class provides the <code>assertRaises</code> method, which takes the following arguments: the exception you&#8217;re expecting, the function you&#8217;re testing, and the arguments you&#8217;re passing to that function. (If the function you&#8217;re testing takes more than one argument, pass them all to <code>assertRaises</code>, in order, and it will pass them right along to the function you&#8217;re testing.)
</ol>
<p>Pay close attention to this last line of code. Instead of calling <code>to_roman()</code> directly and manually checking that it raises a particular exception (by wrapping it in a <code>try...except</code> block [FIXME xref]), the <code>assertRaises</code> method has encapsulated all of that for us. All you do is tell it what exception you're expecting (<code>roman2.OutOfRangeError</code>), the function (<code>to_roman()</code>), and the function's arguments (<code>4000</code>). The <code>assertRaises</code> method takes care of calling <code>to_roman()</code> and checking that it raises <code>roman2.OutOfRangeError</code>.
<p>Also note that you're passing the <code>to_roman()</code> function itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned recently how handy it is that <a href="your-first-python-program.html#everythingisanobject">everything in Python is an object</a>?
<p>Pay close attention to this last line of code. Instead of calling <code>to_roman()</code> directly and manually checking that it raises a particular exception (by wrapping it in a <code>try...except</code> block [FIXME xref]), the <code>assertRaises</code> method has encapsulated all of that for us. All you do is tell it what exception you&#8217;re expecting (<code>roman2.OutOfRangeError</code>), the function (<code>to_roman()</code>), and the function&#8217;s arguments (<code>4000</code>). The <code>assertRaises</code> method takes care of calling <code>to_roman()</code> and checking that it raises <code>roman2.OutOfRangeError</code>.
<p>Also note that you&#8217;re passing the <code>to_roman()</code> function itself as an argument; you&#8217;re not calling it, and you&#8217;re not passing the name of it as a string. Have I mentioned recently how handy it is that <a href="your-first-python-program.html#everythingisanobject">everything in Python is an object</a>?
<p>So what happens when you run the test suite with this new test?
<pre class=screen>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest2.py -v</kbd>
@@ -269,15 +269,15 @@ Ran 2 tests in 0.000s
FAILED (errors=1)</samp></pre>
<ol>
<li>You should have expected this to fail (since you haven't written any code to pass it yet), but... it didn't actually &#8220;fail,&#8221; it had an &#8220;error&#8221; instead. This is a subtle but important distinction. A unit test actually has <em>three</em> return values: pass, fail, and error. Pass, of course, means that the test passed &mdash; the code did what you expected. &#8220;Fail&#8221; is what the previous test case did (until you wrote code to make it pass) &mdash; it executed the code but the result was not what you expected. &#8220;Error&#8221; means that the code didn't even execute properly.
<li>Why didn't the code execute properly? The traceback gives the answer: the module you're testing doesn't have an exception called <code>OutOfRangeError</code>. Remember, you passed this exception to the <code>assertRaises()</code> method, because it's the exception you want the function to raise given an out-of-range input. But the exception doesn't exist, so the call to the <code>assertRaises()</code> method failed. It never got a chance to test the <code>to_roman()</code> function; it didn't get that far.
<li>You should have expected this to fail (since you haven&#8217;t written any code to pass it yet), but... it didn&#8217;t actually &#8220;fail,&#8221; it had an &#8220;error&#8221; instead. This is a subtle but important distinction. A unit test actually has <em>three</em> return values: pass, fail, and error. Pass, of course, means that the test passed &mdash; the code did what you expected. &#8220;Fail&#8221; is what the previous test case did (until you wrote code to make it pass) &mdash; it executed the code but the result was not what you expected. &#8220;Error&#8221; means that the code didn&#8217;t even execute properly.
<li>Why didn&#8217;t the code execute properly? The traceback gives the answer: the module you&#8217;re testing doesn&#8217;t have an exception called <code>OutOfRangeError</code>. Remember, you passed this exception to the <code>assertRaises()</code> method, because it&#8217;s the exception you want the function to raise given an out-of-range input. But the exception doesn&#8217;t exist, so the call to the <code>assertRaises()</code> method failed. It never got a chance to test the <code>to_roman()</code> function; it didn&#8217;t get that far.
</ol>
<p>To solve this problem, you need to define the <code>OutOfRangeError</code> exception in <code>roman2.py</code>.
<pre><code><a>class OutOfRangeError(ValueError): <span>&#x2460;</span></a>
<a> pass <span>&#x2461;</span></a></code></pre>
<ol>
<li>Exceptions are classes. An &#8220;out of range&#8221; error is a kind of value error &mdash; the argument value is out of its acceptable range. So this exception inherits from the built-in <code>ValueError</code> exception. This is not strictly necessary (it could just inherit from the base <code>Exception</code> class), but it feels right.
<li>Exceptions don't actually do anything, but you need at least one line of code to make a class. Calling <code>pass</code> does precisely nothing, but it's a line of Python code, so that makes it a class.
<li>Exceptions don&#8217;t actually do anything, but you need at least one line of code to make a class. Calling <code>pass</code> does precisely nothing, but it&#8217;s a line of Python code, so that makes it a class.
</ol>
<p>Now run the test suite again.
<pre class=screen>
@@ -298,8 +298,8 @@ Ran 2 tests in 0.016s
FAILED (failures=1)</samp></pre>
<ol>
<li>The new test is still not passing, but it's not returning an error either. Instead, the test is failing. That's progress! It means the call to the <code>assertRaises()</code> method succeeded this time, and the unit test framework actually tested the <code>to_roman()</code> function.
<li>Of course, the <code>to_roman()</code> function isn't raising the <code>OutOfRangeError</code> exception you just defined, because you haven't told it to do that yet. That's excellent news! It means this is a valid test case &mdash; it fails before you write the code to make it pass.
<li>The new test is still not passing, but it&#8217;s not returning an error either. Instead, the test is failing. That&#8217;s progress! It means the call to the <code>assertRaises()</code> method succeeded this time, and the unit test framework actually tested the <code>to_roman()</code> function.
<li>Of course, the <code>to_roman()</code> function isn&#8217;t raising the <code>OutOfRangeError</code> exception you just defined, because you haven&#8217;t told it to do that yet. That&#8217;s excellent news! It means this is a valid test case &mdash; it fails before you write the code to make it pass.
</ol>
<p>Now you can write the code to make this test pass.
<p class=d>[<a href=examples/roman2.py>download <code>roman2.py</code></a>]
@@ -315,9 +315,9 @@ FAILED (failures=1)</samp></pre>
n -= integer
return result</code></pre>
<ol>
<li>This is straightforward: if the given input (<var>n</var>) is greater than <code>3999</code>, raise an <code>OutOfRangeError</code> exception. The unit test does not check the human-readable string that accompanies the exception, although you could write another test that did check it (but watch out for internationalization issues for strings that vary by the user's language or environment).
<li>This is straightforward: if the given input (<var>n</var>) is greater than <code>3999</code>, raise an <code>OutOfRangeError</code> exception. The unit test does not check the human-readable string that accompanies the exception, although you could write another test that did check it (but watch out for internationalization issues for strings that vary by the user&#8217;s language or environment).
</ol>
<p>Does this make the test pass? Let's find out.
<p>Does this make the test pass? Let&#8217;s find out.
<pre class=screen>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest2.py -v</kbd>
<samp>to_roman should give known result with known input ... ok
@@ -328,7 +328,7 @@ Ran 2 tests in 0.000s
OK</samp></pre>
<ol>
<li>Hooray! Both tests pass. Because you worked iteratively, bouncing back and forth between testing and coding, you can be sure that the two lines of code you just wrote were the cause of that one test going from &#8220;fail&#8221; to &#8220;pass.&#8221; That kind of confidence doesn't come cheap, but it will pay for itself over the lifetime of your code.
<li>Hooray! Both tests pass. Because you worked iteratively, bouncing back and forth between testing and coding, you can be sure that the two lines of code you just wrote were the cause of that one test going from &#8220;fail&#8221; to &#8220;pass.&#8221; That kind of confidence doesn&#8217;t come cheap, but it will pay for itself over the lifetime of your code.
</ol>
<h2 id=romantest3>More Halting, More Fire</h2>
@@ -342,7 +342,7 @@ OK</samp></pre>
<samp class=p>>>> </samp><kbd>roman2.to_roman(-1)</kbd>
<samp>''</samp></pre>
<p>Well <em>that's</em> not good. Let's add tests for each of these conditions.
<p>Well <em>that&#8217;s</em> not good. Let&#8217;s add tests for each of these conditions.
<p class=d>[<a href=examples/romantest3.py>download <code>romantest3.py</code></a>]
<pre><code>
@@ -359,8 +359,8 @@ class ToRomanBadInput(unittest.TestCase):
"""to_roman should fail with negative input"""
<a> self.assertRaises(roman3.OutOfRangeError, roman3.to_roman, -1) <span>&#x2462;</span></a></code></pre>
<ol>
<li>The <code>test_too_large()</code> method has not changed since the previous step. I'm including it here to show where the new code fits.
<li>Here's a new test: the <code>test_zero()</code> method. Like the <code>test_too_large()</code> method, it tells the <code>assertRaises()</code> method defined in <code>unittest.TestCase</code> to call our <code>to_roman()</code> function with a parameter of <code>0</code>, and check that it raises the appropriate exception, <code>OutOfRangeError</code>.
<li>The <code>test_too_large()</code> method has not changed since the previous step. I&#8217;m including it here to show where the new code fits.
<li>Here&#8217;s a new test: the <code>test_zero()</code> method. Like the <code>test_too_large()</code> method, it tells the <code>assertRaises()</code> method defined in <code>unittest.TestCase</code> to call our <code>to_roman()</code> function with a parameter of <code>0</code>, and check that it raises the appropriate exception, <code>OutOfRangeError</code>.
<li>The <code>test_negative()</code> method is almost identical, except it passes <code>-1</code> to the <code>to_roman()</code> function. If either of these new tests does <em>not</em> raise an <code>OutOfRangeError</code> (either because the function returns an actual value, or because it raises some other exception), the test is considered failed.
</ol>
@@ -394,7 +394,7 @@ Ran 4 tests in 0.000s
FAILED (failures=2)</samp></pre>
<p>Excellent. Both tests failed, as expected. Now let's switch over to the code and see what we can do to make them pass.
<p>Excellent. Both tests failed, as expected. Now let&#8217;s switch over to the code and see what we can do to make them pass.
<p class=d>[<a href=examples/roman3.py>download <code>roman3.py</code></a>]
<pre><code>def to_roman(n):
@@ -409,11 +409,11 @@ FAILED (failures=2)</samp></pre>
n -= integer
return result</code></pre>
<ol>
<li>This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to <code>if not ((0 &lt; n) and (n &lt; 4000))</code>, but it's much easier to read. This one line of code should catch inputs that are too large, negative, or zero.
<li>If you change your conditions, make sure to update your human-readable error strings to match. The <code>unittest</code> framework won't care, but it'll make it difficult to do manual debugging if your code is throwing incorrectly-described exceptions.
<li>This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to <code>if not ((0 &lt; n) and (n &lt; 4000))</code>, but it&#8217;s much easier to read. This one line of code should catch inputs that are too large, negative, or zero.
<li>If you change your conditions, make sure to update your human-readable error strings to match. The <code>unittest</code> framework won&#8217;t care, but it&#8217;ll make it difficult to do manual debugging if your code is throwing incorrectly-described exceptions.
</ol>
<p>I could show you a whole series of unrelated examples to show that the multiple-comparisons-at-once shortcut works, but instead I'll just run the unit tests and prove it.
<p>I could show you a whole series of unrelated examples to show that the multiple-comparisons-at-once shortcut works, but instead I&#8217;ll just run the unit tests and prove it.
<pre class=screen>
<samp class=p>you@localhost:~$ </samp><kbd>python3 romantest3.py -v</kbd>
@@ -438,8 +438,8 @@ OK</samp></pre>
<a><samp class=p>>>> </samp><kbd>roman3.to_roman(1.5)</kbd> <span>&#x2461;</span></a>
<samp>'I'</samp></pre>
<ol>
<li>Oh, that's bad.
<li>Oh, that's even worse. Both of these cases should raise an exception. Instead, they give bogus results.
<li>Oh, that&#8217;s bad.
<li>Oh, that&#8217;s even worse. Both of these cases should raise an exception. Instead, they give bogus results.
</ol>
<p>Testing for non-integers is not difficult. First, define a <code>NonIntegerError</code> exception.
+44
View File
@@ -0,0 +1,44 @@
<!DOCTYPE html>
<head>
<meta charset=utf-8>
<title>What's New In "Dive into Python 3"</title>
<!--[if IE]><script src=html5.js></script><![endif]-->
<link rel=stylesheet type=text/css href=dip3.css>
<style>
body{counter-reset:h1 -1}
h3:before{content:""}
</style>
<link rel=stylesheet type=text/css media='only screen and (max-device-width: 480px)' href=mobile.css>
</head>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=25>&nbsp;<input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#whats-new>Dive Into Python 3</a> <span>&#8227;</span>
<p id=level>Difficulty level: <span title=advanced>&#x2666;&#x2666;&#x2666;&#x2666;&#x2662;</span>
<h1>What&#8217;s New In &#8220;Dive Into Python 3&#8221;</h1>
<blockquote class=q>
<p><span>&#x275D;</span> Isn&#8217;t this where we came in? <span>&#x275E;</span><br>&mdash; Pink Floyd, The Wall
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin><i>a.k.a.</i> &#8220;the minus level&#8221;</h2>
<h3 id=divingin2><i>a.k.a.</i> I don&#8217;t want to read any more of this damn book than I absolutely have to</h3>
<p class=f>You read the original &#8220;<a href=http://diveintopython.org/>Dive Into Python</a>&#8221; and maybe even bought it on paper. (Thanks!) You already know Python 2 pretty well. You&#8217;re ready to take the plunge into Python 3. &hellip; If all of that is true, read on. (If none of that is true, you&#8217;d be better off <a href=your-first-python-program.html>starting at the beginning</a>.)
<p>Python 3 comes with a script called <code>2to3</code>. Learn it. Love it. Use it. <a href=porting-code-to-python-3-with-2to3.html>Porting Code to Python 3 with <code>2to3</code></a> is a reference of all the things that the <code>2to3</code> tool can fix automatically. Since a lot of those things are syntax changes, it&#8217;s a good starting point to learn about a lot of the syntax changes in Python 3. (<code>print</code> is now a function, <code>`x`</code> doesn&#8217;t work, <i class=baa>&amp;</i>c.)
<p><a href=case-study-porting-chardet-to-python-3.html>Case Study: Porting <code>chardet</code> to Python 3</a> documents my (ultimately successful) effort to port a non-trivial library from Python 2 to Python 3. It may help you; it may not. There&#8217;s a fairly steep learning curve, since you need to kind of understand the library first, so you can understand why it broke and how I fixed it. A lot of the breakage centers around strings. Speaking of which&hellip;
<p>Strings. Whew. Where to start. Python 2 had &#8220;strings&#8221; and &#8220;Unicode strings.&#8221; Python 3 has &#8220;bytes&#8221; and &#8220;strings.&#8221; That is, all strings are now Unicode strings, and if you want to deal with a bag of bytes, you use the new <code>bytes</code> type. Oh, and Python 3 will never implicitly convert between strings and bytes, so if you&#8217;re not sure which one you have, your code will almost certainly break. Read <a href=strings.html>the Strings chapter</a> for more details.
<p>Even if you don&#8217;t care about Unicode, you&#8217;ll want to read about <a href=strings.html#formatting-strings>string formatting in Python 3</a>, which is completely different from Python 2.
<p>Iterators are everywhere in Python 3, and I understand them a lot better than I did five years ago when I wrote &#8220;Dive Into Python&#8221;. You need to understand them too, because lots of functions that used to return lists in Python 2 will now return iterators in Python 3. At a minimum, you should read <a href=iterators.html#a-fibonacci-iterator>the second half of the Iterators chapter</a> and <a href=advanced-iterators.html#generator-expressions>the second half of the Advanced Iterators chapter</a>.
<p>By popular request, I&#8217;ve added an appendix on <a href=special-method-names.html>Special Method Names</a>, which is kind of like <a href="http://www.python.org/doc/3.0/reference/datamodel.html#special-method-names">the Python docs &#8220;Data Model&#8221; chapter</a> but with more snark.
<p>That&#8217;s it for now; the book&#8217;s not finished yet! The file I/O subsystem is totally different now; I hope to write about that soon. There are much better choices for XML processing now; I hope to write about that, too.
<!--<p class=nav><a rel=prev class=todo><span>&#x261C;</span></a> <a rel=next href=your-first-python-program.html><span>&#x261E;</span></a>-->
<p class=c>&copy; 2001&ndash;9 <a href=about.html>Mark Pilgrim</a>
<script src=jquery.js></script>
<script src=dip3.js></script>
+27 -27
View File
@@ -20,7 +20,7 @@ th{text-align:left}
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving In</h2>
<p class=f>Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
<p class=f>Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let&#8217;s skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don&#8217;t worry about that, because you&#8217;re going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
<p class=d>[<a href=examples/humansize.py>download <code>humansize.py</code></a>]
<pre><code>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
@@ -50,7 +50,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
if __name__ == "__main__":
print(approximate_size(1000000000000, False))
print(approximate_size(1000000000000))</code></pre>
<p>Now let's run this program on the command line. On Windows, it will look something like this:
<p>Now let&#8217;s run this program on the command line. On Windows, it will look something like this:
<pre class=screen>
<samp class=p>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp>1.0 TB
@@ -66,15 +66,15 @@ if __name__ == "__main__":
<pre><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):</code></pre>
<aside>When you need a function, just declare it.</aside>
<p>The keyword <code>def</code> starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments are separated with commas.
<p>Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value. (In fact, every Python function returns a value; if the function ever executes a <code>return</code> statement, it will return that value, otherwise it will return <code>None</code>, the Python null value.)
<p>Also note that the function doesn&#8217;t define a return datatype. Python functions do not specify the datatype of their return value; they don&#8217;t even specify whether or not they return a value. (In fact, every Python function returns a value; if the function ever executes a <code>return</code> statement, it will return that value, otherwise it will return <code>None</code>, the Python null value.)
<blockquote class=note>
<p><span>&#x261E;</span>In some languages, functions (that return a value) start with <code>function</code>, and subroutines (that do not return a value) start with <code>sub</code>. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's <code>None</code>), and all functions start with <code>def</code>.
<p><span>&#x261E;</span>In some languages, functions (that return a value) start with <code>function</code>, and subroutines (that do not return a value) start with <code>sub</code>. There are no subroutines in Python. Everything is a function, all functions return a value (even if it&#8217;s <code>None</code>), and all functions start with <code>def</code>.
</blockquote>
<p>The <code>approximate_size</code> function takes the two arguments &mdash; <var>size</var> and <var>a_kilobyte_is_1024_bytes</var> &mdash; but neither argument specifies a datatype. (As you might guess from the <code>=True</code> syntax, the second argument is a boolean. You'll learn what that syntax does in [FIXME xref-was-#apihelper].) In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
<p>The <code>approximate_size</code> function takes the two arguments &mdash; <var>size</var> and <var>a_kilobyte_is_1024_bytes</var> &mdash; but neither argument specifies a datatype. (As you might guess from the <code>=True</code> syntax, the second argument is a boolean. You&#8217;ll learn what that syntax does in [FIXME xref-was-#apihelper].) In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
<blockquote class="note compare java">
<p><span>&#x261E;</span>In Java and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
</blockquote>
<h3 id=datatypes>How Python's Datatypes Compare to Other Programming Languages</h3>
<h3 id=datatypes>How Python&#8217;s Datatypes Compare to Other Programming Languages</h3>
<p>An erudite reader sent me this explanation of how Python compares to other programming languages:
<dl>
<dt>statically typed language</dt>
@@ -84,13 +84,13 @@ if __name__ == "__main__":
<dd>A language in which types are discovered at execution time; the opposite of statically typed. JavaScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
</dd>
<dt>strongly typed language</dt>
<dd>A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
<dd>A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can&#8217;t treat it like a string without explicitly converting it.
</dd>
<dt>weakly typed language</dt>
<dd>A language in which types are &#8220;automagically&#8221; coerced to other types as needed; the opposite of strongly typed. <abbr>PHP</abbr> is weakly typed. In <abbr>PHP</abbr>, you can concatenate the string <code>'12'</code> and the integer <code>3</code> to get the string <code>'123'</code>, then treat that as the integer <code>123</code>, all without any explicit conversion.
</dd>
</dl>
<p>So Python is both <em>dynamically typed</em> (because it doesn't use explicit datatype declarations) and <em>strongly typed</em> (because once a variable has a datatype, it actually matters).
<p>So Python is both <em>dynamically typed</em> (because it doesn&#8217;t use explicit datatype declarations) and <em>strongly typed</em> (because once a variable has a datatype, it actually matters).
<p>If you have experience in other programming languages, this table may help you visualize how Python compares to them:
<table>
<tr><th><th>Statically typed<th>Dynamically typed
@@ -98,7 +98,7 @@ if __name__ == "__main__":
<tr><th>Strongly typed<td>Pascal, Java<td>Python, Ruby
</table>
<h2 id=readability>Writing Readable Code</h2>
<p>I won't bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you've forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You'll thank me in six months.
<p>I won&#8217;t bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you&#8217;ve forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You&#8217;ll thank me in six months.
<h3 id=docstrings>Documentation Strings</h3>
<p>You can document a Python function by giving it a documentation string (<code>docstring</code> for short). In this program, the <code>approximate_size</code> function has a <code>docstring</code>:
<pre><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):
@@ -113,13 +113,13 @@ if __name__ == "__main__":
"""</code></pre>
<aside>Every function deserves a decent docstring.</aside>
<p>Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you'll see them most often used when defining a <code>docstring</code>.
<p>Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you&#8217;ll see them most often used when defining a <code>docstring</code>.
<blockquote class="note compare perl5">
<p><span>&#x261E;</span>Triple quotes are also an easy way to define a string with both single and double quotes, like <code>qq/.../</code> in Perl 5.
</blockquote>
<p>Everything between the triple quotes is the function's <code>docstring</code>, which documents what the function does. A <code>docstring</code>, if it exists, must be the first thing defined in a function (that is, on the next line after the function declaration). You don't technically need to give your function a <code>docstring</code>, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the <code>docstring</code> is available at runtime as an attribute of the function.
<p>Everything between the triple quotes is the function&#8217;s <code>docstring</code>, which documents what the function does. A <code>docstring</code>, if it exists, must be the first thing defined in a function (that is, on the next line after the function declaration). You don&#8217;t technically need to give your function a <code>docstring</code>, but you always should. I know you&#8217;ve heard this in every programming class you&#8217;ve ever taken, but Python gives you an added incentive: the <code>docstring</code> is available at runtime as an attribute of the function.
<blockquote class=note>
<p><span>&#x261E;</span>Many Python <abbr>IDE</abbr>s use the <code>docstring</code> to provide context-sensitive documentation, so that when you type a function name, its <code>docstring</code> appears as a tooltip. This can be incredibly helpful, but it's only as good as the <code>docstring</code>s you write.
<p><span>&#x261E;</span>Many Python <abbr>IDE</abbr>s use the <code>docstring</code> to provide context-sensitive documentation, so that when you type a function name, its <code>docstring</code> appears as a tooltip. This can be incredibly helpful, but it&#8217;s only as good as the <code>docstring</code>s you write.
</blockquote>
<!--
<h3 id=functionannotations>Function Annotations</h3>
@@ -146,15 +146,15 @@ if __name__ == "__main__":
</samp></pre>
<ol>
<li>The first line imports the <code>humansize</code> program as a module -- a chunk of code that you can use interactively, or from a larger Python program. (You'll see examples of multi-module Python programs in [FIXME xref].) Once you import a module, you can reference any of its public functions, classes, or attributes. Modules can do this to access functionality in other modules, and you can do it in the Python interactive shell too. This is an important concept, and you'll see a lot more of it throughout this book.
<li>When you want to use functions defined in imported modules, you need to include the module name. So you can't just say <code>approximate_size</code>; it must be <code>humansize.approximate_size</code>. If you've used classes in Java, this should feel vaguely familiar.
<li>Instead of calling the function as you would expect to, you asked for one of the function's attributes, <code>__doc__</code>.
<li>The first line imports the <code>humansize</code> program as a module -- a chunk of code that you can use interactively, or from a larger Python program. (You&#8217;ll see examples of multi-module Python programs in [FIXME xref].) Once you import a module, you can reference any of its public functions, classes, or attributes. Modules can do this to access functionality in other modules, and you can do it in the Python interactive shell too. This is an important concept, and you&#8217;ll see a lot more of it throughout this book.
<li>When you want to use functions defined in imported modules, you need to include the module name. So you can&#8217;t just say <code>approximate_size</code>; it must be <code>humansize.approximate_size</code>. If you&#8217;ve used classes in Java, this should feel vaguely familiar.
<li>Instead of calling the function as you would expect to, you asked for one of the function&#8217;s attributes, <code>__doc__</code>.
</ol>
<blockquote class="note compare perl5">
<p><span>&#x261E;</span><code>import</code> in Python is like <code>require</code> in Perl. Once you <code>import</code> a Python module, you access its functions with <code><var>module</var>.<var>function</var></code>; once you <code>require</code> a Perl module, you access its functions with <code><var>module</var>::<var>function</var></code>.
</blockquote>
<h3 id=importsearchpath>The <code>import</code> Search Path</h3>
<p>Before this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in <code>sys.path</code>. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.)
<p>Before this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in <code>sys.path</code>. This is just a list, and you can easily view it or modify it with standard list methods. (You&#8217;ll learn more about lists later in the next chapter.)
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>import sys</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>sys.path</kbd> <span>&#x2461;</span></a>
@@ -164,14 +164,14 @@ if __name__ == "__main__":
<a><samp class=p>>>> </samp><kbd>sys.path.append('/my/new/path')</kbd> <span>&#x2463;</span></a></pre>
<ol>
<li>Importing the <code>sys</code> module makes all of its functions and attributes available.
<li><code>sys.path</code> is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a <code>.py</code> file whose name matches what you're trying to import.
<li><code>sys.path</code> is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating system, what version of Python you&#8217;re running, and where it was originally installed.) Python will look through these directories (in this order) for a <code>.py</code> file whose name matches what you&#8217;re trying to import.
<li>Actually, I lied; the truth is more complicated than that, because not all modules are stored as <code>.py</code> files. Some, like the <code>sys</code> module, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (The <code>sys</code> module is written in <abbr>C</abbr>.)
<li>You can add a new directory to Python's search path at runtime by appending the directory name to <code>sys.path</code>, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You'll learn more about <code>append()</code> and other list methods in [FIXME xref-was-#datatypes].)
<li>You can add a new directory to Python&#8217;s search path at runtime by appending the directory name to <code>sys.path</code>, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You&#8217;ll learn more about <code>append()</code> and other list methods in [FIXME xref-was-#datatypes].)
</ol>
<h3 id=whatsanobject>What's An Object?</h3>
<p>Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute <code>__doc__</code>, which returns the <var>docstring</var> defined in the function's source code. The <code>sys</code> module is an object which has (among other things) an attribute called <var>path</var>. And so forth.
<p>Still, this doesn't answer the more fundamental question: what is an object? Different programming languages define &#8220;object&#8221; in different ways. In some, it means that <em>all</em> objects <em>must</em> have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in [FIXME xref-was-#datatypes]), and not all objects are subclassable (more on this in [FIXME xref-was-#fileinfo]). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function (more in this in [FIXME xref-was-#apihelp]).
<p>This is so important that I'm going to repeat it in case you missed it the first few times: <em>everything in Python is an object</em>. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.
<h3 id=whatsanobject>What&#8217;s An Object?</h3>
<p>Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute <code>__doc__</code>, which returns the <var>docstring</var> defined in the function&#8217;s source code. The <code>sys</code> module is an object which has (among other things) an attribute called <var>path</var>. And so forth.
<p>Still, this doesn&#8217;t answer the more fundamental question: what is an object? Different programming languages define &#8220;object&#8221; in different ways. In some, it means that <em>all</em> objects <em>must</em> have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in [FIXME xref-was-#datatypes]), and not all objects are subclassable (more on this in [FIXME xref-was-#fileinfo]). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function (more in this in [FIXME xref-was-#apihelp]).
<p>This is so important that I&#8217;m going to repeat it in case you missed it the first few times: <em>everything in Python is an object</em>. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.
<h2 id=indentingcode>Indenting Code</h2>
<p>Python functions have no explicit <code>begin</code> or <code>end</code>, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (<code>:</code>) and the indentation of the code itself.
<pre><code>
@@ -187,13 +187,13 @@ if __name__ == "__main__":
raise ValueError('number too large')</code></pre>
<ol>
<li>Code blocks are defined by their indentation. By "code block," I mean functions, <code>if</code> statements, <code>for</code> loops, <code>while</code> loops, and so forth. Indenting starts a block and unindenting ends it. There are no explicit braces, brackets, or keywords. This means that whitespace is significant, and must be consistent. In this example, the function code is indented four spaces. It doesn't need to be four spaces, it just needs to be consistent. The first line that is not indented marks the end of the function.
<li>Code blocks are defined by their indentation. By "code block," I mean functions, <code>if</code> statements, <code>for</code> loops, <code>while</code> loops, and so forth. Indenting starts a block and unindenting ends it. There are no explicit braces, brackets, or keywords. This means that whitespace is significant, and must be consistent. In this example, the function code is indented four spaces. It doesn&#8217;t need to be four spaces, it just needs to be consistent. The first line that is not indented marks the end of the function.
<li>In Python, an <code>if</code> statement is followed by a code block. If the <code>if</code> expression evaluates to true, the indented block is executed, otherwise it falls to the <code>else</code> block (if any). (Note the lack of parentheses around the expression.)
<li>This line is inside the <code>if</code> code block. This <code>raise</code> statement will raise an exception (of type <code>ValueError</code>), but only if <code>size &lt; 0</code>.
<li>This is <em>not</em> the end of the function. Completely blank lines don't count. The function continues on the next line.
<li>This is <em>not</em> the end of the function. Completely blank lines don&#8217;t count. The function continues on the next line.
<li>The <code>for</code> loop also marks the start of a code block. Code blocks can contain multiple lines, as long as they are all indented the same amount. This <code>for</code> loop has three lines of code in it. There is no other special syntax for multi-line code blocks. Just indent and get on with your life.
</ol>
<p>After some initial protests and several snide analogies to Fortran, you will make peace with this and start seeing its benefits. One major benefit is that all Python programs look similar, since indentation is a language requirement and not a matter of style. This makes it easier to read and understand other people's Python code.
<p>After some initial protests and several snide analogies to Fortran, you will make peace with this and start seeing its benefits. One major benefit is that all Python programs look similar, since indentation is a language requirement and not a matter of style. This makes it easier to read and understand other people&#8217;s Python code.
<blockquote class="note compare java">
<p><span>&#x261E;</span>Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. <abbr>C++</abbr> and Java use semicolons to separate statements and curly braces to separate code blocks.
</blockquote>
@@ -205,9 +205,9 @@ if __name__ == "__main__":
print(approximate_size(1000000000000, False))
print(approximate_size(1000000000000))</code></pre>
<blockquote class="note compare clang">
<p><span>&#x261E;</span>Like <abbr>C</abbr>, Python uses <code>==</code> for comparison and <code>=</code> for assignment. Unlike <abbr>C</abbr>, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.
<p><span>&#x261E;</span>Like <abbr>C</abbr>, Python uses <code>==</code> for comparison and <code>=</code> for assignment. Unlike <abbr>C</abbr>, Python does not support in-line assignment, so there&#8217;s no chance of accidentally assigning the value you thought you were comparing.
</blockquote>
<p>So what makes this <code>if</code> statement special? Well, modules are objects, and all modules have a built-in attribute <code>__name__</code>. A module's <code>__name__</code> depends on how you're using the module. If you <code>import</code> the module, then <code>__name__</code> is the module's filename, without a directory path or file extension.
<p>So what makes this <code>if</code> statement special? Well, modules are objects, and all modules have a built-in attribute <code>__name__</code>. A module&#8217;s <code>__name__</code> depends on how you&#8217;re using the module. If you <code>import</code> the module, then <code>__name__</code> is the module&#8217;s filename, without a directory path or file extension.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import humansize</kbd>
<samp class=p>>>> </samp><kbd>humansize.__name__</kbd>