added note about list concatenation and memory usage. unrelatedly, added nonbreaking spaces around long dashes.

This commit is contained in:
Mark Pilgrim
2009-06-26 00:41:29 -04:00
parent cb1b87b5b0
commit 28a13e1fbc
14 changed files with 75 additions and 74 deletions
+5 -5
View File
@@ -119,7 +119,7 @@ if __name__ == '__main__':
<a><samp class=p>>>> </samp><kbd class=pp>{c for c in ''.join(words)}</kbd> <span class=u>&#x2463;</span></a>
<samp class=pp>{'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}</samp></pre>
<ol>
<li>Given a list of several strings, a set comprehension with the identity function will return a set of unique strings from the list. This makes sense if you think of it like a <code>for</code> loop. Take the first item from the list, put it in the set. Second. Third. Fourth &mdash; wait, that&#8217;s in the set already, so it only gets listed once. Fifth. Sixth &mdash; again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn&#8217;t even need to be sorted first.
<li>Given a list of several strings, a set comprehension with the identity function will return a set of unique strings from the list. This makes sense if you think of it like a <code>for</code> loop. Take the first item from the list, put it in the set. Second. Third. Fourth&nbsp;&mdash;&nbsp;wait, that&#8217;s in the set already, so it only gets listed once. Fifth. Sixth&nbsp;&mdash;&nbsp;again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn&#8217;t even need to be sorted first.
<li>The same technique works with strings, since a string is just a sequence of characters.
<li>Given a list of strings, <code>''.join(<var>a_list</var>)</code> concatenates all the strings together into one.
<li>So, given a list of strings, this set comprehension returns all the unique characters across all the strings, with no duplicates.
@@ -228,7 +228,7 @@ StopIteration</samp></pre>
<li>That&#8217;s it! Those are all the permutations of <code>[1, 2, 3]</code> taken 2 at a time. Pairs like <code>(1, 1)</code> and <code>(2, 2)</code> never show up, because they contain repeats so they aren&#8217;t valid permutations. When there are no more permutations, the iterator raises a <code>StopIteration</code> exception.
</ol>
<p>The <code>permutations()</code> function doesn&#8217;t have to take a list. It can take any sequence &mdash; even a string.
<p>The <code>permutations()</code> function doesn&#8217;t have to take a list. It can take any sequence&nbsp;&mdash;&nbsp;even a string.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>import itertools</kbd>
@@ -255,7 +255,7 @@ StopIteration</samp>
('C', 'A', 'B'), ('C', 'B', 'A')]</samp></pre>
<ol>
<li>A string is just a sequence of characters. For the purposes of finding permutations, the string <code>'ABC'</code> is equivalent to the list <code>['A', 'B', 'C']</code>.
<li>The first permutation of the 3 items <code>['A', 'B', 'C']</code>, taken 3 at a time, is <code>('A', 'B', 'C')</code>. There are five other permutations &mdash; the same three characters in every conceivable order.
<li>The first permutation of the 3 items <code>['A', 'B', 'C']</code>, taken 3 at a time, is <code>('A', 'B', 'C')</code>. There are five other permutations&nbsp;&mdash;&nbsp;the same three characters in every conceivable order.
<li>Since the <code>permutations()</code> function always returns an iterator, an easy way to debug permutations is to pass that iterator to the built-in <code>list()</code> function to see all the permutations immediately.
</ol>
@@ -397,7 +397,7 @@ for guess in itertools.permutations(digits, len(characters)):
<a><samp class=p>>>> </samp><kbd class=pp>'MARK'.translate(translation_table)</kbd> <span class=u>&#x2462;</span></a>
<samp class=pp>'MORK'</samp></pre>
<ol>
<li>String translation starts with a translation table, which is just a dictionary that maps one character to another. Actually, &#8220;character&#8221; is incorrect &mdash; the translation table really maps one <em>byte</em> to another.
<li>String translation starts with a translation table, which is just a dictionary that maps one character to another. Actually, &#8220;character&#8221; is incorrect&nbsp;&mdash;&nbsp;the translation table really maps one <em>byte</em> to another.
<li>Remember, bytes in Python 3 are integers. The <code>ord()</code> function returns the <abbr>ASCII</abbr> value of a character, which, in the case of A&ndash;Z, is always a byte from 65 to 90.
<li>The <code>translate()</code> method on a string takes a translation table and runs the string through it. That is, it replaces all occurrences of the keys of the translation table with the corresponding values. In this case, &#8220;translating&#8221; <code>MARK</code> to <code>MORK</code>.
</ol>
@@ -512,7 +512,7 @@ NameError: name 'x' is not defined</samp>
NameError: name 'math' is not defined</samp></pre>
<ol>
<li>The second and third parameters passed to the <code>eval()</code> function act as the global and local namespaces for evaluating the expression. In this case, they are both empty, which means that when the string <code>"x * 5"</code> is evaluated, there is no reference to <var>x</var> in either the global or local namespace, so <code>eval()</code> throws an exception.
<li>You can selectively include specific values in the global namespace by listing them individually. Then those &mdash; and only those &mdash; variables will be available during evaluation.
<li>You can selectively include specific values in the global namespace by listing them individually. Then those&nbsp;&mdash;&nbsp;and only those&nbsp;&mdash;&nbsp;variables will be available during evaluation.
<li>Even though you just imported the <code>math</code> module, you didn&#8217;t include it in the namespace passed to the <code>eval()</code> function, so the evaluation failed.
</ol>
+8 -8
View File
@@ -77,7 +77,7 @@ del{background:#f87}
<p class=a>&#x2042;
<h2 id=running2to3>Running <code>2to3</code></h2>
<p>We&#8217;re going to migrate the <code>chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy &mdash; a function was renamed or moved to a different modules &mdash; but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>. In this chapter, we&#8217;ll start by running <code>2to3</code> on the <code>chardet</code> package, but as you&#8217;ll see, there will still be a lot of work to do after the automated tools have performed their magic.
<p>We&#8217;re going to migrate the <code>chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy&nbsp;&mdash;&nbsp;a function was renamed or moved to a different modules&nbsp;&mdash;&nbsp;but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>. In this chapter, we&#8217;ll start by running <code>2to3</code> on the <code>chardet</code> package, but as you&#8217;ll see, there will still be a lot of work to do after the automated tools have performed their magic.
<p>The main <code>chardet</code> package is split across several different files, all in the same directory. The <code>2to3</code> script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and <code>2to3</code> will convert each of the files in turn.
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w chardet\</kbd>
<samp>RefactoringTool: Skipping implicit fixer: buffer
@@ -616,7 +616,7 @@ else:
File "C:\home\chardet\chardet\universaldetector.py", line 29, in &lt;module>
import constants, sys
ImportError: No module named constants</samp></pre>
<p>What&#8217;s that you say? No module named <code>constants</code>? Of course there&#8217;s a module named <code>constants</code>. &hellip;Oh wait, no there isn&#8217;t. Remember when the <code>2to3</code> script fixed up all those import statements? This library has a lot of relative imports &mdash; that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
<p>What&#8217;s that you say? No module named <code>constants</code>? Of course there&#8217;s a module named <code>constants</code>. &hellip;Oh wait, no there isn&#8217;t. Remember when the <code>2to3</code> script fixed up all those import statements? This library has a lot of relative imports&nbsp;&mdash;&nbsp;that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
<pre><code class=pp>from . import constants</code></pre>
<p>But wait. Wasn&#8217;t the <code>2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code>constants</code> module within the library, and an absolute import of the <code>sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can&#8217;t, and the <code>2to3</code> script is not smart enough to split the import statement into two.
<p>The solution is to split the import statement manually. So this two-in-one import:
@@ -656,7 +656,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
<p>This pre-compiles a regular expression designed to find non-<abbr>ASCII</abbr> characters in the range 128&ndash;255 (0x80&ndash;0xFF). Wait, that&#8217;s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-<abbr>ASCII</abbr> <em>bytes</em> in the range 128-255.
<p>And therein lies the problem.
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string &mdash; that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string &mdash; again, an array of characters. But what we&#8217;re searching is not a string, it&#8217;s a byte array. Looking at the traceback, this error occurred in <code>universaldetector.py</code>:
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string&nbsp;&mdash;&nbsp;that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string&nbsp;&mdash;&nbsp;again, an array of characters. But what we&#8217;re searching is not a string, it&#8217;s a byte array. Looking at the traceback, this error occurred in <code>universaldetector.py</code>:
<pre><code class=pp>def feed(self, aBuf):
.
.
@@ -671,7 +671,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
for line in open(f, 'rb'):
u.feed(line)</code></pre>
<aside>Not an array of characters, but an array of bytes.</aside>
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string &mdash; an array of Unicode characters &mdash; according to the system default character encoding. (You could override the system encoding with another parameter to the <code>open()</code> function, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string&nbsp;&mdash;&nbsp;an array of Unicode characters&nbsp;&mdash;&nbsp;according to the system default character encoding. (You could override the system encoding with another parameter to the <code>open()</code> function, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
<pre><code class=pp> class UniversalDetector:
@@ -737,7 +737,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
self._mGotData = False
self._mInputState = ePureAscii
<mark> self._mLastChar = ''</mark></code></pre>
<p>And now we have our answer. Do you see it? <var>self._mLastChar</var> is a string, but <var>aBuf</var> is a byte array. And you can&#8217;t concatenate a string to a byte array &mdash; not even a zero-length string.
<p>And now we have our answer. Do you see it? <var>self._mLastChar</var> is a string, but <var>aBuf</var> is a byte array. And you can&#8217;t concatenate a string to a byte array&nbsp;&mdash;&nbsp;not even a zero-length string.
<p>So what is <var>self._mLastChar</var> anyway? The answer is in the <code>feed()</code> method, just a few lines down from where the trackback occurred.
<pre><code class=pp>if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
@@ -854,7 +854,7 @@ def next_state(self, c):
def feed(self, aBuf):
for c in aBuf:
codingState = self._mCodingSM.next_state(c)</code></pre>
<p>And now we have the answer. Do you see it? In Python 2, <var>aBuf</var> was a string, so <var>c</var> was a 1-character string. (That&#8217;s what you get when you iterate over a string &mdash; all the characters, one by one.) But now, <var>aBuf</var> is a byte array, so <var>c</var> is an <code>int</code>, not a 1-character string. In other words, there&#8217;s no need to call the <code>ord()</code> function because <var>c</var> is already an <code>int</code>!
<p>And now we have the answer. Do you see it? In Python 2, <var>aBuf</var> was a string, so <var>c</var> was a 1-character string. (That&#8217;s what you get when you iterate over a string&nbsp;&mdash;&nbsp;all the characters, one by one.) But now, <var>aBuf</var> is a byte array, so <var>c</var> is an <code>int</code>, not a 1-character string. In other words, there&#8217;s no need to call the <code>ord()</code> function because <var>c</var> is already an <code>int</code>!
<p>Thus:
<pre><code class=pp> def next_state(self, c):
# for each byte we get its class
@@ -1131,7 +1131,7 @@ NameError: global name 'reduce' is not defined</samp></pre>
return 0.01
<mark> total = reduce(operator.add, self._mFreqCounter)</mark></code></pre>
<p>The <code>reduce()</code> function takes two arguments &mdash; a function and a list (strictly speaking, any iterable object will do) &mdash; and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.
<p>The <code>reduce()</code> function takes two arguments&nbsp;&mdash;&nbsp;a function and a list (strictly speaking, any iterable object will do)&nbsp;&mdash;&nbsp;and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.
<p>This monstrosity was so common that Python added a global <code>sum()</code> function.
<pre><code class=pp> def get_confidence(self):
if self.get_state() == constants.eNotMe:
@@ -1185,7 +1185,7 @@ tests\EUC-JP\arclamp.jp.xml EUC-JP with confide
<p>What have we learned?
<ol>
<li>Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There&#8217;s no way around it. It&#8217;s hard.
<li>The <a href=porting-code-to-python-3-with-2to3.html>automated <code>2to3</code> tool</a> is helpful as far as it goes, but it will only do the easy parts &mdash; function renames, module renames, syntax changes. It&#8217;s an impressive piece of engineering, but in the end it&#8217;s just an intelligent search-and-replace bot.
<li>The <a href=porting-code-to-python-3-with-2to3.html>automated <code>2to3</code> tool</a> is helpful as far as it goes, but it will only do the easy parts&nbsp;&mdash;&nbsp;function renames, module renames, syntax changes. It&#8217;s an impressive piece of engineering, but in the end it&#8217;s just an intelligent search-and-replace bot.
<li>The #1 porting problem in this library was the difference between strings and bytes. In this case that seems obvious, since the whole point of the <code>chardet</code> library is to convert a stream of bytes into a string. But &#8220;a stream of bytes&#8221; comes up more often than you might think. Reading a file in &#8220;binary&#8221; mode? You&#8217;ll get a stream of bytes. Fetching a web page? Calling a web <abbr>API</abbr>? They return a stream of bytes, too.
<li><em>You</em> need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least, you need to be comfortable with all its quirks and musty corners. The bugs are everywhere.
<li>Test cases are essential. Don&#8217;t port anything without them. Don&#8217;t even try. The <em>only</em> reason I have any confidence at all that <code>chardet</code> works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I <em>never</em> would have found half of these problems with manual spot-checking.
+1 -1
View File
@@ -37,7 +37,7 @@ Classname Legend
.c = "centered" = centered footer text (also clears floats)
.a = "asterism" = section break
.v = "navigation" = prev/next navigation links (not breadcrumbs)
.u = "Unicode" = text contains Unicode characters (requires special font declaration)
.u = "Unicode" = text contains Unicode characters (requires special font declaration to accomodate *cough* a certain browser)
.nm = "no mobile" = hide this section on mobile devices
.nd = "no decoration" = hide the widgets on this code block
+3 -3
View File
@@ -20,7 +20,7 @@ body{counter-reset:h1 5}
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving In</h2>
<p class=f>For reasons passing all understanding, I have always been fascinated by languages. Not programming languages. Well yes, programming languages, but also natural languages. Take English. English is a schizophrenic language that borrows words from German, French, Spanish, and Latin (to name a few). Actually, &#8220;borrows&#8221; is the wrong word; &#8220;pillages&#8221; is more like it. Or perhaps &#8220;assimilates&#8221; &mdash; like the Borg. Yes, I like that.
<p class=f>For reasons passing all understanding, I have always been fascinated by languages. Not programming languages. Well yes, programming languages, but also natural languages. Take English. English is a schizophrenic language that borrows words from German, French, Spanish, and Latin (to name a few). Actually, &#8220;borrows&#8221; is the wrong word; &#8220;pillages&#8221; is more like it. Or perhaps &#8220;assimilates&#8221;&nbsp;&mdash;&nbsp;like the Borg. Yes, I like that.
<p class=c><code>We are the Borg. Your linguistic and etymological distinctiveness will be added to our own. Resistance is futile.</code>
<p>In this chapter, you&#8217;re going to learn about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators. But first, let&#8217;s talk about how to make plural nouns. (If you haven&#8217;t read <a href=regular-expressions.html>the chapter on regular expressions</a>, now would be a good time. This chapter assumes you understand the basics of regular expressions, and it quickly descends into more advanced uses.)
<p>If you grew up in an English-speaking country or learned English in a formal school setting, you&#8217;re probably familiar with the basic rules:
@@ -170,7 +170,7 @@ def plural(noun):
</ol>
<aside>The &#8220;rules&#8221; variable is a list of functions.</aside>
<p>The reason this technique works is that <a href=your-first-python-program.html#everythingisanobject>everything in Python is an object</a>, including functions. The <var>rules</var> data structure contains functions &mdash; not names of functions, but actual function objects. When they get assigned in the <code>for</code> loop, then <var>matches_rule</var> and <var>apply_rule</var> are actual functions that you can call. On the first iteration of the <code>for</code> loop, this is equivalent to calling <code>matches_sxz(noun)</code>, and if it returns a match, calling <code>apply_sxz(noun)</code>.
<p>The reason this technique works is that <a href=your-first-python-program.html#everythingisanobject>everything in Python is an object</a>, including functions. The <var>rules</var> data structure contains functions&nbsp;&mdash;&nbsp;not names of functions, but actual function objects. When they get assigned in the <code>for</code> loop, then <var>matches_rule</var> and <var>apply_rule</var> are actual functions that you can call. On the first iteration of the <code>for</code> loop, this is equivalent to calling <code>matches_sxz(noun)</code>, and if it returns a match, calling <code>apply_sxz(noun)</code>.
<p>If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire <code>for</code> loop is equivalent to the following:
@@ -392,7 +392,7 @@ def plural(noun):
<p>What have you gained over stage 4? Startup time. In stage 4, when you imported the <code>plural4</code> module, it read the entire patterns file and built a list of all the possible rules, before you could even think about calling the <code>plural()</code> function. With generators, you can do everything lazily: you read the first rule and create functions and try them, and if that works you don&#8217;t ever read the rest of the file or create any other functions.
<p>What have you lost? Performance! Every time you call the <code>plural()</code> function, the <code>rules()</code> generator starts over from the beginning &mdash; which means re-opening the patterns file and reading from the beginning, one line at a time.
<p>What have you lost? Performance! Every time you call the <code>plural()</code> function, the <code>rules()</code> generator starts over from the beginning&nbsp;&mdash;&nbsp;which means re-opening the patterns file and reading from the beginning, one line at a time.
<p>What if you could have the best of both worlds: minimal startup cost (don&#8217;t execute any code on <code>import</code>), <em>and</em> maximum performance (don&#8217;t build the same functions over and over again). Oh, and you still want to keep the rules in a separate file (because code is code and data is data), just as long as you never have to read the same line twice.
+7 -7
View File
@@ -23,7 +23,7 @@ mark{display:inline}
<h2 id=divingin>Diving In</h2>
<p class=f>HTTP web services are programmatic ways of sending and receiving data from remote servers using nothing but the operations of <abbr>HTTP</abbr>. If you want to get data from the server, use <abbr>HTTP</abbr> <code>GET</code>; if you want to send new data to the server, use <abbr>HTTP</abbr> <code>POST</code>. Some more advanced <abbr>HTTP</abbr> web service <abbr>API</abbr>s also define ways of creating, modifying, and deleting data, using <abbr>HTTP</abbr> <code>PUT</code> and <abbr>HTTP</abbr> <code>DELETE</code>. In other words, the &#8220;verbs&#8221; built into the <abbr>HTTP</abbr> protocol (<code>GET</code>, <code>POST</code>, <code>PUT</code>, and <code>DELETE</code>) can map directly to application-level operations for retrieving, creating, modifying, and deleting data.
<p>The main advantage of this approach is simplicity, and its simplicity has proven popular. Data &mdash; usually <abbr>XML</abbr> data &mdash; can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an <abbr>HTTP</abbr> library for downloading it. Debugging is also easier; because each resource in an <abbr>HTTP</abbr> web service has a unique address (in the form of a <abbr>URL</abbr>), you can load it in your web browser and immediately see the raw data.
<p>The main advantage of this approach is simplicity, and its simplicity has proven popular. Data&nbsp;&mdash;&nbsp;usually <abbr>XML</abbr> data&nbsp;&mdash;&nbsp;can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an <abbr>HTTP</abbr> library for downloading it. Debugging is also easier; because each resource in an <abbr>HTTP</abbr> web service has a unique address (in the form of a <abbr>URL</abbr>), you can load it in your web browser and immediately see the raw data.
<p>Examples of <abbr>HTTP</abbr> web services:
<ul>
@@ -52,7 +52,7 @@ mark{display:inline}
<h3 id=caching>Caching</h3>
<p>The most important thing to understand about any type of web service is that network access is incredibly expensive. I don&#8217;t mean &#8220;dollars and cents&#8221; expensive (although bandwidth ain&#8217;t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even on the fastest broadband connection, <i>latency</i> (the time it takes to send a request and start retrieving data in a response) can still be higher than you anticipated. A router misbehaves, a packet is dropped, an intermediate proxy is under attack &mdash; there&#8217;s <a href=http://isc.sans.org/>never a dull moment</a> on the public internet, and there may be nothing you can do about it.
<p>The most important thing to understand about any type of web service is that network access is incredibly expensive. I don&#8217;t mean &#8220;dollars and cents&#8221; expensive (although bandwidth ain&#8217;t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even on the fastest broadband connection, <i>latency</i> (the time it takes to send a request and start retrieving data in a response) can still be higher than you anticipated. A router misbehaves, a packet is dropped, an intermediate proxy is under attack&nbsp;&mdash;&nbsp;there&#8217;s <a href=http://isc.sans.org/>never a dull moment</a> on the public internet, and there may be nothing you can do about it.
<p><abbr>HTTP</abbr> is designed with caching in mind. There is an entire class of devices (called &#8220;caching proxies&#8221;) whose only job is to sit between you and the rest of the world and minimize network access. Your company or <abbr>ISP</abbr> almost certainly maintains caching proxies, even if you&#8217;re unaware of them. They work because caching built into the <abbr>HTTP</abbr> protocol.
@@ -295,7 +295,7 @@ Content-Type: application/xml</samp>
<li>&hellip;the exact same 3070 bytes you downloaded last time.
</ol>
<p><abbr>HTTP</abbr> is designed to work better than this. <code>urllib</code> speaks <abbr>HTTP</abbr> like I speak Spanish &mdash; enough to get by in a jam, but not enough to hold a conversation. <abbr>HTTP</abbr> is a conversation. It&#8217;s time to upgrade to a library that speaks <abbr>HTTP</abbr> fluently.
<p><abbr>HTTP</abbr> is designed to work better than this. <code>urllib</code> speaks <abbr>HTTP</abbr> like I speak Spanish&nbsp;&mdash;&nbsp;enough to get by in a jam, but not enough to hold a conversation. <abbr>HTTP</abbr> is a conversation. It&#8217;s time to upgrade to a library that speaks <abbr>HTTP</abbr> fluently.
<p class=a>&#x2042;
@@ -363,9 +363,9 @@ Content-Type: application/xml</samp>
<li>Let&#8217;s turn on debugging and see <a href=#whats-on-the-wire>what&#8217;s on the wire</a>. This is the <code>httplib2</code> equivalent of turning on debugging in <code>http.client</code>. <code>httplib2</code> will print all the data being sent to the server and some key information being sent back.
<li>Create an <code>httplib2.Http</code> object with the same directory name as before.
<li>Request the same <abbr>URL</abbr> as before. <em>Nothing appears to happen.</em> More precisely, nothing gets sent to the server, and nothing gets returned from the server. There is absolutely no network activity whatsoever.
<li>Yet we did &#8220;receive&#8221; some data &mdash; in fact, we received all of it.
<li>Yet we did &#8220;receive&#8221; some data&nbsp;&mdash;&nbsp;in fact, we received all of it.
<li>We also &#8220;received&#8221; an <abbr>HTTP</abbr> status code indicating that the &#8220;request&#8221; was successful.
<li>Here&#8217;s the rub: this &#8220;response&#8221; was generated from <code>httplib2</code>&#8217;s local cache. That directory name you passed in when you created the <code>httplib2.Http</code> object &mdash; that directory holds <code>httplib2</code>&#8217;s cache of all the operations it&#8217;s ever performed.
<li>Here&#8217;s the rub: this &#8220;response&#8221; was generated from <code>httplib2</code>&#8217;s local cache. That directory name you passed in when you created the <code>httplib2.Http</code> object&nbsp;&mdash;&nbsp;that directory holds <code>httplib2</code>&#8217;s cache of all the operations it&#8217;s ever performed.
</ol>
<p>You previously requested the data at this <abbr>URL</abbr>. That request was successful (<code>status: 200</code>). That response included not only the feed data, but also a set of <a href=#caching>caching headers</a> that told anyone who was listening that they could cache this resource for up to 24 hours (<code>Cache-Control: max-age=86400</code>, which is 24 hours measured in seconds). <code>httplib2</code> understand and respects those caching headers, and it stored the previous response in the <code>.cache</code> directory (which you passed in when you create the <code>Http</code> object). That cache hasn&#8217;t expired yet, so the second time you request the data at this <abbr>URL</abbr>, <code>httplib2</code> simply returns the cached result without ever hitting the network.
@@ -409,7 +409,7 @@ reply: 'HTTP/1.1 200 OK'
'content-type': 'application/xml'}</samp></pre>
<ol>
<li><code>httplib2</code> allows you to add arbitrary <abbr>HTTP</abbr> headers to any outgoing request. In order to bypass <em>all</em> caches (not just your local disk cache, but also any caching proxies between you and the remote server), add a <code>no-cache</code> header in the <var>headers</var> dictionary.
<li>Now you see <code>httplib2</code> initiating a network request. <code>httplib2</code> understands and respects caching headers <em>in both directions</em> &mdash; as part of the incoming response <em>and as part of the outgoing request</em>. It noticed that you added the <code>no-cache</code> header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data.
<li>Now you see <code>httplib2</code> initiating a network request. <code>httplib2</code> understands and respects caching headers <em>in both directions</em>&nbsp;&mdash;&nbsp;as part of the incoming response <em>and as part of the outgoing request</em>. It noticed that you added the <code>no-cache</code> header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data.
<li>This response was <em>not</em> generated from your local cache. You knew that, of course, because you saw the debugging information on the outgoing request. But it&#8217;s nice to have that programmatically verified.
<li>The request succeeded; you downloaded the entire feed again from the remote server. Of course, the server also sent back a full complement of <abbr>HTTP</abbr> headers along with the feed data. That includes caching headers, which <code>httplib2</code> uses to update its local cache, in the hopes of avoiding network access the <em>next</em> time you request this feed. Everything about <abbr>HTTP</abbr> caching is designed to maximize cache hits and minimize network access. Even though you bypassed the cache this time, the remote server would really appreciate it if you would cache the result for next time.
</ol>
@@ -477,7 +477,7 @@ user-agent: Python-httplib2/$Rev: 259 $'
<li><code>httplib2</code> also sends the <code>Last-Modified</code> validator back to the server in the <code>If-Modified-Since</code> header.
<li>The server looked at these validators, looked at the page you requested, and determined that the page has not changed since you last requested it, so it sends back a <code>304</code> status code <em>and no data</em>.
<li>Back on the client, <code>httplib2</code> notices the <code>304</code> status code and loads the content of the page from its cache.
<li>This might be a bit confusing. There are really <em>two</em> status codes &mdash; <code>304</code> (returned from the server this time, which caused <code>httplib2</code> to look in its cache), and <code>200</code> (returned from the server <em>last time</em>, and stored in <code>httplib2</code>&#8217;s cache along with the page data). <code>response.status</code> returns the status from the cache.
<li>This might be a bit confusing. There are really <em>two</em> status codes&nbsp;&mdash;&nbsp;<code>304</code> (returned from the server this time, which caused <code>httplib2</code> to look in its cache), and <code>200</code> (returned from the server <em>last time</em>, and stored in <code>httplib2</code>&#8217;s cache along with the page data). <code>response.status</code> returns the status from the cache.
<li>If you want the raw status code returned from the server, you can get that by looking in <code>response.dict</code>, which is a dictionary of the actual headers returned from the server.
<li>However, you still get the data in the <var>content</var> variable. Generally, you don&#8217;t need to know why a response was served from the cache. (You may not even care that it was served from the cache at all, and that&#8217;s fine too. <code>httplib2</code> is smart enough to let you act dumb.) By the time the <code>request()</code> method returns to the caller, <code>httplib2</code> has already updated its cache and returned the data to you.
</ol>
+5 -5
View File
@@ -288,8 +288,8 @@ rules = LazyRules()</code></pre>
<a> return self <span class=u>&#x2462;</span></a>
</code></pre>
<ol>
<li>The <code>__iter__()</code> method will be called every time someone &mdash; say, a <code>for</code> loop &mdash; calls <code>iter(rules)</code>.
<li>This is the place to reset the counter that we&#8217;re going to use to retrieve items from the cache (that we haven&#8217;t built yet &mdash; patience, grasshopper).
<li>The <code>__iter__()</code> method will be called every time someone&nbsp;&mdash;&nbsp;say, a <code>for</code> loop&nbsp;&mdash;&nbsp;calls <code>iter(rules)</code>.
<li>This is the place to reset the counter that we&#8217;re going to use to retrieve items from the cache (that we haven&#8217;t built yet&nbsp;&mdash;&nbsp;patience, grasshopper).
<li>Finally, the <code>__iter__()</code> method returns <var>self</var>, which signals that this class will take care of returning its own values throughout an iteration.
</ol>
@@ -303,7 +303,7 @@ rules = LazyRules()</code></pre>
<a> self.cache.append(funcs) <span class=u>&#x2462;</span></a>
return funcs</code></pre>
<ol>
<li>The <code>__next__()</code> method gets called whenever someone &mdash; say, a <code>for</code> loop &mdash; calls <code>next(rules)</code>. This method will only make sense if we start at the end and work backwards. So let&#8217;s do that.
<li>The <code>__next__()</code> method gets called whenever someone&nbsp;&mdash;&nbsp;say, a <code>for</code> loop&nbsp;&mdash;&nbsp;calls <code>next(rules)</code>. This method will only make sense if we start at the end and work backwards. So let&#8217;s do that.
<li>The last part of this function should look familiar, at least. The <code>build_match_and_apply_functions()</code> function hasn&#8217;t changed; it&#8217;s the same as it ever was. <em>Each line of the pattern file will be read exactly once, as late as possible.</em>
<li>The only difference is that, before returning the match and apply functions (which are stored in the tuple <var>funcs</var>), we&#8217;ve going to save them in <code>self.cache</code>. <em>Each match and apply function will be built exactly once, as late as possible, then cached.</em>
</ol>
@@ -341,7 +341,7 @@ rules = LazyRules()</code></pre>
.</code></pre>
<ol>
<li><code>self.cache</code> will be a list of the functions we need to match and apply individual rules. (At least <em>that</em> should sound familiar!) <code>self.cache_index</code> keeps track of which cached item we should return next. If we haven&#8217;t exhausted the cache yet (<i>i.e.</i> if the length of <code>self.cache</code> is greater than <code>self.cache_index</code>), then we have a cache hit! Hooray! We can return the match and apply functions from the cache instead of building them from scratch.
<li>On the other hand, if we don&#8217;t get a hit from the cache, <em>and</em> the file object has been closed (which could happen, further down the method, as you saw in the previous code snippet), then there&#8217;s nothing more we can do. If the file is closed, it means we&#8217;ve exhausted it &mdash; we&#8217;ve already read through every line from the pattern file, and we&#8217;ve already built and cached the match and apply functions for each pattern. The file is exhausted; the cache is exhausted; I&#8217;m exhausted. Wait, what? Hang in there, we&#8217;re almost done.
<li>On the other hand, if we don&#8217;t get a hit from the cache, <em>and</em> the file object has been closed (which could happen, further down the method, as you saw in the previous code snippet), then there&#8217;s nothing more we can do. If the file is closed, it means we&#8217;ve exhausted it&nbsp;&mdash;&nbsp;we&#8217;ve already read through every line from the pattern file, and we&#8217;ve already built and cached the match and apply functions for each pattern. The file is exhausted; the cache is exhausted; I&#8217;m exhausted. Wait, what? Hang in there, we&#8217;re almost done.
</ol>
<p>Putting it all together, here&#8217;s what happens when:
@@ -352,7 +352,7 @@ rules = LazyRules()</code></pre>
<li>Let&#8217;s say, for the sake of argument, that the very first rule matched. If so, no further match and apply functions are built, and no further lines are read from the pattern file.
<li>Furthermore, for the sake of argument, suppose that the caller calls the <code>plural()</code> function <em>again</em> to pluralize a different word. The <code>for</code> loop in the <code>plural()</code> function will call <code>iter(rules)</code>, which will reset the cache index but will not reset the open file object.
<li>The first time through, the <code>for</code> loop will ask for a value from <var>rules</var>, which will invoke its <code>__next__()</code> method. This time, however, the cache is primed with a single pair of match and apply functions, corresponding to the patterns in the first line of the pattern file. Since they were built and cached in the course of pluralizing the previous word, they&#8217;re retrieved from the cache. The cache index increments, and the open file is never touched.
<li>Let&#8217;s say, for the sake of argument, that the first rule does <em>not</em> match this time around. So the <code>for</code> loop comes around again and asks for another value from <var>rules</var>. This invokes the <code>__next__()</code> method a second time. This time, the cache is exhausted &mdash; it only contained one item, and we&#8217;re asking for a second &mdash; so the <code>__next__()</code> method continues. It reads another line from the open file, builds match and apply functions out of the patterns, and caches them.
<li>Let&#8217;s say, for the sake of argument, that the first rule does <em>not</em> match this time around. So the <code>for</code> loop comes around again and asks for another value from <var>rules</var>. This invokes the <code>__next__()</code> method a second time. This time, the cache is exhausted&nbsp;&mdash;&nbsp;it only contained one item, and we&#8217;re asking for a second&nbsp;&mdash;&nbsp;so the <code>__next__()</code> method continues. It reads another line from the open file, builds match and apply functions out of the patterns, and caches them.
<li>This read-build-and-cache process will continue as long as the rules being read from the pattern file don&#8217;t match the word we&#8217;re trying to pluralize. If we do find a matching rule before the end of the file, we simply use it and stop, with the file still open. The file pointer will stay wherever we stopped reading, waiting for the next <code>readline()</code> command. In the meantime, the cache now has more items in it, and if we start all over again trying to pluralize a new word, each of those items in the cache will be tried before reading the next line from the pattern file.
</ul>
+8 -7
View File
@@ -32,7 +32,7 @@ body{counter-reset:h1 2}
<li><b>Dictionaries</b> are unordered bags of key-value pairs.
</ol>
<p>Of course, there are a lot more types than these seven. <a href=your-first-python-program.html#everythingisanobject>Everything is an object</a> in Python, so there are types like <i>module</i>, <i>function</i>, <i>class</i>, <i>method</i>, <i>file</i>, and even <i>compiled code</i>. You&#8217;ve already seen some of these: <a href=your-first-python-program.html#runningscripts>modules have names</a>, <a href=your-first-python-program.html#docstrings>functions have <code>docstrings</code></a>, <i class=baa>&amp;</i>c. You&#8217;ll learn about classes in [FIXME xref] and files in [FIXME xref].
<p>Strings and bytes are important enough &mdash; and complicated enough &mdash; that they get their own chapter. Let&#8217;s look at the others first.
<p>Strings and bytes are important enough&nbsp;&mdash;&nbsp;and complicated enough&nbsp;&mdash;&nbsp;that they get their own chapter. Let&#8217;s look at the others first.
<p class=a>&#x2042;
<h2 id=booleans>Booleans</h2>
@@ -272,19 +272,20 @@ ZeroDivisionError: Fraction(0, 0)</samp></pre>
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>a_list = ['a']</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>a_list = a_list + [2.0, 3]</kbd> <span class=u>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd class=pp>a_list</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>a_list</kbd> <span class=u>&#x2461;</span></a>
<samp class=pp>['a', 2.0, 3]</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_list.append(True)</kbd> <span class=u>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>a_list.append(True)</kbd> <span class=u>&#x2462;</span></a>
<samp class=p>>>> </samp><kbd class=pp>a_list</kbd>
<samp class=pp>['a', 2.0, 3, True]</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_list.extend(['four', '&Omega;'])</kbd> <span class=u>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>a_list.extend(['four', '&Omega;'])</kbd> <span class=u>&#x2463;</span></a>
<samp class=p>>>> </samp><kbd class=pp>a_list</kbd>
<samp class=pp>['a', 2.0, 3, True, 'four', '&Omega;']</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_list.insert(0, '&Omega;')</kbd> <span class=u>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>a_list.insert(0, '&Omega;')</kbd> <span class=u>&#x2464;</span></a>
<samp class=p>>>> </samp><kbd class=pp>a_list</kbd>
<samp class=pp>['&Omega;', 'a', 2.0, 3, True, 'four', '&Omega;']</samp></pre>
<ol>
<li>The <code>+</code> operator concatenates lists. A list can contain any number of items; there is no size limit (other than available memory). A list can contain items of any datatype; they don&#8217;t all need to be the same type. Here we have a list containing a string, a floating point number, and an integer.
<li>The <code>+</code> operator concatenates lists to create a new list. A list can contain any number of items; there is no size limit (other than available memory). However, if memory is a concern, you should be aware that list concatenation creates a second list in memory. In this case, that new list is immediately assigned to the existing variable <var>a_list</var>. So this line of code is really a two-step process&nbsp;&mdash;&nbsp;concatenation then assignment&nbsp;&mdash;&nbsp;which can (temporarily) consume a lot of memory when you&#8217;re dealing with large lists.
<li>A list can contain items of any datatype, and the items in a single list don&#8217;t all need to be the same type. Here we have a list containing a string, a floating point number, and an integer.
<li>The <code>append()</code> method adds a single item to the end of the list. (Now we have <em>four</em> different datatypes in the list!)
<li>Lists are implemented as classes. &#8220;Creating&#8221; a list is really instantiating a class. As such, a list has methods that operate on it. The <code>extend()</code> method takes one argument, a list, and appends each of the items of the argument to the original list.
<li>The <code>insert()</code> method inserts a single item into a list. The first argument is the index of the first item in the list that will get bumped out of position. List items do not need to be unique; for example, there are now two separate items with the value <code>'&Omega;'</code>: the first item, <code>a_list[0]</code>, and the last item, <code>a_list[6]</code>.
@@ -487,7 +488,7 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<li>You can add new key-value pairs at any time. This syntax is identical to modifying existing values.
<li>The new dictionary item (key <code>'user'</code>, value <code>'mark'</code>) appears to be in the middle. In fact, it was just a coincidence that the items appeared to be in order in the first example; it is just as much a coincidence that they appear to be out of order now.
<li>Assigning a value to an existing dictionary key simply replaces the old value with the new one.
<li>Will this change the value of the <code>user</code> key back to "mark"? No! Look at the key closely &mdash; that&#8217;s a capital <kbd>U</kbd> in <kbd>"User"</kbd>. Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it&#8217;s completely different.
<li>Will this change the value of the <code>user</code> key back to "mark"? No! Look at the key closely&nbsp;&mdash;&nbsp;that&#8217;s a capital <kbd>U</kbd> in <kbd>"User"</kbd>. Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it&#8217;s completely different.
</ol>
<h3 id=mixed-value-dictionaries>Mixed-Value Dictionaries</h3>
<p>Dictionaries aren&#8217;t just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don&#8217;t all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.
+6 -6
View File
@@ -32,7 +32,7 @@ td pre{padding:0;border:0}
<h2 id=divingin>Diving in</h2>
<p class=f>Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. <a href=case-study-porting-chardet-to-python-3.html#running2to3>Case study: porting <code>chardet</code> to Python 3</a> describes how to run the <code>2to3</code> script, then shows some things it can&#8217;t fix automatically. This appendix documents what it <em>can</em> fix automatically.
<h2 id=print><code>print</code> statement</h2>
<p>In Python 2, <code><dfn>print</dfn></code> was a statement. Whatever you wanted to print simply followed the <code>print</code> keyword. In Python 3, <code>print()</code> is a function &mdash; whatever you want to print is passed to <code>print()</code> like any other function.
<p>In Python 2, <code><dfn>print</dfn></code> was a statement. Whatever you wanted to print simply followed the <code>print</code> keyword. In Python 3, <code>print()</code> is a function&nbsp;&mdash;&nbsp;whatever you want to print is passed to <code>print()</code> like any other function.
<table>
<tr><th>Notes
<th>Python 2
@@ -58,7 +58,7 @@ td pre{padding:0;border:0}
<li>To print a single value, call <code>print()</code> with one argument
<li>To print two values separated by a space, call <code>print()</code> with two arguments.
<li>This one is a little tricky. In Python 2, if you ended a <code>print</code> statement with a comma, it would print the values separated by spaces, then print a trailing space, then stop without printing a carriage return. In Python 3, the way to do this is to pass <code>end=' '</code> as a keyword argument to the <code>print()</code> function. The <code>end</code> argument defaults to <code>'\n'</code> (a carriage return), so overriding it will suppress the carriage return after printing the other arguments.
<li>In Python 2, you could redirect the output to a pipe &mdash; like <code>sys.stderr</code> &mdash; by using the <code>>>pipe_name</code> syntax. In Python 3, the way to do this is to pass the pipe in the <code>file</code> keyword argument. The <code>file</code> argument defaults to <code>sys.stdout</code> (standard out), so overriding it will output to a different pipe instead.
<li>In Python 2, you could redirect the output to a pipe&nbsp;&mdash;&nbsp;like <code>sys.stderr</code>&nbsp;&mdash;&nbsp;by using the <code>>>pipe_name</code> syntax. In Python 3, the way to do this is to pass the pipe in the <code>file</code> keyword argument. The <code>file</code> argument defaults to <code>sys.stdout</code> (standard out), so overriding it will output to a different pipe instead.
</ol>
<h2 id=unicodeliteral>Unicode string literals</h2>
<p>Python 2 had two string types: <dfn>Unicode</dfn> strings and non-Unicode strings. Python 3 has one string type: Unicode strings.
@@ -159,7 +159,7 @@ td pre{padding:0;border:0}
<ol>
<li>The simplest form.
<li>The <code>or</code> operator takes precedence over the <code>in</code> operator, so there is no need for parentheses here.
<li>On the other hand, you <em>do</em> need parentheses here, for the same reason &mdash; <code>or</code> takes precedence over <code>in</code>.
<li>On the other hand, you <em>do</em> need parentheses here, for the same reason&nbsp;&mdash;&nbsp;<code>or</code> takes precedence over <code>in</code>.
<li>The <code>in</code> operator takes precedence over the <code>+</code> operator, so this form technically doesn&#8217;t need parentheses, but <code>2to3</code> includes them anyway.
<li>This form definitely needs parentheses, since the <code>in</code> operator takes precedence over the <code>+</code> operator.
</ol>
@@ -252,7 +252,7 @@ from urllib.error import HTTPError</code></pre>
</table>
<ol>
<li>The old <code>urllib</code> module in Python 2 had a variety of functions, including <code>urlopen()</code> for fetching data and <code>splittype()</code>, <code>splithost()</code>, and <code>splituser()</code> for splitting a <abbr>URL</abbr> into its constituent parts. These functions have been reorganized more logically within the new <code>urllib</code> package. <code>2to3</code> will also change all calls to these functions so they use the new naming scheme.
<li>The old <code>urllib2</code> module in Python 2 has been folded into the <code>urllib</code> package in Python 3. All your <code>urllib2</code> favorites &mdash; the <code>build_opener()</code> method, <code>Request</code> objects, and <code>HTTPBasicAuthHandler</code> and friends &mdash; are still available.
<li>The old <code>urllib2</code> module in Python 2 has been folded into the <code>urllib</code> package in Python 3. All your <code>urllib2</code> favorites&nbsp;&mdash;&nbsp;the <code>build_opener()</code> method, <code>Request</code> objects, and <code>HTTPBasicAuthHandler</code> and friends&nbsp;&mdash;&nbsp;are still available.
<li>The <code>urllib.parse</code> module in Python 3 contains all the parsing functions from the old <code>urlparse</code> module in Python 2.
<li>The <code>urllib.robotparser</code> module parses <a href=http://www.robotstxt.org/><code>robots.txt</code> files</a>.
<li>The <code>FancyURLopener</code> class, which handles <abbr>HTTP</abbr> redirects and other status codes, is still available in the new <code>urllib.request</code> module. The <code>urlencode()</code> function has moved to <code>urllib.parse</code>.
@@ -567,7 +567,7 @@ reduce(a, b, c)</code></pre>
<td><code class=pp>repr('PapayaWhip' + repr(2))</code>
</table>
<ol>
<li>Remember, <var>x</var> can be anything &mdash; a class, a function, a module, a primitive data type, etc. The <code>repr()</code> function works on everything.
<li>Remember, <var>x</var> can be anything&nbsp;&mdash;&nbsp;a class, a function, a module, a primitive data type, etc. The <code>repr()</code> function works on everything.
<li>In Python 2, backticks could be nested, leading to this sort of confusing (but valid) expression. The <code>2to3</code> tool is smart enough to convert this into nested calls to <code>repr()</code>.
</ol>
<h2 id=except><code>try...except</code> statement</h2>
@@ -1037,7 +1037,7 @@ except:
<li>The <code>2to3</code> script is smart enough to construct a valid class declaration, even if the class is inherited from one or more base classes.
</ol>
<h2 id=nitpick>Matters of style</h2>
<p>The rest of the &#8220;fixes&#8221; listed here aren&#8217;t really fixes per se. That is, the things they change are matters of style, not substance. They work just as well in Python 3 as they do in Python 2, but the developers of Python have a vested interest in making Python code as uniform as possible. To that end, there is an <a href=http://www.python.org/dev/peps/pep-0008/>official Python style guide</a> which outlines &mdash; in excruciating detail &mdash; all sorts of nitpicky details that you almost certainly don&#8217;t care about. And given that <code>2to3</code> provides such a great infrastructure for converting Python code from one thing to another, the authors took it upon themselves to add a few optional features to improve the readability of your Python programs.
<p>The rest of the &#8220;fixes&#8221; listed here aren&#8217;t really fixes per se. That is, the things they change are matters of style, not substance. They work just as well in Python 3 as they do in Python 2, but the developers of Python have a vested interest in making Python code as uniform as possible. To that end, there is an <a href=http://www.python.org/dev/peps/pep-0008/>official Python style guide</a> which outlines&nbsp;&mdash;&nbsp;in excruciating detail&nbsp;&mdash;&nbsp;all sorts of nitpicky details that you almost certainly don&#8217;t care about. And given that <code>2to3</code> provides such a great infrastructure for converting Python code from one thing to another, the authors took it upon themselves to add a few optional features to improve the readability of your Python programs.
<h3 id=set_literal><code>set()</code> literals (explicit)</h3>
<p>In Python 2, the only way to define a literal set in your code was to call <code>set(a_sequence)</code>. This still works in Python 3, but a clearer way of doing it is to use the new set literal notation: curly braces. (Dictionaries are also defined with curly braces, which makes sense once you think about it, because dictionaries are just sets of key-value pairs.)
<blockquote class=note>
+2 -2
View File
@@ -301,7 +301,7 @@ Ran 12 tests in 0.203s
<p>Answer: there&#8217;s only 5000 of them; why don&#8217;t you just build a lookup table? This idea gets even better when you realize that <em>you don&#8217;t need to use regular expressions at all</em>. As you build the lookup table for converting integers to Roman numerals, you can build the reverse lookup table to convert Roman numerals to integers. By the time you need to check whether an arbitrary string is a valid Roman numeral, you will have collected all the valid Roman numerals. &#8220;Validating&#8221; is reduced to a single dictionary lookup.
<p>And best of all, you already have a complete set of unit tests. You can change over half the code in the module, but the unit tests will stay the same. That means you can prove &mdash; to yourself and to others &mdash; that the new code works just as well as the original.
<p>And best of all, you already have a complete set of unit tests. You can change over half the code in the module, but the unit tests will stay the same. That means you can prove&nbsp;&mdash;&nbsp;to yourself and to others&nbsp;&mdash;&nbsp;that the new code works just as well as the original.
<p class=d>[<a href=examples/roman10.py>download <code>roman10.py</code></a>]
<pre><code class=pp>class OutOfRangeError(ValueError): pass
@@ -392,7 +392,7 @@ def build_lookup_tables():
<a> to_roman_table.append(roman_numeral) <span class=u>&#x2462;</span></a>
from_roman_table[roman_numeral] = integer</code></pre>
<ol>
<li>This is a clever bit of programming&hellip; perhaps too clever. The <code>to_roman()</code> function is defined above; it looks up values in the lookup table and returns them. But the <code>build_lookup_tables()</code> function redefines the <code>to_roman()</code> function to actually do work (like the previous examples did, before you added a lookup table). Within the <code>build_lookup_tables()</code> function, calling <code>to_roman()</code> will call this redefined version. Once the <code>build_lookup_tables()</code> function exits, the redefined version disappears &mdash; it is only defined in the local scope of the <code>build_lookup_tables()</code> function.
<li>This is a clever bit of programming&hellip; perhaps too clever. The <code>to_roman()</code> function is defined above; it looks up values in the lookup table and returns them. But the <code>build_lookup_tables()</code> function redefines the <code>to_roman()</code> function to actually do work (like the previous examples did, before you added a lookup table). Within the <code>build_lookup_tables()</code> function, calling <code>to_roman()</code> will call this redefined version. Once the <code>build_lookup_tables()</code> function exits, the redefined version disappears&nbsp;&mdash;&nbsp;it is only defined in the local scope of the <code>build_lookup_tables()</code> function.
<li>This line of code will call the redefined <code>to_roman()</code> function, which actually calculates the Roman numeral.
<li>Once you have the result (from the redefined <code>to_roman()</code> function), you add the integer and its Roman numeral equivalent to both lookup tables.
</ol>
+6 -6
View File
@@ -31,7 +31,7 @@ td a:link, td a:visited{border:0}
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving in</h2>
<p class=f>We&#8217;ve already covered a few special method names elsewhere in this book &mdash; &#8220;magic&#8221; methods that Python invokes when you use certain syntax. Using special methods, your classes can act like sequences, like dictionaries, like functions, like iterators, or even like numbers! This appendix serves both as a reference for the special methods we&#8217;ve seen already and a brief introduction to some of the more esoteric ones.
<p class=f>We&#8217;ve already covered a few special method names elsewhere in this book&nbsp;&mdash;&nbsp;&#8220;magic&#8221; methods that Python invokes when you use certain syntax. Using special methods, your classes can act like sequences, like dictionaries, like functions, like iterators, or even like numbers! This appendix serves both as a reference for the special methods we&#8217;ve seen already and a brief introduction to some of the more esoteric ones.
<h2 id=basics>Basics</h2>
@@ -207,7 +207,7 @@ AttributeError</samp></pre>
<h2 id=acts-like-function>Classes That Act Like Functions</h2>
<p>You can make an instance of a class callable &mdash; exactly like a function is callable &mdash; by defining the <code>__call__()</code> method.
<p>You can make an instance of a class callable&nbsp;&mdash;&nbsp;exactly like a function is callable&nbsp;&mdash;&nbsp;by defining the <code>__call__()</code> method.
<table>
<tr><th>Notes
@@ -255,7 +255,7 @@ bytes = zef_file.read(12)
<h2 id=acts-like-list>Classes That Act Like Sequences</h2>
<p>If your class acts as a container for a set of values &mdash; that is, if it makes sense to ask whether your class &#8220;contains&#8221; a value &mdash; then it should probably define the following special methods that make it act like a sequence.
<p>If your class acts as a container for a set of values&nbsp;&mdash;&nbsp;that is, if it makes sense to ask whether your class &#8220;contains&#8221; a value&nbsp;&mdash;&nbsp;then it should probably define the following special methods that make it act like a sequence.
<table>
<tr><th>Notes
@@ -358,7 +358,7 @@ class FieldStorage:
<h2 id=acts-like-number>Classes That Act Like Numbers</h2>
<p>Using the appropriate special methods, you can define your own classes that act like numbers. That is, you can add them, subtract them, and perform other mathematical operations on them. This is how <a href=advanced-classes.html#implementing-fractions><dfn>fractions</dfn> are implemented</a> &mdash; the <code><dfn>Fraction</dfn></code> class implements these special methods, then you can do things like this:
<p>Using the appropriate special methods, you can define your own classes that act like numbers. That is, you can add them, subtract them, and perform other mathematical operations on them. This is how <a href=advanced-classes.html#implementing-fractions><dfn>fractions</dfn> are implemented</a>&nbsp;&mdash;&nbsp;the <code><dfn>Fraction</dfn></code> class implements these special methods, then you can do things like this:
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>from fractions import Fraction</kbd>
@@ -635,7 +635,7 @@ class FieldStorage:
<h2 id=rich-comparisons>Classes That Can Be Compared</h2>
<p>I broke this section out from the previous one because comparisons are not strictly the purview of numbers. Many datatypes can be compared &mdash; strings, lists, even dictionaries. If you&#8217;re creating your own class and it makes sense to compare your objects to other objects, you can use the following special methods to implement comparisons.
<p>I broke this section out from the previous one because comparisons are not strictly the purview of numbers. Many datatypes can be compared&nbsp;&mdash;&nbsp;strings, lists, even dictionaries. If you&#8217;re creating your own class and it makes sense to compare your objects to other objects, you can use the following special methods to implement comparisons.
<table>
<tr><th>Notes
@@ -755,7 +755,7 @@ def __exit__(self, *args) -> None:
<a> self.close() <span class=u>&#x2462;</span></a></code></pre>
<ol>
<li>The file object defines both an <code>__enter__()</code> and an <code>__exit__()</code> method. The <code>__enter__()</code> method checks that the file is open; if it&#8217;s not, the <code>_checkClosed()</code> method raises an exception.
<li>The <code>__enter__()</code> method should almost always return <var>self</var> &mdash; this is the object that the <code>with</code> block will use to dispatch properties and methods.
<li>The <code>__enter__()</code> method should almost always return <var>self</var>&nbsp;&mdash;&nbsp;this is the object that the <code>with</code> block will use to dispatch properties and methods.
<li>After the <code>with</code> block, the file object automatically closes. How? In the <code>__exit__()</code> method, it calls <code>self.close()</code>.
</ol>
+9 -9
View File
@@ -21,11 +21,11 @@ My alphabet starts where your alphabet ends! <span class=u>&#x275E;</span><br>&m
</blockquote>
<p id=toc>&nbsp;
<h2 id=boring-stuff>Some Boring Stuff You Need To Understand Before You Can Dive In</h2>
<p class=f>Did you know that the people of <a href=http://en.wikipedia.org/wiki/Bougainville_Province>Bougainville</a> have the smallest alphabet in the world? Their <a href=http://en.wikipedia.org/wiki/Rotokas_alphabet>Rotokas alphabet</a> is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters &mdash; 52 if you count uppercase and lowercase separately &mdash; plus a handful of <i class=baa>!@#$%&</i> punctuation marks.
<p class=f>Did you know that the people of <a href=http://en.wikipedia.org/wiki/Bougainville_Province>Bougainville</a> have the smallest alphabet in the world? Their <a href=http://en.wikipedia.org/wiki/Rotokas_alphabet>Rotokas alphabet</a> is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters&nbsp;&mdash;&nbsp;52 if you count uppercase and lowercase separately&nbsp;&mdash;&nbsp;plus a handful of <i class=baa>!@#$%&</i> punctuation marks.
<p>When people talk about &#8220;text,&#8221; they&#8217;re thinking of &#8220;characters and symbols on the computer screen.&#8221; But computers don&#8217;t deal in characters and symbols; they deal in bits and bytes. Every piece of text you&#8217;ve ever seen on a computer screen is actually stored in a particular <i>character encoding</i>. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.
<p>In reality, it&#8217;s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes &mdash; a file, a web page, whatever &mdash; and claims it&#8217;s &#8220;text,&#8221; you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you&#8217;re left with the unenviable task of cracking the code yourself. Chances are you&#8217;ll get it wrong, and the result will be gibberish.
<p>In reality, it&#8217;s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes&nbsp;&mdash;&nbsp;a file, a web page, whatever&nbsp;&mdash;&nbsp;and claims it&#8217;s &#8220;text,&#8221; you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you&#8217;re left with the unenviable task of cracking the code yourself. Chances are you&#8217;ll get it wrong, and the result will be gibberish.
<aside>Everything you thought you knew about strings is wrong.</aside>
@@ -61,11 +61,11 @@ My alphabet starts where your alphabet ends! <span class=u>&#x275E;</span><br>&m
<p>Even though there are a lot of Unicode characters, it turns out that most people will never use anything beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes every character from 0&ndash;65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used &#8220;astral plane&#8221; Unicode characters beyond 65535. Most obvious advantage: UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store instead of four bytes (except for the ones that don&#8217;t). And you can still easily find the <var>Nth</var> character of a string in constant time, if you assume that the string doesn&#8217;t include any astral plane characters, which is a good assumption right up until the moment that it&#8217;s not.
<p>But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character <code>U+4E2D</code> could be stored in UTF-16 as either <code>4E 2D</code> or <code>2D 4E</code>, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you&#8217;re safe &mdash; different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you&#8217;re going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence <code>4E 2D</code> means <code>U+4E2D</code> or <code>U+2D4E</code>.
<p>But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character <code>U+4E2D</code> could be stored in UTF-16 as either <code>4E 2D</code> or <code>2D 4E</code>, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you&#8217;re safe&nbsp;&mdash;&nbsp;different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you&#8217;re going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence <code>4E 2D</code> means <code>U+4E2D</code> or <code>U+2D4E</code>.
<p>To solve <em>this</em> problem, the multi-byte Unicode encodings define a &#8220;Byte Order Mark,&#8221; which is a special non-printable character that you can include at the beginning of your document to indicate what order your bytes are in. For UTF-16, the Byte Order Mark is <code>U+FEFF</code>. If you receive a UTF-16 document that starts with the bytes <code>FF FE</code>, you know the byte ordering is one way; if it starts with <code>FE FF</code>, you know the byte ordering is reversed.
<p>Still, UTF-16 isn&#8217;t exactly ideal, especially if you&#8217;re dealing with a lot of <abbr>ASCII</abbr> characters. If you think about it, even a Chinese web page is going to contain a lot of <abbr>ASCII</abbr> characters &mdash; all the elements and attributes surrounding the printable Chinese characters. Being able to find the <var>Nth</var> character in constant time is nice, but there&#8217;s still the nagging problem of those astral plane characters, which mean that you can&#8217;t <em>guarantee</em> that every character is exactly two bytes, so you can&#8217;t <em>really</em> find the <var>Nth</var> character in constant time unless you maintain a separate index. And boy, there sure is a lot of <abbr>ASCII</abbr> text in the world&hellip;
<p>Still, UTF-16 isn&#8217;t exactly ideal, especially if you&#8217;re dealing with a lot of <abbr>ASCII</abbr> characters. If you think about it, even a Chinese web page is going to contain a lot of <abbr>ASCII</abbr> characters&nbsp;&mdash;&nbsp;all the elements and attributes surrounding the printable Chinese characters. Being able to find the <var>Nth</var> character in constant time is nice, but there&#8217;s still the nagging problem of those astral plane characters, which mean that you can&#8217;t <em>guarantee</em> that every character is exactly two bytes, so you can&#8217;t <em>really</em> find the <var>Nth</var> character in constant time unless you maintain a separate index. And boy, there sure is a lot of <abbr>ASCII</abbr> text in the world&hellip;
<p>Other people pondered these questions, and they came up with a solution:
@@ -73,7 +73,7 @@ My alphabet starts where your alphabet ends! <span class=u>&#x275E;</span><br>&m
<p>UTF-8 is a <em>variable-length</em> encoding system for Unicode. That is, different characters take up a different number of bytes. For <abbr>ASCII</abbr> characters (A-Z, <i class=baa>&amp;</i>c.) UTF-8 uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0&ndash;127) in UTF-8 are indistinguishable from <abbr>ASCII</abbr>. &#8220;Extended Latin&#8221; characters like &ntilde; and &ouml; end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like &#x4E2D; end up taking three bytes. The rarely-used &#8220;astral plane&#8221; characters take four bytes.
<p>Disadvantages: because each character can take a different number of bytes, finding the <var>Nth</var> character is an O(N) operation &mdash; that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.
<p>Disadvantages: because each character can take a different number of bytes, finding the <var>Nth</var> character is an O(N) operation&nbsp;&mdash;&nbsp;that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.
<p>Advantages: super-efficient encoding of common <abbr>ASCII</abbr> characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you&#8217;ll have to trust me on this, because I&#8217;m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.
@@ -164,7 +164,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
</pre>
<ol>
<li>Rather than calling any function in the <code>humansize</code> module, you&#8217;re just grabbing one of the data structures it defines: the list of &#8220;SI&#8221; (powers-of-1000) suffixes.
<li>This looks complicated, but it&#8217;s not. <code>{0}</code> would refer to the first argument passed to the <code>format()</code> method, <var>si_suffixes</var>. But <var>si_suffixes</var> is a list. So <code>{0[0]}</code> refers to the first item of the list which is the first argument passed to the <code>format()</code> method: <code>'KB'</code>. Meanwhile, <code>{0[1]}</code> refers to the second item of the same list: <code>'MB'</code>. Everything outside the curly braces &mdash; including <code>1000</code>, the equals sign, and the spaces &mdash; is untouched. The final result is the string <code>'1000KB = 1MB'</code>.
<li>This looks complicated, but it&#8217;s not. <code>{0}</code> would refer to the first argument passed to the <code>format()</code> method, <var>si_suffixes</var>. But <var>si_suffixes</var> is a list. So <code>{0[0]}</code> refers to the first item of the list which is the first argument passed to the <code>format()</code> method: <code>'KB'</code>. Meanwhile, <code>{0[1]}</code> refers to the second item of the same list: <code>'MB'</code>. Everything outside the curly braces&nbsp;&mdash;&nbsp;including <code>1000</code>, the equals sign, and the spaces&nbsp;&mdash;&nbsp;is untouched. The final result is the string <code>'1000KB = 1MB'</code>.
</ol>
<aside>{0} is replaced by the 1<sup>st</sup> format() argument. {1} is replaced by the 2<sup>nd</sup>.</aside>
@@ -340,7 +340,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp>
<li>By an amazing coincidence, this line of code says &#8220;count the occurrences of the string that you would get after decoding this sequence of bytes in this particular character encoding.&#8221;
</ol>
<p>And here is the link between strings and bytes: <code>bytes</code> objects have a <code><dfn>decode</dfn>()</code> method that takes a character encoding and returns a string, and strings have an <code><dfn>encode</dfn>()</code> method that takes a character encoding and returns a <code>bytes</code> object. In the previous example, the decoding was relatively straightforward &mdash; converting a sequence of bytes n the <abbr>ASCII</abbr> encoding into a string of characters. But the same process works with any encoding that supports the characters of the string &mdash; even legacy (non-Unicode) encodings.
<p>And here is the link between strings and bytes: <code>bytes</code> objects have a <code><dfn>decode</dfn>()</code> method that takes a character encoding and returns a string, and strings have an <code><dfn>encode</dfn>()</code> method that takes a character encoding and returns a <code>bytes</code> object. In the previous example, the decoding was relatively straightforward&nbsp;&mdash;&nbsp;converting a sequence of bytes n the <abbr>ASCII</abbr> encoding into a string of characters. But the same process works with any encoding that supports the characters of the string&nbsp;&mdash;&nbsp;even legacy (non-Unicode) encodings.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>a_string = '深入 Python'</kbd> <span class=u>&#x2460;</span></a>
@@ -378,7 +378,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp>
<h2 id=py-encoding>Postscript: Character Encoding Of Python Source Code</h2>
<p>Python 3 assumes that your source code &mdash; <i>i.e.</i> each <code>.py</code> file &mdash; is encoded in UTF-8.
<p>Python 3 assumes that your source code&nbsp;&mdash;&nbsp;<i>i.e.</i> each <code>.py</code> file&nbsp;&mdash;&nbsp;is encoded in UTF-8.
<blockquote class='note compare python2'>
<p><span class=u>&#x261E;</span>In Python 2, the <dfn>default</dfn> encoding for <code>.py</code> files was <abbr>ASCII</abbr>. In Python 3, <a href=http://www.python.org/dev/peps/pep-3120/>the default encoding is UTF-8</a>.
@@ -425,7 +425,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp>
<p>On strings and string formatting:
<ul>
<li><a href=http://docs.python.org/3.0/library/string.html><code>string</code> &mdash; Common string operations</a>
<li><a href=http://docs.python.org/3.0/library/string.html><code>string</code>&nbsp;&mdash;&nbsp;Common string operations</a>
<li><a href=http://docs.python.org/3.0/library/string.html#formatstrings>Format String Syntax</a>
<li><a href=http://docs.python.org/3.0/library/string.html#format-specification-mini-language>Format Specification Mini-Language</a>
<li><a href=http://www.python.org/dev/peps/pep-3101/><abbr>PEP</abbr> 3101: Advanced String Formatting</a>
+6 -6
View File
@@ -32,7 +32,7 @@ body{counter-reset:h1 8}
</ol>
<p>Let&#8217;s start mapping out what a <code>roman.py</code> module should do. It will have two main functions, <code>to_roman()</code> and <code>from_roman()</code>. The <code>to_roman()</code> function should take an integer from <code>1</code> to <code>3999</code> and return the Roman numeral representation as a string&hellip;
<p>Stop right there. Now let&#8217;s do something a little unexpected: write a test case that checks whether the <code>to_roman()</code> function does what you want it to. You read that right: you&#8217;re going to write code that tests code that you haven&#8217;t written yet.
<p>This is called <i>unit testing</i>. The set of two conversion functions &mdash; <code>to_roman()</code>, and later <code>from_roman()</code> &mdash; can be written and tested as a unit, separate from any larger program that imports them. Python has a framework for unit testing, the appropriately-named <code>unittest</code> module.
<p>This is called <i>unit testing</i>. The set of two conversion functions&nbsp;&mdash;&nbsp;<code>to_roman()</code>, and later <code>from_roman()</code>&nbsp;&mdash;&nbsp;can be written and tested as a unit, separate from any larger program that imports them. Python has a framework for unit testing, the appropriately-named <code>unittest</code> module.
<p>Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important to write them early (preferably before writing the code that they test), and to keep them updated as code and requirements change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases of development:
<ul>
<li>Before writing code, it forces you to detail your requirements in a useful fashion.
@@ -134,7 +134,7 @@ if __name__ == '__main__':
<li>Assuming the <code>to_roman()</code> function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check whether it returned the <em>right</em> value. This is a common question, and the <code>TestCase</code> class provides a method, <code>assertEqual</code>, to check whether two values are equal. If the result returned from <code>to_roman()</code> (<var>result</var>) does not match the known value you were expecting (<var>numeral</var>), <code>assertEqual</code> will raise an exception and the test will fail. If the two values are equal, <code>assertEqual</code> will do nothing. If every value returned from <code>to_roman()</code> matches the known value you expect, <code>assertEqual</code> never raises an exception, so <code>testToRomanKnownValues</code> eventually exits normally, which means <code>to_roman()</code> has passed this test.
</ol>
<aside>Write a test that fails, then code until it passes.</aside>
<p>Once you have a test case, you can start coding the <code>to_roman()</code> function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you&#8217;ve written any code, you&#8217;re doing it wrong &mdash; your tests aren&#8217;t testing your code at all! Write a test that fails, then code until it passes.
<p>Once you have a test case, you can start coding the <code>to_roman()</code> function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you&#8217;ve written any code, you&#8217;re doing it wrong&nbsp;&mdash;&nbsp;your tests aren&#8217;t testing your code at all! Write a test that fails, then code until it passes.
<pre><code class=pp># roman1.py
function to_roman(n):
@@ -237,7 +237,7 @@ OK</samp></pre>
<a><samp class=p>>>> </samp><kbd class=pp>roman1.to_roman(9000)</kbd> <span class=u>&#x2460;</span></a>
<samp class=pp>'MMMMMMMMM'</samp></pre>
<ol>
<li>That&#8217;s definitely not what you wanted &mdash; that&#8217;s not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is <em>baaaaaaad</em>; if a program is going to fail, it is far better that it fail quickly and noisily. &#8220;Halt and catch fire,&#8221; as the saying goes. The Pythonic way to halt and catch fire is to raise an exception.
<li>That&#8217;s definitely not what you wanted&nbsp;&mdash;&nbsp;that&#8217;s not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is <em>baaaaaaad</em>; if a program is going to fail, it is far better that it fail quickly and noisily. &#8220;Halt and catch fire,&#8221; as the saying goes. The Pythonic way to halt and catch fire is to raise an exception.
</ol>
<p>The question to ask yourself is, &#8220;How can I express this as a testable requirement?&#8221; How&#8217;s this for starters:
<blockquote>
@@ -275,14 +275,14 @@ Ran 2 tests in 0.000s
FAILED (errors=1)</samp></pre>
<ol>
<li>You should have expected this to fail (since you haven&#8217;t written any code to pass it yet), but... it didn&#8217;t actually &#8220;fail,&#8221; it had an &#8220;error&#8221; instead. This is a subtle but important distinction. A unit test actually has <em>three</em> return values: pass, fail, and error. Pass, of course, means that the test passed &mdash; the code did what you expected. &#8220;Fail&#8221; is what the previous test case did (until you wrote code to make it pass) &mdash; it executed the code but the result was not what you expected. &#8220;Error&#8221; means that the code didn&#8217;t even execute properly.
<li>You should have expected this to fail (since you haven&#8217;t written any code to pass it yet), but... it didn&#8217;t actually &#8220;fail,&#8221; it had an &#8220;error&#8221; instead. This is a subtle but important distinction. A unit test actually has <em>three</em> return values: pass, fail, and error. Pass, of course, means that the test passed&nbsp;&mdash;&nbsp;the code did what you expected. &#8220;Fail&#8221; is what the previous test case did (until you wrote code to make it pass)&nbsp;&mdash;&nbsp;it executed the code but the result was not what you expected. &#8220;Error&#8221; means that the code didn&#8217;t even execute properly.
<li>Why didn&#8217;t the code execute properly? The traceback gives the answer: the module you&#8217;re testing doesn&#8217;t have an exception called <code>OutOfRangeError</code>. Remember, you passed this exception to the <code>assertRaises()</code> method, because it&#8217;s the exception you want the function to raise given an out-of-range input. But the exception doesn&#8217;t exist, so the call to the <code>assertRaises()</code> method failed. It never got a chance to test the <code>to_roman()</code> function; it didn&#8217;t get that far.
</ol>
<p>To solve this problem, you need to define the <code>OutOfRangeError</code> exception in <code>roman2.py</code>.
<pre><code class=pp><a>class OutOfRangeError(ValueError): <span class=u>&#x2460;</span></a>
<a> pass <span class=u>&#x2461;</span></a></code></pre>
<ol>
<li>Exceptions are classes. An &#8220;out of range&#8221; error is a kind of value error &mdash; the argument value is out of its acceptable range. So this exception inherits from the built-in <code>ValueError</code> exception. This is not strictly necessary (it could just inherit from the base <code>Exception</code> class), but it feels right.
<li>Exceptions are classes. An &#8220;out of range&#8221; error is a kind of value error&nbsp;&mdash;&nbsp;the argument value is out of its acceptable range. So this exception inherits from the built-in <code>ValueError</code> exception. This is not strictly necessary (it could just inherit from the base <code>Exception</code> class), but it feels right.
<li>Exceptions don&#8217;t actually do anything, but you need at least one line of code to make a class. Calling <code>pass</code> does precisely nothing, but it&#8217;s a line of Python code, so that makes it a class.
</ol>
<p>Now run the test suite again.
@@ -305,7 +305,7 @@ Ran 2 tests in 0.016s
FAILED (failures=1)</samp></pre>
<ol>
<li>The new test is still not passing, but it&#8217;s not returning an error either. Instead, the test is failing. That&#8217;s progress! It means the call to the <code>assertRaises()</code> method succeeded this time, and the unit test framework actually tested the <code>to_roman()</code> function.
<li>Of course, the <code>to_roman()</code> function isn&#8217;t raising the <code>OutOfRangeError</code> exception you just defined, because you haven&#8217;t told it to do that yet. That&#8217;s excellent news! It means this is a valid test case &mdash; it fails before you write the code to make it pass.
<li>Of course, the <code>to_roman()</code> function isn&#8217;t raising the <code>OutOfRangeError</code> exception you just defined, because you haven&#8217;t told it to do that yet. That&#8217;s excellent news! It means this is a valid test case&nbsp;&mdash;&nbsp;it fails before you write the code to make it pass.
</ol>
<p>Now you can write the code to make this test pass.
<p class=d>[<a href=examples/roman2.py>download <code>roman2.py</code></a>]
+7 -7
View File
@@ -23,7 +23,7 @@ mark{display:inline}
<h2 id=divingin>Diving In</h2>
<p class=f>Most of the chapters in this book have centered around a piece of sample code. But <abbr>XML</abbr> isn&#8217;t about code; it&#8217;s about data. One common use of <abbr>XML</abbr> is &#8220;syndication feeds&#8221; that list the latest articles on a blog, forum, or other frequently-updated website. Most popular blogging software can produce a feed and update it whenever new articles, discussion threads, or blog posts are published. You can follow a blog by &#8220;subscribing&#8221; to its feed, and you can follow multiple blogs with a dedicated &#8220;<a href=http://en.wikipedia.org/wiki/List_of_feed_aggregators>feed aggregator</a>&#8221; like <a href=http://www.google.com/reader/>Google Reader</a>.
<p>Here, then, is the <abbr>XML</abbr> data we&#8217;ll be working with in this chapter. It&#8217;s a feed &mdash; specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
<p>Here, then, is the <abbr>XML</abbr> data we&#8217;ll be working with in this chapter. It&#8217;s a feed&nbsp;&mdash;&nbsp;specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
<pre><code class=pp>&lt;?xml version='1.0' encoding='utf-8'?>
@@ -320,9 +320,9 @@ mark{display:inline}
<samp class=pp>{}</samp></pre>
<ol>
<li>The <code>attrib</code> property is a dictionary of the element&#8217;s attributes. The original markup here was <code>&lt;feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'></code>. The <code>xml:</code> prefix refers to a built-in namespace that every <abbr>XML</abbr> document can use without declaring it.
<li>The fifth child &mdash; <code>[4]</code> in a <code>0</code>-based list &mdash; is the <code>link</code> element.
<li>The fifth child&nbsp;&mdash;&nbsp;<code>[4]</code> in a <code>0</code>-based list&nbsp;&mdash;&nbsp;is the <code>link</code> element.
<li>The <code>link</code> element has three attributes: <code>href</code>, <code>type</code>, and <code>rel</code>.
<li>The fourth child &mdash; <code>[3]</code> in a <code>0</code>-based list &mdash; is the <code>updated</code> element.
<li>The fourth child&nbsp;&mdash;&nbsp;<code>[3]</code> in a <code>0</code>-based list&nbsp;&mdash;&nbsp;is the <code>updated</code> element.
<li>The <code>updated</code> element has no attributes, so its <code>.attrib</code> is just an empty dictionary.
</ol>
@@ -348,7 +348,7 @@ mark{display:inline}
<samp class=pp>[]</samp></pre>
<ol>
<li>The <code>findall()</code> method finds child elements that match a specific query. (More on the query format in a minute.)
<li>Each element &mdash; including the root element, but also child elements &mdash; has a <code>findall()</code> method. It finds all matching elements among the element&#8217;s children. But why aren&#8217;t there any results? Although it may not be obvious, this particular query only searches the element&#8217;s children. Since the root <code>feed</code> element has no child named <code>feed</code>, this query returns an empty list.
<li>Each element&nbsp;&mdash;&nbsp;including the root element, but also child elements&nbsp;&mdash;&nbsp;has a <code>findall()</code> method. It finds all matching elements among the element&#8217;s children. But why aren&#8217;t there any results? Although it may not be obvious, this particular query only searches the element&#8217;s children. Since the root <code>feed</code> element has no child named <code>feed</code>, this query returns an empty list.
<li>This result may also surprise you. <a href=#divingin>There is an <code>author</code> element</a> in this document; in fact, there are three (one in each <code>entry</code>). But those <code>author</code> elements are not <em>direct children</em> of the root element; they are &#8220;grandchildren&#8221; (literally, a child element of a child element). If you want to look for <code>author</code> elements at any nesting level, you can do that, but the query format is slightly different.
</ol>
@@ -391,7 +391,7 @@ mark{display:inline}
'type': 'text/html',
'rel': 'alternate'}</samp></pre>
<ol>
<li>This query &mdash; <code>//{http://www.w3.org/2005/Atom}link</code> &mdash; is very similar to the previous examples, except for the two slashes at the beginning of the query. Those two slashes mean &#8220;don&#8217;t just look for direct children; I want <em>any</em> elements, regardless of nesting level.&#8221; So the result is a list of four <code>link</code> elements, not just one.
<li>This query&nbsp;&mdash;&nbsp;<code>//{http://www.w3.org/2005/Atom}link</code>&nbsp;&mdash;&nbsp;is very similar to the previous examples, except for the two slashes at the beginning of the query. Those two slashes mean &#8220;don&#8217;t just look for direct children; I want <em>any</em> elements, regardless of nesting level.&#8221; So the result is a list of four <code>link</code> elements, not just one.
<li>The first result <em>is</em> a direct child of the root element. As you can see from its attributes, this is the feed-level alternate link that points to the <abbr>HTML</abbr> version of the website that the feed describes.
<li>The other three results are each entry-level alternate links. Each <code>entry</code> has a single <code>link</code> child element, and because of the double slash at the beginning of the query, this query finds all of them.
</ol>
@@ -509,7 +509,7 @@ except ImportError:
<li>At any time, you can serialize any element (and its children) with the ElementTree <code>tostring()</code> function.
</ol>
<p>Was that serialization surprising to you? The way ElementTree serializes namespaced <abbr>XML</abbr> elements is technically accurate but not optimal. The sample <abbr>XML</abbr> document at the beginning of this chapter defined a <i>default namespace</i> (<code>xmlns='http://www.w3.org/2005/Atom'</code>). Defining a default namespace is useful for documents &mdash; like Atom feeds &mdash; where every element is in the same namespace, because you can declare the namespace once and declare each element with just its local name (<code>&lt;feed></code>, <code>&lt;link></code>, <code>&lt;entry></code>). There is no need to use any prefixes unless you want to declare elements from another namespace.
<p>Was that serialization surprising to you? The way ElementTree serializes namespaced <abbr>XML</abbr> elements is technically accurate but not optimal. The sample <abbr>XML</abbr> document at the beginning of this chapter defined a <i>default namespace</i> (<code>xmlns='http://www.w3.org/2005/Atom'</code>). Defining a default namespace is useful for documents&nbsp;&mdash;&nbsp;like Atom feeds&nbsp;&mdash;&nbsp;where every element is in the same namespace, because you can declare the namespace once and declare each element with just its local name (<code>&lt;feed></code>, <code>&lt;link></code>, <code>&lt;entry></code>). There is no need to use any prefixes unless you want to declare elements from another namespace.
<p>An <abbr>XML</abbr> parser won&#8217;t &#8220;see&#8221; any difference between an <abbr>XML</abbr> document with a default namespace and an <abbr>XML</abbr> document with a prefixed namespace. The resulting <abbr>DOM</abbr> of this serialization:
@@ -566,7 +566,7 @@ except ImportError:
<h2 id=xml-custom-parser>Parsing Broken XML</h2>
<p>The <abbr>XML</abbr> specification mandates that all conforming <abbr>XML</abbr> parsers employ &#8220;draconian error handling.&#8221; That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the <abbr>XML</abbr> document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like <abbr>HTML</abbr> &mdash; your browser doesn&#8217;t stop rendering a web page if you forget to close an <abbr>HTML</abbr> tag or escape an ampersand in an attribute value. (It is a common misconception that <abbr>HTML</abbr> has no defined error handling. <a href=http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#parsing><abbr>HTML</abbr> error handling</a> is actually quite well-defined, but it&#8217;s significantly more complicated than &#8220;halt and catch fire on first error.&#8221;)
<p>The <abbr>XML</abbr> specification mandates that all conforming <abbr>XML</abbr> parsers employ &#8220;draconian error handling.&#8221; That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the <abbr>XML</abbr> document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like <abbr>HTML</abbr>&nbsp;&mdash;&nbsp;your browser doesn&#8217;t stop rendering a web page if you forget to close an <abbr>HTML</abbr> tag or escape an ampersand in an attribute value. (It is a common misconception that <abbr>HTML</abbr> has no defined error handling. <a href=http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#parsing><abbr>HTML</abbr> error handling</a> is actually quite well-defined, but it&#8217;s significantly more complicated than &#8220;halt and catch fire on first error.&#8221;)
<p>Some people (myself included) believe that it was a mistake for the inventors of <abbr>XML</abbr> to mandate draconian error handling. Don&#8217;t get me wrong; I can certainly see the allure of simplifying the error handling rules. But in practice, the concept of &#8220;wellformedness&#8221; is trickier than it sounds, especially for <code>XML</code> documents (like Atom feeds) that are published on the web and served over <abbr>HTTP</abbr>. Despite the maturity of <abbr>XML</abbr>, which standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom feeds on the web are plagued with wellformedness errors.
+2 -2
View File
@@ -66,7 +66,7 @@ if __name__ == '__main__':
<p>What just happened? You executed your first Python program. You called the Python intepreter on the command line, and you passed the name of the script you wanted Python to execute. The script defines a single function, the <code>approximate_size()</code> function, which takes an exact file size in bytes and calculates a &#8220;pretty&#8221; (but approximate) size. (You&#8217;ve probably seen this in Windows Explorer, or the Mac OS X Finder, or Nautilus or Dolphin or Thunar on Linux. If you display a folder of documents as a multi-column list, it will display a table with the document icon, the document name, the size, type, last-modified date, and so on. If the folder contains a 1093-byte file named <code>TODO</code>, your file manager won&#8217;t display <code>TODO 1093 bytes</code>; it&#8217;ll say something like <code>TODO 1 KB</code> instead. That&#8217;s what the <code>approximate_size()</code> function does.)
<p>Look at the bottom of the script, and you&#8217;ll see two calls to <code>print(approximate_size(<var>arguments</var>))</code>. These are function calls &mdash; first calling the <code>approximate_size()</code> function and passing a number of arguments, then taking the return value and passing it straight on to the <code>print()</code> function. The <code>print()</code> function is built-in; you&#8217;ll never see an explicit declaration of it. You can just use it, anytime, anywhere. (There are lots of built-in functions, and lots more functions that are separated into <i>modules</i>. Patience, grasshopper.)
<p>Look at the bottom of the script, and you&#8217;ll see two calls to <code>print(approximate_size(<var>arguments</var>))</code>. These are function calls&nbsp;&mdash;&nbsp;first calling the <code>approximate_size()</code> function and passing a number of arguments, then taking the return value and passing it straight on to the <code>print()</code> function. The <code>print()</code> function is built-in; you&#8217;ll never see an explicit declaration of it. You can just use it, anytime, anywhere. (There are lots of built-in functions, and lots more functions that are separated into <i>modules</i>. Patience, grasshopper.)
<p>So why does running the script on the command line give you the same output every time? We&#8217;ll get to that. First, let&#8217;s look at that <code>approximate_size()</code> function.
@@ -81,7 +81,7 @@ if __name__ == '__main__':
<blockquote class=note>
<p><span class=u>&#x261E;</span>In some languages, functions (that return a value) start with <code>function</code>, and subroutines (that do not return a value) start with <code>sub</code>. There are no subroutines in Python. Everything is a function, all functions return a value (even if it&#8217;s <code>None</code>), and all functions start with <code>def</code>.
</blockquote>
<p>The <code>approximate_size()</code> function takes the two arguments &mdash; <var>size</var> and <var>a_kilobyte_is_1024_bytes</var> &mdash; but neither argument specifies a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
<p>The <code>approximate_size()</code> function takes the two arguments&nbsp;&mdash;&nbsp;<var>size</var> and <var>a_kilobyte_is_1024_bytes</var>&nbsp;&mdash;&nbsp;but neither argument specifies a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
<blockquote class='note compare java'>
<p><span class=u>&#x261E;</span>In Java and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
</blockquote>