mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 15:00:18 +00:00
IE fixes
This commit is contained in:
@@ -45,7 +45,7 @@ E = 4</code></pre>
|
||||
<p>In this chapter, we’ll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles <em>in just 14 lines of code</em>.
|
||||
|
||||
<p class=d>[<a href=examples/alphametics.py>download <code>alphametics.py</code></a>]
|
||||
<pre><code class=pp>import re
|
||||
<pre class=pp><code>import re
|
||||
import itertools
|
||||
|
||||
def solve(puzzle):
|
||||
@@ -150,7 +150,7 @@ if __name__ == '__main__':
|
||||
|
||||
<p>The alphametics solver uses this technique to get a list of all the unique characters in the puzzle.
|
||||
|
||||
<pre class=nd><code class=pp>unique_characters = set(''.join(words))</code></pre>
|
||||
<pre class='nd pp'><code>unique_characters = set(''.join(words))</code></pre>
|
||||
|
||||
<p>This list is later used to assign digits to characters as the solver iterates through the possible solutions.
|
||||
|
||||
@@ -178,11 +178,11 @@ AssertionError: Only for very large values of 2</samp></pre>
|
||||
|
||||
<p>Therefore, this line of code:
|
||||
|
||||
<pre class=nd><code class=pp>assert len(unique_characters) <= 10, 'Too many letters'</code></pre>
|
||||
<pre class='nd pp'><code>assert len(unique_characters) <= 10, 'Too many letters'</code></pre>
|
||||
|
||||
<p>…is equivalent to this:
|
||||
|
||||
<pre class=nd><code class=pp>if len(unique_characters) > 10:
|
||||
<pre class='nd pp'><code>if len(unique_characters) > 10:
|
||||
raise AssertionError('Too many letters')</code></pre>
|
||||
|
||||
<p>The alphametics solver uses this exact <code>assert</code> statement to bail out early if the puzzle contains more than ten unique letters. Since each letter is assigned a unique digit, and there are only ten digits, a puzzle with more than ten unique letters can not possibly have a solution.
|
||||
@@ -217,7 +217,7 @@ AssertionError: Only for very large values of 2</samp></pre>
|
||||
|
||||
<p>Here’s another way to accomplish the same thing, using a <a href=generators.html>generator function</a>:
|
||||
|
||||
<pre class=nd><code class=pp>def ord_map(a_string):
|
||||
<pre class='nd pp'><code>def ord_map(a_string):
|
||||
for c in a_string:
|
||||
yield ord(c)
|
||||
|
||||
@@ -413,7 +413,7 @@ Wesley</samp></pre>
|
||||
|
||||
<p id=guess>The alphametics solver uses this technique to create a dictionary that maps letters in the puzzle to digits in the solution, for each possible solution.
|
||||
|
||||
<pre class=nd><code class=pp>characters = tuple(ord(c) for c in sorted_characters)
|
||||
<pre class='nd pp'><code>characters = tuple(ord(c) for c in sorted_characters)
|
||||
digits = tuple(ord(c) for c in '0123456789')
|
||||
...
|
||||
for guess in itertools.permutations(digits, len(characters)):
|
||||
|
||||
@@ -228,7 +228,7 @@ RefactoringTool: test.py</samp></pre>
|
||||
|
||||
<p>Let’s take a peek in that <code>__init__.py</code> file.
|
||||
|
||||
<pre><code class=pp><a>def detect(aBuf): <span class=u>①</span></a>
|
||||
<pre class=pp><code><a>def detect(aBuf): <span class=u>①</span></a>
|
||||
<a> from . import universaldetector <span class=u>②</span></a>
|
||||
u = universaldetector.UniversalDetector()
|
||||
u.reset()
|
||||
@@ -242,7 +242,7 @@ RefactoringTool: test.py</samp></pre>
|
||||
|
||||
<p>The answer lies in that odd-looking <code>import</code> statement:
|
||||
|
||||
<pre class=nd><code class=pp>from . import universaldetector</code></pre>
|
||||
<pre class='nd pp'><code>from . import universaldetector</code></pre>
|
||||
|
||||
<p>Translated into English, that means “import the <code>universaldetector</code> module; that’s in the same directory I am,” where “I” is the <code>chardet/__init__.py</code> file. This is called a <i>relative import</i>. It’s a way for the files within a multi-file module to reference each other, without worrying about naming conflicts with other modules you may have installed in <a href=your-first-python-program.html#importsearchpath>your import search path</a>. This <code>import</code> statement will <em>only</em> look for the <code>universaldetector</code> module within the <code>chardet/</code> directory itself.
|
||||
|
||||
@@ -267,7 +267,7 @@ RefactoringTool: test.py</samp></pre>
|
||||
^
|
||||
SyntaxError: invalid syntax</samp></pre>
|
||||
<p>Hmm, a small snag. In Python 3, <code>False</code> is a reserved word, so you can’t use it as a variable name. Let’s look at <code>constants.py</code> to see where it’s defined. Here’s the original version from <code>constants.py</code>, before the <code>2to3</code> script changed it:
|
||||
<pre class=nd><code class=pp>import __builtin__
|
||||
<pre class='nd pp'><code>import __builtin__
|
||||
if not hasattr(__builtin__, 'False'):
|
||||
False = 0
|
||||
True = 1
|
||||
@@ -277,9 +277,9 @@ else:
|
||||
<p>This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3, Python had no built-in <code>bool</code> type. This code detects the absence of the built-in constants <code>True</code> and <code>False</code>, and defines them if necessary.
|
||||
<p>However, Python 3 will always have a <code>bool</code> type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of <code>constants.True</code> and <code>constants.False</code> with <code>True</code> and <code>False</code>, respectively, then delete this dead code from <code>constants.py</code>.
|
||||
<p>So this line in <code>universaldetector.py</code>:
|
||||
<pre class=nd><code class=pp>self.done = constants.False</code></pre>
|
||||
<pre class='nd pp'><code>self.done = constants.False</code></pre>
|
||||
<p>Becomes
|
||||
<pre class=nd><code class=pp>self.done = False</code></pre>
|
||||
<pre class='nd pp'><code>self.done = False</code></pre>
|
||||
<p>Ah, wasn’t that satisfying? The code is shorter and more readable already.
|
||||
<h3 id=nomodulenamedconstants>No module named <code>constants</code></h3>
|
||||
<p>Time to run <code>test.py</code> again and see how far it gets.
|
||||
@@ -293,12 +293,12 @@ ImportError: No module named constants</samp></pre>
|
||||
<p>What’s that you say? No module named <code>constants</code>? Of course there’s a module named <code>constants</code>. It’s right there, in <code>chardet/constants.py</code>.
|
||||
|
||||
<p>Remember when the <code>2to3</code> script fixed up all those import statements? This library has a lot of relative imports — that is, <a href=#multifile-modules>modules that import other modules within the same library</a> — but <em>the logic behind relative imports has changed in Python 3</em>. In Python 2, you could just <code>import constants</code> and it would look in the <code>chardet/</code> directory first. In Python 3, <a href=http://www.python.org/dev/peps/pep-0328/>all import statements are absolute by default</a>. If you want to do a relative import in Python 3, you need to be explicit about it:
|
||||
<pre class=nd><code class=pp>from . import constants</code></pre>
|
||||
<pre class='nd pp'><code>from . import constants</code></pre>
|
||||
<p>But wait. Wasn’t the <code>2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code>constants</code> module within the library, and an absolute import of the <code>sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the <code>2to3</code> script is not smart enough to split the import statement into two.
|
||||
<p>The solution is to split the import statement manually. So this two-in-one import:
|
||||
<pre class=nd><code class=pp>import constants, sys</code></pre>
|
||||
<pre class='nd pp'><code>import constants, sys</code></pre>
|
||||
<p>Needs to become two separate imports:
|
||||
<pre class=nd><code class=pp>from . import constants
|
||||
<pre class='nd pp'><code>from . import constants
|
||||
import sys</code></pre>
|
||||
<p>There are variations of this problem scattered throughout the <code>chardet</code> library. In some places it’s “<code>import constants, sys</code>”; in other places, it’s “<code>import constants, re</code>”. The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
|
||||
<p>Onward!
|
||||
@@ -313,7 +313,7 @@ import sys</code></pre>
|
||||
NameError: name 'file' is not defined</samp></pre>
|
||||
<p>This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global <code>file()</code> function was an alias for the <code>open()</code> function, which was the standard way of <a href=files.html#reading>opening text files for reading</a>. In Python 3, the global <code>file()</code> function no longer exists, but the <code>open()</code> function still exists.
|
||||
<p>Thus, the simplest solution to the problem of the missing <code>file()</code> is to call the <code>open()</code> function instead:
|
||||
<pre class=nd><code class=pp>for line in open(f, 'rb'):</code></pre>
|
||||
<pre class='nd pp'><code>for line in open(f, 'rb'):</code></pre>
|
||||
<p>And that’s all I have to say about that.
|
||||
<h3 id=cantuseastringpattern>Can’t use a string pattern on a bytes-like object</h3>
|
||||
<p>Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.”
|
||||
@@ -326,20 +326,20 @@ NameError: name 'file' is not defined</samp></pre>
|
||||
if self._highBitDetector.search(aBuf):
|
||||
TypeError: can't use a string pattern on a bytes-like object</samp></pre>
|
||||
<p>To debug this, let’s see what <var>self._highBitDetector</var> is. It’s defined in the <var>__init__</var> method of the <var>UniversalDetector</var> class:
|
||||
<pre class=nd><code class=pp>class UniversalDetector:
|
||||
<pre class='nd pp'><code>class UniversalDetector:
|
||||
def __init__(self):
|
||||
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
|
||||
<p>This pre-compiles a regular expression designed to find non-<abbr>ASCII</abbr> characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-<abbr>ASCII</abbr> <em>bytes</em> in the range 128-255.
|
||||
<p>And therein lies the problem.
|
||||
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in <code>universaldetector.py</code>:
|
||||
<pre class=nd><code class=pp>def feed(self, aBuf):
|
||||
<pre class='nd pp'><code>def feed(self, aBuf):
|
||||
.
|
||||
.
|
||||
.
|
||||
if self._mInputState == ePureAscii:
|
||||
if self._highBitDetector.search(aBuf):</code></pre>
|
||||
<p>And what is <var>aBuf</var>? Let’s backtrack further to a place that calls <code>UniversalDetector.feed()</code>. One place that calls it is the test harness, <code>test.py</code>.
|
||||
<pre class=nd><code class=pp>u = UniversalDetector()
|
||||
<pre class='nd pp'><code>u = UniversalDetector()
|
||||
.
|
||||
.
|
||||
.
|
||||
@@ -349,7 +349,7 @@ for line in open(f, 'rb'):
|
||||
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for “read”; OK, big deal, we’re reading the file. Ah, but <a href=files.html#binary><code>'b'</code> is for “binary.”</a> Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
|
||||
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
|
||||
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
|
||||
<pre class=nd><code class=pp> class UniversalDetector:
|
||||
<pre class='nd pp'><code> class UniversalDetector:
|
||||
def __init__(self):
|
||||
<del>- self._highBitDetector = re.compile(r'[\x80-\xFF]')</del>
|
||||
<del>- self._escDetector = re.compile(r'(\033|~{)')</del>
|
||||
@@ -359,7 +359,7 @@ for line in open(f, 'rb'):
|
||||
self._mCharSetProbers = []
|
||||
self.reset()</code></pre>
|
||||
<p>Searching the entire codebase for other uses of the <code>re</code> module turns up two more instances, in <code>charsetprober.py</code>. Again, the code is defining regular expressions as strings but executing them on <var>aBuf</var>, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays.
|
||||
<pre class=nd><code class=pp> class CharSetProber:
|
||||
<pre class='nd pp'><code> class CharSetProber:
|
||||
.
|
||||
.
|
||||
.
|
||||
@@ -384,7 +384,7 @@ for line in open(f, 'rb'):
|
||||
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
|
||||
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
|
||||
<p>There’s an unfortunate clash of coding style and Python interpreter here. The <code>TypeError</code> could be anywhere on that line, but the traceback doesn’t tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
|
||||
<pre class=nd><code class=pp>elif (self._mInputState == ePureAscii) and \
|
||||
<pre class='nd pp'><code>elif (self._mInputState == ePureAscii) and \
|
||||
self._escDetector.search(self._mLastChar + aBuf):</code></pre>
|
||||
<p>And re-run the test:
|
||||
<pre class='nd screen'><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
||||
@@ -397,7 +397,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
|
||||
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
|
||||
<p>Aha! The problem was not in the first conditional (<code>self._mInputState == ePureAscii</code>) but in the second one. So what could cause a <code>TypeError</code> there? Perhaps you’re thinking that the <code>search()</code> method is expecting a value of a different type, but that wouldn’t generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may <em>crash</em> if you pass it a value of a different type than it’s expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the <code>search()</code> method. So the problem must be in that <code>+</code> operation, as it’s trying to construct the value that it will eventually pass to the <code>search()</code> method.
|
||||
<p>We know from <a href=#cantuseastringpattern>previous debugging</a> that <var>aBuf</var> is a byte array. So what is <code>self._mLastChar</code>? It’s an instance variable, defined in the <code>reset()</code> method, which is actually called from the <code>__init__()</code> method.
|
||||
<pre class=nd><code class=pp>class UniversalDetector:
|
||||
<pre class='nd pp'><code>class UniversalDetector:
|
||||
def __init__(self):
|
||||
self._highBitDetector = re.compile(b'[\x80-\xFF]')
|
||||
self._escDetector = re.compile(b'(\033|~{)')
|
||||
@@ -414,7 +414,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
|
||||
<mark> self._mLastChar = ''</mark></code></pre>
|
||||
<p>And now we have our answer. Do you see it? <var>self._mLastChar</var> is a string, but <var>aBuf</var> is a byte array. And you can’t concatenate a string to a byte array — not even a zero-length string.
|
||||
<p>So what is <var>self._mLastChar</var> anyway? In the <code>feed()</code> method, just a few lines down from where the trackback occurred.
|
||||
<pre class=nd><code class=pp>if self._mInputState == ePureAscii:
|
||||
<pre class='nd pp'><code>if self._mInputState == ePureAscii:
|
||||
if self._highBitDetector.search(aBuf):
|
||||
self._mInputState = eHighbyte
|
||||
elif (self._mInputState == ePureAscii) and \
|
||||
@@ -423,14 +423,14 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
|
||||
|
||||
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
|
||||
<p>The calling function calls this <code>feed()</code> method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as <var>aBuf</var>), then stores the last byte in <var>self._mLastChar</var> in case it’s needed during the next call. (In a multi-byte encoding, the <code>feed()</code> method might get called with half of a character, then called again with the other half.) But because <var>aBuf</var> is now a byte array instead of a string, <var>self._mLastChar</var> needs to be a byte array as well. Thus:
|
||||
<pre class=nd><code class=pp> def reset(self):
|
||||
<pre class='nd pp'><code> def reset(self):
|
||||
.
|
||||
.
|
||||
.
|
||||
<del>- self._mLastChar = ''</del>
|
||||
<ins>+ self._mLastChar = b''</ins></code></pre>
|
||||
<p>Searching the entire codebase for “<code>mLastChar</code>” turns up a similar problem in <code>mbcharsetprober.py</code>, but instead of tracking the last character, it tracks the last <em>two</em> characters. The <code>MultiByteCharSetProber</code> class uses a list of 1-character strings to track the last two characters. In Python 3, it needs to use a list of integers, because it’s not really tracking characters, it’s tracking bytes. (Bytes are just integers from <code>0-255</code>.)
|
||||
<pre class=nd><code class=pp> class MultiByteCharSetProber(CharSetProber):
|
||||
<pre class='nd pp'><code> class MultiByteCharSetProber(CharSetProber):
|
||||
def __init__(self):
|
||||
CharSetProber.__init__(self)
|
||||
self._mDistributionAnalyzer = None
|
||||
@@ -459,7 +459,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
|
||||
<p>…The bad news is it doesn’t always feel like progress.
|
||||
<p>But this is progress! Really! Even though the traceback calls out the same line of code, it’s a different error than it used to be. Progress! So what’s the problem now? The last time I checked, this line of code didn’t try to concatenate an <code>int</code> with a byte array (<code>bytes</code>). In fact, you just spent a lot of time <a href=#cantconvertbytesobject>ensuring that <var>self._mLastChar</var> was a byte array</a>. How did it turn into an <code>int</code>?
|
||||
<p>The answer lies not in the previous lines of code, but in the following lines.
|
||||
<pre class=nd><code class=pp>if self._mInputState == ePureAscii:
|
||||
<pre class='nd pp'><code>if self._mInputState == ePureAscii:
|
||||
if self._highBitDetector.search(aBuf):
|
||||
self._mInputState = eHighbyte
|
||||
elif (self._mInputState == ePureAscii) and \
|
||||
@@ -496,7 +496,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp>
|
||||
<li>Concatenating a byte array of length 1 with a byte array of length 3 returns a new byte array of length 4.
|
||||
</ol>
|
||||
<p>So, to ensure that the <code>feed()</code> method in <code>universaldetector.py</code> continues to work no matter how often it’s called, you need to <a href=#cantconvertbytesobject>initialize <var>self._mLastChar</var> as a 0-length byte array</a>, then <em>make sure it stays a byte array</em>.
|
||||
<pre class=nd><code class=pp> self._escDetector.search(self._mLastChar + aBuf):
|
||||
<pre class='nd pp'><code> self._escDetector.search(self._mLastChar + aBuf):
|
||||
self._mInputState = eEscAscii
|
||||
|
||||
<del>- self._mLastChar = aBuf[-1]</del>
|
||||
@@ -519,25 +519,25 @@ tests\Big5\0804.blogspot.com.xml</samp>
|
||||
byteCls = self._mModel['classTable'][ord(c)]
|
||||
TypeError: ord() expected string of length 1, but int found</samp></pre>
|
||||
<p>OK, so <var>c</var> is an <code>int</code>, but the <code>ord()</code> function was expecting a 1-character string. Fair enough. Where is <var>c</var> defined?
|
||||
<pre class=nd><code class=pp># codingstatemachine.py
|
||||
<pre class='nd pp'><code># codingstatemachine.py
|
||||
def next_state(self, c):
|
||||
# for each byte we get its class
|
||||
# if it is first byte, we also get byte length
|
||||
byteCls = self._mModel['classTable'][ord(c)]</code></pre>
|
||||
<p>That’s no help; it’s just passed into the function. Let’s pop the stack.
|
||||
<pre class=nd><code class=pp># utf8prober.py
|
||||
<pre class='nd pp'><code># utf8prober.py
|
||||
def feed(self, aBuf):
|
||||
for c in aBuf:
|
||||
codingState = self._mCodingSM.next_state(c)</code></pre>
|
||||
<p>Do you see it? In Python 2, <var>aBuf</var> was a string, so <var>c</var> was a 1-character string. (That’s what you get when you iterate over a string — all the characters, one by one.) But now, <var>aBuf</var> is a byte array, so <var>c</var> is an <code>int</code>, not a 1-character string. In other words, there’s no need to call the <code>ord()</code> function because <var>c</var> is already an <code>int</code>!
|
||||
<p>Thus:
|
||||
<pre class=nd><code class=pp> def next_state(self, c):
|
||||
<pre class='nd pp'><code> def next_state(self, c):
|
||||
# for each byte we get its class
|
||||
# if it is first byte, we also get byte length
|
||||
<del>- byteCls = self._mModel['classTable'][ord(c)]</del>
|
||||
<ins>+ byteCls = self._mModel['classTable'][c]</ins></code></pre>
|
||||
<p>Searching the entire codebase for instances of “<code>ord(c)</code>” uncovers similar problems in <code>sbcharsetprober.py</code>…
|
||||
<pre class=nd><code class=pp># sbcharsetprober.py
|
||||
<pre class='nd pp'><code># sbcharsetprober.py
|
||||
def feed(self, aBuf):
|
||||
if not self._mModel['keepEnglishLetter']:
|
||||
aBuf = self.filter_without_english_letters(aBuf)
|
||||
@@ -547,13 +547,13 @@ def feed(self, aBuf):
|
||||
for c in aBuf:
|
||||
<mark> order = self._mModel['charToOrderMap'][ord(c)]</mark></code></pre>
|
||||
<p>…and <code>latin1prober.py</code>…
|
||||
<pre class=nd><code class=pp># latin1prober.py
|
||||
<pre class='nd pp'><code># latin1prober.py
|
||||
def feed(self, aBuf):
|
||||
aBuf = self.filter_with_english_letters(aBuf)
|
||||
for c in aBuf:
|
||||
<mark> charClass = Latin1_CharToClass[ord(c)]</mark></code></pre>
|
||||
<p><var>c</var> is iterating over <var>aBuf</var>, which means it is an integer, not a 1-character string. The solution is the same: change <code>ord(c)</code> to just plain <code>c</code>.
|
||||
<pre class=nd><code class=pp> # sbcharsetprober.py
|
||||
<pre class='nd pp'><code> # sbcharsetprober.py
|
||||
def feed(self, aBuf):
|
||||
if not self._mModel['keepEnglishLetter']:
|
||||
aBuf = self.filter_without_english_letters(aBuf)
|
||||
@@ -591,7 +591,7 @@ tests\Big5\0804.blogspot.com.xml</samp>
|
||||
if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
|
||||
TypeError: unorderable types: int() >= str()</samp></pre>
|
||||
<p>So what’s this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code:
|
||||
<pre class=nd><code class=pp>class SJISContextAnalysis(JapaneseContextAnalysis):
|
||||
<pre class='nd pp'><code>class SJISContextAnalysis(JapaneseContextAnalysis):
|
||||
def get_order(self, aStr):
|
||||
if not aStr: return -1, 1
|
||||
# find out current char's byte length
|
||||
@@ -601,7 +601,7 @@ TypeError: unorderable types: int() >= str()</samp></pre>
|
||||
else:
|
||||
charLen = 1</code></pre>
|
||||
<p>And where does <var>aStr</var> come from? Let’s pop the stack:
|
||||
<pre class=nd><code class=pp>def feed(self, aBuf, aLen):
|
||||
<pre class='nd pp'><code>def feed(self, aBuf, aLen):
|
||||
.
|
||||
.
|
||||
.
|
||||
@@ -611,7 +611,7 @@ TypeError: unorderable types: int() >= str()</samp></pre>
|
||||
<p>Oh look, it’s our old friend, <var>aBuf</var>. As you might have guessed from every other issue we’ve encountered in this chapter, <var>aBuf</var> is a byte array. Here, the <code>feed()</code> method isn’t just passing it on wholesale; it’s slicing it. But as you saw <a href=#unsupportedoperandtypeforplus>earlier in this chapter</a>, slicing a byte array returns a byte array, so the <var>aStr</var> parameter that gets passed to the <code>get_order()</code> method is still a byte array.
|
||||
<p>And what is this code trying to do with <var>aStr</var>? It’s taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because <var>aStr</var> and <var>aBuf</var> were strings, and <var>aStr[0]</var> would be a string, and you can compare strings for inequality. But in Python 3, <var>aStr</var> and <var>aBuf</var> are byte arrays, <var>aStr[0]</var> is an integer, and you can’t compare integers and strings for inequality without explicitly coercing one of them.
|
||||
<p>In this case, there’s no need to make the code more complicated by adding an explicit coercion. <var>aStr[0]</var> yields an integer; the things you’re comparing to are all constants. Let’s change them from 1-character strings to integers. And while we’re at it, let’s change <var>aStr</var> to <var>aBuf</var>, since it’s not actually a string.
|
||||
<pre class=nd><code class=pp> class SJISContextAnalysis(JapaneseContextAnalysis):
|
||||
<pre class='nd pp'><code> class SJISContextAnalysis(JapaneseContextAnalysis):
|
||||
<del>- def get_order(self, aStr):</del>
|
||||
<del>- if not aStr: return -1, 1
|
||||
<ins>+ def get_order(self, aBuf):</ins>
|
||||
@@ -688,7 +688,7 @@ tests\Big5\0804.blogspot.com.xml</samp>
|
||||
if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):
|
||||
TypeError: unorderable types: int() >= str()</samp></pre>
|
||||
<p>The fix is the same:
|
||||
<pre class=nd><code class=pp> class EUCTWDistributionAnalysis(CharDistributionAnalysis):
|
||||
<pre class='nd pp'><code> class EUCTWDistributionAnalysis(CharDistributionAnalysis):
|
||||
def __init__(self):
|
||||
CharDistributionAnalysis.__init__(self)
|
||||
self._mCharToFreqOrder = EUCTWCharToFreqOrder
|
||||
@@ -812,21 +812,21 @@ tests\Big5\0804.blogspot.com.xml</samp>
|
||||
total = reduce(operator.add, self._mFreqCounter)
|
||||
NameError: global name 'reduce' is not defined</samp></pre>
|
||||
<p>According to the official <a href=http://docs.python.org/3.0/whatsnew/3.0.html#builtins>What’s New In Python 3.0</a> guide, the <code>reduce()</code> function has been moved out of the global namespace and into the <code>functools</code> module. Quoting the guide: “Use <code>functools.reduce()</code> if you really need it; however, 99 percent of the time an explicit <code>for</code> loop is more readable.” You can read more about the decision from Guido van Rossum’s weblog: <a href='http://www.artima.com/weblogs/viewpost.jsp?thread=98196'>The fate of reduce() in Python 3000</a>.
|
||||
<pre class=nd><code class=pp>def get_confidence(self):
|
||||
<pre class='nd pp'><code>def get_confidence(self):
|
||||
if self.get_state() == constants.eNotMe:
|
||||
return 0.01
|
||||
|
||||
<mark> total = reduce(operator.add, self._mFreqCounter)</mark></code></pre>
|
||||
<p>The <code>reduce()</code> function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.
|
||||
<p>This monstrosity was so common that Python added a global <code>sum()</code> function.
|
||||
<pre class=nd><code class=pp> def get_confidence(self):
|
||||
<pre class='nd pp'><code> def get_confidence(self):
|
||||
if self.get_state() == constants.eNotMe:
|
||||
return 0.01
|
||||
|
||||
<del>- total = reduce(operator.add, self._mFreqCounter)</del>
|
||||
<ins>+ total = sum(self._mFreqCounter)</ins></code></pre>
|
||||
<p>Since you’re no longer using the <code>operator</code> module, you can remove that <code>import</code> from the top of the file as well.
|
||||
<pre class=nd><code class=pp> from .charsetprober import CharSetProber
|
||||
<pre class='nd pp'><code> from .charsetprober import CharSetProber
|
||||
from . import constants
|
||||
<del>- import operator</del></code></pre>
|
||||
<p>I CAN HAZ TESTZ?
|
||||
|
||||
@@ -200,7 +200,6 @@ a.hl:hover, h2[id]:hover a.hl, h3[id]:hover a.hl {
|
||||
/* code blocks */
|
||||
|
||||
pre {
|
||||
white-space: pre-wrap;
|
||||
padding-left: 2.154em;
|
||||
border-left: 1px solid #ddd;
|
||||
}
|
||||
@@ -323,10 +322,10 @@ aside a {
|
||||
border: 0;
|
||||
display: block;
|
||||
}
|
||||
.v a:first-child {
|
||||
.v a {
|
||||
float: left;
|
||||
}
|
||||
.v a:last-child {
|
||||
.v a + a {
|
||||
float: right;
|
||||
}
|
||||
.v span {
|
||||
|
||||
+6
-6
@@ -26,7 +26,7 @@ body{counter-reset:h1 11}
|
||||
|
||||
<p>Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:
|
||||
|
||||
<pre class=nd><code class=pp>a_file = open('examples/chinese.txt', encoding='utf-8')</code></pre>
|
||||
<pre class='nd pp'><code>a_file = open('examples/chinese.txt', encoding='utf-8')</code></pre>
|
||||
|
||||
<p>Python has a built-in <code>open()</code> function, which takes a filename as an argument. Here the filename is <code class=pp>'examples/chinese.txt'</code>. There are five interesting things about this filename:
|
||||
|
||||
@@ -207,7 +207,7 @@ ValueError: I/O operation on closed file.</samp>
|
||||
|
||||
<p>Python 2 had a solution for this: the <code>try..finally</code> block. That still works in Python 3, and you may see it in other people’s code or in older code that was <a href=case-study-porting-chardet-to-python-3.html>ported to Python 3</a>. But Python 3 also adds a cleaner solution: the <code>with</code> statement.
|
||||
|
||||
<pre class=nd><code class=pp>with open('examples/chinese.txt', encoding='utf-8') as a_file:
|
||||
<pre class='nd pp'><code>with open('examples/chinese.txt', encoding='utf-8') as a_file:
|
||||
a_file.seek(17)
|
||||
a_character = a_file.read(1)
|
||||
print(a_character)</code></pre>
|
||||
@@ -235,7 +235,7 @@ ValueError: I/O operation on closed file.</samp>
|
||||
<p>So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful.
|
||||
|
||||
<p class=d>[<a href=examples/oneline.py>download <code>oneline.py</code></a>]
|
||||
<pre><code class=pp>line_number = 0
|
||||
<pre class=pp><code>line_number = 0
|
||||
<a>with open('examples/favorite-people.txt', encoding='utf-8') as a_file: <span class=u>①</span></a>
|
||||
<a> for a_line in a_file: <span class=u>②</span></a>
|
||||
line_number += 1
|
||||
@@ -450,7 +450,7 @@ IOError: not readable</samp></pre>
|
||||
<p>So <code>sys.stdout</code> and <code>sys.stderr</code> are file-like objects, albeit ones that only support writing. But they’re not constants; they’re variables. That means you can assign them a new value — another file object, or another file-like object — and redirect their output.
|
||||
|
||||
<p class=d>[<a href=examples/stdout.py>download <code>stdout.py</code></a>]
|
||||
<pre><code class=pp>import sys
|
||||
<pre class=pp><code>import sys
|
||||
|
||||
class RedirectStdoutTo:
|
||||
def __init__(self, out_new):
|
||||
@@ -479,7 +479,7 @@ C</samp>
|
||||
|
||||
<p>Let’s take the last part first.
|
||||
|
||||
<pre><code class=pp>
|
||||
<pre class=pp><code>
|
||||
<a>print('A') <span class=u>①</span></a>
|
||||
<a>with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file): <span class=u>②</span></a>
|
||||
<a> print('B') <span class=u>③</span></a>
|
||||
@@ -493,7 +493,7 @@ C</samp>
|
||||
|
||||
<p>Now take a look at the <code>RedirectStdoutTo</code> class. It is a custom context manager. Upon entering the context, it redirects <code>sys.stdout</code> to a given file-like object. Upon exiting the context, it restores <code>sys.stdout</code> to its original value.
|
||||
|
||||
<pre><code class=pp>class RedirectStdoutTo:
|
||||
<pre class=pp><code>class RedirectStdoutTo:
|
||||
<a> def __init__(self, out_new): <span class=u>①</span></a>
|
||||
self.out_new = out_new
|
||||
|
||||
|
||||
+12
-12
@@ -38,7 +38,7 @@ body{counter-reset:h1 6}
|
||||
<h2 id=i-know>I Know, Let’s Use Regular Expressions!</h2>
|
||||
<p>So you’re looking at words, which, at least in English, means you’re looking at strings of characters. You have rules that say you need to find different combinations of characters, then do different things to them. This sounds like a job for regular expressions!
|
||||
<p class=d>[<a href=examples/plural1.py>download <code>plural1.py</code></a>]
|
||||
<pre><code class=pp>import re
|
||||
<pre class=pp><code>import re
|
||||
|
||||
def plural(noun):
|
||||
<a> if re.search('[sxz]$', noun): <span class=u>①</span></a>
|
||||
@@ -74,7 +74,7 @@ def plural(noun):
|
||||
|
||||
<p>And now, back to the <code>plural()</code> function…
|
||||
|
||||
<pre><code class=pp>def plural(noun):
|
||||
<pre class=pp><code>def plural(noun):
|
||||
if re.search('[sxz]$', noun):
|
||||
<a> return re.sub('$', 'es', noun) <span class=u>①</span></a>
|
||||
<a> elif re.search('[^aeioudgkprt]h$', noun): <span class=u>②</span></a>
|
||||
@@ -126,7 +126,7 @@ def plural(noun):
|
||||
<p>Now you’re going to add a level of abstraction. You started by defining a list of rules: if this, do that, otherwise go to the next rule. Let’s temporarily complicate part of the program so you can simplify another part.
|
||||
|
||||
<p class=d>[<a href=examples/plural2.py>download <code>plural2.py</code></a>]
|
||||
<pre><code class=pp>import re
|
||||
<pre class=pp><code>import re
|
||||
|
||||
def match_sxz(noun):
|
||||
return re.search('[sxz]$', noun)
|
||||
@@ -174,7 +174,7 @@ def plural(noun):
|
||||
|
||||
<p>If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire <code>for</code> loop is equivalent to the following:
|
||||
|
||||
<pre class=nd><code class=pp>
|
||||
<pre class='nd pp'><code>
|
||||
def plural(noun):
|
||||
if match_sxz(noun):
|
||||
return apply_sxz(noun)
|
||||
@@ -206,7 +206,7 @@ def plural(noun):
|
||||
<p>Defining separate named functions for each match and apply rule isn’t really necessary. You never call them directly; you add them to the <var>rules</var> sequence and call them through there. Furthermore, each function follows one of two patterns. All the match functions call <code>re.search()</code>, and all the apply functions call <code>re.sub()</code>. Let’s factor out the patterns so that defining new rules can be easier.
|
||||
|
||||
<p class=d>[<a href=examples/plural3.py>download <code>plural3.py</code></a>]
|
||||
<pre><code class=pp>import re
|
||||
<pre class=pp><code>import re
|
||||
|
||||
def build_match_and_apply_functions(pattern, search, replace):
|
||||
<a> def matches_rule(word): <span class=u>①</span></a>
|
||||
@@ -222,7 +222,7 @@ def build_match_and_apply_functions(pattern, search, replace):
|
||||
|
||||
<p>If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it.
|
||||
|
||||
<pre><code class=pp><a>patterns = \ <span class=u>①</span></a>
|
||||
<pre class=pp><code><a>patterns = \ <span class=u>①</span></a>
|
||||
(
|
||||
('[sxz]$', '$', 'es'),
|
||||
('[^aeioudgkprt]h$', '$', 'es'),
|
||||
@@ -239,7 +239,7 @@ def build_match_and_apply_functions(pattern, search, replace):
|
||||
|
||||
<p>Rounding out this version of the script is the main entry point, the <code>plural()</code> function.
|
||||
|
||||
<pre><code class=pp>def plural(noun):
|
||||
<pre class=pp><code>def plural(noun):
|
||||
<a> for matches_rule, apply_rule in rules: <span class=u>①</span></a>
|
||||
if matches_rule(noun):
|
||||
return apply_rule(noun)</code></pre>
|
||||
@@ -256,7 +256,7 @@ def build_match_and_apply_functions(pattern, search, replace):
|
||||
<p>First, let’s create a text file that contains the rules you want. No fancy data structures, just whitespace-delimited strings in three columns. Let’s call it <code>plural4-rules.txt</code>.
|
||||
|
||||
<p class=d>[<a href=examples/plural4-rules.txt>download <code>plural4-rules.txt</code></a>]
|
||||
<pre class=nd><code class=pp>[sxz]$ $ es
|
||||
<pre class='nd pp'><code>[sxz]$ $ es
|
||||
[^aeioudgkprt]h$ $ es
|
||||
[^aeiou]y$ y$ ies
|
||||
$ $ s</code></pre>
|
||||
@@ -264,7 +264,7 @@ $ $ s</code></pre>
|
||||
<p>Now let’s see how you can use this rules file.
|
||||
|
||||
<p class=d>[<a href=examples/plural4.py>download <code>plural4.py</code></a>]
|
||||
<pre><code class=pp>import re
|
||||
<pre class=pp><code>import re
|
||||
|
||||
<a>def build_match_and_apply_functions(pattern, search, replace): <span class=u>①</span></a>
|
||||
def matches_rule(word):
|
||||
@@ -295,7 +295,7 @@ rules = []
|
||||
<p>Wouldn’t it be grand to have a generic <code>plural()</code> function that parses the rules file? Get rules, check for a match, apply appropriate transformation, go to next rule. That’s all the <code>plural()</code> function has to do, and that’s all the <code>plural()</code> function should do.
|
||||
|
||||
<p class=d>[<a href=examples/plural5.py>download <code>plural5.py</code></a>]
|
||||
<pre class=nd><code class=pp>def rules(rules_filename):
|
||||
<pre class='nd pp'><code>def rules(rules_filename):
|
||||
with open('plural5-rules.txt', encoding='utf-8') as pattern_file:
|
||||
for line in pattern_file:
|
||||
pattern, search, replace = line.split(None, 3)
|
||||
@@ -343,7 +343,7 @@ def plural(noun, rules_filename='plural5-rules.txt'):
|
||||
<h3 id=a-fibonacci-generator>A Fibonacci Generator</h3>
|
||||
|
||||
<p class=d>[<a href=examples/fibonacci.py>download <code>fibonacci.py</code></a>]
|
||||
<pre><code class=pp>def fib(max):
|
||||
<pre class=pp><code>def fib(max):
|
||||
<a> a, b = 0, 1 <span class=u>①</span></a>
|
||||
while a < max:
|
||||
<a> yield a <span class=u>②</span></a>
|
||||
@@ -375,7 +375,7 @@ def plural(noun, rules_filename='plural5-rules.txt'):
|
||||
|
||||
<p>Let’s go back to <code>plural5.py</code> and see how this version of the <code>plural()</code> function works.
|
||||
|
||||
<pre><code class=pp>def rules(rules_filename):
|
||||
<pre class=pp><code>def rules(rules_filename):
|
||||
with open(rules_filename, encoding='utf-8') as pattern_file:
|
||||
for line in pattern_file:
|
||||
<a> pattern, search, replace = line.split(None, 3) <span class=u>①</span></a>
|
||||
|
||||
+13
-13
@@ -25,7 +25,7 @@ body{counter-reset:h1 7}
|
||||
<p>Remember <a href=generators.html#a-fibonacci-generator>the Fibonacci generator</a>? Here it is as a built-from-scratch iterator:
|
||||
|
||||
<p class=d>[<a href=examples/fibonacci2.py>download <code>fibonacci2.py</code></a>]
|
||||
<pre><code class=pp>class Fib:
|
||||
<pre class=pp><code>class Fib:
|
||||
'''iterator that yields numbers in the Fibonacci sequence'''
|
||||
|
||||
def __init__(self, max):
|
||||
@@ -45,7 +45,7 @@ body{counter-reset:h1 7}
|
||||
|
||||
<p>Let’s take that one line at a time.
|
||||
|
||||
<pre class=nd><code class=pp>class Fib:</code></pre>
|
||||
<pre class='nd pp'><code>class Fib:</code></pre>
|
||||
|
||||
<p><code>class</code>? What’s a class?
|
||||
|
||||
@@ -57,7 +57,7 @@ body{counter-reset:h1 7}
|
||||
|
||||
<p>Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word <code>class</code>, followed by the class name. Technically, that’s all that’s required, since a class doesn’t need to inherit from any other class.
|
||||
|
||||
<pre><code class=pp><a>class PapayaWhip: <span class=u>①</span></a>
|
||||
<pre class=pp><code><a>class PapayaWhip: <span class=u>①</span></a>
|
||||
<a> pass <span class=u>②</span></a></code></pre>
|
||||
<ol>
|
||||
<li>The name of this class is <code>PapayaWhip</code>, and it doesn’t inherit from any other class. Class names are usually capitalized, <code>EachWordLikeThis</code>, but this is only a convention, not a requirement.
|
||||
@@ -76,7 +76,7 @@ body{counter-reset:h1 7}
|
||||
|
||||
<p>This example shows the initialization of the <code>Fib</code> class using the <code>__init__</code> method.
|
||||
|
||||
<pre><code class=pp>class Fib:
|
||||
<pre class=pp><code>class Fib:
|
||||
<a> '''iterator that yields numbers in the Fibonacci sequence''' <span class=u>①</span></a>
|
||||
|
||||
<a> def __init__(self, max): <span class=u>②</span></a></code></pre>
|
||||
@@ -120,14 +120,14 @@ body{counter-reset:h1 7}
|
||||
|
||||
<p>On to the next line:
|
||||
|
||||
<pre><code class=pp>class Fib:
|
||||
<pre class=pp><code>class Fib:
|
||||
def __init__(self, max):
|
||||
<a> self.max = max <span class=u>①</span></a></code></pre>
|
||||
<ol>
|
||||
<li>What is <var>self.max</var>? It’s an instance variable. It is completely separate from <var>max</var>, which was passed into the <code>__init__()</code> method as an argument. <var>self.max</var> is “global” to the instance. That means that you can access it from other methods.
|
||||
</ol>
|
||||
|
||||
<pre><code class=pp>class Fib:
|
||||
<pre class=pp><code>class Fib:
|
||||
def __init__(self, max):
|
||||
<a> self.max = max <span class=u>①</span></a>
|
||||
.
|
||||
@@ -163,7 +163,7 @@ All three of these class methods, <code>__init__</code>, <code>__iter__</code>,
|
||||
</aside>
|
||||
|
||||
<p class=d>[<a href=examples/fibonacci2.py>download <code>fibonacci2.py</code></a>]
|
||||
<pre><code class=pp><a>class Fib: <span class=u>①</span></a>
|
||||
<pre class=pp><code><a>class Fib: <span class=u>①</span></a>
|
||||
<a> def __init__(self, max): <span class=u>②</span></a>
|
||||
self.max = max
|
||||
|
||||
@@ -214,7 +214,7 @@ All three of these class methods, <code>__init__</code>, <code>__iter__</code>,
|
||||
<p>Now it’s time for the finale. Let’s rewrite the <a href=generators.html>plural rules generator</a> as an iterator.
|
||||
|
||||
<p class=d>[<a href=examples/plural6.py>download <code>plural6.py</code></a>]
|
||||
<pre><code class=pp>class LazyRules:
|
||||
<pre class=pp><code>class LazyRules:
|
||||
rules_filename = 'plural6-rules.txt'
|
||||
|
||||
def __init__(self):
|
||||
@@ -250,7 +250,7 @@ rules = LazyRules()</code></pre>
|
||||
|
||||
<p>Let’s take the class one bite at a time.
|
||||
|
||||
<pre><code class=pp>class LazyRules:
|
||||
<pre class=pp><code>class LazyRules:
|
||||
rules_filename = 'plural6-rules.txt'
|
||||
|
||||
def __init__(self):
|
||||
@@ -297,7 +297,7 @@ rules = LazyRules()</code></pre>
|
||||
|
||||
<p>And now back to our show.
|
||||
|
||||
<pre><code class=pp><a> def __iter__(self): <span class=u>①</span></a>
|
||||
<pre class=pp><code><a> def __iter__(self): <span class=u>①</span></a>
|
||||
<a> self.cache_index = 0 <span class=u>②</span></a>
|
||||
<a> return self <span class=u>③</span></a>
|
||||
</code></pre>
|
||||
@@ -307,7 +307,7 @@ rules = LazyRules()</code></pre>
|
||||
<li>Finally, the <code>__iter__()</code> method returns <var>self</var>, which signals that this class will take care of returning its own values throughout an iteration.
|
||||
</ol>
|
||||
|
||||
<pre><code class=pp><a> def __next__(self): <span class=u>①</span></a>
|
||||
<pre class=pp><code><a> def __next__(self): <span class=u>①</span></a>
|
||||
.
|
||||
.
|
||||
.
|
||||
@@ -324,7 +324,7 @@ rules = LazyRules()</code></pre>
|
||||
|
||||
<p>Moving backwards…
|
||||
|
||||
<pre><code class=pp> def __next__(self):
|
||||
<pre class=pp><code> def __next__(self):
|
||||
.
|
||||
.
|
||||
.
|
||||
@@ -343,7 +343,7 @@ rules = LazyRules()</code></pre>
|
||||
|
||||
<p>Moving backwards all the way to the start of the <code>__next__()</code> method…
|
||||
|
||||
<pre><code class=pp> def __next__(self):
|
||||
<pre class=pp><code> def __next__(self):
|
||||
self.cache_index += 1
|
||||
if len(self.cache) >= self.cache_index:
|
||||
<a> return self.cache[self.cache_index - 1] <span class=u>①</span></a>
|
||||
|
||||
+46
-11
@@ -11,11 +11,7 @@
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
//
|
||||
// Changes from upstream:
|
||||
// - use class=pp instead of class=prettyprint to declare blocks-to-colorize
|
||||
// - removed support for <xmp>
|
||||
// - added support for <kbd> and <samp>
|
||||
|
||||
|
||||
/**
|
||||
* @fileoverview
|
||||
@@ -36,6 +32,9 @@
|
||||
* <script type="text/javascript" src="/path/to/prettify.js"></script>
|
||||
* 2) define style rules. See the example page for examples.
|
||||
* 3) mark the <pre> and <code> tags in your source with class=pp.
|
||||
* You can also use the (html deprecated) <xmp> tag, but the pretty printer
|
||||
* needs to do more substantial DOM manipulations to support that, so some
|
||||
* css styles may not be preserved.
|
||||
* That's it. I wanted to keep the API as simple as possible, so there's no
|
||||
* need to specify which language the code is in.
|
||||
*
|
||||
@@ -269,6 +268,11 @@ window['_pr_isIE6'] = function () {
|
||||
.replace(pr_nbspEnt, ' ');
|
||||
}
|
||||
|
||||
/** is the given node's innerHTML normally unescaped? */
|
||||
function isRawContent(node) {
|
||||
return 'XMP' === node.tagName;
|
||||
}
|
||||
|
||||
function normalizedHtml(node, out) {
|
||||
switch (node.nodeType) {
|
||||
case 1: // an element
|
||||
@@ -541,6 +545,10 @@ window['_pr_isIE6'] = function () {
|
||||
|
||||
if (PR_innerHtmlWorks) {
|
||||
var content = node.innerHTML;
|
||||
// XMP tags contain unescaped entities so require special handling.
|
||||
if (isRawContent(node)) {
|
||||
content = textToHtml(content);
|
||||
}
|
||||
return content;
|
||||
}
|
||||
|
||||
@@ -603,13 +611,14 @@ window['_pr_isIE6'] = function () {
|
||||
'[^<]+' // A run of characters other than '<'
|
||||
+ '|<\!--[\\s\\S]*?--\>' // an HTML comment
|
||||
+ '|<!\\[CDATA\\[[\\s\\S]*?\\]\\]>' // a CDATA section
|
||||
+ '|</?[a-zA-Z][^>]*>' // a probable tag that should not be highlighted
|
||||
// a probable tag that should not be highlighted
|
||||
+ '|<\/?[a-zA-Z](?:[^>\"\']|\'[^\']*\'|\"[^\"]*\")*>'
|
||||
+ '|<', // A '<' that does not begin a larger chunk
|
||||
'g');
|
||||
var pr_commentPrefix = /^<\!--/;
|
||||
var pr_cdataPrefix = /^<\[CDATA\[/;
|
||||
var pr_cdataPrefix = /^<!\[CDATA\[/;
|
||||
var pr_brPrefix = /^<br\b/i;
|
||||
var pr_tagNameRe = /^<(\/?)([a-zA-Z]+)/;
|
||||
var pr_tagNameRe = /^<(\/?)([a-zA-Z][a-zA-Z0-9]*)/;
|
||||
|
||||
/** split markup into chunks of html tags (style null) and
|
||||
* plain text (style {@link #PR_PLAIN}), converting tags which are
|
||||
@@ -1273,7 +1282,8 @@ window['_pr_isIE6'] = function () {
|
||||
document.getElementsByTagName('pre'),
|
||||
document.getElementsByTagName('code'),
|
||||
document.getElementsByTagName('kbd'),
|
||||
document.getElementsByTagName('samp') ];
|
||||
document.getElementsByTagName('samp'),
|
||||
document.getElementsByTagName('xmp') ];
|
||||
var elements = [];
|
||||
for (var i = 0; i < codeSegments.length; ++i) {
|
||||
for (var j = 0, n = codeSegments[i].length; j < n; ++j) {
|
||||
@@ -1311,7 +1321,8 @@ window['_pr_isIE6'] = function () {
|
||||
var nested = false;
|
||||
for (var p = cs.parentNode; p; p = p.parentNode) {
|
||||
if ((p.tagName === 'pre' || p.tagName === 'code' ||
|
||||
p.tagName === 'kbd' || p.tagName === 'samp') &&
|
||||
p.tagName === 'kbd' || p.tagName === 'samp' ||
|
||||
p.tagName === 'xmp') &&
|
||||
p.className && p.className.indexOf('pp') >= 0) {
|
||||
nested = true;
|
||||
break;
|
||||
@@ -1348,7 +1359,31 @@ window['_pr_isIE6'] = function () {
|
||||
var cs = prettyPrintingJob.sourceNode;
|
||||
|
||||
// push the prettified html back into the tag.
|
||||
cs.innerHTML = newContent;
|
||||
if (!isRawContent(cs)) {
|
||||
// just replace the old html with the new
|
||||
cs.innerHTML = newContent;
|
||||
} else {
|
||||
// we need to change the tag to a <pre> since <xmp>s do not allow
|
||||
// embedded tags such as the span tags used to attach styles to
|
||||
// sections of source code.
|
||||
var pre = document.createElement('PRE');
|
||||
for (var i = 0; i < cs.attributes.length; ++i) {
|
||||
var a = cs.attributes[i];
|
||||
if (a.specified) {
|
||||
var aname = a.name.toLowerCase();
|
||||
if (aname === 'class') {
|
||||
pre.className = a.value; // For IE 6
|
||||
} else {
|
||||
pre.setAttribute(a.name, a.value);
|
||||
}
|
||||
}
|
||||
}
|
||||
pre.innerHTML = newContent;
|
||||
|
||||
// remove the old
|
||||
cs.parentNode.replaceChild(pre, cs);
|
||||
cs = pre;
|
||||
}
|
||||
|
||||
// Replace <br>s with line-feeds so that copying and pasting works
|
||||
// on IE 6.
|
||||
|
||||
@@ -40,7 +40,7 @@ body{counter-reset:h1 2}
|
||||
<aside>You can use virtually any expression in a boolean context.</aside>
|
||||
<p>Booleans are either true or false. Python has two constants, cleverly <code><dfn>True</dfn></code> and <code><dfn>False</dfn></code>, which can be used to assign <dfn>boolean</dfn> values directly. Expressions can also evaluate to a boolean value. In certain places (like <code>if</code> statements), Python expects an expression to evaluate to a boolean value. These places are called <i>boolean contexts</i>. You can use virtually any expression in a boolean context, and Python will try to determine its truth value. Different datatypes have different rules about which values are true or false in a boolean context. (This will make more sense once you see some concrete examples later in this chapter.)
|
||||
<p>For example, take this snippet from <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
|
||||
<pre class=nd><code class=pp>if size < 0:
|
||||
<pre class='nd pp'><code>if size < 0:
|
||||
raise ValueError('number must be non-negative')</code></pre>
|
||||
<p><var>size</var> is an integer, <code>0</code> is an integer, and <code><</code> is a numerical operator. The result of the expression <code>size < 0</code> is always a boolean. You can test this yourself in the Python interactive shell:
|
||||
<pre class='nd screen'>
|
||||
@@ -865,7 +865,7 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
|
||||
<h3 id=mixed-value-dictionaries>Mixed-Value Dictionaries</h3>
|
||||
<p>Dictionaries aren’t just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don’t all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.
|
||||
<p>In fact, you’ve already seen a dictionary with non-string keys and values, in <a href=your-first-python-program.html#divingin>your first Python program</a>.
|
||||
<pre class=nd><code class=pp>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
|
||||
<pre class='nd pp'><code>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
|
||||
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}</code></pre>
|
||||
<p>Let's tear that apart in the interactive shell.
|
||||
<pre class=screen>
|
||||
|
||||
+3
-3
@@ -29,7 +29,7 @@ mark{display:inline}
|
||||
|
||||
<p>In this chapter, you’ll learn how the setup scripts for <code>chardet</code> and <code>httplib2</code> work, and you’ll step through the process of releasing your own Python software.
|
||||
|
||||
<pre><code class=pp># chardet's setup.py
|
||||
<pre class=pp><code># chardet's setup.py
|
||||
from distutils.core import setup
|
||||
setup(
|
||||
name = "chardet",
|
||||
@@ -157,7 +157,7 @@ chardet/
|
||||
|
||||
<p>The first line of every Distutils setup script is always the same:
|
||||
|
||||
<pre class=nd><code class=pp>from distutils.core import setup</code></pre>
|
||||
<pre class='nd pp'><code>from distutils.core import setup</code></pre>
|
||||
|
||||
<p>This imports the <code>setup()</code> function, which is the main entry point into Distutils. 95% of all Distutils setup scripts consist of a single call to <code>setup()</code> and nothing else. (I totally just made up that statistic, but if your Distutils setup script is doing more than calling the Distutils <code>setup()</code> function, you should have a good reason. Do you have a good reason? I didn’t think so.)
|
||||
|
||||
@@ -187,7 +187,7 @@ chardet/
|
||||
|
||||
<p>Now let’s look at the <code>chardet</code> setup script. It has all of these required and recommended parameters, plus one I haven’t mentioned yet: <code>packages</code>.
|
||||
|
||||
<pre class=nd><code class=pp>from distutils.core import setup
|
||||
<pre class='nd pp'><code>from distutils.core import setup
|
||||
setup(
|
||||
name = 'chardet',
|
||||
<mark>packages = ['chardet']</mark>,
|
||||
|
||||
@@ -246,7 +246,7 @@ td pre{padding:0;border:0}
|
||||
<td><code class=pp>import <dfn>cookielib</dfn></code>
|
||||
<td><code class=pp>import http.cookiejar</code>
|
||||
<tr><th>④
|
||||
<td><pre><code class=pp>import <dfn>BaseHTTPServer</dfn>
|
||||
<td><pre class=pp><code>import <dfn>BaseHTTPServer</dfn>
|
||||
import <dfn>SimpleHTTPServer</dfn>
|
||||
import <dfn>CGIHttpServer</dfn></code></pre>
|
||||
<td><code class=pp>import http.server</code>
|
||||
@@ -280,14 +280,14 @@ import <dfn>CGIHttpServer</dfn></code></pre>
|
||||
<td><code class=pp>import <dfn>robotparser</dfn></code>
|
||||
<td><code class=pp>import urllib.robotparser</code>
|
||||
<tr><th>⑤
|
||||
<td><pre><code class=pp>from urllib import <dfn>FancyURLopener</dfn>
|
||||
<td><pre class=pp><code>from urllib import <dfn>FancyURLopener</dfn>
|
||||
from urllib import urlencode</code></pre>
|
||||
<td><pre><code class=pp>from urllib.request import FancyURLopener
|
||||
<td><pre class=pp><code>from urllib.request import FancyURLopener
|
||||
from urllib.parse import urlencode</code></pre>
|
||||
<tr><th>⑥
|
||||
<td><pre><code class=pp>from urllib2 import <dfn>Request</dfn>
|
||||
<td><pre class=pp><code>from urllib2 import <dfn>Request</dfn>
|
||||
from urllib2 import <dfn>HTTPError</dfn></code></pre>
|
||||
<td><pre><code class=pp>from urllib.request import Request
|
||||
<td><pre class=pp><code>from urllib.request import Request
|
||||
from urllib.error import HTTPError</code></pre>
|
||||
</table>
|
||||
|
||||
@@ -307,9 +307,9 @@ from urllib.error import HTTPError</code></pre>
|
||||
<th>Python 2
|
||||
<th>Python 3
|
||||
<tr><th>
|
||||
<td><pre><code class=pp>import urllib
|
||||
<td><pre class=pp><code>import urllib
|
||||
print urllib.urlopen('http://diveintopython3.org/').read()</code></pre>
|
||||
<td><pre><code class=pp>import urllib.request, urllib.parse, urllib.error
|
||||
<td><pre class=pp><code>import urllib.request, urllib.parse, urllib.error
|
||||
print(urllib.request.urlopen('http://diveintopython3.org/').read())</code></pre>
|
||||
</table>
|
||||
|
||||
@@ -334,7 +334,7 @@ print(urllib.request.urlopen('http://diveintopython3.org/').read())</code></pre>
|
||||
<td><code class=pp>import <dfn>dumbdbm</dfn></code>
|
||||
<td><code class=pp>import dbm.dumb</code>
|
||||
<tr><th>
|
||||
<td><pre><code class=pp>import <dfn>anydbm</dfn>
|
||||
<td><pre class=pp><code>import <dfn>anydbm</dfn>
|
||||
import whichdb</code></pre>
|
||||
<td><code class=pp>import dbm</code>
|
||||
</table>
|
||||
@@ -351,7 +351,7 @@ import whichdb</code></pre>
|
||||
<td><code class=pp>import <dfn>xmlrpclib</dfn></code>
|
||||
<td><code class=pp>import xmlrpc.client</code>
|
||||
<tr><th>
|
||||
<td><pre><code class=pp>import <dfn>DocXMLRPCServer</dfn>
|
||||
<td><pre class=pp><code>import <dfn>DocXMLRPCServer</dfn>
|
||||
import <dfn>SimpleXMLRPCServer</dfn></code></pre>
|
||||
<td><code class=pp>import xmlrpc.server</code>
|
||||
</table>
|
||||
@@ -363,13 +363,13 @@ import <dfn>SimpleXMLRPCServer</dfn></code></pre>
|
||||
<th>Python 2
|
||||
<th>Python 3
|
||||
<tr><th>①
|
||||
<td><pre><code class=pp>try:
|
||||
<td><pre class=pp><code>try:
|
||||
import <dfn>cStringIO</dfn> as <dfn>StringIO</dfn>
|
||||
except ImportError:
|
||||
import StringIO</code></pre>
|
||||
<td><code class=pp>import io</code>
|
||||
<tr><th>②
|
||||
<td><pre><code class=pp>try:
|
||||
<td><pre class=pp><code>try:
|
||||
import cPickle as pickle
|
||||
except ImportError:
|
||||
import pickle</code></pre>
|
||||
@@ -456,22 +456,22 @@ except ImportError:
|
||||
<td><code class=pp>a_function_that_returns_an_iterator().next()</code>
|
||||
<td><code class=pp>next(a_function_that_returns_an_iterator())</code>
|
||||
<tr><th>③
|
||||
<td><pre><code class=pp>class A:
|
||||
<td><pre class=pp><code>class A:
|
||||
def next(self):
|
||||
pass</code></pre>
|
||||
<td><pre><code class=pp>class A:
|
||||
<td><pre class=pp><code>class A:
|
||||
def __next__(self):
|
||||
pass</code></pre>
|
||||
<tr><th>④
|
||||
<td><pre><code class=pp>class A:
|
||||
<td><pre class=pp><code>class A:
|
||||
def next(self, x, y):
|
||||
pass</code></pre>
|
||||
<td><i>no change</i>
|
||||
<tr><th>⑤
|
||||
<td><pre><code class=pp>next = 42
|
||||
<td><pre class=pp><code>next = 42
|
||||
for an_iterator in a_sequence_of_iterators:
|
||||
an_iterator.next()</code></pre>
|
||||
<td><pre><code class=pp>next = 42
|
||||
<td><pre class=pp><code>next = 42
|
||||
for an_iterator in a_sequence_of_iterators:
|
||||
an_iterator.__next__()</code></pre>
|
||||
</table>
|
||||
@@ -560,7 +560,7 @@ for an_iterator in a_sequence_of_iterators:
|
||||
<th>Python 3
|
||||
<tr><th>
|
||||
<td><code class=pp>reduce(a, b, c)</code>
|
||||
<td><pre><code class=pp>from functools import reduce
|
||||
<td><pre class=pp><code>from functools import reduce
|
||||
reduce(a, b, c)</code></pre>
|
||||
</table>
|
||||
|
||||
@@ -674,31 +674,31 @@ reduce(a, b, c)</code></pre>
|
||||
<th>Python 2
|
||||
<th>Python 3
|
||||
<tr><th>①
|
||||
<td><pre><code class=pp>try:
|
||||
<td><pre class=pp><code>try:
|
||||
import mymodule
|
||||
<dfn>except</dfn> ImportError, e
|
||||
pass</code></pre>
|
||||
<td><pre><code class=pp>try:
|
||||
<td><pre class=pp><code>try:
|
||||
import mymodule
|
||||
except ImportError as e:
|
||||
pass</code></pre>
|
||||
<tr><th>②
|
||||
<td><pre><code class=pp>try:
|
||||
<td><pre class=pp><code>try:
|
||||
import mymodule
|
||||
except (RuntimeError, ImportError), e
|
||||
pass</code></pre>
|
||||
<td><pre><code class=pp>try:
|
||||
<td><pre class=pp><code>try:
|
||||
import mymodule
|
||||
except (RuntimeError, ImportError) as e:
|
||||
pass</code></pre>
|
||||
<tr><th>③
|
||||
<td><pre><code class=pp>try:
|
||||
<td><pre class=pp><code>try:
|
||||
import mymodule
|
||||
except ImportError:
|
||||
pass</code></pre>
|
||||
<td><i>no change</i>
|
||||
<tr><th>④
|
||||
<td><pre><code class=pp>try:
|
||||
<td><pre class=pp><code>try:
|
||||
import mymodule
|
||||
except:
|
||||
pass</code></pre>
|
||||
@@ -951,14 +951,14 @@ except:
|
||||
<th>Python 2
|
||||
<th>Python 3
|
||||
<tr><th>①
|
||||
<td><pre><code class=pp>class A:
|
||||
<td><pre class=pp><code>class A:
|
||||
def <dfn>__nonzero__</dfn>(self):
|
||||
pass</code></pre>
|
||||
<td><pre><code class=pp>class A:
|
||||
<td><pre class=pp><code>class A:
|
||||
def <dfn>__bool__</dfn>(self):
|
||||
pass</code></pre>
|
||||
<tr><th>②
|
||||
<td><pre><code class=pp>class A:
|
||||
<td><pre class=pp><code>class A:
|
||||
def __nonzero__(self, x, y):
|
||||
pass</code></pre>
|
||||
<td><i>no change</i>
|
||||
@@ -1233,18 +1233,18 @@ except:
|
||||
<th>Python 2
|
||||
<th>Python 3
|
||||
<tr><th>①
|
||||
<td><pre><code class=pp>class C(metaclass=PapayaMeta):
|
||||
<td><pre class=pp><code>class C(metaclass=PapayaMeta):
|
||||
pass</code></pre>
|
||||
<td><i>unchanged</i>
|
||||
<tr><th>②
|
||||
<td><pre><code class=pp>class Whip:
|
||||
<td><pre class=pp><code>class Whip:
|
||||
__metaclass__ = PapayaMeta</code></pre>
|
||||
<td><pre><code class=pp>class Whip(metaclass=PapayaMeta):
|
||||
<td><pre class=pp><code>class Whip(metaclass=PapayaMeta):
|
||||
pass</code></pre>
|
||||
<tr><th>③
|
||||
<td><pre><code class=pp>class C(Whipper, Beater):
|
||||
<td><pre class=pp><code>class C(Whipper, Beater):
|
||||
__metaclass__ = PapayaMeta</code></pre>
|
||||
<td><pre><code class=pp>class C(Whipper, Beater, metaclass=PapayaMeta):
|
||||
<td><pre class=pp><code>class C(Whipper, Beater, metaclass=PapayaMeta):
|
||||
pass</code></pre>
|
||||
</table>
|
||||
|
||||
@@ -1335,9 +1335,9 @@ except:
|
||||
<th>After
|
||||
|
||||
<tr><th>
|
||||
<td><pre><code class=pp>while 1:
|
||||
<td><pre class=pp><code>while 1:
|
||||
do_stuff()</code></pre>
|
||||
<td><pre><code class=pp>while True:
|
||||
<td><pre class=pp><code>while True:
|
||||
do_stuff()</code></pre>
|
||||
<tr><th>
|
||||
<td><code class=pp>type(x) == T</code>
|
||||
@@ -1346,10 +1346,10 @@ except:
|
||||
<td><code class=pp>type(x) is T</code>
|
||||
<td><code class=pp>isinstance(x, T)</code>
|
||||
<tr><th>
|
||||
<td><pre><code class=pp>a_list = list(a_sequence)
|
||||
<td><pre class=pp><code>a_list = list(a_sequence)
|
||||
a_list.sort()
|
||||
do_stuff(a_list)</code></pre>
|
||||
<td><pre><code class=pp>a_list = sorted(a_sequence)
|
||||
<td><pre class=pp><code>a_list = sorted(a_sequence)
|
||||
do_stuff(a_list)</code></pre>
|
||||
</table>
|
||||
|
||||
|
||||
+8
-8
@@ -31,7 +31,7 @@ body{counter-reset:h1 10}
|
||||
|
||||
<p>After reproducing the bug, and before fixing it, you should write a test case that fails, thus illustrating the bug.
|
||||
|
||||
<pre><code class=pp>class FromRomanBadInput(unittest.TestCase):
|
||||
<pre class=pp><code>class FromRomanBadInput(unittest.TestCase):
|
||||
.
|
||||
.
|
||||
.
|
||||
@@ -72,7 +72,7 @@ FAILED (failures=1)</samp></pre>
|
||||
|
||||
<p><em>Now</em> you can fix the bug.
|
||||
|
||||
<pre><code class=pp>def from_roman(s):
|
||||
<pre class=pp><code>def from_roman(s):
|
||||
'''convert Roman numeral to integer'''
|
||||
<a> if not s: <span class=u>①</span></a>
|
||||
raise InvalidRomanNumeralError, 'Input can not be blank'
|
||||
@@ -124,7 +124,7 @@ Ran 11 tests in 0.156s
|
||||
<p>Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Normally, no character in a Roman numeral can be repeated more than three times in a row. But the Romans were willing to make an exception to that rule by having 4 <code>M</code> characters in a row to represent <code>4000</code>. If you make this change, you’ll be able to expand the range of convertible numbers from <code>1..3999</code> to <code>1..4999</code>. But first, you need to make some changes to your test cases.
|
||||
|
||||
<p class=d>[<a href=examples/roman8.py>download <code>roman8.py</code></a>]
|
||||
<pre><code class=pp>class KnownValues(unittest.TestCase):
|
||||
<pre class=pp><code>class KnownValues(unittest.TestCase):
|
||||
known_values = ( (1, 'I'),
|
||||
.
|
||||
.
|
||||
@@ -228,7 +228,7 @@ FAILED (errors=3)</samp></pre>
|
||||
<p>Now that you have test cases that fail due to the new requirements, you can think about fixing the code to bring it in line with the test cases. (One thing that takes some getting used to when you first start coding unit tests is that the code being tested is never “ahead” of the test cases. While it’s behind, you still have some work to do, and as soon as it catches up to the test cases, you stop coding.)
|
||||
|
||||
<p class=d>[<a href=examples/roman9.py>download <code>roman9.py</code></a>]
|
||||
<pre><code class=pp>roman_numeral_pattern = re.compile('''
|
||||
<pre class=pp><code>roman_numeral_pattern = re.compile('''
|
||||
^ # beginning of string
|
||||
<a> M{0,4} # thousands - 0 to 4 M's <span class=u>①</span></a>
|
||||
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
|
||||
@@ -305,7 +305,7 @@ Ran 12 tests in 0.203s
|
||||
<p>And best of all, you already have a complete set of unit tests. You can change over half the code in the module, but the unit tests will stay the same. That means you can prove — to yourself and to others — that the new code works just as well as the original.
|
||||
|
||||
<p class=d>[<a href=examples/roman10.py>download <code>roman10.py</code></a>]
|
||||
<pre><code class=pp>class OutOfRangeError(ValueError): pass
|
||||
<pre class=pp><code>class OutOfRangeError(ValueError): pass
|
||||
class NotIntegerError(ValueError): pass
|
||||
class InvalidRomanNumeralError(ValueError): pass
|
||||
|
||||
@@ -365,13 +365,13 @@ build_lookup_tables()</code></pre>
|
||||
|
||||
<p>Let’s break that down into digestable pieces. Arguably, the most important line is the last one:
|
||||
|
||||
<pre class=nd><code class=pp>build_lookup_tables()</code></pre>
|
||||
<pre class='nd pp'><code>build_lookup_tables()</code></pre>
|
||||
|
||||
<p>You will note that is a function call, but there’s no <code>if</code> statement around it. This is not an <code>if __name__ == '__main__'</code> block; it gets called <em>when the module is imported</em>. (It is important to understand that modules are only imported once, then cached. If you import an already-imported module, it does nothing. So this code will only get called the first time you import this module.)
|
||||
|
||||
<p>So what does the <code>build_lookup_tables()</code> function do? I’m glad you asked.
|
||||
|
||||
<pre><code class=pp>to_roman_table = [ None ]
|
||||
<pre class=pp><code>to_roman_table = [ None ]
|
||||
from_roman_table = {}
|
||||
.
|
||||
.
|
||||
@@ -400,7 +400,7 @@ def build_lookup_tables():
|
||||
|
||||
<p>Once the lookup tables are built, the rest of the code is both easy and fast.
|
||||
|
||||
<pre><code class=pp>def to_roman(n):
|
||||
<pre class=pp><code>def to_roman(n):
|
||||
'''convert integer to Roman numeral'''
|
||||
if not (0 < n < 5000):
|
||||
raise OutOfRangeError('number out of range (must be 1..4999)')
|
||||
|
||||
+3
-3
@@ -189,7 +189,7 @@ NameError: name 'entry' is not defined</samp>
|
||||
|
||||
<p>FIXME
|
||||
|
||||
<pre><code class=pp># customserializer.py
|
||||
<pre class=pp><code># customserializer.py
|
||||
def to_json(python_object):
|
||||
if isinstance(python_object, bytes):
|
||||
return {'__class__': 'bytes',
|
||||
@@ -213,7 +213,7 @@ def to_json(python_object):
|
||||
<li>FIXME
|
||||
</ol>
|
||||
|
||||
<pre><code class=pp># customserializer.py
|
||||
<pre class=pp><code># customserializer.py
|
||||
def to_json(python_object):
|
||||
if isinstance(python_object, time.struct_time):
|
||||
return {'__class__': 'time.asctime',
|
||||
@@ -270,7 +270,7 @@ NameError: name 'entry' is not defined</samp>
|
||||
|
||||
<p>FIXME
|
||||
|
||||
<pre><code class=pp># customserializer.py
|
||||
<pre class=pp><code># customserializer.py
|
||||
def from_json(json_object):
|
||||
if '__class__' in json_object:
|
||||
if json_object['__class__'] == 'time.asctime':
|
||||
|
||||
@@ -220,7 +220,7 @@ AttributeError</samp></pre>
|
||||
|
||||
<p>The <a href=http://docs.python.org/3.1/library/zipfile.html><code>zipfile</code> module</a> uses this to define a class that can <dfn>decrypt</dfn> an <dfn>encrypted</dfn> <dfn>zip</dfn> file with a given password. The zip <dfn>decryption</dfn> algorithm requires you to store state during decryption. Defining the decryptor as a class allows you to maintain this state within a single instance of the decryptor class. The state is initialized in the <code>__init__()</code> method and updated as the file is <dfn>decrypted</dfn>. But since the class is also “callable” like a function, you can pass the instance as the first argument of the <code>map()</code> function, like so:
|
||||
|
||||
<pre><code class=pp># excerpt from zipfile.py
|
||||
<pre class=pp><code># excerpt from zipfile.py
|
||||
class _ZipDecrypter:
|
||||
.
|
||||
.
|
||||
@@ -272,7 +272,7 @@ bytes = zef_file.read(12)
|
||||
|
||||
<p id=acts-like-list-example>The <a href=http://docs.python.org/3.1/library/cgi.html><code>cgi</code> module</a> uses these methods in its <code>FieldStorage</code> class, which represents all of the form fields or query parameters submitted to a dynamic web page.
|
||||
|
||||
<pre><code class=pp># A script which responds to http://example.com/search?q=cgi
|
||||
<pre class=pp><code># A script which responds to http://example.com/search?q=cgi
|
||||
import cgi
|
||||
fs = cgi.FieldStorage()
|
||||
<a>if 'q' in fs: <span class=u>①</span></a>
|
||||
@@ -326,7 +326,7 @@ class FieldStorage:
|
||||
|
||||
<p>The <a href=#acts-like-list-example><code>FieldStorage</code> class</a> from the <a href=http://docs.python.org/3.1/library/cgi.html><code>cgi</code> module</a> also defines these special methods, which means you can do things like this:
|
||||
|
||||
<pre><code class=pp># A script which responds to http://example.com/search?q=cgi
|
||||
<pre class=pp><code># A script which responds to http://example.com/search?q=cgi
|
||||
import cgi
|
||||
fs = cgi.FieldStorage()
|
||||
if 'q' in fs:
|
||||
@@ -737,7 +737,7 @@ class FieldStorage:
|
||||
|
||||
<p>This is how the <a href=files.html#with><code>with <var>file</var></code> idiom</a> works.
|
||||
|
||||
<pre><code class=pp># excerpt from io.py:
|
||||
<pre class=pp><code># excerpt from io.py:
|
||||
def _checkClosed(self, msg=None):
|
||||
'''Internal: raise an ValueError if file is closed
|
||||
'''
|
||||
|
||||
+4
-4
@@ -106,7 +106,7 @@ My alphabet starts where your alphabet ends! <span class=u>❞</span><br>&m
|
||||
<p>Let’s take another look at <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
|
||||
|
||||
<p class=d>[<a href=examples/humansize.py>download <code>humansize.py</code></a>]
|
||||
<pre><code class=pp><a>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'], <span class=u>①</span></a>
|
||||
<pre class=pp><code><a>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'], <span class=u>①</span></a>
|
||||
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
|
||||
|
||||
def approximate_size(size, a_kilobyte_is_1024_bytes=True):
|
||||
@@ -201,7 +201,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
|
||||
|
||||
<p>But wait! There’s more! Let’s take another look at that strange line of code from <code>humansize.py</code>:
|
||||
|
||||
<pre class=nd><code class=pp>if size < multiple:
|
||||
<pre class='nd pp'><code>if size < multiple:
|
||||
return '{0:.1f} {1}'.format(size, suffix)</code></pre>
|
||||
|
||||
<p><code>{1}</code> is replaced with the second argument passed to the <code>format()</code> method, which is <var>suffix</var>. But what is <code>{0:.1f}</code>? It’s two things: <code>{0}</code>, which you recognize, and <code>:.1f</code>, which you don’t. The second half (including and after the colon) defines the <i>format specifier</i>, which further refines how the replaced variable should be formatted.
|
||||
@@ -412,11 +412,11 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp>
|
||||
|
||||
<p>If you would like to use a different encoding within your Python code, you can put an encoding declaration on the first line of each file. This declaration defines a <code>.py</code> file to be windows-1252:
|
||||
|
||||
<pre class=nd><code class=pp># -*- coding: windows-1252 -*-</code></pre>
|
||||
<pre class='nd pp'><code># -*- coding: windows-1252 -*-</code></pre>
|
||||
|
||||
<p>Technically, the character encoding override can also be on the second line, if the first line is a <abbr>UNIX</abbr>-like hash-bang command.
|
||||
|
||||
<pre class=nd><code class=pp>#!/usr/bin/python3
|
||||
<pre class='nd pp'><code>#!/usr/bin/python3
|
||||
# -*- coding: windows-1252 -*-</code></pre>
|
||||
|
||||
<p>For more information, consult <a href=http://www.python.org/dev/peps/pep-0263/><abbr>PEP</abbr> 263: Defining Python Source Code Encodings</a>.
|
||||
|
||||
+23
-23
@@ -57,7 +57,7 @@ body{counter-reset:h1 9}
|
||||
</ol>
|
||||
<p>It is not immediately obvious how this code does… well, <em>anything</em>. It defines a class which has no <code>__init__()</code> method. The class <em>does</em> have another method, but it is never called. The entire script has a <code>__main__</code> block, but it doesn’t reference the class or its method. But it does do something, I promise.
|
||||
<p class=d>[<a href=examples/romantest1.py>download <code>romantest1.py</code></a>]
|
||||
<pre><code class=pp>import roman1
|
||||
<pre class=pp><code>import roman1
|
||||
import unittest
|
||||
|
||||
<a>class KnownValues(unittest.TestCase): <span class=u>①</span></a>
|
||||
@@ -135,7 +135,7 @@ if __name__ == '__main__':
|
||||
</ol>
|
||||
<aside>Write a test that fails, then code until it passes.</aside>
|
||||
<p>Once you have a test case, you can start coding the <code>to_roman()</code> function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you’ve written any code, your tests aren’t testing your code at all! Unit testing is a dance: tests lead, code follows. Write a test that fails, then code until it passes.
|
||||
<pre><code class=pp># roman1.py
|
||||
<pre class=pp><code># roman1.py
|
||||
|
||||
def to_roman(n):
|
||||
'''convert integer to Roman numeral'''
|
||||
@@ -170,7 +170,7 @@ Traceback (most recent call last):
|
||||
</ol>
|
||||
<p><em>Now</em>, finally, you can write the <code>to_roman()</code> function.
|
||||
<p class=d>[<a href=examples/roman1.py>download <code>roman1.py</code></a>]
|
||||
<pre><code class=pp>roman_numeral_map = (('M', 1000),
|
||||
<pre class=pp><code>roman_numeral_map = (('M', 1000),
|
||||
('CM', 900),
|
||||
('D', 500),
|
||||
('CD', 400),
|
||||
@@ -197,7 +197,7 @@ def to_roman(n):
|
||||
<li>Here’s where the rich data structure of <var>roman_numeral_map</var> pays off, because you don’t need any special logic to handle the subtraction rule. To convert to Roman numerals, simply iterate through <var>roman_numeral_map</var> looking for the largest integer value less than or equal to the input. Once found, add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
|
||||
</ol>
|
||||
<p>If you’re still not clear how the <code>to_roman()</code> function works, add a <code>print()</code> call to the end of the <code>while</code> loop:
|
||||
<pre class=nd><code class=pp>
|
||||
<pre class='nd pp'><code>
|
||||
while n >= integer:
|
||||
result += numeral
|
||||
n -= integer
|
||||
@@ -248,7 +248,7 @@ OK</samp></pre>
|
||||
</blockquote>
|
||||
<p>What would that test look like?
|
||||
<p class=d>[<a href=examples/romantest2.py>download <code>romantest2.py</code></a>]
|
||||
<pre><code class=pp><a>class ToRomanBadInput(unittest.TestCase): <span class=u>①</span></a>
|
||||
<pre class=pp><code><a>class ToRomanBadInput(unittest.TestCase): <span class=u>①</span></a>
|
||||
<a> def test_too_large(self): <span class=u>②</span></a>
|
||||
'''to_roman should fail with large input'''
|
||||
<a> self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000) <span class=u>③</span></a></code></pre>
|
||||
@@ -284,7 +284,7 @@ FAILED (errors=1)</samp></pre>
|
||||
<li>Why didn’t the code execute properly? The traceback tells all. The module you’re testing doesn’t have an exception called <code>OutOfRangeError</code>. Remember, you passed this exception to the <code>assertRaises()</code> method, because it’s the exception you want the function to raise given an out-of-range input. But the exception doesn’t exist, so the call to the <code>assertRaises()</code> method failed. It never got a chance to test the <code>to_roman()</code> function; it didn’t get that far.
|
||||
</ol>
|
||||
<p>To solve this problem, you need to define the <code>OutOfRangeError</code> exception in <code>roman2.py</code>.
|
||||
<pre><code class=pp><a>class OutOfRangeError(ValueError): <span class=u>①</span></a>
|
||||
<pre class=pp><code><a>class OutOfRangeError(ValueError): <span class=u>①</span></a>
|
||||
<a> pass <span class=u>②</span></a></code></pre>
|
||||
<ol>
|
||||
<li>Exceptions are classes. An “out of range” error is a kind of value error — the argument value is out of its acceptable range. So this exception inherits from the built-in <code>ValueError</code> exception. This is not strictly necessary (it could just inherit from the base <code>Exception</code> class), but it feels right.
|
||||
@@ -316,7 +316,7 @@ FAILED (failures=1)</samp></pre>
|
||||
</ol>
|
||||
<p>Now you can write the code to make this test pass.
|
||||
<p class=d>[<a href=examples/roman2.py>download <code>roman2.py</code></a>]
|
||||
<pre><code class=pp>def to_roman(n):
|
||||
<pre class=pp><code>def to_roman(n):
|
||||
'''convert integer to Roman numeral'''
|
||||
if n > 3999:
|
||||
<a> raise OutOfRangeError('number out of range (must be less than 3999)') <span class=u>①</span></a>
|
||||
@@ -362,7 +362,7 @@ OK</samp></pre>
|
||||
<p>Well <em>that’s</em> not good. Let’s add tests for each of these conditions.
|
||||
|
||||
<p class=d>[<a href=examples/romantest3.py>download <code>romantest3.py</code></a>]
|
||||
<pre><code class=pp>class ToRomanBadInput(unittest.TestCase):
|
||||
<pre class=pp><code>class ToRomanBadInput(unittest.TestCase):
|
||||
def test_too_large(self):
|
||||
'''to_roman should fail with large input'''
|
||||
<a> self.assertRaises(roman3.OutOfRangeError, roman3.to_roman, 4000) <span class=u>①</span></a>
|
||||
@@ -417,7 +417,7 @@ FAILED (failures=2)</samp></pre>
|
||||
<p>Excellent. Both tests failed, as expected. Now let’s switch over to the code and see what we can do to make them pass.
|
||||
|
||||
<p class=d>[<a href=examples/roman3.py>download <code>roman3.py</code></a>]
|
||||
<pre><code class=pp>def to_roman(n):
|
||||
<pre class=pp><code>def to_roman(n):
|
||||
'''convert integer to Roman numeral'''
|
||||
<a> if not (0 < n < 4000): <span class=u>①</span></a>
|
||||
<a> raise OutOfRangeError('number out of range (must be 0..3999)') <span class=u>②</span></a>
|
||||
@@ -470,13 +470,13 @@ OK</samp></pre>
|
||||
|
||||
<p>Testing for non-integers is not difficult. First, define a <code>NonIntegerError</code> exception.
|
||||
|
||||
<pre class=nd><code class=pp># roman4.py
|
||||
<pre class='nd pp'><code># roman4.py
|
||||
class OutOfRangeError(ValueError): pass
|
||||
<mark>class NotIntegerError(ValueError): pass</mark></code></pre>
|
||||
|
||||
<p>Next, write a test case that checks for the <code>NonIntegerError</code> exception.
|
||||
|
||||
<pre class=nd><code class=pp>class ToRomanBadInput(unittest.TestCase):
|
||||
<pre class='nd pp'><code>class ToRomanBadInput(unittest.TestCase):
|
||||
.
|
||||
.
|
||||
.
|
||||
@@ -514,7 +514,7 @@ FAILED (failures=1)</samp></pre>
|
||||
|
||||
<p>Write the code that makes the test pass.
|
||||
|
||||
<pre><code class=pp>def to_roman(n):
|
||||
<pre class=pp><code>def to_roman(n):
|
||||
'''convert integer to Roman numeral'''
|
||||
if not (0 < n < 4000):
|
||||
raise OutOfRangeError('number out of range (must be 0..3999)')
|
||||
@@ -564,7 +564,7 @@ OK</samp></pre>
|
||||
|
||||
<p>But first, the tests. We’ll need a “known values” test to spot-check for accuracy. Our test suite already contains <a href=#romantest1>a mapping of known values</a>; let’s reuse that.
|
||||
|
||||
<pre class=nd><code class=pp> def test_from_roman_known_values(self):
|
||||
<pre class='nd pp'><code> def test_from_roman_known_values(self):
|
||||
'''from_roman should give known result with known input'''
|
||||
for integer, numeral in self.known_values:
|
||||
result = roman5.from_roman(numeral)
|
||||
@@ -572,11 +572,11 @@ OK</samp></pre>
|
||||
|
||||
<p>There’s a pleasing symmetry here. The <code>to_roman()</code> and <code>from_roman()</code> functions are inverses of each other. The first converts integers to specially-formatted strings, the second converts specially-formated strings to integers. In theory, we should be able to “round-trip” a number by passing to the <code>to_roman()</code> function to get a string, then passing that string to the <code>from_roman()</code> function to get an integer, and end up with the same number.
|
||||
|
||||
<pre class=nd><code class=pp>n = from_roman(to_roman(n)) for all values of n</code></pre>
|
||||
<pre class='nd pp'><code>n = from_roman(to_roman(n)) for all values of n</code></pre>
|
||||
|
||||
<p>In this case, “all values” means any number between <code>1..3999</code>, since that is the valid range of inputs to the <code>to_roman()</code> function. We can express this symmetry in a test case that runs through all the values <code>1..3999</code>, calls <code>to_roman()</code>, calls <code>from_roman()</code>, and checks that the output is the same as the original input.
|
||||
|
||||
<pre class=nd><code class=pp>class RoundtripCheck(unittest.TestCase):
|
||||
<pre class='nd pp'><code>class RoundtripCheck(unittest.TestCase):
|
||||
def test_roundtrip(self):
|
||||
'''from_roman(to_roman(n))==n for all n'''
|
||||
for integer in range(1, 4000):
|
||||
@@ -614,7 +614,7 @@ FAILED (errors=2)</samp></pre>
|
||||
|
||||
<p>A quick stub function will solve that problem.
|
||||
|
||||
<pre class=nd><code class=pp># roman5.py
|
||||
<pre class='nd pp'><code># roman5.py
|
||||
def from_roman(s):
|
||||
'''convert Roman numeral to integer'''</code></pre>
|
||||
|
||||
@@ -650,7 +650,7 @@ FAILED (failures=2)</samp></pre>
|
||||
|
||||
<p>Now it’s time to write the <code>from_roman()</code> function.
|
||||
|
||||
<pre><code class=pp>def from_roman(s):
|
||||
<pre class=pp><code>def from_roman(s):
|
||||
"""convert Roman numeral to integer"""
|
||||
result = 0
|
||||
index = 0
|
||||
@@ -665,7 +665,7 @@ FAILED (failures=2)</samp></pre>
|
||||
|
||||
<p>If you're not clear how <code>from_roman()</code> works, add a <code>print</code> statement to the end of the <code>while</code> loop:
|
||||
|
||||
<pre><code class=pp>def from_roman(s):
|
||||
<pre class=pp><code>def from_roman(s):
|
||||
"""convert Roman numeral to integer"""
|
||||
result = 0
|
||||
index = 0
|
||||
@@ -717,7 +717,7 @@ OK</samp></pre>
|
||||
|
||||
<p>Thus, one useful test would be to ensure that the <code>from_roman()</code> function should fail when you pass it a string with too many repeated numerals. How many is “too many” depends on the numeral.
|
||||
|
||||
<pre class=nd><code class=pp>class FromRomanBadInput(unittest.TestCase):
|
||||
<pre class='nd pp'><code>class FromRomanBadInput(unittest.TestCase):
|
||||
def test_too_many_repeated_numerals(self):
|
||||
'''from_roman should fail with too many repeated numerals'''
|
||||
for s in ('MMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'):
|
||||
@@ -725,14 +725,14 @@ OK</samp></pre>
|
||||
|
||||
<p>Another useful test would be to check that certain patterns aren’t repeated. For example, <code>IX</code> is <code>9</code>, but <code>IXIX</code> is never valid.
|
||||
|
||||
<pre class=nd><code class=pp> def test_repeated_pairs(self):
|
||||
<pre class='nd pp'><code> def test_repeated_pairs(self):
|
||||
'''from_roman should fail with repeated pairs of numerals'''
|
||||
for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'):
|
||||
self.assertRaises(roman6.InvalidRomanNumeralError, roman6.from_roman, s)</code></pre>
|
||||
|
||||
<p>A third test could check that numerals appear in the correct order, from highest to lowest value. For example, <code>CL</code> is <code>150</code>, but <code>LC</code> is never valid, because the numeral for <code>50</code> can never come before the numeral for <code>100</code>. This test includes a randomly chosen set of invalid antecedents: <code>I</code> before <code>M</code>, <code>V</code> before <code>X</code>, and so on.
|
||||
|
||||
<pre class=nd><code class=pp> def test_malformed_antecedents(self):
|
||||
<pre class='nd pp'><code> def test_malformed_antecedents(self):
|
||||
'''from_roman should fail with malformed antecedents'''
|
||||
for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV',
|
||||
'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'):
|
||||
@@ -740,7 +740,7 @@ OK</samp></pre>
|
||||
|
||||
<p>Each of these tests relies the <code>from_roman()</code> function raising a new exception, <code>InvalidRomanNumeralError</code>, which we haven’t defined yet.
|
||||
|
||||
<pre class=nd><code class=pp># roman6.py
|
||||
<pre class='nd pp'><code># roman6.py
|
||||
class InvalidRomanNumeralError(ValueError): pass</code></pre>
|
||||
|
||||
<p>All three of these tests should fail, since the <code>from_roman()</code> function doesn’t currently have any validity checking. (If they don’t fail now, then what the heck are they testing?)
|
||||
@@ -782,7 +782,7 @@ FAILED (failures=3)</samp></pre>
|
||||
|
||||
<p>Good deal. Now, all we need to do is add the <a href=regular-expressions.html#romannumerals>regular expression to test for valid Roman numerals</a> into the <code>from_roman()</code> function.
|
||||
|
||||
<pre class=nd><code class=pp>roman_numeral_pattern = re.compile('''
|
||||
<pre class='nd pp'><code>roman_numeral_pattern = re.compile('''
|
||||
^ # beginning of string
|
||||
M{0,3} # thousands - 0 to 3 Ms
|
||||
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 Cs),
|
||||
|
||||
@@ -26,7 +26,7 @@ mark{display:inline}
|
||||
<p>Here, then, is the <abbr>XML</abbr> data we’ll be working with in this chapter. It’s a feed — specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
|
||||
|
||||
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
||||
<pre><code class=pp><?xml version='1.0' encoding='utf-8'?>
|
||||
<pre class=pp><code><?xml version='1.0' encoding='utf-8'?>
|
||||
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title>dive into mark</title>
|
||||
<subtitle>currently between addictions</subtitle>
|
||||
@@ -99,7 +99,7 @@ mark{display:inline}
|
||||
|
||||
<p><abbr>XML</abbr> is a generalized way of describing hierarchical structured data. An <abbr>XML</abbr> <i>document</i> contains one or more <i>elements</i>, which are delimited by <i>start and end tags</i>. This is a complete (albeit boring) <abbr>XML</abbr> document:
|
||||
|
||||
<pre class=nd><code class=pp><a><foo> <span class=u>①</span></a>
|
||||
<pre class='nd pp'><code><a><foo> <span class=u>①</span></a>
|
||||
<a></foo> <span class=u>②</span></a></code></pre>
|
||||
<ol>
|
||||
<li>This is the <i>start tag</i> of the <code>foo</code> element.
|
||||
@@ -108,19 +108,19 @@ mark{display:inline}
|
||||
|
||||
<p>Elements can be <i>nested</i> to any depth. An element <code>bar</code> inside an element <code>foo</code> is said to be a <i>subelement</i> or <i>child</i> of <code>foo</code>.
|
||||
|
||||
<pre class=nd><code class=pp><foo>
|
||||
<pre class='nd pp'><code><foo>
|
||||
<mark><bar></bar></mark>
|
||||
</foo>
|
||||
</code></pre>
|
||||
|
||||
<p>The first element in every <abbr>XML</abbr> document is called the <i>root element</i>. An <abbr>XML</abbr> document can only have one root element. The following is <strong>not an <abbr>XML</abbr> document</strong>, because it has two root elements:
|
||||
|
||||
<pre class=nd><code class=pp><foo></foo>
|
||||
<pre class='nd pp'><code><foo></foo>
|
||||
<bar></bar></code></pre>
|
||||
|
||||
<p>Elements can have <i>attributes</i>, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. <i>Attribute names</i> can not be repeated within an element. <i>Attribute values</i> must be quoted. You may use either single or double quotes.
|
||||
|
||||
<pre class=nd><code class=pp><a><foo <mark>lang='en'</mark>> <span class=u>①</span></a>
|
||||
<pre class='nd pp'><code><a><foo <mark>lang='en'</mark>> <span class=u>①</span></a>
|
||||
<a> <bar id='papayawhip' <mark>lang="fr"</mark>></bar> <span class=u>②</span></a>
|
||||
</foo>
|
||||
</code></pre>
|
||||
@@ -133,22 +133,22 @@ mark{display:inline}
|
||||
|
||||
<p>Elements can have <i>text content</i>.
|
||||
|
||||
<pre class=nd><code class=pp><foo lang='en'>
|
||||
<pre class='nd pp'><code><foo lang='en'>
|
||||
<bar lang='fr'><mark>PapayaWhip</mark></bar>
|
||||
</foo>
|
||||
</code></pre>
|
||||
|
||||
<p>Elements that contain no text and no children are <i>empty</i>.
|
||||
|
||||
<pre class=nd><code class=pp><foo></foo></code></pre>
|
||||
<pre class='nd pp'><code><foo></foo></code></pre>
|
||||
|
||||
<p>There is a shorthand for writing empty elements. By putting a <code>/</code> character in the start tag, you can skip the end tag altogther. The <abbr>XML</abbr> document in the previous example could be written like this instead:
|
||||
|
||||
<pre class=nd><code class=pp><foo<mark>/</mark>></code></pre>
|
||||
<pre class='nd pp'><code><foo<mark>/</mark>></code></pre>
|
||||
|
||||
<p>Like Python functions can be declared in different <i>modules</i>, <abbr>XML</abbr> elements can be declared in different <i>namespaces</i>. Namespaces usually look like URLs. You use an <code>xmlns</code> declaration to define a <i>default namespace</i>. A namespace declaration looks similar to an attribute, but it has a different purpose.
|
||||
|
||||
<pre class=nd><code class=pp><a><feed <mark>xmlns='http://www.w3.org/2005/Atom'</mark>> <span class=u>①</span></a>
|
||||
<pre class='nd pp'><code><a><feed <mark>xmlns='http://www.w3.org/2005/Atom'</mark>> <span class=u>①</span></a>
|
||||
<a> <title>dive into mark</title> <span class=u>②</span></a>
|
||||
</feed>
|
||||
</code></pre>
|
||||
@@ -159,7 +159,7 @@ mark{display:inline}
|
||||
|
||||
<p>You can also use an <code>xmlns:<var>prefix</var></code> declaration to define a namespace and associate it with a <i>prefix</i>. Then each element in that namespace must be explicitly declared with the prefix.
|
||||
|
||||
<pre class=nd><code class=pp><a><atom:feed <mark>xmlns:atom='http://www.w3.org/2005/Atom'</mark>> <span class=u>①</span></a>
|
||||
<pre class='nd pp'><code><a><atom:feed <mark>xmlns:atom='http://www.w3.org/2005/Atom'</mark>> <span class=u>①</span></a>
|
||||
<a> <atom:title>dive into mark</atom:title> <span class=u>②</span></a>
|
||||
</atom:feed></code></pre>
|
||||
<ol>
|
||||
@@ -171,7 +171,7 @@ mark{display:inline}
|
||||
|
||||
<p>Finally, <abbr>XML</abbr> documents can contain <a href=strings.html#one-ring-to-rule-them-all>character encoding information</a> on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the document can be parsed, <a href=http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info>Section F of the <abbr>XML</abbr> specification</a> details how to resolve this Catch-22.)
|
||||
|
||||
<pre class=nd><code class=pp><?xml version='1.0' <mark>encoding='utf-8'</mark>?></code></pre>
|
||||
<pre class='nd pp'><code><?xml version='1.0' <mark>encoding='utf-8'</mark>?></code></pre>
|
||||
|
||||
<p>And now you know just enough <abbr>XML</abbr> to be dangerous!
|
||||
|
||||
@@ -185,7 +185,7 @@ mark{display:inline}
|
||||
|
||||
<p>At the top level is the <i>root element</i>, which every Atom feed shares: the <code>feed</code> element in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
||||
|
||||
<pre><code class=pp><a><feed xmlns='http://www.w3.org/2005/Atom' <span class=u>①</span></a>
|
||||
<pre class=pp><code><a><feed xmlns='http://www.w3.org/2005/Atom' <span class=u>①</span></a>
|
||||
<a> xml:lang='en'> <span class=u>②</span></a></code></pre>
|
||||
<ol>
|
||||
<li><code>http://www.w3.org/2005/Atom</code> is the Atom namespace.
|
||||
@@ -194,7 +194,7 @@ mark{display:inline}
|
||||
|
||||
<p>An Atom feed contains several pieces of information about the feed itself. These are declared as children of the root-level <code>feed</code> element.
|
||||
|
||||
<pre><code class=pp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<pre class=pp><code><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<a> <title>dive into mark</title> <span class=u>①</span></a>
|
||||
<a> <subtitle>currently between addictions</subtitle> <span class=u>②</span></a>
|
||||
<a> <id>tag:diveintomark.org,2001-07-29:/</id> <span class=u>③</span></a>
|
||||
@@ -216,7 +216,7 @@ mark{display:inline}
|
||||
|
||||
<p>After the feed-level metadata is the list of the most recent articles. An article looks like this:
|
||||
|
||||
<pre><code class=pp><entry>
|
||||
<pre class=pp><code><entry>
|
||||
<a> <author> <span class=u>①</span></a>
|
||||
<name>Mark</name>
|
||||
<uri>http://diveintomark.org/</uri>
|
||||
@@ -467,7 +467,7 @@ StopIteration</samp></pre>
|
||||
|
||||
<p>For large <abbr>XML</abbr> documents, <code>lxml</code> is significantly faster than the built-in ElementTree libary. If you’re only using the ElementTree <abbr>API</abbr> and want to use the fastest available implementation, you can try to import <code>lxml</code> and fall back to the built-in ElementTree.
|
||||
|
||||
<pre class=nd><code class=pp>try:
|
||||
<pre class='nd pp'><code>try:
|
||||
from lxml import etree
|
||||
except ImportError:
|
||||
import xml.etree.ElementTree as etree</code></pre>
|
||||
@@ -537,11 +537,11 @@ except ImportError:
|
||||
|
||||
<p>An <abbr>XML</abbr> parser won’t “see” any difference between an <abbr>XML</abbr> document with a default namespace and an <abbr>XML</abbr> document with a prefixed namespace. The resulting <abbr>DOM</abbr> of this serialization:
|
||||
|
||||
<pre class=nd><code class=pp><ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
<pre class='nd pp'><code><ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
|
||||
<p>is identical to the <abbr>DOM</abbr> of this serialization:
|
||||
|
||||
<pre class=nd><code class=pp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
<pre class='nd pp'><code><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
|
||||
<p>The only practical difference is that the second serialization is several characters shorter. If we were to recast our entire sample feed with a <code>ns0:</code> prefix in every start and end tag, it would add 4 characters per start tag × 79 tags + 4 characters for the namespace declaration itself, for a total of 316 characters. Assuming <a href=strings.html#byte-arrays>UTF-8 encoding</a>, that’s 316 extra bytes. (After gzipping, the difference drops to 21 bytes, but still, 21 bytes is 21 bytes.) Maybe that doesn’t matter to you, but for something like an Atom feed, which may be downloaded several thousand times whenever it changes, saving a few bytes per request can quickly add up.
|
||||
|
||||
@@ -602,7 +602,7 @@ except ImportError:
|
||||
|
||||
<p>Here is a fragment of a broken <abbr>XML</abbr> document. I’ve highlighted the wellformedness error.
|
||||
|
||||
<pre class=nd><code class=pp><?xml version='1.0' encoding='utf-8'?>
|
||||
<pre class='nd pp'><code><?xml version='1.0' encoding='utf-8'?>
|
||||
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title>dive into <mark>&hellip;</mark></title>
|
||||
...
|
||||
|
||||
@@ -26,7 +26,7 @@ mark{display:inline}
|
||||
<h2 id=divingin>Diving In</h2>
|
||||
<p class=f>Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let’s skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don’t worry about that, because you’re going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
|
||||
<p class=d>[<a href=examples/humansize.py>download <code>humansize.py</code></a>]
|
||||
<pre><code class=pp>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
|
||||
<pre class=pp><code>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
|
||||
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
|
||||
|
||||
def approximate_size(size, a_kilobyte_is_1024_bytes=True):
|
||||
@@ -75,7 +75,7 @@ if __name__ == '__main__':
|
||||
|
||||
<h2 id=declaringfunctions>Declaring Functions</h2>
|
||||
<p>Python has functions like most other languages, but it does not have separate header files like <abbr>C++</abbr> or <code>interface</code>/<code>implementation</code> sections like Pascal. When you need a function, just declare it, like this:
|
||||
<pre class=nd><code class=pp>def approximate_size(size, a_kilobyte_is_1024_bytes=True):</code></pre>
|
||||
<pre class='nd pp'><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):</code></pre>
|
||||
<aside>When you need a function, just declare it.</aside>
|
||||
<p>The keyword <code>def</code> starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments are separated with commas.
|
||||
<p>Also note that the function doesn’t define a return datatype. Python functions do not specify the datatype of their return value; they don’t even specify whether or not they return a value. (In fact, every Python function returns a value; if the function ever executes a <code>return</code> statement, it will return that value, otherwise it will return <code>None</code>, the Python null value.)
|
||||
@@ -93,13 +93,13 @@ if __name__ == '__main__':
|
||||
|
||||
<p>Let’s take another look at that <code>approximate_size()</code> function declaration:
|
||||
|
||||
<pre class=nd><code class=pp>def approximate_size(size, a_kilobyte_is_1024_bytes=True):</code></pre>
|
||||
<pre class='nd pp'><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):</code></pre>
|
||||
|
||||
<p>The second argument, <var>a_kilobyte_is_1024_bytes</var>, specifies a default value of <code>True</code>. This means the argument is <i>optional</i>; you can call the function without it, and Python will act as if you had called it with <code>True</code> as a second parameter.
|
||||
|
||||
<p>Now look at the bottom of the script:
|
||||
|
||||
<pre><code class=pp>if __name__ == '__main__':
|
||||
<pre class=pp><code>if __name__ == '__main__':
|
||||
<a> print(approximate_size(1000000000000, False)) <span class=u>①</span></a>
|
||||
<a> print(approximate_size(1000000000000)) <span class=u>②</span></a></code></pre>
|
||||
<ol>
|
||||
@@ -137,7 +137,7 @@ SyntaxError: non-keyword arg after keyword arg</samp></pre>
|
||||
<p>I won’t bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (<i>i.e.</i> after you’ve forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You’ll thank me in six months.
|
||||
<h3 id=docstrings>Documentation Strings</h3>
|
||||
<p>You can document a Python function by giving it a documentation string (<code>docstring</code> for short). In this program, the <code>approximate_size()</code> function has a <code>docstring</code>:
|
||||
<pre class=nd><code class=pp>def approximate_size(size, a_kilobyte_is_1024_bytes=True):
|
||||
<pre class='nd pp'><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):
|
||||
'''Convert a file size to human-readable form.
|
||||
|
||||
Keyword arguments:
|
||||
@@ -228,7 +228,7 @@ SyntaxError: non-keyword arg after keyword arg</samp></pre>
|
||||
|
||||
<h2 id=indentingcode>Indenting Code</h2>
|
||||
<p>Python functions have no explicit <code>begin</code> or <code>end</code>, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (<code>:</code>) and the indentation of the code itself.
|
||||
<pre><code class=pp><a>def approximate_size(size, a_kilobyte_is_1024_bytes=True): <span class=u>①</span></a>
|
||||
<pre class=pp><code><a>def approximate_size(size, a_kilobyte_is_1024_bytes=True): <span class=u>①</span></a>
|
||||
<a> if size < 0: <span class=u>②</span></a>
|
||||
<a> raise ValueError('number must be non-negative') <span class=u>③</span></a>
|
||||
<a> <span class=u>④</span></a>
|
||||
@@ -272,7 +272,7 @@ SyntaxError: non-keyword arg after keyword arg</samp></pre>
|
||||
|
||||
<p>The <code>approximate_size()</code> function raises exceptions in two different cases: if the given <var>size</var> is larger than the function is designed to handle, or if it’s less than zero.
|
||||
|
||||
<pre class=nd><code class=pp>if size < 0:
|
||||
<pre class='nd pp'><code>if size < 0:
|
||||
raise ValueError('number must be non-negative')</code></pre>
|
||||
|
||||
<p>The syntax for raising an exception is simple enough. Use the <code>raise</code> statement, followed by the exception name, and an optional human-readable string for debugging purposes. The syntax is reminiscent of calling a function. (In reality, exceptions are implemented as classes, and this <code>raise</code> statement is actually creating an instance of the <code>ValueError</code> class and passing the string <code>'number must be non-negative'</code> to its initialization method. But <a href=iterators.html#defining-classes>we’re getting ahead of ourselves</a>!)
|
||||
@@ -285,21 +285,21 @@ SyntaxError: non-keyword arg after keyword arg</samp></pre>
|
||||
|
||||
<p>One of Python’s built-in exceptions is <code>ImportError</code>, which is raised when you try to import a module and fail. This can happen for a variety of reasons, but the simplest case is when the module doesn’t exist in your <a href=#importsearchpath>import search path</a>. You can use this to include optional features in your program. For example, <a href=case-study-porting-chardet-to-python-3.html>the <code>chardet</code> library</a> provides character encoding auto-detection. Perhaps your program wants to use this library <em>if it exists</em>, but continue gracefully if the user hasn’t installed it. You can do this with a <code>try..except</code> block.
|
||||
|
||||
<pre class=nd><code class=pp><mark>try</mark>:
|
||||
<pre class='nd pp'><code><mark>try</mark>:
|
||||
import chardet
|
||||
<mark>except</mark> ImportError:
|
||||
chardet = None</code></pre>
|
||||
|
||||
<p>Later, you can check for the presence of the <code>chardet</code> module with a simple <code>if</code> statement:
|
||||
|
||||
<pre class=nd><code class=pp>if chardet:
|
||||
<pre class='nd pp'><code>if chardet:
|
||||
# do something
|
||||
else:
|
||||
# continue anyway</code></pre>
|
||||
|
||||
<p>Another common use of the <code>ImportError</code> exception is when two modules implement a common <abbr>API</abbr>, but one is more desirable than the other. (Maybe it’s faster, or it uses less memory.) You can try to import one module but fall back to a different module if the first import fails. For example, <a href=xml.html>the XML chapter</a> talks about two modules that implement a common <abbr>API</abbr>, called the <code>ElementTree</code> <abbr>API</abbr>. The first, <code>lxml</code>, is a third-party module that you need to download and install yourself. The second, <code>xml.etree.ElementTree</code>, is slower but is part of the Python 3 standard library.
|
||||
|
||||
<pre class=nd><code class=pp>try:
|
||||
<pre class='nd pp'><code>try:
|
||||
from lxml import etree
|
||||
except ImportError:
|
||||
import xml.etree.ElementTree as etree</code></pre>
|
||||
@@ -310,7 +310,7 @@ except ImportError:
|
||||
|
||||
<p>Take another look at this line of code from the <code>approximate_size()</code> function:
|
||||
|
||||
<pre class=nd><code class=pp>multiple = 1024 if a_kilobyte_is_1024_bytes else 1000</code></pre>
|
||||
<pre class='nd pp'><code>multiple = 1024 if a_kilobyte_is_1024_bytes else 1000</code></pre>
|
||||
|
||||
<p>You never declare the variable <var>multiple</var>, you just assign a value to it. That’s OK, because Python lets you do that. What Python will <em>not</em> let you do is reference a variable that has never been assigned a value. Trying to do so will raise a <code>NameError</code> exception.
|
||||
<pre class='nd screen'>
|
||||
@@ -353,7 +353,7 @@ NameError: name 'an_inteGer' is not defined</samp>
|
||||
<h2 id=runningscripts>Running Scripts</h2>
|
||||
<aside>Everything in Python is an object.</aside>
|
||||
<p>Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them, by including a special block of code that executes when you run the Python file on the command line. Take the last few lines of <code>humansize.py</code>:
|
||||
<pre class=nd><code class=pp>
|
||||
<pre class='nd pp'><code>
|
||||
if __name__ == '__main__':
|
||||
print(approximate_size(1000000000000, False))
|
||||
print(approximate_size(1000000000000))</code></pre>
|
||||
|
||||
Reference in New Issue
Block a user