markup fiddling

2026-06-05 15:00:18 +00:00 · 2009-03-13 17:02:28 -04:00
parent a9b82eab12
commit bda59cfc55
11 changed files with 38 additions and 93 deletions
@@ -3,7 +3,7 @@
 <head>
 <meta charset=utf-8>
 <title>About the book - Dive Into Python 3</title>
-<!--[if IE]><script src="html5.js"></script><![endif]-->
+<!--[if IE]><script src=html5.js></script><![endif]-->
 <link rel=stylesheet type=text/css href=dip3.css>
 <link rel="shortcut icon" href=data:image/ico,>
 <link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
@@ -23,6 +23,6 @@ h1:before{content:""}
 <li>Other Javascript and CSS resources are minimized by <a href=http://developer.yahoo.com/yui/compressor/>YUI Compressor</a>.
 <li>HTTP caching and other server-side options are optimized based on advice from <a href=http://developer.yahoo.com/yslow/>YSlow</a>.
 <li>The text uses Unicode characters in place of graphics wherever possible.
-<li>The entire book was lovingly hand-authored in HTML 5. View-source; I typed that.
+<li>The entire book was lovingly hand-authored in HTML 5 to avoid markup cruft.
 </ol>
 <p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim
@@ -3,7 +3,7 @@
 <head>
 <meta charset=utf-8>
 <title>Case study: porting chardet to Python 3 - Dive into Python 3</title>
-<!--[if IE]><script src="html5.js"></script><![endif]-->
+<!--[if IE]><script src=html5.js></script><![endif]-->
 <link rel=stylesheet type=text/css href=dip3.css>
 <link rel="shortcut icon" href=data:image/ico,>
 <link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
@@ -704,7 +704,6 @@ for line in open(f, 'rb'):
 <p id=skiptestharnessfeedcode>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221;  Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.)  But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
 <p>What we need this regular expression to search is not an array of characters, but an array of bytes.
 <p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
-
 <p class=skip><a href=#skip-cant-use-a-string-pattern-solution>skip over this code listing</a>
 <pre><code>  class UniversalDetector:
      def __init__(self):
@@ -716,7 +715,6 @@ for line in open(f, 'rb'):
          self._mCharSetProbers = []
          self.reset()</code></pre>
 <p id=skip-case-use-a-string-pattern-solution>Searching the entire codebase for other uses of the <code>re</code> module turns up two more instances, in <code>charsetprober.py</code>. Again, the code is defining regular expressions as strings but executing them on <var>aBuf</var>, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays.
-
 <p class=skip><a href=#cantconvertbytesobject>skip over this code listing</a>
 <pre><code>  class CharSetProber:
      .
@@ -743,15 +741,11 @@ for line in open(f, 'rb'):
  File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed
    elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
 TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
-
 <p id=skipcantconvertbytesobject>There's an unfortunate clash of coding style and Python interpreter here. The <code>TypeError</code> could be anywhere on that line, but the traceback doesn't tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
-
 <p class=skip><a href=#skip-split-conditional>skip over this code listing</a>
 <pre><code>elif (self._mInputState == ePureAscii) and \
    self._escDetector.search(self._mLastChar + aBuf):</code></pre>
-
 <p id=skip-split-conditional>And re-run the test:</p>
-
 <p class=skip><a href=#skip-cant-convert-bytes-object-2>skip over this command output listing</a>
 <pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
 <samp>tests\ascii\howto.diveintomark.org.xml</samp>
@@ -761,11 +755,8 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
  File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed
    self._escDetector.search(self._mLastChar + aBuf):
 TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
-
 <p id=skip-over-cant-convert-bytes-object-2>Aha! The problem was not in the first conditional (<code>self._mInputState == ePureAscii</code>) but in the second one. So what could cause a <code>TypeError</code> there? Perhaps you're thinking that the <code>search()</code> method is expecting a value of a different type, but that wouldn't generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may <em>crash</em> if you pass it a value of a different type than it's expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the <code>search()</code> method. So the problem must be in that <code>+</code> operation, as it's trying to construct the value that it will eventually pass to the <code>search()</code> method.
-
 <p>We know from <a href=#cantuseastringpattern>previous debugging</a> that <var>aBuf</var> is a byte array. So what is <code>self._mLastChar</code>? It's an instance variable, defined in the <code>reset()</code> method, which is actually called from the <code>__init__()</code> method.
-
 <p class=skip><a href=#skip-mlastchar-declaration>skip over this code listing</a>
 <pre><code>class UniversalDetector:
    def __init__(self):
@@ -782,11 +773,8 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
        self._mGotData = False
        self._mInputState = ePureAscii
 <mark>        self._mLastChar = ''</mark></code></pre>
-
 <p id=skip-mlastchar-declaration>And now we have our answer. Do you see it? <var>self._mLastChar</var> is a string, but <var>aBuf</var> is a byte array. And you can't concatenate a string to a byte array &mdash; not even a zero-length string.
-
 <p>So what is <var>self._mLastChar</var> anyway? The answer is in the <code>feed()</code> method, just a few lines down from where the trackback occurred.
-
 <p class=skip><a href=#skip-mlastchar-set>skip over this code listing</a>
 <pre><code>if self._mInputState == ePureAscii:
    if self._highBitDetector.search(aBuf):
@@ -796,9 +784,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
        self._mInputState = eEscAscii

 <mark>self._mLastChar = aBuf[-1]</mark></code></pre>
-
 <p>The calling function calls this <code>feed()</code> method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as <var>aBuf</var>), then stores the last byte in <var>self._mLastChar</var> in case it's needed during the next call. (In a multi-byte encoding, the <code>feed()</code> method might get called with half of a character, then called again with the other half.)  But because <var>aBuf</var> is now a byte array instead of a string, <var>self._mLastChar</var> needs to be a byte array as well. Thus:
-
 <p class=skip><a href=#skip-mlastchar-solution>skip over this code listing</a>
 <pre><code>  def reset(self):
      .
@@ -806,9 +792,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
      .
 <del>-     self._mLastChar = ''</del>
 <ins>+     self._mLastChar = b''</ins></code></pre>
-
 <p id=skip-mlastchar-solution>Searching the entire codebase for <code>"mLastChar"</code> turns up a similar problem in <code>mbcharsetprober.py</code>, but instead of tracking the last character, it tracks the last <em>two</em> characters. The <code>MultiByteCharSetProber</code> class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.
-
 <p class=skip><a href=#skip-mbcharsetprober>skip over this code listing</a>
 <pre><code>
  class MultiByteCharSetProber(CharSetProber):
@@ -827,11 +811,8 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
              self._mDistributionAnalyzer.reset()
 <del>-         self._mLastChar = ['\x00', '\x00']</del>
 <ins>+         self._mLastChar = [0, 0]</ins></code></pre>
-
 <h3 id=unsupportedoperandtypeforplus>Unsupported operand type(s) for +: <code>'int'</code> and <code>'bytes'</code></h3>
-
 <p>I have good news, and I have bad news. The good news is we're making progress&hellip;
-
 <p class=skip><a href=#skip-unsupported-operand-types>skip over this command listing</a>
 <pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
 <samp>tests\ascii\howto.diveintomark.org.xml</samp>
@@ -841,13 +822,9 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
  File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed
    self._escDetector.search(self._mLastChar + aBuf):
 TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
-
 <p id=skip-unsupported-operand-types>&hellip;The bad news is it doesn't always feel like progress.
-
 <p>But this is progress! Really! Even though the traceback calls out the same line of code, it's a different error than it used to be. Progress! So what's the problem now? The last time I checked, this line of code didn't try to concatenate an <code>int</code> with a byte array (<code>bytes</code>). In fact, you just spent a lot of time <a href=#cantconvertbytesobject>ensuring that <var>self._mLastChar</var> was a byte array</a>. How did it turn into an <code>int</code>?
-
 <p>The answer lies not in the previous lines of code, but in the following lines.
-
 <p class=skip><a href=#skip-mlastchar-highlight>skip over this code listing</a>
 <pre><code>if self._mInputState == ePureAscii:
    if self._highBitDetector.search(aBuf):
@@ -857,9 +834,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
        self._mInputState = eEscAscii

 <mark>self._mLastChar = aBuf[-1]</mark></code></pre>
-
 <p id=skip-mlastchar-highlight>This error doesn't occur the first time the <code>feed()</code> method gets called; it occurs the <em>second time</em>, after <var>self._mLastChar</var> has been set to the last byte of <var>aBuf</var>. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
-
 <p class=skip><a href=#skip-mlastchar-interactive>skip over this interpreter listing</a>
 <pre class=screen>
 <a><samp class=prompt>>>> </samp><kbd>aBuf = b'\xEF\xBB\xBF'</kbd>         <span>&#x2460;</span></a>
@@ -887,19 +862,14 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp>
 <li>Ah, here's the fix. Instead of taking the last element of the byte array, use <a href=native-datatypes.html#slicinglists>list slicing</a> to create a new byte array containing just the last element. That is, start with the last element and continue the slice until the end of the byte array. Now <var>mLastChar</var> is a byte array of length 1.
 <li>Concatenating a byte array of length 1 with a byte array of length 3 returns a new byte array of length 4.
 </ol>
-
 <p>So, to ensure that the <code>feed()</code> method in <code>universaldetector.py</code> continues to work no matter how often it's called, you need to <a href=#cantconvertbytesobject>initialize <var>self._mLastChar</var> as a 0-length byte array</a>, then <em>make sure it stays a byte array</em>.
-
 <pre><code>              self._escDetector.search(self._mLastChar + aBuf):
          self._mInputState = eEscAscii

 <del>- self._mLastChar = aBuf[-1]</del>
 <ins>+ self._mLastChar = aBuf[-1:]</ins></code></pre>
-
 <h3 id=ordexpectedstring><code>ord()</code> expected string of length 1, but <code>int</code> found</h3>
-
 <p>Tired yet? You're almost there&hellip;
-
 <p class=skip><a href=#skip-ord-expected-string>skip over this command output listing</a>
 <pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
 <samp>tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
@@ -916,37 +886,28 @@ tests\Big5\0804.blogspot.com.xml</samp>
  File "C:\home\chardet\chardet\codingstatemachine.py", line 43, in next_state
    byteCls = self._mModel['classTable'][ord(c)]
 TypeError: ord() expected string of length 1, but int found</samp></pre>
-
 <p id=skip-ord-expected-string>OK, so <var>c</var> is an <code>int</code>, but the <code>ord()</code> function was expecting a 1-character string. Fair enough. Where is <var>c</var> defined?
-
 <p class=skip><a href=#skip-next-state>skip over this code listing</a>
 <pre><code># codingstatemachine.py
 def next_state(self, c):
    # for each byte we get its class
    # if it is first byte, we also get byte length
    byteCls = self._mModel['classTable'][ord(c)]</code></pre>
-
 <p id=skip-next-state>That's no help; it's just passed into the function. Let's pop the stack.
-
 <p class=skip><a href=#skip-utf8prober-feed>skip over this code listing</a>
 <pre><code># utf8prober.py
 def feed(self, aBuf):
    for c in aBuf:
        codingState = self._mCodingSM.next_state(c)</code></pre>
-
 <p id=skip-utf8prober-feed>And now we have the answer. Do you see it? In Python 2, <var>aBuf</var> was a string, so <var>c</var> was a 1-character string. (That's what you get when you iterate over a string &mdash; all the characters, one by one.) But now, <var>aBuf</var> is a byte array, so <var>c</var> is an <code>int</code>, not a 1-character string. In other words, there's no need to call the <code>ord()</code> function because <var>c</var> is already an <code>int</code>!
-
 <p>Thus:
-
 <p class=skip><a href=#skip-ordc-diff>skip over this code listing</a>
 <pre><code>  def next_state(self, c):
      # for each byte we get its class
      # if it is first byte, we also get byte length
 <del>-     byteCls = self._mModel['classTable'][ord(c)]</del>
 <ins>+     byteCls = self._mModel['classTable'][c]</ins></code></pre>
-
 <p>Searching the entire codebase for instances of <code>"ord(c)"</code> uncovers similar problems in <code>sbcharsetprober.py</code>&hellip;
-
 <p class=skip><a href=#skip-sbcharsetprober-code>skip over this code listing</a>
 <pre><code># sbcharsetprober.py
 def feed(self, aBuf):
@@ -957,18 +918,14 @@ def feed(self, aBuf):
        return self.get_state()
    for c in aBuf:
 <mark>        order = self._mModel['charToOrderMap'][ord(c)]</mark></code></pre>
-
 <p id=skip-sbcharsetprober-code>&hellip;and <code>latin1prober.py</code>&hellip;
-
 <p class=skip><a href=#skip-latin1prober-code-2>skip over this code listing</a>
 <pre><code># latin1prober.py
 def feed(self, aBuf):
    aBuf = self.filter_with_english_letters(aBuf)
    for c in aBuf:
 <mark>        charClass = Latin1_CharToClass[ord(c)]</mark></code></pre>
-
 <p id=skip-sbcharsetprober-code-2><var>c</var> is iterating over <var>aBuf</var>, which means it is an integer, not a 1-character string.  The solution is the same: change <code>ord(c)</code> to just plain <code>c</code>.
-
 <p class=skip><a href=#unorderabletypes>skip over this code listing</a>
 <pre><code>  # sbcharsetprober.py
  def feed(self, aBuf):
@@ -988,11 +945,8 @@ def feed(self, aBuf):
 <del>-         charClass = Latin1_CharToClass[ord(c)]</del>
 <ins>+         charClass = Latin1_CharToClass[c]</ins>
 </code></pre>
-
 <h3 id=unorderabletypes>Unorderable types: <code>int()</code> >= <code>str()</code></h3>
-
 <p>Let's go again.
-
 <p class=skip><a href=#skip-unorderable-types-screen>skip over this command output listing</a>
 <pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
 <samp>tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
@@ -1011,11 +965,8 @@ tests\Big5\0804.blogspot.com.xml</samp>
  File "C:\home\chardet\chardet\jpcntx.py", line 176, in get_order
    if ((aStr[0] >= '\x81') and (aStr[0] &lt;= '\x9F')) or \
 TypeError: unorderable types: int() >= str()</samp></pre>
-
 <p id=skip-unorderable-types-screen>Did you notice? This time around, the code passed the first test case (<code>tests\ascii\howto.diveintomark.org.xml</code>). You're making real progress here.
-
 <p>So what's this all about? &#8220;Unorderable types&#8221;? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code:
-
 <p class=skip><a href=#skip-unorderable-types-1>skip over this code listing</a>
 <pre><code>class SJISContextAnalysis(JapaneseContextAnalysis):
    def get_order(self, aStr):
@@ -1026,9 +977,7 @@ TypeError: unorderable types: int() >= str()</samp></pre>
            charLen = 2
        else:
            charLen = 1</code></pre>
-
 <p id=skip-unorderable-types-1>And where does <var>aStr</var> come from? Let's pop the stack:
-
 <p class=skip><a href=#skip-unorderable-types-2>skip over this code listing</a>
 <pre><code>def feed(self, aBuf, aLen):
    .
@@ -1037,13 +986,9 @@ TypeError: unorderable types: int() >= str()</samp></pre>
    i = self._mNeedToSkipCharNum
    while i &lt; aLen:
 <mark>        order, charLen = self.get_order(aBuf[i:i+2])</mark></code></pre>
-
 <p id=skip-unorderable-types-2>Oh look, it's our old friend, <var>aBuf</var>. As you might have guessed from every other issue we've encountered in this chapter, <var>aBuf</var> is a byte array. Here, the <code>feed()</code> method isn't just passing it on wholesale; it's slicing it. But as you saw <a href=#unsupportedoperandtypeforplus>earlier in this chapter</a>, slicing a byte array returns a byte array, so the <var>aStr</var> parameter that gets passed to the <code>get_order()</code> method is still a byte array.
-
 <p>And what is this code trying to do with <var>aStr</var>? It's taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because <var>aStr</var> and <var>aBuf</var> were strings, and <var>aStr[0]</var> would be a string, and you can compare strings for inequality. But in Python 3, <var>aStr</var> and <var>aBuf</var> are byte arrays, <var>aStr[0]</var> is an integer, and you can't compare integers and strings for inequality without explicitly coercing one of them.
-
 <p>In this case, there's no need to make the code more complicated by adding an explicit coercion. <var>aStr[0]</var> yields an integer; the things you're comparing to are all constants. Let's change them from 1-character strings to integers.
-
 <p class=skip><a href=#skip-unorderable-types-3>skip over this code listing</a>
 <pre><code>  class SJISContextAnalysis(JapaneseContextAnalysis):
      def get_order(self, aStr):
@@ -1097,9 +1042,7 @@ TypeError: unorderable types: int() >= str()</samp></pre>
 <ins>+               return aStr[1] - 0xA1, charLen</ins>

        return -1, charLen</code></pre>
-
 <p>Searching the entire codebase for occurrences of the <code>ord()</code> function uncovers the same problem in <code>chardistribution.py</code>:
-
 <p class=skip><a href=#skip-unorderable-types-4>skip over this command output listing</a>
 <pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
 <samp>tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
@@ -1118,9 +1061,7 @@ tests\Big5\0804.blogspot.com.xml</samp>
  File "C:\home\chardet\chardet\chardistribution.py", line 174, in get_order
    if (aStr[0] >= '\x81') and (aStr[0] &lt;= '\x9F'):
 TypeError: unorderable types: int() >= str()</samp></pre>
-
 <p id=skip-unorderable-types-4>The fix is the same:
-
 <p class=skip><a href=#reduceisnotdefined>skip over this code listing</a>
 <pre><code>  class EUCTWDistributionAnalysis(CharDistributionAnalysis):
      def __init__(self):
@@ -1226,11 +1167,8 @@ TypeError: unorderable types: int() >= str()</samp></pre>
 <ins>+             return 94 * (aStr[0] - 0xA1) + aStr[1] - 0xA1</ins>
          else:
              return -1</code></pre>
-
 <h3 id=reduceisnotdefined>Global name <code>'reduce'</code> is not defined</h3>
-
 <p>Once more into the breach&hellip;
-
 <p class=skip><a href=#skip-reduceisnotdefined-output>skip over this command output listing</a>
 <pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
 <samp>tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
@@ -1243,20 +1181,15 @@ tests\Big5\0804.blogspot.com.xml</samp>
  File "C:\home\chardet\chardet\latin1prober.py", line 126, in get_confidence
    total = reduce(operator.add, self._mFreqCounter)
 NameError: global name 'reduce' is not defined</samp></pre>
-
 <p id=skip-reduceisnotdefined-output>According to the official <a href=http://docs.python.org/dev/3.0/whatsnew/3.0.html#builtins>What's New In Python 3.0</a> guide, the <code>reduce()</code> function has been moved out of the global namespace and into the <code>functools</code> module. Quoting the guide: "Use <code>functools.reduce()</code> if you really need it; however, 99 percent of the time an explicit <code>for</code> loop is more readable."
-
 <p>OK then, let's refactor it to use a <code>for</code> loop.
-
 <p class=skip><a href=#skip-reduce-code>skip over this code listing</a>
 <pre><code>def get_confidence(self):
    if self.get_state() == constants.eNotMe:
        return 0.01
  
 <mark>    total = reduce(operator.add, self._mFreqCounter)</mark></code></pre>
-
 <p>The <code>reduce()</code> function takes two arguments &mdash; a function and a list (strictly speaking, any iterable object will do) &mdash; and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result. It looks much more readable as a <code>for</code> loop.
-
 <p class=skip><a href=#skip-reduce-refactoring>skip over this code listing</a>
 <pre><code>  def get_confidence(self):
      if self.get_state() == constants.eNotMe:
@@ -1266,9 +1199,7 @@ NameError: global name 'reduce' is not defined</samp></pre>
 <ins>+     total = 0</ins>
 <ins>+     for frequency in self._mFreqCounter:</ins>
 <ins>+         total += frequency</ins></code></pre>
-
 <p id=skip-reduce-refactoring>I CAN HAZ TESTZ?
-
 <p class=skip><a href=#skip-final-output>skip over this command output listing</a>
 <pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
 <samp>tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
@@ -1304,13 +1235,9 @@ tests\EUC-JP\arclamp.jp.xml                                  EUC-JP with confide
 .
 .
 316 tests</samp></pre>
-
 <p id=skip-final-output>Holy crap, it actually works! <em><a href=http://www.hampsterdance.com/>/me does a little dance</a></em>
-
 <h2 id=summary>Summary</h2>
-
 <p>What have we learned?
-
 <ol>
 <li>Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There's no way around it. It's hard.
 <li>The <a href=porting-code-to-python-3-with-2to3.html>automated <code>2to3</code> tool</a> is helpful as far as it goes, but it will only do the easy parts &mdash; function renames, module renames, syntax changes. It's an impressive piece of engineering, but in the end it's just an intelligent search-and-replace bot.
@@ -1318,7 +1245,6 @@ tests\EUC-JP\arclamp.jp.xml                                  EUC-JP with confide
 <li><em>You</em> need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least, you need to be comfortable with all its quirks and musty corners. The bugs are everywhere.
 <li>Test cases are essential. Don't port anything without them. Don't even try. The <em>only</em> reason I have any confidence at all that <code>chardet</code> works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I <em>never</em> would have found half of these problems with manual spot-checking.
 </ol>
-
 <p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
 <script src=jquery.js></script>
 <script src=dip3.js></script>
@@ -0,0 +1,22 @@
+"""Quick-and-dirty HTML minimizer"""
+
+import sys
+
+input_file = sys.argv[1]
+output_file = sys.argv[2]
+in_pre = False
+out = open(output_file, 'w')
+for line in open(input_file).readlines():
+    g = line.strip()
+    if g.count('<pre'):
+        in_pre = True
+    if g.count('</pre'):
+        # this will break if you have a </pre> then <pre>
+        # on the same line, so don't do that
+        in_pre = False
+        g = line.rstrip()
+    if in_pre:
+        out.write(line)
+    else:
+        out.write(g)
+out.close()
@@ -3,7 +3,7 @@
 <head>
 <meta charset=utf-8>
 <title>Dive Into Python 3</title>
-<!--[if IE]><script src="html5.js"></script><![endif]-->
+<!--[if IE]><script src=html5.js></script><![endif]-->
 <link rel=stylesheet type=text/css href=dip3.css>
 <link rel="shortcut icon" href=data:image/ico,>
 <link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
@@ -41,15 +41,8 @@ li.todo{background:white;color:gainsboro}
 <li><a href=case-study-porting-chardet-to-python-3.html>Case study: porting <code>chardet</code> to Python 3</a>
 <li><a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>
 </ol>
-<p>There is a <a href=http://hg.diveintopython3.org/>changelog</a>, a <a rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>feed</a>, and <a href=http://www.reddit.com/search?q=%22Dive+Into+Python+3%22&amp;sort=new>discussion on Reddit</a>. During development, you can download the book by cloning the Mercurial repository:
+<p>There is a <a href=http://hg.diveintopython3.org/>changelog</a>, a <a type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>feed</a>, and <a href=http://www.reddit.com/search?q=%22Dive+Into+Python+3%22&amp;sort=new>discussion on Reddit</a>. During development, you can download the book by cloning the Mercurial repository:
 <pre><samp class=prompt>you@localhost:~$ </samp><kbd>hg clone http://hg.diveintopython3.org/ diveintopython3</kbd></pre>
 <p>The final version will be downloadable as <abbr>HTML</abbr> and <abbr>PDF</abbr>.
 <p class=c>This site is optimized for Lynx just because fuck you.<br>I&#8217;m told it also looks good in graphical browsers.
 <p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim &#8226; <a href=about.html>open standards &#8226; open content &#8226; open source</a>
-<!--
-As I write this, the year is 2009, and the internet is STILL a battleground of so-called intellectual property disputes. Some people would have you believe that without proper financial incentives, music, literature, and software would disappear. After all, who would make music if they can't make money on it?  Who would write?  Who would program?
-
-I know the answer. The answer is that musicians will make music, not because they can make money, but because musicians are the people who can't not make music. Writers will write because they can't not write. Most of the people you think of as artists are really just showmen. They collect a paycheck and go home at 5 o'clock. That's not art, that's commerce.
-
-I've been programming since 1983 and releasing my code under Free Software licenses since 1993. I've been writing and publishing under Free Content licenses since 2000. I can't imagine not doing this. If you can imagine yourself not doing what you're doing, do something else. Do whatever it is you can't not do.
-->
@@ -3,7 +3,7 @@
 <head>
 <meta charset=utf-8>
 <title>Native datatypes - Dive into Python 3</title>
-<!--[if IE]><script src="html5.js"></script><![endif]-->
+<!--[if IE]><script src=html5.js></script><![endif]-->
 <link rel=stylesheet type=text/css href=dip3.css>
 <link rel="shortcut icon" href=data:image/ico,>
 <link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
@@ -3,7 +3,7 @@
 <head>
 <meta charset=utf-8>
 <title>Porting code to Python 3 with 2to3 - Dive into Python 3</title>
-<!--[if IE]><script src="html5.js"></script><![endif]-->
+<!--[if IE]><script src=html5.js></script><![endif]-->
 <link rel=stylesheet type=text/css href=dip3.css>
 <link rel="shortcut icon" href=data:image/ico,>
 <link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
@@ -3,7 +3,10 @@
 # make build directory and copy original files there for preflighting
 rm -rf build
 mkdir build
-cp *.html *.py *.txt .htaccess *.js *.css build/
+cp *.py robots.txt .htaccess *.js *.css build/
+for f in *.html; do
+  python htmlminimizer.py "$f" build/"$f"
+done

 # replace local jquery reference with Google API loader
 sed -i -e "s|jquery\.js|http://www.google.com/jsapi|g" build/*.html
@@ -17,6 +20,7 @@ java -jar yuicompressor-2.4.2.jar build/dip3.css > build/dip3.$revision.min.css
 sed -i -e "s|dip3\.js|http://wearehugh.com/dip3/dip3.${revision}.min.js|g" build/*.html
 sed -i -e "s|dip3\.css|http://wearehugh.com/dip3/dip3.${revision}.min.css|g" build/*.html
 sed -i -e "s|html5\.js|http://wearehugh.com/dip3/html5.js|g" build/*.html
+sed -i -e "s|=http:|=|g" build/*.html

 # set file permissions for public consumption
 chmod 644 build/*.html build/*.css build/*.js build/*.py build/*.txt build/.htaccess
@@ -3,7 +3,7 @@
 <head>
 <meta charset=utf-8>
 <title>Regular expressions - Dive into Python 3</title>
-<!--[if IE]><script src="html5.js"></script><![endif]-->
+<!--[if IE]><script src=html5.js></script><![endif]-->
 <link rel=stylesheet type=text/css href=dip3.css>
 <link rel="shortcut icon" href=data:image/ico,>
 <link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
@@ -3,7 +3,7 @@
 <head>
 <meta charset=utf-8>
 <title>Table of contents - Dive Into Python 3</title>
-<!--[if IE]><script src="html5.js"></script><![endif]-->
+<!--[if IE]><script src=html5.js></script><![endif]-->
 <link rel=stylesheet type=text/css href=dip3.css>
 <link rel="shortcut icon" href=data:image/ico,>
 <link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
@@ -3,7 +3,7 @@
 <head>
 <meta charset=utf-8>
 <title>Unit testing - Dive into Python 3</title>
-<!--[if IE]><script src="html5.js"></script><![endif]-->
+<!--[if IE]><script src=html5.js></script><![endif]-->
 <link rel=stylesheet type=text/css href=dip3.css>
 <link rel="shortcut icon" href=data:image/ico,>
 <link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
@@ -3,7 +3,7 @@
 <head>
 <meta charset=utf-8>
 <title>Your first Python program - Dive into Python 3</title>
-<!--[if IE]><script src="html5.js"></script><![endif]-->
+<!--[if IE]><script src=html5.js></script><![endif]-->
 <link rel=stylesheet type=text/css href=dip3.css>
 <link rel="shortcut icon" href=data:image/ico,>
 <link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>