clarifications [thanks G.P.]

This commit is contained in:
Mark Pilgrim
2009-06-01 12:24:15 -07:00
parent bca614e2be
commit 7e2b1808a8
2 changed files with 6 additions and 6 deletions
+1 -1
View File
@@ -107,7 +107,7 @@ class OrderedDict(dict, collections.MutableMapping):
<pre class=screen>
<samp class=p>>>> </samp><kbd>import ordereddict</kbd>
<samp class=p>>>> </samp><kbd>od = ordereddict.OrderedDict()</kbd>
<samp class=p>>>> </samp><kbd>klass = od.__class__</kbd>
<a><samp class=p>>>> </samp><kbd>klass = od.__class__</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>type(klass)</kbd>
<samp>&lt;class 'abc.ABCMeta'></samp>
<samp class=p>>>> </samp><kbd>klass.__name__</kbd>
+5 -5
View File
@@ -600,8 +600,8 @@ if not hasattr(__builtin__, 'False'):
else:
False = __builtin__.False
True = __builtin__.True</code></pre>
<p>This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in <code>Boolean</code> type. This code detects the absence of the built-in constants <code>True</code> and <code>False</code>, and defines them if necessary.
<p>However, Python 3 will always have a <code>Boolean</code> type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of <code>constants.True</code> and <code>constants.False</code> with <code>True</code> and <code>False</code>, respectively, then delete this dead code from <code>constants.py</code>.
<p>This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in <code>bool</code> type. This code detects the absence of the built-in constants <code>True</code> and <code>False</code>, and defines them if necessary.
<p>However, Python 3 will always have a <code>bool</code> type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of <code>constants.True</code> and <code>constants.False</code> with <code>True</code> and <code>False</code>, respectively, then delete this dead code from <code>constants.py</code>.
<p>So this line in <code>universaldetector.py</code>:
<pre><code>self.done = constants.False</code></pre>
<p>Becomes
@@ -635,8 +635,8 @@ import sys</code></pre>
File "test.py", line 9, in &lt;module>
for line in file(f, 'rb'):
NameError: name 'file' is not defined</samp></pre>
<p>This one surprised me, because I&#8217;ve been using this idiom as long as I can remember. In Python 2, the global <var>file()</var> function was an alias for <var>open()</var>, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the <code>io</code> module. [FIXME-LINK PEP 3116] I&#8217;ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global <var>file()</var> function no longer exists. However, the <var>open()</var> function does still exist. (Technically, it&#8217;s an alias for <var>io.open()</var>, but never mind that right now.)
<p>Thus, the simplest solution to the problem of the missing <var>file()</var> is to call <var>open()</var> instead:
<p>This one surprised me, because I&#8217;ve been using this idiom as long as I can remember. In Python 2, the global <code>file()</code> function was an alias for the <code>open()</code> function, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the <code>io</code> module. [FIXME-LINK PEP 3116] I&#8217;ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global <code>file()</code> function no longer exists. However, the <code>open()</code> function does still exist. (Technically, it&#8217;s an alias for <var>io.open()</var>, but never mind that right now.)
<p>Thus, the simplest solution to the problem of the missing <code>file()</code> is to call the <code>open()</code> function instead:
<pre><code>for line in open(f, 'rb'):</code></pre>
<p>And that&#8217;s all I have to say about that.
<h3 id=cantuseastringpattern>Can&#8217;t use a string pattern on a bytes-like object</h3>
@@ -670,7 +670,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
for line in open(f, 'rb'):
u.feed(line)</code></pre>
<aside>Not an array of characters, but an array of bytes.</aside>
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string &mdash; an array of Unicode characters &mdash; according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string &mdash; an array of Unicode characters &mdash; according to the system default character encoding. (You could override the system encoding with another parameter to the <code>open()</code> function, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
<pre><code> class UniversalDetector: