xref to files chapter

This commit is contained in:
Mark Pilgrim
2009-07-16 12:35:19 -04:00
parent ca741fad6e
commit 49fae282ec
+1 -1
View File
@@ -670,7 +670,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
for line in open(f, 'rb'):
u.feed(line)</code></pre>
<aside>Not an array of characters, but an array of bytes.</aside>
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string&nbsp;&mdash;&nbsp;an array of Unicode characters&nbsp;&mdash;&nbsp;according to the system default character encoding. (You could override the system encoding with another parameter to the <code>open()</code> function, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <a href=files.html#binary><code>'b'</code> is for &#8220;binary.&#8221;</a> Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string&nbsp;&mdash;&nbsp;an array of Unicode characters&nbsp;&mdash;&nbsp;according to the system default character encoding. But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
<pre class=nd><code class=pp> class UniversalDetector: