started strings chapter, rewrote case-study intro, added some FIXMEs for obvious holes

This commit is contained in:
Mark Pilgrim
2009-03-15 23:49:11 -04:00
parent f727165f62
commit 08be466e7b
11 changed files with 417 additions and 424 deletions
+27 -33
View File
@@ -12,20 +12,18 @@ body{counter-reset:h1 20}
</style>
</head>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&#xa0;<input name=q size=31>&#xa0;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html#case-study-porting-chardet-to-python-3>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Case study: porting <code>chardet</code> to Python 3</h1>
<blockquote class=q>
<p><span>&#x275D;</span> Words, words. They&#8217;re all we have to go on. <span>&#x275E;</span><br>&mdash; <cite>Rosencrantz and Guildenstern are Dead</cite>
</blockquote>
<ol>
<li><a href=#divingin>What is character encoding?</a>
<li><a href=#divingin>Diving in</a>
<li><a href=#faq.what>What is character encoding auto-detection?</h2>
<ol>
<li><a href=#faq.what>What is character encoding auto-detection?</a>
<li><a href=#faq.impossible>Isn&#8217;t that impossible?</a>
<li><a href=#faq.who>Who wrote this detection algorithm?</a>
<li><a href=#faq.yippie>Yippie! Screw the standards, I&#8217;ll just auto-detect everything!</a>
<li><a href=#faq.why>Why bother with auto-detection if it&#8217;s slow, inaccurate, and non-standard?</a>
<li><a href=#faq.who>Does such an algorithm exist?</a>
</ol>
<li><a href=#divingin2>Diving in</a>
<ol>
@@ -50,31 +48,26 @@ body{counter-reset:h1 20}
</ol>
<li><a href=#summary>Summary</a>
</ol>
<h2 id=divingin>What is character encoding?</h2>
<p class=fancy>Usually, when people talk about &#8220;text,&#8221; they&#8217;re thinking of &#8220;characters and symbols on the computer screen.&#8221; But computers don&#8217;t deal in characters and symbols; they deal in bits and bytes. Every piece of text you&#8217;ve ever seen on a computer screen is actually stored in a particular <em>character encoding</em>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
<p>In reality, it&#8217;s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it&#8217;s &#8220;text&#8221;, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
<h3 id=faq.what>What is character encoding auto-detection?</h3>
<h2 id=divingin>Diving in</h2>
<p class=fancy>Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In <a href=strings.html>Chapter 3</a>, I talked about the history of character encoding and the creation of Unicode, the &#8220;one encoding to rule them all.&#8221; I&#8217;d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.
<p>I&#8217;d also like a pony.
<p>A Unicode pony.
<p>A Unipony, as it were.
<p>I&#8217;ll settle for character encoding auto-detection.
<h2 id=faq.what>What is character encoding auto-detection?</h2>
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It&#8217;s like cracking a code when you don&#8217;t have the decryption key.
<h3 id=faq.impossible>Isn&#8217;t that impossible?</h3>
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds &#8220;txzqJv 2!dasd0a QqdKjvz&#8221; will instantly recognize that that isn&#8217;t English (even though it is composed entirely of English letters). By studying lots of &#8220;typical&#8221; text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text&#8217;s language.
<p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
<h3 id=faq.who>Who wrote this detection algorithm?</h3>
<p>This library is a port of <a href=http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>the auto-detection code in Mozilla</a>. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors&#8217; comments, which are quite extensive and informative.
<p>You may also be interested in the research paper which led to the Mozilla implementation, <a href=http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html>A composite approach to language/encoding detection</a>.
<h3 id=faq.yippie>Yippie! Screw the standards, I&#8217;ll just auto-detect everything!</h3>
<p>Don&#8217;t do that. Virtually every format and protocol contains a method for specifying character encoding.
<ul>
<li><abbr>HTTP</abbr> can define a <code>charset</code> parameter in the <code>Content-type</code> header.
<li><abbr>HTML</abbr> documents can define a <code>&lt;meta http-equiv="content-type"&gt;</code> element in the <code>&lt;head&gt;</code> of a web page.
<li><abbr>XML</abbr> documents can define an <code>encoding</code> attribute in the <abbr>XML</abbr> prolog.
</ul>
<p>If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over <abbr>HTTP</abbr>, you need to support both standards <em>and</em> figure out which one wins if they give you conflicting information.)
<p>Despite the complexity, it&#8217;s worthwhile to follow standards and <a href=http://www.w3.org/2001/tag/doc/mime-respect>respect explicit character encoding information</a>. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
<h3 id=faq.why>Why bother with auto-detection if it&#8217;s slow, inaccurate, and non-standard?</h3>
<p>Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn&#8217;t work. There are also some poorly designed standards that have no way to specify encoding at all.
<p>If following the relevant standards gets you nowhere, <em>and</em> you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my <a href=http://feedparser.org/>Universal Feed Parser</a>, which calls this auto-detection library <a href=http://feedparser.org/docs/character-encoding.html>only after exhausting all other options</a>.
<h2 id=divingin2>Diving in</h2>
<p>This is a brief guide to navigating the code itself.
<h3 id=faq.who>Does such an algorithm exist?</h3>
<p>As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of pages that have no encoding information whatsoever. <a href=http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>Mozilla Firefox contains an encoding auto-detection library</a> which is open source. <a href=http://chardet.feedparser.org/>I ported the library to Python 2</a> and dubbed it the <code>chardet</code> module. This chapter will take you step-by-step through the process of porting the <code>chardet</code> module from Python 2 to Python 3.
<h2 id=divingin2>Introducing the <code>chardet</code> module</h2>
<p>[FIXME download link, possibly on chardet.feedparser.org, possibly local]
<p>Before we set off porting the code, it would help if you understood how the code worked! This is a brief guide to navigating the code itself.
<p>The main entry point for the detection algorithm is <code>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code>chardet/__init__.py</code>, but that&#8217;s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
<ol>
@@ -98,11 +91,11 @@ body{counter-reset:h1 20}
<h3 id=how.sb>Single-byte encodings</h3>
<p>The single-byte encoding prober, <code>SBCSGroupProber</code> (defined in <code>sbcsgroupprober.py</code>), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: <code>windows-1251</code>, <code>KOI8-R</code>, <code>ISO-8859-5</code>, <code>MacCyrillic</code>, <code>IBM855</code>, and <code>IBM866</code> (Russian); <code>ISO-8859-7</code> and <code>windows-1253</code> (Greek); <code>ISO-8859-5</code> and <code>windows-1251</code> (Bulgarian); <code>ISO-8859-2</code> and <code>windows-1250</code> (Hungarian); <code>TIS-620</code> (Thai); <code>windows-1255</code> and <code>ISO-8859-8</code> (Hebrew).
<p><code>SBCSGroupProber</code> feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, <code>SingleByteCharSetProber</code> (defined in <code>sbcharsetprober.py</code>), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. <code>SingleByteCharSetProber</code> processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code>hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored "<span class=quote>backwards</span>" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code>hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored &#8220;backwards&#8221; line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
<h3 id=how.windows1252><code>windows-1252</code></h3>
<p>If <code>UniversalDetector</code> detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a <code>Latin1Prober</code> (defined in <code>latin1prober.py</code>) to try to detect English text in a <code>windows-1252</code> encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish <code>windows-1252</code> is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. <code>Latin1Prober</code> automatically reduces its confidence rating to allow more accurate probers to win if at all possible.
<h2 id=running2to3>Running <code>2to3</code></h2>
<p>We&#8217;re going to migrate the <code>chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>. In this chapter, we&#8217;ll start by running <code>2to3</code> on the <code>chardet</code> package, but as you&#8217;ll see, there will still be a lot of work to do after the automated tools have performed their magic.
<p>We&#8217;re going to migrate the <code>chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy &mdash; a function was renamed or moved to a different modules &mdash; but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>. In this chapter, we&#8217;ll start by running <code>2to3</code> on the <code>chardet</code> package, but as you&#8217;ll see, there will still be a lot of work to do after the automated tools have performed their magic.
<p>The main <code>chardet</code> package is split across several different files, all in the same directory. The <code>2to3</code> script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and <code>2to3</code> will convert each of the files in turn.
<p id=noscript>[The code examples will be easier to follow if you enable Javascript, but whatever.]
<p class=skip><a href=#skip2to3output>skip over this</a>
@@ -604,7 +597,8 @@ RefactoringTool: Skipping implicit fixer: ws_comma
<ins>+print(count, 'tests')</ins>
RefactoringTool: Files that were modified:
RefactoringTool: test.py</samp></pre>
<p id=skip2to3outputtest>Well, that wasn&#8217;t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it&#8217;ll work?
<p id=skip2to3outputtest>[FIXME explain the difference in import syntax]
<p>Well, that wasn&#8217;t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it&#8217;ll work?
<h2 id=manual>Fixing what <code>2to3</code> can&#8217;t</h2>
<h3 id=falseisinvalidsyntax><code>False</code> is invalid syntax</h3>
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it&#8217;s a good way to test our ported code to make sure there aren&#8217;t any bugs lurking anywhere.
@@ -643,7 +637,7 @@ else:
File "C:\home\chardet\chardet\universaldetector.py", line 29, in &lt;module>
import constants, sys
ImportError: No module named constants</samp></pre>
<p id=skipnomodulenamedconstants>What&#8217;s that you say? No module named <code>constants</code>? Of course there&#8217;s a module named <code>constants</code>. &hellip;Oh wait, no there isn&#8217;t. Remember when the <code>2to3</code> script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
<p id=skipnomodulenamedconstants>What&#8217;s that you say? No module named <code>constants</code>? Of course there&#8217;s a module named <code>constants</code>. &hellip;Oh wait, no there isn&#8217;t. Remember when the <code>2to3</code> script fixed up all those import statements? This library has a lot of relative imports &mdash; that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
<pre><code>from . import constants</code></pre>
<p>But wait. Wasn&#8217;t the <code>2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code>constants</code> module within the library, and an absolute import of the <code>sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can&#8217;t, and the <code>2to3</code> script is not smart enough to split the import statement into two.
<p>The solution is to split the import statement manually. So this two-in-one import:
@@ -685,7 +679,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
<p id=skiphighbitdetectorcode>This pre-compiles a regular expression designed to find non-<abbr>ASCII</abbr> characters in the range 128&ndash;255 (0x80&ndash;0xFF). Wait, that&#8217;s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-<abbr>ASCII</abbr> <em>bytes</em> in the range 128-255.
<p>And therein lies the problem.
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we&#8217;re searching is not a string, it&#8217;s a byte array. Looking at the traceback, this error occurred in <code>universaldetector.py</code>:
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string &mdash; that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string &mdash; again, an array of characters. But what we&#8217;re searching is not a string, it&#8217;s a byte array. Looking at the traceback, this error occurred in <code>universaldetector.py</code>:
<p class=skip><a href=#skipfeedhighbitdetectorcode>skip over this</a>
<pre><code>def feed(self, aBuf):
.
@@ -701,7 +695,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
.
for line in open(f, 'rb'):
u.feed(line)</code></pre>
<p id=skiptestharnessfeedcode>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p id=skiptestharnessfeedcode>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string &mdash; an array of Unicode characters &mdash; according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
<p class=skip><a href=#skip-cant-use-a-string-pattern-solution>skip over this code listing</a>