added asides, new styles

This commit is contained in:
Mark Pilgrim
2009-03-28 15:58:35 -05:00
parent fe57cb0215
commit 1a4ce72944
9 changed files with 127 additions and 83 deletions
+17 -11
View File
@@ -19,26 +19,27 @@ mark{background:#ff8;font-weight:bold}
<p><span>&#x275D;</span> Words, words. They&#8217;re all we have to go on. <span>&#x275E;</span><br>&mdash; <a href=http://www.imdb.com/title/tt0100519/quotes>Rosencrantz and Guildenstern are Dead</a>
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving in</h2>
<h2 id=divingin>Diving In</h2>
<p class=f>Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In <a href=strings.html>Chapter 3</a>, I talked about the history of character encoding and the creation of Unicode, the &#8220;one encoding to rule them all.&#8221; I&#8217;d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.
<p>I&#8217;d also like a pony.
<p>A Unicode pony.
<p>A Unipony, as it were.
<p>I&#8217;ll settle for character encoding auto-detection.
<h2 id=faq.what>What is character encoding auto-detection?</h2>
<h2 id=faq.what>What is Character Encoding Auto-Detection?</h2>
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It&#8217;s like cracking a code when you don&#8217;t have the decryption key.
<h3 id=faq.impossible>Isn&#8217;t that impossible?</h3>
<h3 id=faq.impossible>Isn&#8217;t That Impossible?</h3>
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds &#8220;txzqJv 2!dasd0a QqdKjvz&#8221; will instantly recognize that that isn&#8217;t English (even though it is composed entirely of English letters). By studying lots of &#8220;typical&#8221; text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text&#8217;s language.
<p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
<h3 id=faq.who>Does such an algorithm exist?</h3>
<h3 id=faq.who>Does Such An Algorithm Exist?</h3>
<p>As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of pages that have no encoding information whatsoever. <a href=http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>Mozilla Firefox contains an encoding auto-detection library</a> which is open source. <a href=http://chardet.feedparser.org/>I ported the library to Python 2</a> and dubbed it the <code>chardet</code> module. This chapter will take you step-by-step through the process of porting the <code>chardet</code> module from Python 2 to Python 3.
<h2 id=divingin2>Introducing the <code>chardet</code> module</h2>
<h2 id=divingin2>Introducing The <code>chardet</code> Module</h2>
<p>[FIXME download link, possibly on chardet.feedparser.org, possibly local]
<p>Before we set off porting the code, it would help if you understood how the code worked! This is a brief guide to navigating the code itself.
<aside>Encoding detection is really language detection in drag.</aside>
<p>The main entry point for the detection algorithm is <code>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code>chardet/__init__.py</code>, but that&#8217;s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
<ol>
@@ -48,18 +49,19 @@ mark{background:#ff8;font-weight:bold}
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn&#8217;t know a character encoding from a hole in the ground.
</ol>
<h3 id=how.bom><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></h3>
<h3 id=how.bom><code>UTF-n</code> With A <abbr title="Byte Order Mark">BOM</abbr></h3>
<p>If the text starts with a <abbr title="Byte Order Mark">BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr title="Byte Order Mark">BOM</abbr> will tell us exactly which one; that&#8217;s what it&#8217;s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
<h3 id=how.esc>Escaped encodings</h3>
<h3 id=how.esc>Escaped Encodings</h3>
<p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <code>UniversalDetector</code> creates an <code>EscCharSetProber</code> (defined in <code>escprober.py</code>) and feeds it the text.
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <code>HZ-GB-2312</code>, <code>ISO-2022-CN</code>, <code>ISO-2022-JP</code>, and <code>ISO-2022-KR</code> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
<h3 id=how.mb>Multi-byte encodings</h3>
<h3 id=how.mb>Multi-Byte Encodings</h3>
<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of &#8220;probers&#8221; for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
<p>Most of the multi-byte encoding probers are inherited from <code>MultiByteCharSetProber</code> (defined in <code>mbcharsetprober.py</code>), and simply hook up the appropriate state machine and distribution analyzer and let <code>MultiByteCharSetProber</code> do the rest of the work. <code>MultiByteCharSetProber</code> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <code>MultiByteCharSetProber</code> feeds the text to an encoding-specific distribution analyzer.
<p>The distribution analyzers (each defined in <code>chardistribution.py</code>) use language-specific models of which characters are used most frequently. Once <code>MultiByteCharSetProber</code> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <code>MultiByteCharSetProber</code> returns the result to <code>MBCSGroupProber</code>, which returns it to <code>UniversalDetector</code>, which returns it to the caller.
<p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <code>EUC-JP</code> and <code>SHIFT_JIS</code>, so the <code>SJISProber</code> (defined in <code>sjisprober.py</code>) also uses 2-character distribution analysis. <code>SJISContextAnalysis</code> and <code>EUCJPContextAnalysis</code> (both defined in <code>jpcntx.py</code> and both inheriting from a common <code>JapaneseContextAnalysis</code> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <code>SJISProber</code>, which checks both analyzers and returns the higher confidence level to <code>MBCSGroupProber</code>.
<h3 id=how.sb>Single-byte encodings</h3>
<h3 id=how.sb>Single-Byte Encodings</h3>
<aside>Seriously, where&#8217;s my Unicode pony?</aside>
<p>The single-byte encoding prober, <code>SBCSGroupProber</code> (defined in <code>sbcsgroupprober.py</code>), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: <code>windows-1251</code>, <code>KOI8-R</code>, <code>ISO-8859-5</code>, <code>MacCyrillic</code>, <code>IBM855</code>, and <code>IBM866</code> (Russian); <code>ISO-8859-7</code> and <code>windows-1253</code> (Greek); <code>ISO-8859-5</code> and <code>windows-1251</code> (Bulgarian); <code>ISO-8859-2</code> and <code>windows-1250</code> (Hungarian); <code>TIS-620</code> (Thai); <code>windows-1255</code> and <code>ISO-8859-8</code> (Hebrew).
<p><code>SBCSGroupProber</code> feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, <code>SingleByteCharSetProber</code> (defined in <code>sbcharsetprober.py</code>), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. <code>SingleByteCharSetProber</code> processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code>hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored &#8220;backwards&#8221; line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
@@ -567,8 +569,9 @@ RefactoringTool: Files that were modified:
RefactoringTool: test.py</samp></pre>
<p>[FIXME explain the difference in import syntax]
<p>Well, that wasn&#8217;t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it&#8217;ll work?
<h2 id=manual>Fixing what <code>2to3</code> can&#8217;t</h2>
<h2 id=manual>Fixing What <code>2to3</code> Can&#8217;t</h2>
<h3 id=falseisinvalidsyntax><code>False</code> is invalid syntax</h3>
<aside>You do have tests, right?</aside>
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it&#8217;s a good way to test our ported code to make sure there aren&#8217;t any bugs lurking anywhere.
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp class=traceback>Traceback (most recent call last):
@@ -613,6 +616,7 @@ import sys</code></pre>
<p>There are variations of this problem scattered throughout the <code>chardet</code> library. In some places it&#8217;s "<code>import constants, sys</code>"; in other places, it&#8217;s "<code>import constants, re</code>". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
<p>Onward!
<h3 id=namefileisnotdefined>Name <var>'file'</var> is not defined</h3>
<aside>open() is the new file(). PapayaWhip is the new black.</aside>
<p>And here we go again, running <code>test.py</code> to try to execute our test cases&hellip;</p>
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
@@ -654,6 +658,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
.
for line in open(f, 'rb'):
u.feed(line)</code></pre>
<aside>Not an array of characters, but an array of bytes.</aside>
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string &mdash; an array of Unicode characters &mdash; according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit&hellip; characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
@@ -776,6 +781,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
self._mInputState = eEscAscii
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
<aside>Each item in a string is a string. Each item in a byte array is an integer.</aside>
<p>This error doesn't occur the first time the <code>feed()</code> method gets called; it occurs the <em>second time</em>, after <var>self._mLastChar</var> has been set to the last byte of <var>aBuf</var>. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>aBuf = b'\xEF\xBB\xBF'</kbd> <span>&#x2460;</span></a>
@@ -1115,7 +1121,7 @@ NameError: global name 'reduce' is not defined</samp></pre>
<mark> total = reduce(operator.add, self._mFreqCounter)</mark></code></pre>
<p>The <code>reduce()</code> function takes two arguments &mdash; a function and a list (strictly speaking, any iterable object will do) &mdash; and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.
<p>This monstrosity was so common in Python 2 that Python 3 added a global <code>sum()</code> function.
<p>This monstrosity was so common that Python added a global <code>sum()</code> function.
<pre><code> def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01