mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
added asides, new styles
This commit is contained in:
@@ -19,26 +19,27 @@ mark{background:#ff8;font-weight:bold}
|
||||
<p><span>❝</span> Words, words. They’re all we have to go on. <span>❞</span><br>— <a href=http://www.imdb.com/title/tt0100519/quotes>Rosencrantz and Guildenstern are Dead</a>
|
||||
</blockquote>
|
||||
<p id=toc>
|
||||
<h2 id=divingin>Diving in</h2>
|
||||
<h2 id=divingin>Diving In</h2>
|
||||
<p class=f>Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In <a href=strings.html>Chapter 3</a>, I talked about the history of character encoding and the creation of Unicode, the “one encoding to rule them all.” I’d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.
|
||||
<p>I’d also like a pony.
|
||||
<p>A Unicode pony.
|
||||
<p>A Unipony, as it were.
|
||||
<p>I’ll settle for character encoding auto-detection.
|
||||
|
||||
<h2 id=faq.what>What is character encoding auto-detection?</h2>
|
||||
<h2 id=faq.what>What is Character Encoding Auto-Detection?</h2>
|
||||
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key.
|
||||
|
||||
<h3 id=faq.impossible>Isn’t that impossible?</h3>
|
||||
<h3 id=faq.impossible>Isn’t That Impossible?</h3>
|
||||
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.
|
||||
<p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
|
||||
|
||||
<h3 id=faq.who>Does such an algorithm exist?</h3>
|
||||
<h3 id=faq.who>Does Such An Algorithm Exist?</h3>
|
||||
<p>As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of pages that have no encoding information whatsoever. <a href=http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>Mozilla Firefox contains an encoding auto-detection library</a> which is open source. <a href=http://chardet.feedparser.org/>I ported the library to Python 2</a> and dubbed it the <code>chardet</code> module. This chapter will take you step-by-step through the process of porting the <code>chardet</code> module from Python 2 to Python 3.
|
||||
|
||||
<h2 id=divingin2>Introducing the <code>chardet</code> module</h2>
|
||||
<h2 id=divingin2>Introducing The <code>chardet</code> Module</h2>
|
||||
<p>[FIXME download link, possibly on chardet.feedparser.org, possibly local]
|
||||
<p>Before we set off porting the code, it would help if you understood how the code worked! This is a brief guide to navigating the code itself.
|
||||
<aside>Encoding detection is really language detection in drag.</aside>
|
||||
<p>The main entry point for the detection algorithm is <code>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code>chardet/__init__.py</code>, but that’s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
|
||||
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
|
||||
<ol>
|
||||
@@ -48,18 +49,19 @@ mark{background:#ff8;font-weight:bold}
|
||||
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
|
||||
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.
|
||||
</ol>
|
||||
<h3 id=how.bom><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></h3>
|
||||
<h3 id=how.bom><code>UTF-n</code> With A <abbr title="Byte Order Mark">BOM</abbr></h3>
|
||||
<p>If the text starts with a <abbr title="Byte Order Mark">BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr title="Byte Order Mark">BOM</abbr> will tell us exactly which one; that’s what it’s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
|
||||
<h3 id=how.esc>Escaped encodings</h3>
|
||||
<h3 id=how.esc>Escaped Encodings</h3>
|
||||
<p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <code>UniversalDetector</code> creates an <code>EscCharSetProber</code> (defined in <code>escprober.py</code>) and feeds it the text.
|
||||
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <code>HZ-GB-2312</code>, <code>ISO-2022-CN</code>, <code>ISO-2022-JP</code>, and <code>ISO-2022-KR</code> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
|
||||
<h3 id=how.mb>Multi-byte encodings</h3>
|
||||
<h3 id=how.mb>Multi-Byte Encodings</h3>
|
||||
<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
|
||||
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
|
||||
<p>Most of the multi-byte encoding probers are inherited from <code>MultiByteCharSetProber</code> (defined in <code>mbcharsetprober.py</code>), and simply hook up the appropriate state machine and distribution analyzer and let <code>MultiByteCharSetProber</code> do the rest of the work. <code>MultiByteCharSetProber</code> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <code>MultiByteCharSetProber</code> feeds the text to an encoding-specific distribution analyzer.
|
||||
<p>The distribution analyzers (each defined in <code>chardistribution.py</code>) use language-specific models of which characters are used most frequently. Once <code>MultiByteCharSetProber</code> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <code>MultiByteCharSetProber</code> returns the result to <code>MBCSGroupProber</code>, which returns it to <code>UniversalDetector</code>, which returns it to the caller.
|
||||
<p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <code>EUC-JP</code> and <code>SHIFT_JIS</code>, so the <code>SJISProber</code> (defined in <code>sjisprober.py</code>) also uses 2-character distribution analysis. <code>SJISContextAnalysis</code> and <code>EUCJPContextAnalysis</code> (both defined in <code>jpcntx.py</code> and both inheriting from a common <code>JapaneseContextAnalysis</code> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <code>SJISProber</code>, which checks both analyzers and returns the higher confidence level to <code>MBCSGroupProber</code>.
|
||||
<h3 id=how.sb>Single-byte encodings</h3>
|
||||
<h3 id=how.sb>Single-Byte Encodings</h3>
|
||||
<aside>Seriously, where’s my Unicode pony?</aside>
|
||||
<p>The single-byte encoding prober, <code>SBCSGroupProber</code> (defined in <code>sbcsgroupprober.py</code>), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: <code>windows-1251</code>, <code>KOI8-R</code>, <code>ISO-8859-5</code>, <code>MacCyrillic</code>, <code>IBM855</code>, and <code>IBM866</code> (Russian); <code>ISO-8859-7</code> and <code>windows-1253</code> (Greek); <code>ISO-8859-5</code> and <code>windows-1251</code> (Bulgarian); <code>ISO-8859-2</code> and <code>windows-1250</code> (Hungarian); <code>TIS-620</code> (Thai); <code>windows-1255</code> and <code>ISO-8859-8</code> (Hebrew).
|
||||
<p><code>SBCSGroupProber</code> feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, <code>SingleByteCharSetProber</code> (defined in <code>sbcharsetprober.py</code>), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. <code>SingleByteCharSetProber</code> processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
|
||||
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code>hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored “backwards” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
|
||||
@@ -567,8 +569,9 @@ RefactoringTool: Files that were modified:
|
||||
RefactoringTool: test.py</samp></pre>
|
||||
<p>[FIXME explain the difference in import syntax]
|
||||
<p>Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work?
|
||||
<h2 id=manual>Fixing what <code>2to3</code> can’t</h2>
|
||||
<h2 id=manual>Fixing What <code>2to3</code> Can’t</h2>
|
||||
<h3 id=falseisinvalidsyntax><code>False</code> is invalid syntax</h3>
|
||||
<aside>You do have tests, right?</aside>
|
||||
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere.
|
||||
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
||||
<samp class=traceback>Traceback (most recent call last):
|
||||
@@ -613,6 +616,7 @@ import sys</code></pre>
|
||||
<p>There are variations of this problem scattered throughout the <code>chardet</code> library. In some places it’s "<code>import constants, sys</code>"; in other places, it’s "<code>import constants, re</code>". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
|
||||
<p>Onward!
|
||||
<h3 id=namefileisnotdefined>Name <var>'file'</var> is not defined</h3>
|
||||
<aside>open() is the new file(). PapayaWhip is the new black.</aside>
|
||||
<p>And here we go again, running <code>test.py</code> to try to execute our test cases…</p>
|
||||
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
||||
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
|
||||
@@ -654,6 +658,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
|
||||
.
|
||||
for line in open(f, 'rb'):
|
||||
u.feed(line)</code></pre>
|
||||
<aside>Not an array of characters, but an array of bytes.</aside>
|
||||
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for “read”; OK, big deal, we’re reading the file. Ah, but <code>'b'</code> is for “binary.” Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
|
||||
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
|
||||
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
|
||||
@@ -776,6 +781,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
|
||||
self._mInputState = eEscAscii
|
||||
|
||||
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
|
||||
<aside>Each item in a string is a string. Each item in a byte array is an integer.</aside>
|
||||
<p>This error doesn't occur the first time the <code>feed()</code> method gets called; it occurs the <em>second time</em>, after <var>self._mLastChar</var> has been set to the last byte of <var>aBuf</var>. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>aBuf = b'\xEF\xBB\xBF'</kbd> <span>①</span></a>
|
||||
@@ -1115,7 +1121,7 @@ NameError: global name 'reduce' is not defined</samp></pre>
|
||||
|
||||
<mark> total = reduce(operator.add, self._mFreqCounter)</mark></code></pre>
|
||||
<p>The <code>reduce()</code> function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.
|
||||
<p>This monstrosity was so common in Python 2 that Python 3 added a global <code>sum()</code> function.
|
||||
<p>This monstrosity was so common that Python added a global <code>sum()</code> function.
|
||||
<pre><code> def get_confidence(self):
|
||||
if self.get_state() == constants.eNotMe:
|
||||
return 0.01
|
||||
|
||||
Reference in New Issue
Block a user