markup fiddling (encodings are always wrapped in abbr)

This commit is contained in:
Mark Pilgrim
2009-09-26 00:12:49 -04:00
parent fc155d5fe6
commit 131638d9ea
4 changed files with 21 additions and 21 deletions
+9 -9
View File
@@ -47,20 +47,20 @@ del{background:#f87}
<p>The main entry point for the detection algorithm is <code>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code>chardet/__init__.py</code>, but that&#8217;s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
<ol>
<li><code>UTF-n</code> with a Byte Order Mark (<abbr>BOM</abbr>). This includes <code>UTF-8</code>, both Big-Endian and Little-Endian variants of <code>UTF-16</code>, and all 4 byte-order variants of <code>UTF-32</code>.
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <code>ISO-2022-JP</code> (Japanese) and <code>HZ-GB-2312</code> (Chinese).
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <code>Big5</code> (Chinese), <code>SHIFT_JIS</code> (Japanese), <code>EUC-KR</code> (Korean), and <code>UTF-8</code> without a <abbr>BOM</abbr>.
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn&#8217;t know a character encoding from a hole in the ground.
<li><abbr>UTF-n</abbr> with a Byte Order Mark (<abbr>BOM</abbr>). This includes <abbr>UTF-8</abbr>, both Big-Endian and Little-Endian variants of <abbr>UTF-16</abbr>, and all 4 byte-order variants of <abbr>UTF-32</abbr>.
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <abbr>ISO-2022-JP</abbr> (Japanese) and <abbr>HZ-GB-2312</abbr> (Chinese).
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <abbr>Big5</abbr> (Chinese), <abbr>SHIFT_JIS</abbr> (Japanese), <abbr>EUC-KR</abbr> (Korean), and <abbr>UTF-8</abbr> without a <abbr>BOM</abbr>.
<li>Single-byte encodings, where each character is represented by one byte. Examples: <abbr>KOI8-R</abbr> (Russian), <abbr>windows-1255</abbr> (Hebrew), and <abbr>TIS-620</abbr> (Thai).
<li><abbr>windows-1252</abbr>, which is used primarily on Microsoft Windows by middle managers who wouldn&#8217;t know a character encoding from a hole in the ground.
</ol>
<h3 id=how.bom><code>UTF-n</code> With A <abbr>BOM</abbr></h3>
<p>If the text starts with a <abbr>BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr>BOM</abbr> will tell us exactly which one; that&#8217;s what it&#8217;s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
<h3 id=how.bom><abbr>UTF-n</abbr> With A <abbr>BOM</abbr></h3>
<p>If the text starts with a <abbr>BOM</abbr>, we can reasonably assume that the text is encoded in <abbr>UTF-8</abbr>, <abbr>UTF-16</abbr>, or <abbr>UTF-32</abbr>. (The <abbr>BOM</abbr> will tell us exactly which one; that&#8217;s what it&#8217;s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
<h3 id=how.esc>Escaped Encodings</h3>
<p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <code>UniversalDetector</code> creates an <code>EscCharSetProber</code> (defined in <code>escprober.py</code>) and feeds it the text.
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <code>HZ-GB-2312</code>, <code>ISO-2022-CN</code>, <code>ISO-2022-JP</code>, and <code>ISO-2022-KR</code> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <abbr>HZ-GB-2312</abbr>, <abbr>ISO-2022-CN</abbr>, <abbr>ISO-2022-JP</abbr>, and <abbr>ISO-2022-KR</abbr> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
<h3 id=how.mb>Multi-Byte Encodings</h3>
<p>Assuming no <abbr>BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of &#8220;probers&#8221; for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <abbr>Big5</abbr>, <abbr>GB2312</abbr>, <abbr>EUC-TW</abbr>, <abbr>EUC-KR</abbr>, <abbr>EUC-JP</abbr>, <abbr>SHIFT_JIS</abbr>, and <abbr>UTF-8</abbr>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
<p>Most of the multi-byte encoding probers are inherited from <code>MultiByteCharSetProber</code> (defined in <code>mbcharsetprober.py</code>), and simply hook up the appropriate state machine and distribution analyzer and let <code>MultiByteCharSetProber</code> do the rest of the work. <code>MultiByteCharSetProber</code> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <code>MultiByteCharSetProber</code> feeds the text to an encoding-specific distribution analyzer.
<p>The distribution analyzers (each defined in <code>chardistribution.py</code>) use language-specific models of which characters are used most frequently. Once <code>MultiByteCharSetProber</code> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <code>MultiByteCharSetProber</code> returns the result to <code>MBCSGroupProber</code>, which returns it to <code>UniversalDetector</code>, which returns it to the caller.
<p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <code>EUC-JP</code> and <code>SHIFT_JIS</code>, so the <code>SJISProber</code> (defined in <code>sjisprober.py</code>) also uses 2-character distribution analysis. <code>SJISContextAnalysis</code> and <code>EUCJPContextAnalysis</code> (both defined in <code>jpcntx.py</code> and both inheriting from a common <code>JapaneseContextAnalysis</code> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <code>SJISProber</code>, which checks both analyzers and returns the higher confidence level to <code>MBCSGroupProber</code>.