updated TOC

2026-06-05 23:10:17 +00:00 · 2009-01-30 00:28:56 -05:00
parent 891e357a2a
commit 504e9cbdb1
2 changed files with 73 additions and 480 deletions
@@ -41,9 +41,9 @@ body{counter-reset:h1 19}

 <h2 id="faq">Introducing <code class="filename">chardet</code>: a mini-FAQ</h2>

-<p class="fancy">When you think of “text”, you probably think of “characters and symbols I see on my computer screen”.  But computers don't deal in characters and symbols; they deal in bits and bytes.  Every piece of text you've ever seen on a computer screen is actually stored in a particular <em>character encoding</em>.  There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.  Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
+<p class="fancy">When you think of "text", you probably think of "characters and symbols I see on my computer screen".  But computers don't deal in characters and symbols; they deal in bits and bytes.  Every piece of text you've ever seen on a computer screen is actually stored in a particular <em>character encoding</em>.  There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.  Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.

-<p>In reality, it's more complicated than that.  Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk.  So you can think of the character encoding as a kind of decryption key for the text.  Whenever someone gives you a sequence of bytes and claims it's “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
+<p>In reality, it's more complicated than that.  Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk.  So you can think of the character encoding as a kind of decryption key for the text.  Whenever someone gives you a sequence of bytes and claims it's "text", you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).

 <h3 id="faq.what">What is character encoding auto-detection?</h3>

@@ -51,7 +51,7 @@ body{counter-reset:h1 19}

 <h3 id="faq.impossible">Isn't that impossible?</h3>

-<p>In general, yes.  However, some encodings are optimized for specific languages, and languages are not random.  Some character sequences pop up all the time, while other sequences make no sense.  A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters).  By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
+<p>In general, yes.  However, some encodings are optimized for specific languages, and languages are not random.  Some character sequences pop up all the time, while other sequences make no sense.  A person fluent in English who opens a newspaper and finds "txzqJv 2!dasd0a QqdKjvz" will instantly recognize that that isn't English (even though it is composed entirely of English letters).  By studying lots of "typical" text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
 <p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.

 <h3 id="faq.who">Who wrote this detection algorithm?</h3>
@@ -108,7 +108,7 @@ body{counter-reset:h1 19}

 <h3 id="how.mb">Multi-byte encodings</h3>

-<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters.  If so, it creates a series of “<span class="quote">probers</span>” for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
+<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters.  If so, it creates a series of "<span class="quote">probers</span>" for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.

 <p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code class="filename">mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>.  <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results.  If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober).  If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.

@@ -124,7 +124,7 @@ body{counter-reset:h1 19}

 <p><code>SBCSGroupProber</code> feeds the text to each of these encoding+language-specific probers and checks the results.  These probers are all implemented as a single class, <code>SingleByteCharSetProber</code> (defined in <code class="filename">sbcharsetprober.py</code>), which takes a language model as an argument.  The language model defines how frequently different 2-character sequences appear in typical text.  <code>SingleByteCharSetProber</code> processes the text and tallies the most frequently used 2-character sequences.  Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.

-<p>Hebrew is handled as a special case.  If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code class="filename">hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored “<span class="quote">backwards</span>” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client).  Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
+<p>Hebrew is handled as a special case.  If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code class="filename">hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored "<span class="quote">backwards</span>" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client).  Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).

 <h3 id="how.windows1252"><code>windows-1252</code></h3>