markup fiddling (encodings are always wrapped in abbr)

This commit is contained in:
Mark Pilgrim
2009-09-26 00:12:49 -04:00
parent fc155d5fe6
commit 131638d9ea
4 changed files with 21 additions and 21 deletions
+9 -9
View File
@@ -47,20 +47,20 @@ del{background:#f87}
<p>The main entry point for the detection algorithm is <code>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code>chardet/__init__.py</code>, but that&#8217;s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
<ol>
<li><code>UTF-n</code> with a Byte Order Mark (<abbr>BOM</abbr>). This includes <code>UTF-8</code>, both Big-Endian and Little-Endian variants of <code>UTF-16</code>, and all 4 byte-order variants of <code>UTF-32</code>.
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <code>ISO-2022-JP</code> (Japanese) and <code>HZ-GB-2312</code> (Chinese).
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <code>Big5</code> (Chinese), <code>SHIFT_JIS</code> (Japanese), <code>EUC-KR</code> (Korean), and <code>UTF-8</code> without a <abbr>BOM</abbr>.
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn&#8217;t know a character encoding from a hole in the ground.
<li><abbr>UTF-n</abbr> with a Byte Order Mark (<abbr>BOM</abbr>). This includes <abbr>UTF-8</abbr>, both Big-Endian and Little-Endian variants of <abbr>UTF-16</abbr>, and all 4 byte-order variants of <abbr>UTF-32</abbr>.
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <abbr>ISO-2022-JP</abbr> (Japanese) and <abbr>HZ-GB-2312</abbr> (Chinese).
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <abbr>Big5</abbr> (Chinese), <abbr>SHIFT_JIS</abbr> (Japanese), <abbr>EUC-KR</abbr> (Korean), and <abbr>UTF-8</abbr> without a <abbr>BOM</abbr>.
<li>Single-byte encodings, where each character is represented by one byte. Examples: <abbr>KOI8-R</abbr> (Russian), <abbr>windows-1255</abbr> (Hebrew), and <abbr>TIS-620</abbr> (Thai).
<li><abbr>windows-1252</abbr>, which is used primarily on Microsoft Windows by middle managers who wouldn&#8217;t know a character encoding from a hole in the ground.
</ol>
<h3 id=how.bom><code>UTF-n</code> With A <abbr>BOM</abbr></h3>
<p>If the text starts with a <abbr>BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr>BOM</abbr> will tell us exactly which one; that&#8217;s what it&#8217;s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
<h3 id=how.bom><abbr>UTF-n</abbr> With A <abbr>BOM</abbr></h3>
<p>If the text starts with a <abbr>BOM</abbr>, we can reasonably assume that the text is encoded in <abbr>UTF-8</abbr>, <abbr>UTF-16</abbr>, or <abbr>UTF-32</abbr>. (The <abbr>BOM</abbr> will tell us exactly which one; that&#8217;s what it&#8217;s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
<h3 id=how.esc>Escaped Encodings</h3>
<p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <code>UniversalDetector</code> creates an <code>EscCharSetProber</code> (defined in <code>escprober.py</code>) and feeds it the text.
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <code>HZ-GB-2312</code>, <code>ISO-2022-CN</code>, <code>ISO-2022-JP</code>, and <code>ISO-2022-KR</code> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <abbr>HZ-GB-2312</abbr>, <abbr>ISO-2022-CN</abbr>, <abbr>ISO-2022-JP</abbr>, and <abbr>ISO-2022-KR</abbr> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
<h3 id=how.mb>Multi-Byte Encodings</h3>
<p>Assuming no <abbr>BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of &#8220;probers&#8221; for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <abbr>Big5</abbr>, <abbr>GB2312</abbr>, <abbr>EUC-TW</abbr>, <abbr>EUC-KR</abbr>, <abbr>EUC-JP</abbr>, <abbr>SHIFT_JIS</abbr>, and <abbr>UTF-8</abbr>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
<p>Most of the multi-byte encoding probers are inherited from <code>MultiByteCharSetProber</code> (defined in <code>mbcharsetprober.py</code>), and simply hook up the appropriate state machine and distribution analyzer and let <code>MultiByteCharSetProber</code> do the rest of the work. <code>MultiByteCharSetProber</code> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <code>MultiByteCharSetProber</code> feeds the text to an encoding-specific distribution analyzer.
<p>The distribution analyzers (each defined in <code>chardistribution.py</code>) use language-specific models of which characters are used most frequently. Once <code>MultiByteCharSetProber</code> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <code>MultiByteCharSetProber</code> returns the result to <code>MBCSGroupProber</code>, which returns it to <code>UniversalDetector</code>, which returns it to the caller.
<p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <code>EUC-JP</code> and <code>SHIFT_JIS</code>, so the <code>SJISProber</code> (defined in <code>sjisprober.py</code>) also uses 2-character distribution analysis. <code>SJISContextAnalysis</code> and <code>EUCJPContextAnalysis</code> (both defined in <code>jpcntx.py</code> and both inheriting from a common <code>JapaneseContextAnalysis</code> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <code>SJISProber</code>, which checks both analyzers and returns the higher confidence level to <code>MBCSGroupProber</code>.
+2 -2
View File
@@ -57,7 +57,7 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
<p>What just happened? You didn&#8217;t specify a character encoding, so Python is forced to use the default encoding. What&#8217;s the default encoding? If you look closely at the traceback, you can see that it&#8217;s dying in <code>cp1252.py</code>, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn&#8217;t support the characters that are in this file, so the read fails with an ugly <code>UnicodeDecodeError</code>.
<p>But wait, it&#8217;s worse than that! The default encoding is <em>platform-dependent</em>, so this code <em>might</em> work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).
<p>But wait, it&#8217;s worse than that! The default encoding is <em>platform-dependent</em>, so this code <em>might</em> work on your computer (if your default encoding is <abbr>UTF-8</abbr>), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).
<blockquote class=note>
<p><span class=u>&#x261E;</span>If you need to get the default character encoding, import the <code>locale</code> module and call <code>locale.getpreferredencoding()</code>. On my Windows laptop, it returns <code>'cp1252'</code>, but on my Linux box upstairs, it returns <code>'UTF8'</code>. I can&#8217;t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it&#8217;s so important to specify the encoding every time you open a file.
@@ -141,7 +141,7 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
<li>Now you&#8217;re on the 20<sup>th</sup> byte.
</ol>
<p>Do you see it yet? The <code>seek()</code> and <code>tell()</code> methods always count <em>bytes</em>, but since you opened this file as text, the <code>read()</code> method counts <em>characters</em>. Chinese characters <a href=strings.html#boring-stuff>require multiple bytes to encode in UTF-8</a>. The English characters in the file only require one byte each, so you might be misled into thinking that the <code>seek()</code> and <code>read()</code> methods are counting the same thing. But that&#8217;s only true for some characters.
<p>Do you see it yet? The <code>seek()</code> and <code>tell()</code> methods always count <em>bytes</em>, but since you opened this file as text, the <code>read()</code> method counts <em>characters</em>. Chinese characters <a href=strings.html#boring-stuff>require multiple bytes to encode in <abbr>UTF-8</abbr></a>. The English characters in the file only require one byte each, so you might be misled into thinking that the <code>seek()</code> and <code>read()</code> methods are counting the same thing. But that&#8217;s only true for some characters.
<p>But wait, it gets worse!
+4 -4
View File
@@ -310,7 +310,7 @@ def protocol_version(file_object):
<p>Second, as with any text-based format, there is the issue of whitespace. <abbr>JSON</abbr> allows arbitrary amounts of whitespace (spaces, tabs, carriage returns, and line feeds) between values. This whitespace is &#8220;insignificant,&#8221; which means that <abbr>JSON</abbr> encoders can add as much or as little whitespace as they like, and <abbr>JSON</abbr> decoders are required to ignore the whitespace between values. This allows you to &#8220;pretty-print&#8221; your <abbr>JSON</abbr> data, nicely nesting values within values at different indentation levels so you can read it in a standard browser or text editor. Python&#8217;s <code>json</code> module has options for pretty-printing during encoding.
<p>Third, there&#8217;s the perennial problem of character encoding. <abbr>JSON</abbr> encodes values as plain text, but as you know, <a href=strings.html>there ain&#8217;t no such thing as &#8220;plain text.&#8221;</a> <abbr>JSON</abbr> must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, UTF-8), and <a href=http://www.ietf.org/rfc/rfc4627.txt>section 3 of RFC 4627</a> defines how to tell which encoding is being used.
<p>Third, there&#8217;s the perennial problem of character encoding. <abbr>JSON</abbr> encodes values as plain text, but as you know, <a href=strings.html>there ain&#8217;t no such thing as &#8220;plain text.&#8221;</a> <abbr>JSON</abbr> must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, <abbr>UTF-8</abbr>), and <a href=http://www.ietf.org/rfc/rfc4627.txt>section 3 of RFC 4627</a> defines how to tell which encoding is being used.
<p class=a>&#x2042;
@@ -332,7 +332,7 @@ def protocol_version(file_object):
<a><samp class=p>... </samp><kbd class=pp> json.dump(basic_entry, f)</kbd> <span class=u>&#x2462;</span></a></pre>
<ol>
<li>We&#8217;re going to create a new data structure instead of re-using the existing <var>entry</var> data structure. Later in this chapter, we&#8217;ll see what happens when we try to encode the more complex data structure in <abbr>JSON</abbr>.
<li><abbr>JSON</abbr> is a text-based format, which means you need to open this file in text mode and specify a character encoding. You can never go wrong with UTF-8.
<li><abbr>JSON</abbr> is a text-based format, which means you need to open this file in text mode and specify a character encoding. You can never go wrong with <abbr>UTF-8</abbr>.
<li>Like the <code>pickle</code> module, the <code>json</code> module defines a <code>dump()</code> function which takes a Python data structure and a writeable stream object. The <code>dump()</code> function serializes the Python data structure and writes it to the stream object. Doing this inside a <code>with</code> statement will ensure that the file is closed properly when we&#8217;re done.
</ol>
@@ -445,7 +445,7 @@ def protocol_version(file_object):
<mark>TypeError: b'\xDE\xD5\xB4\xF8' is not JSON serializable</mark></samp></pre>
<ol>
<li>OK, it&#8217;s time to revisit the <var>entry</var> data structure. This has it all: a boolean value, a <code>None</code> value, a string, a tuple of strings, a <code>bytes</code> object, and a <code>time</code> structure.
<li>I know I&#8217;ve said it before, but it&#8217;s worth repeating: <abbr>JSON</abbr> is a text-based format. Always open <abbr>JSON</abbr> files in text mode with a UTF-8 character encoding.
<li>I know I&#8217;ve said it before, but it&#8217;s worth repeating: <abbr>JSON</abbr> is a text-based format. Always open <abbr>JSON</abbr> files in text mode with a <abbr>UTF-8</abbr> character encoding.
<li>Well <em>that&#8217;s</em> not good. What happened?
</ol>
@@ -490,7 +490,7 @@ def protocol_version(file_object):
TypeError: time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1) is not JSON serializable</samp></pre>
<ol>
<li>The <code>customserializer</code> module is where you just defined the <code>to_json()</code> function in the previous example.
<li>Text mode, UTF-8 encoding, yadda yadda. (You&#8217;ll forget! I forget sometimes! And everything will work right up until the moment that it fails, and then it will fail most spectacularly.)
<li>Text mode, <abbr>UTF-8</abbr> encoding, yadda yadda. (You&#8217;ll forget! I forget sometimes! And everything will work right up until the moment that it fails, and then it will fail most spectacularly.)
<li>This is the important bit: to hook your custom conversion function into the <code>json.dump()</code> function, pass your function into the <code>json.dump()</code> function in the <var>default</var> parameter. (Hooray, <a href=your-first-python-program.html#everythingisanobject>everything in Python is an object</a>!)
<li>OK, so it didn&#8217;t actually work. But take a look at the exception. The <code>json.dump()</code> function is no longer complaining about being unable to serialize the <code>bytes</code> object. Now it&#8217;s complaining about a completely different object: the <code>time.struct_time</code> object.
</ol>
+6 -6
View File
@@ -69,17 +69,17 @@ My alphabet starts where your alphabet ends! <span class=u>&#x275E;</span><br>&m
<p class=xxxl>UTF-8
<p>UTF-8 is a <em>variable-length</em> encoding system for Unicode. That is, different characters take up a different number of bytes. For <abbr>ASCII</abbr> characters (A-Z, <i class=baa>&amp;</i>c.) UTF-8 uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0&ndash;127) in UTF-8 are indistinguishable from <abbr>ASCII</abbr>. &#8220;Extended Latin&#8221; characters like &ntilde; and &ouml; end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like &#x4E2D; end up taking three bytes. The rarely-used &#8220;astral plane&#8221; characters take four bytes.
<p>UTF-8 is a <em>variable-length</em> encoding system for Unicode. That is, different characters take up a different number of bytes. For <abbr>ASCII</abbr> characters (A-Z, <i class=baa>&amp;</i>c.) <abbr>UTF-8</abbr> uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0&ndash;127) in <abbr>UTF-8</abbr> are indistinguishable from <abbr>ASCII</abbr>. &#8220;Extended Latin&#8221; characters like &ntilde; and &ouml; end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like &#x4E2D; end up taking three bytes. The rarely-used &#8220;astral plane&#8221; characters take four bytes.
<p>Disadvantages: because each character can take a different number of bytes, finding the <var>Nth</var> character is an O(N) operation&nbsp;&mdash;&nbsp;that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.
<p>Advantages: super-efficient encoding of common <abbr>ASCII</abbr> characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you&#8217;ll have to trust me on this, because I&#8217;m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.
<p>Advantages: super-efficient encoding of common <abbr>ASCII</abbr> characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you&#8217;ll have to trust me on this, because I&#8217;m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in <abbr>UTF-8</abbr> uses the exact same stream of bytes on any computer.
<p class=a>&#x2042;
<h2 id=divingin>Diving In</h2>
<p>In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. &#8220;Is this string UTF-8?&#8221; is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
<p>In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in <abbr>UTF-8</abbr>, or a Python string encoded as CP-1252. &#8220;Is this string <abbr>UTF-8</abbr>?&#8221; is an invalid question. <abbr>UTF-8</abbr> is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>s = '深入 Python'</kbd> <span class=u>&#x2460;</span></a>
@@ -392,7 +392,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp>
<samp class=pp>True</samp></pre>
<ol>
<li>This is a string. It has nine characters.
<li>This is a <code>bytes</code> object. It has 13 bytes. It is the sequence of bytes you get when you take <var>a_string</var> and encode it in UTF-8.
<li>This is a <code>bytes</code> object. It has 13 bytes. It is the sequence of bytes you get when you take <var>a_string</var> and encode it in <abbr>UTF-8</abbr>.
<li>This is a <code>bytes</code> object. It has 11 bytes. It is the sequence of bytes you get when you take <var>a_string</var> and encode it in <a href=http://en.wikipedia.org/wiki/GB_18030>GB18030</a>.
<li>This is a <code>bytes</code> object. It has 11 bytes. It is an <em>entirely different sequence of bytes</em> that you get when you take <var>a_string</var> and encode it in <a href=http://en.wikipedia.org/wiki/Big5>Big5</a>.
<li>This is a string. It has nine characters. It is the sequence of characters you get when you take <var>by</var> and decode it using the Big5 encoding algorithm. It is identical to the original string.
@@ -402,10 +402,10 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp>
<h2 id=py-encoding>Postscript: Character Encoding Of Python Source Code</h2>
<p>Python 3 assumes that your source code&nbsp;&mdash;&nbsp;<i>i.e.</i> each <code>.py</code> file&nbsp;&mdash;&nbsp;is encoded in UTF-8.
<p>Python 3 assumes that your source code&nbsp;&mdash;&nbsp;<i>i.e.</i> each <code>.py</code> file&nbsp;&mdash;&nbsp;is encoded in <abbr>UTF-8</abbr>.
<blockquote class='note compare python2'>
<p><span class=u>&#x261E;</span>In Python 2, the <dfn>default</dfn> encoding for <code>.py</code> files was <abbr>ASCII</abbr>. In Python 3, <a href=http://www.python.org/dev/peps/pep-3120/>the default encoding is UTF-8</a>.
<p><span class=u>&#x261E;</span>In Python 2, the <dfn>default</dfn> encoding for <code>.py</code> files was <abbr>ASCII</abbr>. In Python 3, <a href=http://www.python.org/dev/peps/pep-3120/>the default encoding is <abbr>UTF-8</abbr></a>.
</blockquote>
<p>If you would like to use a different encoding within your Python code, you can put an encoding declaration on the first line of each file. This declaration defines a <code>.py</code> file to be windows-1252: