mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 15:00:18 +00:00
markup fiddling (encodings are always wrapped in abbr)
This commit is contained in:
@@ -47,20 +47,20 @@ del{background:#f87}
|
||||
<p>The main entry point for the detection algorithm is <code>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code>chardet/__init__.py</code>, but that’s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
|
||||
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
|
||||
<ol>
|
||||
<li><code>UTF-n</code> with a Byte Order Mark (<abbr>BOM</abbr>). This includes <code>UTF-8</code>, both Big-Endian and Little-Endian variants of <code>UTF-16</code>, and all 4 byte-order variants of <code>UTF-32</code>.
|
||||
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <code>ISO-2022-JP</code> (Japanese) and <code>HZ-GB-2312</code> (Chinese).
|
||||
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <code>Big5</code> (Chinese), <code>SHIFT_JIS</code> (Japanese), <code>EUC-KR</code> (Korean), and <code>UTF-8</code> without a <abbr>BOM</abbr>.
|
||||
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
|
||||
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.
|
||||
<li><abbr>UTF-n</abbr> with a Byte Order Mark (<abbr>BOM</abbr>). This includes <abbr>UTF-8</abbr>, both Big-Endian and Little-Endian variants of <abbr>UTF-16</abbr>, and all 4 byte-order variants of <abbr>UTF-32</abbr>.
|
||||
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <abbr>ISO-2022-JP</abbr> (Japanese) and <abbr>HZ-GB-2312</abbr> (Chinese).
|
||||
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <abbr>Big5</abbr> (Chinese), <abbr>SHIFT_JIS</abbr> (Japanese), <abbr>EUC-KR</abbr> (Korean), and <abbr>UTF-8</abbr> without a <abbr>BOM</abbr>.
|
||||
<li>Single-byte encodings, where each character is represented by one byte. Examples: <abbr>KOI8-R</abbr> (Russian), <abbr>windows-1255</abbr> (Hebrew), and <abbr>TIS-620</abbr> (Thai).
|
||||
<li><abbr>windows-1252</abbr>, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.
|
||||
</ol>
|
||||
<h3 id=how.bom><code>UTF-n</code> With A <abbr>BOM</abbr></h3>
|
||||
<p>If the text starts with a <abbr>BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr>BOM</abbr> will tell us exactly which one; that’s what it’s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
|
||||
<h3 id=how.bom><abbr>UTF-n</abbr> With A <abbr>BOM</abbr></h3>
|
||||
<p>If the text starts with a <abbr>BOM</abbr>, we can reasonably assume that the text is encoded in <abbr>UTF-8</abbr>, <abbr>UTF-16</abbr>, or <abbr>UTF-32</abbr>. (The <abbr>BOM</abbr> will tell us exactly which one; that’s what it’s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
|
||||
<h3 id=how.esc>Escaped Encodings</h3>
|
||||
<p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <code>UniversalDetector</code> creates an <code>EscCharSetProber</code> (defined in <code>escprober.py</code>) and feeds it the text.
|
||||
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <code>HZ-GB-2312</code>, <code>ISO-2022-CN</code>, <code>ISO-2022-JP</code>, and <code>ISO-2022-KR</code> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
|
||||
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <abbr>HZ-GB-2312</abbr>, <abbr>ISO-2022-CN</abbr>, <abbr>ISO-2022-JP</abbr>, and <abbr>ISO-2022-KR</abbr> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
|
||||
<h3 id=how.mb>Multi-Byte Encodings</h3>
|
||||
<p>Assuming no <abbr>BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
|
||||
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
|
||||
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <abbr>Big5</abbr>, <abbr>GB2312</abbr>, <abbr>EUC-TW</abbr>, <abbr>EUC-KR</abbr>, <abbr>EUC-JP</abbr>, <abbr>SHIFT_JIS</abbr>, and <abbr>UTF-8</abbr>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
|
||||
<p>Most of the multi-byte encoding probers are inherited from <code>MultiByteCharSetProber</code> (defined in <code>mbcharsetprober.py</code>), and simply hook up the appropriate state machine and distribution analyzer and let <code>MultiByteCharSetProber</code> do the rest of the work. <code>MultiByteCharSetProber</code> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <code>MultiByteCharSetProber</code> feeds the text to an encoding-specific distribution analyzer.
|
||||
<p>The distribution analyzers (each defined in <code>chardistribution.py</code>) use language-specific models of which characters are used most frequently. Once <code>MultiByteCharSetProber</code> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <code>MultiByteCharSetProber</code> returns the result to <code>MBCSGroupProber</code>, which returns it to <code>UniversalDetector</code>, which returns it to the caller.
|
||||
<p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <code>EUC-JP</code> and <code>SHIFT_JIS</code>, so the <code>SJISProber</code> (defined in <code>sjisprober.py</code>) also uses 2-character distribution analysis. <code>SJISContextAnalysis</code> and <code>EUCJPContextAnalysis</code> (both defined in <code>jpcntx.py</code> and both inheriting from a common <code>JapaneseContextAnalysis</code> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <code>SJISProber</code>, which checks both analyzers and returns the higher confidence level to <code>MBCSGroupProber</code>.
|
||||
|
||||
+2
-2
@@ -57,7 +57,7 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
|
||||
|
||||
<p>What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in <code>cp1252.py</code>, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly <code>UnicodeDecodeError</code>.
|
||||
|
||||
<p>But wait, it’s worse than that! The default encoding is <em>platform-dependent</em>, so this code <em>might</em> work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).
|
||||
<p>But wait, it’s worse than that! The default encoding is <em>platform-dependent</em>, so this code <em>might</em> work on your computer (if your default encoding is <abbr>UTF-8</abbr>), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).
|
||||
|
||||
<blockquote class=note>
|
||||
<p><span class=u>☞</span>If you need to get the default character encoding, import the <code>locale</code> module and call <code>locale.getpreferredencoding()</code>. On my Windows laptop, it returns <code>'cp1252'</code>, but on my Linux box upstairs, it returns <code>'UTF8'</code>. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file.
|
||||
@@ -141,7 +141,7 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
|
||||
<li>Now you’re on the 20<sup>th</sup> byte.
|
||||
</ol>
|
||||
|
||||
<p>Do you see it yet? The <code>seek()</code> and <code>tell()</code> methods always count <em>bytes</em>, but since you opened this file as text, the <code>read()</code> method counts <em>characters</em>. Chinese characters <a href=strings.html#boring-stuff>require multiple bytes to encode in UTF-8</a>. The English characters in the file only require one byte each, so you might be misled into thinking that the <code>seek()</code> and <code>read()</code> methods are counting the same thing. But that’s only true for some characters.
|
||||
<p>Do you see it yet? The <code>seek()</code> and <code>tell()</code> methods always count <em>bytes</em>, but since you opened this file as text, the <code>read()</code> method counts <em>characters</em>. Chinese characters <a href=strings.html#boring-stuff>require multiple bytes to encode in <abbr>UTF-8</abbr></a>. The English characters in the file only require one byte each, so you might be misled into thinking that the <code>seek()</code> and <code>read()</code> methods are counting the same thing. But that’s only true for some characters.
|
||||
|
||||
<p>But wait, it gets worse!
|
||||
|
||||
|
||||
+4
-4
@@ -310,7 +310,7 @@ def protocol_version(file_object):
|
||||
|
||||
<p>Second, as with any text-based format, there is the issue of whitespace. <abbr>JSON</abbr> allows arbitrary amounts of whitespace (spaces, tabs, carriage returns, and line feeds) between values. This whitespace is “insignificant,” which means that <abbr>JSON</abbr> encoders can add as much or as little whitespace as they like, and <abbr>JSON</abbr> decoders are required to ignore the whitespace between values. This allows you to “pretty-print” your <abbr>JSON</abbr> data, nicely nesting values within values at different indentation levels so you can read it in a standard browser or text editor. Python’s <code>json</code> module has options for pretty-printing during encoding.
|
||||
|
||||
<p>Third, there’s the perennial problem of character encoding. <abbr>JSON</abbr> encodes values as plain text, but as you know, <a href=strings.html>there ain’t no such thing as “plain text.”</a> <abbr>JSON</abbr> must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, UTF-8), and <a href=http://www.ietf.org/rfc/rfc4627.txt>section 3 of RFC 4627</a> defines how to tell which encoding is being used.
|
||||
<p>Third, there’s the perennial problem of character encoding. <abbr>JSON</abbr> encodes values as plain text, but as you know, <a href=strings.html>there ain’t no such thing as “plain text.”</a> <abbr>JSON</abbr> must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, <abbr>UTF-8</abbr>), and <a href=http://www.ietf.org/rfc/rfc4627.txt>section 3 of RFC 4627</a> defines how to tell which encoding is being used.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
@@ -332,7 +332,7 @@ def protocol_version(file_object):
|
||||
<a><samp class=p>... </samp><kbd class=pp> json.dump(basic_entry, f)</kbd> <span class=u>③</span></a></pre>
|
||||
<ol>
|
||||
<li>We’re going to create a new data structure instead of re-using the existing <var>entry</var> data structure. Later in this chapter, we’ll see what happens when we try to encode the more complex data structure in <abbr>JSON</abbr>.
|
||||
<li><abbr>JSON</abbr> is a text-based format, which means you need to open this file in text mode and specify a character encoding. You can never go wrong with UTF-8.
|
||||
<li><abbr>JSON</abbr> is a text-based format, which means you need to open this file in text mode and specify a character encoding. You can never go wrong with <abbr>UTF-8</abbr>.
|
||||
<li>Like the <code>pickle</code> module, the <code>json</code> module defines a <code>dump()</code> function which takes a Python data structure and a writeable stream object. The <code>dump()</code> function serializes the Python data structure and writes it to the stream object. Doing this inside a <code>with</code> statement will ensure that the file is closed properly when we’re done.
|
||||
</ol>
|
||||
|
||||
@@ -445,7 +445,7 @@ def protocol_version(file_object):
|
||||
<mark>TypeError: b'\xDE\xD5\xB4\xF8' is not JSON serializable</mark></samp></pre>
|
||||
<ol>
|
||||
<li>OK, it’s time to revisit the <var>entry</var> data structure. This has it all: a boolean value, a <code>None</code> value, a string, a tuple of strings, a <code>bytes</code> object, and a <code>time</code> structure.
|
||||
<li>I know I’ve said it before, but it’s worth repeating: <abbr>JSON</abbr> is a text-based format. Always open <abbr>JSON</abbr> files in text mode with a UTF-8 character encoding.
|
||||
<li>I know I’ve said it before, but it’s worth repeating: <abbr>JSON</abbr> is a text-based format. Always open <abbr>JSON</abbr> files in text mode with a <abbr>UTF-8</abbr> character encoding.
|
||||
<li>Well <em>that’s</em> not good. What happened?
|
||||
</ol>
|
||||
|
||||
@@ -490,7 +490,7 @@ def protocol_version(file_object):
|
||||
TypeError: time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1) is not JSON serializable</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>customserializer</code> module is where you just defined the <code>to_json()</code> function in the previous example.
|
||||
<li>Text mode, UTF-8 encoding, yadda yadda. (You’ll forget! I forget sometimes! And everything will work right up until the moment that it fails, and then it will fail most spectacularly.)
|
||||
<li>Text mode, <abbr>UTF-8</abbr> encoding, yadda yadda. (You’ll forget! I forget sometimes! And everything will work right up until the moment that it fails, and then it will fail most spectacularly.)
|
||||
<li>This is the important bit: to hook your custom conversion function into the <code>json.dump()</code> function, pass your function into the <code>json.dump()</code> function in the <var>default</var> parameter. (Hooray, <a href=your-first-python-program.html#everythingisanobject>everything in Python is an object</a>!)
|
||||
<li>OK, so it didn’t actually work. But take a look at the exception. The <code>json.dump()</code> function is no longer complaining about being unable to serialize the <code>bytes</code> object. Now it’s complaining about a completely different object: the <code>time.struct_time</code> object.
|
||||
</ol>
|
||||
|
||||
+6
-6
@@ -69,17 +69,17 @@ My alphabet starts where your alphabet ends! <span class=u>❞</span><br>&m
|
||||
|
||||
<p class=xxxl>UTF-8
|
||||
|
||||
<p>UTF-8 is a <em>variable-length</em> encoding system for Unicode. That is, different characters take up a different number of bytes. For <abbr>ASCII</abbr> characters (A-Z, <i class=baa>&</i>c.) UTF-8 uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0–127) in UTF-8 are indistinguishable from <abbr>ASCII</abbr>. “Extended Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end up taking three bytes. The rarely-used “astral plane” characters take four bytes.
|
||||
<p>UTF-8 is a <em>variable-length</em> encoding system for Unicode. That is, different characters take up a different number of bytes. For <abbr>ASCII</abbr> characters (A-Z, <i class=baa>&</i>c.) <abbr>UTF-8</abbr> uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0–127) in <abbr>UTF-8</abbr> are indistinguishable from <abbr>ASCII</abbr>. “Extended Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end up taking three bytes. The rarely-used “astral plane” characters take four bytes.
|
||||
|
||||
<p>Disadvantages: because each character can take a different number of bytes, finding the <var>Nth</var> character is an O(N) operation — that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.
|
||||
|
||||
<p>Advantages: super-efficient encoding of common <abbr>ASCII</abbr> characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.
|
||||
<p>Advantages: super-efficient encoding of common <abbr>ASCII</abbr> characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in <abbr>UTF-8</abbr> uses the exact same stream of bytes on any computer.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=divingin>Diving In</h2>
|
||||
|
||||
<p>In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
|
||||
<p>In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in <abbr>UTF-8</abbr>, or a Python string encoded as CP-1252. “Is this string <abbr>UTF-8</abbr>?” is an invalid question. <abbr>UTF-8</abbr> is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd class=pp>s = '深入 Python'</kbd> <span class=u>①</span></a>
|
||||
@@ -392,7 +392,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp>
|
||||
<samp class=pp>True</samp></pre>
|
||||
<ol>
|
||||
<li>This is a string. It has nine characters.
|
||||
<li>This is a <code>bytes</code> object. It has 13 bytes. It is the sequence of bytes you get when you take <var>a_string</var> and encode it in UTF-8.
|
||||
<li>This is a <code>bytes</code> object. It has 13 bytes. It is the sequence of bytes you get when you take <var>a_string</var> and encode it in <abbr>UTF-8</abbr>.
|
||||
<li>This is a <code>bytes</code> object. It has 11 bytes. It is the sequence of bytes you get when you take <var>a_string</var> and encode it in <a href=http://en.wikipedia.org/wiki/GB_18030>GB18030</a>.
|
||||
<li>This is a <code>bytes</code> object. It has 11 bytes. It is an <em>entirely different sequence of bytes</em> that you get when you take <var>a_string</var> and encode it in <a href=http://en.wikipedia.org/wiki/Big5>Big5</a>.
|
||||
<li>This is a string. It has nine characters. It is the sequence of characters you get when you take <var>by</var> and decode it using the Big5 encoding algorithm. It is identical to the original string.
|
||||
@@ -402,10 +402,10 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp>
|
||||
|
||||
<h2 id=py-encoding>Postscript: Character Encoding Of Python Source Code</h2>
|
||||
|
||||
<p>Python 3 assumes that your source code — <i>i.e.</i> each <code>.py</code> file — is encoded in UTF-8.
|
||||
<p>Python 3 assumes that your source code — <i>i.e.</i> each <code>.py</code> file — is encoded in <abbr>UTF-8</abbr>.
|
||||
|
||||
<blockquote class='note compare python2'>
|
||||
<p><span class=u>☞</span>In Python 2, the <dfn>default</dfn> encoding for <code>.py</code> files was <abbr>ASCII</abbr>. In Python 3, <a href=http://www.python.org/dev/peps/pep-3120/>the default encoding is UTF-8</a>.
|
||||
<p><span class=u>☞</span>In Python 2, the <dfn>default</dfn> encoding for <code>.py</code> files was <abbr>ASCII</abbr>. In Python 3, <a href=http://www.python.org/dev/peps/pep-3120/>the default encoding is <abbr>UTF-8</abbr></a>.
|
||||
</blockquote>
|
||||
|
||||
<p>If you would like to use a different encoding within your Python code, you can put an encoding declaration on the first line of each file. This declaration defines a <code>.py</code> file to be windows-1252:
|
||||
|
||||
Reference in New Issue
Block a user