From 131638d9ea46455dce37a3e2845392e8398f5318 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Sat, 26 Sep 2009 00:12:49 -0400 Subject: [PATCH] markup fiddling (encodings are always wrapped in abbr) --- case-study-porting-chardet-to-python-3.html | 18 +++++++++--------- files.html | 4 ++-- serializing.html | 8 ++++---- strings.html | 12 ++++++------ 4 files changed, 21 insertions(+), 21 deletions(-) diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 5fc87e2..5d0ac3d 100755 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -47,20 +47,20 @@ del{background:#f87}

The main entry point for the detection algorithm is universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that’s really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.)

There are 5 categories of encodings that UniversalDetector handles:

    -
  1. UTF-n with a Byte Order Mark (BOM). This includes UTF-8, both Big-Endian and Little-Endian variants of UTF-16, and all 4 byte-order variants of UTF-32. -
  2. Escaped encodings, which are entirely 7-bit ASCII compatible, where non-ASCII characters start with an escape sequence. Examples: ISO-2022-JP (Japanese) and HZ-GB-2312 (Chinese). -
  3. Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: Big5 (Chinese), SHIFT_JIS (Japanese), EUC-KR (Korean), and UTF-8 without a BOM. -
  4. Single-byte encodings, where each character is represented by one byte. Examples: KOI8-R (Russian), windows-1255 (Hebrew), and TIS-620 (Thai). -
  5. windows-1252, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground. +
  6. UTF-n with a Byte Order Mark (BOM). This includes UTF-8, both Big-Endian and Little-Endian variants of UTF-16, and all 4 byte-order variants of UTF-32. +
  7. Escaped encodings, which are entirely 7-bit ASCII compatible, where non-ASCII characters start with an escape sequence. Examples: ISO-2022-JP (Japanese) and HZ-GB-2312 (Chinese). +
  8. Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: Big5 (Chinese), SHIFT_JIS (Japanese), EUC-KR (Korean), and UTF-8 without a BOM. +
  9. Single-byte encodings, where each character is represented by one byte. Examples: KOI8-R (Russian), windows-1255 (Hebrew), and TIS-620 (Thai). +
  10. windows-1252, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.
-

UTF-n With A BOM

-

If the text starts with a BOM, we can reasonably assume that the text is encoded in UTF-8, UTF-16, or UTF-32. (The BOM will tell us exactly which one; that’s what it’s for.) This is handled inline in UniversalDetector, which returns the result immediately without any further processing. +

UTF-n With A BOM

+

If the text starts with a BOM, we can reasonably assume that the text is encoded in UTF-8, UTF-16, or UTF-32. (The BOM will tell us exactly which one; that’s what it’s for.) This is handled inline in UniversalDetector, which returns the result immediately without any further processing.

Escaped Encodings

If the text contains a recognizable escape sequence that might indicate an escaped encoding, UniversalDetector creates an EscCharSetProber (defined in escprober.py) and feeds it the text. -

EscCharSetProber creates a series of state machines, based on models of HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR (defined in escsm.py). EscCharSetProber feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, EscCharSetProber immediately returns the positive result to UniversalDetector, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines. +

EscCharSetProber creates a series of state machines, based on models of HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR (defined in escsm.py). EscCharSetProber feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, EscCharSetProber immediately returns the positive result to UniversalDetector, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.

Multi-Byte Encodings

Assuming no BOM, UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252. -

The multi-byte encoding prober, MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. MBCSGroupProber feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller. +

The multi-byte encoding prober, MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. MBCSGroupProber feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller.

Most of the multi-byte encoding probers are inherited from MultiByteCharSetProber (defined in mbcharsetprober.py), and simply hook up the appropriate state machine and distribution analyzer and let MultiByteCharSetProber do the rest of the work. MultiByteCharSetProber runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, MultiByteCharSetProber feeds the text to an encoding-specific distribution analyzer.

The distribution analyzers (each defined in chardistribution.py) use language-specific models of which characters are used most frequently. Once MultiByteCharSetProber has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, MultiByteCharSetProber returns the result to MBCSGroupProber, which returns it to UniversalDetector, which returns it to the caller.

The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between EUC-JP and SHIFT_JIS, so the SJISProber (defined in sjisprober.py) also uses 2-character distribution analysis. SJISContextAnalysis and EUCJPContextAnalysis (both defined in jpcntx.py and both inheriting from a common JapaneseContextAnalysis class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to SJISProber, which checks both analyzers and returns the higher confidence level to MBCSGroupProber. diff --git a/files.html b/files.html index cf482d2..2d8089e 100644 --- a/files.html +++ b/files.html @@ -57,7 +57,7 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara

What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in cp1252.py, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly UnicodeDecodeError. -

But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252). +

But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).

If you need to get the default character encoding, import the locale module and call locale.getpreferredencoding(). On my Windows laptop, it returns 'cp1252', but on my Linux box upstairs, it returns 'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file. @@ -141,7 +141,7 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara

  • Now you’re on the 20th byte. -

    Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that the seek() and read() methods are counting the same thing. But that’s only true for some characters. +

    Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that the seek() and read() methods are counting the same thing. But that’s only true for some characters.

    But wait, it gets worse! diff --git a/serializing.html b/serializing.html index e81a674..9c1b86b 100644 --- a/serializing.html +++ b/serializing.html @@ -310,7 +310,7 @@ def protocol_version(file_object):

    Second, as with any text-based format, there is the issue of whitespace. JSON allows arbitrary amounts of whitespace (spaces, tabs, carriage returns, and line feeds) between values. This whitespace is “insignificant,” which means that JSON encoders can add as much or as little whitespace as they like, and JSON decoders are required to ignore the whitespace between values. This allows you to “pretty-print” your JSON data, nicely nesting values within values at different indentation levels so you can read it in a standard browser or text editor. Python’s json module has options for pretty-printing during encoding. -

    Third, there’s the perennial problem of character encoding. JSON encodes values as plain text, but as you know, there ain’t no such thing as “plain text.” JSON must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, UTF-8), and section 3 of RFC 4627 defines how to tell which encoding is being used. +

    Third, there’s the perennial problem of character encoding. JSON encodes values as plain text, but as you know, there ain’t no such thing as “plain text.” JSON must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, UTF-8), and section 3 of RFC 4627 defines how to tell which encoding is being used.

    ⁂ @@ -332,7 +332,7 @@ def protocol_version(file_object): ... json.dump(basic_entry, f)

    1. We’re going to create a new data structure instead of re-using the existing entry data structure. Later in this chapter, we’ll see what happens when we try to encode the more complex data structure in JSON. -
    2. JSON is a text-based format, which means you need to open this file in text mode and specify a character encoding. You can never go wrong with UTF-8. +
    3. JSON is a text-based format, which means you need to open this file in text mode and specify a character encoding. You can never go wrong with UTF-8.
    4. Like the pickle module, the json module defines a dump() function which takes a Python data structure and a writeable stream object. The dump() function serializes the Python data structure and writes it to the stream object. Doing this inside a with statement will ensure that the file is closed properly when we’re done.
    @@ -445,7 +445,7 @@ def protocol_version(file_object): TypeError: b'\xDE\xD5\xB4\xF8' is not JSON serializable
    1. OK, it’s time to revisit the entry data structure. This has it all: a boolean value, a None value, a string, a tuple of strings, a bytes object, and a time structure. -
    2. I know I’ve said it before, but it’s worth repeating: JSON is a text-based format. Always open JSON files in text mode with a UTF-8 character encoding. +
    3. I know I’ve said it before, but it’s worth repeating: JSON is a text-based format. Always open JSON files in text mode with a UTF-8 character encoding.
    4. Well that’s not good. What happened?
    @@ -490,7 +490,7 @@ def protocol_version(file_object): TypeError: time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1) is not JSON serializable
    1. The customserializer module is where you just defined the to_json() function in the previous example. -
    2. Text mode, UTF-8 encoding, yadda yadda. (You’ll forget! I forget sometimes! And everything will work right up until the moment that it fails, and then it will fail most spectacularly.) +
    3. Text mode, UTF-8 encoding, yadda yadda. (You’ll forget! I forget sometimes! And everything will work right up until the moment that it fails, and then it will fail most spectacularly.)
    4. This is the important bit: to hook your custom conversion function into the json.dump() function, pass your function into the json.dump() function in the default parameter. (Hooray, everything in Python is an object!)
    5. OK, so it didn’t actually work. But take a look at the exception. The json.dump() function is no longer complaining about being unable to serialize the bytes object. Now it’s complaining about a completely different object: the time.struct_time object.
    diff --git a/strings.html b/strings.html index ed43bd8..903c234 100755 --- a/strings.html +++ b/strings.html @@ -69,17 +69,17 @@ My alphabet starts where your alphabet ends!
    &m

    UTF-8 -

    UTF-8 is a variable-length encoding system for Unicode. That is, different characters take up a different number of bytes. For ASCII characters (A-Z, &c.) UTF-8 uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0–127) in UTF-8 are indistinguishable from ASCII. “Extended Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end up taking three bytes. The rarely-used “astral plane” characters take four bytes. +

    UTF-8 is a variable-length encoding system for Unicode. That is, different characters take up a different number of bytes. For ASCII characters (A-Z, &c.) UTF-8 uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0–127) in UTF-8 are indistinguishable from ASCII. “Extended Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end up taking three bytes. The rarely-used “astral plane” characters take four bytes.

    Disadvantages: because each character can take a different number of bytes, finding the Nth character is an O(N) operation — that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters. -

    Advantages: super-efficient encoding of common ASCII characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer. +

    Advantages: super-efficient encoding of common ASCII characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.

    Diving In

    -

    In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions. +

    In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.

     >>> s = '深入 Python'    
    @@ -392,7 +392,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
     True
    1. This is a string. It has nine characters. -
    2. This is a bytes object. It has 13 bytes. It is the sequence of bytes you get when you take a_string and encode it in UTF-8. +
    3. This is a bytes object. It has 13 bytes. It is the sequence of bytes you get when you take a_string and encode it in UTF-8.
    4. This is a bytes object. It has 11 bytes. It is the sequence of bytes you get when you take a_string and encode it in GB18030.
    5. This is a bytes object. It has 11 bytes. It is an entirely different sequence of bytes that you get when you take a_string and encode it in Big5.
    6. This is a string. It has nine characters. It is the sequence of characters you get when you take by and decode it using the Big5 encoding algorithm. It is identical to the original string. @@ -402,10 +402,10 @@ TypeError: Can't convert 'bytes' object to str implicitly

      Postscript: Character Encoding Of Python Source Code

      -

      Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8. +

      Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8.

      -

      In Python 2, the default encoding for .py files was ASCII. In Python 3, the default encoding is UTF-8. +

      In Python 2, the default encoding for .py files was ASCII. In Python 3, the default encoding is UTF-8.

      If you would like to use a different encoding within your Python code, you can put an encoding declaration on the first line of each file. This declaration defines a .py file to be windows-1252: