mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
various fixes in strings chapter [thanks A.H.]
This commit is contained in:
+3
-3
@@ -41,7 +41,7 @@ My alphabet starts where your alphabet ends! <span>❞</span><br>— Dr
|
||||
|
||||
<p>Now think about trying to store multiple pieces of text in the same place, like in the same database table that holds all the email you’ve ever received. You still need to store the character encoding alongside each piece of text so you can display it properly. Think that’s hard? Try searching your email database, which means converting between multiple encodings on the fly. Doesn’t that sound fun?
|
||||
|
||||
<p>Now think about the possibility of multilingual documents, where characters from several languages are next to each other in the same document. (Hint: programs that tried to do this typically used escape codes to switch “modes.” Poof, you’re in Russian koi8-r mode, so 241 means this character; poof, now you’re in Mac Greek mode, so 241 means some other character.) And of course you’ll want to search <em>those</em> documents, too.
|
||||
<p>Now think about the possibility of multilingual documents, where characters from several languages are next to each other in the same document. (Hint: programs that tried to do this typically used escape codes to switch “modes.” Poof, you’re in Russian koi8-r mode, so 241 means Я; poof, now you’re in Mac Greek mode, so 241 means ώ.) And of course you’ll want to search <em>those</em> documents, too.
|
||||
|
||||
<p>Now cry a lot, because everything you thought you knew about strings is wrong, and there ain’t no such thing as “plain text.”
|
||||
|
||||
@@ -61,7 +61,7 @@ My alphabet starts where your alphabet ends! <span>❞</span><br>— Dr
|
||||
|
||||
<p>To solve <em>this</em> problem, the multi-byte Unicode encodings define a “Byte Order Mark,” which is a special non-printable character that you can include at the beginning of your document to indicate what order your bytes are in. For UTF-16, the Byte Order Mark is <code>U+FEFF</code>. If you receive a UTF-16 document that starts with the bytes <code>FF FE</code>, you know the byte ordering is one way; if it starts with <code>FE FF</code>, you know the byte ordering is reversed.
|
||||
|
||||
<p>Still, UTF-16 isn’t exactly ideal, especially if you’re dealing with a lot of <abbr>ASCII</abbr> characters. If you think about it, even a Chinese web page is going to contain a lot of <abbr>ASCII</abbr> characters — all the elements and attributes surrounding the printable Chinese characters. Being able to find the <var>Nth</var> character in O(1) time is nice, but there’s still the nagging problem of those astral plane characters, which mean that you can’t <em>guarantee</em> that every character is exactly two bytes, so you can’t <em>really</em> find the <var>Nth</var> character in O(1) time unless you maintain a separate index. And boy, there sure is a lot of <abbr>ASCII</abbr> text in the world…
|
||||
<p>Still, UTF-16 isn’t exactly ideal, especially if you’re dealing with a lot of <abbr>ASCII</abbr> characters. If you think about it, even a Chinese web page is going to contain a lot of <abbr>ASCII</abbr> characters — all the elements and attributes surrounding the printable Chinese characters. Being able to find the <var>Nth</var> character in constant time is nice, but there’s still the nagging problem of those astral plane characters, which mean that you can’t <em>guarantee</em> that every character is exactly two bytes, so you can’t <em>really</em> find the <var>Nth</var> character in constant time unless you maintain a separate index. And boy, there sure is a lot of <abbr>ASCII</abbr> text in the world…
|
||||
|
||||
<p>Other people pondered these questions, and they came up with a solution:
|
||||
|
||||
@@ -69,7 +69,7 @@ My alphabet starts where your alphabet ends! <span>❞</span><br>— Dr
|
||||
|
||||
<p>UTF-8 is a <em>variable-length</em> encoding system for Unicode. That is, different characters take up a different number of bytes. For <abbr>ASCII</abbr> characters (A-Z, <i class=baa>&</i>c.) UTF-8 uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0–127) in UTF-8 are indistinguishable from <abbr>ASCII</abbr>. “Extended Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end up taking three bytes. The rarely-used “astral plane” characters take four bytes.
|
||||
|
||||
<p>Disadvantages: because each character can take a different number of bytes, finding the <var>Nth</var> character is an O(N) operation. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.
|
||||
<p>Disadvantages: because each character can take a different number of bytes, finding the <var>Nth</var> character is an O(N) operation — that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.
|
||||
|
||||
<p>Advantages: super-efficient encoding of common <abbr>ASCII</abbr> characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user