From b0f4e357fe016ff7aca8be96cb6aa7bdda8f97b3 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Thu, 18 Jun 2009 00:06:53 -0400 Subject: [PATCH] clarifications and typos [h/t G.M.] --- http-web-services.html | 7 ++++--- strings.html | 2 +- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/http-web-services.html b/http-web-services.html index 6400848..4543c83 100644 --- a/http-web-services.html +++ b/http-web-services.html @@ -611,6 +611,7 @@ user-agent: Python-httplib2/$Rev: 259 $' # continued from the previous example >>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-301.xml') >>> response2.fromcache +True >>> content2 == content True @@ -811,10 +812,10 @@ user-agent: Python-httplib2/$Rev: 259 $ >>> resp.status 200
    -
  1. “Delete this status message, please.” +
  2. “Delete this status message.”
  3. “I’m sorry, Dave, I’m afraid I can’t do that.” -
  4. “Delete this status message, please… -
  5. …here’s my username and password.” +
  6. “Unauthorized Hmmph. Delete this status message, please… +
  7. …and here’s my username and password.”
  8. “Consider it done!”
diff --git a/strings.html b/strings.html index b04b450..ea2a36a 100644 --- a/strings.html +++ b/strings.html @@ -55,7 +55,7 @@ My alphabet starts where your alphabet ends!
&m

Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number. Each number represents a unique character used in at least one of the world’s languages. (Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn’t be sufficient.) Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. U+0041 is always 'A', even if your language doesn’t have an 'A' in it. -

On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it’s wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character. +

On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it’s wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character.

There is a Unicode encoding that uses four bytes per character. It’s called UTF-32, because 32 bits = 4 bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the Nth character of a string in constant time, because the Nth character starts at the 4×Nth byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every freaking character.