clarifications and typos [h/t G.M.]

This commit is contained in:
Mark Pilgrim
2009-06-18 00:06:53 -04:00
parent 0053b56c01
commit b0f4e357fe
2 changed files with 5 additions and 4 deletions
+4 -3
View File
@@ -611,6 +611,7 @@ user-agent: Python-httplib2/$Rev: 259 $'
# continued from the previous example
<a><samp class=p>>>> </samp><kbd class=pp>response2, content2 = h.request('http://diveintopython3.org/examples/feed-301.xml')</kbd> <span class=u>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>response2.fromcache</kbd> <span class=u>&#x2461;</span></a>
<samp class=pp>True</samp>
<a><samp class=p>>>> </samp><kbd class=pp>content2 == content</kbd> <span class=u>&#x2462;</span></a>
<samp class=pp>True</samp>
</pre>
@@ -811,10 +812,10 @@ user-agent: Python-httplib2/$Rev: 259 $
<samp class=p>>>> </samp><kbd class=pp>resp.status</kbd>
<samp class=pp>200</samp></pre>
<ol>
<li>&#8220;Delete this status message, please.&#8221;
<li>&#8220;Delete this status message.&#8221;
<li>&#8220;I&#8217;m sorry, Dave, I&#8217;m afraid I can&#8217;t do that.&#8221;
<li>&#8220;Delete this status message, please&hellip;
<li>&hellip;here&#8217;s my username and password.&#8221;
<li>&#8220;Unauthorized<span class=u title='interrobang!'>&#8253;</span> Hmmph. Delete this status message, <em>please</em>&hellip;
<li>&hellip;and here&#8217;s my username and password.&#8221;
<li>&#8220;Consider it done!&#8221;
</ol>
+1 -1
View File
@@ -55,7 +55,7 @@ My alphabet starts where your alphabet ends! <span class=u>&#x275E;</span><br>&m
<p>Unicode is a system designed to represent <em>every</em> character from <em>every</em> language. Unicode represents each letter, character, or ideograph as a 4-byte number. Each number represents a unique character used in at least one of the world&#8217;s languages. (Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn&#8217;t be sufficient.) Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no &#8220;modes&#8221; to keep track of. <code>U+0041</code> is always <code>'A'</code>, even if your language doesn&#8217;t have an <code>'A'</code> in it.
<p>On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more &#8220;mode switching&#8221; to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character<span title='interrobang!'>&#8253;</span> That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it&#8217;s wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character.
<p>On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more &#8220;mode switching&#8221; to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character<span class=u title='interrobang!'>&#8253;</span> That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it&#8217;s wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character.
<p>There is a Unicode encoding that uses four bytes per character. It&#8217;s called UTF-32, because 32 bits = 4 bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the <var>Nth</var> character of a string in constant time, because the <var>Nth</var> character starts at the <var>4&times;Nth</var> byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every freaking character.