added asides, new styles

This commit is contained in:
Mark Pilgrim
2009-03-28 15:58:35 -05:00
parent fe57cb0215
commit 1a4ce72944
9 changed files with 127 additions and 83 deletions
+17 -11
View File
@@ -16,13 +16,15 @@ body{counter-reset:h1 3}
My alphabet starts where your alphabet ends! <span>&#x275E;</span><br>&mdash; Dr. Seuss, On Beyond Zebra!
</blockquote>
<p id=toc>&nbsp;
<h2 id=boring-stuff>Some boring stuff you need to understand before you can dive in</h2>
<h2 id=boring-stuff>Some Boring Stuff You Need To Understand Before You Can Dive In</h2>
<p class=f>Did you know that the people of <a href="http://en.wikipedia.org/wiki/Bougainville_Province">Bougainville</a> have the smallest alphabet in the world? Their <a href="http://en.wikipedia.org/wiki/Rotokas_alphabet">Rotokas alphabet</a> is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters &mdash; 52 if you count uppercase and lowercase separately &mdash; plus a handful of <i class=baa>!@#$%&</i> punctuation marks.
<p>When people talk about &#8220;text,&#8221; they&#8217;re thinking of &#8220;characters and symbols on the computer screen.&#8221; But computers don&#8217;t deal in characters and symbols; they deal in bits and bytes. Every piece of text you&#8217;ve ever seen on a computer screen is actually stored in a particular <i>character encoding</i>. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.
<p>In reality, it&#8217;s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes &mdash; a file, a web page, whatever &mdash; and claims it&#8217;s &#8220;text,&#8221; you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you&#8217;re left with the unenviable task of cracking the code yourself. Chances are you&#8217;ll get it wrong, and the result will be gibberish.
<aside>Everything you thought you knew about strings is wrong.</aside>
<p>Surely you&#8217;ve seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn&#8217;t declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it&#8217;s merely annoying; in other languages, the result can be completely unreadable.
<p>There are character encodings for each major language in the world. Since each language is different, and memory and disk space have historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0&ndash;255) to represent that language&#8217;s characters. For instance, you&#8217;re probably familiar with the <abbr>ASCII</abbr> encoding, which stores English characters as numbers ranging from 0 to 127. (65 is capital &#8220;A&#8221;, 97 is lowercase &#8220;a&#8221;, <i class=baa>&amp;</i>c.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that&#8217;s 7 out of the 8 bits in a byte.
@@ -97,8 +99,9 @@ La Pe&ntilde;a</pre>
</ol>
</div>
<h2 id=divingin>Diving in</h2>
<h2 id=divingin>Diving In</h2>
<aside>Strings can be defined with either single or double quotes.</aside>
<p>Let's take another look at <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
<p class=d>[<a href=examples/humansize.py>download <code>humansize.py</code></a>]
@@ -135,7 +138,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<li>There's a&hellip; whoa, what the heck is that?
</ol>
<h2 id=formatting-strings>Formatting strings</h2>
<h2 id=formatting-strings>Formatting Strings</h2>
<p>Python 3 supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with single placeholder.
@@ -149,7 +152,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<li>There's a lot going on here. First, that's a method call on a string literal. <em>Strings are objects</em>, and objects have methods. Second, the whole expression evaluates to a string. Third, <code>{0}</code> and <code>{1}</code> are <i>replacement fields</i>, which are replaced by the arguments passed to the <code>format()</code> method.
</ol>
<h3 id=compound-field-names>Compound field names</h3>
<h3 id=compound-field-names>Compound Field Names</h3>
<p>The previous example shows the simplest case, where the replacement fields are simply integers. Integer replacement fields are treated as positional indices into the argument list of the <code>format()</code> method. That means that <code>{0}</code> is replaced by the first argument (<var>username</var> in this case), <code>{1}</code> is replaced by the second argument (<var>password</var>), <i class=baa>&amp;</i>c. You can have as many positional indices as you have arguments, and you can have as many arguments as you want. But replacement fields are much more powerful than that.
@@ -166,6 +169,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<li>This looks complicated, but it's not. <code>{0}</code> would refer to the first argument passed to the <code>format()</code> method, <var>si_suffixes</var>. But <var>si_suffixes</var> is a list. So <code>{0[0]}</code> refers to the first item of the list which is the first argument passed to the <code>format()</code> method: <code>'KB'</code>. Meanwhile, <code>{0[1]}</code> refers to the second item of the same list: <code>'MB'</code>. Everything outside the curly braces &mdash; including <code>1000</code>, the equals sign, and the spaces &mdash; is untouched. The final result is the string <code>'1000KB = 1MB'</code>.
</ol>
<aside>{0} is replaced by the 1<sup>st</sup> format() argument. {1} is replaced by the 2<sup>nd</sup>.</aside>
<p>What this example shows is that <em>format specifers can access items and properties of data structures using (almost) Python syntax</em>. This is called <i>compound field names</i>. The following compound field names "just work":
<ul>
@@ -195,7 +199,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<lI><code>sys.modules["humansize"].SUFFIXES[1000][0]</code> is the first item of the list of <abbr>SI</abbr> suffixes: <code>'KB'</code>. Therefore, the complete replacement field <code>{0.modules[humansize].SUFFIXES[1000][0]}</code> is replaced by the two-character string <code>KB</code>.
</ul>
<h3 id=format-specifiers>Format specifiers</h3>
<h3 id=format-specifiers>Format Specifiers</h3>
<p>But wait! There's more! Let's take another look at that strange line of code from <code>humansize.py</code>:
@@ -216,7 +220,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<p>For all the gory details on format specifiers, consult the <a href="http://docs.python.org/dev/3.0/library/string.html#format-specification-mini-language">Format Specification Mini-Language</a> in the official Python documentation.
<h2 id=common-string-methods>Other common string methods</h2>
<h2 id=common-string-methods>Other Common String Methods</h2>
<p>Besides formatting, strings can do a number of other useful tricks.
@@ -241,7 +245,7 @@ experience of years.</samp>
<li>You can input multi-line strings in the Python interactive shell. Once you start a multi-line string with triple quotation marks, just hit <kbd>ENTER</kbd> and the interactive shell will prompt you to continue the string. Typing the closing triple quotation marks ends the string, and the next <kbd>ENTER</kbd> will execute the command (in this case, assigning the string to <var>s</var>).
<li>The <code>splitlines()</code> method takes one multi-line string and returns a list of strings, one for each line of the original. Note that the carriage returns at the end of each line are not included.
<li>The <code>lower()</code> method converts the entire string to lowercase. (Similarly, the <code>upper()</code> method converts a string to uppercase.)
<li>the <code>count()</code> method counts the number of occurrences of a substring. Yes, there really are six &#8220;f&#8221;s in that sentence!
<li>The <code>count()</code> method counts the number of occurrences of a substring. Yes, there really are six &#8220;f&#8221;s in that sentence!
</ol>
<!--
@@ -318,7 +322,7 @@ is an object. You might have thought I meant that string <em>variables</em> are
</div>
-->
<h2 id=string-module>The <code>string</code> module</h2>
<h2 id=string-module>The <code>string</code> Module</h2>
<p>[FIXME is this worth keeping? The module still exists in 3.0; check if it's going away in 3.1 or something.]
@@ -326,9 +330,11 @@ is an object. You might have thought I meant that string <em>variables</em> are
<p>When I first learned Python, I expected <code>join</code> to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the <code>join</code> method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate <code>string</code> module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like <code>lower</code>, <code>upper</code>, and <code>split</code>. But many hard-core Python programmers objected to the new <code>join</code> method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old <code>string</code> module (which still has a lot of useful stuff in it). I use the new <code>join</code> method exclusively, but you will see code written either way, and if it really bothers you, you can use the old <code>string.join</code> function instead.
</div>
<h2 id=byte-arrays>Strings vs. bytes</h2>
<h2 id=byte-arrays>Strings vs. Bytes</h2>
<h2 id=py-encoding>Character encoding of Python source code</h2>
<p>FIXME
<h2 id=py-encoding>Character Encoding Of Python Source Code</h2>
<p>Python 3 assumes that your source code &mdash; <i>i.e.</i> each <code>.py</code> file &mdash; is encoded in <abbr>UTF-8</abbr>.
@@ -347,7 +353,7 @@ is an object. You might have thought I meant that string <em>variables</em> are
<p>For more information, consult <a href="http://www.python.org/dev/peps/pep-0263/"><abbr>PEP</abbr> 263: Defining Python Source Code Encodings</a>.
<h2 id=furtherreading>Further reading</h2>
<h2 id=furtherreading>Further Reading</h2>
<p>On Unicode in Python: