From e871e69319857862b093f958334a6901fa2199e4 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Sat, 21 Mar 2009 13:09:06 -0400 Subject: [PATCH] a little further along in strings chapter --- about.html | 1 + dip3.css | 1 + dip3.js | 6 +- strings.html | 163 +++++++++++++++++++++++++++++++++++---------------- 4 files changed, 118 insertions(+), 53 deletions(-) diff --git a/about.html b/about.html index 43131d0..5a265ea 100644 --- a/about.html +++ b/about.html @@ -21,4 +21,5 @@ h1:before{content:""}
  • The text uses Unicode characters in place of graphics wherever possible.
  • The entire book was lovingly hand-authored in HTML 5 to avoid markup cruft. +

    Send corrections and feedback to mark@diveintomark.org.

    © 2001–9 ark Pilgrim diff --git a/dip3.css b/dip3.css index 90b847e..9e70225 100644 --- a/dip3.css +++ b/dip3.css @@ -42,6 +42,7 @@ kbd{font-weight:bold} /* overrides */ li ol,.q{margin:0} pre a,.w a,pre a:hover{border:0} +.s{text-decoration:line-through} /* headers */ h1,#noscript{background:PapayaWhip;width:100%} /* all hail PapayaWhip */ diff --git a/dip3.js b/dip3.js index e8918df..bac6331 100644 --- a/dip3.js +++ b/dip3.js @@ -84,7 +84,8 @@ function plainTextOnClick(id) { } function hideTOC() { - $("#toc").html(' show table of contents'); + var toc = ' show table of contents'; + $("#toc").html(toc); } function showTOC() { @@ -104,5 +105,6 @@ function showTOC() { toc += ''; level -= 1; } - $("#toc").html(' hide table of contents' + toc); + toc = ' hide table of contents' + toc; + $("#toc").html(toc); } diff --git a/strings.html b/strings.html index 8fdc99d..af3e7b0 100644 --- a/strings.html +++ b/strings.html @@ -15,8 +15,8 @@ body{counter-reset:h1 3} My alphabet starts where your alphabet ends!
    Dr. Seuss, On Beyond Zebra!

      -

    Diving in

    -

    Did you know that the people of Bougainville have the smallest alphabet in the world? Their Rotokas alphabet is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course has 26, plus a handful of !@#$%& punctuation marks. Python 3 can handle all of these languages, and more. +

    Some boring stuff you need to understand before you can dive in

    +

    Did you know that the people of Bougainville have the smallest alphabet in the world? Their Rotokas alphabet is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters — 52 if you count uppercase and lowercase separately — plus a handful of !@#$%& punctuation marks.

    When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. @@ -24,13 +24,13 @@ My alphabet starts where your alphabet ends!
    Surely you’ve seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn’t declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it’s merely annoying; in other languages, the result can be completely unreadable. -

    There are character encodings for each major language in the world. Since each language is different, and memory and disk space have historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0–255) to represent that language’s characters. For instance, you’re probably familiar with the ASCII encoding, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that’s 7 out of the 8 bits in a byte. +

    There are character encodings for each major language in the world. Since each language is different, and memory and disk space have historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0–255) to represent that language’s characters. For instance, you’re probably familiar with the ASCII encoding, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, &c.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that’s 7 out of the 8 bits in a byte. -

    Western European languages like French, Spanish, and German have more letters than English. Or, more precisely, they have letters combined with various diacritical marks, like the ñ character in Spanish. The most common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on Microsoft Windows. The CP-1252 encoding shares characters with ASCII in the 0–127 range, but then extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252), and so on. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte. +

    Western European languages like French, Spanish, and German have more letters than English. Or, more precisely, they have letters combined with various diacritical marks, like the ñ character in Spanish. The most common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on Microsoft Windows. The CP-1252 encoding shares characters with ASCII in the 0–127 range, but then extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252), &c. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte.

    Then there are languages like Chinese, Japanese, and Korean, which have so many characters that they require multiple-byte character sets. That is, each “character” is represented by a two-byte number from 0–65535. But different multi-byte encodings still share the same problem as different single-byte encodings, namely that they each use the same numbers to mean different things. It’s just that the range of numbers is broader, because there are many more characters to represent. -

    That was mostly OK in a non-networked world, where “text” was something you typed yourself and occasionally printed. There wasn’t much “plain text” — your word processor had its own format with stored character encoding information, rich styling, and so on. Word processors were customized for each language, so they automatically used the most appropriate character encoding in the Russian edition and in the English edition and in the Spanish edition. People who read these documents were using the same word processing program as the original author, so everything worked, more or less. +

    That was mostly OK in a non-networked world, where “text” was something you typed yourself and occasionally printed. There wasn’t much “plain text”. Source code was ASCII, and everyone else used word processors, which defined their own (non-text) formats that tracked character encoding information along with rich styling, &c. People read these documents with the same word processing program as the original author, so everything worked, more or less.

    Now think about the rise of global networks like email and the web. Lots of “plain text” flying around the globe, being authored on one computer, transmitted through a second computer, and received and displayed by a third computer. Computers can only see numbers, but the numbers could mean different things. Oh no! What to do? Well, systems had to be designed to carry encoding information along with every piece of “plain text.” Remember, it’s the decryption key that maps computer-readable numbers to human-readable characters. A missing decryption key means garbled text, gibberish, or worse. @@ -40,22 +40,21 @@ My alphabet starts where your alphabet ends!
    Now cry a lot, because everything you thought you knew about strings is wrong, and there ain’t no such thing as “plain text.” -


    - -

    Nothing below this line is really done yet. Thanks for reading this far! Stop now! -

    Unicode

    Enter Unicode.

    Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That's 232−1.) Each 4-byte number represents a unique character used in at least one of the world's languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn't be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. U+0041 is always 'A', even if your language doesn't have an 'A' in it. -

    Right away, problems leap out at you. 4 bytes? For every single character That's seems awfully wasteful, especially for English and Spanish, which need less than 256 numbers to express every possible character. [FIXME incomplete paragraph] +

    Right away, problems leap out at you. 4 bytes? For every single character That seems awfully wasteful, especially for English and Spanish, which need less than 256 numbers to express every possible character. [FIXME incomplete paragraph]

    Of course, there is still the matter of all those legacy encoding systems. [FIXME incomplete paragraph]

    [FIXME stuff about UTF-32, UTF-16, and finally UTF-8] - -

    Specifying character encoding in .py files

    - - - -

    [FIXME this appears to be mostly the same in Python 3, except the default encoding is now UTF-8, not ASCII.] - -

    If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to be UTF-8:

    
    -#!/usr/bin/env python
    -# -*- coding: UTF-8 -*-
    - -

    [FIXME maybe some examples here] + raise ValueError('number too large') +

      +
    1. 'KB', 'MB', 'GB'… those are each strings. Python strings can be defined with either single quotes (') or double quotes ("). +
    2. Function docstrings are strings. This docstring spans multiple lines, so it uses three-in-a-row quotes to start and end the string. +
    3. These three-in-a-row quotes end the docstring. +
    4. There's another string, being passed to the exception as a human-readable error message. +
    5. There's a… whoa, what the heck is that? +

    Formatting strings

    [FIXME this is all completely different in Python 3. Cover the new way, then maybe show some examples from the old way? Or maybe not. Hey, maybe just point to the original "Dive Into Python".] -

    Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is - to insert values into a string with the %s placeholder. +

    Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with single placeholder.

    ->>> k = "uid"
    ->>> v = "sa"
    ->>> "%s=%s" % (k, v) 
    -'uid=sa'
    +>>> username = "mark" +>>> password = "PapayaWhip" +>>> "{0}'s password is {1}".format(username, password) +"mark's password is PapayaWhip"
      -
    1. The whole expression evaluates to a string. The first %s is replaced by the value of k; the second %s is replaced by the value of v. All other characters in the string (in this case, the equal sign) stay as they are. +
    2. No, my password is not really PapayaWhip. +
    3. There's a lot going on here. First, that's a method call on a string literal. Strings are objects, and objects have methods. Second, the whole expression evaluates to a string. Third, {0} and {1} are format specifiers, which are replaced by the arguments passed to the format() method.
    +

    The previous example shows the simplest case, where the format specifiers are simply integers. Integer format specifiers are treated as positional indices into the argument list of the format() method. That means that {0} is replaced by the first argument (username in this case), {1} is replaced by the second argument (password), &c. You can have as many positional indices as you have arguments, and you can have as many arguments as you want. But format specifiers are much more powerful than that. + +

    +>>> import humansize
    +>>> 
    +
    +

    Note that (k, v) is a tuple. I told you they were good for something.

    You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that @@ -176,11 +198,17 @@ TypeError: cannot concatenate 'str' and 'int' objects

  • The ".2" modifier of the %f option truncates the value to two decimal places.
  • You can even combine modifiers. Adding the + modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding the value to exactly two decimal places. + -

    Common string operations

    +

    Common string methods

    + +

    [FIXME is it worth keeping this section on joining lists / splitting strings? All the examples are from an old code sample that isn't used at all anymore.] +

    You have a list of key-value pairs in the form key=value, and you want to join them into a single string. To join any list of strings into a single string, use the join method of a string object.

    Here is an example of joining a list from the buildConnectionString function: @@ -219,30 +247,63 @@ is an object. You might have thought I meant that string variables are +

    + +

    Common string operations

    The string module

    [FIXME is this worth keeping? The module still exists in 3.0; check if it's going away in 3.1 or something.] +

    When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the join method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate string module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old string module (which still has a lot of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead. +

    Strings vs. bytes

    +

    Character encoding of Python source code

    + +

    Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8. + +

    +

    In Python 2, the default encoding for .py files was ASCII. In Python 3, the default encoding is UTF-8. +

    + +

    If you would like to use a different encoding within your Python code, you can put an encoding declaration on the first line of each file. This declaration defines a .py file to be windows-1252: + +

    # -*- coding: windows-1252 -*-
    + +

    Technically, the character encoding override can also be on the second line, if the first line is a UNIX-like hash-bang command. + +

    #!/usr/bin/python3
    +# -*- coding: windows-1252 -*-
    + +

    For more information, consult PEP 263: Defining Python Source Code Encodings. +

    Further reading

    -

    FIXME proper links +

    On Unicode in Python: -

    -http://docs.python.org/dev/3.0/howto/unicode.html - Unicode HOWTO
    -http://docs.python.org/dev/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit - changes in Python 3
    -http://blog.whatwg.org/the-road-to-html-5-character-encoding
    -http://www.joelonsoftware.com/articles/Unicode.html
    -http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
    -http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings
    -http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
    -http://www.w3.org/People/Dürst/papers.html
    -http://rishida.net/scripts/chinese/
    -
    + + +

    On Unicode in general: + +

    + +

    On character encoding in other formats: + +

    © 2001–9 ark Pilgrim