diff --git a/examples/chinese.txt b/examples/chinese.txt new file mode 100644 index 0000000..834030e --- /dev/null +++ b/examples/chinese.txt @@ -0,0 +1 @@ +Dive Into Python 是为有经验的程序员编写的一本 Python 书。 diff --git a/strings.html b/strings.html index 58b0b1c..247d711 100644 --- a/strings.html +++ b/strings.html @@ -50,58 +50,49 @@ My alphabet starts where your alphabet ends!
— Dr

Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That's 232−1.) Each 4-byte number represents a unique character used in at least one of the world's languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn't be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. U+0041 is always 'A', even if your language doesn't have an 'A' in it. -

Right away, the obvious question should leap out at you. Four bytes? For every single character That seems awfully wasteful, especially for languages like English and Spanish, which need less than 256 numbers to express every possible character. [FIXME incomplete paragraph] +

On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it's wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character. -

Of course, there is still the matter of all those legacy encoding systems. [FIXME incomplete paragraph] +

There is a Unicode encoding that uses four bytes per character. It's called UTF-32, because 32 bits = 4 bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the Nth character of a string in constant time, because the Nth character starts at the 4×Nth byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every freaking character. -

[FIXME stuff about UTF-32, UTF-16, and finally UTF-8] +

Even though there are a lot of Unicode characters, it turns out that most people will never use anything beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes every character from 0–65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used “astral plane” Unicode characters beyond 65535. Most obvious advantage: UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store instead of four bytes (except for the ones that don't). And you can still easily find the Nth character of a string in constant time, if you assume that the string doesn't include any astral plane characters, which is a good assumption right up until the moment that it's not. -

[FIXME FIXME FIXME, damn it!] +

But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character U+4E2D could be stored in UTF-16 as either 4E 2D or 2D 4E, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you're safe — different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you're going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence 4E 2D means U+4E2D or U+2D4E. -

-

UTF-8 uses the same characters as 7-bit ASCII for 0 through 127 +

To solve this problem, the multi-byte Unicode encodings define a “Byte Order Mark,” which is a special non-printable character that you can include at the beginning of your document to indicate what order your bytes are in. For UTF-16, the Byte Order Mark is U+FEFF. If you receive a UTF-16 document that starts with the bytes FF FE, you know the byte ordering is one way; if it starts with FE FF, you know the byte ordering is reversed. +

Still, UTF-16 isn't exactly ideal, especially if you're dealing with a lot of ASCII characters. If you think about it, even a Chinese web page is going to contain a lot of ASCII characters — all the elements and attributes surrounding the printable Chinese characters. Being able to find the Nth character in O(1) time is nice, but there's still the nagging problem of those astral plane characters, which mean that you can't guarantee that every character is exactly two bytes, so you can't really find the Nth character in O(1) time unless you maintain a separate index. And boy, there sure is a lot of ASCII text in the world… +

Other people pondered these questions, and they came up with a solution: +

UTF-8 -

When dealing with Unicode data, you may at some point need to convert the data back into one of these other legacy encoding -systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding -scheme, or to print it to a non-Unicode-aware terminal or printer. +

UTF-8 is a variable-length encoding system for Unicode. That is, different characters take up a different number of bytes. For ASCII characters (A-Z, &c.) UTF-8 uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0–127) in UTF-8 are indistinguishable from ASCII. “Extended Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end up taking three bytes. The rarely-used “astral plane” characters take four bytes. +

Disadvantages: because each character can take a different number of bytes, finding the Nth character is an O(N) operation. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters. - - -FIXME: update for Python 3 - -

Python has had Unicode support throughout the language since version 2.0. The XML package uses Unicode to store all parsed XML data, but you can use Unicode anywhere. -

->>> s = u'Dive in'            
->>> s
-u'Dive in'
->>> print s 
-Dive in
-
    -
  1. To create a Unicode string instead of a regular ASCII string, add the letter “u” before the string. Note that this particular string doesn't have any non-ASCII characters. That's fine; Unicode is a superset of ASCII (a very large superset at that), so any regular ASCII string can also be stored as Unicode. -
  2. When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this Unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn't know that s was a Unicode string, you'd never notice the difference. -
-
->>> s = u'La Pe\xf1a'         
->>> print s 
-Traceback (innermost last):
-  File "<interactive input>", line 1, in ?
-UnicodeError: ASCII encoding error: ordinal not in range(128)
->>> print s.encode('latin-1') 
-La Peña
-
    -
  1. The real advantage of Unicode, of course, is its ability to store non-ASCII characters, like the Spanish “ñ” (n with a tilde over it). The Unicode character code for the tilde-n is 0xf1 in hexadecimal (241 in decimal), which you can type like this: \xf1. -
  2. Remember I said that the print function attempts to convert a Unicode string to ASCII so it can print it? Well, that's not going to work here, because your Unicode string contains non-ASCII characters, so Python raises a UnicodeError error. -
  3. Here's where the conversion-from-Unicode-to-other-encoding-schemes comes in. s is a Unicode string, but print can only print a regular string. To solve this problem, you call the encode method, available on every Unicode string, to convert the Unicode string to a regular string in the given encoding scheme, - which you pass as a parameter. In this case, you're using latin-1 (also known as iso-8859-1), which includes the tilde-n (whereas the default ASCII encoding scheme did not, since it only includes characters numbered 0 through 127). -
-
+

Advantages: super-efficient encoding of common ASCII characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you'll have to trust me on this, because I'm not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.

Diving In

+

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. "Is this string UTF-8?" is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions. + +

+>>> s = '深入 Python'    
+>>> len(s)               
+9
+>>> s[0]                 
+'深'
+>>> s + ' 3'             
+'深入 Python 3'
+
    +
  1. To create a string, enclose it in quotes. Python strings can be defined with either single quotes (') or double quotes ("). +
  2. The built-in len() function returns the length of the string, i.e. the number of characters. This is the same function you use to find the length of a list. A string is like a list of characters. +
  3. Just like getting individual items out of a list, you can get individual characters out of a string using index notation. +
  4. Just like lists, you can concatenate strings using the + operator. +
+ +

Formatting Strings

+

Let's take another look at humansize.py: @@ -132,15 +123,13 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True): raise ValueError('number too large')

    -
  1. 'KB', 'MB', 'GB'… those are each strings. Python strings can be defined with either single quotes (') or double quotes ("). +
  2. 'KB', 'MB', 'GB'… those are each strings.
  3. Function docstrings are strings. This docstring spans multiple lines, so it uses three-in-a-row quotes to start and end the string.
  4. These three-in-a-row quotes end the docstring.
  5. There's another string, being passed to the exception as a human-readable error message.
  6. There's a… whoa, what the heck is that?
-

Formatting Strings

-

Python 3 supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with single placeholder.

@@ -249,98 +238,90 @@ experience of years.
 
  • The count() method counts the number of occurrences of a substring. Yes, there really are six “f”s in that sentence! - +>>> query = 'user=pilgrim&database=master&password=PapayaWhip' +>>> a_list = query.split('&') +>>> a_list +['user=pilgrim', 'database=master', 'password=PapayaWhip'] +>>> a_list_of_lists = [v.split('=', 1) for v in a_list] +>>> a_list_of_lists +[['user', 'pilgrim'], ['database', 'master'], ['password', 'PapayaWhip']] +>>> a_dict = dict(a_list_of_lists) +>>> a_dict +{'password': 'PapayaWhip', 'user': 'pilgrim', 'database': 'master'}
  • - - - - -

    The string Module

    - -

    [FIXME is this worth keeping? The module still exists in 3.0; check if it's going away in 3.1 or something.] - -

    -

    When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the join method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate string module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old string module (which still has a lot of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead. -

    -

    Strings vs. Bytes

    -

    FIXME +

    Bytes are bytes; characters are an abstraction. An immutable sequence of Unicode characters is called a string. An immutable sequence of numbers-between-0-and-255 is called a bytes object. -

    Character Encoding Of Python Source Code

    +
    +>>> by = b'abcd\x65'  
    +>>> by
    +b'abcde'
    +>>> type(by)          
    +<class 'bytes'>
    +>>> len(by)           
    +5
    +>>> by += b'\xff'     
    +>>> by
    +b'abcde\xff'
    +>>> len(by)           
    +6
    +>>> by[0]             
    +97
    +>>> by[0] = 102       
    +Traceback (most recent call last):
    +  File "<stdin>", line 1, in 
    +TypeError: 'bytes' object does not support item assignment
    +
      +
    1. To define a bytes object, use the b'' “byte literal” syntax. Each byte within the byte literal can be an ASCII character or an encoded hexadecimal number from \x00 to \xff (0–255). +
    2. The type of a bytes object is bytes. +
    3. Just like lists and strings, you can get the length of a bytes object with the built-in len() function. +
    4. Just like lists and strings, you can use the + operator to concatenate bytes objects. The result is a new bytes object. +
    5. Concatenating a 5-byte bytes object and a 1-byte bytes object gives you a 6-byte bytes object. +
    6. Just like lists and strings, you can use index notation to get individual bytes in a bytes object. The items of a string are strings; the items of a bytes object are integers. Specifically, integers between 0–255. +
    7. A bytes object is immutable; you can not assign individual bytes. If you need to change individual bytes, you can either use slicing methods (which work the same as strings) and concatenation operators (which also work the same as strings), or you can convert the bytes object into a bytearray object. +
    -

    Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8. +

    +>>> by = b'abcd\x65'
    +>>> barr = bytearray(by)  
    +>>> barr
    +bytearray(b'abcde')
    +>>> len(barr)             
    +5
    +>>> barr[0] = 102         
    +>>> barr
    +bytearray(b'fbcde')
    +
      +
    1. To convert an bytes object into a mutable bytearray object, use the built-in bytearray() function. +
    2. All the methods and operations you can do on a bytes object, you can do on a bytearray object too. +
    3. The one difference is that, with the bytearray object, you can assign individual bytes using index notation. The assigned value must be an integer between 0–255. +
    + +

    OK, so a string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? The answer is that it decodes the bytes according to a specific character encoding algorithm, and returns a sequence of Unicode characters, otherwise known as a string. + +

    FIXME examples/chinese.txt + + +

    Postscript: Character Encoding Of Python Source Code

    + +

    Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8.

    -

    In Python 2, the default encoding for .py files was ASCII. In Python 3, the default encoding is UTF-8. +

    In Python 2, the default encoding for .py files was ASCII. In Python 3, the default encoding is UTF-8.

    If you would like to use a different encoding within your Python code, you can put an encoding declaration on the first line of each file. This declaration defines a .py file to be windows-1252: