diff --git a/strings.html b/strings.html index afaa490..3fb7320 100644 --- a/strings.html +++ b/strings.html @@ -307,13 +307,64 @@ TypeError: 'bytes' object does not support item assignment
  • The one difference is that, with the bytearray object, you can assign individual bytes using index notation. The assigned value must be an integer between 0–255. -

    OK, so a string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? The answer is that it decodes the bytes according to a specific character encoding algorithm, and returns a sequence of Unicode characters, otherwise known as a string. +

    The one thing you can never do is mix bytes and strings. + +

    +>>> by = b'd'
    +>>> s = 'abcde'
    +>>> by + s                       
    +Traceback (most recent call last):
    +  File "<stdin>", line 1, in <module>
    +TypeError: can't concat bytes to str
    +>>> s.count(by)                  
    +Traceback (most recent call last):
    +  File "<stdin>", line 1, in <module>
    +TypeError: Can't convert 'bytes' object to str implicitly
    +>>> s.count(by.decode('ascii'))  
    +1
    +
      +
    1. You can't concatenate bytes and strings. They are two different data types. +
    2. You can't count the occurrences of bytes in a string, because there are no bytes in a string. A string is a sequence of characters. Perhaps you meant “count the occurrences of the string that you would get after decoding this sequence of bytes in a particular character encoding”? Well then, you'll need to say that explicitly. Python 3 won't implicitly convert bytes to strings or strings to bytes. +
    3. By an amazing coincidence, this line of code says “count the occurrences of the string that you would get after decoding this sequence of bytes in this particular character encoding.” +
    + +

    And here is the link between strings and bytes: bytes objects have a decode() method that takes a character encoding and returns a string, and strings have an encode() method that takes a character encoding and returns a bytes object. In the previous example, the decoding was relatively straightforward — converting a sequence of bytes n the ASCII encoding into a string of characters. But the same process works with any encoding that supports the characters of the string — even legacy (non-Unicode) encodings. + +

    +>>> a_string = '深入 Python'         
    +>>> len(a_string)
    +9
    +>>> by = a_string.encode('utf-8')    
    +>>> by
    +b'\xe6\xb7\xb1\xe5\x85\xa5 Python'
    +>>> len(by)
    +13
    +>>> by = a_string.encode('gb18030')  
    +>>> by
    +b'\xc9\xee\xc8\xeb Python'
    +>>> len(by)
    +11
    +>>> by = a_string.encode('big5')     
    +>>> by
    +b'\xb2`\xa4J Python'
    +>>> len(by)
    +11
    +>>> roundtrip = by.decode('big5')    
    +>>> roundtrip
    +'深入 Python'
    +>>> a_string == roundtrip
    +True
    +
      +
    1. This is a string. It has nine characters. +
    2. This is a bytes object. It has 13 bytes. It is the sequence of bytes you get when you take a_string and encode it in UTF-8. +
    3. This is a bytes object. It has 11 bytes. It is the sequence of bytes you get when you take encode a_string in the GB18030 encoding. +
    4. This is a bytes object. It has 11 bytes. It is an entirely different sequence of bytes that you get by encoding a_string with the Big5 encoding algorithm. +
    5. This is a string. It has nine characters. It is the sequence of characters you get when you take by and decode it using the Big5 encoding algorithm. It is identical to the original string. +
    -

    FIXME examples/chinese.txt

    Postscript: Character Encoding Of Python Source Code