diff --git a/strings.html b/strings.html index afaa490..3fb7320 100644 --- a/strings.html +++ b/strings.html @@ -307,13 +307,64 @@ TypeError: 'bytes' object does not support item assignment
bytearray object, you can assign individual bytes using index notation. The assigned value must be an integer between 0–255.
-OK, so a string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? The answer is that it decodes the bytes according to a specific character encoding algorithm, and returns a sequence of Unicode characters, otherwise known as a string. +
The one thing you can never do is mix bytes and strings. + +
+>>> by = b'd' +>>> s = 'abcde' +>>> by + s ① +Traceback (most recent call last): + File "<stdin>", line 1, in <module> +TypeError: can't concat bytes to str +>>> s.count(by) ② +Traceback (most recent call last): + File "<stdin>", line 1, in <module> +TypeError: Can't convert 'bytes' object to str implicitly +>>> s.count(by.decode('ascii')) ③ +1+
And here is the link between strings and bytes: bytes objects have a decode() method that takes a character encoding and returns a string, and strings have an encode() method that takes a character encoding and returns a bytes object. In the previous example, the decoding was relatively straightforward — converting a sequence of bytes n the ASCII encoding into a string of characters. But the same process works with any encoding that supports the characters of the string — even legacy (non-Unicode) encodings.
+
+
+>>> a_string = '深入 Python' ① +>>> len(a_string) +9 +>>> by = a_string.encode('utf-8') ② +>>> by +b'\xe6\xb7\xb1\xe5\x85\xa5 Python' +>>> len(by) +13 +>>> by = a_string.encode('gb18030') ③ +>>> by +b'\xc9\xee\xc8\xeb Python' +>>> len(by) +11 +>>> by = a_string.encode('big5') ④ +>>> by +b'\xb2`\xa4J Python' +>>> len(by) +11 +>>> roundtrip = by.decode('big5') ⑤ +>>> roundtrip +'深入 Python' +>>> a_string == roundtrip +True+
bytes object. It has 13 bytes. It is the sequence of bytes you get when you take a_string and encode it in UTF-8.
+bytes object. It has 11 bytes. It is the sequence of bytes you get when you take encode a_string in the GB18030 encoding.
+bytes object. It has 11 bytes. It is an entirely different sequence of bytes that you get by encoding a_string with the Big5 encoding algorithm.
+FIXME examples/chinese.txt