diff --git a/files.html b/files.html index 9decc06..2390083 100644 --- a/files.html +++ b/files.html @@ -33,10 +33,11 @@ open(..., 'r', encoding='...')
Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm, and returns a sequence of Unicode characters, otherwise known as a string. +
Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).
-# on Windows...
+# This example was created on Windows. Other platforms may
+# behave differently, for reasons outlined below.
>>> file = open('examples/chinese.txt')
>>> a_string = file.read()
Traceback (most recent call last):
@@ -47,10 +48,14 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
>>>
-
+But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252). + +
+☞If you need to get the default character encoding, import the
localemodule and calllocale.getpreferredencoding(). On my Windows laptop, it returns'cp1252', but on my Linux box upstairs, it returns'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file. + +