From 24135b8bdef7d66c00353b24662aa2133217daf6 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Thu, 16 Jul 2009 13:21:46 -0400 Subject: [PATCH] finished section on character encoding --- files.html | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/files.html b/files.html index 9decc06..2390083 100644 --- a/files.html +++ b/files.html @@ -33,10 +33,11 @@ open(..., 'r', encoding='...')

Character Encoding Rears Its Ugly Head

-

Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm, and returns a sequence of Unicode characters, otherwise known as a string. +

Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).

-# on Windows...
+# This example was created on Windows. Other platforms may
+# behave differently, for reasons outlined below.
 >>> file = open('examples/chinese.txt')
 >>> a_string = file.read()
 Traceback (most recent call last):
@@ -47,10 +48,14 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
 >>>
 
- +

But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252). + +

+

If you need to get the default character encoding, import the locale module and call locale.getpreferredencoding(). On my Windows laptop, it returns 'cp1252', but on my Linux box upstairs, it returns 'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file. + +

File Objects