From 6216bf3cacda1befe80dd46949eed403c6d3d08c Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Sat, 18 Jul 2009 13:13:15 -0400 Subject: [PATCH] finished #read section --- files.html | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/files.html b/files.html index b041677..e7d08c3 100644 --- a/files.html +++ b/files.html @@ -126,7 +126,7 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
  • 16 + 1 + 1 = … 20? -

    FIXME +

    Let’s see that again.

     # continued from the previous example
    @@ -137,12 +137,14 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
     >>> a_file.tell()                      
     20
      -
    1. FIXME -
    2. -
    3. +
    4. Move to the 17th byte. +
    5. Read one character. +
    6. Now you’re on the 20th byte.
    -

    FIXME +

    Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that they’re counting the same thing. But that’s only true for some characters. + +

    But wait, it gets worse!

     >>> a_file.seek(18)                         
    @@ -155,8 +157,8 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
         (result, consumed) = self._buffer_decode(data, self.errors, final)
     UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte
      -
    1. FIXME -
    2. +
    3. Move to the 18th byte and try to read one character. +
    4. Why does this fail? Because there isn’t a character at the 18th byte. The nearest character starts at the 17th byte (and goes for three bytes). Trying to read a character from the middle will fail with a UnicodeDecodeError.

    Closing Files