diff --git a/files.html b/files.html index b041677..e7d08c3 100644 --- a/files.html +++ b/files.html @@ -126,7 +126,7 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
  • 16 + 1 + 1 = … 20? -

    FIXME +

    Let’s see that again.

     # continued from the previous example
    @@ -137,12 +137,14 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
     >>> a_file.tell()                      
     20
      -
    1. FIXME -
    2. -
    3. +
    4. Move to the 17th byte. +
    5. Read one character. +
    6. Now you’re on the 20th byte.
    -

    FIXME +

    Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that they’re counting the same thing. But that’s only true for some characters. + +

    But wait, it gets worse!

     >>> a_file.seek(18)                         
    @@ -155,8 +157,8 @@ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: chara
         (result, consumed) = self._buffer_decode(data, self.errors, final)
     UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte
      -
    1. FIXME -
    2. +
    3. Move to the 18th byte and try to read one character. +
    4. Why does this fail? Because there isn’t a character at the 18th byte. The nearest character starts at the 17th byte (and goes for three bytes). Trying to read a character from the middle will fail with a UnicodeDecodeError.

    Closing Files