You are here: Home Dive Into Python 3

Difficulty level: ♦♦♦♢♢

Files

FIXME
— FIXME

 

Diving In

FIXME

Reading From Text Files

Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:

a_file = open('examples/chinese.txt', encoding='utf-8')

Python has a built-in open() function, which takes a filename as an argument. Here the filename is 'examples/chinese.txt'. There are four interesting things about this filename:

  1. It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments — a directory path and a filename — but the open() function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well.
  2. The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows.
  3. The directory path does not begin with a slash or a drive letter, so it is called a relative path. Relative to what, you might ask? Patience, grasshopper.
  4. It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-ASCII pathnames.

But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding. Oh dear, that sounds dreadfully familiar.

Character Encoding Rears Its Ugly Head

Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).

# This example was created on Windows. Other platforms may
# behave differently, for reasons outlined below.
>>> file = open('examples/chinese.txt')
>>> a_string = file.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined>
>>> 

What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in cp1252.py, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly UnicodeDecodeError.

But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).

If you need to get the default character encoding, import the locale module and call locale.getpreferredencoding(). On my Windows laptop, it returns 'cp1252', but on my Linux box upstairs, it returns 'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file.

File Objects

So far, all we know is that Python has a built-in function called open(). The open() function returns a file object, which has methods and attributes for getting information about and manipulating the file.

>>> a_file = open('examples/chinese.txt', encoding='utf-8')
>>> a_file.name                                              
'examples/chinese.txt'
>>> a_file.encoding                                          
'utf-8'
>>> a_file.mode                                              
'r'
  1. The name attribute reflects the name you passed in to the open() function when you opened the file. It is not normalized to an absolute pathname.
  2. Likewise, encoding attribute reflects the encoding you passed in to the open() function. If you didn’t specify the encoding when you opened the file (bad developer!) then the encoding attribute will reflect locale.getpreferredencoding().
  3. The mode attribute tells you in which mode the file was opened. You can pass an optional mode parameter to the open() function. You didn’t specify a mode when you opened this file, so Python defaults to 'r', which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings).

The documentation for the open() function lists all the possible file modes.

Reading Data From A Text File

After you open a file for reading, you’ll probably want to read from it at some point.

>>> a_file = open('examples/chinese.txt', encoding='utf-8')
>>> a_file.read()
'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'
  1. FIXME

FIXME

>>> a_file.seek(0)
0
>>> a_file.read(16)
'Dive Into Python'
>>> a_file.read(1)
' '
>>> a_file.read(1)
'是'
>>> a_file.tell()
20
  1. FIXME

FIXME

>>> a_file.seek(17)
17
>>> a_file.read(1)
'是'
>>> a_file.tell()
20
  1. FIXME

FIXME

>>> a_file.seek(18)
>>> a_file.read(1)
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    a_file.read(1)
  File "C:\Python31\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte
  1. FIXME

6.2.2. Closing Files

Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them.

FIXME checking if a file is closed

Using The with Statement

FIXME "with open(...) as file" pattern

Reading Data One Line At A Time

FIXME

FIXME what's a "line"? (line endings discussion, universal line endings, etc.)

Writing to Text Files

FIXME

Character Encoding Again

FIXME

Write A Little, Write A Lot

FIXME write(), writelines(), .writeable

Handling I/O Errors

FIXME

Binary Files

FIXME

>>> image = open('examples/beauregard-100x100.jpg', 'rb')
>>> image
<io.BufferedReader object at 0x00C7A390>
>>> image.mode
'rb'
>>> image.name
'examples/beauregard-100x100.jpg'
>>> image
<io.BufferedReader object at 0x00C7A390>
>>> image.tell()
0
>>> data = image.read(3)
>>> data
b'\xff\xd8\xff'
>>> image.tell()
3
>>> image.seek(0)
0
>>> data = image.read()
>>> len(data)
3150

File-like Objects

FIXME

Standard Input, Output, and Error

FIXME

Further Reading

FIXME

© 2001–9 Mark Pilgrim