You are here: Home Dive Into Python 3

Difficulty level: ♦♦♦♢♢

Files

FIXME
— FIXME

 

Diving In

FIXME

Reading From Text Files

Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:

a_file = open('examples/chinese.txt', encoding='utf-8')

Python has a built-in open() function, which takes a filename as an argument. Here the filename is 'examples/chinese.txt'. There are five interesting things about this filename:

  1. It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments — a directory path and a filename — but the open() function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well.
  2. The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows.
  3. The directory path does not begin with a slash or a drive letter, so it is called a relative path. Relative to what, you might ask? Patience, grasshopper.
  4. It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-ASCII pathnames.
  5. It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a figment of an entirely virtual filesystem. If your computer considers it a file and can access it as a file, Python can open it.

But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding. Oh dear, that sounds dreadfully familiar.

Character Encoding Rears Its Ugly Head

Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).

# This example was created on Windows. Other platforms may
# behave differently, for reasons outlined below.
>>> file = open('examples/chinese.txt')
>>> a_string = file.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined>
>>> 

What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in cp1252.py, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly UnicodeDecodeError.

But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).

If you need to get the default character encoding, import the locale module and call locale.getpreferredencoding(). On my Windows laptop, it returns 'cp1252', but on my Linux box upstairs, it returns 'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file.

File Objects

So far, all we know is that Python has a built-in function called open(). The open() function returns a file object, which has methods and attributes for getting information about and manipulating the file.

>>> a_file = open('examples/chinese.txt', encoding='utf-8')
>>> a_file.name                                              
'examples/chinese.txt'
>>> a_file.encoding                                          
'utf-8'
>>> a_file.mode                                              
'r'
  1. The name attribute reflects the name you passed in to the open() function when you opened the file. It is not normalized to an absolute pathname.
  2. Likewise, encoding attribute reflects the encoding you passed in to the open() function. If you didn’t specify the encoding when you opened the file (bad developer!) then the encoding attribute will reflect locale.getpreferredencoding().
  3. The mode attribute tells you in which mode the file was opened. You can pass an optional mode parameter to the open() function. You didn’t specify a mode when you opened this file, so Python defaults to 'r', which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings).

The documentation for the open() function lists all the possible file modes.

Reading Data From A Text File

After you open a file for reading, you’ll probably want to read from it at some point.

>>> a_file = open('examples/chinese.txt', encoding='utf-8')
>>> a_file.read()                                            
'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'
>>> a_file.read()                                            
''
  1. Once you open a file (with the correct encoding), reading from it is just a matter of calling the file object’s read() method. The result is a string.
  2. Perhaps somewhat surprisingly, reading the file again does not raise an exception. Python does not consider reading past end-of-file to be an error; it simply returns an empty string.

What if you want to re-read a file?

# continued from the previous example
>>> a_file.read()                      
''
>>> a_file.seek(0)                     
0
>>> a_file.read(16)                    
'Dive Into Python'
>>> a_file.read(1)                     
' '
>>> a_file.read(1)
'是'
>>> a_file.tell()                      
20
  1. Since you’re still at the end of the file, further calls to the file object’s read() method simply return an empty string.
  2. The seek() method moves to a specific byte position in a file.
  3. The read() method can take an optional parameter, the number of characters to read.
  4. If you like, you can even read one character at a time.
  5. 16 + 1 + 1 = … 20?

Let’s see that again.

# continued from the previous example
>>> a_file.seek(17)                    
17
>>> a_file.read(1)                     
'是'
>>> a_file.tell()                      
20
  1. Move to the 17th byte.
  2. Read one character.
  3. Now you’re on the 20th byte.

Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that they’re counting the same thing. But that’s only true for some characters.

But wait, it gets worse!

>>> a_file.seek(18)                         
18
>>> a_file.read(1)                          
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    a_file.read(1)
  File "C:\Python31\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte
  1. Move to the 18th byte and try to read one character.
  2. Why does this fail? Because there isn’t a character at the 18th byte. The nearest character starts at the 17th byte (and goes for three bytes). Trying to read a character from the middle will fail with a UnicodeDecodeError.

Closing Files

Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them.

# continued from the previous example
>>> a_file.close()

Well that was anticlimactic.

The file object a_file still exists; calling its close() method doesn’t destroy the object itself. But it’s not terribly useful.

# continued from the previous example
>>> a_file.read()                           
Traceback (most recent call last):
  File "<pyshell#24>", line 1, in <module>
    a_file.read()
ValueError: I/O operation on closed file.
>>> a_file.seek(0)                          
Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    a_file.seek(0)
ValueError: I/O operation on closed file.
>>> a_file.tell()                           
Traceback (most recent call last):
  File "<pyshell#26>", line 1, in <module>
    a_file.tell()
ValueError: I/O operation on closed file.
>>> a_file.close()                          
>>> a_file.closed                           
True
  1. You can’t read from a closed file; that raises an IOError exception.
  2. You can’t seek in a closed file either.
  3. There’s no current position in a closed file, so the tell() method also fails.
  4. Perhaps surprisingly, calling the close() method on a file object whose file has been closed does not raise an exception. It’s just a no-op.
  5. Closed file objects do have one useful attribute: the closed attribute will confirm that the file is closed.

Using The with Statement

FIXME "with open(...) as file" pattern

Reading Data One Line At A Time

A “line” of a text file is just what you think it is — you type a few words and press ENTER, and now you’re on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated, because text files can use several different characters to mark the end of a line. Every operating system has its own convention. Some use a carriage return character, others use a line feed character, and some use both characters at the end of every line.

Now breathe a sigh of relief, because Python handles line endings automatically by default. If you say, “I want to read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and and it will all Just Work.

If you need fine-grained control over what’s considered a line ending, you can pass the optional newline parameter to the open() function. See the open() function documentation for all the gory details.

So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful.

[download oneline.py]

line_number = 0
with open('examples/favorite-people.txt', encoding='utf-8') as a_file:  
    for a_line in a_file:                                               
        line_number += 1
        print('{} {}'.format(line_number, a_line.rstrip()))             
  1. Using the with pattern, you safely open the file and let Python close it for you.
  2. This is it: to read a file one line at a time, use a for loop. That’s it. Besides having explicit methods like read(), the file object is also an iterator which spits out a single line every time you ask for a value.
  3. Using the format() string method, you can print out the line number and the line itself. (The a_line variable contains the complete line, carriage returns and all. The rstrip() string method removes the trailing whitespace, including the carriage return characters.)
you@localhost:~/diveintopython3$ python3 examples/oneline.py
1 Dora
2 Ethan
3 Wesley
4 John
5 Anne
6 Mike
7 Chris
8 Sarah
9 Alex
10 Lizzie

Writing to Text Files

FIXME

Character Encoding Again

FIXME

Write A Little, Write A Lot

FIXME write(), writelines(), .writeable

Handling I/O Errors

FIXME

Binary Files

FIXME

>>> image = open('examples/beauregard-100x100.jpg', 'rb')
>>> image
<io.BufferedReader object at 0x00C7A390>
>>> image.mode
'rb'
>>> image.name
'examples/beauregard-100x100.jpg'
>>> image
<io.BufferedReader object at 0x00C7A390>
>>> image.tell()
0
>>> data = image.read(3)
>>> data
b'\xff\xd8\xff'
>>> image.tell()
3
>>> image.seek(0)
0
>>> data = image.read()
>>> len(data)
3150

File-like Objects

FIXME

Standard Input, Output, and Error

FIXME

Further Reading

FIXME

© 2001–9 Mark Pilgrim