You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♢♢
❝ FIXME ❞
— FIXME
FIXME
Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:
a_file = open('examples/chinese.txt', encoding='utf-8')
Python has a built-in open() function, which takes a filename as an argument. Here the filename is 'examples/chinese.txt'. There are five interesting things about this filename:
open() function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well.
But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding. Oh dear, that sounds dreadfully familiar.
Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).
# This example was created on Windows. Other platforms may
# behave differently, for reasons outlined below.
>>> file = open('examples/chinese.txt')
>>> a_string = file.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined>
>>>
What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in cp1252.py, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly UnicodeDecodeError.
But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).
☞If you need to get the default character encoding, import the
localemodule and calllocale.getpreferredencoding(). On my Windows laptop, it returns'cp1252', but on my Linux box upstairs, it returns'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file.
So far, all we know is that Python has a built-in function called open(). The open() function returns a file object, which has methods and attributes for getting information about and manipulating the file.
>>> a_file = open('examples/chinese.txt', encoding='utf-8')
>>> a_file.name ①
'examples/chinese.txt'
>>> a_file.encoding ②
'utf-8'
>>> a_file.mode ③
'r'
name attribute reflects the name you passed in to the open() function when you opened the file. It is not normalized to an absolute pathname.
encoding attribute reflects the encoding you passed in to the open() function. If you didn’t specify the encoding when you opened the file (bad developer!) then the encoding attribute will reflect locale.getpreferredencoding().
mode attribute tells you in which mode the file was opened. You can pass an optional mode parameter to the open() function. You didn’t specify a mode when you opened this file, so Python defaults to 'r', which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings).
☞The documentation for the
open()function lists all the possible file modes.
After you open a file for reading, you’ll probably want to read from it at some point.
>>> a_file = open('examples/chinese.txt', encoding='utf-8')
>>> a_file.read() ①
'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'
>>> a_file.read() ②
''
read() method. The result is a string.
What if you want to re-read a file?
# continued from the previous example >>> a_file.read() ① '' >>> a_file.seek(0) ② 0 >>> a_file.read(16) ③ 'Dive Into Python' >>> a_file.read(1) ④ ' ' >>> a_file.read(1) '是' >>> a_file.tell() ⑤ 20
read() method simply return an empty string.
seek() method moves to a specific byte position in a file.
read() method can take an optional parameter, the number of characters to read.
Let’s see that again.
# continued from the previous example >>> a_file.seek(17) ① 17 >>> a_file.read(1) ② '是' >>> a_file.tell() ③ 20
Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that they’re counting the same thing. But that’s only true for some characters.
But wait, it gets worse!
>>> a_file.seek(18) ① 18 >>> a_file.read(1) ② Traceback (most recent call last): File "<pyshell#12>", line 1, in <module> a_file.read(1) File "C:\Python31\lib\codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte
UnicodeDecodeError.
Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them.
# continued from the previous example >>> a_file.close()
Well that was anticlimactic.
The file object a_file still exists; calling its close() method doesn’t destroy the object itself. But it’s not terribly useful.
# continued from the previous example >>> a_file.read() ① Traceback (most recent call last): File "<pyshell#24>", line 1, in <module> a_file.read() ValueError: I/O operation on closed file. >>> a_file.seek(0) ② Traceback (most recent call last): File "<pyshell#25>", line 1, in <module> a_file.seek(0) ValueError: I/O operation on closed file. >>> a_file.tell() ③ Traceback (most recent call last): File "<pyshell#26>", line 1, in <module> a_file.tell() ValueError: I/O operation on closed file. >>> a_file.close() ④ >>> a_file.closed ⑤ True
IOError exception.
tell() method also fails.
close() method on a file object whose file has been closed does not raise an exception. It’s just a no-op.
closed attribute will confirm that the file is closed.
with StatementFIXME "with open(...) as file" pattern
A “line” of a text file is just what you think it is — you type a few words and press ENTER, and now you’re on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated, because text files can use several different characters to mark the end of a line. Every operating system has its own convention. Some use a carriage return character, others use a line feed character, and some use both characters at the end of every line.
Now breathe a sigh of relief, because Python handles line endings automatically by default. If you say, “I want to read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and and it will all Just Work.
☞If you need fine-grained control over what’s considered a line ending, you can pass the optional
newlineparameter to theopen()function. See theopen()function documentation for all the gory details.
So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful.
line_number = 0
with open('examples/favorite-people.txt', encoding='utf-8') as a_file: ①
for a_line in a_file: ②
line_number += 1
print('{} {}'.format(line_number, a_line.rstrip())) ③
with pattern, you safely open the file and let Python close it for you.
for loop. That’s it. Besides having explicit methods like read(), the file object is also an iterator which spits out a single line every time you ask for a value.
format() string method, you can print out the line number and the line itself. (The a_line variable contains the complete line, carriage returns and all. The rstrip() string method removes the trailing whitespace, including the carriage return characters.)
you@localhost:~/diveintopython3$ python3 examples/oneline.py 1 Dora 2 Ethan 3 Wesley 4 John 5 Anne 6 Mike 7 Chris 8 Sarah 9 Alex 10 Lizzie
FIXME
FIXME
FIXME write(), writelines(), .writeable
FIXME
FIXME
>>> image = open('examples/beauregard-100x100.jpg', 'rb')
>>> image
<io.BufferedReader object at 0x00C7A390>
>>> image.mode
'rb'
>>> image.name
'examples/beauregard-100x100.jpg'
>>> image <io.BufferedReader object at 0x00C7A390> >>> image.tell() 0 >>> data = image.read(3) >>> data b'\xff\xd8\xff' >>> image.tell() 3 >>> image.seek(0) 0 >>> data = image.read() >>> len(data) 3150
FIXME
FIXME
FIXME
© 2001–9 Mark Pilgrim