You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♢♢
❝ FIXME ❞
— FIXME
FIXME
Python has a built-in function, open(), for opening a file on disk. The open() function returns a file object, which has methods and attributes for getting information about and manipulating the file.
>>> image = open('examples/beauregard-100x100.jpg', 'rb')
>>> image
<io.BufferedReader object at 0x00C7A390>
>>> image.mode
'rb'
>>> image.name
'examples/beauregard-100x100.jpg'
>>>
>>> f = open("/music/_singles/kairo.mp3", "rb") ①
>>> f ②
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.mode ③
'rb'
>>> f.name ④
'/music/_singles/kairo.mp3'
- The
open method can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename, is required; the other two are optional. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode. (print open.__doc__ displays a great explanation of all the possible modes.)
- The
open function returns an object (by now, this should not surprise you). A file object has several useful attributes.
- The mode attribute of a file object tells you in which mode the file was opened.
- The name attribute of a file object tells you the name of the file that the file object has open.
6.2.1. Reading Files
After you open a file, the first thing you'll want to do is read from it, as shown in the next example.
Example 6.4. Reading a File
>>> image
<io.BufferedReader object at 0x00C7A390>
>>> image.tell()
0
>>> data = image.read(3)
>>> data
b'\xff\xd8\xff'
>>> image.tell()
3
>>> image.seek(0)
0
>>> data = image.read()
>>> len(data)
3150
>>> f
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.tell() ①
0
>>> f.seek(-128, 2) ②
>>> f.tell() ③
7542909
>>> tagData = f.read(128) ④
>>> tagData
'TAGKAIRO****THE BEST GOA ***DJ MARY-JANE***
Rave Mix 2000http://mp3.com/DJMARYJANE \037'
>>> f.tell() ⑤
7543037
- A file object maintains state about the file it has open. The
tell method of a file object tells you your current position in the open file. Since you haven't done anything with this file yet, the current position is 0, which is the beginning of the file.
- The
seek method of a file object moves to another position in the open file. The second parameter specifies what the first one means;
0 means move to an absolute position (counting from the start of the file), 1 means move to a relative position (counting from the current position), and 2 means move to a position relative to the end of the file. Since the MP3 tags you're looking for are stored at the end of the file, you use 2 and tell the file object to move to a position 128 bytes from the end of the file.
- The
tell method confirms that the current file position has moved.
- The
read method reads a specified number of bytes from the open file and returns a string with the data that was read. The optional parameter specifies the maximum number of bytes to read. If no parameter is specified, read will read until the end of the file. (You could have simply said read() here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.) The read data is assigned to the tagData variable, and the current position is updated based on how many bytes were read.
- The
tell method confirms that the current position has moved. If you do the math, you'll see that after reading 128 bytes, the position has been incremented by 128.
6.2.2. Closing Files
Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's
important to close files as soon as you're finished with them.
Example 6.5. Closing a File
>>> f
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.closed ①
False
>>> f.close() ②
>>> f
<closed file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.closed ③
True
>>> f.seek(0) ④
Traceback (innermost last):
File "<interactive input>", line 1, in ?
ValueError: I/O operation on closed file
>>> f.tell()
Traceback (innermost last):
File "<interactive input>", line 1, in ?
ValueError: I/O operation on closed file
>>> f.read()
Traceback (innermost last):
File "<interactive input>", line 1, in ?
ValueError: I/O operation on closed file
>>> f.close() ⑤
- The closed attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (closed is
False).
- To close a file, call the
close method of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any) that the system hadn't gotten around to actually writing yet, and releases the system resources.
- The closed attribute confirms that the file is closed.
- Just because a file is closed doesn't mean that the file object ceases to exist. The variable f will continue to exist until it goes out of scope or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed; they all raise an exception.
- Calling
close on a file object whose file is already closed does not raise an exception; it fails silently.
6.2.3. Handling I/O Errors
Now you've seen enough to understand the file handling code in the fileinfo.py sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle
errors.
Example 6.6. File Objects in MP3FileInfo
try: ① fsock = open(filename, "rb", 0) ② try: fsock.seek(-128, 2) ③ tagdata = fsock.read(128) ④ finally: ⑤ fsock.close() . . .
except IOError: ⑥ pass
- Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a
try...except block. (Hey, isn't standardized indentation great? This is where you start to appreciate it.)
- The
open function may raise an IOError. (Maybe the file doesn't exist.)
- The
seek method may raise an IOError. (Maybe the file is smaller than 128 bytes.)
- The
read method may raise an IOError. (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.)
- This is new: a
try...finally block. Once the file has been opened successfully by the open function, you want to make absolutely sure that you close it, even if an exception is raised by the seek or read methods. That's what a try...finally block is for: code in the finally block will always be executed, even if something in the try block raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before.
- At last, you handle your
IOError exception. This could be the IOError exception raised by the call to open, seek, or read. Here, you really don't care, because all you're going to do is ignore it silently and continue. (Remember, pass is a Python statement that does nothing.) That's perfectly legal; “handling” an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the next line of code after the try...except block.
6.2.4. Writing to Files
As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes:
- "Append" mode will add data to the end of the file.
- "write" mode will overwrite the file.
Either mode will create the file automatically if it doesn't already exist, so there's never a need for any sort of fiddly
"if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open
it and start writing.
Example 6.7. Writing to Files
>>> logfile = open('test.log', 'w') ①
>>> logfile.write('test succeeded') ②
>>> logfile.close()
>>> print file('test.log').read() ③
test succeeded
>>> logfile = open('test.log', 'a') ④
>>> logfile.write('line 2')
>>> logfile.close()
>>> print file('test.log').read() ⑤
test succeededline 2
- You start boldly by creating either the new file
test.log or overwrites the existing file, and opening the file for writing. (The second parameter "w" means open the file for writing.) Yes, that's all as dangerous as it sounds. I hope you didn't care about the previous contents of that file, because it's gone now.
- You can add data to the newly opened file with the
write method of the file object returned by open.
file is a synonym for open. This one-liner opens the file, reads its contents, and prints them.
- You happen to know that
test.log exists (since you just finished writing to it), so you can open it and append to it. (The "a" parameter means open the file for appending.) Actually you could do this even if the file didn't exist, because opening the file for appending will create the file if necessary. But appending will never harm the existing contents of the file.
- As you can see, both the original line you wrote and the second line you appended are now in
test.log. Also note that carriage returns are not included. Since you didn't write them explicitly to the file either time, the file doesn't include them. You can write a carriage return with the "\n" character. Since you didn't do this, everything you wrote to the file ended up smooshed together on the same line.
Further Reading on File Handling
- Python Tutorial discusses reading and writing files, including how to read a file one line at a time into a list.
- eff-bot discusses efficiency and performance of various ways of reading a file.
- Python Knowledge Base answers common questions about files.
- Python Library Reference summarizes all the file object methods.
10.1. Abstracting input sources
One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the file-like object.
Many functions which require an input source could simply take a filename, go open the file for reading, read it, and close
it when they're done. But they don't. Instead, they take a file-like object.
In the simplest case, a file-like object is any object with a read method with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When
called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left
off and returns the next chunk of data.
This is how reading from real files works; the difference is that you're not limiting yourself to real files. The input source could be anything: a file on
disk, a web page, even a hard-coded string. As long as you pass a file-like object to the function, and the function simply
calls the object's read method, the function can handle any kind of input source without specific code to handle each kind.
In case you were wondering how this relates to XML processing, minidom.parse is one such function which can take a file-like object.
Example 10.1. Parsing XML from a file
>>> from xml.dom import minidom
>>> fsock = open('binary.xml') ①
>>> xmldoc = minidom.parse(fsock) ②
>>> fsock.close() ③
>>> print xmldoc.toxml() ④
<?xml version="1.0" ?>
<grammar>
<ref id="bit">
<p>0</p>
<p>1</p>
</ref>
<ref id="byte">
<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>
- First, you open the file on disk. This gives you a file object.
- You pass the file object to
minidom.parse, which calls the read method of fsock and reads the XML document from the file on disk.
- Be sure to call the
close method of the file object after you're done with it. minidom.parse will not do this for you.
- Calling the
toxml() method on the returned XML document prints out the entire thing.
Well, that all seems like a colossal waste of time. After all, you've already seen that minidom.parse can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're
just going to be parsing a local file, you can pass the filename and minidom.parse is smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet.
Example 10.2. Parsing XML from a URL
>>> import urllib
>>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf') ①
>>> xmldoc = minidom.parse(usock) ②
>>> usock.close() ③
>>> print xmldoc.toxml() ④
<?xml version="1.0" ?>
<rdf:RDF xmlns="http://my.netscape.com/rdf/simple/0.9/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<channel>
<title>Slashdot</title>
<link>http://slashdot.org/</link>
<description>News for nerds, stuff that matters</description>
</channel>
<image>
<title>Slashdot</title>
<url>http://images.slashdot.org/topics/topicslashdot.gif</url>
<link>http://slashdot.org/</link>
</image>
<item>
<title>To HDTV or Not to HDTV?</title>
<link>http://slashdot.org/article.pl?sid=01/12/28/0421241</link>
</item>
[...snip...]
- As you saw in a previous chapter,
urlopen takes a web page URL and returns a file-like object. Most importantly, this object has a read method which returns the HTML source of the web page.
- Now you pass the file-like object to
minidom.parse, which obediently calls the read method of the object and parses the XML data that the read method returns. The fact that this XML data is now coming straight from a web page is completely irrelevant. minidom.parse doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects.
- As soon as you're done with it, be sure to close the file-like object that
urlopen gives you.
- By the way, this URL is real, and it really is XML. It's an XML representation of the current headlines on Slashdot, a technical news and gossip site.
Example 10.3. Parsing XML from a string (the easy but inflexible way)
>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
>>> xmldoc = minidom.parseString(contents) ①
>>> print xmldoc.toxml()
<?xml version="1.0" ?>
<grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
minidom has a method, parseString, which takes an entire XML document as a string and parses it. You can use this instead of minidom.parse if you know you already have your entire XML document in a string.
OK, so you can use the minidom.parse function for parsing both local files and remote URLs, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a
file, a URL, or a string, you'll need special logic to check whether it's a string, and call the parseString function instead. How unsatisfying.
If there were a way to turn a string into a file-like object, then you could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO.
Example 10.4. Introducing StringIO
>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
>>> import StringIO
>>> ssock = StringIO.StringIO(contents) ①
>>> ssock.read() ②
"<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
>>> ssock.read() ③
''
>>> ssock.seek(0) ④
>>> ssock.read(15) ⑤
'<grammar><ref i'
>>> ssock.read(15)
"d='bit'><p>0</p"
>>> ssock.read()
'><p>1</p></ref></grammar>'
>>> ssock.close() ⑥
- The
StringIO module contains a single class, also called StringIO, which allows you to turn a string into a file-like object. The StringIO class takes the string as a parameter when creating an instance.
- Now you have a file-like object, and you can do all sorts of file-like things with it. Like
read, which returns the original string.
- Calling
read again returns an empty string. This is how real file objects work too; once you read the entire file, you can't read any more without explicitly seeking to the beginning of the file. The StringIO object works the same way.
- You can explicitly seek to the beginning of the string, just like seeking through a file, by using the
seek method of the StringIO object.
- You can also read the string in chunks, by passing a size parameter to the
read method.
- At any time,
read will return the rest of the string that you haven't read yet. All of this is exactly how file objects work; hence the term
file-like object.
Example 10.5. Parsing XML from a string (the file-like object way)
>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
>>> ssock = StringIO.StringIO(contents)
>>> xmldoc = minidom.parse(ssock) ①
>>> ssock.close()
>>> print xmldoc.toxml()
<?xml version="1.0" ?>
<grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
- Now you can pass the file-like object (really a
StringIO) to minidom.parse, which will call the object's read method and happily parse away, never knowing that its input came from a hard-coded string.
So now you know how to use a single function, minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, you use urlopen to get a file-like object; for a local file, you use open; and for a string, you use StringIO. Now let's take it one step further and generalize these differences as well.
Example 10.6. openAnything
def openAnything(source):①
# try to open with urllib (if source is http, ftp, or file URL)
import urllib
try:
return urllib.urlopen(source) ②
except (IOError, OSError):
pass
# try to open with native open function (if source is pathname)
try:
return open(source) ③
except (IOError, OSError):
pass
# treat source as string
import StringIO
return StringIO.StringIO(str(source)) ④
- The
openAnything function takes a single parameter, source, and returns a file-like object. source is a string of some sort; it can either be a URL (like 'http://slashdot.org/slashdot.rdf'), a full or partial pathname to a local file (like 'binary.xml'), or a string that contains actual XML data to be parsed.
- First, you see if source is a URL. You do this through brute force: you try to open it as a URL and silently ignore errors caused by trying to open something which is not a URL. This is actually elegant in the sense that, if
urllib ever supports new types of URLs in the future, you will also support them without recoding. If urllib is able to open source, then the return kicks you out of the function immediately and the following try statements never execute.
- On the other hand, if
urllib yelled at you and told you that source wasn't a valid URL, you assume it's a path to a file on disk and try to open it. Again, you don't do anything fancy to check whether source is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably get them wrong anyway). Instead, you just blindly open the file, and silently trap any errors.
- By this point, you need to assume that source is a string that has hard-coded data in it (since nothing else worked), so you use
StringIO to create a file-like object out of it and return that. (In fact, since you're using the str function, source doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its __str__ special method.)
Now you can use this openAnything function in conjunction with minidom.parse to make a function that takes a source that refers to an XML document somehow (either as a URL, or a local filename, or a hard-coded XML document in a string) and parses it.
Example 10.7. Using openAnything in kgp.py
class KantGenerator:
def _load(self, source):
sock = toolbox.openAnything(source)
xmldoc = minidom.parse(sock).documentElement
sock.close()
return xmldoc
10.2. Standard input, output, and error
UNIX users are already familiar with the concept of standard input, standard output, and standard error. This section is for
the rest of you.
Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX system. When you print something, it goes to the stdout pipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to the stderr pipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program
prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system
with a window-based Python IDE, stdout and stderr default to your “Interactive Window”.)
Example 10.8. Introducing stdout and stderr
>>> for i in range(3):
... print 'Dive in' ①
Dive in
Dive in
Dive in
>>> import sys
>>> for i in range(3):
... sys.stdout.write('Dive in') ②
Dive inDive inDive in
>>> for i in range(3):
... sys.stderr.write('Dive in') ③
Dive inDive inDive in
- As you saw in Example 6.9, “Simple Counters”, you can use Python's built-in
range function to build simple counter loops that repeat something a set number of times.
stdout is a file-like object; calling its write function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you're printing, and calls sys.stdout.write.
- In the simplest case,
stdout and stderr send their output to the same place: the Python IDE (if you're in one), or the terminal (if you're running Python from the command line). Like stdout, stderr does not add carriage returns for you; if you want them, add them yourself.
stdout and stderr are both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have no read method, only write. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output.
Example 10.9. Redirecting output
[you@localhost kgp]$ python stdout.py
Dive in
[you@localhost kgp]$ cat out.log
This message will be logged instead of displayed
(On Windows, you can use type instead of cat to display the contents of a file.)
If you have not already done so, you can download this and other examples used in this book.
#stdout.py
import sys
print 'Dive in' ①
saveout = sys.stdout ②
fsock = open('out.log', 'w') ③
sys.stdout = fsock ④
print 'This message will be logged instead of displayed' ⑤
sys.stdout = saveout ⑥
fsock.close() ⑦
- This will print to the IDE “Interactive Window” (or the terminal, if running the script from the command line).
- Always save
stdout before redirecting it, so you can set it back to normal later.
- Open a file for writing. If the file doesn't exist, it will be created. If the file does exist, it will be overwritten.
- Redirect all further output to the new file you just opened.
- This will be “printed” to the log file only; it will not be visible in the IDE window or on the screen.
- Set
stdout back to the way it was before you mucked with it.
- Close the log file.
Redirecting stderr works exactly the same way, using sys.stderr instead of sys.stdout.
Example 10.10. Redirecting error information
[you@localhost kgp]$ python stderr.py
[you@localhost kgp]$ cat error.log
Traceback (most recent line last):
File "stderr.py", line 5, in ?
raise Exception, 'this error will be logged'
Exception: this error will be loggedIf you have not already done so, you can download this and other examples used in this book.
#stderr.py
import sys
fsock = open('error.log', 'w') ①
sys.stderr = fsock ②
raise Exception, 'this error will be logged' ③ ④
- Open the log file where you want to store debugging information.
- Redirect standard error by assigning the file object of the newly-opened log file to
stderr.
- Raise an exception. Note from the screen output that this does not print anything on screen. All the normal traceback information has been written to
error.log.
- Also note that you're not explicitly closing your log file, nor are you setting
stderr back to its original value. This is fine, since once the program crashes (because of the exception), Python will clean up and close the file for us, and it doesn't make any difference that stderr is never restored, since, as I mentioned, the program crashes and Python ends. Restoring the original is more important for stdout, if you expect to go do other stuff within the same script afterwards.
Since it is so common to write error messages to standard error, there is a shorthand syntax that can be used instead of going
through the hassle of redirecting it outright.
Example 10.11. Printing to stderr
>>> print 'entering function'
entering function
>>> import sys
>>> print >> sys.stderr, 'entering function' ①
entering function
- This shorthand syntax of the
print statement can be used to write to any open file, or file-like object. In this case, you can redirect a single print statement to stderr without affecting subsequent print statements.
Standard input, on the other hand, is a read-only file object, and it represents the data flowing into the program from some
previous program. This will likely not make much sense to classic Mac OS users, or even Windows users unless you were ever fluent on the MS-DOS command line. The way it works is that you can construct a chain of commands in a single line, so that one program's output
becomes the input for the next program in the chain. The first program simply outputs to standard output (without doing any
special redirecting itself, just doing normal print statements or whatever), and the next program reads from standard input, and the operating system takes care of connecting
one program's output to the next program's input.
© 2001–9 Mark Pilgrim