Files
dive-into-python3/files.html
T
2009-07-26 15:11:40 -04:00

533 lines
44 KiB
HTML

<!DOCTYPE html>
<head>
<meta charset=utf-8>
<title>Files - Dive into Python 3</title>
<!--[if IE]><script src=j/html5.js></script><![endif]-->
<link rel=stylesheet href=dip3.css>
<style>
body{counter-reset:h1 11}
</style>
<link rel=stylesheet type=text/css media='only screen and (max-device-width: 480px)' href=mobile.css>
<link rel=stylesheet media=print href=print.css>
<meta name=viewport content='initial-scale=1.0'>
</head>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=25>&nbsp;<input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span class=u>&#8227;</span> <a href=table-of-contents.html#files>Dive Into Python 3</a> <span class=u>&#8227;</span>
<p id=level>Difficulty level: <span class=u title=intermediate>&#x2666;&#x2666;&#x2666;&#x2662;&#x2662;</span>
<h1>Files</h1>
<blockquote class=q>
<p><span class=u>&#x275D;</span> A nine mile walk is no joke, especially in the rain. <span class=u>&#x275E;</span><br>&mdash; Harry Kemelman, <cite>The Nine Mile Walk</cite>
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving In</h2>
<p class=f>FIXME
<h2 id=reading>Reading From Text Files</h2>
<p>Before you can read from a file, you need to open it. Opening a file in Python couldn&#8217;t be easier:
<pre class=nd><code class=pp>a_file = open('examples/chinese.txt', encoding='utf-8')</code></pre>
<p>Python has a built-in <code>open()</code> function, which takes a filename as an argument. Here the filename is <code class=pp>'examples/chinese.txt'</code>. There are five interesting things about this filename:
<ol>
<li>It&#8217;s not just the name of a file; it&#8217;s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments&nbsp;&mdash;&nbsp;a directory path and a filename&nbsp;&mdash;&nbsp;but the <code>open()</code> function only takes one. In Python, whenever you need a &#8220;filename,&#8221; you can include some or all of a directory path as well.
<li>The directory path uses a forward slash, but I didn&#8217;t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows.
<li>The directory path does not begin with a slash or a drive letter, so it is called a <i>relative path</i>. Relative to what, you might ask? Patience, grasshopper.
<li>It&#8217;s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-<abbr>ASCII</abbr> pathnames.
<li>It doesn&#8217;t need to be on your local disk. You might have a network drive mounted. That &#8220;file&#8221; might be a figment of <a href=http://en.wikipedia.org/wiki/Filesystem_in_Userspace>an entirely virtual filesystem</a>. If your computer considers it a file and can access it as a file, Python can open it.
</ol>
<p>But that call to the <code>open()</code> function didn&#8217;t stop at the filename. There&#8217;s another argument, called <code>encoding</code>. Oh dear, <a href=strings.html#boring-stuff>that sounds dreadfully familiar</a>.
<h3 id=encoding>Character Encoding Rears Its Ugly Head</h3>
<p>Bytes are bytes; <a href=strings.html#byte-arrays>characters are an abstraction</a>. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a &#8220;text file&#8221; from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).
<pre>
# This example was created on Windows. Other platforms may
# behave differently, for reasons outlined below.
<samp class=p>>>> </samp><kbd class=pp>file = open('examples/chinese.txt')</kbd>
<samp class=p>>>> </samp><kbd class=pp>a_string = file.read()</kbd>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in &lt;module>
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to &lt;undefined></samp>
<samp class=p>>>> </samp></pre>
<p>What just happened? You didn&#8217;t specify a character encoding, so Python is forced to use the default encoding. What&#8217;s the default encoding? If you look closely at the traceback, you can see that it&#8217;s dying in <code>cp1252.py</code>, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn&#8217;t support the characters that are in this file, so the read fails with an ugly <code>UnicodeDecodeError</code>.
<p>But wait, it&#8217;s worse than that! The default encoding is <em>platform-dependent</em>, so this code <em>might</em> work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).
<blockquote class=note>
<p><span class=u>&#x261E;</span>If you need to get the default character encoding, import the <code>locale</code> module and call <code>locale.getpreferredencoding()</code>. On my Windows laptop, it returns <code>'cp1252'</code>, but on my Linux box upstairs, it returns <code>'UTF8'</code>. I can&#8217;t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it&#8217;s so important to specify the encoding every time you open a file.
</blockquote>
<h3 id=file-objects>File Objects</h3>
<p>So far, all we know is that Python has a built-in function called <code>open()</code>. The <code>open()</code> function returns a <i>file object</i>, which has methods and attributes for getting information about and manipulating the file.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>a_file = open('examples/chinese.txt', encoding='utf-8')</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.name</kbd> <span class=u>&#x2460;</span></a>
<samp class=pp>'examples/chinese.txt'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.encoding</kbd> <span class=u>&#x2461;</span></a>
<samp class=pp>'utf-8'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.mode</kbd> <span class=u>&#x2462;</span></a>
<samp class=pp>'r'</samp></pre>
<ol>
<li>The <code>name</code> attribute reflects the name you passed in to the <code>open()</code> function when you opened the file. It is not normalized to an absolute pathname.
<li>Likewise, <code>encoding</code> attribute reflects the encoding you passed in to the <code>open()</code> function. If you didn&#8217;t specify the encoding when you opened the file (bad developer!) then the <code>encoding</code> attribute will reflect <code>locale.getpreferredencoding()</code>.
<li>The <code>mode</code> attribute tells you in which mode the file was opened. You can pass an optional <var>mode</var> parameter to the <code>open()</code> function. You didn&#8217;t specify a mode when you opened this file, so Python defaults to <code>'r'</code>, which means &#8220;open for reading only, in text mode.&#8221; As you&#8217;ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings).
</ol>
<blockquote class=note>
<p><span class=u>&#x261E;</span>The <a href=http://docs.python.org/3.1/library/io.html#module-interface>documentation for the <code>open()</code> function</a> lists all the possible file modes.
</blockquote>
<h3 id=read>Reading Data From A Text File</h3>
<p>After you open a file for reading, you&#8217;ll probably want to read from it at some point.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>a_file = open('examples/chinese.txt', encoding='utf-8')</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>&#x2460;</span></a>
<samp class=pp>'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>&#x2461;</span></a>
<samp class=pp>''</samp></pre>
<ol>
<li>Once you open a file (with the correct encoding), reading from it is just a matter of calling the file object&#8217;s <code>read()</code> method. The result is a string.
<li>Perhaps somewhat surprisingly, reading the file again does not raise an exception. Python does not consider reading past end-of-file to be an error; it simply returns an empty string.
</ol>
<p>What if you want to re-read a file?
<pre class=screen>
# continued from the previous example
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>&#x2460;</span></a>
<samp class=pp>''</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(0)</kbd> <span class=u>&#x2461;</span></a>
<samp class=pp>0</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(16)</kbd> <span class=u>&#x2462;</span></a>
<samp class=pp>'Dive Into Python'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd> <span class=u>&#x2463;</span></a>
<samp class=pp>' '</samp>
<samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd>
<samp class=pp>'是'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd> <span class=u>&#x2464;</span></a>
<samp class=pp>20</samp></pre>
<ol>
<li>Since you&#8217;re still at the end of the file, further calls to the file object&#8217;s <code>read()</code> method simply return an empty string.
<li>The <code>seek()</code> method moves to a specific byte position in a file.
<li>The <code>read()</code> method can take an optional parameter, the number of characters to read.
<li>If you like, you can even read one character at a time.
<li>16 + 1 + 1 = &hellip; 20?
</ol>
<p>Let&#8217;s try that again.
<pre class=screen>
# continued from the previous example
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(17)</kbd> <span class=u>&#x2460;</span></a>
<samp class=pp>17</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd> <span class=u>&#x2461;</span></a>
<samp class=pp>'是'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd> <span class=u>&#x2462;</span></a>
<samp class=pp>20</samp></pre>
<ol>
<li>Move to the 17<sup>th</sup> byte.
<li>Read one character.
<li>Now you&#8217;re on the 20<sup>th</sup> byte.
</ol>
<p>Do you see it yet? The <code>seek()</code> and <code>tell()</code> methods always count <em>bytes</em>, but since you opened this file as text, the <code>read()</code> method counts <em>characters</em>. Chinese characters <a href=strings.html#boring-stuff>require multiple bytes to encode in UTF-8</a>. The English characters in the file only require one byte each, so you might be misled into thinking that the <code>seek()</code> and <code>read()</code> methods are counting the same thing. But that&#8217;s only true for some characters.
<p>But wait, it gets worse!
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(18)</kbd> <span class=u>&#x2460;</span></a>
<samp class=pp>18</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd> <span class=u>&#x2461;</span></a>
<samp class=traceback>Traceback (most recent call last):
File "&lt;pyshell#12>", line 1, in &lt;module>
a_file.read(1)
File "C:\Python31\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte</samp></pre>
<ol>
<li>Move to the 18<sup>th</sup> byte and try to read one character.
<li>Why does this fail? Because there isn&#8217;t a character at the 18<sup>th</sup> byte. The nearest character starts at the 17<sup>th</sup> byte (and goes for three bytes). Trying to read a character from the middle will fail with a <code>UnicodeDecodeError</code>.
</ol>
<h3 id=close>Closing Files</h3>
<p>Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It&#8217;s important to close files as soon as you&#8217;re finished with them.
<pre class='nd screen'>
# continued from the previous example
<samp class=p>>>> </samp><kbd class=pp>a_file.close()</kbd></pre>
<p>Well <em>that</em> was anticlimactic.
<p>The file object <var>a_file</var> still exists; calling its <code>close()</code> method doesn&#8217;t destroy the object itself. But it&#8217;s not terribly useful.
<pre class=screen>
# continued from the previous example
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>&#x2460;</span></a>
<samp class=traceback>Traceback (most recent call last):
File "&lt;pyshell#24>", line 1, in &lt;module>
a_file.read()
ValueError: I/O operation on closed file.</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(0)</kbd> <span class=u>&#x2461;</span></a>
<samp class=traceback>Traceback (most recent call last):
File "&lt;pyshell#25>", line 1, in &lt;module>
a_file.seek(0)
ValueError: I/O operation on closed file.</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd> <span class=u>&#x2462;</span></a>
<samp class=traceback>Traceback (most recent call last):
File "&lt;pyshell#26>", line 1, in &lt;module>
a_file.tell()
ValueError: I/O operation on closed file.</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.close()</kbd> <span class=u>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.closed</kbd> <span class=u>&#x2464;</span></a>
<samp class=pp>True</samp></pre>
<ol>
<li>You can&#8217;t read from a closed file; that raises an <code>IOError</code> exception.
<li>You can&#8217;t seek in a closed file either.
<li>There&#8217;s no current position in a closed file, so the <code>tell()</code> method also fails.
<li>Perhaps surprisingly, calling the <code>close()</code> method on a file object whose file has been closed does <em>not</em> raise an exception. It&#8217;s just a no-op.
<li>Closed file objects do have one useful attribute: the <code>closed</code> attribute will confirm that the file is closed.
</ol>
<h3 id=with>Closing Files Automatically</h3>
<p>File objects have an explicit <code>close()</code> method, but what happens if your code has a bug and crashes before you call <code>close()</code>? That file could theoretically stay open for much longer than necessary. While you&#8217;re debugging on your local computer, that&#8217;s not a big deal. On a production server, maybe it is.
<p>Python 2 had a solution for this: the <code>try..finally</code> block. That still works in Python 3, and you may see it in other people&#8217;s code or in older code that was <a href=case-study-porting-chardet-to-python-3.html>ported to Python 3</a>. But Python 3 also adds a cleaner solution: the <code>with</code> statement.
<pre class=nd><code class=pp>with open('examples/chinese.txt', encoding='utf-8') as a_file:
a_file.seek(17)
a_character = a_file.read(1)
print(a_character)</code></pre>
<p>This code calls <code>open()</code>, but it never calls <code>a_file.close()</code>. The <code>with</code> statement starts a code block, like an <code>if</code> statement or a <code>for</code> loop. Inside this code block, you can use the variable <var>a_file</var> as the file object returned from the call to <code>open()</code>. All the regular file object methods are available&nbsp;&mdash;&nbsp;<code>seek()</code>, <code>read()</code>, whatever you need. When the <code>with</code> block ends, <em>Python calls <code>a_file.close()</code> automatically</em>.
<p>Here&#8217;s the kicker: no matter how or when you exit the <code>with</code> block, Python will close that file&hellip; even if you &#8220;exit&#8221; it via an unhandled exception. That&#8217;s right, even if your code raises an exception and your entire program comes to a screeching halt, that file will get closed. Guaranteed.
<blockquote class=note>
<p><span class=u>&#x261E;</span>In technical terms, the <code>with</code> statement creates a <dfn>runtime context</dfn>. In these examples, the file object acts as a <dfn>context manager</dfn>. Python creates the file object <var>a_file</var> and tells it that it is entering a runtime context. When the <code>with</code> code block is completed, Python tells the file object that it is exiting the runtime context, and the file object calls its own <code>close()</code> method. See <a href=special-method-names.html#context-managers>Appendix B, &#8220;Context Managers&#8221;</a> for details.
</blockquote>
<p>There&#8217;s nothing file-specific about the <code>with</code> statement; it&#8217;s just a generic framework for creating runtime contexts and telling objects that they&#8217;re entering and exiting a runtime context. If the object in question is a file object, then it does useful file-like things (like closing the file automatically). But that behavior is defined in the file object, not in the <code>with</code> statement. There are lots of other ways to use context managers that have nothing to do with files. You can even create your own, as you&#8217;ll see later in this chapter.
<h3 id=for>Reading Data One Line At A Time</h3>
<p>A &#8220;line&#8221; of a text file is just what you think it is&nbsp;&mdash;&nbsp;you type a few words and press <kbd>ENTER</kbd>, and now you&#8217;re on a new line. A line of text is a sequence of characters delimited by&hellip; what exactly? Well, it&#8217;s complicated, because text files can use several different characters to mark the end of a line. Every operating system has its own convention. Some use a carriage return character, others use a line feed character, and some use both characters at the end of every line.
<p>Now breathe a sigh of relief, because <em>Python handles line endings automatically</em> by default. If you say, &#8220;I want to read this text file one line at a time,&#8221; Python will figure out which kind of line ending the text file uses and and it will all Just Work.
<blockquote class=note>
<p><span class=u>&#x261E;</span>If you need fine-grained control over what&#8217;s considered a line ending, you can pass the optional <code>newline</code> parameter to the <code>open()</code> function. See <a href=http://docs.python.org/3.1/library/io.html#module-interface>the <code>open()</code> function documentation</a> for all the gory details.
</blockquote>
<p>So, how do you actually do it? Read a file one line at a time, that is. It&#8217;s so simple, it&#8217;s beautiful.
<p class=d>[<a href=examples/oneline.py>download <code>oneline.py</code></a>]
<pre><code class=pp>line_number = 0
<a>with open('examples/favorite-people.txt', encoding='utf-8') as a_file: <span class=u>&#x2460;</span></a>
<a> for a_line in a_file: <span class=u>&#x2461;</span></a>
line_number += 1
<a> print('{} {}'.format(line_number, a_line.rstrip())) <span class=u>&#x2462;</span></a></code></pre>
<ol>
<li>Using <a href=#with>the <code>with</code> pattern</a>, you safely open the file and let Python close it for you.
<li>To read a file one line at a time, use a <code>for</code> loop. That&#8217;s it. Besides having explicit methods like <code>read()</code>, <em>the file object is also an <a href=iterators.html>iterator</a></em> which spits out a single line every time you ask for a value.
<li>Using <a href=strings.html#formatting-strings>the <code>format()</code> string method</a>, you can print out the line number and the line itself. (The <var>a_line</var> variable contains the complete line, carriage returns and all. The <code>rstrip()</code> string method removes the trailing whitespace, including the carriage return characters.)
</ol>
<pre class=screen>
<samp class=p>you@localhost:~/diveintopython3$ </samp><kbd class=pp>python3 examples/oneline.py</kbd>
<samp>1 Dora
2 Ethan
3 Wesley
4 John
5 Anne
6 Mike
7 Chris
8 Sarah
9 Alex
10 Lizzie</samp></pre>
<p class=a>&#x2042;
<h2 id=writing>Writing to Text Files</h2>
<p>You can write to files in much the same way that you read from them. First you open a file and get a file object, then you use methods on the file object to write data to the file, then you close the file.
<p>To open a file for writing, use the <code>open()</code> method and specify the write mode. There are two file modes for writing:
<ul>
<li>&#8220;Write&#8221; mode will overwrite the file. Pass <code>mode='w'</code> to the <code>open()</code> function.
<li>&#8220;Append&#8221; mode will add data to the end of the file. Pass <code>mode='a'</code> to the <code>open()</code> function.
</ul>
<p>Either mode will create the file automatically if it doesn&#8217;t already exist, so there&#8217;s never a need for any sort of fiddly &#8220;if the file doesn&#8217;t exist yet, create a new empty file just so you can open it for the first time&#8221; function. Just open a file and start writing.
<p>You should always close a file as soon as you&#8217;re done writing to it, to release the file handle and ensure that the data is actually written to disk. As with reading data from a file, you can call the file object&#8217;s <code>close()</code> method, or you can use the <code>with</code> statement and let Python close the file for you. I bet you can guess which technique I recommend.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>with open('test.log', mode='w', encoding='utf-8') as a_file:</kbd> <span class=u>&#x2460;</span></a>
<a><samp class=p>... </samp><kbd class=pp> a_file.write('test succeeded')</kbd> <span class=u>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd class=pp>with open('test.log', encoding='utf-8') as a_file:</kbd>
<samp class=p>... </samp><kbd class=pp> print(a_file.read())</kbd>
<samp class=pp>test succeeded</samp>
<a><samp class=p>>>> </samp><kbd class=pp>with open('test.log', mode='a', encoding='utf-8') as a_file:</kbd> <span class=u>&#x2462;</span></a>
<samp class=p>... </samp><kbd class=pp> a_file.write('and again')</kbd>
<samp class=p>>>> </samp><kbd class=pp>with open('test.log', encoding='utf-8') as a_file:</kbd>
<samp class=p>... </samp><kbd class=pp> print(a_file.read())</kbd>
<a><samp class=pp>test succeededand again</samp> <span class=u>&#x2463;</span></a></pre>
<ol>
<li>You start boldly by creating the new file <code>test.log</code> (or overwriting the existing file), and opening the file for writing. The <code>mode='w'</code> parameter means open the file for writing. Yes, that&#8217;s all as dangerous as it sounds. I hope you didn&#8217;t care about the previous contents of that file (if any), because that data is gone now.
<li>You can add data to the newly opened file with the <code>write</code> method of the file object returned by the <code>open()</code> function. After the <code>with</code> block ends, Python automatically closes the file.
<li>That was so fun, let&#8217;s do it again. But this time, with <code>mode='a'</code> to append to the file instead of overwriting it. Appending will <em>never</em> harm the existing contents of the file.
<li>Both the original line you wrote and the second line you appended are now in the file <code>test.log</code>. Also note that carriage returns are not included. Since you didn&#8217;t write them explicitly to the file either time, the file doesn&#8217;t include them. You can write a carriage return with the <code>'\n'</code> character. Since you didn&#8217;t do this, everything you wrote to the file ended up on one line.
</ol>
<h3 id=encoding-again>Character Encoding Again</h3>
<p>Did you notice the <code>encoding</code> parameter that got passed in to the <code>open()</code> function while you were <a href=#writing>opening a file for writing</a>? It&#8217;s important; don&#8217;t ever leave it out! As you saw in the beginning of this chapter, files don&#8217;t contain <i>strings</i>, they contain <i>bytes</i>. Reading a &#8220;string&#8221; from a text file only works because you told Python what encoding to use to read a stream of bytes and convert it to a string. Writing text to a file presents the same problem in reverse. You can&#8217;t write characters to a file; <a href=strings.html#byte-arrays>characters are an abstraction</a>. In order to write to the file, Python needs to know how to convert your string into a sequence of bytes. The only way to be sure it&#8217;s performing the correct conversion is to specify the <code>encoding</code> parameter when you open the file for writing.
<p class=a>&#x2042;
<h2 id=binary>Binary Files</h2>
<p class=ss><img src=examples/beauregard.jpg alt='my dog Beauregard' width=100 height=100>
<p>Not all files contain text. Some of them contain pictures of my dog.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>an_image = open('examples/beauregard.jpg', mode='rb')</kbd> <span class=u>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>an_image.mode</kbd> <span class=u>&#x2461;</span></a>
<samp class=pp>'rb'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>an_image.name</kbd> <span class=u>&#x2462;</span></a>
<samp class=pp>'examples/beauregard.jpg'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>an_image.encoding</kbd> <span class=u>&#x2463;</span></a>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in &lt;module>
AttributeError: '_io.BufferedReader' object has no attribute 'encoding'</samp></pre>
<ol>
<li>Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that the <code>mode</code> parameter contains a <code>'b'</code>.
<li>The file object you get from opening a file in binary mode has many of the same attributes, including <code>mode</code>, which reflects the <code>mode</code> parameter you passed into the <code>open()</code> function.
<li>File objects for binary files also have a <code>name</code> attribute, just like file objects for text files.
<li>Here&#8217;s one difference, though: the file object for a binary file has no <code>encoding</code> attribute. That makes sense, right? You&#8217;re reading (or writing) bytes, not strings, so there&#8217;s no conversion for Python to do. What you get out of a binary file is exactly what you put into it, no conversion necessary.
</ol>
<p>Did I mention you&#8217;re reading bytes? Oh yes you are.
<pre class=screen>
# continued from the previous example
<samp class=p>>>> </samp><kbd class=pp>an_image.tell()</kbd>
<samp class=pp>0</samp>
<a><samp class=p>>>> </samp><kbd class=pp>data = image.read(3)</kbd> <span class=u>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd class=pp>data</kbd>
<samp class=pp>b'\xff\xd8\xff'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>type(data)</kbd> <span class=u>&#x2461;</span></a>
<samp class=pp>&lt;class 'bytes'></samp>
<a><samp class=p>>>> </samp><kbd class=pp>an_image.tell()</kbd> <span class=u>&#x2462;</span></a>
<samp class=pp>3</samp>
<samp class=p>>>> </samp><kbd class=pp>an_image.seek(0)</kbd>
<samp class=pp>0</samp>
<samp class=p>>>> </samp><kbd class=pp>data = an_image.read()</kbd>
<samp class=p>>>> </samp><kbd class=pp>len(data)</kbd>
<samp class=pp>3150</samp></pre>
<ol>
<li>Like text files, you can read binary files a little bit at a time. But there&#8217;s a crucial difference&hellip;
<li>&hellip;you&#8217;re reading bytes, not strings. Since you opened the file in binary mode, the <code>read()</code> method takes <em>the number of bytes to read</em>, not the number of characters.
<li>That means that there&#8217;s never <a href=#read>an unexpected mismatch</a> between the number you passed into the <code>read()</code> method and the position index you get out of the <code>tell()</code> method. The <code>read()</code> method reads bytes, and the <code>seek()</code> and <code>tell()</code> methods track the number of bytes read. For binary files, they&#8217;ll always agree.
</ol>
<p class=a>&#x2042;
<h2 id=file-like-objects>File-like Objects</h2>
<p>One of Python&#8217;s greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the <dfn>file-like object</dfn>.
<p>Your functions which require an input source could simply take a filename as a string, go open the file for reading, read it, and close it when they&#8217;re done. But they shouldn&#8217;t. Instead, they should take a <em>file-like object</em>.
<p>In the simplest case, a <em>file-like object</em> is any object with a <code>read()</code> method with an optional <var>size</var> parameter, which returns a string. When called with no <var>size</var> parameter, it reads everything there is to read from the input source and returns all the data as a single string. When called with a <var>size</var> parameter, it reads that much from the input source and returns that much data. When called again, it picks up where it left off and returns the next chunk of data.
<p>You know, like a real file object. The difference is that you&#8217;re not limiting yourself to real files. The input source that&#8217;s being &#8220;read&#8221; could be anything: a web page, a string in memory, even the output of another program. As long as your functions take a file-like object and simply call the object&#8217;s <code>read()</code> method, you can handle any input source that acts like a file, without specific code to handle each kind of input.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>a_string = 'PapayaWhip is the new black.'</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>import io</kbd> <span class=u>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>a_file = io.StringIO(a_string)</kbd> <span class=u>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>&#x2462;</span></a>
<samp class=pp>'PapayaWhip is the new black.'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>&#x2463;</span></a>
<samp class=pp>''</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(0)</kbd> <span class=u>&#x2464;</span></a>
<samp class=pp>0</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(10)</kbd> <span class=u>&#x2465;</span></a>
<samp class=pp>'PapayaWhip'</samp>
<samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd>
<samp class=pp>10</samp>
<samp class=p>>>> </samp><kbd class=pp>a_file.seek(18)</kbd>
<samp class=pp>18</samp>
<samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd>
<samp class=pp>'new black.'</samp></pre>
<ol>
<li>The <code>io</code> module contains the definition of the <code>StringIO</code> class that you can use to treat a string in memory as a file.
<li>To create a file-like object out of a string, create an instance of the <code>io.StringIO()</code> class and pass it the string you want to use as your &#8220;file&#8221; data. Now you have a file-like object, and you can do all sorts of file-like things with it.
<li>Calling the <code>read()</code> method &#8220;reads&#8221; the entire &#8220;file,&#8221; which in the case of a <code>StringIO</code> object simply returns the original string.
<li>Just like a real file, calling the <code>read()</code> method again returns an empty string.
<li>You can explicitly seek to the beginning of the string, just like seeking through a real file, by using the <code>seek()</code> method of the <code>StringIO</code> object.
<li>You can also read the string in chunks, by passing a <var>size</var> parameter to the <code>read()</code> method.
</ol>
<h3 id=gzip>Handling Compressed Files</h3>
<p>The Python standard library contains modules that support reading and writing compressed files. There are a number of different compression schemes; the most popular for single files are <a href=http://docs.python.org/3.1/library/gzip.html>gzip</a> and <a href=http://docs.python.org/3.1/library/bz2.html>bzip2</a>. (You may have also encountered <a href=http://docs.python.org/3.1/library/zipfile.html>PKZIP archives</a> and <a href=http://docs.python.org/3.1/library/tarfile.html>GNU Tar archives</a>. Python has modules for those, too.)
<p>The <code>gzip</code> module lets you create a file-like object for reading or writing a gzip-compressed file. The file-like object it gives you supports the <code>read()</code> method (if you opened it for reading) or the <code>write()</code> method (if you opened it for writing). That means you can use the methods you&#8217;ve already learned for regular files to <em>directly read or write a gzip-compressed file</em>, without creating a temporary file to store the decompressed data.
<p>As an added bonus, it supports the <code>with</code> statement too, so you can let Python automatically close your gzip-compressed file when you&#8217;re done with it.
<pre class='nd screen'>
<samp class=p>you@localhost:~$ </samp><kbd>python3</kbd>
<samp class=p>>>> </samp><kbd class=pp>import gzip</kbd>
<samp class=p>>>> </samp><kbd class=pp>with gzip.open('out.log.gz', mode='wb') as z_file:</kbd>
<samp class=p>... </samp><kbd class=pp> z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))</kbd>
<samp class=p>... </samp>
<samp class=p>>>> </samp><kbd class=pp>exit()</kbd>
<samp class=p>you@localhost:~$ </samp><kbd>ls -l out.log.gz</kbd>
<samp>-rw-r--r-- 1 mark mark 79 2009-07-19 14:29 out.log.gz</samp>
<samp class=p>you@localhost:~$ </samp><kbd>gunzip out.log.gz</kbd>
<samp class=p>you@localhost:~$ </samp><kbd>cat out.log</kbd>
<samp>A nine mile walk is no joke, especially in the rain.</samp></pre>
<p class=a>&#x2042;
<h2 id=stdio>Standard Input, Output, and Error</h2>
<p>Command-line gurus are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you.
<p>Standard output and standard error (commonly abbreviated <code>stdout</code> and <code>stderr</code>) are pipes that are built into every <abbr>UNIX</abbr>-like system, including Mac OS X and Linux. When you call the <code>print()</code> function, the thing you&#8217;re printing is sent to the <code>stdout</code> pipe. When your program crashes and prints out a traceback, it goes to the <code>stderr</code> pipe. By default, both of these pipes are just connected to the terminal window where you are working; when your program prints something, you see the output in your terminal window, and when a program crashes, you see the traceback in your terminal window too. In the graphical Python Shell, the <code>stdout</code> and <code>stderr</code> pipes default to your &#8220;Interactive Window&#8221;.)
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>for i in range(3):</kbd>
<a><samp class=p>... </samp><kbd class=pp> print('PapayaWhip')</kbd> <span class=u>&#x2460;</span></a>
<samp>PapayaWhip
PapayaWhip
PapayaWhip</samp>
<samp class=p>>>> </samp><kbd class=pp>import sys</kbd>
<samp class=p>>>> </samp><kbd class=pp>for i in range(3):</kbd>
<a><samp class=p>... </samp><kbd class=pp>sys.stdout.write('is the')</kbd> <span class=u>&#x2461;</span></a>
<samp>is theis theis the</samp>
<samp class=p>>>> </samp><kbd class=pp>for i in range(3):</kbd>
<a><samp class=p>... </samp><kbd class=pp>sys.stderr.write('new black')</kbd> <span class=u>&#x2462;</span></a>
<samp>new blacknew blacknew black</samp></pre>
<ol>
<li>The <code>print()</code> statement, in a loop. Nothing surprising here.
<li><code>stdout</code> is defined in the <code>sys</code> module, and it is a <a href=#file-like-objects>file-like object</a>. Calling its <code>write</code> function will print out whatever string you give it. In fact, this is what the <code>print</code> function really does; it adds a carriage return to the end of the string you&#8217;re printing, and calls <code>sys.stdout.write</code>.
<li>In the simplest case, <code>sys.stdout</code> and <code>sys.stderr</code> send their output to the same place: the Python <abbr>IDE</abbr> (if you&#8217;re in one), or the terminal (if you&#8217;re running Python from the command line). Like standard output, standard error does not add carriage returns for you. If you want carriage returns, you&#8217;ll need to write carriage return characters.
</ol>
<p><code>sys.stdout</code> and <code>sys.stderr</code> are file-like objects, but they are write-only. Attempting to call their <code>read()</code> method will always raise an <code>IOError</code>.
<pre class='nd screen'>
<samp class=p>>>> </samp><kbd class=pp>import sys</kbd>
<samp class=p>>>> </samp><kbd class=pp>sys.stdout.read()</kbd>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in &lt;module>
IOError: not readable</samp></pre>
<h3 id=redirect>Redirecting Standard Output</h3>
<p>So <code>sys.stdout</code> and <code>sys.stderr</code> are file-like objects, albeit ones that only support writing. But they&#8217;re not constants; they&#8217;re variables. That means you can assign them a new value&nbsp;&mdash;&nbsp;another file object, or another file-like object&nbsp;&mdash;&nbsp;and redirect their output.
<p class=d>[<a href=examples/stdout.py>download <code>stdout.py</code></a>]
<pre><code class=pp>import sys
class RedirectStdoutTo:
def __init__(self, out_new):
self.out_new = out_new
def __enter__(self):
self.out_old = sys.stdout
sys.stdout = self.out_new
def __exit__(self, *args):
sys.stdout = self.out_old
print('A')
with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
print('B')
print('C')</code></pre>
<p>Check this out:
<pre class='nd screen'>
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>python3 stdout.py</kbd>
<samp>A
C</samp>
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>cat out.log</kbd>
<samp>B</samp></pre>
<p>Let&#8217;s take the last part first.
<pre><code class=pp>
<a>print('A') <span class=u>&#x2460;</span></a>
<a>with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file): <span class=u>&#x2461;</span></a>
<a> print('B') <span class=u>&#x2462;</span></a>
<a>print('C') <span class=u>&#x2463;</span></a></code></pre>
<ol>
<li>This will print to the <abbr>IDE</abbr> &#8220;Interactive Window&#8221; (or the terminal, if running the script from the command line).
<li>This is <a href=#with>a <code>with</code> statement</a>, which you&#8217;ve seen before. But unlike all previous example, this one doesn&#8217;t stop at <code>as a_file</code>. Instead, there&#8217;s a comma and another function call. The <code>with</code> statement can actually take <em>a comma-separated list of contexts</em>. The first is a context you&#8217;ve seen several times already: it opens a file, assigns the file object to <var>a_file</var>, and closes the file automatically when the context ends. The second context is a custom-built context that redirects <code>sys.stdout</code> to the file object that was created in the first context.
<li>Because this <code>print()</code> statement is executed with the contexts created by the <code>with</code> statement, it will not print to the screen; it will write to the file <code>out.log</code>.
<li>The <code>with</code> code block is over. Python has told each context manager to do whatever it is they do upon exiting a context. The first context closed the file; the second context changed <code>sys.stdout</code> back to its original value. That means that this call to the <code>print()</code> function will once again print to the screen.
</ol>
<p>Now take a look at the <code>RedirectStdoutTo</code> class. It is a custom context manager. Upon entering the context, it redirects <code>sys.stdout</code> to a given file-like object. Upon exiting the context, it restores <code>sys.stdout</code> to its original value.
<pre><code class=pp>class RedirectStdoutTo:
<a> def __init__(self, out_new): <span class=u>&#x2460;</span></a>
self.out_new = out_new
<a> def __enter__(self): <span class=u>&#x2461;</span></a>
self.out_old = sys.stdout
sys.stdout = self.out_new
<a> def __exit__(self, *args): <span class=u>&#x2462;</span></a>
sys.stdout = self.out_old</code></pre>
<ol>
<li>The <code>__init__()</code> method is called immediately after an instance is created. It takes one parameter, the file-like object that you want to use as standard output for the life of the context. This method just saves the file-like object in an instance variable so other methods can use it later.
<li>The <code>__enter__()</code> method is a <a href=iterators.html#a-fibonacci-iterator>special class method</a>; Python calls it when entering a context (<i>i.e.</i> at the beginning of the <code>with</code> statement). This method saves the current value of <code>sys.stdout</code> in <var>self.out_old</var>, then redirects standard output by assigning <var>self.out_new</var> to <var>sys.stdout</var>.
<li>The <code>__exit__()</code> method is another special class method; Python calls it when exiting the context (<i>i.e.</i> at the end of the <code>with</code> statement). This method restores standard output to its original value by assigning the saved <var>self.out_old</var> value to <var>sys.stdout</var>.
</ol>
<p>Redirecting standard error works exactly the same way, using <code>sys.stderr</code> instead of <code>sys.stdout</code>.
<p class=a>&#x2042;
<h2 id=furtherreading>Further Reading</h2>
<ul>
<li><a href=http://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files>Reading and writing files</a> in the Python.org tutorial
<li><a href=http://docs.python.org/3.1/library/io.html><code>io</code> module</a>
<li><a href=http://docs.python.org/3.1/library/stdtypes.html#file-objects>File objects</a>
<li><a href=http://docs.python.org/3.1/library/stdtypes.html#context-manager-types>Context manager types</a>
<li><a href=http://docs.python.org/3.1/library/sys.html#sys.stdout><code>sys.stdout</code> and <code>sys.stderr</code></a>
<li><a href=http://en.wikipedia.org/wiki/Filesystem_in_Userspace><abbr>FUSE</abbr> on Wikipedia</a>
</ul>
<p class=v><a href=refactoring.html rel=prev title='back to &#8220;Refactoring&#8221;'><span class=u>&#x261C;</span></a> <a href=xml.html rel=next title='onward to &#8220;XML&#8221;'><span class=u>&#x261E;</span></a>
<p class=c>&copy; 2001&ndash;9 <a href=about.html>Mark Pilgrim</a>
<script src=j/jquery.js></script>
<script src=j/prettify.js></script>
<script src=j/dip3.js></script>