mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 15:00:18 +00:00
521 lines
42 KiB
HTML
521 lines
42 KiB
HTML
<!DOCTYPE html>
|
|
<head>
|
|
<meta charset=utf-8>
|
|
<title>Files - Dive into Python 3</title>
|
|
<!--[if IE]><script src=j/html5.js></script><![endif]-->
|
|
<link rel=stylesheet href=dip3.css>
|
|
<style>
|
|
body{counter-reset:h1 12}
|
|
</style>
|
|
<link rel=stylesheet type=text/css media='only screen and (max-device-width: 480px)' href=mobile.css>
|
|
<link rel=stylesheet media=print href=print.css>
|
|
<meta name=viewport content='initial-scale=1.0'>
|
|
</head>
|
|
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=25> <input type=submit name=sa value=Search></div></form>
|
|
<p>You are here: <a href=index.html>Home</a> <span class=u>‣</span> <a href=table-of-contents.html#files>Dive Into Python 3</a> <span class=u>‣</span>
|
|
<p id=level>Difficulty level: <span class=u title=intermediate>♦♦♦♢♢</span>
|
|
<h1>Files</h1>
|
|
<blockquote class=q>
|
|
<p><span class=u>❝</span> FIXME <span class=u>❞</span><br>— FIXME
|
|
</blockquote>
|
|
<p id=toc>
|
|
<h2 id=divingin>Diving In</h2>
|
|
<p class=f>FIXME
|
|
|
|
<h2 id=reading>Reading From Text Files</h2>
|
|
|
|
<p>Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:
|
|
|
|
<pre class=nd><code class=pp>a_file = open('examples/chinese.txt', encoding='utf-8')</code></pre>
|
|
|
|
<p>Python has a built-in <code>open()</code> function, which takes a filename as an argument. Here the filename is <code class=pp>'examples/chinese.txt'</code>. There are five interesting things about this filename:
|
|
|
|
<ol>
|
|
<li>It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments — a directory path and a filename — but the <code>open()</code> function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well.
|
|
<li>The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows.
|
|
<li>The directory path does not begin with a slash or a drive letter, so it is called a <i>relative path</i>. Relative to what, you might ask? Patience, grasshopper.
|
|
<li>It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-<abbr>ASCII</abbr> pathnames.
|
|
<li>It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a figment of <a href=http://en.wikipedia.org/wiki/Filesystem_in_Userspace>an entirely virtual filesystem</a>. If your computer considers it a file and can access it as a file, Python can open it.
|
|
</ol>
|
|
|
|
<p>But that call to the <code>open()</code> function didn’t stop at the filename. There’s another argument, called <code>encoding</code>. Oh dear, <a href=strings.html#boring-stuff>that sounds dreadfully familiar</a>.
|
|
|
|
<h3 id=encoding>Character Encoding Rears Its Ugly Head</h3>
|
|
|
|
<p>Bytes are bytes; <a href=strings.html#byte-arrays>characters are an abstraction</a>. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).
|
|
|
|
<pre>
|
|
# This example was created on Windows. Other platforms may
|
|
# behave differently, for reasons outlined below.
|
|
<samp class=p>>>> </samp><kbd class=pp>file = open('examples/chinese.txt')</kbd>
|
|
<samp class=p>>>> </samp><kbd class=pp>a_string = file.read()</kbd>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
|
|
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
|
|
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined></samp>
|
|
<samp class=p>>>> </samp></pre>
|
|
|
|
<p>What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in <code>cp1252.py</code>, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly <code>UnicodeDecodeError</code>.
|
|
|
|
<p>But wait, it’s worse than that! The default encoding is <em>platform-dependent</em>, so this code <em>might</em> work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).
|
|
|
|
<blockquote class=note>
|
|
<p><span class=u>☞</span>If you need to get the default character encoding, import the <code>locale</code> module and call <code>locale.getpreferredencoding()</code>. On my Windows laptop, it returns <code>'cp1252'</code>, but on my Linux box upstairs, it returns <code>'UTF8'</code>. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file.
|
|
|
|
</blockquote>
|
|
|
|
<h3 id=file-objects>File Objects</h3>
|
|
|
|
<p>So far, all we know is that Python has a built-in function called <code>open()</code>. The <code>open()</code> function returns a <i>file object</i>, which has methods and attributes for getting information about and manipulating the file.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd class=pp>a_file = open('examples/chinese.txt', encoding='utf-8')</kbd>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.name</kbd> <span class=u>①</span></a>
|
|
<samp class=pp>'examples/chinese.txt'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.encoding</kbd> <span class=u>②</span></a>
|
|
<samp class=pp>'utf-8'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.mode</kbd> <span class=u>③</span></a>
|
|
<samp class=pp>'r'</samp></pre>
|
|
<ol>
|
|
<li>The <code>name</code> attribute reflects the name you passed in to the <code>open()</code> function when you opened the file. It is not normalized to an absolute pathname.
|
|
<li>Likewise, <code>encoding</code> attribute reflects the encoding you passed in to the <code>open()</code> function. If you didn’t specify the encoding when you opened the file (bad developer!) then the <code>encoding</code> attribute will reflect <code>locale.getpreferredencoding()</code>.
|
|
<li>The <code>mode</code> attribute tells you in which mode the file was opened. You can pass an optional <var>mode</var> parameter to the <code>open()</code> function. You didn’t specify a mode when you opened this file, so Python defaults to <code>'r'</code>, which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings).
|
|
</ol>
|
|
|
|
<blockquote class=note>
|
|
<p><span class=u>☞</span>The <a href=http://docs.python.org/3.1/library/io.html#module-interface>documentation for the <code>open()</code> function</a> lists all the possible file modes.
|
|
</blockquote>
|
|
|
|
<h3 id=read>Reading Data From A Text File</h3>
|
|
|
|
<p>After you open a file for reading, you’ll probably want to read from it at some point.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd class=pp>a_file = open('examples/chinese.txt', encoding='utf-8')</kbd>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>①</span></a>
|
|
<samp class=pp>'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>②</span></a>
|
|
<samp class=pp>''</samp></pre>
|
|
<ol>
|
|
<li>Once you open a file (with the correct encoding), reading from it is just a matter of calling the file object’s <code>read()</code> method. The result is a string.
|
|
<li>Perhaps somewhat surprisingly, reading the file again does not raise an exception. Python does not consider reading past end-of-file to be an error; it simply returns an empty string.
|
|
</ol>
|
|
|
|
<p>What if you want to re-read a file?
|
|
|
|
<pre class=screen>
|
|
# continued from the previous example
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>①</span></a>
|
|
<samp class=pp>''</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(0)</kbd> <span class=u>②</span></a>
|
|
<samp class=pp>0</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(16)</kbd> <span class=u>③</span></a>
|
|
<samp class=pp>'Dive Into Python'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd> <span class=u>④</span></a>
|
|
<samp class=pp>' '</samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd>
|
|
<samp class=pp>'是'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd> <span class=u>⑤</span></a>
|
|
<samp class=pp>20</samp></pre>
|
|
<ol>
|
|
<li>Since you’re still at the end of the file, further calls to the file object’s <code>read()</code> method simply return an empty string.
|
|
<li>The <code>seek()</code> method moves to a specific byte position in a file.
|
|
<li>The <code>read()</code> method can take an optional parameter, the number of characters to read.
|
|
<li>If you like, you can even read one character at a time.
|
|
<li>16 + 1 + 1 = … 20?
|
|
</ol>
|
|
|
|
<p>Let’s try that again.
|
|
|
|
<pre class=screen>
|
|
# continued from the previous example
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(17)</kbd> <span class=u>①</span></a>
|
|
<samp class=pp>17</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd> <span class=u>②</span></a>
|
|
<samp class=pp>'是'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd> <span class=u>③</span></a>
|
|
<samp class=pp>20</samp></pre>
|
|
<ol>
|
|
<li>Move to the 17<sup>th</sup> byte.
|
|
<li>Read one character.
|
|
<li>Now you’re on the 20<sup>th</sup> byte.
|
|
</ol>
|
|
|
|
<p>Do you see it yet? The <code>seek()</code> and <code>tell()</code> methods always count <em>bytes</em>, but since you opened this file as text, the <code>read()</code> method counts <em>characters</em>. Chinese characters <a href=strings.html#boring-stuff>require multiple bytes to encode in UTF-8</a>. The English characters in the file only require one byte each, so you might be misled into thinking that the <code>seek()</code> and <code>read()</code> methods are counting the same thing. But that’s only true for some characters.
|
|
|
|
<p>But wait, it gets worse!
|
|
|
|
<pre class=screen>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(18)</kbd> <span class=u>①</span></a>
|
|
<samp class=pp>18</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd> <span class=u>②</span></a>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "<pyshell#12>", line 1, in <module>
|
|
a_file.read(1)
|
|
File "C:\Python31\lib\codecs.py", line 300, in decode
|
|
(result, consumed) = self._buffer_decode(data, self.errors, final)
|
|
UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte</samp></pre>
|
|
<ol>
|
|
<li>Move to the 18<sup>th</sup> byte and try to read one character.
|
|
<li>Why does this fail? Because there isn’t a character at the 18<sup>th</sup> byte. The nearest character starts at the 17<sup>th</sup> byte (and goes for three bytes). Trying to read a character from the middle will fail with a <code>UnicodeDecodeError</code>.
|
|
</ol>
|
|
|
|
<h3 id=close>Closing Files</h3>
|
|
|
|
<p>Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them.
|
|
|
|
<pre class='nd screen'>
|
|
# continued from the previous example
|
|
<samp class=p>>>> </samp><kbd class=pp>a_file.close()</kbd></pre>
|
|
|
|
<p>Well <em>that</em> was anticlimactic.
|
|
|
|
<p>The file object <var>a_file</var> still exists; calling its <code>close()</code> method doesn’t destroy the object itself. But it’s not terribly useful.
|
|
|
|
<pre class=screen>
|
|
# continued from the previous example
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>①</span></a>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "<pyshell#24>", line 1, in <module>
|
|
a_file.read()
|
|
ValueError: I/O operation on closed file.</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(0)</kbd> <span class=u>②</span></a>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "<pyshell#25>", line 1, in <module>
|
|
a_file.seek(0)
|
|
ValueError: I/O operation on closed file.</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd> <span class=u>③</span></a>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "<pyshell#26>", line 1, in <module>
|
|
a_file.tell()
|
|
ValueError: I/O operation on closed file.</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.close()</kbd> <span class=u>④</span></a>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.closed</kbd> <span class=u>⑤</span></a>
|
|
<samp class=pp>True</samp></pre>
|
|
<ol>
|
|
<li>You can’t read from a closed file; that raises an <code>IOError</code> exception.
|
|
<li>You can’t seek in a closed file either.
|
|
<li>There’s no current position in a closed file, so the <code>tell()</code> method also fails.
|
|
<li>Perhaps surprisingly, calling the <code>close()</code> method on a file object whose file has been closed does <em>not</em> raise an exception. It’s just a no-op.
|
|
<li>Closed file objects do have one useful attribute: the <code>closed</code> attribute will confirm that the file is closed.
|
|
</ol>
|
|
|
|
<h3 id=with>Closing Files Automatically</h3>
|
|
|
|
<p>File objects have an explicit <code>close()</code> method, but what happens if your code has a bug and crashes before you call <code>close()</code>? That file could theoretically stay open for much longer than necessary. While you’re debugging on your local computer, that’s not a big deal. On a production server, maybe it is.
|
|
|
|
<p>Python 2 had a solution for this: the <code>try..finally</code> block. That still works in Python 3, and you may see it in other people’s code or in older code that was <a href=case-study-porting-chardet-to-python-3.html>ported to Python 3</a>. But Python 3 also adds a cleaner solution: the <code>with</code> statement.
|
|
|
|
<pre class=nd><code class=pp>with open('examples/chinese.txt', encoding='utf-8') as a_file:
|
|
a_file.seek(17)
|
|
a_character = a_file.read(1)
|
|
print(a_character)</code></pre>
|
|
|
|
<p>This code calls <code>open()</code>, but it never calls <code>a_file.close()</code>. The <code>with</code> statement starts a code block, like an <code>if</code> statement or a <code>for</code> loop. Inside this code block, you can use the variable <var>a_file</var> as the file object returned from the call to <code>open()</code>. All the regular file object methods are available — <code>seek()</code>, <code>read()</code>, whatever you need. When the <code>with</code> block ends, <em>Python calls <code>a_file.close()</code> automatically</em>.
|
|
|
|
<p>Here’s the kicker: no matter how or when you exit the <code>with</code> block, Python will close that file… even if you “exit” it via an unhandled exception. That’s right, even if your code raises an exception and your entire program comes to a screeching halt, that file will get closed. Guaranteed.
|
|
|
|
<blockquote class=note>
|
|
<p><span class=u>☞</span>In technical terms, the <code>with</code> statement creates a <dfn>runtime context</dfn>. In these examples, the file object acts as a <dfn>context manager</dfn>. Python creates the file object <var>a_file</var> and tells it that it is entering a runtime context. When the <code>with</code> code block is completed, Python tells the file object that it is exiting the runtime context, and the file object calls its own <code>close()</code> method. See <a href=special-method-names.html#context-managers>Appendix B, “Context Managers”</a> for details.
|
|
</blockquote>
|
|
|
|
<p>There’s nothing file-specific about the <code>with</code> statement; it’s just a generic framework for creating runtime contexts and telling objects that they’re entering and exiting a runtime context. If the object in question is a file object, then it does useful file-like things (like closing the file automatically). But that behavior is defined in the file object, not in the <code>with</code> statement. There are lots of other ways to use context managers that have nothing to do with files. You can even create your own, as you’ll see later in this chapter.
|
|
|
|
<h3 id=for>Reading Data One Line At A Time</h3>
|
|
|
|
<p>A “line” of a text file is just what you think it is — you type a few words and press <kbd>ENTER</kbd>, and now you’re on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated, because text files can use several different characters to mark the end of a line. Every operating system has its own convention. Some use a carriage return character, others use a line feed character, and some use both characters at the end of every line.
|
|
|
|
<p>Now breathe a sigh of relief, because <em>Python handles line endings automatically</em> by default. If you say, “I want to read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and and it will all Just Work.
|
|
|
|
<blockquote class=note>
|
|
<p><span class=u>☞</span>If you need fine-grained control over what’s considered a line ending, you can pass the optional <code>newline</code> parameter to the <code>open()</code> function. See <a href=http://docs.python.org/3.1/library/io.html#module-interface>the <code>open()</code> function documentation</a> for all the gory details.
|
|
</blockquote>
|
|
|
|
<p>So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful.
|
|
|
|
<p class=d>[<a href=examples/oneline.py>download <code>oneline.py</code></a>]
|
|
<pre><code class=pp>line_number = 0
|
|
<a>with open('examples/favorite-people.txt', encoding='utf-8') as a_file: <span class=u>①</span></a>
|
|
<a> for a_line in a_file: <span class=u>②</span></a>
|
|
line_number += 1
|
|
<a> print('{} {}'.format(line_number, a_line.rstrip())) <span class=u>③</span></a></code></pre>
|
|
<ol>
|
|
<li>Using <a href=#with>the <code>with</code> pattern</a>, you safely open the file and let Python close it for you.
|
|
<li>To read a file one line at a time, use a <code>for</code> loop. That’s it. Besides having explicit methods like <code>read()</code>, <em>the file object is also an <a href=iterators.html>iterator</a></em> which spits out a single line every time you ask for a value.
|
|
<li>Using <a href=strings.html#formatting-strings>the <code>format()</code> string method</a>, you can print out the line number and the line itself. (The <var>a_line</var> variable contains the complete line, carriage returns and all. The <code>rstrip()</code> string method removes the trailing whitespace, including the carriage return characters.)
|
|
</ol>
|
|
|
|
<pre class=screen>
|
|
<samp class=p>you@localhost:~/diveintopython3$ </samp><kbd class=pp>python3 examples/oneline.py</kbd>
|
|
<samp>1 Dora
|
|
2 Ethan
|
|
3 Wesley
|
|
4 John
|
|
5 Anne
|
|
6 Mike
|
|
7 Chris
|
|
8 Sarah
|
|
9 Alex
|
|
10 Lizzie</samp></pre>
|
|
|
|
<h2 id=writing>Writing to Text Files</h2>
|
|
|
|
<p>You can write to files in much the same way that you read from them. First you open a file and get a file object, then you use methods on the file object to write data to the file, then you close the file.
|
|
|
|
<p>To open a file for writing, use the <code>open()</code> method and specify the write mode. There are two file modes for writing:
|
|
|
|
<ul>
|
|
<li>“Write” mode will overwrite the file. Pass <code>mode='w'</code> to the <code>open()</code> function.
|
|
<li>“Append” mode will add data to the end of the file. Pass <code>mode='a'</code> to the <code>open()</code> function.
|
|
</ul>
|
|
|
|
<p>Either mode will create the file automatically if it doesn’t already exist, so there’s never a need for any sort of fiddly “if the file doesn’t exist yet, create a new empty file just so you can open it for the first time” function. Just open a file and start writing.
|
|
|
|
<p>You should always close a file as soon as you’re done writing to it, to release the file handle and ensure that the data is actually written to disk. As with reading data from a file, you can call the file object’s <code>close()</code> method, or you can use the <code>with</code> statement and let Python close the file for you. I bet you can guess which technique I recommend.
|
|
|
|
<pre class=screen>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>with open('test.log', mode='w', encoding='utf-8') as a_file:</kbd> <span class=u>①</span></a>
|
|
<a><samp class=p>... </samp><kbd class=pp> a_file.write('test succeeded')</kbd> <span class=u>②</span></a>
|
|
<samp class=p>>>> </samp><kbd class=pp>with open('test.log', encoding='utf-8') as a_file:</kbd>
|
|
<samp class=p>... </samp><kbd class=pp> print(a_file.read())</kbd>
|
|
<samp class=pp>test succeeded</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>with open('test.log', mode='a', encoding='utf-8') as a_file:</kbd> <span class=u>③</span></a>
|
|
<samp class=p>... </samp><kbd class=pp> a_file.write('and again')</kbd>
|
|
<samp class=p>>>> </samp><kbd class=pp>with open('test.log', encoding='utf-8') as a_file:</kbd>
|
|
<samp class=p>... </samp><kbd class=pp> print(a_file.read())</kbd>
|
|
<a><samp class=pp>test succeededand again</samp> <span class=u>④</span></a></pre>
|
|
<ol>
|
|
<li>You start boldly by creating the new file <code>test.log</code> (or overwriting the existing file), and opening the file for writing. The <code>mode='w'</code> parameter means open the file for writing. Yes, that’s all as dangerous as it sounds. I hope you didn’t care about the previous contents of that file (if any), because that data is gone now.
|
|
<li>You can add data to the newly opened file with the <code>write</code> method of the file object returned by the <code>open()</code> function. After the <code>with</code> block ends, Python automatically closes the file.
|
|
<li>That was so fun, let’s do it again. But this time, with <code>mode='a'</code> to append to the file instead of overwriting it. Appending will <em>never</em> harm the existing contents of the file.
|
|
<li>Both the original line you wrote and the second line you appended are now in the file <code>test.log</code>. Also note that carriage returns are not included. Since you didn’t write them explicitly to the file either time, the file doesn’t include them. You can write a carriage return with the <code>'\n'</code> character. Since you didn’t do this, everything you wrote to the file ended up on one line.
|
|
</ol>
|
|
|
|
<h3 id=encoding-again>Character Encoding Again</h3>
|
|
|
|
<p>Did you notice the <code>encoding</code> parameter that got passed in to the <code>open()</code> function while you were <a href=#writing>opening a file for writing</a>? It’s important; don’t ever leave it out! As you saw in the beginning of this chapter, files don’t contain <i>strings</i>, they contain <i>bytes</i>. Reading a “string” from a text file only works because you told Python what encoding to use to read a stream of bytes and convert it to a string. Writing text to a file presents the same problem in reverse. You can’t write characters to a file; <a href=strings.html#byte-arrays>characters are an abstraction</a>. In order to write to the file, Python needs to know how to convert your string into a sequence of bytes. The only way to be sure it’s performing the correct conversion is to specify the <code>encoding</code> parameter when you open the file for writing.
|
|
|
|
<h3 id=write>Write A Little, Write A Lot</h3>
|
|
|
|
<p>FIXME write(), writelines(), .writeable
|
|
|
|
<h2 id=binary>Binary Files</h2>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>an_image = open('examples/beauregard-100x100.jpg', mode='rb')</kbd> <span class=u>①</span></a>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>an_image.mode</kbd> <span class=u>②</span></a>
|
|
<samp class=pp>'rb'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>an_image.name</kbd> <span class=u>③</span></a>
|
|
<samp class=pp>'examples/beauregard.jpg'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>an_image.encoding</kbd> <span class=u>④</span></a>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
AttributeError: '_io.BufferedReader' object has no attribute 'encoding'</samp></pre>
|
|
<ol>
|
|
<li>FIXME
|
|
<li>
|
|
<li>
|
|
<li>
|
|
</ol>
|
|
|
|
<pre class=screen>
|
|
# continued from the previous example
|
|
<a><samp class=p>>>> </samp><kbd class=pp>an_image.tell()</kbd> <span class=u>①</span></a>
|
|
<samp class=pp>0</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>data = image.read(3)</kbd> <span class=u>②</span></a>
|
|
<samp class=p>>>> </samp><kbd class=pp>data</kbd>
|
|
<samp class=pp>b'\xff\xd8\xff'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>type(data)</kbd> <span class=u>③</span></a>
|
|
<samp class=pp><class 'bytes'></samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>an_image.tell()</kbd>
|
|
<samp class=pp>3</samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>an_image.seek(0)</kbd>
|
|
<samp class=pp>0</samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>data = an_image.read()</kbd>
|
|
<samp class=p>>>> </samp><kbd class=pp>len(data)</kbd>
|
|
<samp class=pp>3150</samp></pre>
|
|
<ol>
|
|
<li>FIXME
|
|
<li>
|
|
<li>
|
|
</ol>
|
|
|
|
<h2 id=file-like-objects>File-like Objects</h2>
|
|
|
|
<p>One of Python’s greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the <dfn>file-like object</dfn>.
|
|
|
|
<p>Your functions which require an input source could simply take a filename as a string, go open the file for reading, read it, and close it when they’re done. But they shouldn’t. Instead, they should take a <em>file-like object</em>.
|
|
|
|
<p>In the simplest case, a <em>file-like object</em> is any object with a <code>read()</code> method with an optional <var>size</var> parameter, which returns a string. When called with no <var>size</var> parameter, it reads everything there is to read from the input source and returns all the data as a single string. When called with a <var>size</var> parameter, it reads that much from the input source and returns that much data. When called again, it picks up where it left off and returns the next chunk of data.
|
|
|
|
<p>You know, like a real file object. The difference is that you’re not limiting yourself to real files. The input source that’s being “read” could be anything: a web page, a string in memory, even the output of another program. As long as your functions take a file-like object and simply call the object’s <code>read()</code> method, you can handle any input source that acts like a file, without specific code to handle each kind of input.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd class=pp>a_string = 'PapayaWhip is the new black.'</kbd>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>import io</kbd> <span class=u>①</span></a>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file = io.StringIO(a_string)</kbd> <span class=u>②</span></a>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>③</span></a>
|
|
<samp class=pp>'PapayaWhip is the new black.'</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>④</span></a>
|
|
<samp class=pp>''</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(0)</kbd> <span class=u>⑤</span></a>
|
|
<samp class=pp>0</samp>
|
|
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(10)</kbd> <span class=u>⑥</span></a>
|
|
<samp class=pp>'PapayaWhip'</samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd>
|
|
<samp class=pp>10</samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>a_file.seek(18)</kbd>
|
|
<samp class=pp>18</samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd>
|
|
<samp class=pp>'new black.'</samp></pre>
|
|
<ol>
|
|
<li>The <code>io</code> module contains the definition of the <code>StringIO</code> class that you can use to treat a string in memory as a file.
|
|
<li>To create a file-like object out of a string, create an instance of the <code>io.StringIO()</code> class and pass it the string you want to use as your “file” data. Now you have a file-like object, and you can do all sorts of file-like things with it.
|
|
<li>Calling the <code>read()</code> method “reads” the entire “file,” which in the case of a <code>StringIO</code> object simply returns the original string.
|
|
<li>Just like a real file, calling the <code>read()</code> method again returns an empty string.
|
|
<li>You can explicitly seek to the beginning of the string, just like seeking through a real file, by using the <code>seek()</code> method of the <code>StringIO</code> object.
|
|
<li>You can also read the string in chunks, by passing a <var>size</var> parameter to the <code>read()</code> method.
|
|
</ol>
|
|
|
|
<h3 id=gzip>Handling Compressed Files</h3>
|
|
|
|
<p>The Python standard library contains modules that support reading and writing compressed files. There are a number of different compression schemes; the most popular for single files are <a href=http://docs.python.org/3.1/library/gzip.html>gzip</a> and <a href=http://docs.python.org/3.1/library/bz2.html>bzip2</a>. (You may have also encountered <a href=http://docs.python.org/3.1/library/zipfile.html>PKZIP archives</a> and <a href=http://docs.python.org/3.1/library/tarfile.html>GNU Tar archives</a>. Python has modules for those, too.)
|
|
|
|
<p>The <code>gzip</code> module lets you create a file-like object for reading or writing a gzip-compressed file. The file-like object it gives you supports the <code>read()</code> method (if you opened it for reading) or the <code>write()</code> method (if you opened it for writing). That means you can use the methods you’ve already learned for regular files to <em>directly read or write a gzip-compressed file</em>, without creating a temporary file to store the decompressed data.
|
|
|
|
<p>As an added bonus, it supports the <code>with</code> statement too, so you can let Python automatically close your gzip-compressed file when you’re done with it.
|
|
|
|
<pre class='nd screen'>
|
|
<samp class=p>you@localhost:~$ </samp><kbd>python3</kbd>
|
|
|
|
<samp class=p>>>> </samp><kbd class=pp>import gzip</kbd>
|
|
<samp class=p>>>> </samp><kbd class=pp>with gzip.open('out.log.gz', mode='wb') as z_file:</kbd>
|
|
<samp class=p>... </samp><kbd class=pp> z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))</kbd>
|
|
<samp class=p>... </samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>exit()</kbd>
|
|
|
|
<samp class=p>you@localhost:~$ </samp><kbd>ls -l out.log.gz</kbd>
|
|
<samp>-rw-r--r-- 1 mark mark 79 2009-07-19 14:29 out.log.gz</samp>
|
|
<samp class=p>you@localhost:~$ </samp><kbd>gunzip out.log.gz</kbd>
|
|
<samp class=p>you@localhost:~$ </samp><kbd>cat out.log</kbd>
|
|
<samp>A nine mile walk is no joke, especially in the rain.</samp></pre>
|
|
|
|
<h2 id=stdio>Standard Input, Output, and Error</h2>
|
|
|
|
<p>Command-line gurus are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you.
|
|
|
|
<p>Standard output and standard error (commonly abbreviated <code>stdout</code> and <code>stderr</code>) are pipes that are built into every <abbr>UNIX</abbr>-like system, including Mac OS X and Linux. When you call the <code>print()</code> function, the thing you’re printing is sent to the <code>stdout</code> pipe. When your program crashes and prints out a traceback, it goes to the <code>stderr</code> pipe. By default, both of these pipes are just connected to the terminal window where you are working; when your program prints something, you see the output in your terminal window, and when a program crashes, you see the traceback in your terminal window too. In the graphical Python Shell, the <code>stdout</code> and <code>stderr</code> pipes default to your “Interactive Window”.)
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd class=pp>for i in range(3):</kbd>
|
|
<a><samp class=p>... </samp><kbd class=pp> print('PapayaWhip')</kbd> <span class=u>①</span></a>
|
|
<samp>PapayaWhip
|
|
PapayaWhip
|
|
PapayaWhip</samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>import sys</kbd>
|
|
<samp class=p>>>> </samp><kbd class=pp>for i in range(3):</kbd>
|
|
<a><samp class=p>... </samp><kbd class=pp>sys.stdout.write('is the')</kbd> <span class=u>②</span></a>
|
|
<samp>is theis theis the</samp>
|
|
<samp class=p>>>> </samp><kbd class=pp>for i in range(3):</kbd>
|
|
<a><samp class=p>... </samp><kbd class=pp>sys.stderr.write('new black')</kbd> <span class=u>③</span></a>
|
|
<samp>new blacknew blacknew black</samp></pre>
|
|
<ol>
|
|
<li>The <code>print()</code> statement, in a loop. Nothing surprising here.
|
|
<li><code>stdout</code> is defined in the <code>sys</code> module, and it is a <a href=#file-like-objects>file-like object</a>. Calling its <code>write</code> function will print out whatever string you give it. In fact, this is what the <code>print</code> function really does; it adds a carriage return to the end of the string you’re printing, and calls <code>sys.stdout.write</code>.
|
|
<li>In the simplest case, <code>sys.stdout</code> and <code>sys.stderr</code> send their output to the same place: the Python <abbr>IDE</abbr> (if you’re in one), or the terminal (if you’re running Python from the command line). Like standard output, standard error does not add carriage returns for you. If you want carriage returns, you’ll need to write carriage return characters.
|
|
</ol>
|
|
|
|
<p><code>sys.stdout</code> and <code>sys.stderr</code> are file-like objects, but they are write-only. Attempting to call their <code>read()</code> method will always raise an <code>IOError</code>.
|
|
|
|
<pre class='nd screen'>
|
|
<samp class=p>>>> </samp><kbd class=pp>import sys</kbd>
|
|
<samp class=p>>>> </samp><kbd class=pp>sys.stdout.read()</kbd>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
IOError: not readable</samp></pre>
|
|
|
|
<h3 id=redirect>Redirecting Standard Output</h3>
|
|
|
|
<p>So <code>sys.stdout</code> and <code>sys.stderr</code> are file-like objects, albeit ones that only support writing. But they’re not constants; they’re variables. That means you can assign them a new value — another file object, or another file-like object — and redirect their output.
|
|
|
|
<p class=d>[<a href=examples/stdout.py>download <code>stdout.py</code></a>]
|
|
<pre><code class=pp>import sys
|
|
|
|
class RedirectStdoutTo:
|
|
def __init__(self, out_new):
|
|
self.out_new = out_new
|
|
|
|
def __enter__(self):
|
|
self.out_old = sys.stdout
|
|
sys.stdout = self.out_new
|
|
|
|
def __exit__(self, *args):
|
|
sys.stdout = self.out_old
|
|
|
|
print('A')
|
|
with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
|
|
print('B')
|
|
print('C')</code></pre>
|
|
|
|
<p>Check this out:
|
|
|
|
<pre class='nd screen'>
|
|
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>python3 stdout.py</kbd>
|
|
<samp>A
|
|
C</samp>
|
|
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>cat out.log</kbd>
|
|
<samp>B</samp></pre>
|
|
|
|
<p>Let’s take the last part first.
|
|
|
|
<pre><code class=pp>
|
|
<a>print('A') <span class=u>①</span></a>
|
|
<a>with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file): <span class=u>②</span></a>
|
|
<a> print('B') <span class=u>③</span></a>
|
|
<a>print('C') <span class=u>④</span></a></code></pre>
|
|
<ol>
|
|
<li>This will print to the <abbr>IDE</abbr> “Interactive Window” (or the terminal, if running the script from the command line).
|
|
<li>This is <a href=#with>a <code>with</code> statement</a>, which you’ve seen before. But unlike all previous example, this one doesn’t stop at <code>as a_file</code>. Instead, there’s a comma and another function call. The <code>with</code> statement can actually take <em>a comma-separated list of contexts</em>. The first is a context you’ve seen several times already: it opens a file, assigns the file object to <var>a_file</var>, and closes the file automatically when the context ends. The second context is a custom-built context that redirects <code>sys.stdout</code> to the file object that was created in the first context.
|
|
<li>Because this <code>print()</code> statement is executed with the contexts created by the <code>with</code> statement, it will not print to the screen; it will write to the file <code>out.log</code>.
|
|
<li>The <code>with</code> code block is over. Python has told each context manager to do whatever it is they do upon exiting a context. The first context closed the file; the second context changed <code>sys.stdout</code> back to its original value. That means that this call to the <code>print()</code> function will once again print to the screen.
|
|
</ol>
|
|
|
|
<p>Now take a look at the <code>RedirectStdoutTo</code> class. It is a custom context manager. Upon entering the context, it redirects <code>sys.stdout</code> to a given file-like object. Upon exiting the context, it restores <code>sys.stdout</code> to its original value.
|
|
|
|
<pre><code class=pp>class RedirectStdoutTo:
|
|
<a> def __init__(self, out_new): <span class=u>①</span></a>
|
|
self.out_new = out_new
|
|
|
|
<a> def __enter__(self): <span class=u>②</span></a>
|
|
self.out_old = sys.stdout
|
|
sys.stdout = self.out_new
|
|
|
|
<a> def __exit__(self, *args): <span class=u>③</span></a>
|
|
sys.stdout = self.out_old</code></pre>
|
|
<ol>
|
|
<li>The <code>__init__()</code> method is called immediately after an instance is created. It takes one parameter, the file-like object that you want to use as standard output for the life of the context. This method just saves the file-like object in an instance variable so other methods can use it later.
|
|
<li>The <code>__enter__()</code> method is a <a href=iterators.html#a-fibonacci-iterator>special class method</a>; Python calls it when entering a context (<i>i.e.</i> at the beginning of the <code>with</code> statement). This method saves the current value of <code>sys.stdout</code> in <var>self.out_old</var>, then redirects standard output by assigning <var>self.out_new</var> to <var>sys.stdout</var>.
|
|
<li>The <code>__exit__()</code> method is another special class method; Python calls it when exiting the context (<i>i.e.</i> at the end of the <code>with</code> statement). This method restores standard output to its original value by assigning the saved <var>self.out_old</var> value to <var>sys.stdout</var>.
|
|
</ol>
|
|
|
|
<p>Redirecting standard error works exactly the same way, using <code>sys.stderr</code> instead of <code>sys.stdout</code>.
|
|
|
|
<h2 id=furtherreading>Further Reading</h2>
|
|
|
|
<ul>
|
|
<li><a href=http://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files>Reading and writing files</a> in the Python.org tutorial
|
|
<li><a href=http://docs.python.org/3.1/library/io.html><code>io</code> module</a>
|
|
<li><a href=http://docs.python.org/3.1/library/sys.html#sys.stdout><code>sys.stdout</code> and <code>sys.stderr</code></a>
|
|
<li><a href=http://en.wikipedia.org/wiki/Filesystem_in_Userspace><abbr>FUSE</abbr> on Wikipedia</a>
|
|
</ul>
|
|
|
|
<p class=v><a href=advanced-classes.html rel=prev title='back to “Advanced Classes”'><span class=u>☜</span></a> <a href=xml.html rel=next title='onward to “XML”'><span class=u>☞</span></a>
|
|
|
|
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
|
<script src=j/jquery.js></script>
|
|
<script src=j/prettify.js></script>
|
|
<script src=j/dip3.js></script>
|