more HTTP chapter

This commit is contained in:
Mark Pilgrim
2009-05-31 22:12:37 -07:00
parent 421c370591
commit 49b763e3f5
+37 -587
View File
@@ -253,599 +253,49 @@ Content-Type: application/xml</samp>
<p>But wait, it gets worse! To see just how inefficient this code is, let&#8217;s request the same feed a second time.
<pre class=screen>
FIXME
</pre>
# continued from the <a href=#whats-on-the-wire>previous example</a>
<samp class=p>>>> </samp><kbd>response2 = urlopen('http://diveintopython3.org/examples/feed.xml')</kbd>
<samp>send: b'GET /examples/feed.xml HTTP/1.1
Host: diveintopython3.org
Accept-Encoding: identity
User-Agent: Python-urllib/3.0'
Connection: close
reply: 'HTTP/1.1 200 OK'
&hellip;further debugging information omitted&hellip;</samp></pre>
<!--
<p class=a>&#x2042;
<p>Notice anything peculiar about this request? It hasn&#8217;t changed! It&#8217;s exactly the same as the first request. No sign of <a href=#last-modified><code>If-Modified-Since</code> headers</a>. No sign of <a href=#etags><code>If-None-Match</code> headers</a>. No respect for the caching headers. Still no compression.
<h2 id="oa.useragent">11.5. Setting the <code>User-Agent</code></h2>
<p>The first step to improving your <abbr>HTTP</abbr> web services client is to identify yourself properly with a <code>User-Agent</code>. To do that, you need to move beyond the basic <code>urllib</code> and dive into <code>urllib2</code>.
<div class=example><h3>Example 11.4. Introducing <code>urllib2</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import httplib</kbd>
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>import urllib2</kbd>
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener()</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>feeddata = opener.open(request).read()</kbd> <span>&#x2463;</span>
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Wed, 14 Apr 2004 23:23:12 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Content-Type: application/atom+xml
header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT
header: ETag: "e8284-68e0-4de30f80"
header: Accept-Ranges: bytes
header: Content-Length: 26848
header: Connection: close
</pre>
<p>And what happens when you do the same thing twice? You get the same response. Twice.
<pre class=screen>
# continued from the previous example
<a><samp class=p>>>> </samp><kbd>print(response2.headers.as_string())</kbd> <span>&#x2460;</span></a>
<samp>Date: Mon, 01 Jun 2009 03:58:00 GMT
Server: Apache
Last-Modified: Sun, 31 May 2009 22:51:11 GMT
ETag: "bfe-255ef5c0"
Accept-Ranges: bytes
Content-Length: 3070
Cache-Control: max-age=86400
Expires: Tue, 02 Jun 2009 03:58:00 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: application/xml</samp>
<samp class=p>>>> </samp><kbd>data2 = response2.read()</kbd>
<a><samp class=p>>>> </samp><kbd>len(data2)</kbd> <span>&#x2461;</span></a>
<samp>3070</samp>
<a><samp class=p>>>> </samp><kbd>data2 == data</kbd> <span>&#x2462;</span></a>
<samp>True</samp></pre>
<ol>
<li>If you still have your Python <abbr>IDE</abbr> open from the previous section&#8217;s example, you can skip this, but this turns on <a href="#oa.debug" title="11.4. Debugging HTTP web services"><abbr>HTTP</abbr> debugging</a> so you can see what you&#8217;re actually sending over the wire, and what gets sent back.
<li>Fetching an <abbr>HTTP</abbr> resource with <code>urllib2</code> is a three-step process, for good reasons that will become clear shortly. The first step is to create a <code>Request</code> object, which takes the <abbr>URL</abbr> of the resource you&#8217;ll eventually get around to retrieving. Note that this step doesn&#8217;t actually
retrieve anything yet.
<li>The second step is to build a <abbr>URL</abbr> opener. This can take any number of handlers, which control how responses are handled.
But you can also build an opener without any custom handlers, which is what you&#8217;re doing here. You&#8217;ll see how to define
and use custom handlers later in this chapter when you explore redirects.
<li>The final step is to tell the opener to open the <abbr>URL</abbr>, using the <code>Request</code> object you created. As you can see from all the debugging information that gets printed, this step actually retrieves the
resource and stores the returned data in <var>feeddata</var>.
<div class=example><h3>Example 11.5. Adding headers with the <code>Request</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>request</kbd> <span>&#x2460;</span>
&lt;urllib2.Request instance at 0x00250AA8>
<samp class=p>>>> </samp><kbd>request.get_full_url()</kbd>
http://diveintomark.org/xml/atom.xml
<samp class=p>>>> </samp><kbd>request.add_header('User-Agent',</kbd>
<samp class=p>... </samp><kbd>'OpenAnything/1.0 +http://diveintopython3.org/')</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>feeddata = opener.open(request).read()</kbd> <span>&#x2462;</span>
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: OpenAnything/1.0 +http://diveintopython3.org/ <span>&#x2463;</span>
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Wed, 14 Apr 2004 23:45:17 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Content-Type: application/atom+xml
header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT
header: ETag: "e8284-68e0-4de30f80"
header: Accept-Ranges: bytes
header: Content-Length: 26848
header: Connection: close
</pre>
<ol>
<li>You&#8217;re continuing from the previous example; you&#8217;ve already created a <code>Request</code> object with the <abbr>URL</abbr> you want to access.
<li>Using the <code>add_header</code> method on the <code>Request</code> object, you can add arbitrary <abbr>HTTP</abbr> headers to the request. The first argument is the header, the second is the value you&#8217;re
providing for that header. Convention dictates that a <code>User-Agent</code> should be in this specific format: an application name, followed by a slash, followed by a version number. The rest is free-form,
and you&#8217;ll see a lot of variations in the wild, but somewhere it should include a <abbr>URL</abbr> of your application. The <code>User-Agent</code> is usually logged by the server along with other details of your request, and including a <abbr>URL</abbr> of your application allows
server administrators looking through their access logs to contact you if something is wrong.
<li>The <var>opener</var> object you created before can be reused too, and it will retrieve the same feed again, but with your custom <code>User-Agent</code> header.
<li>And here&#8217;s you sending your custom <code>User-Agent</code>, in place of the generic one that Python sends by default. If you look closely, you&#8217;ll notice that you defined a <code>User-Agent</code> header, but you actually sent a <code>User-agent</code> header. See the difference? <code>urllib2</code> changed the case so that only the first letter was capitalized. It doesn&#8217;t really matter; <abbr>HTTP</abbr> specifies that header field
names are completely case-insensitive.
<li>The server is still sending the same array of &#8220;smart&#8221; headers: <code>Cache-Control</code> and <code>Expires</code> to allow caching, <code>Last-Modified</code> and <code>ETag</code> to enable &#8220;not-modified&#8221; tracking. Even the <code>Vary: Accept-Encoding</code> header hints that the server would support compression, if only you would bloody well ask for it. But you&#8217;re not listening.
<li>Once again, fetching this data downloads the whole 3070 bytes&hellip;
<li>&hellip;the exact same 3070 bytes you downloaded last time.
</ol>
<p><abbr>HTTP</abbr> is designed to work better than this. <code>urllib</code> speaks <abbr>HTTP</abbr> like I speak Spanish &mdash; enough to get by in a jam, but not enough to hold a conversation. <abbr>HTTP</abbr> is a conversation. It&#8217;s time to upgrade to a library that speaks <abbr>HTTP</abbr> fluently.
<p class=a>&#x2042;
<h2 id="oa.etags">11.6. Handling <code>Last-Modified</code> and <code>ETag</code></h2>
<p>Now that you know how to add custom <abbr>HTTP</abbr> headers to your web service requests, let&#8217;s look at adding support for <code>Last-Modified</code> and <code>ETag</code> headers.
<p>These examples show the output with debugging turned off. If you still have it turned on from the previous section, you can
turn it off by setting <code>httplib.HTTPConnection.debuglevel = 0</code>. Or you can just leave debugging on, if that helps you.
<div class=example><h3 id="oa.etags.example.1">Example 11.6. Testing <code>Last-Modified</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import urllib2</kbd>
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')</kbd>
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener()</kbd>
<samp class=p>>>> </samp><kbd>firstdatastream = opener.open(request)</kbd>
<samp class=p>>>> </samp><kbd>firstdatastream.headers.dict</kbd> <span>&#x2460;</span>
<samp>{'date': 'Thu, 15 Apr 2004 20:42:41 GMT',
'server': 'Apache/2.0.49 (Debian GNU/Linux)',
'content-type': 'application/atom+xml',
'last-modified': 'Thu, 15 Apr 2004 19:45:21 GMT',
'etag': '"e842a-3e53-55d97640"',
'content-length': '15955',
'accept-ranges': 'bytes',
'connection': 'close'}</samp>
<samp class=p>>>> </samp><kbd>request.add_header('If-Modified-Since',</kbd>
<samp class=p>... </samp>firstdatastream.headers.get('Last-Modified')) <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>seconddatastream = opener.open(request)</kbd> <span>&#x2462;</span>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
File "c:\python23\lib\urllib2.py", line 326, in open
'_open', req)
File "c:\python23\lib\urllib2.py", line 306, in _call_chain
result = func(*args)
File "c:\python23\lib\urllib2.py", line 901, in http_open
return self.do_open(httplib.HTTP, req)
File "c:\python23\lib\urllib2.py", line 895, in do_open
return self.parent.error('http', req, fp, code, msg, hdrs)
File "c:\python23\lib\urllib2.py", line 352, in error
return self._call_chain(*args)
File "c:\python23\lib\urllib2.py", line 306, in _call_chain
result = func(*args)
File "c:\python23\lib\urllib2.py", line 412, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 304: Not Modified</span>
</pre>
<ol>
<li>Remember all those <abbr>HTTP</abbr> headers you saw printed out when you turned on debugging? This is how you can get access to them
programmatically: <var>firstdatastream.headers</var> is <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class">an object that acts like a dictionary</a> and allows you to get any of the individual headers returned from the <abbr>HTTP</abbr> server.
<li>On the second request, you add the <code>If-Modified-Since</code> header with the last-modified date from the first request. If the data hasn&#8217;t changed, the server should return a <code>304</code> status code.
<li>Sure enough, the data hasn&#8217;t changed. You can see from the traceback that <code>urllib2</code> throws a special exception, <code>HTTPError</code>, in response to the <code>304</code> status code. This is a little unusual, and not entirely helpful. After all, it&#8217;s not an error; you specifically asked the
server not to send you any data if it hadn&#8217;t changed, and the data didn&#8217;t change, so the server told you it wasn&#8217;t sending
you any data. That&#8217;s not an error; that&#8217;s exactly what you were hoping for.
<p><code>urllib2</code> also raises an <code>HTTPError</code> exception for conditions that you would think of as errors, such as <code>404</code> (page not found). In fact, it will raise <code>HTTPError</code> for <em>any</em> status code other than <code>200</code> (OK), <code>301</code> (permanent redirect), or <code>302</code> (temporary redirect). It would be more helpful for your purposes to capture the status code and simply return it, without
throwing an exception. To do that, you&#8217;ll need to define a custom <abbr>URL</abbr> handler.
<div class=example><h3>Example 11.7. Defining URL handlers</h3>
<p>This custom <abbr>URL</abbr> handler is part of <code>openanything.py</code>.
<pre><code>
class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>&#x2460;</span>
def http_error_default(self, req, fp, code, msg, headers): <span>&#x2461;</span>
result = urllib2.HTTPError(
req.get_full_url(), code, msg, headers, fp)
result.status = code <span>&#x2462;</span>
return result
</pre>
<ol>
<li><code>urllib2</code> is designed around <abbr>URL</abbr> handlers. Each handler is just a class that can define any number of methods. When something happens
&mdash; like an <abbr>HTTP</abbr> error, or even a <code>304</code> code &mdash; <code>urllib2</code> introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a> to define handlers for different node types, but <code>urllib2</code> is more flexible, and introspects over as many handlers as are defined for the current request.
<li><code>urllib2</code> searches through the defined handlers and calls the <code>http_error_default</code> method when it encounters a <code>304</code> status code from the server. By defining a custom error handler, you can prevent <code>urllib2</code> from raising an exception. Instead, you create the <code>HTTPError</code> object, but return it instead of raising it.
<li>This is the key part: before returning, you save the status code returned by the <abbr>HTTP</abbr> server. This will allow you easy access
to it from the calling program.
<div class=example><h3>Example 11.8. Using custom URL handlers</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>request.headers</kbd> <span>&#x2460;</span>
{'If-modified-since': 'Thu, 15 Apr 2004 19:45:21 GMT'}
<samp class=p>>>> </samp><kbd>import openanything</kbd>
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener(</kbd>
<samp class=p>... </samp>openanything.DefaultErrorHandler()) <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>seconddatastream = opener.open(request)</kbd>
<samp class=p>>>> </samp><kbd>seconddatastream.status</kbd> <span>&#x2462;</span>
304
<samp class=p>>>> </samp><kbd>seconddatastream.read()</kbd> <span>&#x2463;</span>
''
</pre>
<ol>
<li>You&#8217;re continuing the previous example, so the <code>Request</code> object is already set up, and you&#8217;ve already added the <code>If-Modified-Since</code> header.
<li>This is the key: now that you&#8217;ve defined your custom <abbr>URL</abbr> handler, you need to tell <code>urllib2</code> to use it. Remember how I said that <code>urllib2</code> broke up the process of accessing an <abbr>HTTP</abbr> resource into three steps, and for good reason? This is why building the <abbr>URL</abbr> opener
is its own step, because you can build it with your own custom <abbr>URL</abbr> handlers that override <code>urllib2</code>&#8217;s default behavior.
<li>Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use <var>seconddatastream.headers.dict</var> to acess them), also contains the <abbr>HTTP</abbr> status code. In this case, as you expected, the status is <code>304</code>, meaning this data hasn&#8217;t changed since the last time you asked for it.
<li>Note that when the server sends back a <code>304</code> status code, it doesn&#8217;t re-send the data. That&#8217;s the whole point: to save bandwidth by not re-downloading data that hasn&#8217;t
changed. So if you actually want that data, you&#8217;ll need to cache it locally the first time you get it.
<p>Handling <code>ETag</code> works much the same way, but instead of checking for <code>Last-Modified</code> and sending <code>If-Modified-Since</code>, you check for <code>ETag</code> and send <code>If-None-Match</code>. Let&#8217;s start with a fresh <abbr>IDE</abbr> session.
<div class=example><h3 id="oa.etags.example">Example 11.9. Supporting <code>ETag</code>/<code>If-None-Match</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import urllib2, openanything</kbd>
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')</kbd>
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener(</kbd>
<samp class=p>... </samp>openanything.DefaultErrorHandler())
<samp class=p>>>> </samp><kbd>firstdatastream = opener.open(request)</kbd>
<samp class=p>>>> </samp><kbd>firstdatastream.headers.get('ETag')</kbd> <span>&#x2460;</span>
'"e842a-3e53-55d97640"'
<samp class=p>>>> </samp><kbd>firstdata = firstdatastream.read()</kbd>
<samp class=p>>>> </samp><kbd>print firstdata</kbd> <span>&#x2461;</span>
<samp>&lt;?xml version="1.0" encoding="iso-8859-1"?>
&lt;feed version="0.3"
xmlns="http://purl.org/atom/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:lang="en">
&lt;title mode="escaped">dive into mark&lt;/title>
&lt;link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
&hellip;
</samp>
<samp class=p>>>> </samp><kbd>request.add_header('If-None-Match',</kbd>
<samp class=p>... </samp>firstdatastream.headers.get('ETag')) <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>seconddatastream = opener.open(request)</kbd>
<samp class=p>>>> </samp><kbd>seconddatastream.status</kbd> <span>&#x2463;</span>
304
<samp class=p>>>> </samp><kbd>seconddatastream.read()</kbd> <span>&#x2464;</span>
''
</pre>
<ol>
<li>Using the <var>firstdatastream.headers</var> pseudo-dictionary, you can get the <code>ETag</code> returned from the server. (What happens if the server didn&#8217;t send back an <code>ETag</code>? Then this line would return <code>None</code>.)
<li>OK, you got the data.
<li>Now set up the second call by setting the <code>If-None-Match</code> header to the <code>ETag</code> you got from the first call.
<li>The second call succeeds quietly (without throwing an exception), and once again you see that the server has sent back a <code>304</code> status code. Based on the <code>ETag</code> you sent the second time, it knows that the data hasn&#8217;t changed.
<li>Regardless of whether the <code>304</code> is triggered by <code>Last-Modified</code> date checking or <code>ETag</code> hash matching, you&#8217;ll never get the data along with the <code>304</code>. That&#8217;s the whole point.
<table id="tip.etag.vs.lastmodified" class=note border="0" summary="">
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In these examples, the <abbr>HTTP</abbr> server has supported both <code>Last-Modified</code> and <code>ETag</code> headers, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively
in case a server only supports one or the other, or neither.
<p class=a>&#x2042;
<h2 id="oa.redirect">11.7. Handling redirects</h2>
<p>You can support permanent and temporary redirects using a different kind of custom <abbr>URL</abbr> handler.
<p>First, let&#8217;s see why a redirect handler is necessary in the first place.
<div class=example><h3>Example 11.10. Accessing web services without a redirect handler</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import urllib2, httplib</kbd>
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>request = urllib2.Request(</kbd>
<samp class=p>... </samp>'http://diveintomark.org/redir/example301.xml') <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener()</kbd>
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd>
<samp>connect: (diveintomark.org, 80)
send: '
GET /redir/example301.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'</span> <span>&#x2462;</span>
<samp>header: Date: Thu, 15 Apr 2004 22:06:25 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Location: http://diveintomark.org/xml/atom.xml</span> <span>&#x2463;</span>
<samp>header: Content-Length: 338
header: Connection: close
header: Content-Type: text/html; charset=iso-8859-1
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0</span> <span>&#x2464;</span>
<samp>Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 15 Apr 2004 22:06:25 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
header: ETag: "e842a-3e53-55d97640"
header: Accept-Ranges: bytes
header: Content-Length: 15955
header: Connection: close
header: Content-Type: application/atom+xml</samp>
<samp class=p>>>> </samp><kbd>f.url</kbd> <span>&#x2465;</span>
'http://diveintomark.org/xml/atom.xml'
<samp class=p>>>> </samp><kbd>f.headers.dict</kbd>
<samp>{'content-length': '15955',
'accept-ranges': 'bytes',
'server': 'Apache/2.0.49 (Debian GNU/Linux)',
'last-modified': 'Thu, 15 Apr 2004 19:45:21 GMT',
'connection': 'close',
'etag': '"e842a-3e53-55d97640"',
'date': 'Thu, 15 Apr 2004 22:06:25 GMT',
'content-type': 'application/atom+xml'}</samp>
<samp class=p>>>> </samp><kbd>f.status</kbd>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
AttributeError: addinfourl instance has no attribute 'status'</span>
</pre>
<ol>
<li>You&#8217;ll be better able to see what&#8217;s happening if you turn on debugging.
<li>This is a <abbr>URL</abbr> which I have set up to permanently redirect to my Atom feed at <code>http://diveintomark.org/xml/atom.xml</code>.
<li>Sure enough, when you try to download the data at that address, the server sends back a <code>301</code> status code, telling you that the resource has moved permanently.
<li>The server also sends back a <code>Location</code> header that gives the new address of this data.
<li><code>urllib2</code> notices the redirect status code and automatically tries to retrieve the data at the new location specified in the <code>Location</code> header.
<li>The object you get back from the <var>opener</var> contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent
address). But the status code is missing, so you have no way of knowing programmatically whether this redirect was temporary
or permanent. And that matters very much: if it was a temporary redirect, then you should continue to ask for the data at
the old location. But if it was a permanent redirect (as this was), you should ask for the data at the new location from
now on.
<p>This is suboptimal, but easy to fix. <code>urllib2</code> doesn&#8217;t behave exactly as you want it to when it encounters a <code>301</code> or <code>302</code>, so let&#8217;s override its behavior. How? With a custom <abbr>URL</abbr> handler, <a href="#oa.etags" title="11.6. Handling Last-Modified and ETag">just like you did to handle <code>304</code> codes</a>.
<div class=example><h3>Example 11.11. Defining the redirect handler</h3>
<p>This class is defined in <code>openanything.py</code>.
<pre><code>
class SmartRedirectHandler(urllib2.HTTPRedirectHandler): <span>&#x2460;</span>
def http_error_301(self, req, fp, code, msg, headers):
result = urllib2.HTTPRedirectHandler.http_error_301( <span>&#x2461;</span>
self, req, fp, code, msg, headers)
result.status = code <span>&#x2462;</span>
return result
def http_error_302(self, req, fp, code, msg, headers): <span>&#x2463;</span>
result = urllib2.HTTPRedirectHandler.http_error_302(
self, req, fp, code, msg, headers)
result.status = code
return result
</pre>
<ol>
<li>Redirect behavior is defined in <code>urllib2</code> in a class called <code>HTTPRedirectHandler</code>. You don&#8217;t want to completely override the behavior, you just want to extend it a little, so you&#8217;ll subclass <code>HTTPRedirectHandler</code> so you can call the ancestor class to do all the hard work.
<li>When it encounters a <code>301</code> status code from the server, <code>urllib2</code> will search through its handlers and call the <code>http_error_301</code> method. The first thing ours does is just call the <code>http_error_301</code> method in the ancestor, which handles the grunt work of looking for the <code>Location</code> header and following the redirect to the new address.
<li>Here&#8217;s the key: before you return, you store the status code (<code>301</code>), so that the calling program can access it later.
<li>Temporary redirects (status code <code>302</code>) work the same way: override the <code>http_error_302</code> method, call the ancestor, and save the status code before returning.
<p>So what has this bought us? You can now build a <abbr>URL</abbr> opener with the custom redirect handler, and it will still automatically
follow redirects, but now it will also expose the redirect status code.
<div class=example><h3>Example 11.12. Using the redirect handler to detect permanent redirects</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/redir/example301.xml')</kbd>
<samp class=p>>>> </samp><kbd>import openanything, httplib</kbd>
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd>
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener(</kbd>
<samp class=p>... </samp>openanything.SmartRedirectHandler()) <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd>
<samp>connect: (diveintomark.org, 80)
send: 'GET /redir/example301.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'</span> <span>&#x2461;</span>
<samp>header: Date: Thu, 15 Apr 2004 22:13:21 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Location: http://diveintomark.org/xml/atom.xml
header: Content-Length: 338
header: Connection: close
header: Content-Type: text/html; charset=iso-8859-1
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 15 Apr 2004 22:13:21 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
header: ETag: "e842a-3e53-55d97640"
header: Accept-Ranges: bytes
header: Content-Length: 15955
header: Connection: close
header: Content-Type: application/atom+xml
</samp>
<samp class=p>>>> </samp><kbd>f.status</kbd> <span>&#x2462;</span>
301
<samp class=p>>>> </samp><kbd>f.url</kbd>
'http://diveintomark.org/xml/atom.xml'
</pre>
<ol>
<li>First, build a <abbr>URL</abbr> opener with the redirect handler you just defined.
<li>You sent off a request, and you got a <code>301</code> status code in response. At this point, the <code>http_error_301</code> method gets called. You call the ancestor method, which follows the redirect and sends a request at the new location (<code>http://diveintomark.org/xml/atom.xml</code>).
<li>This is the payoff: now, not only do you have access to the new <abbr>URL</abbr>, but you have access to the redirect status code, so you
can tell that this was a permanent redirect. The next time you request this data, you should request it from the new location
(<code>http://diveintomark.org/xml/atom.xml</code>, as specified in <var>f.url</var>). If you had stored the location in a configuration file or a database, you need to update that so you don&#8217;t keep pounding
the server with requests at the old address. It&#8217;s time to update your address book.
<p>The same redirect handler can also tell you that you <em>shouldn&#8217;t</em> update your address book.
<div class=example><h3>Example 11.13. Using the redirect handler to detect temporary redirects</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>request = urllib2.Request(</kbd>
<samp class=p>... </samp>'http://diveintomark.org/redir/example302.xml') <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd>
<samp>connect: (diveintomark.org, 80)
send: '
GET /redir/example302.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 302 Found\r\n'</span> <span>&#x2461;</span>
<samp>header: Date: Thu, 15 Apr 2004 22:18:21 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Location: http://diveintomark.org/xml/atom.xml
header: Content-Length: 314
header: Connection: close
header: Content-Type: text/html; charset=iso-8859-1
connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0</span> <span>&#x2462;</span>
<samp>Host: diveintomark.org
User-agent: Python-urllib/2.1
'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 15 Apr 2004 22:18:21 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
header: ETag: "e842a-3e53-55d97640"
header: Accept-Ranges: bytes
header: Content-Length: 15955
header: Connection: close
header: Content-Type: application/atom+xml</samp>
<samp class=p>>>> </samp><kbd>f.status</kbd> <span>&#x2463;</span>
302
<samp class=p>>>> </samp><kbd>f.url</kbd>
http://diveintomark.org/xml/atom.xml
</pre>
<ol>
<li>This is a sample <abbr>URL</abbr> I&#8217;ve set up that is configured to tell clients to <em>temporarily</em> redirect to <code>http://diveintomark.org/xml/atom.xml</code>.
<li>The server sends back a <code>302</code> status code, indicating a temporary redirect. The temporary new location of the data is given in the <code>Location</code> header.
<li><code>urllib2</code> calls your <code>http_error_302</code> method, which calls the ancestor method of the same name in <code>urllib2.HTTPRedirectHandler</code>, which follows the redirect to the new location. Then your <code>http_error_302</code> method stores the status code (<code>302</code>) so the calling application can get it later.
<li>And here you are, having successfully followed the redirect to <code>http://diveintomark.org/xml/atom.xml</code>. <var>f.status</var> tells you that this was a temporary redirect, which means that you should continue to request data from the original address
(<code>http://diveintomark.org/redir/example302.xml</code>). Maybe it will redirect next time too, but maybe not. Maybe it will redirect to a different address. It&#8217;s not for you
to say. The server said this redirect was only temporary, so you should respect that. And now you&#8217;re exposing enough information
that the calling application can respect that.
<p class=a>&#x2042;
<h2 id="oa.gzip">11.8. Handling compressed data</h2>
<p>The last important <abbr>HTTP</abbr> feature you want to support is compression. Many web services have the ability to send data compressed,
which can cut down the amount of data sent over the wire by 60% or more. This is especially true of <abbr>XML</abbr> web services, since
<abbr>XML</abbr> data compresses very well.
<p>Servers won&#8217;t give you compressed data unless you tell them you can handle it.
<div class=example><h3>Example 11.14. Telling the server you would like compressed data</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import urllib2, httplib</kbd>
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd>
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')</kbd>
<samp class=p>>>> </samp><kbd>request.add_header('Accept-encoding', 'gzip')</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener()</kbd>
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd>
<samp>connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
Accept-encoding: gzip</span><span>&#x2461;</span>
<samp>'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 15 Apr 2004 22:24:39 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
header: ETag: "e842a-3e53-55d97640"
header: Accept-Ranges: bytes
header: Vary: Accept-Encoding
header: Content-Encoding: gzip</span> <span>&#x2462;</span>
header: Content-Length: 6289 <span>&#x2463;</span>
<samp>header: Connection: close
header: Content-Type: application/atom+xml</span>
</pre>
<ol>
<li>This is the key: once you&#8217;ve created your <code>Request</code> object, add an <code>Accept-encoding</code> header to tell the server you can accept gzip-encoded data. <code>gzip</code> is the name of the compression algorithm you&#8217;re using. In theory there could be other compression algorithms, but <code>gzip</code> is the compression algorithm used by 99% of web servers.
<li>There&#8217;s your header going across the wire.
<li>And here&#8217;s what the server sends back: the <code>Content-Encoding: gzip</code> header means that the data you&#8217;re about to receive has been gzip-compressed.
<li>The <code>Content-Length</code> header is the length of the compressed data, not the uncompressed data. As you&#8217;ll see in a minute, the actual length of
the uncompressed data was 15955, so gzip compression cut your bandwidth by over 60%!
<div class=example><h3>Example 11.15. Decompressing the data</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>compresseddata = f.read()</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>len(compresseddata)</kbd>
6289
<samp class=p>>>> </samp><kbd>import StringIO</kbd>
<samp class=p>>>> </samp><kbd>compressedstream = StringIO.StringIO(compresseddata)</kbd> <span>&#x2461;</span>
<samp class=p>>>> </samp><kbd>import gzip</kbd>
<samp class=p>>>> </samp><kbd>gzipper = gzip.GzipFile(fileobj=compressedstream)</kbd> <span>&#x2462;</span>
<samp class=p>>>> </samp><kbd>data = gzipper.read()</kbd> <span>&#x2463;</span>
<samp class=p>>>> </samp><kbd>print data</kbd> <span>&#x2464;</span>
<samp>&lt;?xml version="1.0" encoding="iso-8859-1"?>
&lt;feed version="0.3"
xmlns="http://purl.org/atom/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:lang="en">
&lt;title mode="escaped">dive into mark&lt;/title>
&lt;link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
&hellip;
</samp>
<samp class=p>>>> </samp><kbd>len(data)</kbd>
15955
</pre>
<ol>
<li>Continuing from the previous example, <var>f</var> is the file-like object returned from the <abbr>URL</abbr> opener. Using its <code>read()</code> method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first
step towards getting the data you really want.
<li>OK, this step is a little bit of messy workaround. Python has a <code>gzip</code> module, which reads (and actually writes) gzip-compressed files on disk. But you don&#8217;t have a file on disk, you have a gzip-compressed
buffer in memory, and you don&#8217;t want to write out a temporary file just so you can uncompress it. So what you&#8217;re going to
do is create a file-like object out of the in-memory data (<var>compresseddata</var>), using the <code>StringIO</code> module. You first saw the <code>StringIO</code> module in <a href="#kgp.openanything.stringio.example" title="Example 10.4. Introducing StringIO">the previous chapter</a>, but now you&#8217;ve found another use for it.
<li>Now you can create an instance of <code>GzipFile</code>, and tell it that its &#8220;file&#8221; is the file-like object <var>compressedstream</var>.
<li>This is the line that does all the actual work: &#8220;reading&#8221; from <code>GzipFile</code> will decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. <var>gzipper</var> is a file-like object which represents a gzip-compressed file. That &#8220;file&#8221; is not a real file on disk, though; <var>gzipper</var> is really just &#8220;reading&#8221; from the file-like object you created with <code>StringIO</code> to wrap the compressed data, which is only in memory in the variable <var>compresseddata</var>. And where did that compressed data come from? You originally downloaded it from a remote <abbr>HTTP</abbr> server by &#8220;reading&#8221; from the file-like object you built with <code>urllib2.build_opener</code>. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it.
<li>Look ma, real data. (15955 bytes of it, in fact.)<p>&#8220;But wait!&#8221; I hear you cry. &#8220;This could be even easier!&#8221; I know what you&#8217;re thinking. You&#8217;re thinking that <var>opener.open</var> returns a file-like object, so why not cut out the <code>StringIO</code> middleman and just pass <var>f</var> directly to <code>GzipFile</code>? OK, maybe you weren&#8217;t thinking that, but don&#8217;t worry about it, because it doesn&#8217;t work.
<div class=example><h3>Example 11.16. Decompressing the data directly from the server</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd><span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>f.headers.get('Content-Encoding')</kbd> <span>&#x2461;</span>
'gzip'
<samp class=p>>>> </samp><kbd>data = gzip.GzipFile(fileobj=f).read()</kbd> <span>&#x2462;</span>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in ?
File "c:\python23\lib\gzip.py", line 217, in read
self._read(readsize)
File "c:\python23\lib\gzip.py", line 252, in _read
pos = self.fileobj.tell() # Save current position
AttributeError: addinfourl instance has no attribute 'tell'</span>
</pre>
<ol>
<li>Continuing from the previous example, you already have a <code>Request</code> object set up with an <code>Accept-encoding: gzip</code> header.
<li>Simply opening the request will get you the headers (though not download any data yet). As you can see from the returned
<code>Content-Encoding</code> header, this data has been sent gzip-compressed.
<li>Since <code>opener.open</code> returns a file-like object, and you know from the headers that when you read it, you&#8217;re going to get gzip-compressed data,
why not simply pass that file-like object directly to <code>GzipFile</code>? As you &#8220;read&#8221; from the <code>GzipFile</code> instance, it will &#8220;read&#8221; compressed data from the remote <abbr>HTTP</abbr> server and decompress it on the fly. It&#8217;s a good idea, but unfortunately it doesn&#8217;t
work. Because of the way gzip compression works, <code>GzipFile</code> needs to save its position and move forwards and backwards through the compressed file. This doesn&#8217;t work when the &#8220;file&#8221; is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and
forth through the data stream. So the inelegant hack of using <code>StringIO</code> is the best solution: download the compressed data, create a file-like object out of it with <code>StringIO</code>, and then decompress the data from that.
<p class=a>&#x2042;
<h2 id="oa.alltogether">11.9. Putting it all together</h2>
<p>You&#8217;ve seen all the pieces for building an intelligent <abbr>HTTP</abbr> web services client. Now let&#8217;s see how they all fit together.
<div class=example><h3>Example 11.17. The <code>openanything</code> function</h3>
<p>This function is defined in <code>openanything.py</code>.
<pre><code>
def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
# non-HTTP code omitted for brevity
if urlparse.urlparse(source)[0] == 'http': <span>&#x2460;</span>
# open URL with urllib2
request = urllib2.Request(source)
request.add_header('User-Agent', agent) <span>&#x2461;</span>
if etag:
request.add_header('If-None-Match', etag) <span>&#x2462;</span>
if lastmodified:
request.add_header('If-Modified-Since', lastmodified) <span>&#x2463;</span>
request.add_header('Accept-encoding', 'gzip') <span>&#x2464;</span>
opener = urllib2.build_opener(SmartRedirectHandler(), DefaultErrorHandler()) <span>&#x2465;</span>
return opener.open(request) <span>&#x2466;</span>
</pre>
<ol>
<li><code>urlparse</code> is a handy utility module for, you guessed it, parsing <abbr>URL</abbr>s. Its primary function, also called <code>urlparse</code>, takes a <abbr>URL</abbr> and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier).
Of these, the only thing you care about is the scheme, to make sure that you&#8217;re dealing with an <abbr>HTTP</abbr> <abbr>URL</abbr> (which <code>urllib2</code> can handle).
<li>You identify yourself to the <abbr>HTTP</abbr> server with the <code>User-Agent</code> passed in by the calling function. If no <code>User-Agent</code> was specified, you use a default one defined earlier in the <code>openanything.py</code> module. You never use the default one defined by <code>urllib2</code>.
<li>If an <code>ETag</code> hash was given, send it in the <code>If-None-Match</code> header.
<li>If a last-modified date was given, send it in the <code>If-Modified-Since</code> header.
<li>Tell the server you would like compressed data if possible.
<li>Build a <abbr>URL</abbr> opener that uses <em>both</em> of the custom <abbr>URL</abbr> handlers: <code>SmartRedirectHandler</code> for handling <code>301</code> and <code>302</code> redirects, and <code>DefaultErrorHandler</code> for handling <code>304</code>, <code>404</code>, and other error conditions gracefully.
<li>That&#8217;s it! Open the <abbr>URL</abbr> and return a file-like object to the caller.
<div class=example><h3>Example 11.18. The <code>fetch</code> function</h3>
<p>This function is defined in <code>openanything.py</code>.
<pre><code>
def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
'''Fetch data and metadata from a URL, file, stream, or string'''
result = {}
f = openAnything(source, etag, last_modified, agent) <span>&#x2460;</span>
result['data'] = f.read() <span>&#x2461;</span>
if hasattr(f, 'headers'):
# save ETag, if the server sent one
result['etag'] = f.headers.get('ETag') <span>&#x2462;</span>
# save Last-Modified header, if the server sent one
result['lastmodified'] = f.headers.get('Last-Modified') <span>&#x2463;</span>
if f.headers.get('content-encoding', '') == 'gzip': <span>&#x2464;</span>
# data came back gzip-compressed, decompress it
result['data'] = gzip.GzipFile(fileobj=StringIO(result['data']])).read()
if hasattr(f, 'url'): <span>&#x2465;</span>
result['url'] = f.url
result['status'] = 200
if hasattr(f, 'status'): <span>&#x2466;</span>
result['status'] = f.status
f.close()
return result
</pre>
<ol>
<li>First, you call the <code>openAnything</code> function with a <abbr>URL</abbr>, <code>ETag</code> hash, <code>Last-Modified</code> date, and <code>User-Agent</code>.
<li>Read the actual data returned from the server. This may be compressed; if so, you&#8217;ll decompress it later.
<li>Save the <code>ETag</code> hash returned from the server, so the calling application can pass it back to you next time, and you can pass it on to <code>openAnything</code>, which can stick it in the <code>If-None-Match</code> header and send it to the remote server.
<li>Save the <code>Last-Modified</code> date too.
<li>If the server says that it sent compressed data, decompress it.
<li>If you got a <abbr>URL</abbr> back from the server, save it, and assume that the status code is <code>200</code> until you find out otherwise.
<li>If one of the custom <abbr>URL</abbr> handlers captured a status code, then save that too.
<div class=example><h3>Example 11.19. Using <code>openanything.py</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import openanything</kbd>
<samp class=p>>>> </samp><kbd>useragent = 'MyHTTPWebServicesApp/1.0'</kbd>
<samp class=p>>>> </samp><kbd>url = 'http://diveintopython3.org/redir/example301.xml'</kbd>
<samp class=p>>>> </samp><kbd>params = openanything.fetch(url, agent=useragent)</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>params</kbd> <span>&#x2461;</span>
<samp>{'url': 'http://diveintomark.org/xml/atom.xml',
'lastmodified': 'Thu, 15 Apr 2004 19:45:21 GMT',
'etag': '"e842a-3e53-55d97640"',
'status': 301,
'data': '&lt;?xml version="1.0" encoding="iso-8859-1"?>
&lt;feed version="0.3"
&hellip;
'}</samp>
<samp class=p>>>> </samp><kbd>if params['status'] == 301:</kbd><span>&#x2462;</span>
<samp class=p>... </samp>url = params['url']
<samp class=p>>>> </samp><kbd>newparams = openanything.fetch(</kbd>
<samp class=p>... </samp>url, params['etag'], params['lastmodified'], useragent) <span>&#x2463;</span>
<samp class=p>>>> </samp><kbd>newparams</kbd>
<samp>{'url': 'http://diveintomark.org/xml/atom.xml',
'lastmodified': None,
'etag': '"e842a-3e53-55d97640"',
'status': 304,
'data': ''}</span> <span>&#x2464;</span>
</pre>
<ol>
<li>The very first time you fetch a resource, you don&#8217;t have an <code>ETag</code> hash or <code>Last-Modified</code> date, so you&#8217;ll leave those out. (They&#8217;re <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional parameters</a>.)
<li>What you get back is a dictionary of several useful headers, the <abbr>HTTP</abbr> status code, and the actual data returned from the server.
<code>openanything</code> handles the gzip compression internally; you don&#8217;t care about that at this level.
<li>If you ever get a <code>301</code> status code, that&#8217;s a permanent redirect, and you need to update your <abbr>URL</abbr> to the new address.
<li>The second time you fetch the same resource, you have all sorts of information to pass back: a (possibly updated) <abbr>URL</abbr>, the
<code>ETag</code> from the last time, the <code>Last-Modified</code> date from the last time, and of course your <code>User-Agent</code>.
<li>What you get back is again a dictionary, but the data hasn&#8217;t changed, so all you got was a <code>304</code> status code and no data.
<p class=a>&#x2042;
<h2 id="oa.summary">11.10. Summary</h2>
<p>The <code>openanything.py</code> and its functions should now make perfect sense.
<p>There are 5 important features of <abbr>HTTP</abbr> web services that every client should support:
<div class=itemizedlist>
<ul>
<li>Identifying your application <a href="#oa.useragent" title="11.5. Setting the User-Agent">by setting a proper <code>User-Agent</code></a>.
<li>Handling <a href="#oa.redirect" title="11.7. Handling redirects">permanent redirects properly</a>.
<li>Supporting <a href="#oa.etags" title="11.6. Handling Last-Modified and ETag"><code>Last-Modified</code> date checking</a> to avoid re-downloading data that hasn&#8217;t changed.
<li>Supporting <a href="#oa.etags.example" title="Example 11.9. Supporting ETag/If-None-Match"><code>ETag</code> hashes</a> to avoid re-downloading data that hasn&#8217;t changed.
<li>Supporting <a href="#oa.gzip" title="11.8. Handling compressed data">gzip compression</a> to reduce bandwidth even when data <em>has</em> changed.
</ul>
<p class=a>&#x2042;
-->
<h2 id=beyond-get>Beyond GET</h2>
<p>FIXME