mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
more HTTP chapter
This commit is contained in:
+37
-587
@@ -253,599 +253,49 @@ Content-Type: application/xml</samp>
|
||||
<p>But wait, it gets worse! To see just how inefficient this code is, let’s request the same feed a second time.
|
||||
|
||||
<pre class=screen>
|
||||
FIXME
|
||||
</pre>
|
||||
# continued from the <a href=#whats-on-the-wire>previous example</a>
|
||||
<samp class=p>>>> </samp><kbd>response2 = urlopen('http://diveintopython3.org/examples/feed.xml')</kbd>
|
||||
<samp>send: b'GET /examples/feed.xml HTTP/1.1
|
||||
Host: diveintopython3.org
|
||||
Accept-Encoding: identity
|
||||
User-Agent: Python-urllib/3.0'
|
||||
Connection: close
|
||||
reply: 'HTTP/1.1 200 OK'
|
||||
…further debugging information omitted…</samp></pre>
|
||||
|
||||
<!--
|
||||
<p class=a>⁂
|
||||
<p>Notice anything peculiar about this request? It hasn’t changed! It’s exactly the same as the first request. No sign of <a href=#last-modified><code>If-Modified-Since</code> headers</a>. No sign of <a href=#etags><code>If-None-Match</code> headers</a>. No respect for the caching headers. Still no compression.
|
||||
|
||||
<h2 id="oa.useragent">11.5. Setting the <code>User-Agent</code></h2>
|
||||
<p>The first step to improving your <abbr>HTTP</abbr> web services client is to identify yourself properly with a <code>User-Agent</code>. To do that, you need to move beyond the basic <code>urllib</code> and dive into <code>urllib2</code>.
|
||||
<div class=example><h3>Example 11.4. Introducing <code>urllib2</code></h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import httplib</kbd>
|
||||
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd> <span>①</span>
|
||||
<samp class=p>>>> </samp><kbd>import urllib2</kbd>
|
||||
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')</kbd> <span>②</span>
|
||||
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener()</kbd> <span>③</span>
|
||||
<samp class=p>>>> </samp><kbd>feeddata = opener.open(request).read()</kbd> <span>④</span>
|
||||
connect: (diveintomark.org, 80)
|
||||
send: '
|
||||
GET /xml/atom.xml HTTP/1.0
|
||||
Host: diveintomark.org
|
||||
User-agent: Python-urllib/2.1
|
||||
'
|
||||
reply: 'HTTP/1.1 200 OK\r\n'
|
||||
header: Date: Wed, 14 Apr 2004 23:23:12 GMT
|
||||
header: Server: Apache/2.0.49 (Debian GNU/Linux)
|
||||
header: Content-Type: application/atom+xml
|
||||
header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT
|
||||
header: ETag: "e8284-68e0-4de30f80"
|
||||
header: Accept-Ranges: bytes
|
||||
header: Content-Length: 26848
|
||||
header: Connection: close
|
||||
</pre>
|
||||
<p>And what happens when you do the same thing twice? You get the same response. Twice.
|
||||
|
||||
<pre class=screen>
|
||||
# continued from the previous example
|
||||
<a><samp class=p>>>> </samp><kbd>print(response2.headers.as_string())</kbd> <span>①</span></a>
|
||||
<samp>Date: Mon, 01 Jun 2009 03:58:00 GMT
|
||||
Server: Apache
|
||||
Last-Modified: Sun, 31 May 2009 22:51:11 GMT
|
||||
ETag: "bfe-255ef5c0"
|
||||
Accept-Ranges: bytes
|
||||
Content-Length: 3070
|
||||
Cache-Control: max-age=86400
|
||||
Expires: Tue, 02 Jun 2009 03:58:00 GMT
|
||||
Vary: Accept-Encoding
|
||||
Connection: close
|
||||
Content-Type: application/xml</samp>
|
||||
<samp class=p>>>> </samp><kbd>data2 = response2.read()</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>len(data2)</kbd> <span>②</span></a>
|
||||
<samp>3070</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>data2 == data</kbd> <span>③</span></a>
|
||||
<samp>True</samp></pre>
|
||||
<ol>
|
||||
<li>If you still have your Python <abbr>IDE</abbr> open from the previous section’s example, you can skip this, but this turns on <a href="#oa.debug" title="11.4. Debugging HTTP web services"><abbr>HTTP</abbr> debugging</a> so you can see what you’re actually sending over the wire, and what gets sent back.
|
||||
<li>Fetching an <abbr>HTTP</abbr> resource with <code>urllib2</code> is a three-step process, for good reasons that will become clear shortly. The first step is to create a <code>Request</code> object, which takes the <abbr>URL</abbr> of the resource you’ll eventually get around to retrieving. Note that this step doesn’t actually
|
||||
retrieve anything yet.
|
||||
<li>The second step is to build a <abbr>URL</abbr> opener. This can take any number of handlers, which control how responses are handled.
|
||||
But you can also build an opener without any custom handlers, which is what you’re doing here. You’ll see how to define
|
||||
and use custom handlers later in this chapter when you explore redirects.
|
||||
<li>The final step is to tell the opener to open the <abbr>URL</abbr>, using the <code>Request</code> object you created. As you can see from all the debugging information that gets printed, this step actually retrieves the
|
||||
resource and stores the returned data in <var>feeddata</var>.
|
||||
<div class=example><h3>Example 11.5. Adding headers with the <code>Request</code></h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>request</kbd> <span>①</span>
|
||||
<urllib2.Request instance at 0x00250AA8>
|
||||
<samp class=p>>>> </samp><kbd>request.get_full_url()</kbd>
|
||||
http://diveintomark.org/xml/atom.xml
|
||||
<samp class=p>>>> </samp><kbd>request.add_header('User-Agent',</kbd>
|
||||
<samp class=p>... </samp><kbd>'OpenAnything/1.0 +http://diveintopython3.org/')</kbd> <span>②</span>
|
||||
<samp class=p>>>> </samp><kbd>feeddata = opener.open(request).read()</kbd> <span>③</span>
|
||||
connect: (diveintomark.org, 80)
|
||||
send: '
|
||||
GET /xml/atom.xml HTTP/1.0
|
||||
Host: diveintomark.org
|
||||
User-agent: OpenAnything/1.0 +http://diveintopython3.org/ <span>④</span>
|
||||
'
|
||||
reply: 'HTTP/1.1 200 OK\r\n'
|
||||
header: Date: Wed, 14 Apr 2004 23:45:17 GMT
|
||||
header: Server: Apache/2.0.49 (Debian GNU/Linux)
|
||||
header: Content-Type: application/atom+xml
|
||||
header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT
|
||||
header: ETag: "e8284-68e0-4de30f80"
|
||||
header: Accept-Ranges: bytes
|
||||
header: Content-Length: 26848
|
||||
header: Connection: close
|
||||
</pre>
|
||||
<ol>
|
||||
<li>You’re continuing from the previous example; you’ve already created a <code>Request</code> object with the <abbr>URL</abbr> you want to access.
|
||||
<li>Using the <code>add_header</code> method on the <code>Request</code> object, you can add arbitrary <abbr>HTTP</abbr> headers to the request. The first argument is the header, the second is the value you’re
|
||||
providing for that header. Convention dictates that a <code>User-Agent</code> should be in this specific format: an application name, followed by a slash, followed by a version number. The rest is free-form,
|
||||
and you’ll see a lot of variations in the wild, but somewhere it should include a <abbr>URL</abbr> of your application. The <code>User-Agent</code> is usually logged by the server along with other details of your request, and including a <abbr>URL</abbr> of your application allows
|
||||
server administrators looking through their access logs to contact you if something is wrong.
|
||||
<li>The <var>opener</var> object you created before can be reused too, and it will retrieve the same feed again, but with your custom <code>User-Agent</code> header.
|
||||
<li>And here’s you sending your custom <code>User-Agent</code>, in place of the generic one that Python sends by default. If you look closely, you’ll notice that you defined a <code>User-Agent</code> header, but you actually sent a <code>User-agent</code> header. See the difference? <code>urllib2</code> changed the case so that only the first letter was capitalized. It doesn’t really matter; <abbr>HTTP</abbr> specifies that header field
|
||||
names are completely case-insensitive.
|
||||
<li>The server is still sending the same array of “smart” headers: <code>Cache-Control</code> and <code>Expires</code> to allow caching, <code>Last-Modified</code> and <code>ETag</code> to enable “not-modified” tracking. Even the <code>Vary: Accept-Encoding</code> header hints that the server would support compression, if only you would bloody well ask for it. But you’re not listening.
|
||||
<li>Once again, fetching this data downloads the whole 3070 bytes…
|
||||
<li>…the exact same 3070 bytes you downloaded last time.
|
||||
</ol>
|
||||
|
||||
<p><abbr>HTTP</abbr> is designed to work better than this. <code>urllib</code> speaks <abbr>HTTP</abbr> like I speak Spanish — enough to get by in a jam, but not enough to hold a conversation. <abbr>HTTP</abbr> is a conversation. It’s time to upgrade to a library that speaks <abbr>HTTP</abbr> fluently.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id="oa.etags">11.6. Handling <code>Last-Modified</code> and <code>ETag</code></h2>
|
||||
<p>Now that you know how to add custom <abbr>HTTP</abbr> headers to your web service requests, let’s look at adding support for <code>Last-Modified</code> and <code>ETag</code> headers.
|
||||
<p>These examples show the output with debugging turned off. If you still have it turned on from the previous section, you can
|
||||
turn it off by setting <code>httplib.HTTPConnection.debuglevel = 0</code>. Or you can just leave debugging on, if that helps you.
|
||||
<div class=example><h3 id="oa.etags.example.1">Example 11.6. Testing <code>Last-Modified</code></h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import urllib2</kbd>
|
||||
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')</kbd>
|
||||
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener()</kbd>
|
||||
<samp class=p>>>> </samp><kbd>firstdatastream = opener.open(request)</kbd>
|
||||
<samp class=p>>>> </samp><kbd>firstdatastream.headers.dict</kbd> <span>①</span>
|
||||
<samp>{'date': 'Thu, 15 Apr 2004 20:42:41 GMT',
|
||||
'server': 'Apache/2.0.49 (Debian GNU/Linux)',
|
||||
'content-type': 'application/atom+xml',
|
||||
'last-modified': 'Thu, 15 Apr 2004 19:45:21 GMT',
|
||||
'etag': '"e842a-3e53-55d97640"',
|
||||
'content-length': '15955',
|
||||
'accept-ranges': 'bytes',
|
||||
'connection': 'close'}</samp>
|
||||
<samp class=p>>>> </samp><kbd>request.add_header('If-Modified-Since',</kbd>
|
||||
<samp class=p>... </samp>firstdatastream.headers.get('Last-Modified')) <span>②</span>
|
||||
<samp class=p>>>> </samp><kbd>seconddatastream = opener.open(request)</kbd> <span>③</span>
|
||||
<samp class=traceback>Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
File "c:\python23\lib\urllib2.py", line 326, in open
|
||||
'_open', req)
|
||||
File "c:\python23\lib\urllib2.py", line 306, in _call_chain
|
||||
result = func(*args)
|
||||
File "c:\python23\lib\urllib2.py", line 901, in http_open
|
||||
return self.do_open(httplib.HTTP, req)
|
||||
File "c:\python23\lib\urllib2.py", line 895, in do_open
|
||||
return self.parent.error('http', req, fp, code, msg, hdrs)
|
||||
File "c:\python23\lib\urllib2.py", line 352, in error
|
||||
return self._call_chain(*args)
|
||||
File "c:\python23\lib\urllib2.py", line 306, in _call_chain
|
||||
result = func(*args)
|
||||
File "c:\python23\lib\urllib2.py", line 412, in http_error_default
|
||||
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
|
||||
urllib2.HTTPError: HTTP Error 304: Not Modified</span>
|
||||
</pre>
|
||||
<ol>
|
||||
<li>Remember all those <abbr>HTTP</abbr> headers you saw printed out when you turned on debugging? This is how you can get access to them
|
||||
programmatically: <var>firstdatastream.headers</var> is <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class">an object that acts like a dictionary</a> and allows you to get any of the individual headers returned from the <abbr>HTTP</abbr> server.
|
||||
<li>On the second request, you add the <code>If-Modified-Since</code> header with the last-modified date from the first request. If the data hasn’t changed, the server should return a <code>304</code> status code.
|
||||
<li>Sure enough, the data hasn’t changed. You can see from the traceback that <code>urllib2</code> throws a special exception, <code>HTTPError</code>, in response to the <code>304</code> status code. This is a little unusual, and not entirely helpful. After all, it’s not an error; you specifically asked the
|
||||
server not to send you any data if it hadn’t changed, and the data didn’t change, so the server told you it wasn’t sending
|
||||
you any data. That’s not an error; that’s exactly what you were hoping for.
|
||||
<p><code>urllib2</code> also raises an <code>HTTPError</code> exception for conditions that you would think of as errors, such as <code>404</code> (page not found). In fact, it will raise <code>HTTPError</code> for <em>any</em> status code other than <code>200</code> (OK), <code>301</code> (permanent redirect), or <code>302</code> (temporary redirect). It would be more helpful for your purposes to capture the status code and simply return it, without
|
||||
throwing an exception. To do that, you’ll need to define a custom <abbr>URL</abbr> handler.
|
||||
<div class=example><h3>Example 11.7. Defining URL handlers</h3>
|
||||
<p>This custom <abbr>URL</abbr> handler is part of <code>openanything.py</code>.
|
||||
<pre><code>
|
||||
class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>①</span>
|
||||
def http_error_default(self, req, fp, code, msg, headers): <span>②</span>
|
||||
result = urllib2.HTTPError(
|
||||
req.get_full_url(), code, msg, headers, fp)
|
||||
result.status = code <span>③</span>
|
||||
return result
|
||||
</pre>
|
||||
<ol>
|
||||
<li><code>urllib2</code> is designed around <abbr>URL</abbr> handlers. Each handler is just a class that can define any number of methods. When something happens
|
||||
— like an <abbr>HTTP</abbr> error, or even a <code>304</code> code — <code>urllib2</code> introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a> to define handlers for different node types, but <code>urllib2</code> is more flexible, and introspects over as many handlers as are defined for the current request.
|
||||
<li><code>urllib2</code> searches through the defined handlers and calls the <code>http_error_default</code> method when it encounters a <code>304</code> status code from the server. By defining a custom error handler, you can prevent <code>urllib2</code> from raising an exception. Instead, you create the <code>HTTPError</code> object, but return it instead of raising it.
|
||||
<li>This is the key part: before returning, you save the status code returned by the <abbr>HTTP</abbr> server. This will allow you easy access
|
||||
to it from the calling program.
|
||||
<div class=example><h3>Example 11.8. Using custom URL handlers</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>request.headers</kbd> <span>①</span>
|
||||
{'If-modified-since': 'Thu, 15 Apr 2004 19:45:21 GMT'}
|
||||
<samp class=p>>>> </samp><kbd>import openanything</kbd>
|
||||
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener(</kbd>
|
||||
<samp class=p>... </samp>openanything.DefaultErrorHandler()) <span>②</span>
|
||||
<samp class=p>>>> </samp><kbd>seconddatastream = opener.open(request)</kbd>
|
||||
<samp class=p>>>> </samp><kbd>seconddatastream.status</kbd> <span>③</span>
|
||||
304
|
||||
<samp class=p>>>> </samp><kbd>seconddatastream.read()</kbd> <span>④</span>
|
||||
''
|
||||
</pre>
|
||||
<ol>
|
||||
<li>You’re continuing the previous example, so the <code>Request</code> object is already set up, and you’ve already added the <code>If-Modified-Since</code> header.
|
||||
<li>This is the key: now that you’ve defined your custom <abbr>URL</abbr> handler, you need to tell <code>urllib2</code> to use it. Remember how I said that <code>urllib2</code> broke up the process of accessing an <abbr>HTTP</abbr> resource into three steps, and for good reason? This is why building the <abbr>URL</abbr> opener
|
||||
is its own step, because you can build it with your own custom <abbr>URL</abbr> handlers that override <code>urllib2</code>’s default behavior.
|
||||
<li>Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use <var>seconddatastream.headers.dict</var> to acess them), also contains the <abbr>HTTP</abbr> status code. In this case, as you expected, the status is <code>304</code>, meaning this data hasn’t changed since the last time you asked for it.
|
||||
<li>Note that when the server sends back a <code>304</code> status code, it doesn’t re-send the data. That’s the whole point: to save bandwidth by not re-downloading data that hasn’t
|
||||
changed. So if you actually want that data, you’ll need to cache it locally the first time you get it.
|
||||
<p>Handling <code>ETag</code> works much the same way, but instead of checking for <code>Last-Modified</code> and sending <code>If-Modified-Since</code>, you check for <code>ETag</code> and send <code>If-None-Match</code>. Let’s start with a fresh <abbr>IDE</abbr> session.
|
||||
<div class=example><h3 id="oa.etags.example">Example 11.9. Supporting <code>ETag</code>/<code>If-None-Match</code></h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import urllib2, openanything</kbd>
|
||||
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')</kbd>
|
||||
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener(</kbd>
|
||||
<samp class=p>... </samp>openanything.DefaultErrorHandler())
|
||||
<samp class=p>>>> </samp><kbd>firstdatastream = opener.open(request)</kbd>
|
||||
<samp class=p>>>> </samp><kbd>firstdatastream.headers.get('ETag')</kbd> <span>①</span>
|
||||
'"e842a-3e53-55d97640"'
|
||||
<samp class=p>>>> </samp><kbd>firstdata = firstdatastream.read()</kbd>
|
||||
<samp class=p>>>> </samp><kbd>print firstdata</kbd> <span>②</span>
|
||||
<samp><?xml version="1.0" encoding="iso-8859-1"?>
|
||||
<feed version="0.3"
|
||||
xmlns="http://purl.org/atom/ns#"
|
||||
xmlns:dc="http://purl.org/dc/elements/1.1/"
|
||||
xml:lang="en">
|
||||
<title mode="escaped">dive into mark</title>
|
||||
<link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
|
||||
…
|
||||
</samp>
|
||||
<samp class=p>>>> </samp><kbd>request.add_header('If-None-Match',</kbd>
|
||||
<samp class=p>... </samp>firstdatastream.headers.get('ETag')) <span>③</span>
|
||||
<samp class=p>>>> </samp><kbd>seconddatastream = opener.open(request)</kbd>
|
||||
<samp class=p>>>> </samp><kbd>seconddatastream.status</kbd> <span>④</span>
|
||||
304
|
||||
<samp class=p>>>> </samp><kbd>seconddatastream.read()</kbd> <span>⑤</span>
|
||||
''
|
||||
</pre>
|
||||
<ol>
|
||||
<li>Using the <var>firstdatastream.headers</var> pseudo-dictionary, you can get the <code>ETag</code> returned from the server. (What happens if the server didn’t send back an <code>ETag</code>? Then this line would return <code>None</code>.)
|
||||
<li>OK, you got the data.
|
||||
<li>Now set up the second call by setting the <code>If-None-Match</code> header to the <code>ETag</code> you got from the first call.
|
||||
<li>The second call succeeds quietly (without throwing an exception), and once again you see that the server has sent back a <code>304</code> status code. Based on the <code>ETag</code> you sent the second time, it knows that the data hasn’t changed.
|
||||
<li>Regardless of whether the <code>304</code> is triggered by <code>Last-Modified</code> date checking or <code>ETag</code> hash matching, you’ll never get the data along with the <code>304</code>. That’s the whole point.
|
||||
<table id="tip.etag.vs.lastmodified" class=note border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In these examples, the <abbr>HTTP</abbr> server has supported both <code>Last-Modified</code> and <code>ETag</code> headers, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively
|
||||
in case a server only supports one or the other, or neither.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id="oa.redirect">11.7. Handling redirects</h2>
|
||||
<p>You can support permanent and temporary redirects using a different kind of custom <abbr>URL</abbr> handler.
|
||||
<p>First, let’s see why a redirect handler is necessary in the first place.
|
||||
<div class=example><h3>Example 11.10. Accessing web services without a redirect handler</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import urllib2, httplib</kbd>
|
||||
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd> <span>①</span>
|
||||
<samp class=p>>>> </samp><kbd>request = urllib2.Request(</kbd>
|
||||
<samp class=p>... </samp>'http://diveintomark.org/redir/example301.xml') <span>②</span>
|
||||
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener()</kbd>
|
||||
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd>
|
||||
<samp>connect: (diveintomark.org, 80)
|
||||
send: '
|
||||
GET /redir/example301.xml HTTP/1.0
|
||||
Host: diveintomark.org
|
||||
User-agent: Python-urllib/2.1
|
||||
'
|
||||
reply: 'HTTP/1.1 301 Moved Permanently\r\n'</span> <span>③</span>
|
||||
<samp>header: Date: Thu, 15 Apr 2004 22:06:25 GMT
|
||||
header: Server: Apache/2.0.49 (Debian GNU/Linux)
|
||||
header: Location: http://diveintomark.org/xml/atom.xml</span> <span>④</span>
|
||||
<samp>header: Content-Length: 338
|
||||
header: Connection: close
|
||||
header: Content-Type: text/html; charset=iso-8859-1
|
||||
connect: (diveintomark.org, 80)
|
||||
send: '
|
||||
GET /xml/atom.xml HTTP/1.0</span> <span>⑤</span>
|
||||
<samp>Host: diveintomark.org
|
||||
User-agent: Python-urllib/2.1
|
||||
'
|
||||
reply: 'HTTP/1.1 200 OK\r\n'
|
||||
header: Date: Thu, 15 Apr 2004 22:06:25 GMT
|
||||
header: Server: Apache/2.0.49 (Debian GNU/Linux)
|
||||
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
|
||||
header: ETag: "e842a-3e53-55d97640"
|
||||
header: Accept-Ranges: bytes
|
||||
header: Content-Length: 15955
|
||||
header: Connection: close
|
||||
header: Content-Type: application/atom+xml</samp>
|
||||
<samp class=p>>>> </samp><kbd>f.url</kbd> <span>⑥</span>
|
||||
'http://diveintomark.org/xml/atom.xml'
|
||||
<samp class=p>>>> </samp><kbd>f.headers.dict</kbd>
|
||||
<samp>{'content-length': '15955',
|
||||
'accept-ranges': 'bytes',
|
||||
'server': 'Apache/2.0.49 (Debian GNU/Linux)',
|
||||
'last-modified': 'Thu, 15 Apr 2004 19:45:21 GMT',
|
||||
'connection': 'close',
|
||||
'etag': '"e842a-3e53-55d97640"',
|
||||
'date': 'Thu, 15 Apr 2004 22:06:25 GMT',
|
||||
'content-type': 'application/atom+xml'}</samp>
|
||||
<samp class=p>>>> </samp><kbd>f.status</kbd>
|
||||
<samp class=traceback>Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
AttributeError: addinfourl instance has no attribute 'status'</span>
|
||||
</pre>
|
||||
<ol>
|
||||
<li>You’ll be better able to see what’s happening if you turn on debugging.
|
||||
<li>This is a <abbr>URL</abbr> which I have set up to permanently redirect to my Atom feed at <code>http://diveintomark.org/xml/atom.xml</code>.
|
||||
<li>Sure enough, when you try to download the data at that address, the server sends back a <code>301</code> status code, telling you that the resource has moved permanently.
|
||||
<li>The server also sends back a <code>Location</code> header that gives the new address of this data.
|
||||
<li><code>urllib2</code> notices the redirect status code and automatically tries to retrieve the data at the new location specified in the <code>Location</code> header.
|
||||
<li>The object you get back from the <var>opener</var> contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent
|
||||
address). But the status code is missing, so you have no way of knowing programmatically whether this redirect was temporary
|
||||
or permanent. And that matters very much: if it was a temporary redirect, then you should continue to ask for the data at
|
||||
the old location. But if it was a permanent redirect (as this was), you should ask for the data at the new location from
|
||||
now on.
|
||||
<p>This is suboptimal, but easy to fix. <code>urllib2</code> doesn’t behave exactly as you want it to when it encounters a <code>301</code> or <code>302</code>, so let’s override its behavior. How? With a custom <abbr>URL</abbr> handler, <a href="#oa.etags" title="11.6. Handling Last-Modified and ETag">just like you did to handle <code>304</code> codes</a>.
|
||||
<div class=example><h3>Example 11.11. Defining the redirect handler</h3>
|
||||
<p>This class is defined in <code>openanything.py</code>.
|
||||
<pre><code>
|
||||
class SmartRedirectHandler(urllib2.HTTPRedirectHandler): <span>①</span>
|
||||
def http_error_301(self, req, fp, code, msg, headers):
|
||||
result = urllib2.HTTPRedirectHandler.http_error_301( <span>②</span>
|
||||
self, req, fp, code, msg, headers)
|
||||
result.status = code <span>③</span>
|
||||
return result
|
||||
|
||||
def http_error_302(self, req, fp, code, msg, headers): <span>④</span>
|
||||
result = urllib2.HTTPRedirectHandler.http_error_302(
|
||||
self, req, fp, code, msg, headers)
|
||||
result.status = code
|
||||
return result
|
||||
</pre>
|
||||
<ol>
|
||||
<li>Redirect behavior is defined in <code>urllib2</code> in a class called <code>HTTPRedirectHandler</code>. You don’t want to completely override the behavior, you just want to extend it a little, so you’ll subclass <code>HTTPRedirectHandler</code> so you can call the ancestor class to do all the hard work.
|
||||
<li>When it encounters a <code>301</code> status code from the server, <code>urllib2</code> will search through its handlers and call the <code>http_error_301</code> method. The first thing ours does is just call the <code>http_error_301</code> method in the ancestor, which handles the grunt work of looking for the <code>Location</code> header and following the redirect to the new address.
|
||||
<li>Here’s the key: before you return, you store the status code (<code>301</code>), so that the calling program can access it later.
|
||||
<li>Temporary redirects (status code <code>302</code>) work the same way: override the <code>http_error_302</code> method, call the ancestor, and save the status code before returning.
|
||||
<p>So what has this bought us? You can now build a <abbr>URL</abbr> opener with the custom redirect handler, and it will still automatically
|
||||
follow redirects, but now it will also expose the redirect status code.
|
||||
<div class=example><h3>Example 11.12. Using the redirect handler to detect permanent redirects</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/redir/example301.xml')</kbd>
|
||||
<samp class=p>>>> </samp><kbd>import openanything, httplib</kbd>
|
||||
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd>
|
||||
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener(</kbd>
|
||||
<samp class=p>... </samp>openanything.SmartRedirectHandler()) <span>①</span>
|
||||
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd>
|
||||
<samp>connect: (diveintomark.org, 80)
|
||||
send: 'GET /redir/example301.xml HTTP/1.0
|
||||
Host: diveintomark.org
|
||||
User-agent: Python-urllib/2.1
|
||||
'
|
||||
reply: 'HTTP/1.1 301 Moved Permanently\r\n'</span> <span>②</span>
|
||||
<samp>header: Date: Thu, 15 Apr 2004 22:13:21 GMT
|
||||
header: Server: Apache/2.0.49 (Debian GNU/Linux)
|
||||
header: Location: http://diveintomark.org/xml/atom.xml
|
||||
header: Content-Length: 338
|
||||
header: Connection: close
|
||||
header: Content-Type: text/html; charset=iso-8859-1
|
||||
connect: (diveintomark.org, 80)
|
||||
send: '
|
||||
GET /xml/atom.xml HTTP/1.0
|
||||
Host: diveintomark.org
|
||||
User-agent: Python-urllib/2.1
|
||||
'
|
||||
reply: 'HTTP/1.1 200 OK\r\n'
|
||||
header: Date: Thu, 15 Apr 2004 22:13:21 GMT
|
||||
header: Server: Apache/2.0.49 (Debian GNU/Linux)
|
||||
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
|
||||
header: ETag: "e842a-3e53-55d97640"
|
||||
header: Accept-Ranges: bytes
|
||||
header: Content-Length: 15955
|
||||
header: Connection: close
|
||||
header: Content-Type: application/atom+xml
|
||||
</samp>
|
||||
<samp class=p>>>> </samp><kbd>f.status</kbd> <span>③</span>
|
||||
301
|
||||
<samp class=p>>>> </samp><kbd>f.url</kbd>
|
||||
'http://diveintomark.org/xml/atom.xml'
|
||||
</pre>
|
||||
<ol>
|
||||
<li>First, build a <abbr>URL</abbr> opener with the redirect handler you just defined.
|
||||
<li>You sent off a request, and you got a <code>301</code> status code in response. At this point, the <code>http_error_301</code> method gets called. You call the ancestor method, which follows the redirect and sends a request at the new location (<code>http://diveintomark.org/xml/atom.xml</code>).
|
||||
<li>This is the payoff: now, not only do you have access to the new <abbr>URL</abbr>, but you have access to the redirect status code, so you
|
||||
can tell that this was a permanent redirect. The next time you request this data, you should request it from the new location
|
||||
(<code>http://diveintomark.org/xml/atom.xml</code>, as specified in <var>f.url</var>). If you had stored the location in a configuration file or a database, you need to update that so you don’t keep pounding
|
||||
the server with requests at the old address. It’s time to update your address book.
|
||||
<p>The same redirect handler can also tell you that you <em>shouldn’t</em> update your address book.
|
||||
<div class=example><h3>Example 11.13. Using the redirect handler to detect temporary redirects</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>request = urllib2.Request(</kbd>
|
||||
<samp class=p>... </samp>'http://diveintomark.org/redir/example302.xml') <span>①</span>
|
||||
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd>
|
||||
<samp>connect: (diveintomark.org, 80)
|
||||
send: '
|
||||
GET /redir/example302.xml HTTP/1.0
|
||||
Host: diveintomark.org
|
||||
User-agent: Python-urllib/2.1
|
||||
'
|
||||
reply: 'HTTP/1.1 302 Found\r\n'</span> <span>②</span>
|
||||
<samp>header: Date: Thu, 15 Apr 2004 22:18:21 GMT
|
||||
header: Server: Apache/2.0.49 (Debian GNU/Linux)
|
||||
header: Location: http://diveintomark.org/xml/atom.xml
|
||||
header: Content-Length: 314
|
||||
header: Connection: close
|
||||
header: Content-Type: text/html; charset=iso-8859-1
|
||||
connect: (diveintomark.org, 80)
|
||||
send: '
|
||||
GET /xml/atom.xml HTTP/1.0</span> <span>③</span>
|
||||
<samp>Host: diveintomark.org
|
||||
User-agent: Python-urllib/2.1
|
||||
'
|
||||
reply: 'HTTP/1.1 200 OK\r\n'
|
||||
header: Date: Thu, 15 Apr 2004 22:18:21 GMT
|
||||
header: Server: Apache/2.0.49 (Debian GNU/Linux)
|
||||
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
|
||||
header: ETag: "e842a-3e53-55d97640"
|
||||
header: Accept-Ranges: bytes
|
||||
header: Content-Length: 15955
|
||||
header: Connection: close
|
||||
header: Content-Type: application/atom+xml</samp>
|
||||
<samp class=p>>>> </samp><kbd>f.status</kbd> <span>④</span>
|
||||
302
|
||||
<samp class=p>>>> </samp><kbd>f.url</kbd>
|
||||
http://diveintomark.org/xml/atom.xml
|
||||
</pre>
|
||||
<ol>
|
||||
<li>This is a sample <abbr>URL</abbr> I’ve set up that is configured to tell clients to <em>temporarily</em> redirect to <code>http://diveintomark.org/xml/atom.xml</code>.
|
||||
<li>The server sends back a <code>302</code> status code, indicating a temporary redirect. The temporary new location of the data is given in the <code>Location</code> header.
|
||||
<li><code>urllib2</code> calls your <code>http_error_302</code> method, which calls the ancestor method of the same name in <code>urllib2.HTTPRedirectHandler</code>, which follows the redirect to the new location. Then your <code>http_error_302</code> method stores the status code (<code>302</code>) so the calling application can get it later.
|
||||
<li>And here you are, having successfully followed the redirect to <code>http://diveintomark.org/xml/atom.xml</code>. <var>f.status</var> tells you that this was a temporary redirect, which means that you should continue to request data from the original address
|
||||
(<code>http://diveintomark.org/redir/example302.xml</code>). Maybe it will redirect next time too, but maybe not. Maybe it will redirect to a different address. It’s not for you
|
||||
to say. The server said this redirect was only temporary, so you should respect that. And now you’re exposing enough information
|
||||
that the calling application can respect that.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id="oa.gzip">11.8. Handling compressed data</h2>
|
||||
<p>The last important <abbr>HTTP</abbr> feature you want to support is compression. Many web services have the ability to send data compressed,
|
||||
which can cut down the amount of data sent over the wire by 60% or more. This is especially true of <abbr>XML</abbr> web services, since
|
||||
<abbr>XML</abbr> data compresses very well.
|
||||
<p>Servers won’t give you compressed data unless you tell them you can handle it.
|
||||
<div class=example><h3>Example 11.14. Telling the server you would like compressed data</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import urllib2, httplib</kbd>
|
||||
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd>
|
||||
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/xml/atom.xml')</kbd>
|
||||
<samp class=p>>>> </samp><kbd>request.add_header('Accept-encoding', 'gzip')</kbd> <span>①</span>
|
||||
<samp class=p>>>> </samp><kbd>opener = urllib2.build_opener()</kbd>
|
||||
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd>
|
||||
<samp>connect: (diveintomark.org, 80)
|
||||
send: '
|
||||
GET /xml/atom.xml HTTP/1.0
|
||||
Host: diveintomark.org
|
||||
User-agent: Python-urllib/2.1
|
||||
Accept-encoding: gzip</span><span>②</span>
|
||||
<samp>'
|
||||
reply: 'HTTP/1.1 200 OK\r\n'
|
||||
header: Date: Thu, 15 Apr 2004 22:24:39 GMT
|
||||
header: Server: Apache/2.0.49 (Debian GNU/Linux)
|
||||
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
|
||||
header: ETag: "e842a-3e53-55d97640"
|
||||
header: Accept-Ranges: bytes
|
||||
header: Vary: Accept-Encoding
|
||||
header: Content-Encoding: gzip</span> <span>③</span>
|
||||
header: Content-Length: 6289 <span>④</span>
|
||||
<samp>header: Connection: close
|
||||
header: Content-Type: application/atom+xml</span>
|
||||
</pre>
|
||||
<ol>
|
||||
<li>This is the key: once you’ve created your <code>Request</code> object, add an <code>Accept-encoding</code> header to tell the server you can accept gzip-encoded data. <code>gzip</code> is the name of the compression algorithm you’re using. In theory there could be other compression algorithms, but <code>gzip</code> is the compression algorithm used by 99% of web servers.
|
||||
<li>There’s your header going across the wire.
|
||||
<li>And here’s what the server sends back: the <code>Content-Encoding: gzip</code> header means that the data you’re about to receive has been gzip-compressed.
|
||||
<li>The <code>Content-Length</code> header is the length of the compressed data, not the uncompressed data. As you’ll see in a minute, the actual length of
|
||||
the uncompressed data was 15955, so gzip compression cut your bandwidth by over 60%!
|
||||
<div class=example><h3>Example 11.15. Decompressing the data</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>compresseddata = f.read()</kbd> <span>①</span>
|
||||
<samp class=p>>>> </samp><kbd>len(compresseddata)</kbd>
|
||||
6289
|
||||
<samp class=p>>>> </samp><kbd>import StringIO</kbd>
|
||||
<samp class=p>>>> </samp><kbd>compressedstream = StringIO.StringIO(compresseddata)</kbd> <span>②</span>
|
||||
<samp class=p>>>> </samp><kbd>import gzip</kbd>
|
||||
<samp class=p>>>> </samp><kbd>gzipper = gzip.GzipFile(fileobj=compressedstream)</kbd> <span>③</span>
|
||||
<samp class=p>>>> </samp><kbd>data = gzipper.read()</kbd> <span>④</span>
|
||||
<samp class=p>>>> </samp><kbd>print data</kbd> <span>⑤</span>
|
||||
<samp><?xml version="1.0" encoding="iso-8859-1"?>
|
||||
<feed version="0.3"
|
||||
xmlns="http://purl.org/atom/ns#"
|
||||
xmlns:dc="http://purl.org/dc/elements/1.1/"
|
||||
xml:lang="en">
|
||||
<title mode="escaped">dive into mark</title>
|
||||
<link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
|
||||
…
|
||||
</samp>
|
||||
<samp class=p>>>> </samp><kbd>len(data)</kbd>
|
||||
15955
|
||||
</pre>
|
||||
<ol>
|
||||
<li>Continuing from the previous example, <var>f</var> is the file-like object returned from the <abbr>URL</abbr> opener. Using its <code>read()</code> method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first
|
||||
step towards getting the data you really want.
|
||||
<li>OK, this step is a little bit of messy workaround. Python has a <code>gzip</code> module, which reads (and actually writes) gzip-compressed files on disk. But you don’t have a file on disk, you have a gzip-compressed
|
||||
buffer in memory, and you don’t want to write out a temporary file just so you can uncompress it. So what you’re going to
|
||||
do is create a file-like object out of the in-memory data (<var>compresseddata</var>), using the <code>StringIO</code> module. You first saw the <code>StringIO</code> module in <a href="#kgp.openanything.stringio.example" title="Example 10.4. Introducing StringIO">the previous chapter</a>, but now you’ve found another use for it.
|
||||
<li>Now you can create an instance of <code>GzipFile</code>, and tell it that its “file” is the file-like object <var>compressedstream</var>.
|
||||
<li>This is the line that does all the actual work: “reading” from <code>GzipFile</code> will decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. <var>gzipper</var> is a file-like object which represents a gzip-compressed file. That “file” is not a real file on disk, though; <var>gzipper</var> is really just “reading” from the file-like object you created with <code>StringIO</code> to wrap the compressed data, which is only in memory in the variable <var>compresseddata</var>. And where did that compressed data come from? You originally downloaded it from a remote <abbr>HTTP</abbr> server by “reading” from the file-like object you built with <code>urllib2.build_opener</code>. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it.
|
||||
<li>Look ma, real data. (15955 bytes of it, in fact.)<p>“But wait!” I hear you cry. “This could be even easier!” I know what you’re thinking. You’re thinking that <var>opener.open</var> returns a file-like object, so why not cut out the <code>StringIO</code> middleman and just pass <var>f</var> directly to <code>GzipFile</code>? OK, maybe you weren’t thinking that, but don’t worry about it, because it doesn’t work.
|
||||
<div class=example><h3>Example 11.16. Decompressing the data directly from the server</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd><span>①</span>
|
||||
<samp class=p>>>> </samp><kbd>f.headers.get('Content-Encoding')</kbd> <span>②</span>
|
||||
'gzip'
|
||||
<samp class=p>>>> </samp><kbd>data = gzip.GzipFile(fileobj=f).read()</kbd> <span>③</span>
|
||||
<samp class=traceback>Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
File "c:\python23\lib\gzip.py", line 217, in read
|
||||
self._read(readsize)
|
||||
File "c:\python23\lib\gzip.py", line 252, in _read
|
||||
pos = self.fileobj.tell() # Save current position
|
||||
AttributeError: addinfourl instance has no attribute 'tell'</span>
|
||||
</pre>
|
||||
<ol>
|
||||
<li>Continuing from the previous example, you already have a <code>Request</code> object set up with an <code>Accept-encoding: gzip</code> header.
|
||||
<li>Simply opening the request will get you the headers (though not download any data yet). As you can see from the returned
|
||||
<code>Content-Encoding</code> header, this data has been sent gzip-compressed.
|
||||
<li>Since <code>opener.open</code> returns a file-like object, and you know from the headers that when you read it, you’re going to get gzip-compressed data,
|
||||
why not simply pass that file-like object directly to <code>GzipFile</code>? As you “read” from the <code>GzipFile</code> instance, it will “read” compressed data from the remote <abbr>HTTP</abbr> server and decompress it on the fly. It’s a good idea, but unfortunately it doesn’t
|
||||
work. Because of the way gzip compression works, <code>GzipFile</code> needs to save its position and move forwards and backwards through the compressed file. This doesn’t work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and
|
||||
forth through the data stream. So the inelegant hack of using <code>StringIO</code> is the best solution: download the compressed data, create a file-like object out of it with <code>StringIO</code>, and then decompress the data from that.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id="oa.alltogether">11.9. Putting it all together</h2>
|
||||
<p>You’ve seen all the pieces for building an intelligent <abbr>HTTP</abbr> web services client. Now let’s see how they all fit together.
|
||||
<div class=example><h3>Example 11.17. The <code>openanything</code> function</h3>
|
||||
<p>This function is defined in <code>openanything.py</code>.
|
||||
<pre><code>
|
||||
def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
|
||||
# non-HTTP code omitted for brevity
|
||||
if urlparse.urlparse(source)[0] == 'http': <span>①</span>
|
||||
# open URL with urllib2
|
||||
request = urllib2.Request(source)
|
||||
request.add_header('User-Agent', agent) <span>②</span>
|
||||
if etag:
|
||||
request.add_header('If-None-Match', etag) <span>③</span>
|
||||
if lastmodified:
|
||||
request.add_header('If-Modified-Since', lastmodified) <span>④</span>
|
||||
request.add_header('Accept-encoding', 'gzip') <span>⑤</span>
|
||||
opener = urllib2.build_opener(SmartRedirectHandler(), DefaultErrorHandler()) <span>⑥</span>
|
||||
return opener.open(request) <span>⑦</span>
|
||||
</pre>
|
||||
<ol>
|
||||
<li><code>urlparse</code> is a handy utility module for, you guessed it, parsing <abbr>URL</abbr>s. Its primary function, also called <code>urlparse</code>, takes a <abbr>URL</abbr> and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier).
|
||||
Of these, the only thing you care about is the scheme, to make sure that you’re dealing with an <abbr>HTTP</abbr> <abbr>URL</abbr> (which <code>urllib2</code> can handle).
|
||||
<li>You identify yourself to the <abbr>HTTP</abbr> server with the <code>User-Agent</code> passed in by the calling function. If no <code>User-Agent</code> was specified, you use a default one defined earlier in the <code>openanything.py</code> module. You never use the default one defined by <code>urllib2</code>.
|
||||
<li>If an <code>ETag</code> hash was given, send it in the <code>If-None-Match</code> header.
|
||||
<li>If a last-modified date was given, send it in the <code>If-Modified-Since</code> header.
|
||||
<li>Tell the server you would like compressed data if possible.
|
||||
<li>Build a <abbr>URL</abbr> opener that uses <em>both</em> of the custom <abbr>URL</abbr> handlers: <code>SmartRedirectHandler</code> for handling <code>301</code> and <code>302</code> redirects, and <code>DefaultErrorHandler</code> for handling <code>304</code>, <code>404</code>, and other error conditions gracefully.
|
||||
<li>That’s it! Open the <abbr>URL</abbr> and return a file-like object to the caller.
|
||||
<div class=example><h3>Example 11.18. The <code>fetch</code> function</h3>
|
||||
<p>This function is defined in <code>openanything.py</code>.
|
||||
<pre><code>
|
||||
def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
|
||||
'''Fetch data and metadata from a URL, file, stream, or string'''
|
||||
result = {}
|
||||
f = openAnything(source, etag, last_modified, agent) <span>①</span>
|
||||
result['data'] = f.read() <span>②</span>
|
||||
if hasattr(f, 'headers'):
|
||||
# save ETag, if the server sent one
|
||||
result['etag'] = f.headers.get('ETag') <span>③</span>
|
||||
# save Last-Modified header, if the server sent one
|
||||
result['lastmodified'] = f.headers.get('Last-Modified') <span>④</span>
|
||||
if f.headers.get('content-encoding', '') == 'gzip': <span>⑤</span>
|
||||
# data came back gzip-compressed, decompress it
|
||||
result['data'] = gzip.GzipFile(fileobj=StringIO(result['data']])).read()
|
||||
if hasattr(f, 'url'): <span>⑥</span>
|
||||
result['url'] = f.url
|
||||
result['status'] = 200
|
||||
if hasattr(f, 'status'): <span>⑦</span>
|
||||
result['status'] = f.status
|
||||
f.close()
|
||||
return result
|
||||
</pre>
|
||||
<ol>
|
||||
<li>First, you call the <code>openAnything</code> function with a <abbr>URL</abbr>, <code>ETag</code> hash, <code>Last-Modified</code> date, and <code>User-Agent</code>.
|
||||
<li>Read the actual data returned from the server. This may be compressed; if so, you’ll decompress it later.
|
||||
<li>Save the <code>ETag</code> hash returned from the server, so the calling application can pass it back to you next time, and you can pass it on to <code>openAnything</code>, which can stick it in the <code>If-None-Match</code> header and send it to the remote server.
|
||||
<li>Save the <code>Last-Modified</code> date too.
|
||||
<li>If the server says that it sent compressed data, decompress it.
|
||||
<li>If you got a <abbr>URL</abbr> back from the server, save it, and assume that the status code is <code>200</code> until you find out otherwise.
|
||||
<li>If one of the custom <abbr>URL</abbr> handlers captured a status code, then save that too.
|
||||
<div class=example><h3>Example 11.19. Using <code>openanything.py</code></h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import openanything</kbd>
|
||||
<samp class=p>>>> </samp><kbd>useragent = 'MyHTTPWebServicesApp/1.0'</kbd>
|
||||
<samp class=p>>>> </samp><kbd>url = 'http://diveintopython3.org/redir/example301.xml'</kbd>
|
||||
<samp class=p>>>> </samp><kbd>params = openanything.fetch(url, agent=useragent)</kbd> <span>①</span>
|
||||
<samp class=p>>>> </samp><kbd>params</kbd> <span>②</span>
|
||||
<samp>{'url': 'http://diveintomark.org/xml/atom.xml',
|
||||
'lastmodified': 'Thu, 15 Apr 2004 19:45:21 GMT',
|
||||
'etag': '"e842a-3e53-55d97640"',
|
||||
'status': 301,
|
||||
'data': '<?xml version="1.0" encoding="iso-8859-1"?>
|
||||
<feed version="0.3"
|
||||
…
|
||||
'}</samp>
|
||||
<samp class=p>>>> </samp><kbd>if params['status'] == 301:</kbd><span>③</span>
|
||||
<samp class=p>... </samp>url = params['url']
|
||||
<samp class=p>>>> </samp><kbd>newparams = openanything.fetch(</kbd>
|
||||
<samp class=p>... </samp>url, params['etag'], params['lastmodified'], useragent) <span>④</span>
|
||||
<samp class=p>>>> </samp><kbd>newparams</kbd>
|
||||
<samp>{'url': 'http://diveintomark.org/xml/atom.xml',
|
||||
'lastmodified': None,
|
||||
'etag': '"e842a-3e53-55d97640"',
|
||||
'status': 304,
|
||||
'data': ''}</span> <span>⑤</span>
|
||||
</pre>
|
||||
<ol>
|
||||
<li>The very first time you fetch a resource, you don’t have an <code>ETag</code> hash or <code>Last-Modified</code> date, so you’ll leave those out. (They’re <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional parameters</a>.)
|
||||
<li>What you get back is a dictionary of several useful headers, the <abbr>HTTP</abbr> status code, and the actual data returned from the server.
|
||||
<code>openanything</code> handles the gzip compression internally; you don’t care about that at this level.
|
||||
<li>If you ever get a <code>301</code> status code, that’s a permanent redirect, and you need to update your <abbr>URL</abbr> to the new address.
|
||||
<li>The second time you fetch the same resource, you have all sorts of information to pass back: a (possibly updated) <abbr>URL</abbr>, the
|
||||
<code>ETag</code> from the last time, the <code>Last-Modified</code> date from the last time, and of course your <code>User-Agent</code>.
|
||||
<li>What you get back is again a dictionary, but the data hasn’t changed, so all you got was a <code>304</code> status code and no data.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id="oa.summary">11.10. Summary</h2>
|
||||
<p>The <code>openanything.py</code> and its functions should now make perfect sense.
|
||||
<p>There are 5 important features of <abbr>HTTP</abbr> web services that every client should support:
|
||||
<div class=itemizedlist>
|
||||
<ul>
|
||||
<li>Identifying your application <a href="#oa.useragent" title="11.5. Setting the User-Agent">by setting a proper <code>User-Agent</code></a>.
|
||||
|
||||
<li>Handling <a href="#oa.redirect" title="11.7. Handling redirects">permanent redirects properly</a>.
|
||||
|
||||
<li>Supporting <a href="#oa.etags" title="11.6. Handling Last-Modified and ETag"><code>Last-Modified</code> date checking</a> to avoid re-downloading data that hasn’t changed.
|
||||
|
||||
<li>Supporting <a href="#oa.etags.example" title="Example 11.9. Supporting ETag/If-None-Match"><code>ETag</code> hashes</a> to avoid re-downloading data that hasn’t changed.
|
||||
|
||||
<li>Supporting <a href="#oa.gzip" title="11.8. Handling compressed data">gzip compression</a> to reduce bandwidth even when data <em>has</em> changed.
|
||||
|
||||
</ul>
|
||||
|
||||
<p class=a>⁂
|
||||
-->
|
||||
|
||||
<h2 id=beyond-get>Beyond GET</h2>
|
||||
|
||||
<p>FIXME
|
||||
|
||||
Reference in New Issue
Block a user