more work in HTTP chapter

This commit is contained in:
Mark Pilgrim
2009-05-31 13:14:47 -07:00
parent 2e5e078de0
commit 05911f25af
2 changed files with 193 additions and 92 deletions
+15
View File
@@ -0,0 +1,15 @@
AddType application/xml .xml
AddType text/plain .py
AddDefaultCharset utf-8
ExpiresActive On
ExpiresDefault "access plus 1 day"
FileETag MTime Size
SetOutputFilter DEFLATE
Header unset Vary
Header add Vary Accept-Encoding
+178 -92
View File
@@ -17,7 +17,7 @@ mark{display:inline}
<p id=level>Difficulty level: <span title=advanced>&#x2666;&#x2666;&#x2666;&#x2666;&#x2662;</span>
<h1>HTTP Web Services</h1>
<blockquote class=q>
<p><span>&#x275D;</span> FIXME <span>&#x275E;</span><br>&mdash; FIXME
<p><span>&#x275D;</span> A ruffled mind makes a restless pillow. <span>&#x275E;</span><br>&mdash; Charlotte Bront&euml;
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving In</h2>
@@ -52,43 +52,109 @@ mark{display:inline}
<h3 id=caching>Caching</h3>
<p>FIXME
<p>The most important thing to understand about any type of web service is that network access is incredibly expensive. I don&#8217;t mean &#8220;dollars and cents&#8221; expensive (although bandwidth ain&#8217;t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even the fastest broadband connection is orders of magnitude slower than your local network, which in turn is orders of magnitude slower than you local disk.
<p><abbr>HTTP</abbr> is designed with caching in mind. There is an entire class of devices (called &#8220;caching proxies&#8221;) whose only job is to sit between you and the rest of the world and minimize network access. Your company or <abbr>ISP</abbr> almost certainly maintains caching proxies, even if you&#8217;re unaware of them. They work because caching built into the <abbr>HTTP</abbr> protocol.
<p>Here&#8217;s a concrete example of how caching works. You visit <a href=http://diveintomark.org/><code>diveintomark.org</code></a> in your browser. That page includes a background image, <a href=http://wearehugh.com/m.jpg><code>wearehugh.com/m.jpg</code></a>. When your browser downloads that image, the server includes the following <abbr>HTTP</abbr> headers:
<pre><code>HTTP/1.1 200 OK
Date: Sun, 31 May 2009 17:14:04 GMT
Server: Apache
Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
ETag: "3075-ddc8d800"
Accept-Ranges: bytes
Content-Length: 12405
<mark>Cache-Control: max-age=31536000, public</mark>
<mark>Expires: Mon, 31 May 2010 17:14:04 GMT</mark>
Connection: close
Content-Type: image/jpeg</code></pre>
<p>The <code>Cache-Control</code> and <code>Expires</code> headers tell your browser (and any caching proxies between you and the server) that this image can be cached for up to a year. <em>A year!</em> And if, in the next year, you visit another page which also includes a link to this image, your browser will load the image from its cache <em>without generating any network activity whatsoever</em>.
<p>But wait, it gets better. Let&#8217;s say your browser purges the image from your local cache for some reason. Maybe it ran out of disk space; maybe you manually cleared the cache. Whatever. But the <abbr>HTTP</abbr> headers said that this data could be cached by public caching proxies (by virtue of that <code>public</code> keyword in the <code>Cache-Control</code> header). Caching proxies are designed to have tons of storage space, probably far more than your local browser has allocated.
<p>If your company or <abbr>ISP</abbr> maintain a caching proxy, the proxy may still have the image cached. When you visit <code>diveintomark.org</code> again, your browser will look in its local cache for the image, but it won&#8217;t find it, so it will make a network request to try to download it from the remote server. But if the caching proxy still has a copy of the image, it will intercept that request and serve the image from <em>its</em> cache. That means that your request will never reach the remote server; in fact, it will never leave your company&#8217;s network. That makes for a faster download (fewer network hops) and saves your company money (less data being downloaded from the outside world).
<p><abbr>HTTP</abbr> caching only works when everybody does their part. On one side, servers need to send the correct headers in their response. On the other side, clients need to understand and respect those headers before they request the same data twice. The proxies in the middle are not a panacea; they can only be as smart as the servers and clients allow them to be.
<p>Python&#8217;s <abbr>HTTP</abbr> libraries do not support caching, but <code>httplib2</code> does.
<h3 id=last-modified>Last-Modified Checking</h3>
<p>Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the home page of Google.com may not change for days or even weeks (and then only when they put up a special holiday logo or advertise a new service). Web services are no different. The server knows when the data you&#8217;re requesting last changed, and <abbr>HTTP</abbr> provides a way for the server to include this last-modified date each time you request the data.
<p>Some data never changes, while other data changes all the time. In between, there is a vast field of data that <em>might</em> have changed, but hasn&#8217;t. CNN.com&#8217;s feed is updated every few minutes, but my weblog&#8217;s feed may not change for days or weeks at a time. In the latter case, I don&#8217;t want to tell clients to cache my feed for weeks at a time, because then when I do actually post something, people may not read it for weeks (because they&#8217;re respecting my cache headers which said &#8220;don&#8217;t bother checking this feed for weeks&#8221;). On the other hand, I don&#8217;t want clients downloading my entire feed once an hour if it hasn&#8217;t changed!
<p>If you ask for the same data a second (or third or fourth) time, you can tell the server the last-modified date that you got last time. You send an <code>If-Modified-Since</code> header with your request, with the date you got back from the server last time. If the data hasn&#8217;t changed since then, the server sends back a special <abbr>HTTP</abbr> status code <code>304</code>, which means &#8220;this data hasn&#8217;t changed since the last time you asked for it.&#8221; Why is this an improvement? Because when the server sends a <code>304</code>, <em>it doesn&#8217;t re-send the data</em>. All you get is the status code. So you don&#8217;t need to download the same data over and over again if it hasn&#8217;t changed; the server assumes you have the data <a href=#caching>cached locally</a>.
<p><abbr>HTTP</abbr> has a solution to this, too. When you request data for the first time, the server can send back a <code>Last-Modified</code> header. This is exactly what it sounds like: the date that the data was changed. That background image referenced from <code>diveintomark.org</code> included a <code>Last-Modified</code> header.
<p>All modern web browsers support last-modified date checking. If you&#8217;ve ever visited a page, re-visited the same page a day later and found that it hadn&#8217;t changed, and wondered why it loaded so quickly the second time &mdash; this could be why. Your web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically sent the last-modified date it got from the server the first time. The server simply says <code>304: Not Modified</code>, so your browser knows to load the page from its cache. Web services work the same way.
<pre><code>HTTP/1.1 200 OK
Date: Sun, 31 May 2009 17:14:04 GMT
Server: Apache
<mark>Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT</mark>
ETag: "3075-ddc8d800"
Accept-Ranges: bytes
Content-Length: 12405
Cache-Control: max-age=31536000, public
Expires: Mon, 31 May 2010 17:14:04 GMT
Connection: close
Content-Type: image/jpeg
</code></pre>
<p>Python&#8217;s URL libraries have no built-in support for last-modified date checking, but <code>httplib2</code> does.
<p>When you request the same data a second (or third or fourth) time, you can send an <code>If-Modified-Since</code> header with your request, with the date you got back from the server last time. If the data hasn&#8217;t changed since then, the server sends back a special <abbr>HTTP</abbr> <code>304</code> status code, which means &#8220;this data hasn&#8217;t changed since the last time you asked for it.&#8221; You can test this on the command line, using <a href=http://curl.haxx.se/>curl</a>:
<h3 id=etag>ETags</h3>
<pre class=screen>
<samp class=p>you@localhost:~$ </samp><kbd>curl -I <mark>-H "If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT"</mark> http://wearehugh.com/m.jpg</kbd>
<samp>HTTP/1.1 304 Not Modified
Date: Sun, 31 May 2009 18:04:39 GMT
Server: Apache
Connection: close
ETag: "3075-ddc8d800"
Expires: Mon, 31 May 2010 18:04:39 GMT
Cache-Control: max-age=31536000, public</samp></pre>
<p>ETags are an alternate way to accomplish the same thing as the <a href=#last-modified>last-modified date checking</a>. With Etags, the server sends a hash code in an <code>ETag</code> header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) The second time you request the same data, you include the ETag hash in an <code>If-None-Match</code> header of your request. If the data hasn&#8217;t changed, the server will send you back a <code>304</code> status code. As with the last-modified date checking, the server sends back <em>only</em> the <code>304</code> status code; it doesn&#8217;t send you the same data a second time. By including the ETag hash in your second request, you&#8217;re telling the server that there&#8217;s no need to re-send the same data if it still matches this hash, since <a href=#caching>you still have the data from the last time</a>.
<p>Why is this an improvement? Because when the server sends a <code>304</code>, <em>it doesn&#8217;t re-send the data</em>. All you get is the status code. Even after your cached copy has expired, last-modified checking ensures that you won&#8217;t download the same data twice if it hasn&#8217;t changed. (As an extra bonus, this <code>304</code> response also includes caching headers. Proxies will keep a copy of data even after it officially &#8220;expires,&#8221; in the hopes that the data hasn&#8217;t <em>really</em> changed and the next request responds with a <code>304</code> status code and updated cache information.)
<p>Python&#8217;s URL libraries have no built-in support for ETags, but <code>httplib2</code> does.
<p>Python&#8217;s <abbr>HTTP</abbr> libraries do not support last-modified date checking, but <code>httplib2</code> does.
<h3 id=etags>ETags</h3>
<p>ETags are an alternate way to accomplish the same thing as the <a href=#last-modified>last-modified checking</a>. With Etags, the server sends a hash code in an <code>ETag</code> header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) That background image referenced from <code>diveintomark.org</code> had an <code>ETag</code> header.
<pre><code>HTTP/1.1 200 OK
Date: Sun, 31 May 2009 17:14:04 GMT
Server: Apache
Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
<mark>ETag: "3075-ddc8d800"</mark>
Accept-Ranges: bytes
Content-Length: 12405
Cache-Control: max-age=31536000, public
Expires: Mon, 31 May 2010 17:14:04 GMT
Connection: close
Content-Type: image/jpeg
</code></pre>
The second time you request the same data, you include the ETag hash in an <code>If-None-Match</code> header of your request. If the data hasn&#8217;t changed, the server will send you back a <code>304</code> status code. As with the last-modified date checking, the server sends back <em>only</em> the <code>304</code> status code; it doesn&#8217;t send you the same data a second time. By including the ETag hash in your second request, you&#8217;re telling the server that there&#8217;s no need to re-send the same data if it still matches this hash, since <a href=#caching>you still have the data from the last time</a>.
<p>Python&#8217;s <abbr>HTTP</abbr> libraries do not support ETags, but <code>httplib2</code> does.
<h3 id=compression>Compression</h3>
<p>When you talk about <abbr>HTTP</abbr> web services, you&#8217;re almost always talking about moving text-based data back and forth over the wire. Maybe it&#8217;s <abbr>XML</abbr>; maybe it&#8217;s <abbr>JSON</abbr>. Regardless of the format, text compresses well. When you request a resource over <abbr>HTTP</abbr>, you can ask the server to send it in compressed format. You include the <code>Accept-encoding</code> header in your request, and if the server supports compression, it will send you back compressed data and mark it with a <code>Content-encoding</code> header.
<p>When you talk about <abbr>HTTP</abbr> web services, you&#8217;re almost always talking about moving text-based data back and forth over the wire. Maybe it&#8217;s <abbr>XML</abbr>, maybe it&#8217;s <abbr>JSON</abbr>, maybe it&#8217;s just <a href=strings.html#boring-stuff title="there ain&#8217;t no such thing as plain text">plain text</a>. Regardless of the format, text compresses well. The example feed in <a href=xml.html>the XML chapter</a> is 3070 bytes uncompressed, but would be 941 bytes after gzip compression. That&#8217;s just 30% of the original size!
<p><abbr>HTTP</abbr> supports several compression algorithms. The two most common types are <a href=http://www.ietf.org/rfc/rfc1952.txt>gzip</a> and <a href=http://www.ietf.org/rfc/rfc1951.txt>deflate</a>.
<p><abbr>HTTP</abbr> supports several compression algorithms. The two most common types are <a href=http://www.ietf.org/rfc/rfc1952.txt>gzip</a> and <a href=http://www.ietf.org/rfc/rfc1951.txt>deflate</a>. When you request a resource over <abbr>HTTP</abbr>, you can ask the server to send it in compressed format. You include an <code>Accept-encoding</code> header in your request, and if the server supports compression, it will send you back compressed data with a <code>Content-encoding</code> header that tells you which compression algorithm it used. Then it&#8217;s up to you to decompress the data.
<p>Python&#8217;s URL libraries have no built-in support for compression, but <code>httplib2</code> does.
<p>Python&#8217;s <abbr>HTTP</abbr> libraries do not support compression, but <code>httplib2</code> does.
<h3 id=redirects>Redirects</h3>
<p><a href=http://www.w3.org/Provider/Style/URI>Cool URIs don&#8217;t change</a>, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at <code>http://example.com/index.xml</code> might be moved to <code>http://example.com/xml/atom.xml</code>. Or an entire domain might move, as an organization expands and reorganizes; <code>http://www.example.com/index.xml</code> becomes <code>http://server-farm-1.example.com/index.xml</code>.
<p><a href=http://www.w3.org/Provider/Style/URI>Cool URIs don&#8217;t change</a>, but many <abbr>URI</abbr>s are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at <code>http://example.com/index.xml</code> might be moved to <code>http://example.com/xml/atom.xml</code>. Or an entire domain might move, as an organization expands and reorganizes; <code>http://www.example.com/index.xml</code> becomes <code>http://server-farm-1.example.com/index.xml</code>.
<p>Every time you request any kind of resource from an <abbr>HTTP</abbr> server, the server includes a status code in its response. Status code <code>200</code> means &#8220;everything&#8217;s normal, here&#8217;s the page you asked for&#8221;. Status code <code>404</code> means &#8220;page not found&#8221;. (You&#8217;ve probably seen 404 errors while browsing the web.) Status codes in the 300&#8217;s indicate some form of redirection.
<p><abbr>HTTP</abbr> has several different ways of signifying that a resource has moved. The two most common techiques are status codes <code>302</code> and <code>301</code>. Status code <code>302</code> is a <i>temporary redirect</i>; it means &#8220;oops, that got moved over here temporarily&#8221; (and then gives the temporary address in a <code>Location:</code> header). Status code <code>301</code> is a <i>permanent redirect</i>; it means &#8220;oops, that got moved permanently&#8221; (and then gives the new address in a <code>Location:</code> header). If you get a <code>302</code> status code and a new address, the <abbr>HTTP</abbr> specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a <code>301</code> status code and a new address, you&#8217;re supposed to use the new address from then on.
<p><abbr>HTTP</abbr> has several different ways of signifying that a resource has moved. The two most common techiques are status codes <code>302</code> and <code>301</code>. Status code <code>302</code> is a <i>temporary redirect</i>; it means &#8220;oops, that got moved over here temporarily&#8221; (and then gives the temporary address in a <code>Location</code> header). Status code <code>301</code> is a <i>permanent redirect</i>; it means &#8220;oops, that got moved permanently&#8221; (and then gives the new address in a <code>Location</code> header). If you get a <code>302</code> status code and a new address, the <abbr>HTTP</abbr> specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a <code>301</code> status code and a new address, you&#8217;re supposed to use the new address from then on.
<p>The <code>urllib</code> module will automatically &#8220;follow&#8221; redirects when it receives the appropriate status code from the <abbr>HTTP</abbr> server, but unfortunately, it doesn&#8217;t tell you that it did so. You&#8217;ll end up getting data you asked for, but you&#8217;ll never know that the underlying library &#8220;helpfully&#8221; followed a redirect for you. So you&#8217;ll continue pounding away at the old address, and each time you&#8217;ll get redirected to the new address. That&#8217;s two round trips instead of one, which is bad for the service operator and bad for you.
<p>The <code>urllib.request</code> module automatically &#8220;follow&#8221; redirects when it receives the appropriate status code from the <abbr>HTTP</abbr> server, but it doesn&#8217;t tell you that it did so. You&#8217;ll end up getting data you asked for, but you&#8217;ll never know that the underlying library &#8220;helpfully&#8221; followed a redirect for you. So you&#8217;ll continue pounding away at the old address, and each time you&#8217;ll get redirected to the new address, and each time the <code>urllib.request</code> module will &#8220;helpfully&#8221; follow the redirect. In other words, it treats permanent redirects the same as temporary redirects. That means two round trips instead of one, which is bad for the server and bad for you.
<p><code>httplib2</code> handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them.
<p><code>httplib2</code> handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected <abbr>URL</abbr>s before requesting them.
<!--
<h3><code>User-Agent</code></h3>
@@ -99,12 +165,12 @@ mark{display:inline}
<p>Note that [FIXME-href] our little one-line script to download an Atom feed did not support any of these <abbr>HTTP</abbr> features. Let&#8217;s see how you can improve it.
<p class=a>&#x2042;
-->
<p class=a>&#x2042;
<h2 id=dont-try-this-at-home>How Not To Fetch Data Over HTTP</h2>
<p>Let&#8217;s say you want to download a resource over <abbr>HTTP</abbr>, such as <a href=xml.html>an Atom feed</a>. Being a feed, you&#8217;re not just going to download it once; you&#8217;re going to download it over and over again. Let&#8217;s do it the quick-and-dirty way first, and then see how you can do better.
<p>Let&#8217;s say you want to download a resource over <abbr>HTTP</abbr>, such as <a href=xml.html>an Atom feed</a>. Being a feed, you&#8217;re not just going to download it once; you&#8217;re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let&#8217;s do it the quick-and-dirty way first, and then see how you can do better.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import urllib.request</kbd>
<a><samp class=p>>>> </samp><kbd>data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read()</kbd> <span>&#x2460;</span></a>
@@ -122,53 +188,73 @@ mark{display:inline}
<li>Downloading anything over <abbr>HTTP</abbr> is incredibly easy in Python; in fact, it&#8217;s a one-liner. The <code>urllib.request</code> module has a handy <code>urlopen()</code> function that takes the address of the page you want, and returns a file-like object that you can just <code>read()</code> from to get the full contents of the page. It just can&#8217;t get any easier.
</ol>
<p>So what&#8217;s wrong with this? Well, for a quick one-off during testing or development, there&#8217;s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis &mdash; and remember, you said you were planning on retrieving this syndicated feed once an hour &mdash; then you&#8217;re being inefficient, and you&#8217;re being rude.
<p>So what&#8217;s wrong with this? For a quick one-off during testing or development, there&#8217;s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (<i>e.g.</i> requesting this feed once an hour), then you&#8217;re being inefficient, and you&#8217;re being rude.
<p class=a>&#x2042;
<!--
<h2 id="oa.debug">11.4. Debugging HTTP web services</h2>
<p>First, let&#8217;s turn on the debugging features of Python&#8217;s <abbr>HTTP</abbr> library and see what&#8217;s being sent over the wire. This will be useful throughout the chapter, as you add more and
more features.
<div class=example><h3>Example 11.3. Debugging HTTP</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import httplib</kbd>
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd> <span>&#x2460;</span>
<samp class=p>>>> </samp><kbd>import urllib</kbd>
<samp class=p>>>> </samp><kbd>feeddata = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read()</kbd>
connect: (diveintomark.org, 80) <span>&#x2461;</span>
send: '
GET /xml/atom.xml HTTP/1.0 <span>&#x2462;</span>
Host: diveintomark.org <span>&#x2463;</span>
User-agent: Python-urllib/1.15 <span>&#x2464;</span>
'
reply: 'HTTP/1.1 200 OK\r\n' <span>&#x2465;</span>
header: Date: Wed, 14 Apr 2004 22:27:30 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Content-Type: application/atom+xml
header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT <span>&#x2466;</span>
header: ETag: "e8284-68e0-4de30f80" <span>&#x2467;</span>
header: Accept-Ranges: bytes
header: Content-Length: 26848
header: Connection: close
</pre>
<ol>
<li><code>urllib</code> relies on another standard Python library, <code>httplib</code>. Normally you don&#8217;t need to <code>import httplib</code> directly (<code>urllib</code> does that automatically), but you will here so you can set the debugging flag on the <code>HTTPConnection</code> class that <code>urllib</code> uses internally to connect to the <abbr>HTTP</abbr> server. This is an incredibly useful technique. Some other Python libraries have similar debug flags, but there&#8217;s no particular standard for naming them or turning them on; you need to read
the documentation of each library to see if such a feature is available.
<li>Now that the debugging flag is set, information on the the <abbr>HTTP</abbr> request and response is printed out in real time. The first
thing it tells you is that you&#8217;re connecting to the server <code>diveintomark.org</code> on port 80, which is the standard port for <abbr>HTTP</abbr>.
<li>When you request the Atom feed, <code>urllib</code> sends three lines to the server. The first line specifies the <abbr>HTTP</abbr> verb you&#8217;re using, and the path of the resource (minus
the domain name). All the requests in this chapter will use <code>GET</code>, but in the next chapter on <abbr>SOAP</abbr>, you&#8217;ll see that it uses <code>POST</code> for everything. The basic syntax is the same, regardless of the verb.
<li>The second line is the <code>Host</code> header, which specifies the domain name of the service you&#8217;re accessing. This is important, because a single <abbr>HTTP</abbr> server
can host multiple separate domains. My server currently hosts 12 domains; other servers can host hundreds or even thousands.
<li>The third line is the <code>User-Agent</code> header. What you see here is the generic <code>User-Agent</code> that the <code>urllib</code> library adds by default. In the next section, you&#8217;ll see how to customize this to be more specific.
<li>The server replies with a status code and a bunch of headers (and possibly some data, which got stored in the <var>feeddata</var> variable). The status code here is <code>200</code>, meaning &#8220;everything&#8217;s normal, here&#8217;s the data you requested&#8221;. The server also tells you the date it responded to your request, some information about the server itself, and the content
type of the data it&#8217;s giving you. Depending on your application, this might be useful, or not. It&#8217;s certainly reassuring
that you thought you were asking for an Atom feed, and lo and behold, you&#8217;re getting an Atom feed (<code>application/atom+xml</code>, which is the registered content type for Atom feeds).
<li>The server tells you when this Atom feed was last modified (in this case, about 13 minutes ago). You can send this date back
to the server the next time you request the same feed, and the server can do last-modified checking.
<li>The server also tells you that this Atom feed has an ETag hash of <code>"e8284-68e0-4de30f80"</code>. The hash doesn&#8217;t mean anything by itself; there&#8217;s nothing you can do with it, except send it back to the server the next
time you request this same feed. Then the server can use it to tell you if the data has changed or not.
<h2 id=whats-on-the-wire>What&#8217;s On The Wire?</h2>
<p>To see why this is inefficient and rude, let&#8217;s turn on the debugging features of Python&#8217;s <abbr>HTTP</abbr> library and see what&#8217;s being sent &#8220;on the wire.&#8221;
<pre class=screen>
<samp class=p>>>> </samp><kbd>from http.client import HTTPConnection</kbd>
<a><samp class=p>>>> </samp><kbd>HTTPConnection.debuglevel = 1</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>from urllib.request import urlopen</kbd>
<a><samp class=p>>>> </samp><kbd>response = urlopen('http://diveintopython3.org/examples/feed.xml')</kbd> <span>&#x2461;</span></a>
<samp><a>send: b'GET /examples/feed.xml HTTP/1.1 <span>&#x2462;</span></a>
<a>Host: diveintopython3.org <span>&#x2463;</span></a>
<a>Accept-Encoding: identity <span>&#x2464;</span></a>
<a>User-Agent: Python-urllib/3.0' <span>&#x2465;</span></a>
Connection: close
reply: 'HTTP/1.1 200 OK'
&hellip;further debugging information omitted&hellip;</samp></pre>
<ol>
<li>As I mentioned at the beginning of the chapter, <code>urllib.request</code> relies on another standard Python library, <code>http.client</code>. Normally you don&#8217;t need to touch <code>http.client</code> directly. (The <code>urllib.request</code> module imports it automatically.) But we import it here so we can toggle the debugging flag on the <code>HTTPConnection</code> class that <code>urllib.request</code> uses to connect to the <abbr>HTTP</abbr> server.
<li>Now that the debugging flag is set, information on the the <abbr>HTTP</abbr> request and response is printed out in real time. As you can see, when you request the Atom feed, the <code>urllib.request</code> module sends five lines to the server.
<li>The first line specifies the <abbr>HTTP</abbr> verb you&#8217;re using, and the path of the resource (minus the domain name).
<li>The second line specifies the domain name from which we&#8217;re requesting this feed.
<li>The third line specifies the compression algorithms that the client supports. As I mentioned earlier, <a href=#compression><code>urllib.request</code> does not support compression</a> by default.
<li>The fourth line specifies the name of the library that is making the request. By default, this is <code>Python-urllib</code> plus a version number. Both <code>urllib.request</code> and <code>httplib2</code> support changing the user agent; you&#8217;ll see how to do this later in this chapter. [FIXME really?]
</ol>
<p>Now let&#8217;s look at what the server sent back in its response.
<pre class=screen>
# continued from previous example
<a><samp class=p>>>> </samp><kbd>print(response.headers.as_string())</kbd> <span>&#x2460;</span></a>
<samp><a>Date: Sun, 31 May 2009 19:23:06 GMT <span>&#x2461;</span>
Server: Apache
<a>Last-Modified: Sun, 31 May 2009 06:39:55 GMT <span>&#x2462;</span></a>
<a>ETag: "bfe-93d9c4c0" <span>&#x2463;</span></a>
Accept-Ranges: bytes
<a>Content-Length: 3070 <span>&#x2464;</span></a>
<a>Cache-Control: max-age=86400 <span>&#x2465;</span></a>
Expires: Mon, 01 Jun 2009 19:23:06 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: application/xml</samp>
<a><samp class=p>>>> </samp><kbd>data = response.read()</kbd> <span>&#x2466;</span></a>
<samp class=p>>>> </samp><kbd>len(data)</kbd>
<samp>3070</samp></pre>
<ol>
<li>The <var>response</var> returned from the <code>urllib.request.urlopen()</code> function contains all the <abbr>HTTP</abbr> headers the server sent back. It also contains methods to download the actual data; we&#8217;ll get to that in a minute.
<li>The server tells you when it handled your request.
<li>This response includes a <a href=#last-modified><code>Last-Modified</code></a> header.
<li>This response includes an <a href=#etags><code>ETag</code></a> header.
<li>The data is 3070 bytes long. Notice what <em>isn&#8217;t</em> here: a <code>Content-encoding</code> header. Your request stated that you only accept uncompressed data (<code>Accept-encoding: identity</code>), and sure enough, this response contains uncompressed data.
<li>This response includes caching headers that state that this feed can be cached for up to 24 hours (86400 seconds).
<li>And finally, download the actual data by calling <code>response.read()</code>. As you can tell from the <code>len()</code> function, this downloads all 3070 bytes at once.
</ol>
<p>As you can see, this code is already inefficient: it asked for (and received) uncompressed data. I know for a fact that this server supports <a href=#compression>gzip compression</a>, but <abbr>HTTP</abbr> compression is opt-in. We didn&#8217;t ask for it, so we didn&#8217;t get it. That means we&#8217;re downloading 3070 bytes when we could have just downloaded 941. Bad dog, no biscuit.
<p>But wait, it gets worse! To see just how inefficient this code is, let&#8217;s request the same feed a second time.
<pre class=screen>
FIXME
</pre>
<!--
<p class=a>&#x2042;
<h2 id="oa.useragent">11.5. Setting the <code>User-Agent</code></h2>
@@ -198,12 +284,12 @@ header: Connection: close
</pre>
<ol>
<li>If you still have your Python <abbr>IDE</abbr> open from the previous section&#8217;s example, you can skip this, but this turns on <a href="#oa.debug" title="11.4. Debugging HTTP web services"><abbr>HTTP</abbr> debugging</a> so you can see what you&#8217;re actually sending over the wire, and what gets sent back.
<li>Fetching an <abbr>HTTP</abbr> resource with <code>urllib2</code> is a three-step process, for good reasons that will become clear shortly. The first step is to create a <code>Request</code> object, which takes the URL of the resource you&#8217;ll eventually get around to retrieving. Note that this step doesn&#8217;t actually
<li>Fetching an <abbr>HTTP</abbr> resource with <code>urllib2</code> is a three-step process, for good reasons that will become clear shortly. The first step is to create a <code>Request</code> object, which takes the <abbr>URL</abbr> of the resource you&#8217;ll eventually get around to retrieving. Note that this step doesn&#8217;t actually
retrieve anything yet.
<li>The second step is to build a URL opener. This can take any number of handlers, which control how responses are handled.
<li>The second step is to build a <abbr>URL</abbr> opener. This can take any number of handlers, which control how responses are handled.
But you can also build an opener without any custom handlers, which is what you&#8217;re doing here. You&#8217;ll see how to define
and use custom handlers later in this chapter when you explore redirects.
<li>The final step is to tell the opener to open the URL, using the <code>Request</code> object you created. As you can see from all the debugging information that gets printed, this step actually retrieves the
<li>The final step is to tell the opener to open the <abbr>URL</abbr>, using the <code>Request</code> object you created. As you can see from all the debugging information that gets printed, this step actually retrieves the
resource and stores the returned data in <var>feeddata</var>.
<div class=example><h3>Example 11.5. Adding headers with the <code>Request</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>request</kbd> <span>&#x2460;</span>
@@ -230,10 +316,10 @@ header: Content-Length: 26848
header: Connection: close
</pre>
<ol>
<li>You&#8217;re continuing from the previous example; you&#8217;ve already created a <code>Request</code> object with the URL you want to access.
<li>You&#8217;re continuing from the previous example; you&#8217;ve already created a <code>Request</code> object with the <abbr>URL</abbr> you want to access.
<li>Using the <code>add_header</code> method on the <code>Request</code> object, you can add arbitrary <abbr>HTTP</abbr> headers to the request. The first argument is the header, the second is the value you&#8217;re
providing for that header. Convention dictates that a <code>User-Agent</code> should be in this specific format: an application name, followed by a slash, followed by a version number. The rest is free-form,
and you&#8217;ll see a lot of variations in the wild, but somewhere it should include a URL of your application. The <code>User-Agent</code> is usually logged by the server along with other details of your request, and including a URL of your application allows
and you&#8217;ll see a lot of variations in the wild, but somewhere it should include a <abbr>URL</abbr> of your application. The <code>User-Agent</code> is usually logged by the server along with other details of your request, and including a <abbr>URL</abbr> of your application allows
server administrators looking through their access logs to contact you if something is wrong.
<li>The <var>opener</var> object you created before can be reused too, and it will retrieve the same feed again, but with your custom <code>User-Agent</code> header.
<li>And here&#8217;s you sending your custom <code>User-Agent</code>, in place of the generic one that Python sends by default. If you look closely, you&#8217;ll notice that you defined a <code>User-Agent</code> header, but you actually sent a <code>User-agent</code> header. See the difference? <code>urllib2</code> changed the case so that only the first letter was capitalized. It doesn&#8217;t really matter; <abbr>HTTP</abbr> specifies that header field
@@ -288,9 +374,9 @@ urllib2.HTTPError: HTTP Error 304: Not Modified</span>
server not to send you any data if it hadn&#8217;t changed, and the data didn&#8217;t change, so the server told you it wasn&#8217;t sending
you any data. That&#8217;s not an error; that&#8217;s exactly what you were hoping for.
<p><code>urllib2</code> also raises an <code>HTTPError</code> exception for conditions that you would think of as errors, such as <code>404</code> (page not found). In fact, it will raise <code>HTTPError</code> for <em>any</em> status code other than <code>200</code> (OK), <code>301</code> (permanent redirect), or <code>302</code> (temporary redirect). It would be more helpful for your purposes to capture the status code and simply return it, without
throwing an exception. To do that, you&#8217;ll need to define a custom URL handler.
throwing an exception. To do that, you&#8217;ll need to define a custom <abbr>URL</abbr> handler.
<div class=example><h3>Example 11.7. Defining URL handlers</h3>
<p>This custom URL handler is part of <code>openanything.py</code>.
<p>This custom <abbr>URL</abbr> handler is part of <code>openanything.py</code>.
<pre><code>
class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>&#x2460;</span>
def http_error_default(self, req, fp, code, msg, headers): <span>&#x2461;</span>
@@ -300,7 +386,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>&#x2460;</s
return result
</pre>
<ol>
<li><code>urllib2</code> is designed around URL handlers. Each handler is just a class that can define any number of methods. When something happens
<li><code>urllib2</code> is designed around <abbr>URL</abbr> handlers. Each handler is just a class that can define any number of methods. When something happens
&mdash; like an <abbr>HTTP</abbr> error, or even a <code>304</code> code &mdash; <code>urllib2</code> introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a> to define handlers for different node types, but <code>urllib2</code> is more flexible, and introspects over as many handlers as are defined for the current request.
<li><code>urllib2</code> searches through the defined handlers and calls the <code>http_error_default</code> method when it encounters a <code>304</code> status code from the server. By defining a custom error handler, you can prevent <code>urllib2</code> from raising an exception. Instead, you create the <code>HTTPError</code> object, but return it instead of raising it.
<li>This is the key part: before returning, you save the status code returned by the <abbr>HTTP</abbr> server. This will allow you easy access
@@ -319,8 +405,8 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>&#x2460;</s
</pre>
<ol>
<li>You&#8217;re continuing the previous example, so the <code>Request</code> object is already set up, and you&#8217;ve already added the <code>If-Modified-Since</code> header.
<li>This is the key: now that you&#8217;ve defined your custom URL handler, you need to tell <code>urllib2</code> to use it. Remember how I said that <code>urllib2</code> broke up the process of accessing an <abbr>HTTP</abbr> resource into three steps, and for good reason? This is why building the URL opener
is its own step, because you can build it with your own custom URL handlers that override <code>urllib2</code>&#8217;s default behavior.
<li>This is the key: now that you&#8217;ve defined your custom <abbr>URL</abbr> handler, you need to tell <code>urllib2</code> to use it. Remember how I said that <code>urllib2</code> broke up the process of accessing an <abbr>HTTP</abbr> resource into three steps, and for good reason? This is why building the <abbr>URL</abbr> opener
is its own step, because you can build it with your own custom <abbr>URL</abbr> handlers that override <code>urllib2</code>&#8217;s default behavior.
<li>Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use <var>seconddatastream.headers.dict</var> to acess them), also contains the <abbr>HTTP</abbr> status code. In this case, as you expected, the status is <code>304</code>, meaning this data hasn&#8217;t changed since the last time you asked for it.
<li>Note that when the server sends back a <code>304</code> status code, it doesn&#8217;t re-send the data. That&#8217;s the whole point: to save bandwidth by not re-downloading data that hasn&#8217;t
changed. So if you actually want that data, you&#8217;ll need to cache it locally the first time you get it.
@@ -366,7 +452,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>&#x2460;</s
<p class=a>&#x2042;
<h2 id="oa.redirect">11.7. Handling redirects</h2>
<p>You can support permanent and temporary redirects using a different kind of custom URL handler.
<p>You can support permanent and temporary redirects using a different kind of custom <abbr>URL</abbr> handler.
<p>First, let&#8217;s see why a redirect handler is necessary in the first place.
<div class=example><h3>Example 11.10. Accessing web services without a redirect handler</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import urllib2, httplib</kbd>
@@ -421,16 +507,16 @@ AttributeError: addinfourl instance has no attribute 'status'</span>
</pre>
<ol>
<li>You&#8217;ll be better able to see what&#8217;s happening if you turn on debugging.
<li>This is a URL which I have set up to permanently redirect to my Atom feed at <code>http://diveintomark.org/xml/atom.xml</code>.
<li>This is a <abbr>URL</abbr> which I have set up to permanently redirect to my Atom feed at <code>http://diveintomark.org/xml/atom.xml</code>.
<li>Sure enough, when you try to download the data at that address, the server sends back a <code>301</code> status code, telling you that the resource has moved permanently.
<li>The server also sends back a <code>Location:</code> header that gives the new address of this data.
<li><code>urllib2</code> notices the redirect status code and automatically tries to retrieve the data at the new location specified in the <code>Location:</code> header.
<li>The server also sends back a <code>Location</code> header that gives the new address of this data.
<li><code>urllib2</code> notices the redirect status code and automatically tries to retrieve the data at the new location specified in the <code>Location</code> header.
<li>The object you get back from the <var>opener</var> contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent
address). But the status code is missing, so you have no way of knowing programmatically whether this redirect was temporary
or permanent. And that matters very much: if it was a temporary redirect, then you should continue to ask for the data at
the old location. But if it was a permanent redirect (as this was), you should ask for the data at the new location from
now on.
<p>This is suboptimal, but easy to fix. <code>urllib2</code> doesn&#8217;t behave exactly as you want it to when it encounters a <code>301</code> or <code>302</code>, so let&#8217;s override its behavior. How? With a custom URL handler, <a href="#oa.etags" title="11.6. Handling Last-Modified and ETag">just like you did to handle <code>304</code> codes</a>.
<p>This is suboptimal, but easy to fix. <code>urllib2</code> doesn&#8217;t behave exactly as you want it to when it encounters a <code>301</code> or <code>302</code>, so let&#8217;s override its behavior. How? With a custom <abbr>URL</abbr> handler, <a href="#oa.etags" title="11.6. Handling Last-Modified and ETag">just like you did to handle <code>304</code> codes</a>.
<div class=example><h3>Example 11.11. Defining the redirect handler</h3>
<p>This class is defined in <code>openanything.py</code>.
<pre><code>
@@ -449,10 +535,10 @@ class SmartRedirectHandler(urllib2.HTTPRedirectHandler): <span>&#x2460;</spa
</pre>
<ol>
<li>Redirect behavior is defined in <code>urllib2</code> in a class called <code>HTTPRedirectHandler</code>. You don&#8217;t want to completely override the behavior, you just want to extend it a little, so you&#8217;ll subclass <code>HTTPRedirectHandler</code> so you can call the ancestor class to do all the hard work.
<li>When it encounters a <code>301</code> status code from the server, <code>urllib2</code> will search through its handlers and call the <code>http_error_301</code> method. The first thing ours does is just call the <code>http_error_301</code> method in the ancestor, which handles the grunt work of looking for the <code>Location:</code> header and following the redirect to the new address.
<li>When it encounters a <code>301</code> status code from the server, <code>urllib2</code> will search through its handlers and call the <code>http_error_301</code> method. The first thing ours does is just call the <code>http_error_301</code> method in the ancestor, which handles the grunt work of looking for the <code>Location</code> header and following the redirect to the new address.
<li>Here&#8217;s the key: before you return, you store the status code (<code>301</code>), so that the calling program can access it later.
<li>Temporary redirects (status code <code>302</code>) work the same way: override the <code>http_error_302</code> method, call the ancestor, and save the status code before returning.
<p>So what has this bought us? You can now build a URL opener with the custom redirect handler, and it will still automatically
<p>So what has this bought us? You can now build a <abbr>URL</abbr> opener with the custom redirect handler, and it will still automatically
follow redirects, but now it will also expose the redirect status code.
<div class=example><h3>Example 11.12. Using the redirect handler to detect permanent redirects</h3><pre class=screen>
<samp class=p>>>> </samp><kbd>request = urllib2.Request('http://diveintomark.org/redir/example301.xml')</kbd>
@@ -495,9 +581,9 @@ header: Content-Type: application/atom+xml
'http://diveintomark.org/xml/atom.xml'
</pre>
<ol>
<li>First, build a URL opener with the redirect handler you just defined.
<li>First, build a <abbr>URL</abbr> opener with the redirect handler you just defined.
<li>You sent off a request, and you got a <code>301</code> status code in response. At this point, the <code>http_error_301</code> method gets called. You call the ancestor method, which follows the redirect and sends a request at the new location (<code>http://diveintomark.org/xml/atom.xml</code>).
<li>This is the payoff: now, not only do you have access to the new URL, but you have access to the redirect status code, so you
<li>This is the payoff: now, not only do you have access to the new <abbr>URL</abbr>, but you have access to the redirect status code, so you
can tell that this was a permanent redirect. The next time you request this data, you should request it from the new location
(<code>http://diveintomark.org/xml/atom.xml</code>, as specified in <var>f.url</var>). If you had stored the location in a configuration file or a database, you need to update that so you don&#8217;t keep pounding
the server with requests at the old address. It&#8217;s time to update your address book.
@@ -540,8 +626,8 @@ header: Content-Type: application/atom+xml</samp>
http://diveintomark.org/xml/atom.xml
</pre>
<ol>
<li>This is a sample URL I&#8217;ve set up that is configured to tell clients to <em>temporarily</em> redirect to <code>http://diveintomark.org/xml/atom.xml</code>.
<li>The server sends back a <code>302</code> status code, indicating a temporary redirect. The temporary new location of the data is given in the <code>Location:</code> header.
<li>This is a sample <abbr>URL</abbr> I&#8217;ve set up that is configured to tell clients to <em>temporarily</em> redirect to <code>http://diveintomark.org/xml/atom.xml</code>.
<li>The server sends back a <code>302</code> status code, indicating a temporary redirect. The temporary new location of the data is given in the <code>Location</code> header.
<li><code>urllib2</code> calls your <code>http_error_302</code> method, which calls the ancestor method of the same name in <code>urllib2.HTTPRedirectHandler</code>, which follows the redirect to the new location. Then your <code>http_error_302</code> method stores the status code (<code>302</code>) so the calling application can get it later.
<li>And here you are, having successfully followed the redirect to <code>http://diveintomark.org/xml/atom.xml</code>. <var>f.status</var> tells you that this was a temporary redirect, which means that you should continue to request data from the original address
(<code>http://diveintomark.org/redir/example302.xml</code>). Maybe it will redirect next time too, but maybe not. Maybe it will redirect to a different address. It&#8217;s not for you
@@ -610,7 +696,7 @@ header: Content-Type: application/atom+xml</span>
15955
</pre>
<ol>
<li>Continuing from the previous example, <var>f</var> is the file-like object returned from the URL opener. Using its <code>read()</code> method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first
<li>Continuing from the previous example, <var>f</var> is the file-like object returned from the <abbr>URL</abbr> opener. Using its <code>read()</code> method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first
step towards getting the data you really want.
<li>OK, this step is a little bit of messy workaround. Python has a <code>gzip</code> module, which reads (and actually writes) gzip-compressed files on disk. But you don&#8217;t have a file on disk, you have a gzip-compressed
buffer in memory, and you don&#8217;t want to write out a temporary file just so you can uncompress it. So what you&#8217;re going to
@@ -662,14 +748,14 @@ def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
return opener.open(request) <span>&#x2466;</span>
</pre>
<ol>
<li><code>urlparse</code> is a handy utility module for, you guessed it, parsing URLs. Its primary function, also called <code>urlparse</code>, takes a URL and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier).
Of these, the only thing you care about is the scheme, to make sure that you&#8217;re dealing with an <abbr>HTTP</abbr> URL (which <code>urllib2</code> can handle).
<li><code>urlparse</code> is a handy utility module for, you guessed it, parsing <abbr>URL</abbr>s. Its primary function, also called <code>urlparse</code>, takes a <abbr>URL</abbr> and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier).
Of these, the only thing you care about is the scheme, to make sure that you&#8217;re dealing with an <abbr>HTTP</abbr> <abbr>URL</abbr> (which <code>urllib2</code> can handle).
<li>You identify yourself to the <abbr>HTTP</abbr> server with the <code>User-Agent</code> passed in by the calling function. If no <code>User-Agent</code> was specified, you use a default one defined earlier in the <code>openanything.py</code> module. You never use the default one defined by <code>urllib2</code>.
<li>If an <code>ETag</code> hash was given, send it in the <code>If-None-Match</code> header.
<li>If a last-modified date was given, send it in the <code>If-Modified-Since</code> header.
<li>Tell the server you would like compressed data if possible.
<li>Build a URL opener that uses <em>both</em> of the custom URL handlers: <code>SmartRedirectHandler</code> for handling <code>301</code> and <code>302</code> redirects, and <code>DefaultErrorHandler</code> for handling <code>304</code>, <code>404</code>, and other error conditions gracefully.
<li>That&#8217;s it! Open the URL and return a file-like object to the caller.
<li>Build a <abbr>URL</abbr> opener that uses <em>both</em> of the custom <abbr>URL</abbr> handlers: <code>SmartRedirectHandler</code> for handling <code>301</code> and <code>302</code> redirects, and <code>DefaultErrorHandler</code> for handling <code>304</code>, <code>404</code>, and other error conditions gracefully.
<li>That&#8217;s it! Open the <abbr>URL</abbr> and return a file-like object to the caller.
<div class=example><h3>Example 11.18. The <code>fetch</code> function</h3>
<p>This function is defined in <code>openanything.py</code>.
<pre><code>
@@ -695,13 +781,13 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
return result
</pre>
<ol>
<li>First, you call the <code>openAnything</code> function with a URL, <code>ETag</code> hash, <code>Last-Modified</code> date, and <code>User-Agent</code>.
<li>First, you call the <code>openAnything</code> function with a <abbr>URL</abbr>, <code>ETag</code> hash, <code>Last-Modified</code> date, and <code>User-Agent</code>.
<li>Read the actual data returned from the server. This may be compressed; if so, you&#8217;ll decompress it later.
<li>Save the <code>ETag</code> hash returned from the server, so the calling application can pass it back to you next time, and you can pass it on to <code>openAnything</code>, which can stick it in the <code>If-None-Match</code> header and send it to the remote server.
<li>Save the <code>Last-Modified</code> date too.
<li>If the server says that it sent compressed data, decompress it.
<li>If you got a URL back from the server, save it, and assume that the status code is <code>200</code> until you find out otherwise.
<li>If one of the custom URL handlers captured a status code, then save that too.
<li>If you got a <abbr>URL</abbr> back from the server, save it, and assume that the status code is <code>200</code> until you find out otherwise.
<li>If one of the custom <abbr>URL</abbr> handlers captured a status code, then save that too.
<div class=example><h3>Example 11.19. Using <code>openanything.py</code></h3><pre class=screen>
<samp class=p>>>> </samp><kbd>import openanything</kbd>
<samp class=p>>>> </samp><kbd>useragent = 'MyHTTPWebServicesApp/1.0'</kbd>
@@ -731,8 +817,8 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
<li>The very first time you fetch a resource, you don&#8217;t have an <code>ETag</code> hash or <code>Last-Modified</code> date, so you&#8217;ll leave those out. (They&#8217;re <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional parameters</a>.)
<li>What you get back is a dictionary of several useful headers, the <abbr>HTTP</abbr> status code, and the actual data returned from the server.
<code>openanything</code> handles the gzip compression internally; you don&#8217;t care about that at this level.
<li>If you ever get a <code>301</code> status code, that&#8217;s a permanent redirect, and you need to update your URL to the new address.
<li>The second time you fetch the same resource, you have all sorts of information to pass back: a (possibly updated) URL, the
<li>If you ever get a <code>301</code> status code, that&#8217;s a permanent redirect, and you need to update your <abbr>URL</abbr> to the new address.
<li>The second time you fetch the same resource, you have all sorts of information to pass back: a (possibly updated) <abbr>URL</abbr>, the
<code>ETag</code> from the last time, the <code>Last-Modified</code> date from the last time, and of course your <code>User-Agent</code>.
<li>What you get back is again a dictionary, but the data hasn&#8217;t changed, so all you got was a <code>304</code> status code and no data.