mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
f6961f7e7e
--HG-- rename : html5.js => j/html5.js
633 lines
41 KiB
HTML
633 lines
41 KiB
HTML
<!DOCTYPE html>
|
|
<head>
|
|
<meta charset=utf-8>
|
|
<title>HTTP Web Services - Dive into Python 3</title>
|
|
<!--[if IE]><script src=j/html5.js></script><![endif]-->
|
|
<link rel=stylesheet href=dip3.css>
|
|
<style>
|
|
body{counter-reset:h1 15}
|
|
mark{display:inline}
|
|
</style>
|
|
<link rel=stylesheet media='only screen and (max-device-width: 480px)' href=mobile.css>
|
|
<link rel=stylesheet media=print href=print.css>
|
|
<meta name=viewport content='initial-scale=1.0'>
|
|
</head>
|
|
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=25> <input type=submit name=root value=Search></div></form>
|
|
<p>You are here: <a href=index.html>Home</a> <span>‣</span> <a href=table-of-contents.html#http-web-services>Dive Into Python 3</a> <span>‣</span>
|
|
<p id=level>Difficulty level: <span title=advanced>♦♦♦♦♢</span>
|
|
<h1>HTTP Web Services</h1>
|
|
<blockquote class=q>
|
|
<p><span>❝</span> A ruffled mind makes a restless pillow. <span>❞</span><br>— Charlotte Brontë
|
|
</blockquote>
|
|
<p id=toc>
|
|
<h2 id=divingin>Diving In</h2>
|
|
<p class=f>HTTP web services are programmatic ways of sending and receiving data from remote servers using nothing but the operations of <abbr>HTTP</abbr>. If you want to get data from the server, use <abbr>HTTP</abbr> <code>GET</code>; if you want to send new data to the server, use <abbr>HTTP</abbr> <code>POST</code>. Some more advanced <abbr>HTTP</abbr> web service <abbr>API</abbr>s also define ways of modifying existing data and deleting data, using <abbr>HTTP</abbr> <code>PUT</code> and <abbr>HTTP</abbr> <code>DELETE</code>. In other words, the “verbs” built into the <abbr>HTTP</abbr> protocol (<code>GET</code>, <code>POST</code>, <code>PUT</code>, and <code>DELETE</code>) map directly to application-level operations for retrieving, creating, modifying, and deleting data.
|
|
|
|
<p>The main advantage of this approach is simplicity, and its simplicity has proven popular. Data — usually <abbr>XML</abbr> data — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an <abbr>HTTP</abbr> library for downloading it. Debugging is also easier; because each “call” to the web service had a unique <abbr>URL</abbr>, you can load it in your web browser and immediately see the raw data.
|
|
|
|
<p>Examples of <abbr>HTTP</abbr> web services:
|
|
<ul>
|
|
<li><a href=http://code.google.com/apis/gdata/>Google Data <abbr>API</abbr>s</a> allow you to interact with a wide variety of Google services, including <a href=http://www.blogger.com/>Blogger</a> and <a href=http://www.youtube.com/>YouTube</a>.
|
|
<li><a href=http://www.flickr.com/services/api/>Flickr Services</a> allow you to upload and download photos from <a href=http://www.flickr.com/>Flickr</a>.
|
|
<li><a href=http://apiwiki.twitter.com/>Twitter <abbr>API</abbr></a> allows you to publish status updates on <a href=http://twitter.com/>Twitter</a>.
|
|
<li><a href="http://www.programmableweb.com/apis/directory/1?sort=mashups">…and many more</a>
|
|
</ul>
|
|
|
|
<p>Python 3 comes with two different libraries for interacting with <abbr>HTTP</abbr> web services:
|
|
|
|
<ul>
|
|
<li><a href=http://docs.python.org/3.0/library/http.client.html><code>http.client</code></a> is a low-level library that implements <a href=http://www.w3.org/Protocols/rfc2616/rfc2616.html><abbr>RFC</abbr> 2616</a>, the <abbr>HTTP</abbr> protocol.
|
|
<li><a href=http://docs.python.org/3.0/library/urllib.request.html><code>urllib.request</code></a> is an abstraction layer built on top of <code>http.client</code>. It provides a standard <abbr>API</abbr> for accessing both <abbr>HTTP</abbr> and <abbr>FTP</abbr> servers, automatically follows <abbr>HTTP</abbr> redirects, and handles some common forms of <abbr>HTTP</abbr> authentication.
|
|
</ul>
|
|
|
|
<p>So which one should you use? Neither of them. Instead, you should use <a href=http://code.google.com/p/httplib2/><code>httplib2</code></a>, an open source third-party library that implements <abbr>HTTP</abbr> more fully than <code>http.client</code> but provides a better abstraction that <code>urllib.request</code>.
|
|
|
|
<p>To understand why <code>httplib2</code> is the right choice, you first need to understand <abbr>HTTP</abbr>.
|
|
|
|
<p class=a>⁂
|
|
|
|
<h2 id=http-features>Features of HTTP</h2>
|
|
|
|
<p>There are five important features which all <abbr>HTTP</abbr> clients should support.
|
|
|
|
<h3 id=caching>Caching</h3>
|
|
|
|
<p>The most important thing to understand about any type of web service is that network access is incredibly expensive. I don’t mean “dollars and cents” expensive (although bandwidth ain’t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even the fastest broadband connection is orders of magnitude slower than your local network, which in turn is orders of magnitude slower than you local disk.
|
|
|
|
<p><abbr>HTTP</abbr> is designed with caching in mind. There is an entire class of devices (called “caching proxies”) whose only job is to sit between you and the rest of the world and minimize network access. Your company or <abbr>ISP</abbr> almost certainly maintains caching proxies, even if you’re unaware of them. They work because caching built into the <abbr>HTTP</abbr> protocol.
|
|
|
|
<p>Here’s a concrete example of how caching works. You visit <a href=http://diveintomark.org/><code>diveintomark.org</code></a> in your browser. That page includes a background image, <a href=http://wearehugh.com/m.jpg><code>wearehugh.com/m.jpg</code></a>. When your browser downloads that image, the server includes the following <abbr>HTTP</abbr> headers:
|
|
|
|
<pre><code>HTTP/1.1 200 OK
|
|
Date: Sun, 31 May 2009 17:14:04 GMT
|
|
Server: Apache
|
|
Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
|
|
ETag: "3075-ddc8d800"
|
|
Accept-Ranges: bytes
|
|
Content-Length: 12405
|
|
<mark>Cache-Control: max-age=31536000, public</mark>
|
|
<mark>Expires: Mon, 31 May 2010 17:14:04 GMT</mark>
|
|
Connection: close
|
|
Content-Type: image/jpeg</code></pre>
|
|
|
|
<p>The <code>Cache-Control</code> and <code>Expires</code> headers tell your browser (and any caching proxies between you and the server) that this image can be cached for up to a year. <em>A year!</em> And if, in the next year, you visit another page which also includes a link to this image, your browser will load the image from its cache <em>without generating any network activity whatsoever</em>.
|
|
|
|
<p>But wait, it gets better. Let’s say your browser purges the image from your local cache for some reason. Maybe it ran out of disk space; maybe you manually cleared the cache. Whatever. But the <abbr>HTTP</abbr> headers said that this data could be cached by public caching proxies (by virtue of that <code>public</code> keyword in the <code>Cache-Control</code> header). Caching proxies are designed to have tons of storage space, probably far more than your local browser has allocated.
|
|
|
|
<p>If your company or <abbr>ISP</abbr> maintain a caching proxy, the proxy may still have the image cached. When you visit <code>diveintomark.org</code> again, your browser will look in its local cache for the image, but it won’t find it, so it will make a network request to try to download it from the remote server. But if the caching proxy still has a copy of the image, it will intercept that request and serve the image from <em>its</em> cache. That means that your request will never reach the remote server; in fact, it will never leave your company’s network. That makes for a faster download (fewer network hops) and saves your company money (less data being downloaded from the outside world).
|
|
|
|
<p><abbr>HTTP</abbr> caching only works when everybody does their part. On one side, servers need to send the correct headers in their response. On the other side, clients need to understand and respect those headers before they request the same data twice. The proxies in the middle are not a panacea; they can only be as smart as the servers and clients allow them to be.
|
|
|
|
<p>Python’s <abbr>HTTP</abbr> libraries do not support caching, but <code>httplib2</code> does.
|
|
|
|
<h3 id=last-modified>Last-Modified Checking</h3>
|
|
|
|
<p>Some data never changes, while other data changes all the time. In between, there is a vast field of data that <em>might</em> have changed, but hasn’t. CNN.com’s feed is updated every few minutes, but my weblog’s feed may not change for days or weeks at a time. In the latter case, I don’t want to tell clients to cache my feed for weeks at a time, because then when I do actually post something, people may not read it for weeks (because they’re respecting my cache headers which said “don’t bother checking this feed for weeks”). On the other hand, I don’t want clients downloading my entire feed once an hour if it hasn’t changed!
|
|
|
|
<p><abbr>HTTP</abbr> has a solution to this, too. When you request data for the first time, the server can send back a <code>Last-Modified</code> header. This is exactly what it sounds like: the date that the data was changed. That background image referenced from <code>diveintomark.org</code> included a <code>Last-Modified</code> header.
|
|
|
|
<pre><code>HTTP/1.1 200 OK
|
|
Date: Sun, 31 May 2009 17:14:04 GMT
|
|
Server: Apache
|
|
<mark>Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT</mark>
|
|
ETag: "3075-ddc8d800"
|
|
Accept-Ranges: bytes
|
|
Content-Length: 12405
|
|
Cache-Control: max-age=31536000, public
|
|
Expires: Mon, 31 May 2010 17:14:04 GMT
|
|
Connection: close
|
|
Content-Type: image/jpeg
|
|
</code></pre>
|
|
|
|
<p>When you request the same data a second (or third or fourth) time, you can send an <code>If-Modified-Since</code> header with your request, with the date you got back from the server last time. If the data hasn’t changed since then, the server sends back a special <abbr>HTTP</abbr> <code>304</code> status code, which means “this data hasn’t changed since the last time you asked for it.” You can test this on the command line, using <a href=http://curl.haxx.se/>curl</a>:
|
|
|
|
<pre class=screen>
|
|
<samp class=p>you@localhost:~$ </samp><kbd>curl -I <mark>-H "If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT"</mark> http://wearehugh.com/m.jpg</kbd>
|
|
<samp>HTTP/1.1 304 Not Modified
|
|
Date: Sun, 31 May 2009 18:04:39 GMT
|
|
Server: Apache
|
|
Connection: close
|
|
ETag: "3075-ddc8d800"
|
|
Expires: Mon, 31 May 2010 18:04:39 GMT
|
|
Cache-Control: max-age=31536000, public</samp></pre>
|
|
|
|
<p>Why is this an improvement? Because when the server sends a <code>304</code>, <em>it doesn’t re-send the data</em>. All you get is the status code. Even after your cached copy has expired, last-modified checking ensures that you won’t download the same data twice if it hasn’t changed. (As an extra bonus, this <code>304</code> response also includes caching headers. Proxies will keep a copy of data even after it officially “expires,” in the hopes that the data hasn’t <em>really</em> changed and the next request responds with a <code>304</code> status code and updated cache information.)
|
|
|
|
<p>Python’s <abbr>HTTP</abbr> libraries do not support last-modified date checking, but <code>httplib2</code> does.
|
|
|
|
<h3 id=etags>ETags</h3>
|
|
|
|
<p>ETags are an alternate way to accomplish the same thing as the <a href=#last-modified>last-modified checking</a>. With Etags, the server sends a hash code in an <code>ETag</code> header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) That background image referenced from <code>diveintomark.org</code> had an <code>ETag</code> header.
|
|
|
|
<pre><code>HTTP/1.1 200 OK
|
|
Date: Sun, 31 May 2009 17:14:04 GMT
|
|
Server: Apache
|
|
Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
|
|
<mark>ETag: "3075-ddc8d800"</mark>
|
|
Accept-Ranges: bytes
|
|
Content-Length: 12405
|
|
Cache-Control: max-age=31536000, public
|
|
Expires: Mon, 31 May 2010 17:14:04 GMT
|
|
Connection: close
|
|
Content-Type: image/jpeg
|
|
</code></pre>
|
|
|
|
The second time you request the same data, you include the ETag hash in an <code>If-None-Match</code> header of your request. If the data hasn’t changed, the server will send you back a <code>304</code> status code. As with the last-modified date checking, the server sends back <em>only</em> the <code>304</code> status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since <a href=#caching>you still have the data from the last time</a>.
|
|
|
|
<p>Again with the <kbd>curl</kbd>:
|
|
|
|
<pre class=screen>
|
|
<a><samp class=p>you@localhost:~$ </samp><kbd>curl -I <mark>-H "If-None-Match: \"3075-ddc8d800\""</mark> http://wearehugh.com/m.jpg</kbd> <span>①</span></a>
|
|
<samp>HTTP/1.1 304 Not Modified
|
|
Date: Sun, 31 May 2009 18:04:39 GMT
|
|
Server: Apache
|
|
Connection: close
|
|
ETag: "3075-ddc8d800"
|
|
Expires: Mon, 31 May 2010 18:04:39 GMT
|
|
Cache-Control: max-age=31536000, public</samp></pre>
|
|
<ol>
|
|
<li>ETags are commonly enclosed in quotation marks, but <em>the quotation marks are part of the value</em>. They are not delimiters; the only delimiter in the <code>ETag</code> header is the colon between <code>ETag</code> and <code>"3075-ddc8d800"</code>. That means you need to send the quotation marks back to the server in the <code>If-None-Match</code> header.
|
|
</ol>
|
|
|
|
<p>Python’s <abbr>HTTP</abbr> libraries do not support ETags, but <code>httplib2</code> does.
|
|
|
|
<h3 id=compression>Compression</h3>
|
|
|
|
<p>When you talk about <abbr>HTTP</abbr> web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s <abbr>XML</abbr>, maybe it’s <abbr>JSON</abbr>, maybe it’s just <a href=strings.html#boring-stuff title="there ain’t no such thing as plain text">plain text</a>. Regardless of the format, text compresses well. The example feed in <a href=xml.html>the XML chapter</a> is 3070 bytes uncompressed, but would be 941 bytes after gzip compression. That’s just 30% of the original size!
|
|
|
|
<p><abbr>HTTP</abbr> supports several compression algorithms. The two most common types are <a href=http://www.ietf.org/rfc/rfc1952.txt>gzip</a> and <a href=http://www.ietf.org/rfc/rfc1951.txt>deflate</a>. When you request a resource over <abbr>HTTP</abbr>, you can ask the server to send it in compressed format. You include an <code>Accept-encoding</code> header in your request that lists which compression algorithms you support. If the server supports any of the same algorithms, it will send you back compressed data (with a <code>Content-encoding</code> header that tells you which algorithm it used). Then it’s up to you to decompress the data.
|
|
|
|
<p>Python’s <abbr>HTTP</abbr> libraries do not support compression, but <code>httplib2</code> does.
|
|
|
|
<h3 id=redirects>Redirects</h3>
|
|
|
|
<p><a href=http://www.w3.org/Provider/Style/URI>Cool <abbr>URI</abbr>s don’t change</a>, but many <abbr>URI</abbr>s are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at <code>http://example.com/index.xml</code> might be moved to <code>http://example.com/xml/atom.xml</code>. Or an entire domain might move, as an organization expands and reorganizes; <code>http://www.example.com/index.xml</code> becomes <code>http://server-farm-1.example.com/index.xml</code>.
|
|
|
|
<p>Every time you request any kind of resource from an <abbr>HTTP</abbr> server, the server includes a status code in its response. Status code <code>200</code> means “everything’s normal, here’s the page you asked for”. Status code <code>404</code> means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection.
|
|
|
|
<p><abbr>HTTP</abbr> has several different ways of signifying that a resource has moved. The two most common techiques are status codes <code>302</code> and <code>301</code>. Status code <code>302</code> is a <i>temporary redirect</i>; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a <code>Location</code> header). Status code <code>301</code> is a <i>permanent redirect</i>; it means “oops, that got moved permanently” (and then gives the new address in a <code>Location</code> header). If you get a <code>302</code> status code and a new address, the <abbr>HTTP</abbr> specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a <code>301</code> status code and a new address, you’re supposed to use the new address from then on.
|
|
|
|
<p>The <code>urllib.request</code> module automatically “follow” redirects when it receives the appropriate status code from the <abbr>HTTP</abbr> server, but it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address, and each time the <code>urllib.request</code> module will “helpfully” follow the redirect. In other words, it treats permanent redirects the same as temporary redirects. That means two round trips instead of one, which is bad for the server and bad for you.
|
|
|
|
<p><code>httplib2</code> handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected <abbr>URL</abbr>s before requesting them.
|
|
|
|
<!--
|
|
<h3><code>User-Agent</code></h3>
|
|
|
|
<p>The <code>User-Agent</code> is simply a way for a client to tell a server who it is when it requests a web page, a syndicated feed, or any sort of web service over <abbr>HTTP</abbr>. When the client requests a resource, it should always announce who it is, as specifically as possible. This helps the server-side administrator figure out who to contact when things go fantastically wrong.
|
|
|
|
<p>By default, Python sends a generic <code>User-Agent</code>: <code>Python-urllib/1.15</code>. In the next section, you’ll see how to change this to something more specific.
|
|
|
|
<p>Note that [FIXME-href] our little one-line script to download an Atom feed did not support any of these <abbr>HTTP</abbr> features. Let’s see how you can improve it.
|
|
|
|
-->
|
|
<p class=a>⁂
|
|
|
|
<h2 id=dont-try-this-at-home>How Not To Fetch Data Over HTTP</h2>
|
|
|
|
<p>Let’s say you want to download a resource over <abbr>HTTP</abbr>, such as <a href=xml.html>an Atom feed</a>. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better.
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import urllib.request</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read()</kbd> <span>①</span></a>
|
|
<samp class=p>>>> </samp><kbd>print(data)</kbd>
|
|
<samp><?xml version="1.0" encoding="utf-8"?>
|
|
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
|
<title>dive into mark</title>
|
|
<subtitle>currently between addictions</subtitle>
|
|
<id>tag:diveintomark.org,2001-07-29:/</id>
|
|
<updated>2009-03-27T21:56:07Z</updated>
|
|
<link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
|
|
…
|
|
</samp></pre>
|
|
<ol>
|
|
<li>Downloading anything over <abbr>HTTP</abbr> is incredibly easy in Python; in fact, it’s a one-liner. The <code>urllib.request</code> module has a handy <code>urlopen()</code> function that takes the address of the page you want, and returns a file-like object that you can just <code>read()</code> from to get the full contents of the page. It just can’t get any easier.
|
|
</ol>
|
|
|
|
<p>So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (<i>e.g.</i> requesting this feed once an hour), then you’re being inefficient, and you’re being rude.
|
|
|
|
<p class=a>⁂
|
|
|
|
<h2 id=whats-on-the-wire>What’s On The Wire?</h2>
|
|
|
|
<p>To see why this is inefficient and rude, let’s turn on the debugging features of Python’s <abbr>HTTP</abbr> library and see what’s being sent “on the wire.”
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>from http.client import HTTPConnection</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>HTTPConnection.debuglevel = 1</kbd> <span>①</span></a>
|
|
<samp class=p>>>> </samp><kbd>from urllib.request import urlopen</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>response = urlopen('http://diveintopython3.org/examples/feed.xml')</kbd> <span>②</span></a>
|
|
<samp><a>send: b'GET /examples/feed.xml HTTP/1.1 <span>③</span></a>
|
|
<a>Host: diveintopython3.org <span>④</span></a>
|
|
<a>Accept-Encoding: identity <span>⑤</span></a>
|
|
<a>User-Agent: Python-urllib/3.0' <span>⑥</span></a>
|
|
Connection: close
|
|
reply: 'HTTP/1.1 200 OK'
|
|
…further debugging information omitted…</samp></pre>
|
|
<ol>
|
|
<li>As I mentioned at the beginning of the chapter, <code>urllib.request</code> relies on another standard Python library, <code>http.client</code>. Normally you don’t need to touch <code>http.client</code> directly. (The <code>urllib.request</code> module imports it automatically.) But we import it here so we can toggle the debugging flag on the <code>HTTPConnection</code> class that <code>urllib.request</code> uses to connect to the <abbr>HTTP</abbr> server.
|
|
<li>Now that the debugging flag is set, information on the the <abbr>HTTP</abbr> request and response is printed out in real time. As you can see, when you request the Atom feed, the <code>urllib.request</code> module sends five lines to the server.
|
|
<li>The first line specifies the <abbr>HTTP</abbr> verb you’re using, and the path of the resource (minus the domain name).
|
|
<li>The second line specifies the domain name from which we’re requesting this feed.
|
|
<li>The third line specifies the compression algorithms that the client supports. As I mentioned earlier, <a href=#compression><code>urllib.request</code> does not support compression</a> by default.
|
|
<li>The fourth line specifies the name of the library that is making the request. By default, this is <code>Python-urllib</code> plus a version number. Both <code>urllib.request</code> and <code>httplib2</code> support changing the user agent; you’ll see how to do this later in this chapter. [FIXME really?]
|
|
</ol>
|
|
|
|
<p>Now let’s look at what the server sent back in its response.
|
|
|
|
<pre class=screen>
|
|
# continued from previous example
|
|
<a><samp class=p>>>> </samp><kbd>print(response.headers.as_string())</kbd> <span>①</span></a>
|
|
<samp><a>Date: Sun, 31 May 2009 19:23:06 GMT <span>②</span></a>
|
|
Server: Apache
|
|
<a>Last-Modified: Sun, 31 May 2009 06:39:55 GMT <span>③</span></a>
|
|
<a>ETag: "bfe-93d9c4c0" <span>④</span></a>
|
|
Accept-Ranges: bytes
|
|
<a>Content-Length: 3070 <span>⑤</span></a>
|
|
<a>Cache-Control: max-age=86400 <span>⑥</span></a>
|
|
Expires: Mon, 01 Jun 2009 19:23:06 GMT
|
|
Vary: Accept-Encoding
|
|
Connection: close
|
|
Content-Type: application/xml</samp>
|
|
<a><samp class=p>>>> </samp><kbd>data = response.read()</kbd> <span>⑦</span></a>
|
|
<samp class=p>>>> </samp><kbd>len(data)</kbd>
|
|
<samp>3070</samp></pre>
|
|
<ol>
|
|
<li>The <var>response</var> returned from the <code>urllib.request.urlopen()</code> function contains all the <abbr>HTTP</abbr> headers the server sent back. It also contains methods to download the actual data; we’ll get to that in a minute.
|
|
<li>The server tells you when it handled your request.
|
|
<li>This response includes a <a href=#last-modified><code>Last-Modified</code></a> header.
|
|
<li>This response includes an <a href=#etags><code>ETag</code></a> header.
|
|
<li>The data is 3070 bytes long. Notice what <em>isn’t</em> here: a <code>Content-encoding</code> header. Your request stated that you only accept uncompressed data (<code>Accept-encoding: identity</code>), and sure enough, this response contains uncompressed data.
|
|
<li>This response includes caching headers that state that this feed can be cached for up to 24 hours (86400 seconds).
|
|
<li>And finally, download the actual data by calling <code>response.read()</code>. As you can tell from the <code>len()</code> function, this downloads all 3070 bytes at once.
|
|
</ol>
|
|
|
|
<p>As you can see, this code is already inefficient: it asked for (and received) uncompressed data. I know for a fact that this server supports <a href=#compression>gzip compression</a>, but <abbr>HTTP</abbr> compression is opt-in. We didn’t ask for it, so we didn’t get it. That means we’re downloading 3070 bytes when we could have just downloaded 941. Bad dog, no biscuit.
|
|
|
|
<p>But wait, it gets worse! To see just how inefficient this code is, let’s request the same feed a second time.
|
|
|
|
<pre class=screen>
|
|
# continued from the <a href=#whats-on-the-wire>previous example</a>
|
|
<samp class=p>>>> </samp><kbd>response2 = urlopen('http://diveintopython3.org/examples/feed.xml')</kbd>
|
|
<samp>send: b'GET /examples/feed.xml HTTP/1.1
|
|
Host: diveintopython3.org
|
|
Accept-Encoding: identity
|
|
User-Agent: Python-urllib/3.0'
|
|
Connection: close
|
|
reply: 'HTTP/1.1 200 OK'
|
|
…further debugging information omitted…</samp></pre>
|
|
|
|
<p>Notice anything peculiar about this request? It hasn’t changed! It’s exactly the same as the first request. No sign of <a href=#last-modified><code>If-Modified-Since</code> headers</a>. No sign of <a href=#etags><code>If-None-Match</code> headers</a>. No respect for the caching headers. Still no compression.
|
|
|
|
<p>And what happens when you do the same thing twice? You get the same response. Twice.
|
|
|
|
<pre class=screen>
|
|
# continued from the previous example
|
|
<a><samp class=p>>>> </samp><kbd>print(response2.headers.as_string())</kbd> <span>①</span></a>
|
|
<samp>Date: Mon, 01 Jun 2009 03:58:00 GMT
|
|
Server: Apache
|
|
Last-Modified: Sun, 31 May 2009 22:51:11 GMT
|
|
ETag: "bfe-255ef5c0"
|
|
Accept-Ranges: bytes
|
|
Content-Length: 3070
|
|
Cache-Control: max-age=86400
|
|
Expires: Tue, 02 Jun 2009 03:58:00 GMT
|
|
Vary: Accept-Encoding
|
|
Connection: close
|
|
Content-Type: application/xml</samp>
|
|
<samp class=p>>>> </samp><kbd>data2 = response2.read()</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>len(data2)</kbd> <span>②</span></a>
|
|
<samp>3070</samp>
|
|
<a><samp class=p>>>> </samp><kbd>data2 == data</kbd> <span>③</span></a>
|
|
<samp>True</samp></pre>
|
|
<ol>
|
|
<li>The server is still sending the same array of “smart” headers: <code>Cache-Control</code> and <code>Expires</code> to allow caching, <code>Last-Modified</code> and <code>ETag</code> to enable “not-modified” tracking. Even the <code>Vary: Accept-Encoding</code> header hints that the server would support compression, if only you would ask for it. But you didn’t.
|
|
<li>Once again, fetching this data downloads the whole 3070 bytes…
|
|
<li>…the exact same 3070 bytes you downloaded last time.
|
|
</ol>
|
|
|
|
<p><abbr>HTTP</abbr> is designed to work better than this. <code>urllib</code> speaks <abbr>HTTP</abbr> like I speak Spanish — enough to get by in a jam, but not enough to hold a conversation. <abbr>HTTP</abbr> is a conversation. It’s time to upgrade to a library that speaks <abbr>HTTP</abbr> fluently.
|
|
|
|
<p class=a>⁂
|
|
|
|
<h2 id=introducing-httplib2>Introducing <code>httplib2</code></h2>
|
|
|
|
<p>To use <code>httplib2</code>, create an instance of the <code>httplib2.Http</code> class.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import httplib2</kbd>
|
|
<samp class=p>>>> </samp><kbd>h = httplib2.Http('.cache')</kbd>
|
|
<samp class=p>>>> </samp><kbd>response, content = h.request('http://diveintopython3.org/examples/feed.xml')</kbd>
|
|
<samp class=p>>>> </samp><kbd>response.status</kbd>
|
|
<samp>200</samp>
|
|
<samp class=p>>>> </samp><kbd>content[:52]</kbd>
|
|
<samp>b'<?xml version="1.0" encoding="utf-8"?>\r\n<feed xmlns='</samp>
|
|
<samp class=p>>>> </samp><kbd>len(content)</kbd>
|
|
<samp>3070</samp></pre>
|
|
<ol>
|
|
<li>FIXME
|
|
</ol>
|
|
|
|
<h3 id=httplib2-caching>How <code>httplib2</code> Handles Caching</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
# continued from previous example
|
|
<samp class=p>>>> </samp><kbd>response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml')</kbd>
|
|
<samp class=p>>>> </samp><kbd>response2.status</kbd>
|
|
<samp>200</samp>
|
|
<samp class=p>>>> </samp><kbd>content2[:52]</kbd>
|
|
<samp>b'<?xml version="1.0" encoding="utf-8"?>\r\n<feed xmlns='</samp>
|
|
<samp class=p>>>> </samp><kbd>len(content2)</kbd>
|
|
<samp>3070</samp></pre>
|
|
<ol>
|
|
<li>FIXME
|
|
</ol>
|
|
|
|
<pre class=screen>
|
|
# NOT continued from previous example!
|
|
# Please exit out of the interactive shell
|
|
# and launch a new one.
|
|
<samp class=p>>>> </samp><kbd>import httplib2</kbd>
|
|
<samp class=p>>>> </samp><kbd>httplib2.debuglevel = 1</kbd>
|
|
<samp class=p>>>> </samp><kbd>h = httplib2.Http('.cache')</kbd>
|
|
<samp class=p>>>> </samp><kbd>response, content = h.request('http://diveintopython3.org/examples/feed.xml')</kbd>
|
|
<samp class=p>>>> </samp><kbd>len(content)</kbd>
|
|
<samp>3070</samp>
|
|
<samp class=p>>>> </samp><kbd>response.status</kbd>
|
|
<samp>200</samp>
|
|
<samp class=p>>>> </samp><kbd>response.fromcache</kbd>
|
|
<samp>True</samp></pre>
|
|
<ol>
|
|
<li>FIXME
|
|
</ol>
|
|
|
|
<pre class=screen>
|
|
# continued from previous example
|
|
<samp class=p>>>> </samp><kbd>response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml',</kbd>
|
|
<samp class=p>... </samp><kbd> headers={'cache-control':'no-cache'})</kbd>
|
|
<samp>connect: (diveintopython3.org, 80)
|
|
send: b'GET /examples/feed.xml HTTP/1.1
|
|
Host: diveintopython3.org
|
|
user-agent: Python-httplib2/$Rev: 259 $
|
|
accept-encoding: deflate, gzip
|
|
cache-control: no-cache'
|
|
reply: 'HTTP/1.1 200 OK'
|
|
…further debugging information omitted…</samp>
|
|
<samp class=p>>>> </samp><kbd>response2.status</kbd>
|
|
<samp>200</samp>
|
|
<samp class=p>>>> </samp><kbd>response2.fromcache</kbd>
|
|
<samp>False</samp>
|
|
<samp class=p>>>> </samp><kbd>print(dict(response2.items()))</kbd>
|
|
<samp>{'status': '200',
|
|
'content-length': '3070',
|
|
'content-location': 'http://diveintopython3.org/examples/feed.xml',
|
|
'accept-ranges': 'bytes',
|
|
'expires': 'Wed, 03 Jun 2009 00:40:26 GMT',
|
|
'vary': 'Accept-Encoding',
|
|
'server': 'Apache',
|
|
'last-modified': 'Sun, 31 May 2009 22:51:11 GMT',
|
|
'connection': 'close',
|
|
'-content-encoding': 'gzip',
|
|
'etag': '"bfe-255ef5c0"',
|
|
'cache-control': 'max-age=86400',
|
|
'date': 'Tue, 02 Jun 2009 00:40:26 GMT',
|
|
'content-type': 'application/xml'}</samp></pre>
|
|
<ol>
|
|
<li>FIXME
|
|
</ol>
|
|
|
|
<h3 id=httplib2-etags>How <code>httplib2</code> Handles <code>Last-Modified</code> and <code>ETag</code> Headers</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import httplib2</kbd>
|
|
<samp class=p>>>> </samp><kbd>httplib2.debuglevel = 1</kbd>
|
|
<samp class=p>>>> </samp><kbd>h = httplib2.Http('.cache')</kbd>
|
|
<samp class=p>>>> </samp><kbd>response, content = h.request('http://diveintopython3.org/')</kbd>
|
|
<samp>connect: (diveintopython3.org, 80)
|
|
send: b'GET / HTTP/1.1
|
|
Host: diveintopython3.org
|
|
accept-encoding: deflate, gzip
|
|
user-agent: Python-httplib2/$Rev: 259 $'
|
|
reply: 'HTTP/1.1 200 OK'</samp>
|
|
<samp class=p>>>> </samp><kbd>print(dict(response.items()))</kbd>
|
|
<samp>{'-content-encoding': 'gzip',
|
|
'accept-ranges': 'bytes',
|
|
'connection': 'close',
|
|
'content-length': '6657',
|
|
'content-location': 'http://diveintopython3.org/',
|
|
'content-type': 'text/html',
|
|
'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
|
|
'etag': '"7f806d-1a01-9fb97900"',
|
|
'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
|
|
'server': 'Apache',
|
|
'status': '304',
|
|
'vary': 'Accept-Encoding,User-Agent'}</samp>
|
|
<samp class=p>>>> </samp><kbd>len(content)</kbd>
|
|
<samp>6657</samp>
|
|
<samp class=p>>>> </samp><kbd>response, content = h.request('http://diveintopython3.org/')</kbd>
|
|
<samp>connect: (diveintopython3.org, 80)
|
|
send: b'GET / HTTP/1.1
|
|
Host: diveintopython3.org
|
|
if-none-match: "7f806d-1a01-9fb97900"
|
|
if-modified-since: Tue, 02 Jun 2009 02:51:48 GMT
|
|
accept-encoding: deflate, gzip
|
|
user-agent: Python-httplib2/$Rev: 259 $'
|
|
reply: 'HTTP/1.1 304 Not Modified'</samp>
|
|
<samp class=p>>>> </samp><kbd>len(content)</kbd>
|
|
<samp>6657</samp></pre>
|
|
<ol>
|
|
<li>FIXME
|
|
</ol>
|
|
|
|
<h3 id=httplib2-compression>How <code>http2lib</code> Handles Compression</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>response, content = h.request('http://diveintopython3.org/')</kbd>
|
|
<samp>connect: (diveintopython3.org, 80)
|
|
send: b'GET / HTTP/1.1
|
|
Host: diveintopython3.org
|
|
<mark>accept-encoding: deflate, gzip</mark>
|
|
user-agent: Python-httplib2/$Rev: 259 $'
|
|
reply: 'HTTP/1.1 200 OK'</samp>
|
|
<samp class=p>>>> </samp><kbd>print(dict(response.items()))</kbd>
|
|
<samp>{<mark>'-content-encoding': 'gzip',</mark>
|
|
'accept-ranges': 'bytes',
|
|
'connection': 'close',
|
|
'content-length': '6657',
|
|
'content-location': 'http://diveintopython3.org/',
|
|
'content-type': 'text/html',
|
|
'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
|
|
'etag': '"7f806d-1a01-9fb97900"',
|
|
'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
|
|
'server': 'Apache',
|
|
'status': '304',
|
|
'vary': 'Accept-Encoding,User-Agent'}</samp></pre>
|
|
|
|
<h3 id=httplib2-redirects>How <code>httplib2</code> Handles Redirects</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>response, content = h.request('http://diveintopython3.org/examples/feed-302.xml')</kbd>
|
|
<samp>connect: (diveintopython3.org, 80)
|
|
send: b'GET /examples/feed-302.xml HTTP/1.1
|
|
Host: diveintopython3.org
|
|
accept-encoding: deflate, gzip
|
|
user-agent: Python-httplib2/$Rev: 259 $'
|
|
<mark>reply: 'HTTP/1.1 302 Found'</mark>
|
|
<mark>send: b'GET /examples/feed.xml HTTP/1.1</mark>
|
|
Host: diveintopython3.org
|
|
accept-encoding: deflate, gzip
|
|
user-agent: Python-httplib2/$Rev: 259 $'
|
|
reply: 'HTTP/1.1 200 OK'</samp>
|
|
<samp class=p>>>> </samp><kbd>print(dict(response.items()))</kbd>
|
|
<samp>{'status': '200',
|
|
'content-length': '3070',
|
|
<mark> 'content-location': 'http://diveintopython3.org/examples/feed.xml',</mark>
|
|
'accept-ranges': 'bytes',
|
|
'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',
|
|
'vary': 'Accept-Encoding',
|
|
'server': 'Apache',
|
|
'last-modified': 'Wed, 03 Jun 2009 02:20:15 GMT',
|
|
'connection': 'close',
|
|
'-content-encoding': 'gzip',
|
|
'etag': '"bfe-4cbbf5c0"',
|
|
'cache-control': 'max-age=86400',
|
|
'date': 'Wed, 03 Jun 2009 02:21:41 GMT',
|
|
'content-type': 'application/xml'}</samp>
|
|
<samp class=p>>>> </samp><kbd>response, content = h.request('http://diveintopython3.org/examples/feed-302.xml')</kbd>
|
|
<samp>connect: (diveintopython3.org, 80)
|
|
send: b'GET /examples/feed-302.xml HTTP/1.1
|
|
Host: diveintopython3.org
|
|
accept-encoding: deflate, gzip
|
|
user-agent: Python-httplib2/$Rev: 259 $'
|
|
reply: 'HTTP/1.1 302 Found'</samp></pre>
|
|
<ol>
|
|
<li>FIXME
|
|
</ol>
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>response, content = h.request('http://diveintopython3.org/examples/feed-301.xml')</kbd>
|
|
<samp>connect: (diveintopython3.org, 80)
|
|
send: b'GET /examples/feed-301.xml HTTP/1.1
|
|
Host: diveintopython3.org
|
|
accept-encoding: deflate, gzip
|
|
user-agent: Python-httplib2/$Rev: 259 $'
|
|
reply: 'HTTP/1.1 301 Moved Permanently'</samp>
|
|
<samp class=p>>>> </samp><kbd>print(dict(response.items()))</kbd>
|
|
<samp>{'status': '200',
|
|
'content-length': '3070',
|
|
'content-location': 'http://diveintopython3.org/examples/feed.xml',
|
|
'accept-ranges': 'bytes',
|
|
'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',
|
|
'vary': 'Accept-Encoding',
|
|
'server': 'Apache',
|
|
'last-modified': 'Wed, 03 Jun 2009 02:20:15 GMT',
|
|
'connection': 'close',
|
|
'-content-encoding': 'gzip',
|
|
'etag': '"bfe-4cbbf5c0"',
|
|
'cache-control': 'max-age=86400',
|
|
'date': 'Wed, 03 Jun 2009 02:21:41 GMT',
|
|
'content-type': 'application/xml'}</samp>
|
|
<samp class=p>>>> </samp><kbd>response2, content2 = h.request('http://diveintopython3.org/examples/feed-301.xml')</kbd>
|
|
<samp class=p>>>> </samp><kbd>response2.fromcache</kbd>
|
|
<samp>True</samp></pre>
|
|
<ol>
|
|
<li>FIXME
|
|
</ol>
|
|
|
|
<p class=a>⁂
|
|
|
|
<h2 id=beyond-get>Beyond HTTP GET</h2>
|
|
|
|
<p>FIXME
|
|
|
|
<pre>
|
|
<samp class=p>>>> </samp><kbd>import httplib2</kbd>
|
|
<samp class=p>>>> </samp><kbd>from urllib.parse import urlencode</kbd>
|
|
<samp class=p>>>> </samp><kbd>h = httplib2.Http('.cache')</kbd>
|
|
<samp class=p>>>> </samp><kbd>data = {"status": "Test update from Python 3"}</kbd>
|
|
<samp class=p>>>> </samp><kbd>h.add_credentials("diveintomark", "<var>MY_SECRET_PASSWORD</var>")</kbd>
|
|
<samp class=p>>>> </samp><kbd>resp, content = h.request("http://twitter.com/statuses/update.xml", "POST", urlencode(data))</kbd>
|
|
<samp class=p>>>> </samp><kbd>resp.status</kbd>
|
|
<samp>200</samp>
|
|
<samp class=p>>>> </samp><kbd>from xml.etree import ElementTree as etree</kbd>
|
|
<samp class=p>>>> </samp><kbd>tree = etree.fromstring(content)</kbd>
|
|
<samp class=p>>>> </samp><kbd>print(etree.tostring(tree))</kbd>
|
|
<samp><status>
|
|
<created_at>Sat May 30 19:11:38 +0000 2009</created_at>
|
|
<id>1973974228</id>
|
|
<text>Test update from Python 3</text>
|
|
<source>web</source>
|
|
<truncated>false</truncated>
|
|
<in_reply_to_status_id />
|
|
<in_reply_to_user_id />
|
|
<favorited>false</favorited>
|
|
<in_reply_to_screen_name />
|
|
<user>
|
|
<id>8294212</id>
|
|
<name>Mark Pilgrim</name>
|
|
<screen_name>diveintomark</screen_name>
|
|
<location>Apex, NC</location>
|
|
<description>Like a fine spice</description>
|
|
<profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/72859681/beau_normal.jpg</profile_image_url>
|
|
|
|
<url>http://diveintomark.org/</url>
|
|
<protected>false</protected>
|
|
<followers_count>2565</followers_count>
|
|
<profile_background_color>FFFFFF</profile_background_color>
|
|
<profile_text_color>333333</profile_text_color>
|
|
<profile_link_color>333333</profile_link_color>
|
|
<profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color>
|
|
<profile_sidebar_border_color>333333</profile_sidebar_border_color>
|
|
<friends_count>44</friends_count>
|
|
<created_at>Sun Aug 19 23:58:36 +0000 2007</created_at>
|
|
<favourites_count>71</favourites_count>
|
|
<utc_offset>-18000</utc_offset>
|
|
<time_zone>Eastern Time (US & Canada)</time_zone>
|
|
<profile_background_image_url>http://static.twitter.com/images/themes/theme1/bg.gif</profile_background_image_url>
|
|
<profile_background_tile>false</profile_background_tile>
|
|
<statuses_count>527</statuses_count>
|
|
<notifications>false</notifications>
|
|
<following>false</following>
|
|
</user>
|
|
</status></samp></pre>
|
|
|
|
<p>FIXME
|
|
|
|
<p class=a>⁂
|
|
|
|
<h2 id=beyond-post>Beyond HTTP POST</h2>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
# continued from the previous example
|
|
<samp class=p>>>> </samp><kbd>tree.findtext("id")</kbd>
|
|
<samp>'1973974228'</samp>
|
|
<samp class=p>>>> </samp><kbd>resp, delete_content = h.request("http://twitter.com/statuses/destroy/{0}.xml".format(tree.findtext("id")), "DELETE")</kbd>
|
|
<samp class=p>>>> </samp><kbd>resp.status</kbd>
|
|
<samp>200</samp></pre>
|
|
|
|
<p class=a>⁂
|
|
|
|
<h2 id=furtherreading>Further Reading</h2>
|
|
|
|
<ul>
|
|
<li><a href=http://code.google.com/p/httplib2/><code>httplib2</code></a>
|
|
<li><a href=http://www.xml.com/pub/a/2006/02/01/doing-http-caching-right-introducing-httplib2.html>Doing <abbr>HTTP</abbr> Caching Right: Introducing <code>httplib2</code></a>
|
|
<li><a href=http://www.xml.com/pub/a/2006/03/29/httplib2-http-persistence-and-authentication.html><code>httplib2</code>: <abbr>HTTP</abbr> Persistence and Authentication</a>
|
|
<li><a href=http://apiwiki.twitter.com/>Twitter <abbr>API</abbr> reference</a>
|
|
<li><a href=http://www.mnot.net/cache_docs/><abbr>HTTP</abbr> Caching Tutorial</a> by Mark Nottingham
|
|
<li><a href=http://code.google.com/p/doctype/wiki/ArticleHttpCaching>How to control caching with <abbr>HTTP</abbr> headers</a> on Google Doctype
|
|
</ul>
|
|
|
|
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
|
<script src=jquery.js></script>
|
|
<script src=dip3.js></script>
|