mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
css fiddling
This commit is contained in:
@@ -102,9 +102,10 @@ abbr {
|
||||
}
|
||||
.f:first-letter {
|
||||
float: left;
|
||||
color: lightblue;
|
||||
color: lightsteelblue;
|
||||
padding: 0.11em 4px 0 0;
|
||||
font: normal 4em/0.68 serif;
|
||||
text-shadow: steelblue 1px 1px 1px;
|
||||
}
|
||||
p, ul, ol {
|
||||
margin: 1.75em 0;
|
||||
@@ -130,7 +131,7 @@ body {
|
||||
.a {
|
||||
font-size: xx-large;
|
||||
line-height: .875;
|
||||
color: #444;
|
||||
color: #82b445;
|
||||
}
|
||||
form div, #level {
|
||||
float: right;
|
||||
@@ -152,7 +153,7 @@ a:link, .w a {
|
||||
color: steelblue;
|
||||
}
|
||||
a:visited {
|
||||
color: darkorchid;
|
||||
color: #b44582;
|
||||
}
|
||||
.c a {
|
||||
color: inherit;
|
||||
@@ -267,7 +268,9 @@ aside {
|
||||
-webkit-border-radius: 1em;
|
||||
border-radius: 1em;
|
||||
}
|
||||
|
||||
#level span {
|
||||
color: #82b445;
|
||||
}
|
||||
/* previous/next navigation links */
|
||||
|
||||
.nav a {
|
||||
|
||||
+97
-99
@@ -21,9 +21,9 @@ mark{display:inline}
|
||||
</blockquote>
|
||||
<p id=toc>
|
||||
<h2 id=divingin>Diving In</h2>
|
||||
<p class=f>HTTP web services are programmatic ways of sending and receiving data from remote servers using the operations of <abbr>HTTP</abbr> directly. If you want to get data from the server, use a straight <abbr>HTTP</abbr> GET; if you want to send new data to the server, use <abbr>HTTP</abbr> POST. (Some more advanced <abbr>HTTP</abbr> web service APIs also define ways of modifying existing data and deleting data, using <abbr>HTTP</abbr> PUT and <abbr>HTTP</abbr> DELETE.) In other words, the “verbs” built into the <abbr>HTTP</abbr> protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for receiving, sending, modifying, and deleting data.
|
||||
<p class=f>HTTP web services are programmatic ways of sending and receiving data from remote servers using nothing but the operations of <abbr>HTTP</abbr>. If you want to get data from the server, use <abbr>HTTP</abbr> <code>GET</code>; if you want to send new data to the server, use <abbr>HTTP</abbr> <code>POST</code>. Some more advanced <abbr>HTTP</abbr> web service <abbr>API</abbr>s also define ways of modifying existing data and deleting data, using <abbr>HTTP</abbr> <code>PUT</code> and <abbr>HTTP</abbr> <code>DELETE</code>. In other words, the “verbs” built into the <abbr>HTTP</abbr> protocol (<code>GET</code>, <code>POST</code>, <code>PUT</code>, and <code>DELETE</code>) map directly to application-level operations for retrieving, creating, modifying, and deleting data.
|
||||
|
||||
<p>The main advantage of this approach is simplicity, and its simplicity has proven popular with a lot of different sites. Data -- usually XML data -- can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an <abbr>HTTP</abbr> library for downloading it. Debugging is also easier; because each “call” to the web service had a unique <abbr>URL</abbr>, you can load it in your web browser and immediately see the raw data.
|
||||
<p>The main advantage of this approach is simplicity, and its simplicity has proven popular with a lot of different sites. Data — usually <abbr>XML</abbr> data — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an <abbr>HTTP</abbr> library for downloading it. Debugging is also easier; because each “call” to the web service had a unique <abbr>URL</abbr>, you can load it in your web browser and immediately see the raw data.
|
||||
|
||||
<p>Examples of <abbr>HTTP</abbr> web services:
|
||||
<ul>
|
||||
@@ -36,19 +36,75 @@ mark{display:inline}
|
||||
<p>Python 3 comes with two different libraries for interacting with <abbr>HTTP</abbr> web services:
|
||||
|
||||
<ul>
|
||||
<li><a href=http://docs.python.org/3.0/library/http.client.html><code>http.client</code></a> is a low-level library that implements <a href=http://www.w3.org/Protocols/rfc2616/rfc2616.html>RFC 2616</a>, the <abbr>HTTP</abbr> protocol.
|
||||
<li><a href=http://docs.python.org/3.0/library/http.client.html><code>http.client</code></a> is a low-level library that implements <a href=http://www.w3.org/Protocols/rfc2616/rfc2616.html><abbr>RFC</abbr> 2616</a>, the <abbr>HTTP</abbr> protocol.
|
||||
<li><a href=http://docs.python.org/3.0/library/urllib.request.html><code>urllib.request</code></a> is an abstraction layer built on top of <code>http.client</code>. It provides a standard <abbr>API</abbr> for accessing both <abbr>HTTP</abbr> and <abbr>FTP</abbr> servers, automatically follows <abbr>HTTP</abbr> redirects, and handles some common forms of <abbr>HTTP</abbr> authentication.
|
||||
</ul>
|
||||
|
||||
<p>Which one should you use? Neither of them. Instead, you should use <a href=http://code.google.com/p/httplib2/><code>httplib2</code></a>, an open source third-party library that implements <abbr>HTTP</abbr> more fully than <code>http.client</code> but provides a better abstraction that <code>urllib.request</code>.
|
||||
<p>So which one should you use? Neither of them. Instead, you should use <a href=http://code.google.com/p/httplib2/><code>httplib2</code></a>, an open source third-party library that implements <abbr>HTTP</abbr> more fully than <code>http.client</code> but provides a better abstraction that <code>urllib.request</code>.
|
||||
|
||||
<p>To understand why <code>httplib2</code> is the right choice, you first need to understand <abbr>HTTP</abbr>.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=http-features>Features of HTTP</h2>
|
||||
|
||||
<p>There are five important features which all <abbr>HTTP</abbr> clients should support.
|
||||
|
||||
<h3 id=caching>Caching</h3>
|
||||
|
||||
<p>FIXME
|
||||
|
||||
<h3 id=last-modified>Last-Modified Checking</h3>
|
||||
|
||||
<p>Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the home page of Google.com may not change for days or even weeks (and then only when they put up a special holiday logo or advertise a new service). Web services are no different. The server knows when the data you’re requesting last changed, and <abbr>HTTP</abbr> provides a way for the server to include this last-modified date each time you request the data.
|
||||
|
||||
<p>If you ask for the same data a second (or third or fourth) time, you can tell the server the last-modified date that you got last time. You send an <code>If-Modified-Since</code> header with your request, with the date you got back from the server last time. If the data hasn’t changed since then, the server sends back a special <abbr>HTTP</abbr> status code <code>304</code>, which means “this data hasn’t changed since the last time you asked for it.” Why is this an improvement? Because when the server sends a <code>304</code>, <em>it doesn’t re-send the data</em>. All you get is the status code. So you don’t need to download the same data over and over again if it hasn’t changed; the server assumes you have the data <a href=#caching>cached locally</a>.
|
||||
|
||||
<p>All modern web browsers support last-modified date checking. If you’ve ever visited a page, re-visited the same page a day later and found that it hadn’t changed, and wondered why it loaded so quickly the second time — this could be why. Your web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically sent the last-modified date it got from the server the first time. The server simply says <code>304: Not Modified</code>, so your browser knows to load the page from its cache. Web services work the same way.
|
||||
|
||||
<p>Python’s URL libraries have no built-in support for last-modified date checking, but <code>httplib2</code> does.
|
||||
|
||||
<h3 id=etag>ETags</h3>
|
||||
|
||||
<p>ETags are an alternate way to accomplish the same thing as the <a href=#last-modified>last-modified date checking</a>. With Etags, the server sends a hash code in an <code>ETag</code> header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) The second time you request the same data, you include the ETag hash in an <code>If-None-Match</code> header of your request. If the data hasn’t changed, the server will send you back a <code>304</code> status code. As with the last-modified date checking, the server sends back <em>only</em> the <code>304</code> status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since <a href=#caching>you still have the data from the last time</a>.
|
||||
|
||||
<p>Python’s URL libraries have no built-in support for ETags, but <code>httplib2</code> does.
|
||||
|
||||
<h3 id=compression>Compression</h3>
|
||||
|
||||
<p>When you talk about <abbr>HTTP</abbr> web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s <abbr>XML</abbr>; maybe it’s <abbr>JSON</abbr>. Regardless of the format, text compresses well. When you request a resource over <abbr>HTTP</abbr>, you can ask the server to send it in compressed format. You include the <code>Accept-encoding</code> header in your request, and if the server supports compression, it will send you back compressed data and mark it with a <code>Content-encoding</code> header.
|
||||
|
||||
<p><abbr>HTTP</abbr> supports several compression algorithms. The two most common types are <a href=http://www.ietf.org/rfc/rfc1952.txt>gzip</a> and <a href=http://www.ietf.org/rfc/rfc1951.txt>deflate</a>.
|
||||
|
||||
<p>Python’s URL libraries have no built-in support for compression, but <code>httplib2</code> does.
|
||||
|
||||
<h3 id=redirects>Redirects</h3>
|
||||
|
||||
<p><a href=http://www.w3.org/Provider/Style/URI>Cool URIs don’t change</a>, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at <code>http://example.com/index.xml</code> might be moved to <code>http://example.com/xml/atom.xml</code>. Or an entire domain might move, as an organization expands and reorganizes; <code>http://www.example.com/index.xml</code> becomes <code>http://server-farm-1.example.com/index.xml</code>.
|
||||
|
||||
<p>Every time you request any kind of resource from an <abbr>HTTP</abbr> server, the server includes a status code in its response. Status code <code>200</code> means “everything’s normal, here’s the page you asked for”. Status code <code>404</code> means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection.
|
||||
|
||||
<p><abbr>HTTP</abbr> has several different ways of signifying that a resource has moved. The two most common techiques are status codes <code>302</code> and <code>301</code>. Status code <code>302</code> is a <i>temporary redirect</i>; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a <code>Location:</code> header). Status code <code>301</code> is a <i>permanent redirect</i>; it means “oops, that got moved permanently” (and then gives the new address in a <code>Location:</code> header). If you get a <code>302</code> status code and a new address, the <abbr>HTTP</abbr> specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a <code>301</code> status code and a new address, you’re supposed to use the new address from then on.
|
||||
|
||||
<p>The <code>urllib</code> module will automatically “follow” redirects when it receives the appropriate status code from the <abbr>HTTP</abbr> server, but unfortunately, it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address. That’s two round trips instead of one, which is bad for the service operator and bad for you.
|
||||
|
||||
<p><code>httplib2</code> handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them.
|
||||
|
||||
<!--
|
||||
<h3><code>User-Agent</code></h3>
|
||||
|
||||
<p>The <code>User-Agent</code> is simply a way for a client to tell a server who it is when it requests a web page, a syndicated feed, or any sort of web service over <abbr>HTTP</abbr>. When the client requests a resource, it should always announce who it is, as specifically as possible. This helps the server-side administrator figure out who to contact when things go fantastically wrong.
|
||||
|
||||
<p>By default, Python sends a generic <code>User-Agent</code>: <code>Python-urllib/1.15</code>. In the next section, you’ll see how to change this to something more specific.
|
||||
|
||||
<p>Note that [FIXME-href] our little one-line script to download an Atom feed did not support any of these <abbr>HTTP</abbr> features. Let’s see how you can improve it.
|
||||
|
||||
<p class=a>⁂
|
||||
-->
|
||||
|
||||
<h2 id=dont-try-this-at-home>How Not To Fetch Data Over HTTP</h2>
|
||||
|
||||
<p>Let’s say you want to download a resource over HTTP, such as <a href=xml.html>an Atom feed</a>. But you don’t just want to download it once; you want to download it over and over again, every hour, to get the latest news from the site that’s offering the news feed. Let’s do it the quick-and-dirty way first, and then see how you can do better.
|
||||
<p>Let’s say you want to download a resource over <abbr>HTTP</abbr>, such as <a href=xml.html>an Atom feed</a>. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. Let’s do it the quick-and-dirty way first, and then see how you can do better.
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import urllib.request</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read()</kbd> <span>①</span></a>
|
||||
@@ -63,74 +119,16 @@ mark{display:inline}
|
||||
…
|
||||
</samp></pre>
|
||||
<ol>
|
||||
<li>Downloading anything over HTTP is incredibly easy in Python; in fact, it’s a one-liner. The <code>urllib.request</code> module has a handy <code>urlopen()</code> function that takes the address of the page you want, and returns a file-like object that you can just <code>read()</code> from to get the full contents of the page. It just can’t get any easier.
|
||||
<li>Downloading anything over <abbr>HTTP</abbr> is incredibly easy in Python; in fact, it’s a one-liner. The <code>urllib.request</code> module has a handy <code>urlopen()</code> function that takes the address of the page you want, and returns a file-like object that you can just <code>read()</code> from to get the full contents of the page. It just can’t get any easier.
|
||||
</ol>
|
||||
|
||||
<p>So what’s wrong with this? Well, for a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis -- and remember, you said you were planning on retrieving this syndicated feed once an hour -- then you’re being inefficient, and you’re being rude.
|
||||
|
||||
<p>Let’s talk about some of the basic features of HTTP.
|
||||
<p>So what’s wrong with this? Well, for a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis — and remember, you said you were planning on retrieving this syndicated feed once an hour — then you’re being inefficient, and you’re being rude.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=http-features>Features of HTTP</h2>
|
||||
|
||||
<p>There are five important features which all HTTP clients should support.
|
||||
|
||||
<h3 id=caching>Caching</h3>
|
||||
|
||||
<p>FIXME
|
||||
|
||||
<h3 id=last-modified>Last-Modified Checking</h3>
|
||||
|
||||
<p>Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the home page of Google.com may not change for days or even weeks (and then only when they put up a special holiday logo or advertise a new service). Web services are no different. The server knows when the data you’re requesting last changed, and HTTP provides a way for the server to include this last-modified date each time you request the data.
|
||||
|
||||
<p>If you ask for the same data a second (or third or fourth) time, you can tell the server the last-modified date that you got last time. You send an <code>If-Modified-Since</code> header with your request, with the date you got back from the server last time. If the data hasn’t changed since then, the server sends back a special HTTP status code <code>304</code>, which means “this data hasn’t changed since the last time you asked for it.” Why is this an improvement? Because when the server sends a <code>304</code>, <em>it doesn’t re-send the data</em>. All you get is the status code. So you don’t need to download the same data over and over again if it hasn’t changed; the server assumes you have the data <a href=#caching>cached locally</a>.
|
||||
|
||||
<p>All modern web browsers support last-modified date checking. If you’ve ever visited a page, re-visited the same page a day later and found that it hadn’t changed, and wondered why it loaded so quickly the second time — this could be why. Your web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically sent the last-modified date it got from the server the first time. The server simply says <code>304: Not Modified</code>, so your browser knows to load the page from its cache. Web services work the same way.
|
||||
|
||||
<p>Python’s URL libraries have no built-in support for last-modified date checking, but <code>httplib2</code> does.
|
||||
|
||||
<h3 id=etag>ETags</h3>
|
||||
|
||||
<p>ETags are an alternate way to accomplish the same thing as the <a href=#last-modified>last-modified date checking</a>. With Etags, the server sends a hash code in an <code>ETag</code> header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) The second time you request the same data, you include the ETag hash in an <code>If-None-Match</code> header of your request. If the data hasn’t changed, the server will send you back a <code>304</code> status code. As with the last-modified date checking, the server sends back <em>only</em> the <code>304</code> status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since <a href=#caching>you still have the data from the last time</a>.
|
||||
|
||||
<p>Python’s URL libraries have no built-in support for ETags, but <code>httplib2</code> does.
|
||||
|
||||
<h3 id=compression>Compression</h3>
|
||||
|
||||
<p>When you talk about HTTP web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s <abbr>XML</abbr>; maybe it’s <abbr>JSON</abbr>. Regardless of the format, text compresses well. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include the <code>Accept-encoding</code> header in your request, and if the server supports compression, it will send you back compressed data and mark it with a <code>Content-encoding</code> header.
|
||||
|
||||
<p>HTTP supports several compression algorithms. The two most common types are <a href=http://www.ietf.org/rfc/rfc1952.txt>gzip</a> and <a href=http://www.ietf.org/rfc/rfc1951.txt>deflate</a>.
|
||||
|
||||
<p>Python’s URL libraries have no built-in support for compression, but <code>httplib2</code> does.
|
||||
|
||||
<h3 id=redirects>Redirects</h3>
|
||||
|
||||
<p><a href=http://www.w3.org/Provider/Style/URI>Cool URIs don’t change</a>, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at <code>http://example.com/index.xml</code> might be moved to <code>http://example.com/xml/atom.xml</code>. Or an entire domain might move, as an organization expands and reorganizes; <code>http://www.example.com/index.xml</code> becomes <code>http://server-farm-1.example.com/index.xml</code>.
|
||||
|
||||
<p>Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code <code>200</code> means “everything’s normal, here’s the page you asked for”. Status code <code>404</code> means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection.
|
||||
|
||||
<p>HTTP has several different ways of signifying that a resource has moved. The two most common techiques are status codes <code>302</code> and <code>301</code>. Status code <code>302</code> is a <i>temporary redirect</i>; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a <code>Location:</code> header). Status code <code>301</code> is a <i>permanent redirect</i>; it means “oops, that got moved permanently” (and then gives the new address in a <code>Location:</code> header). If you get a <code>302</code> status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a <code>301</code> status code and a new address, you’re supposed to use the new address from then on.
|
||||
|
||||
<p>The <code>urllib</code> module will automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address. That’s two round trips instead of one, which is bad for the service operator and bad for you.
|
||||
|
||||
<p><code>httplib2</code> handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them.
|
||||
|
||||
<!--
|
||||
<h3><code>User-Agent</code></h3>
|
||||
|
||||
<p>The <code>User-Agent</code> is simply a way for a client to tell a server who it is when it requests a web page, a syndicated feed, or any sort of web service over HTTP. When the client requests a resource, it should always announce who it is, as specifically as possible. This helps the server-side administrator figure out who to contact when things go fantastically wrong.
|
||||
|
||||
<p>By default, Python sends a generic <code>User-Agent</code>: <code>Python-urllib/1.15</code>. In the next section, you’ll see how to change this to something more specific.
|
||||
|
||||
<p>Note that [FIXME-href] our little one-line script to download an Atom feed did not support any of these HTTP features. Let’s see how you can improve it.
|
||||
|
||||
<p class=a>⁂
|
||||
-->
|
||||
|
||||
<!--
|
||||
<h2 id="oa.debug">11.4. Debugging HTTP web services</h2>
|
||||
<p>First, let’s turn on the debugging features of Python’s HTTP library and see what’s being sent over the wire. This will be useful throughout the chapter, as you add more and
|
||||
<p>First, let’s turn on the debugging features of Python’s <abbr>HTTP</abbr> library and see what’s being sent over the wire. This will be useful throughout the chapter, as you add more and
|
||||
more features.
|
||||
<div class=example><h3>Example 11.3. Debugging HTTP</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import httplib</kbd>
|
||||
@@ -154,13 +152,13 @@ header: Content-Length: 26848
|
||||
header: Connection: close
|
||||
</pre>
|
||||
<ol>
|
||||
<li><code>urllib</code> relies on another standard Python library, <code>httplib</code>. Normally you don’t need to <code>import httplib</code> directly (<code>urllib</code> does that automatically), but you will here so you can set the debugging flag on the <code>HTTPConnection</code> class that <code>urllib</code> uses internally to connect to the HTTP server. This is an incredibly useful technique. Some other Python libraries have similar debug flags, but there’s no particular standard for naming them or turning them on; you need to read
|
||||
<li><code>urllib</code> relies on another standard Python library, <code>httplib</code>. Normally you don’t need to <code>import httplib</code> directly (<code>urllib</code> does that automatically), but you will here so you can set the debugging flag on the <code>HTTPConnection</code> class that <code>urllib</code> uses internally to connect to the <abbr>HTTP</abbr> server. This is an incredibly useful technique. Some other Python libraries have similar debug flags, but there’s no particular standard for naming them or turning them on; you need to read
|
||||
the documentation of each library to see if such a feature is available.
|
||||
<li>Now that the debugging flag is set, information on the the HTTP request and response is printed out in real time. The first
|
||||
thing it tells you is that you’re connecting to the server <code>diveintomark.org</code> on port 80, which is the standard port for HTTP.
|
||||
<li>When you request the Atom feed, <code>urllib</code> sends three lines to the server. The first line specifies the HTTP verb you’re using, and the path of the resource (minus
|
||||
<li>Now that the debugging flag is set, information on the the <abbr>HTTP</abbr> request and response is printed out in real time. The first
|
||||
thing it tells you is that you’re connecting to the server <code>diveintomark.org</code> on port 80, which is the standard port for <abbr>HTTP</abbr>.
|
||||
<li>When you request the Atom feed, <code>urllib</code> sends three lines to the server. The first line specifies the <abbr>HTTP</abbr> verb you’re using, and the path of the resource (minus
|
||||
the domain name). All the requests in this chapter will use <code>GET</code>, but in the next chapter on <abbr>SOAP</abbr>, you’ll see that it uses <code>POST</code> for everything. The basic syntax is the same, regardless of the verb.
|
||||
<li>The second line is the <code>Host</code> header, which specifies the domain name of the service you’re accessing. This is important, because a single HTTP server
|
||||
<li>The second line is the <code>Host</code> header, which specifies the domain name of the service you’re accessing. This is important, because a single <abbr>HTTP</abbr> server
|
||||
can host multiple separate domains. My server currently hosts 12 domains; other servers can host hundreds or even thousands.
|
||||
<li>The third line is the <code>User-Agent</code> header. What you see here is the generic <code>User-Agent</code> that the <code>urllib</code> library adds by default. In the next section, you’ll see how to customize this to be more specific.
|
||||
<li>The server replies with a status code and a bunch of headers (and possibly some data, which got stored in the <var>feeddata</var> variable). The status code here is <code>200</code>, meaning “everything’s normal, here’s the data you requested”. The server also tells you the date it responded to your request, some information about the server itself, and the content
|
||||
@@ -174,7 +172,7 @@ header: Connection: close
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id="oa.useragent">11.5. Setting the <code>User-Agent</code></h2>
|
||||
<p>The first step to improving your HTTP web services client is to identify yourself properly with a <code>User-Agent</code>. To do that, you need to move beyond the basic <code>urllib</code> and dive into <code>urllib2</code>.
|
||||
<p>The first step to improving your <abbr>HTTP</abbr> web services client is to identify yourself properly with a <code>User-Agent</code>. To do that, you need to move beyond the basic <code>urllib</code> and dive into <code>urllib2</code>.
|
||||
<div class=example><h3>Example 11.4. Introducing <code>urllib2</code></h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import httplib</kbd>
|
||||
<samp class=p>>>> </samp><kbd>httplib.HTTPConnection.debuglevel = 1</kbd> <span>①</span>
|
||||
@@ -199,8 +197,8 @@ header: Content-Length: 26848
|
||||
header: Connection: close
|
||||
</pre>
|
||||
<ol>
|
||||
<li>If you still have your Python <abbr>IDE</abbr> open from the previous section’s example, you can skip this, but this turns on <a href="#oa.debug" title="11.4. Debugging HTTP web services">HTTP debugging</a> so you can see what you’re actually sending over the wire, and what gets sent back.
|
||||
<li>Fetching an HTTP resource with <code>urllib2</code> is a three-step process, for good reasons that will become clear shortly. The first step is to create a <code>Request</code> object, which takes the URL of the resource you’ll eventually get around to retrieving. Note that this step doesn’t actually
|
||||
<li>If you still have your Python <abbr>IDE</abbr> open from the previous section’s example, you can skip this, but this turns on <a href="#oa.debug" title="11.4. Debugging HTTP web services"><abbr>HTTP</abbr> debugging</a> so you can see what you’re actually sending over the wire, and what gets sent back.
|
||||
<li>Fetching an <abbr>HTTP</abbr> resource with <code>urllib2</code> is a three-step process, for good reasons that will become clear shortly. The first step is to create a <code>Request</code> object, which takes the URL of the resource you’ll eventually get around to retrieving. Note that this step doesn’t actually
|
||||
retrieve anything yet.
|
||||
<li>The second step is to build a URL opener. This can take any number of handlers, which control how responses are handled.
|
||||
But you can also build an opener without any custom handlers, which is what you’re doing here. You’ll see how to define
|
||||
@@ -233,18 +231,18 @@ header: Connection: close
|
||||
</pre>
|
||||
<ol>
|
||||
<li>You’re continuing from the previous example; you’ve already created a <code>Request</code> object with the URL you want to access.
|
||||
<li>Using the <code>add_header</code> method on the <code>Request</code> object, you can add arbitrary HTTP headers to the request. The first argument is the header, the second is the value you’re
|
||||
<li>Using the <code>add_header</code> method on the <code>Request</code> object, you can add arbitrary <abbr>HTTP</abbr> headers to the request. The first argument is the header, the second is the value you’re
|
||||
providing for that header. Convention dictates that a <code>User-Agent</code> should be in this specific format: an application name, followed by a slash, followed by a version number. The rest is free-form,
|
||||
and you’ll see a lot of variations in the wild, but somewhere it should include a URL of your application. The <code>User-Agent</code> is usually logged by the server along with other details of your request, and including a URL of your application allows
|
||||
server administrators looking through their access logs to contact you if something is wrong.
|
||||
<li>The <var>opener</var> object you created before can be reused too, and it will retrieve the same feed again, but with your custom <code>User-Agent</code> header.
|
||||
<li>And here’s you sending your custom <code>User-Agent</code>, in place of the generic one that Python sends by default. If you look closely, you’ll notice that you defined a <code>User-Agent</code> header, but you actually sent a <code>User-agent</code> header. See the difference? <code>urllib2</code> changed the case so that only the first letter was capitalized. It doesn’t really matter; HTTP specifies that header field
|
||||
<li>And here’s you sending your custom <code>User-Agent</code>, in place of the generic one that Python sends by default. If you look closely, you’ll notice that you defined a <code>User-Agent</code> header, but you actually sent a <code>User-agent</code> header. See the difference? <code>urllib2</code> changed the case so that only the first letter was capitalized. It doesn’t really matter; <abbr>HTTP</abbr> specifies that header field
|
||||
names are completely case-insensitive.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id="oa.etags">11.6. Handling <code>Last-Modified</code> and <code>ETag</code></h2>
|
||||
<p>Now that you know how to add custom HTTP headers to your web service requests, let’s look at adding support for <code>Last-Modified</code> and <code>ETag</code> headers.
|
||||
<p>Now that you know how to add custom <abbr>HTTP</abbr> headers to your web service requests, let’s look at adding support for <code>Last-Modified</code> and <code>ETag</code> headers.
|
||||
<p>These examples show the output with debugging turned off. If you still have it turned on from the previous section, you can
|
||||
turn it off by setting <code>httplib.HTTPConnection.debuglevel = 0</code>. Or you can just leave debugging on, if that helps you.
|
||||
<div class=example><h3 id="oa.etags.example.1">Example 11.6. Testing <code>Last-Modified</code></h3><pre class=screen>
|
||||
@@ -283,8 +281,8 @@ turn it off by setting <code>httplib.HTTPConnection.debuglevel = 0</code>. Or yo
|
||||
urllib2.HTTPError: HTTP Error 304: Not Modified</span>
|
||||
</pre>
|
||||
<ol>
|
||||
<li>Remember all those HTTP headers you saw printed out when you turned on debugging? This is how you can get access to them
|
||||
programmatically: <var>firstdatastream.headers</var> is <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class">an object that acts like a dictionary</a> and allows you to get any of the individual headers returned from the HTTP server.
|
||||
<li>Remember all those <abbr>HTTP</abbr> headers you saw printed out when you turned on debugging? This is how you can get access to them
|
||||
programmatically: <var>firstdatastream.headers</var> is <a href="#fileinfo.userdict" title="5.5. Exploring UserDict: A Wrapper Class">an object that acts like a dictionary</a> and allows you to get any of the individual headers returned from the <abbr>HTTP</abbr> server.
|
||||
<li>On the second request, you add the <code>If-Modified-Since</code> header with the last-modified date from the first request. If the data hasn’t changed, the server should return a <code>304</code> status code.
|
||||
<li>Sure enough, the data hasn’t changed. You can see from the traceback that <code>urllib2</code> throws a special exception, <code>HTTPError</code>, in response to the <code>304</code> status code. This is a little unusual, and not entirely helpful. After all, it’s not an error; you specifically asked the
|
||||
server not to send you any data if it hadn’t changed, and the data didn’t change, so the server told you it wasn’t sending
|
||||
@@ -303,9 +301,9 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>①</s
|
||||
</pre>
|
||||
<ol>
|
||||
<li><code>urllib2</code> is designed around URL handlers. Each handler is just a class that can define any number of methods. When something happens
|
||||
— like an HTTP error, or even a <code>304</code> code — <code>urllib2</code> introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a> to define handlers for different node types, but <code>urllib2</code> is more flexible, and introspects over as many handlers as are defined for the current request.
|
||||
— like an <abbr>HTTP</abbr> error, or even a <code>304</code> code — <code>urllib2</code> introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a> to define handlers for different node types, but <code>urllib2</code> is more flexible, and introspects over as many handlers as are defined for the current request.
|
||||
<li><code>urllib2</code> searches through the defined handlers and calls the <code>http_error_default</code> method when it encounters a <code>304</code> status code from the server. By defining a custom error handler, you can prevent <code>urllib2</code> from raising an exception. Instead, you create the <code>HTTPError</code> object, but return it instead of raising it.
|
||||
<li>This is the key part: before returning, you save the status code returned by the HTTP server. This will allow you easy access
|
||||
<li>This is the key part: before returning, you save the status code returned by the <abbr>HTTP</abbr> server. This will allow you easy access
|
||||
to it from the calling program.
|
||||
<div class=example><h3>Example 11.8. Using custom URL handlers</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>request.headers</kbd> <span>①</span>
|
||||
@@ -321,9 +319,9 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>①</s
|
||||
</pre>
|
||||
<ol>
|
||||
<li>You’re continuing the previous example, so the <code>Request</code> object is already set up, and you’ve already added the <code>If-Modified-Since</code> header.
|
||||
<li>This is the key: now that you’ve defined your custom URL handler, you need to tell <code>urllib2</code> to use it. Remember how I said that <code>urllib2</code> broke up the process of accessing an HTTP resource into three steps, and for good reason? This is why building the URL opener
|
||||
<li>This is the key: now that you’ve defined your custom URL handler, you need to tell <code>urllib2</code> to use it. Remember how I said that <code>urllib2</code> broke up the process of accessing an <abbr>HTTP</abbr> resource into three steps, and for good reason? This is why building the URL opener
|
||||
is its own step, because you can build it with your own custom URL handlers that override <code>urllib2</code>’s default behavior.
|
||||
<li>Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use <var>seconddatastream.headers.dict</var> to acess them), also contains the HTTP status code. In this case, as you expected, the status is <code>304</code>, meaning this data hasn’t changed since the last time you asked for it.
|
||||
<li>Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use <var>seconddatastream.headers.dict</var> to acess them), also contains the <abbr>HTTP</abbr> status code. In this case, as you expected, the status is <code>304</code>, meaning this data hasn’t changed since the last time you asked for it.
|
||||
<li>Note that when the server sends back a <code>304</code> status code, it doesn’t re-send the data. That’s the whole point: to save bandwidth by not re-downloading data that hasn’t
|
||||
changed. So if you actually want that data, you’ll need to cache it locally the first time you get it.
|
||||
<p>Handling <code>ETag</code> works much the same way, but instead of checking for <code>Last-Modified</code> and sending <code>If-Modified-Since</code>, you check for <code>ETag</code> and send <code>If-None-Match</code>. Let’s start with a fresh <abbr>IDE</abbr> session.
|
||||
@@ -362,7 +360,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>①</s
|
||||
<li>Regardless of whether the <code>304</code> is triggered by <code>Last-Modified</code> date checking or <code>ETag</code> hash matching, you’ll never get the data along with the <code>304</code>. That’s the whole point.
|
||||
<table id="tip.etag.vs.lastmodified" class=note border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In these examples, the HTTP server has supported both <code>Last-Modified</code> and <code>ETag</code> headers, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In these examples, the <abbr>HTTP</abbr> server has supported both <code>Last-Modified</code> and <code>ETag</code> headers, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively
|
||||
in case a server only supports one or the other, or neither.
|
||||
|
||||
<p class=a>⁂
|
||||
@@ -553,9 +551,9 @@ http://diveintomark.org/xml/atom.xml
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id="oa.gzip">11.8. Handling compressed data</h2>
|
||||
<p>The last important HTTP feature you want to support is compression. Many web services have the ability to send data compressed,
|
||||
which can cut down the amount of data sent over the wire by 60% or more. This is especially true of XML web services, since
|
||||
XML data compresses very well.
|
||||
<p>The last important <abbr>HTTP</abbr> feature you want to support is compression. Many web services have the ability to send data compressed,
|
||||
which can cut down the amount of data sent over the wire by 60% or more. This is especially true of <abbr>XML</abbr> web services, since
|
||||
<abbr>XML</abbr> data compresses very well.
|
||||
<p>Servers won’t give you compressed data unless you tell them you can handle it.
|
||||
<div class=example><h3>Example 11.14. Telling the server you would like compressed data</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import urllib2, httplib</kbd>
|
||||
@@ -618,7 +616,7 @@ header: Content-Type: application/atom+xml</span>
|
||||
buffer in memory, and you don’t want to write out a temporary file just so you can uncompress it. So what you’re going to
|
||||
do is create a file-like object out of the in-memory data (<var>compresseddata</var>), using the <code>StringIO</code> module. You first saw the <code>StringIO</code> module in <a href="#kgp.openanything.stringio.example" title="Example 10.4. Introducing StringIO">the previous chapter</a>, but now you’ve found another use for it.
|
||||
<li>Now you can create an instance of <code>GzipFile</code>, and tell it that its “file” is the file-like object <var>compressedstream</var>.
|
||||
<li>This is the line that does all the actual work: “reading” from <code>GzipFile</code> will decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. <var>gzipper</var> is a file-like object which represents a gzip-compressed file. That “file” is not a real file on disk, though; <var>gzipper</var> is really just “reading” from the file-like object you created with <code>StringIO</code> to wrap the compressed data, which is only in memory in the variable <var>compresseddata</var>. And where did that compressed data come from? You originally downloaded it from a remote HTTP server by “reading” from the file-like object you built with <code>urllib2.build_opener</code>. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it.
|
||||
<li>This is the line that does all the actual work: “reading” from <code>GzipFile</code> will decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. <var>gzipper</var> is a file-like object which represents a gzip-compressed file. That “file” is not a real file on disk, though; <var>gzipper</var> is really just “reading” from the file-like object you created with <code>StringIO</code> to wrap the compressed data, which is only in memory in the variable <var>compresseddata</var>. And where did that compressed data come from? You originally downloaded it from a remote <abbr>HTTP</abbr> server by “reading” from the file-like object you built with <code>urllib2.build_opener</code>. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it.
|
||||
<li>Look ma, real data. (15955 bytes of it, in fact.)<p>“But wait!” I hear you cry. “This could be even easier!” I know what you’re thinking. You’re thinking that <var>opener.open</var> returns a file-like object, so why not cut out the <code>StringIO</code> middleman and just pass <var>f</var> directly to <code>GzipFile</code>? OK, maybe you weren’t thinking that, but don’t worry about it, because it doesn’t work.
|
||||
<div class=example><h3>Example 11.16. Decompressing the data directly from the server</h3><pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>f = opener.open(request)</kbd><span>①</span>
|
||||
@@ -638,14 +636,14 @@ AttributeError: addinfourl instance has no attribute 'tell'</span>
|
||||
<li>Simply opening the request will get you the headers (though not download any data yet). As you can see from the returned
|
||||
<code>Content-Encoding</code> header, this data has been sent gzip-compressed.
|
||||
<li>Since <code>opener.open</code> returns a file-like object, and you know from the headers that when you read it, you’re going to get gzip-compressed data,
|
||||
why not simply pass that file-like object directly to <code>GzipFile</code>? As you “read” from the <code>GzipFile</code> instance, it will “read” compressed data from the remote HTTP server and decompress it on the fly. It’s a good idea, but unfortunately it doesn’t
|
||||
why not simply pass that file-like object directly to <code>GzipFile</code>? As you “read” from the <code>GzipFile</code> instance, it will “read” compressed data from the remote <abbr>HTTP</abbr> server and decompress it on the fly. It’s a good idea, but unfortunately it doesn’t
|
||||
work. Because of the way gzip compression works, <code>GzipFile</code> needs to save its position and move forwards and backwards through the compressed file. This doesn’t work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and
|
||||
forth through the data stream. So the inelegant hack of using <code>StringIO</code> is the best solution: download the compressed data, create a file-like object out of it with <code>StringIO</code>, and then decompress the data from that.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id="oa.alltogether">11.9. Putting it all together</h2>
|
||||
<p>You’ve seen all the pieces for building an intelligent HTTP web services client. Now let’s see how they all fit together.
|
||||
<p>You’ve seen all the pieces for building an intelligent <abbr>HTTP</abbr> web services client. Now let’s see how they all fit together.
|
||||
<div class=example><h3>Example 11.17. The <code>openanything</code> function</h3>
|
||||
<p>This function is defined in <code>openanything.py</code>.
|
||||
<pre><code>
|
||||
@@ -665,8 +663,8 @@ def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):
|
||||
</pre>
|
||||
<ol>
|
||||
<li><code>urlparse</code> is a handy utility module for, you guessed it, parsing URLs. Its primary function, also called <code>urlparse</code>, takes a URL and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier).
|
||||
Of these, the only thing you care about is the scheme, to make sure that you’re dealing with an HTTP URL (which <code>urllib2</code> can handle).
|
||||
<li>You identify yourself to the HTTP server with the <code>User-Agent</code> passed in by the calling function. If no <code>User-Agent</code> was specified, you use a default one defined earlier in the <code>openanything.py</code> module. You never use the default one defined by <code>urllib2</code>.
|
||||
Of these, the only thing you care about is the scheme, to make sure that you’re dealing with an <abbr>HTTP</abbr> URL (which <code>urllib2</code> can handle).
|
||||
<li>You identify yourself to the <abbr>HTTP</abbr> server with the <code>User-Agent</code> passed in by the calling function. If no <code>User-Agent</code> was specified, you use a default one defined earlier in the <code>openanything.py</code> module. You never use the default one defined by <code>urllib2</code>.
|
||||
<li>If an <code>ETag</code> hash was given, send it in the <code>If-None-Match</code> header.
|
||||
<li>If a last-modified date was given, send it in the <code>If-Modified-Since</code> header.
|
||||
<li>Tell the server you would like compressed data if possible.
|
||||
@@ -731,7 +729,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
|
||||
</pre>
|
||||
<ol>
|
||||
<li>The very first time you fetch a resource, you don’t have an <code>ETag</code> hash or <code>Last-Modified</code> date, so you’ll leave those out. (They’re <a href="#apihelper.optional" title="4.2. Using Optional and Named Arguments">optional parameters</a>.)
|
||||
<li>What you get back is a dictionary of several useful headers, the HTTP status code, and the actual data returned from the server.
|
||||
<li>What you get back is a dictionary of several useful headers, the <abbr>HTTP</abbr> status code, and the actual data returned from the server.
|
||||
<code>openanything</code> handles the gzip compression internally; you don’t care about that at this level.
|
||||
<li>If you ever get a <code>301</code> status code, that’s a permanent redirect, and you need to update your URL to the new address.
|
||||
<li>The second time you fetch the same resource, you have all sorts of information to pass back: a (possibly updated) URL, the
|
||||
@@ -742,7 +740,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
|
||||
|
||||
<h2 id="oa.summary">11.10. Summary</h2>
|
||||
<p>The <code>openanything.py</code> and its functions should now make perfect sense.
|
||||
<p>There are 5 important features of HTTP web services that every client should support:
|
||||
<p>There are 5 important features of <abbr>HTTP</abbr> web services that every client should support:
|
||||
<div class=itemizedlist>
|
||||
<ul>
|
||||
<li>Identifying your application <a href="#oa.useragent" title="11.5. Setting the User-Agent">by setting a proper <code>User-Agent</code></a>.
|
||||
@@ -756,11 +754,11 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
|
||||
<li>Supporting <a href="#oa.gzip" title="11.8. Handling compressed data">gzip compression</a> to reduce bandwidth even when data <em>has</em> changed.
|
||||
|
||||
</ul>
|
||||
-->
|
||||
|
||||
<p class=a>⁂
|
||||
-->
|
||||
|
||||
<h2 id=beyond-get>Going Beyond GET</h2>
|
||||
<h2 id=beyond-get>Beyond GET</h2>
|
||||
|
||||
<p>FIXME
|
||||
|
||||
@@ -820,7 +818,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=beyond-post>Going Beyond POST</h2>
|
||||
<h2 id=beyond-post>Beyond POST</h2>
|
||||
|
||||
<p>FIXME
|
||||
|
||||
@@ -841,8 +839,8 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
|
||||
<li><a href=http://www.xml.com/pub/a/2006/02/01/doing-http-caching-right-introducing-httplib2.html>Doing <abbr>HTTP</abbr> Caching Right: Introducing <code>httplib2</code></a>
|
||||
<li><a href=http://www.xml.com/pub/a/2006/03/29/httplib2-http-persistence-and-authentication.html><code>httplib2</code>: <abbr>HTTP</abbr> Persistence and Authentication</a>
|
||||
<li><a href=http://apiwiki.twitter.com/>Twitter <abbr>API</abbr> reference</a>
|
||||
<li><a href=http://www.mnot.net/cache_docs/>HTTP Caching Tutorial</a> by Mark Nottingham
|
||||
<li><a href=http://code.google.com/p/doctype/wiki/ArticleHttpCaching>How to control caching with HTTP headers</a> on Google Doctype
|
||||
<li><a href=http://www.mnot.net/cache_docs/><abbr>HTTP</abbr> Caching Tutorial</a> by Mark Nottingham
|
||||
<li><a href=http://code.google.com/p/doctype/wiki/ArticleHttpCaching>How to control caching with <abbr>HTTP</abbr> headers</a> on Google Doctype
|
||||
</ul>
|
||||
|
||||
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
||||
|
||||
Reference in New Issue
Block a user