mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
features of HTTP section
This commit is contained in:
+54
-50
@@ -47,10 +47,11 @@ mark{display:inline}
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=dont-try-this-at-home>How Not To Fetch Data Over HTTP</h2>
|
||||
|
||||
<p>Let’s say you want to download a resource over HTTP, such as <a href=xml.html>an Atom feed</a>. But you don’t just want to download it once; you want to download it over and over again, every hour, to get the latest news from the site that’s offering the news feed. Let’s do it the quick-and-dirty way first, and then see how you can do better.
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import urllib.request</kbd>
|
||||
<samp class=p>>>> </samp><kbd>data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read()</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read()</kbd> <span>①</span></a>
|
||||
<samp class=p>>>> </samp><kbd>print(data)</kbd>
|
||||
<samp><?xml version="1.0" encoding="utf-8"?>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
||||
@@ -73,60 +74,61 @@ mark{display:inline}
|
||||
|
||||
<h2 id=http-features>Features of HTTP</h2>
|
||||
|
||||
<p>There are five important features which all HTTP clients should support.
|
||||
|
||||
<h3 id=caching>Caching</h3>
|
||||
|
||||
<p>FIXME
|
||||
|
||||
<h3 id=last-modified>Last-Modified Checking</h3>
|
||||
|
||||
<p>Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the home page of Google.com may not change for days or even weeks (and then only when they put up a special holiday logo or advertise a new service). Web services are no different. The server knows when the data you’re requesting last changed, and HTTP provides a way for the server to include this last-modified date each time you request the data.
|
||||
|
||||
<p>If you ask for the same data a second (or third or fourth) time, you can tell the server the last-modified date that you got last time. You send an <code>If-Modified-Since</code> header with your request, with the date you got back from the server last time. If the data hasn’t changed since then, the server sends back a special HTTP status code <code>304</code>, which means “this data hasn’t changed since the last time you asked for it.” Why is this an improvement? Because when the server sends a <code>304</code>, <em>it doesn’t re-send the data</em>. All you get is the status code. So you don’t need to download the same data over and over again if it hasn’t changed; the server assumes you have the data <a href=#caching>cached locally</a>.
|
||||
|
||||
<p>All modern web browsers support last-modified date checking. If you’ve ever visited a page, re-visited the same page a day later and found that it hadn’t changed, and wondered why it loaded so quickly the second time — this could be why. Your web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically sent the last-modified date it got from the server the first time. The server simply says <code>304: Not Modified</code>, so your browser knows to load the page from its cache. Web services work the same way.
|
||||
|
||||
<p>Python’s URL libraries have no built-in support for last-modified date checking, but <code>httplib2</code> does.
|
||||
|
||||
<h3 id=etag>ETags</h3>
|
||||
|
||||
<p>ETags are an alternate way to accomplish the same thing as the <a href=#last-modified>last-modified date checking</a>. With Etags, the server sends a hash code in an <code>ETag</code> header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) The second time you request the same data, you include the ETag hash in an <code>If-None-Match</code> header of your request. If the data hasn’t changed, the server will send you back a <code>304</code> status code. As with the last-modified date checking, the server sends back <em>only</em> the <code>304</code> status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since <a href=#caching>you still have the data from the last time</a>.
|
||||
|
||||
<p>Python’s URL libraries have no built-in support for ETags, but <code>httplib2</code> does.
|
||||
|
||||
<h3 id=compression>Compression</h3>
|
||||
|
||||
<p>When you talk about HTTP web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s <abbr>XML</abbr>; maybe it’s <abbr>JSON</abbr>. Regardless of the format, text compresses well. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include the <code>Accept-encoding</code> header in your request, and if the server supports compression, it will send you back compressed data and mark it with a <code>Content-encoding</code> header.
|
||||
|
||||
<p>HTTP supports several compression algorithms. The two most common types are <a href=http://www.ietf.org/rfc/rfc1952.txt>gzip</a> and <a href=http://www.ietf.org/rfc/rfc1951.txt>deflate</a>.
|
||||
|
||||
<p>Python’s URL libraries have no built-in support for compression, but <code>httplib2</code> does.
|
||||
|
||||
<h3 id=redirects>Redirects</h3>
|
||||
|
||||
<p><a href=http://www.w3.org/Provider/Style/URI>Cool URIs don’t change</a>, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at <code>http://example.com/index.xml</code> might be moved to <code>http://example.com/xml/atom.xml</code>. Or an entire domain might move, as an organization expands and reorganizes; <code>http://www.example.com/index.xml</code> becomes <code>http://server-farm-1.example.com/index.xml</code>.
|
||||
|
||||
<p>Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code <code>200</code> means “everything’s normal, here’s the page you asked for”. Status code <code>404</code> means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection.
|
||||
|
||||
<p>HTTP has several different ways of signifying that a resource has moved. The two most common techiques are status codes <code>302</code> and <code>301</code>. Status code <code>302</code> is a <i>temporary redirect</i>; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a <code>Location:</code> header). Status code <code>301</code> is a <i>permanent redirect</i>; it means “oops, that got moved permanently” (and then gives the new address in a <code>Location:</code> header). If you get a <code>302</code> status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a <code>301</code> status code and a new address, you’re supposed to use the new address from then on.
|
||||
|
||||
<p>The <code>urllib</code> module will automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address. That’s two round trips instead of one, which is bad for the service operator and bad for you.
|
||||
|
||||
<p><code>httplib2</code> handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them.
|
||||
|
||||
<!--
|
||||
<p>There are five important features of HTTP which you should support.
|
||||
<h3>11.3.1. <code>User-Agent</code></h3>
|
||||
<p>The <code>User-Agent</code> is simply a way for a client to tell a server who it is when it requests a web page, a syndicated feed, or any sort of web
|
||||
service over HTTP. When the client requests a resource, it should always announce who it is, as specifically as possible.
|
||||
This allows the server-side administrator to get in touch with the client-side developer if anything is going fantastically
|
||||
wrong.
|
||||
<h3><code>User-Agent</code></h3>
|
||||
|
||||
<p>The <code>User-Agent</code> is simply a way for a client to tell a server who it is when it requests a web page, a syndicated feed, or any sort of web service over HTTP. When the client requests a resource, it should always announce who it is, as specifically as possible. This helps the server-side administrator figure out who to contact when things go fantastically wrong.
|
||||
|
||||
<p>By default, Python sends a generic <code>User-Agent</code>: <code>Python-urllib/1.15</code>. In the next section, you’ll see how to change this to something more specific.
|
||||
<h3>11.3.2. Redirects</h3>
|
||||
<p>Sometimes resources move around. Web sites get reorganized, pages move to new addresses. Even web services can reorganize.
|
||||
A syndicated feed at <code>http://example.com/index.xml</code> might be moved to <code>http://example.com/xml/atom.xml</code>. Or an entire domain might move, as an organization expands and reorganizes; for instance, <code>http://www.example.com/index.xml</code> might be redirected to <code>http://server-farm-1.example.com/index.xml</code>.
|
||||
<p>Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status
|
||||
code <code>200</code> means “everything’s normal, here’s the page you asked for”. Status code <code>404</code> means “page not found”. (You’ve probably seen 404 errors while browsing the web.)
|
||||
<p>HTTP has two different ways of signifying that a resource has moved. Status code <code>302</code> is a <em>temporary redirect</em>; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a <code>Location:</code> header). Status code <code>301</code> is a <em>permanent redirect</em>; it means “oops, that got moved permanently” (and then gives the new address in a <code>Location:</code> header). If you get a <code>302</code> status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but
|
||||
the next time you want to access the same resource, you should retry the old address. But if you get a <code>301</code> status code and a new address, you’re supposed to use the new address from then on.
|
||||
<p><code>urllib.urlopen</code> will automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn’t tell you when
|
||||
it does so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to
|
||||
the new address. That’s two round trips instead of one: not very efficient! Later in this chapter, you’ll see how to work
|
||||
around this so you can deal with permanent redirects properly and efficiently.
|
||||
<h3>11.3.3. <code>Last-Modified</code>/<code>If-Modified-Since</code></h3>
|
||||
<p>Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the
|
||||
home page of Google.com only changes once every few weeks (when they put up a special holiday logo, or advertise a new service).
|
||||
Web services are no different; usually the server knows when the data you requested last changed, and HTTP provides a way
|
||||
for the server to include this last-modified date along with the data you requested.
|
||||
<p>If you ask for the same data a second time (or third, or fourth), you can tell the server the last-modified date that you
|
||||
got last time: you send an <code>If-Modified-Since</code> header with your request, with the date you got back from the server last time. If the data hasn’t changed since then, the
|
||||
server sends back a special HTTP status code <code>304</code>, which means “this data hasn’t changed since the last time you asked for it”. Why is this an improvement? Because when the server sends a <code>304</code>, <em>it doesn’t re-send the data</em>. All you get is the status code. So you don’t need to download the same data over and over again if it hasn’t changed;
|
||||
the server assumes you have the data cached locally.
|
||||
<p>All modern web browsers support last-modified date checking. If you’ve ever visited a page, re-visited the same page a day
|
||||
later and found that it hadn’t changed, and wondered why it loaded so quickly the second time -- this could be why. Your
|
||||
web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically
|
||||
sent the last-modified date it got from the server the first time. The server simply says <code>304: Not Modified</code>, so your browser knows to load the page from its cache. Web services can be this smart too.
|
||||
<p>Python’s URL library has no built-in support for last-modified date checking, but since you can add arbitrary headers to each request
|
||||
and read arbitrary headers in each response, you can add support for it yourself.
|
||||
<h3>11.3.4. <code>ETag</code>/<code>If-None-Match</code></h3>
|
||||
<p>ETags are an alternate way to accomplish the same thing as the last-modified date checking: don’t re-download data that hasn’t
|
||||
changed. The way it works is, the server sends some sort of hash of the data (in an <code>ETag</code> header) along with the data you requested. Exactly how this hash is determined is entirely up to the server. The second
|
||||
time you request the same data, you include the ETag hash in an <code>If-None-Match:</code> header, and if the data hasn’t changed, the server will send you back a <code>304</code> status code. As with the last-modified date checking, the server <em>just</em> sends the <code>304</code>; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the
|
||||
server that there’s no need to re-send the same data if it still matches this hash, since you still have the data from the
|
||||
last time.
|
||||
<p>Python’s URL library has no built-in support for ETags, but you’ll see how to add it later in this chapter.
|
||||
<h3>11.3.5. Compression</h3>
|
||||
<p>The last important HTTP feature is gzip compression. When you talk about HTTP web services, you’re almost always talking
|
||||
about moving XML back and forth over the wire. XML is text, and quite verbose text at that, and text generally compresses
|
||||
well. When you request a resource over HTTP, you can ask the server that, if it has any new data to send you, to please send
|
||||
it in compressed format. You include the <code>Accept-encoding: gzip</code> header in your request, and if the server supports compression, it will send you back gzip-compressed data and mark it with
|
||||
a <code>Content-encoding: gzip</code> header.
|
||||
<p>Python’s URL library has no built-in support for gzip compression per se, but you can add arbitrary headers to the request. And
|
||||
Python comes with a separate <code>gzip</code> module, which has functions you can use to decompress the data yourself.
|
||||
<p>Note that <a href="#oa.review" title="11.2. How not to fetch data over HTTP">our little one-line script</a> to download a syndicated feed did not support any of these HTTP features. Let’s see how you can improve it.
|
||||
|
||||
<p>Note that [FIXME-href] our little one-line script to download an Atom feed did not support any of these HTTP features. Let’s see how you can improve it.
|
||||
|
||||
<p class=a>⁂
|
||||
-->
|
||||
|
||||
<!--
|
||||
<h2 id="oa.debug">11.4. Debugging HTTP web services</h2>
|
||||
<p>First, let’s turn on the debugging features of Python’s HTTP library and see what’s being sent over the wire. This will be useful throughout the chapter, as you add more and
|
||||
more features.
|
||||
@@ -301,7 +303,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): <span>①</s
|
||||
</pre>
|
||||
<ol>
|
||||
<li><code>urllib2</code> is designed around URL handlers. Each handler is just a class that can define any number of methods. When something happens
|
||||
-- like an HTTP error, or even a <code>304</code> code -- <code>urllib2</code> introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a> to define handlers for different node types, but <code>urllib2</code> is more flexible, and introspects over as many handlers as are defined for the current request.
|
||||
— like an HTTP error, or even a <code>304</code> code — <code>urllib2</code> introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in <a href="#kgp" title="Chapter 9. XML Processing">Chapter 9, <i>XML Processing</i></a> to define handlers for different node types, but <code>urllib2</code> is more flexible, and introspects over as many handlers as are defined for the current request.
|
||||
<li><code>urllib2</code> searches through the defined handlers and calls the <code>http_error_default</code> method when it encounters a <code>304</code> status code from the server. By defining a custom error handler, you can prevent <code>urllib2</code> from raising an exception. Instead, you create the <code>HTTPError</code> object, but return it instead of raising it.
|
||||
<li>This is the key part: before returning, you save the status code returned by the HTTP server. This will allow you easy access
|
||||
to it from the calling program.
|
||||
@@ -839,6 +841,8 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
|
||||
<li><a href=http://www.xml.com/pub/a/2006/02/01/doing-http-caching-right-introducing-httplib2.html>Doing <abbr>HTTP</abbr> Caching Right: Introducing <code>httplib2</code></a>
|
||||
<li><a href=http://www.xml.com/pub/a/2006/03/29/httplib2-http-persistence-and-authentication.html><code>httplib2</code>: <abbr>HTTP</abbr> Persistence and Authentication</a>
|
||||
<li><a href=http://apiwiki.twitter.com/>Twitter <abbr>API</abbr> reference</a>
|
||||
<li><a href=http://www.mnot.net/cache_docs/>HTTP Caching Tutorial</a> by Mark Nottingham
|
||||
<li><a href=http://code.google.com/p/doctype/wiki/ArticleHttpCaching>How to control caching with HTTP headers</a> on Google Doctype
|
||||
</ul>
|
||||
|
||||
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
||||
|
||||
@@ -16,7 +16,12 @@ for f in *.html; do
|
||||
python3 htmlminimizer.py "$f" build/"$f"
|
||||
done
|
||||
|
||||
# add evil tracking code
|
||||
# build sitemap
|
||||
ls build/*.html | sed -e "s|build/|http://diveintopython3.org/|g" > build/sitemap.txt
|
||||
|
||||
echo "adding evil tracking code"
|
||||
|
||||
# add Google Analytics script
|
||||
for f in build/*.html; do
|
||||
cat "$f" ga.js > build/tmp
|
||||
mv build/tmp "$f"
|
||||
|
||||
Reference in New Issue
Block a user