You are here: Home ‣ Dive Into Python 3 ‣

Difficulty level: ♦♦♦♦♢

HTTP Web Services

❝ A ruffled mind makes a restless pillow. ❞
— Charlotte Brontë

Diving In

HTTP web services are programmatic ways of sending and receiving data from remote servers using nothing but the operations of HTTP. If you want to get data from the server, use HTTP GET; if you want to send new data to the server, use HTTP POST. Some more advanced HTTP web service APIs also define ways of modifying existing data and deleting data, using HTTP PUT and HTTP DELETE. In other words, the “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for retrieving, creating, modifying, and deleting data.

The main advantage of this approach is simplicity, and its simplicity has proven popular. Data — usually XML data — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging is also easier; because each “call” to the web service had a unique URL, you can load it in your web browser and immediately see the raw data.

Examples of HTTP web services:

Google Data APIs allow you to interact with a wide variety of Google services, including Blogger and YouTube.
Flickr Services allow you to upload and download photos from Flickr.
Twitter API allows you to publish status updates on Twitter.
…and many more

Python 3 comes with two different libraries for interacting with HTTP web services:

http.client is a low-level library that implements RFC 2616, the HTTP protocol.
urllib.request is an abstraction layer built on top of http.client. It provides a standard API for accessing both HTTP and FTP servers, automatically follows HTTP redirects, and handles some common forms of HTTP authentication.

So which one should you use? Neither of them. Instead, you should use httplib2, an open source third-party library that implements HTTP more fully than http.client but provides a better abstraction that urllib.request.

To understand why httplib2 is the right choice, you first need to understand HTTP.

⁂

Features of HTTP

There are five important features which all HTTP clients should support.

Caching

The most important thing to understand about any type of web service is that network access is incredibly expensive. I don’t mean “dollars and cents” expensive (although bandwidth ain’t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even the fastest broadband connection is orders of magnitude slower than your local network, which in turn is orders of magnitude slower than you local disk.

HTTP is designed with caching in mind. There is an entire class of devices (called “caching proxies”) whose only job is to sit between you and the rest of the world and minimize network access. Your company or ISP almost certainly maintains caching proxies, even if you’re unaware of them. They work because caching built into the HTTP protocol.

Here’s a concrete example of how caching works. You visit diveintomark.org in your browser. That page includes a background image, wearehugh.com/m.jpg. When your browser downloads that image, the server includes the following HTTP headers:

HTTP/1.1 200 OK
Date: Sun, 31 May 2009 17:14:04 GMT
Server: Apache
Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
ETag: "3075-ddc8d800"
Accept-Ranges: bytes
Content-Length: 12405
Cache-Control: max-age=31536000, public
Expires: Mon, 31 May 2010 17:14:04 GMT
Connection: close
Content-Type: image/jpeg

The Cache-Control and Expires headers tell your browser (and any caching proxies between you and the server) that this image can be cached for up to a year. A year! And if, in the next year, you visit another page which also includes a link to this image, your browser will load the image from its cache without generating any network activity whatsoever.

But wait, it gets better. Let’s say your browser purges the image from your local cache for some reason. Maybe it ran out of disk space; maybe you manually cleared the cache. Whatever. But the HTTP headers said that this data could be cached by public caching proxies (by virtue of that public keyword in the Cache-Control header). Caching proxies are designed to have tons of storage space, probably far more than your local browser has allocated.

If your company or ISP maintain a caching proxy, the proxy may still have the image cached. When you visit diveintomark.org again, your browser will look in its local cache for the image, but it won’t find it, so it will make a network request to try to download it from the remote server. But if the caching proxy still has a copy of the image, it will intercept that request and serve the image from its cache. That means that your request will never reach the remote server; in fact, it will never leave your company’s network. That makes for a faster download (fewer network hops) and saves your company money (less data being downloaded from the outside world).

HTTP caching only works when everybody does their part. On one side, servers need to send the correct headers in their response. On the other side, clients need to understand and respect those headers before they request the same data twice. The proxies in the middle are not a panacea; they can only be as smart as the servers and clients allow them to be.

Python’s HTTP libraries do not support caching, but httplib2 does.

Last-Modified Checking

Some data never changes, while other data changes all the time. In between, there is a vast field of data that might have changed, but hasn’t. CNN.com’s feed is updated every few minutes, but my weblog’s feed may not change for days or weeks at a time. In the latter case, I don’t want to tell clients to cache my feed for weeks at a time, because then when I do actually post something, people may not read it for weeks (because they’re respecting my cache headers which said “don’t bother checking this feed for weeks”). On the other hand, I don’t want clients downloading my entire feed once an hour if it hasn’t changed!

HTTP has a solution to this, too. When you request data for the first time, the server can send back a Last-Modified header. This is exactly what it sounds like: the date that the data was changed. That background image referenced from diveintomark.org included a Last-Modified header.

HTTP/1.1 200 OK
Date: Sun, 31 May 2009 17:14:04 GMT
Server: Apache
Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
ETag: "3075-ddc8d800"
Accept-Ranges: bytes
Content-Length: 12405
Cache-Control: max-age=31536000, public
Expires: Mon, 31 May 2010 17:14:04 GMT
Connection: close
Content-Type: image/jpeg

When you request the same data a second (or third or fourth) time, you can send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data hasn’t changed since then, the server sends back a special HTTP 304 status code, which means “this data hasn’t changed since the last time you asked for it.” You can test this on the command line, using curl:

you@localhost:~$ curl -I -H "If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT" http://wearehugh.com/m.jpg
HTTP/1.1 304 Not Modified
Date: Sun, 31 May 2009 18:04:39 GMT
Server: Apache
Connection: close
ETag: "3075-ddc8d800"
Expires: Mon, 31 May 2010 18:04:39 GMT
Cache-Control: max-age=31536000, public

Why is this an improvement? Because when the server sends a 304, it doesn’t re-send the data. All you get is the status code. Even after your cached copy has expired, last-modified checking ensures that you won’t download the same data twice if it hasn’t changed. (As an extra bonus, this 304 response also includes caching headers. Proxies will keep a copy of data even after it officially “expires,” in the hopes that the data hasn’t really changed and the next request responds with a 304 status code and updated cache information.)

Python’s HTTP libraries do not support last-modified date checking, but httplib2 does.

ETags

ETags are an alternate way to accomplish the same thing as the last-modified checking. With Etags, the server sends a hash code in an ETag header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) That background image referenced from diveintomark.org had an ETag header.

HTTP/1.1 200 OK
Date: Sun, 31 May 2009 17:14:04 GMT
Server: Apache
Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
ETag: "3075-ddc8d800"
Accept-Ranges: bytes
Content-Length: 12405
Cache-Control: max-age=31536000, public
Expires: Mon, 31 May 2010 17:14:04 GMT
Connection: close
Content-Type: image/jpeg

The second time you request the same data, you include the ETag hash in an If-None-Match header of your request. If the data hasn’t changed, the server will send you back a 304 status code. As with the last-modified date checking, the server sends back only the 304 status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since you still have the data from the last time.

Again with the curl:

you@localhost:~$ curl -I -H "If-None-Match: \"3075-ddc8d800\"" http://wearehugh.com/m.jpg  ①
HTTP/1.1 304 Not Modified
Date: Sun, 31 May 2009 18:04:39 GMT
Server: Apache
Connection: close
ETag: "3075-ddc8d800"
Expires: Mon, 31 May 2010 18:04:39 GMT
Cache-Control: max-age=31536000, public

ETags are commonly enclosed in quotation marks, but the quotation marks are part of the value. They are not delimiters; the only delimiter in the ETag header is the colon between ETag and "3075-ddc8d800". That means you need to send the quotation marks back to the server in the If-None-Match header.

Python’s HTTP libraries do not support ETags, but httplib2 does.

Compression

When you talk about HTTP web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s XML, maybe it’s JSON, maybe it’s just plain text. Regardless of the format, text compresses well. The example feed in the XML chapter is 3070 bytes uncompressed, but would be 941 bytes after gzip compression. That’s just 30% of the original size!

HTTP supports several compression algorithms. The two most common types are gzip and deflate. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include an Accept-encoding header in your request that lists which compression algorithms you support. If the server supports any of the same algorithms, it will send you back compressed data (with a Content-encoding header that tells you which algorithm it used). Then it’s up to you to decompress the data.

Python’s HTTP libraries do not support compression, but httplib2 does.

Redirects

Cool URIs don’t change, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at http://example.com/index.xml might be moved to http://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; http://www.example.com/index.xml becomes http://server-farm-1.example.com/index.xml.

Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code 200 means “everything’s normal, here’s the page you asked for”. Status code 404 means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection.

HTTP has several different ways of signifying that a resource has moved. The two most common techiques are status codes 302 and 301. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you’re supposed to use the new address from then on.

The urllib.request module automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address, and each time the urllib.request module will “helpfully” follow the redirect. In other words, it treats permanent redirects the same as temporary redirects. That means two round trips instead of one, which is bad for the server and bad for you.

httplib2 handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them.

⁂

How Not To Fetch Data Over HTTP

Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better.

>>> import urllib.request
>>> data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read()  ①
>>> print(data)
<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
  <title>dive into mark</title>
  <subtitle>currently between addictions</subtitle>
  <id>tag:diveintomark.org,2001-07-29:/</id>
  <updated>2009-03-27T21:56:07Z</updated>
  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
  …

Downloading anything over HTTP is incredibly easy in Python; in fact, it’s a one-liner. The urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier.

So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (e.g. requesting this feed once an hour), then you’re being inefficient, and you’re being rude.

⁂

What’s On The Wire?

To see why this is inefficient and rude, let’s turn on the debugging features of Python’s HTTP library and see what’s being sent “on the wire.”

>>> from http.client import HTTPConnection
>>> HTTPConnection.debuglevel = 1                                       ①
>>> from urllib.request import urlopen
>>> response = urlopen('http://diveintopython3.org/examples/feed.xml')  ②
send: b'GET /examples/feed.xml HTTP/1.1                                 ③
Host: diveintopython3.org                                               ④
Accept-Encoding: identity                                               ⑤
User-Agent: Python-urllib/3.0'                                          ⑥
Connection: close
reply: 'HTTP/1.1 200 OK'
…further debugging information omitted…

As I mentioned at the beginning of the chapter, urllib.request relies on another standard Python library, http.client. Normally you don’t need to touch http.client directly. (The urllib.request module imports it automatically.) But we import it here so we can toggle the debugging flag on the HTTPConnection class that urllib.request uses to connect to the HTTP server.
Now that the debugging flag is set, information on the the HTTP request and response is printed out in real time. As you can see, when you request the Atom feed, the urllib.request module sends five lines to the server.
The first line specifies the HTTP verb you’re using, and the path of the resource (minus the domain name).
The second line specifies the domain name from which we’re requesting this feed.
The third line specifies the compression algorithms that the client supports. As I mentioned earlier, urllib.request does not support compression by default.
The fourth line specifies the name of the library that is making the request. By default, this is Python-urllib plus a version number. Both urllib.request and httplib2 support changing the user agent; you’ll see how to do this later in this chapter. [FIXME really?]

Now let’s look at what the server sent back in its response.

# continued from previous example
>>> print(response.headers.as_string())        ①
Date: Sun, 31 May 2009 19:23:06 GMT            ②
Server: Apache
Last-Modified: Sun, 31 May 2009 06:39:55 GMT   ③
ETag: "bfe-93d9c4c0"                           ④
Accept-Ranges: bytes
Content-Length: 3070                           ⑤
Cache-Control: max-age=86400                   ⑥
Expires: Mon, 01 Jun 2009 19:23:06 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: application/xml
>>> data = response.read()                     ⑦
>>> len(data)
3070

The response returned from the urllib.request.urlopen() function contains all the HTTP headers the server sent back. It also contains methods to download the actual data; we’ll get to that in a minute.
The server tells you when it handled your request.
This response includes a Last-Modified header.
This response includes an ETag header.
The data is 3070 bytes long. Notice what isn’t here: a Content-encoding header. Your request stated that you only accept uncompressed data (Accept-encoding: identity), and sure enough, this response contains uncompressed data.
This response includes caching headers that state that this feed can be cached for up to 24 hours (86400 seconds).
And finally, download the actual data by calling response.read(). As you can tell from the len() function, this downloads all 3070 bytes at once.

As you can see, this code is already inefficient: it asked for (and received) uncompressed data. I know for a fact that this server supports gzip compression, but HTTP compression is opt-in. We didn’t ask for it, so we didn’t get it. That means we’re downloading 3070 bytes when we could have just downloaded 941. Bad dog, no biscuit.

But wait, it gets worse! To see just how inefficient this code is, let’s request the same feed a second time.

# continued from the previous example
>>> response2 = urlopen('http://diveintopython3.org/examples/feed.xml')
send: b'GET /examples/feed.xml HTTP/1.1
Host: diveintopython3.org
Accept-Encoding: identity
User-Agent: Python-urllib/3.0'
Connection: close
reply: 'HTTP/1.1 200 OK'
…further debugging information omitted…

Notice anything peculiar about this request? It hasn’t changed! It’s exactly the same as the first request. No sign of If-Modified-Since headers. No sign of If-None-Match headers. No respect for the caching headers. Still no compression.

And what happens when you do the same thing twice? You get the same response. Twice.

# continued from the previous example
>>> print(response2.headers.as_string())     ①
Date: Mon, 01 Jun 2009 03:58:00 GMT
Server: Apache
Last-Modified: Sun, 31 May 2009 22:51:11 GMT
ETag: "bfe-255ef5c0"
Accept-Ranges: bytes
Content-Length: 3070
Cache-Control: max-age=86400
Expires: Tue, 02 Jun 2009 03:58:00 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: application/xml
>>> data2 = response2.read()
>>> len(data2)                               ②
3070
>>> data2 == data                            ③
True

The server is still sending the same array of “smart” headers: Cache-Control and Expires to allow caching, Last-Modified and ETag to enable “not-modified” tracking. Even the Vary: Accept-Encoding header hints that the server would support compression, if only you would ask for it. But you didn’t.
Once again, fetching this data downloads the whole 3070 bytes…
…the exact same 3070 bytes you downloaded last time.

HTTP is designed to work better than this. urllib speaks HTTP like I speak Spanish — enough to get by in a jam, but not enough to hold a conversation. HTTP is a conversation. It’s time to upgrade to a library that speaks HTTP fluently.

⁂

Introducing `httplib2`

To use httplib2, create an instance of the httplib2.Http class.

>>> import httplib2
>>> h = httplib2.Http('.cache')                                                    ①
>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml')  ②
>>> response.status                                                                ③
200
>>> content[:52]                                                                   ④
b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
>>> len(content)
3070

The primary interface to httplib2 is the Http object. For reasons you’ll see in the next section, you should always pass a directory name when you create an Http object. The directory does not need to exist; httplib2 will create it if necessary.
Once you have an Http object, retrieving data is as simple as calling the request() method with the address of the data you want. This will issue an HTTP GET request for that URL. (Later in this chapter, you’ll see how to issue other HTTP requests, like POST.)
The request() method returns two values. The first is an httplib2.Response object, which contains all the HTTP headers the server returned. For example, a status code of 200 indicates that the request was successful.
The content variable contains the actual data that was returned by the HTTP server. The data is returned as bytes, not str. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it.

☞You probably only need one httplib2.Http object. There are valid reasons for creating more than one, but you should only do so if you know why you need them. “I need to request data from two different URLs” is not a valid reason. Re-use the Http object and just call the request() method twice.

How `httplib2` Handles Caching

Remember in the previous section when I said you should always create an httplib2.Http object with a directory name? Caching is the reason.

# continued from the previous example
>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml')  ①
>>> response2.status                                                                 ②
200
>>> content2[:52]                                                                    ③
b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
>>> len(content2)
3070

This shouldn’t be terribly surprising. It’s the same thing you did last time, except you’re putting the result into two new variables.
The HTTP status is once again 200, just like last time.
The downloaded content is the same as last time, too.

So… who cares? Quit your Python interactive shell and relaunch it with a new session, and I’ll show you.

# NOT continued from previous example!
# Please exit out of the interactive shell
# and launch a new one.
>>> import httplib2
>>> httplib2.debuglevel = 1                                                        ①
>>> h = httplib2.Http('.cache')                                                    ②
>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml')  ③
>>> len(content)                                                                   ④
3070
>>> response.status                                                                ⑤
200
>>> response.fromcache                                                             ⑥
True

Let’s turn on debugging and see what’s on the wire. This is the httplib2 equivalent of turning on debugging in http.client. httplib2 will print all the data being sent to the server and some key information being sent back.
Create an httplib2.Http object with the same directory name as before.
Request the same URL as before. Nothing appears to happen. More precisely, nothing gets sent to the server, and nothing gets returned from the server. There is absolutely no network activity whatsoever.
Yet we did “receive” some data — in fact, we received all of it.
We also “received” an HTTP status code indicating that the “request” was successful.
Here’s the rub: this “response” was generated from httplib2’s local cache. That directory name you passed in when you created the httplib2.Http object — that directory holds httplib2’s cache of all the operations it’s ever performed.

You previously requested the data at this URL. That request was successful (status: 200). That response included not only the feed data, but also a set of caching headers that told anyone who was listening that they could cache this resource for up to 24 hours (Cache-Control: max-age=86400, which is 24 hours measured in seconds). httplib2 understand and respects those caching headers, and it stored the previous response in the .cache directory (which you passed in when you create the Http object). That cache hasn’t expired yet, so the second time you request the data at this URL, httplib2 simply returns the cached result without ever hitting the network.

I say “simply,” but obviously there is a lot of complexity hidden behind that simplicity. httplib2 handles HTTP caching automatically and by default. If for some reason you need to know whether a response came from the cache, you can check response.fromcache. Otherwise, it Just Works.

Now, suppose you have data cached, but you want to bypass the cache and re-request it from the remote server. Browsers sometimes do this if the user specifically requests it. For example, pressing F5 refreshes the current page, but pressing Ctrl+F5 bypasses the cache and re-requests the current page from the remote server. You might think “oh, I’ll just delete the data from my local cache, then request it again.” You could do that, but remember that there may be more parties involved than just you and the remote server. What about those intermediate proxy servers? They’re completely beyond your control, and they may still have that data cached, and will happily return it to you because (as far as they are concerned) their cache is still valid.

Instead of manipulating your local cache and hoping for the best, you should use the features of HTTP to ensure that your request actually reaches the remote server.

# continued from the previous example
>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml',
...     headers={'cache-control':'no-cache'})  ①
connect: (diveintopython3.org, 80)             ②
send: b'GET /examples/feed.xml HTTP/1.1
Host: diveintopython3.org
user-agent: Python-httplib2/$Rev: 259 $
accept-encoding: deflate, gzip
cache-control: no-cache'
reply: 'HTTP/1.1 200 OK'
…further debugging information omitted…
>>> response2.status
200
>>> response2.fromcache                        ③
False
>>> print(dict(response2.items()))             ④
{'status': '200',
 'content-length': '3070',
 'content-location': 'http://diveintopython3.org/examples/feed.xml',
 'accept-ranges': 'bytes',
 'expires': 'Wed, 03 Jun 2009 00:40:26 GMT',
 'vary': 'Accept-Encoding',
 'server': 'Apache',
 'last-modified': 'Sun, 31 May 2009 22:51:11 GMT',
 'connection': 'close',
 '-content-encoding': 'gzip',
 'etag': '"bfe-255ef5c0"',
 'cache-control': 'max-age=86400',
 'date': 'Tue, 02 Jun 2009 00:40:26 GMT',
 'content-type': 'application/xml'}

httplib2 allows you to add arbitrary HTTP headers to any outgoing request. In order to bypass all caches (not just your local disk cache, but also any caching proxies between you and the remote server), add a no-cache header in the headers dictionary.
Now you see httplib2 initiating a network request. httplib2 understands and respects caching headers in both directions — as part of the incoming response and as part of the outgoing request. It noticed that you added the no-cache header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data.
This response was not generated from your local cache. You knew that, of course, because you saw the debugging information on the outgoing request. But it’s nice to have that programmatically verified.
The request succeeded; you downloaded the entire feed again from the remote server. Of course, the server also sent back a full complement of HTTP headers along with the feed data. That includes caching headers, which httplib2 uses to update its local cache, in the hopes of avoiding network access the next time you request this feed. Everything about HTTP caching is designed to maximize cache hits and minimize network access. Even though you bypassed the cache this time, the remote server would really appreciate it if you would cache the result for next time.

How `httplib2` Handles `Last-Modified` and `ETag` Headers

The Cache-Control and Expires caching headers are called freshness indicators. They tell caches in no uncertain terms that you can completely avoid all network access until the cache expires. And that’s exactly the behavior you saw in the previous section: given a strong validator, httplib2 does not generate a single byte of network activity to serve up cached data (unless you explicitly bypass the cache, of course).

But what about the case where the data might have changed, but hasn’t? HTTP defines Last-Modified and Etag headers for this purpose. These headers are called validators. If the local cache is no longer fresh, a client can send the validators with the next request to see if the data has actually changed. If the data hasn’t changed, the server sends back a 304 status code and no data. So there’s still a round-trip over the network, but you end up downloading fewer bytes.

>>> import httplib2
>>> httplib2.debuglevel = 1
>>> h = httplib2.Http('.cache')
>>> response, content = h.request('http://diveintopython3.org/')  ①
connect: (diveintopython3.org, 80)
send: b'GET / HTTP/1.1
Host: diveintopython3.org
accept-encoding: deflate, gzip
user-agent: Python-httplib2/$Rev: 259 $'
reply: 'HTTP/1.1 200 OK'
>>> print(dict(response.items()))                                 ②
{'-content-encoding': 'gzip',
 'accept-ranges': 'bytes',
 'connection': 'close',
 'content-length': '6657',
 'content-location': 'http://diveintopython3.org/',
 'content-type': 'text/html',
 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
 'etag': '"7f806d-1a01-9fb97900"',
 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
 'server': 'Apache',
 'status': '304',
 'vary': 'Accept-Encoding,User-Agent'}
>>> len(content)                                                  ③
6657

Instead of the feed, this time we’re going to download the site’s home page, which is HTML. Since this is the first time you’lve ever requested this page, httplib2 has little to work with, and it sends out a minimum of headers with the request.
The response contains a multitude of HTTP headers… but no caching information. However, it does include both an ETag and Last-Modified header.
At the time I constructed this example, this page was 6657 bytes. It’s probably changed since then, but don’t worry about it.

# continued from the previous example
>>> response, content = h.request('http://diveintopython3.org/')  ①
connect: (diveintopython3.org, 80)
send: b'GET / HTTP/1.1
Host: diveintopython3.org
if-none-match: "7f806d-1a01-9fb97900"                             ②
if-modified-since: Tue, 02 Jun 2009 02:51:48 GMT                  ③
accept-encoding: deflate, gzip
user-agent: Python-httplib2/$Rev: 259 $'
reply: 'HTTP/1.1 304 Not Modified'                                ④
>>> response.fromcache                                            ⑤
True
>>> response.status                                               ⑥
200
>>> response.dict['status']                                       ⑦
'304'
>>> len(content)                                                  ⑧
6657

You request the same page again, with the same Http object (and the same local cache).
httplib2 sends the ETag validator back to the server in the If-None-Match header.
httplib2 also sends the Last-Modified validator back to the server in the If-Modified-Since header.
The server looked at these validators, looked at the page you requested, and determined that the page has not changed since you last requested it, so it sends back a 304 status code and no data.
Back on the client, httplib2 notices the 304 status code and loads the content of the page from its cache.
This might be a bit confusing. There are really two status codes — 304 (returned from the server this time, which caused httplib2 to look in its cache), and 200 (returned from the server last time, and stored in httplib2’s cache along with the page data). response.status returns the status from the cache.
If you want the raw status code returned from the server, you can get that by looking in response.dict, which is a dictionary of the actual headers returned from the server.
However, you still get the data in the content variable. Generally, you don’t need to know why a response was served from the cache. (You may not even care that it was served from the cache at all, and that’s fine too. httplib2 is smart enough to let you act dumb.) By the time the request() method returns to the caller, httplib2 has already updated its cache and returned the data to you.

How `http2lib` Handles Compression

HTTP supports two types of compression. httplib2 supports both of them.

>>> response, content = h.request('http://diveintopython3.org/')
connect: (diveintopython3.org, 80)
send: b'GET / HTTP/1.1
Host: diveintopython3.org
accept-encoding: deflate, gzip                          ①
user-agent: Python-httplib2/$Rev: 259 $'
reply: 'HTTP/1.1 200 OK'
>>> print(dict(response.items()))
{'-content-encoding': 'gzip',                           ②
 'accept-ranges': 'bytes',
 'connection': 'close',
 'content-length': '6657',
 'content-location': 'http://diveintopython3.org/',
 'content-type': 'text/html',
 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
 'etag': '"7f806d-1a01-9fb97900"',
 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
 'server': 'Apache',
 'status': '304',
 'vary': 'Accept-Encoding,User-Agent'}

Every time httplib2 sends a request, it includes an Accept-Encoding header to tell the server that it can handle either deflate or gzip compression.
In this case, the server has responded with a gzip-compressed payload. By the time the request() method returns, httplib2 has already decompressed the body of the response and placed it in the content variable. If you’re curious about whether or not the response was compressed, you can check the response dictionary; otherwise, don’t worry about it.

How `httplib2` Handles Redirects

FIXME

>>> response, content = h.request('http://diveintopython3.org/examples/feed-302.xml')
connect: (diveintopython3.org, 80)
send: b'GET /examples/feed-302.xml HTTP/1.1
Host: diveintopython3.org
accept-encoding: deflate, gzip
user-agent: Python-httplib2/$Rev: 259 $'
reply: 'HTTP/1.1 302 Found'
send: b'GET /examples/feed.xml HTTP/1.1
Host: diveintopython3.org
accept-encoding: deflate, gzip
user-agent: Python-httplib2/$Rev: 259 $'
reply: 'HTTP/1.1 200 OK'
>>> print(dict(response.items()))
{'status': '200',
 'content-length': '3070',
 'content-location': 'http://diveintopython3.org/examples/feed.xml',
 'accept-ranges': 'bytes',
 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',
 'vary': 'Accept-Encoding',
 'server': 'Apache',
 'last-modified': 'Wed, 03 Jun 2009 02:20:15 GMT',
 'connection': 'close',
 '-content-encoding': 'gzip',
 'etag': '"bfe-4cbbf5c0"',
 'cache-control': 'max-age=86400',
 'date': 'Wed, 03 Jun 2009 02:21:41 GMT',
 'content-type': 'application/xml'}
>>> response, content = h.request('http://diveintopython3.org/examples/feed-302.xml')
connect: (diveintopython3.org, 80)
send: b'GET /examples/feed-302.xml HTTP/1.1
Host: diveintopython3.org
accept-encoding: deflate, gzip
user-agent: Python-httplib2/$Rev: 259 $'
reply: 'HTTP/1.1 302 Found'

FIXME

>>> response, content = h.request('http://diveintopython3.org/examples/feed-301.xml')
connect: (diveintopython3.org, 80)
send: b'GET /examples/feed-301.xml HTTP/1.1
Host: diveintopython3.org
accept-encoding: deflate, gzip
user-agent: Python-httplib2/$Rev: 259 $'
reply: 'HTTP/1.1 301 Moved Permanently'
>>> print(dict(response.items()))
{'status': '200',
 'content-length': '3070',
 'content-location': 'http://diveintopython3.org/examples/feed.xml',
 'accept-ranges': 'bytes',
 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',
 'vary': 'Accept-Encoding',
 'server': 'Apache',
 'last-modified': 'Wed, 03 Jun 2009 02:20:15 GMT',
 'connection': 'close',
 '-content-encoding': 'gzip',
 'etag': '"bfe-4cbbf5c0"',
 'cache-control': 'max-age=86400',
 'date': 'Wed, 03 Jun 2009 02:21:41 GMT',
 'content-type': 'application/xml'}
>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-301.xml')
>>> response2.fromcache
True

FIXME

⁂

Beyond HTTP GET

FIXME

>>> import httplib2
>>> from urllib.parse import urlencode
>>> h = httplib2.Http('.cache')
>>> data = {'status': 'Test update from Python 3'}
>>> h.add_credentials('diveintomark', 'MY_SECRET_PASSWORD')
>>> resp, content = h.request('http://twitter.com/statuses/update.xml', 'POST', urlencode(data))
>>> resp.status
200
>>> from xml.etree import ElementTree as etree
>>> tree = etree.fromstring(content)
>>> print(etree.tostring(tree))
<status>
  <created_at>Sat May 30 19:11:38 +0000 2009</created_at>
  <id>1973974228</id>
  <text>Test update from Python 3</text>
  <source>web</source>
  <truncated>false</truncated>
  <in_reply_to_status_id />
  <in_reply_to_user_id />
  <favorited>false</favorited>
  <in_reply_to_screen_name />
  <user>
    <id>8294212</id>
    <name>Mark Pilgrim</name>
    <screen_name>diveintomark</screen_name>
    <location>Apex, NC</location>
    <description>Like a fine spice</description>
    <profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/72859681/beau_normal.jpg</profile_image_url>

    <url>http://diveintomark.org/</url>
    <protected>false</protected>
    <followers_count>2565</followers_count>
    <profile_background_color>FFFFFF</profile_background_color>
    <profile_text_color>333333</profile_text_color>
    <profile_link_color>333333</profile_link_color>
    <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color>
    <profile_sidebar_border_color>333333</profile_sidebar_border_color>
    <friends_count>44</friends_count>
    <created_at>Sun Aug 19 23:58:36 +0000 2007</created_at>
    <favourites_count>71</favourites_count>
    <utc_offset>-18000</utc_offset>
    <time_zone>Eastern Time (US & Canada)</time_zone>
    <profile_background_image_url>http://static.twitter.com/images/themes/theme1/bg.gif</profile_background_image_url>
    <profile_background_tile>false</profile_background_tile>
    <statuses_count>527</statuses_count>
    <notifications>false</notifications>
    <following>false</following>
  </user>
</status>

FIXME

⁂

Beyond HTTP POST

FIXME

# continued from the previous example
>>> tree.findtext('id')
'1973974228'
>>> resp, delete_content = h.request('http://twitter.com/statuses/destroy/{0}.xml'.format(tree.findtext('id')), 'DELETE')
>>> resp.status
200

⁂

HTTP Web Services

Diving In

Features of HTTP

Caching

Last-Modified Checking

ETags

Compression

Redirects

How Not To Fetch Data Over HTTP

What’s On The Wire?

Introducing httplib2

How httplib2 Handles Caching

How httplib2 Handles Last-Modified and ETag Headers

How http2lib Handles Compression

How httplib2 Handles Redirects

Beyond HTTP GET

Beyond HTTP POST

Further Reading

Introducing `httplib2`

How `httplib2` Handles Caching

How `httplib2` Handles `Last-Modified` and `ETag` Headers

How `http2lib` Handles Compression

How `httplib2` Handles Redirects