diff --git a/dip3.css b/dip3.css index 5a3b6c1..4a7a553 100644 --- a/dip3.css +++ b/dip3.css @@ -102,9 +102,10 @@ abbr { } .f:first-letter { float: left; - color: lightblue; + color: lightsteelblue; padding: 0.11em 4px 0 0; font: normal 4em/0.68 serif; + text-shadow: steelblue 1px 1px 1px; } p, ul, ol { margin: 1.75em 0; @@ -130,7 +131,7 @@ body { .a { font-size: xx-large; line-height: .875; - color: #444; + color: #82b445; } form div, #level { float: right; @@ -152,7 +153,7 @@ a:link, .w a { color: steelblue; } a:visited { - color: darkorchid; + color: #b44582; } .c a { color: inherit; @@ -267,7 +268,9 @@ aside { -webkit-border-radius: 1em; border-radius: 1em; } - +#level span { + color: #82b445; +} /* previous/next navigation links */ .nav a { diff --git a/http-web-services.html b/http-web-services.html index a7e972e..e93cfa6 100644 --- a/http-web-services.html +++ b/http-web-services.html @@ -21,9 +21,9 @@ mark{display:inline}
HTTP web services are programmatic ways of sending and receiving data from remote servers using the operations of HTTP directly. If you want to get data from the server, use a straight HTTP GET; if you want to send new data to the server, use HTTP POST. (Some more advanced HTTP web service APIs also define ways of modifying existing data and deleting data, using HTTP PUT and HTTP DELETE.) In other words, the “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for receiving, sending, modifying, and deleting data. +
HTTP web services are programmatic ways of sending and receiving data from remote servers using nothing but the operations of HTTP. If you want to get data from the server, use HTTP GET; if you want to send new data to the server, use HTTP POST. Some more advanced HTTP web service APIs also define ways of modifying existing data and deleting data, using HTTP PUT and HTTP DELETE. In other words, the “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for retrieving, creating, modifying, and deleting data.
-
The main advantage of this approach is simplicity, and its simplicity has proven popular with a lot of different sites. Data -- usually XML data -- can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging is also easier; because each “call” to the web service had a unique URL, you can load it in your web browser and immediately see the raw data. +
The main advantage of this approach is simplicity, and its simplicity has proven popular with a lot of different sites. Data — usually XML data — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging is also easier; because each “call” to the web service had a unique URL, you can load it in your web browser and immediately see the raw data.
Examples of HTTP web services:
Python 3 comes with two different libraries for interacting with HTTP web services:
http.client is a low-level library that implements RFC 2616, the HTTP protocol.
+http.client is a low-level library that implements RFC 2616, the HTTP protocol.
urllib.request is an abstraction layer built on top of http.client. It provides a standard API for accessing both HTTP and FTP servers, automatically follows HTTP redirects, and handles some common forms of HTTP authentication.
Which one should you use? Neither of them. Instead, you should use httplib2, an open source third-party library that implements HTTP more fully than http.client but provides a better abstraction that urllib.request.
+
So which one should you use? Neither of them. Instead, you should use httplib2, an open source third-party library that implements HTTP more fully than http.client but provides a better abstraction that urllib.request.
To understand why httplib2 is the right choice, you first need to understand HTTP.
⁂ +
There are five important features which all HTTP clients should support. + +
FIXME + +
Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the home page of Google.com may not change for days or even weeks (and then only when they put up a special holiday logo or advertise a new service). Web services are no different. The server knows when the data you’re requesting last changed, and HTTP provides a way for the server to include this last-modified date each time you request the data. + +
If you ask for the same data a second (or third or fourth) time, you can tell the server the last-modified date that you got last time. You send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data hasn’t changed since then, the server sends back a special HTTP status code 304, which means “this data hasn’t changed since the last time you asked for it.” Why is this an improvement? Because when the server sends a 304, it doesn’t re-send the data. All you get is the status code. So you don’t need to download the same data over and over again if it hasn’t changed; the server assumes you have the data cached locally.
+
+
All modern web browsers support last-modified date checking. If you’ve ever visited a page, re-visited the same page a day later and found that it hadn’t changed, and wondered why it loaded so quickly the second time — this could be why. Your web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically sent the last-modified date it got from the server the first time. The server simply says 304: Not Modified, so your browser knows to load the page from its cache. Web services work the same way.
+
+
Python’s URL libraries have no built-in support for last-modified date checking, but httplib2 does.
+
+
ETags are an alternate way to accomplish the same thing as the last-modified date checking. With Etags, the server sends a hash code in an ETag header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) The second time you request the same data, you include the ETag hash in an If-None-Match header of your request. If the data hasn’t changed, the server will send you back a 304 status code. As with the last-modified date checking, the server sends back only the 304 status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since you still have the data from the last time.
+
+
Python’s URL libraries have no built-in support for ETags, but httplib2 does.
+
+
When you talk about HTTP web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s XML; maybe it’s JSON. Regardless of the format, text compresses well. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include the Accept-encoding header in your request, and if the server supports compression, it will send you back compressed data and mark it with a Content-encoding header.
+
+
HTTP supports several compression algorithms. The two most common types are gzip and deflate. + +
Python’s URL libraries have no built-in support for compression, but httplib2 does.
+
+
Cool URIs don’t change, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at http://example.com/index.xml might be moved to http://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; http://www.example.com/index.xml becomes http://server-farm-1.example.com/index.xml.
+
+
Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code 200 means “everything’s normal, here’s the page you asked for”. Status code 404 means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection.
+
+
HTTP has several different ways of signifying that a resource has moved. The two most common techiques are status codes 302 and 301. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location: header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location: header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you’re supposed to use the new address from then on.
+
+
The urllib module will automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address. That’s two round trips instead of one, which is bad for the service operator and bad for you.
+
+
httplib2 handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them.
+
+
+
Let’s say you want to download a resource over HTTP, such as an Atom feed. But you don’t just want to download it once; you want to download it over and over again, every hour, to get the latest news from the site that’s offering the news feed. Let’s do it the quick-and-dirty way first, and then see how you can do better. +
Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. Let’s do it the quick-and-dirty way first, and then see how you can do better.
>>> import urllib.request
>>> data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read() ①
@@ -63,74 +119,16 @@ mark{display:inline}
…
urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier.
+urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier.
So what’s wrong with this? Well, for a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis -- and remember, you said you were planning on retrieving this syndicated feed once an hour -- then you’re being inefficient, and you’re being rude. - -
Let’s talk about some of the basic features of HTTP. +
So what’s wrong with this? Well, for a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis — and remember, you said you were planning on retrieving this syndicated feed once an hour — then you’re being inefficient, and you’re being rude.
⁂ -
There are five important features which all HTTP clients should support. - -
FIXME - -
Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the home page of Google.com may not change for days or even weeks (and then only when they put up a special holiday logo or advertise a new service). Web services are no different. The server knows when the data you’re requesting last changed, and HTTP provides a way for the server to include this last-modified date each time you request the data. - -
If you ask for the same data a second (or third or fourth) time, you can tell the server the last-modified date that you got last time. You send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data hasn’t changed since then, the server sends back a special HTTP status code 304, which means “this data hasn’t changed since the last time you asked for it.” Why is this an improvement? Because when the server sends a 304, it doesn’t re-send the data. All you get is the status code. So you don’t need to download the same data over and over again if it hasn’t changed; the server assumes you have the data cached locally.
-
-
All modern web browsers support last-modified date checking. If you’ve ever visited a page, re-visited the same page a day later and found that it hadn’t changed, and wondered why it loaded so quickly the second time — this could be why. Your web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically sent the last-modified date it got from the server the first time. The server simply says 304: Not Modified, so your browser knows to load the page from its cache. Web services work the same way.
-
-
Python’s URL libraries have no built-in support for last-modified date checking, but httplib2 does.
-
-
ETags are an alternate way to accomplish the same thing as the last-modified date checking. With Etags, the server sends a hash code in an ETag header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) The second time you request the same data, you include the ETag hash in an If-None-Match header of your request. If the data hasn’t changed, the server will send you back a 304 status code. As with the last-modified date checking, the server sends back only the 304 status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since you still have the data from the last time.
-
-
Python’s URL libraries have no built-in support for ETags, but httplib2 does.
-
-
When you talk about HTTP web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s XML; maybe it’s JSON. Regardless of the format, text compresses well. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include the Accept-encoding header in your request, and if the server supports compression, it will send you back compressed data and mark it with a Content-encoding header.
-
-
HTTP supports several compression algorithms. The two most common types are gzip and deflate. - -
Python’s URL libraries have no built-in support for compression, but httplib2 does.
-
-
Cool URIs don’t change, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at http://example.com/index.xml might be moved to http://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; http://www.example.com/index.xml becomes http://server-farm-1.example.com/index.xml.
-
-
Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code 200 means “everything’s normal, here’s the page you asked for”. Status code 404 means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection.
-
-
HTTP has several different ways of signifying that a resource has moved. The two most common techiques are status codes 302 and 301. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location: header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location: header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you’re supposed to use the new address from then on.
-
-
The urllib module will automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address. That’s two round trips instead of one, which is bad for the service operator and bad for you.
-
-
httplib2 handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them.
-
-
-
⁂ +--> -
FIXME @@ -820,7 +818,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
⁂ -
FIXME @@ -841,8 +839,8 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
httplib2
httplib2: HTTP Persistence and Authentication
© 2001–9 Mark Pilgrim