diff --git a/http-web-services.html b/http-web-services.html index 1152e34..d78d61b 100755 --- a/http-web-services.html +++ b/http-web-services.html @@ -178,7 +178,7 @@ Cache-Control: max-age=31536000, public
Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better.
>>> import urllib.request ->>> a_url = 'http://diveintopython3.org/examples/feed.xml' +>>> a_url = 'http://diveintopython3.org/examples/feed.xml' >>> data = urllib.request.urlopen(a_url).read() ① >>> type(data) ② <class 'bytes'> @@ -194,7 +194,7 @@ Cache-Control: max-age=31536000, public
urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier.
-urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get bytes. If you want a string, you’ll have to convert it yourself.
+urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get bytes. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it to a string.
So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (e.g. requesting this feed once an hour), then you’re being inefficient, and you’re being rude. @@ -323,13 +323,38 @@ Content-Type: application/xml
httplib2 is the Http object. For reasons you’ll see in the next section, you should always pass a directory name when you create an Http object. The directory does not need to exist; httplib2 will create it if necessary.
Http object, retrieving data is as simple as calling the request() method with the address of the data you want. This will issue an HTTP GET request for that URL. (Later in this chapter, you’ll see how to issue other HTTP requests, like POST.)
request() method returns two values. The first is an httplib2.Response object, which contains all the HTTP headers the server returned. For example, a status code of 200 indicates that the request was successful.
-bytes, not str. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it.
+bytes object, not a string. If you want it as a string, you’ll need to determine the character encoding and convert it yourself.
+☞You probably only need one
httplib2.Httpobject. There are valid reasons for creating more than one, but you should only do so if you know why you need them. “I need to request data from two different URLs” is not a valid reason. Re-use theHttpobject and just call therequest()method twice.
httplib2 Returns Bytes Instead of StringsBytes. Strings. What a pain. Why can’t httplib2 “just” do the conversion for you? Well, it’s complicated, because the rules for determining the character encoding are specific to what kind of resource you’re requesting. How could httplib2 know what kind of resource you’re requesting? It’s usually listed in the Content-Type HTTP header, but that’s an optional feature of HTTP and not all HTTP servers include it. If that header is not included in the HTTP response, it’s left up to the client to guess. (This is commonly called “content sniffing,” and it’s never perfect.)
+
+
If you know what sort of resource you’re expecting (an XML document in this case), perhaps you could “just” pass the returned bytes object to the xml.etree.ElementTree.parse() function. That’ll work as long as the XML document includes information on its own character encoding (as this one does), but that’s an optional feature and not all XML documents do that. If an XML document doesn’t include encoding information, the client is supposed to look at the enclosing transport — i.e. the Content-Type HTTP header, which can include a charset parameter.
+
+
But it’s worse than that. Now character encoding information can be in two places: within the XML document itself, and within the Content-Type HTTP header. If the information is in both places, which one wins? According to RFC 3023 (I swear I am not making this up), if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the encoding is
+
+
charset parameter of the Content-Type HTTP header, or
+encoding attribute of the XML declaration within the document, or
+utf-8
+On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is
+
+
Content-Type HTTP header, or
+us-ascii
+And that’s just for XML documents. For HTML documents, web browsers have constructed such byzantine rules for content-sniffing [PDF] that we’re still trying to figure them all out. + +
“Patches welcome.” +
httplib2 Handles CachingRemember in the previous section when I said you should always create an httplib2.Http object with a directory name? Caching is the reason.
@@ -640,7 +665,7 @@ user-agent: Python-httplib2/$Rev: 259 $'
Identi.ca REST API Method: statuses/update
Updates the authenticating user’s status. Requires thestatusparameter specified below. Request must be aPOST. - +
- URL
https://identi.ca/api/statuses/update.format