diff --git a/http-web-services.html b/http-web-services.html index 1152e34..d78d61b 100755 --- a/http-web-services.html +++ b/http-web-services.html @@ -178,7 +178,7 @@ Cache-Control: max-age=31536000, public

Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better.

 >>> import urllib.request
->>> a_url = 'http://diveintopython3.org/examples/feed.xml'
+>>> a_url = 'http://diveintopython3.org/examples/feed.xml'
 >>> data = urllib.request.urlopen(a_url).read()  
 >>> type(data)                                   
 <class 'bytes'>
@@ -194,7 +194,7 @@ Cache-Control: max-age=31536000, public
  1. Downloading anything over HTTP is incredibly easy in Python; in fact, it’s a one-liner. The urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier. -
  2. The urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get bytes. If you want a string, you’ll have to convert it yourself. +
  3. The urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get bytes. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it to a string.

So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (e.g. requesting this feed once an hour), then you’re being inefficient, and you’re being rude. @@ -323,13 +323,38 @@ Content-Type: application/xml

  • The primary interface to httplib2 is the Http object. For reasons you’ll see in the next section, you should always pass a directory name when you create an Http object. The directory does not need to exist; httplib2 will create it if necessary.
  • Once you have an Http object, retrieving data is as simple as calling the request() method with the address of the data you want. This will issue an HTTP GET request for that URL. (Later in this chapter, you’ll see how to issue other HTTP requests, like POST.)
  • The request() method returns two values. The first is an httplib2.Response object, which contains all the HTTP headers the server returned. For example, a status code of 200 indicates that the request was successful. -
  • The content variable contains the actual data that was returned by the HTTP server. The data is returned as bytes, not str. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it. +
  • The content variable contains the actual data that was returned by the HTTP server. The data is returned as a bytes object, not a string. If you want it as a string, you’ll need to determine the character encoding and convert it yourself.

    You probably only need one httplib2.Http object. There are valid reasons for creating more than one, but you should only do so if you know why you need them. “I need to request data from two different URLs” is not a valid reason. Re-use the Http object and just call the request() method twice.

    +

    A Short Digression To Explain Why httplib2 Returns Bytes Instead of Strings

    + +

    Bytes. Strings. What a pain. Why can’t httplib2 “just” do the conversion for you? Well, it’s complicated, because the rules for determining the character encoding are specific to what kind of resource you’re requesting. How could httplib2 know what kind of resource you’re requesting? It’s usually listed in the Content-Type HTTP header, but that’s an optional feature of HTTP and not all HTTP servers include it. If that header is not included in the HTTP response, it’s left up to the client to guess. (This is commonly called “content sniffing,” and it’s never perfect.) + +

    If you know what sort of resource you’re expecting (an XML document in this case), perhaps you could “just” pass the returned bytes object to the xml.etree.ElementTree.parse() function. That’ll work as long as the XML document includes information on its own character encoding (as this one does), but that’s an optional feature and not all XML documents do that. If an XML document doesn’t include encoding information, the client is supposed to look at the enclosing transport — i.e. the Content-Type HTTP header, which can include a charset parameter. + +

    But it’s worse than that. Now character encoding information can be in two places: within the XML document itself, and within the Content-Type HTTP header. If the information is in both places, which one wins? According to RFC 3023 (I swear I am not making this up), if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the encoding is + +

      +
    1. the encoding given in the charset parameter of the Content-Type HTTP header, or +
    2. the encoding given in the encoding attribute of the XML declaration within the document, or +
    3. utf-8 +
    + +

    On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is + +

      +
    1. the encoding given in the charset parameter of the Content-Type HTTP header, or +
    2. us-ascii +
    + +

    And that’s just for XML documents. For HTML documents, web browsers have constructed such byzantine rules for content-sniffing [PDF] that we’re still trying to figure them all out. + +

    Patches welcome.” +

    How httplib2 Handles Caching

    Remember in the previous section when I said you should always create an httplib2.Http object with a directory name? Caching is the reason. @@ -640,7 +665,7 @@ user-agent: Python-httplib2/$Rev: 259 $'

    Identi.ca REST API Method: statuses/update
    Updates the authenticating user’s status. Requires the status parameter specified below. Request must be a POST. - +

    URL
    https://identi.ca/api/statuses/update.format