diff --git a/http-web-services.html b/http-web-services.html index 06e1b5a..cdeb6a7 100755 --- a/http-web-services.html +++ b/http-web-services.html @@ -178,7 +178,10 @@ Cache-Control: max-age=31536000, public
Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better.
>>> import urllib.request
->>> data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read() ①
+>>> a_url = 'http://diveintopython3.org/examples/feed.xml'
+>>> data = urllib.request.urlopen(a_url).read() ①
+>>> type(data) ②
+<class 'bytes'>
>>> print(data)
<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
@@ -191,6 +194,7 @@ Cache-Control: max-age=31536000, public
urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier.
+urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get bytes. If you want a string, you’ll have to convert it yourself.
So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (e.g. requesting this feed once an hour), then you’re being inefficient, and you’re being rude.