From 28a13e1fbc4cfecdf9c891728c70a1d0f08e73b0 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Fri, 26 Jun 2009 00:41:29 -0400 Subject: [PATCH] added note about list concatenation and memory usage. unrelatedly, added nonbreaking spaces around long dashes. --- advanced-iterators.html | 10 +++++----- case-study-porting-chardet-to-python-3.html | 16 ++++++++-------- dip3.css | 2 +- generators.html | 6 +++--- http-web-services.html | 14 +++++++------- iterators.html | 10 +++++----- native-datatypes.html | 15 ++++++++------- porting-code-to-python-3-with-2to3.html | 12 ++++++------ refactoring.html | 4 ++-- special-method-names.html | 12 ++++++------ strings.html | 18 +++++++++--------- unit-testing.html | 12 ++++++------ xml.html | 14 +++++++------- your-first-python-program.html | 4 ++-- 14 files changed, 75 insertions(+), 74 deletions(-) diff --git a/advanced-iterators.html b/advanced-iterators.html index aa13534..f299960 100644 --- a/advanced-iterators.html +++ b/advanced-iterators.html @@ -119,7 +119,7 @@ if __name__ == '__main__': >>> {c for c in ''.join(words)} {'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
    -
  1. Given a list of several strings, a set comprehension with the identity function will return a set of unique strings from the list. This makes sense if you think of it like a for loop. Take the first item from the list, put it in the set. Second. Third. Fourth — wait, that’s in the set already, so it only gets listed once. Fifth. Sixth — again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn’t even need to be sorted first. +
  2. Given a list of several strings, a set comprehension with the identity function will return a set of unique strings from the list. This makes sense if you think of it like a for loop. Take the first item from the list, put it in the set. Second. Third. Fourth — wait, that’s in the set already, so it only gets listed once. Fifth. Sixth — again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn’t even need to be sorted first.
  3. The same technique works with strings, since a string is just a sequence of characters.
  4. Given a list of strings, ''.join(a_list) concatenates all the strings together into one.
  5. So, given a list of strings, this set comprehension returns all the unique characters across all the strings, with no duplicates. @@ -228,7 +228,7 @@ StopIteration
  6. That’s it! Those are all the permutations of [1, 2, 3] taken 2 at a time. Pairs like (1, 1) and (2, 2) never show up, because they contain repeats so they aren’t valid permutations. When there are no more permutations, the iterator raises a StopIteration exception.
-

The permutations() function doesn’t have to take a list. It can take any sequence — even a string. +

The permutations() function doesn’t have to take a list. It can take any sequence — even a string.

 >>> import itertools
@@ -255,7 +255,7 @@ StopIteration
  ('C', 'A', 'B'), ('C', 'B', 'A')]
  1. A string is just a sequence of characters. For the purposes of finding permutations, the string 'ABC' is equivalent to the list ['A', 'B', 'C']. -
  2. The first permutation of the 3 items ['A', 'B', 'C'], taken 3 at a time, is ('A', 'B', 'C'). There are five other permutations — the same three characters in every conceivable order. +
  3. The first permutation of the 3 items ['A', 'B', 'C'], taken 3 at a time, is ('A', 'B', 'C'). There are five other permutations — the same three characters in every conceivable order.
  4. Since the permutations() function always returns an iterator, an easy way to debug permutations is to pass that iterator to the built-in list() function to see all the permutations immediately.
@@ -397,7 +397,7 @@ for guess in itertools.permutations(digits, len(characters)): >>> 'MARK'.translate(translation_table) 'MORK'
    -
  1. String translation starts with a translation table, which is just a dictionary that maps one character to another. Actually, “character” is incorrect — the translation table really maps one byte to another. +
  2. String translation starts with a translation table, which is just a dictionary that maps one character to another. Actually, “character” is incorrect — the translation table really maps one byte to another.
  3. Remember, bytes in Python 3 are integers. The ord() function returns the ASCII value of a character, which, in the case of A–Z, is always a byte from 65 to 90.
  4. The translate() method on a string takes a translation table and runs the string through it. That is, it replaces all occurrences of the keys of the translation table with the corresponding values. In this case, “translating” MARK to MORK.
@@ -512,7 +512,7 @@ NameError: name 'x' is not defined NameError: name 'math' is not defined
  1. The second and third parameters passed to the eval() function act as the global and local namespaces for evaluating the expression. In this case, they are both empty, which means that when the string "x * 5" is evaluated, there is no reference to x in either the global or local namespace, so eval() throws an exception. -
  2. You can selectively include specific values in the global namespace by listing them individually. Then those — and only those — variables will be available during evaluation. +
  3. You can selectively include specific values in the global namespace by listing them individually. Then those — and only those — variables will be available during evaluation.
  4. Even though you just imported the math module, you didn’t include it in the namespace passed to the eval() function, so the evaluation failed.
diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 8b68547..3d906d3 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -77,7 +77,7 @@ del{background:#f87}

Running 2to3

-

We’re going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy — a function was renamed or moved to a different modules — but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic. +

We’re going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy — a function was renamed or moved to a different modules — but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.

The main chardet package is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn.

C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w chardet\
 RefactoringTool: Skipping implicit fixer: buffer
@@ -616,7 +616,7 @@ else:
   File "C:\home\chardet\chardet\universaldetector.py", line 29, in <module>
     import constants, sys
 ImportError: No module named constants
-

What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead: +

What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:

from . import constants

But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two.

The solution is to split the import statement manually. So this two-in-one import: @@ -656,7 +656,7 @@ TypeError: can't use a string pattern on a bytes-like object self._highBitDetector = re.compile(r'[\x80-\xFF]')

This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.

And therein lies the problem. -

In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py: +

In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py:

def feed(self, aBuf):
     .
     .
@@ -671,7 +671,7 @@ TypeError: can't use a string pattern on a bytes-like object
for line in open(f, 'rb'): u.feed(line) -

And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to the open() function, but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops. +

And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to the open() function, but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.

What we need this regular expression to search is not an array of characters, but an array of bytes.

Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)

  class UniversalDetector:
@@ -737,7 +737,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mGotData = False self._mInputState = ePureAscii self._mLastChar = '' -

And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can’t concatenate a string to a byte array — not even a zero-length string. +

And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can’t concatenate a string to a byte array — not even a zero-length string.

So what is self._mLastChar anyway? The answer is in the feed() method, just a few lines down from where the trackback occurred.

if self._mInputState == ePureAscii:
     if self._highBitDetector.search(aBuf):
@@ -854,7 +854,7 @@ def next_state(self, c):
 def feed(self, aBuf):
     for c in aBuf:
         codingState = self._mCodingSM.next_state(c)
-

And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That’s what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there’s no need to call the ord() function because c is already an int! +

And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That’s what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there’s no need to call the ord() function because c is already an int!

Thus:

  def next_state(self, c):
       # for each byte we get its class
@@ -1131,7 +1131,7 @@ NameError: global name 'reduce' is not defined
return 0.01 total = reduce(operator.add, self._mFreqCounter) -

The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result. +

The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.

This monstrosity was so common that Python added a global sum() function.

  def get_confidence(self):
       if self.get_state() == constants.eNotMe:
@@ -1185,7 +1185,7 @@ tests\EUC-JP\arclamp.jp.xml                                  EUC-JP with confide
 

What have we learned?

  1. Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There’s no way around it. It’s hard. -
  2. The automated 2to3 tool is helpful as far as it goes, but it will only do the easy parts — function renames, module renames, syntax changes. It’s an impressive piece of engineering, but in the end it’s just an intelligent search-and-replace bot. +
  3. The automated 2to3 tool is helpful as far as it goes, but it will only do the easy parts — function renames, module renames, syntax changes. It’s an impressive piece of engineering, but in the end it’s just an intelligent search-and-replace bot.
  4. The #1 porting problem in this library was the difference between strings and bytes. In this case that seems obvious, since the whole point of the chardet library is to convert a stream of bytes into a string. But “a stream of bytes” comes up more often than you might think. Reading a file in “binary” mode? You’ll get a stream of bytes. Fetching a web page? Calling a web API? They return a stream of bytes, too.
  5. You need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least, you need to be comfortable with all its quirks and musty corners. The bugs are everywhere.
  6. Test cases are essential. Don’t port anything without them. Don’t even try. The only reason I have any confidence at all that chardet works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I never would have found half of these problems with manual spot-checking. diff --git a/dip3.css b/dip3.css index b13d916..4aeb36b 100644 --- a/dip3.css +++ b/dip3.css @@ -37,7 +37,7 @@ Classname Legend .c = "centered" = centered footer text (also clears floats) .a = "asterism" = section break .v = "navigation" = prev/next navigation links (not breadcrumbs) -.u = "Unicode" = text contains Unicode characters (requires special font declaration) +.u = "Unicode" = text contains Unicode characters (requires special font declaration to accomodate *cough* a certain browser) .nm = "no mobile" = hide this section on mobile devices .nd = "no decoration" = hide the widgets on this code block diff --git a/generators.html b/generators.html index 7fd1521..85c263d 100644 --- a/generators.html +++ b/generators.html @@ -20,7 +20,7 @@ body{counter-reset:h1 5}

     

    Diving In

    -

    For reasons passing all understanding, I have always been fascinated by languages. Not programming languages. Well yes, programming languages, but also natural languages. Take English. English is a schizophrenic language that borrows words from German, French, Spanish, and Latin (to name a few). Actually, “borrows” is the wrong word; “pillages” is more like it. Or perhaps “assimilates” — like the Borg. Yes, I like that. +

    For reasons passing all understanding, I have always been fascinated by languages. Not programming languages. Well yes, programming languages, but also natural languages. Take English. English is a schizophrenic language that borrows words from German, French, Spanish, and Latin (to name a few). Actually, “borrows” is the wrong word; “pillages” is more like it. Or perhaps “assimilates” — like the Borg. Yes, I like that.

    We are the Borg. Your linguistic and etymological distinctiveness will be added to our own. Resistance is futile.

    In this chapter, you’re going to learn about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators. But first, let’s talk about how to make plural nouns. (If you haven’t read the chapter on regular expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and it quickly descends into more advanced uses.)

    If you grew up in an English-speaking country or learned English in a formal school setting, you’re probably familiar with the basic rules: @@ -170,7 +170,7 @@ def plural(noun):

-

The reason this technique works is that everything in Python is an object, including functions. The rules data structure contains functions — not names of functions, but actual function objects. When they get assigned in the for loop, then matches_rule and apply_rule are actual functions that you can call. On the first iteration of the for loop, this is equivalent to calling matches_sxz(noun), and if it returns a match, calling apply_sxz(noun). +

The reason this technique works is that everything in Python is an object, including functions. The rules data structure contains functions — not names of functions, but actual function objects. When they get assigned in the for loop, then matches_rule and apply_rule are actual functions that you can call. On the first iteration of the for loop, this is equivalent to calling matches_sxz(noun), and if it returns a match, calling apply_sxz(noun).

If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire for loop is equivalent to the following: @@ -392,7 +392,7 @@ def plural(noun):

What have you gained over stage 4? Startup time. In stage 4, when you imported the plural4 module, it read the entire patterns file and built a list of all the possible rules, before you could even think about calling the plural() function. With generators, you can do everything lazily: you read the first rule and create functions and try them, and if that works you don’t ever read the rest of the file or create any other functions. -

What have you lost? Performance! Every time you call the plural() function, the rules() generator starts over from the beginning — which means re-opening the patterns file and reading from the beginning, one line at a time. +

What have you lost? Performance! Every time you call the plural() function, the rules() generator starts over from the beginning — which means re-opening the patterns file and reading from the beginning, one line at a time.

What if you could have the best of both worlds: minimal startup cost (don’t execute any code on import), and maximum performance (don’t build the same functions over and over again). Oh, and you still want to keep the rules in a separate file (because code is code and data is data), just as long as you never have to read the same line twice. diff --git a/http-web-services.html b/http-web-services.html index 4543c83..ff8f360 100644 --- a/http-web-services.html +++ b/http-web-services.html @@ -23,7 +23,7 @@ mark{display:inline}

Diving In

HTTP web services are programmatic ways of sending and receiving data from remote servers using nothing but the operations of HTTP. If you want to get data from the server, use HTTP GET; if you want to send new data to the server, use HTTP POST. Some more advanced HTTP web service APIs also define ways of creating, modifying, and deleting data, using HTTP PUT and HTTP DELETE. In other words, the “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) can map directly to application-level operations for retrieving, creating, modifying, and deleting data. -

The main advantage of this approach is simplicity, and its simplicity has proven popular. Data — usually XML data — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging is also easier; because each resource in an HTTP web service has a unique address (in the form of a URL), you can load it in your web browser and immediately see the raw data. +

The main advantage of this approach is simplicity, and its simplicity has proven popular. Data — usually XML data — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging is also easier; because each resource in an HTTP web service has a unique address (in the form of a URL), you can load it in your web browser and immediately see the raw data.

Examples of HTTP web services:

    @@ -52,7 +52,7 @@ mark{display:inline}

    Caching

    -

    The most important thing to understand about any type of web service is that network access is incredibly expensive. I don’t mean “dollars and cents” expensive (although bandwidth ain’t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even on the fastest broadband connection, latency (the time it takes to send a request and start retrieving data in a response) can still be higher than you anticipated. A router misbehaves, a packet is dropped, an intermediate proxy is under attack — there’s never a dull moment on the public internet, and there may be nothing you can do about it. +

    The most important thing to understand about any type of web service is that network access is incredibly expensive. I don’t mean “dollars and cents” expensive (although bandwidth ain’t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even on the fastest broadband connection, latency (the time it takes to send a request and start retrieving data in a response) can still be higher than you anticipated. A router misbehaves, a packet is dropped, an intermediate proxy is under attack — there’s never a dull moment on the public internet, and there may be nothing you can do about it.

    HTTP is designed with caching in mind. There is an entire class of devices (called “caching proxies”) whose only job is to sit between you and the rest of the world and minimize network access. Your company or ISP almost certainly maintains caching proxies, even if you’re unaware of them. They work because caching built into the HTTP protocol. @@ -295,7 +295,7 @@ Content-Type: application/xml

  • …the exact same 3070 bytes you downloaded last time. -

    HTTP is designed to work better than this. urllib speaks HTTP like I speak Spanish — enough to get by in a jam, but not enough to hold a conversation. HTTP is a conversation. It’s time to upgrade to a library that speaks HTTP fluently. +

    HTTP is designed to work better than this. urllib speaks HTTP like I speak Spanish — enough to get by in a jam, but not enough to hold a conversation. HTTP is a conversation. It’s time to upgrade to a library that speaks HTTP fluently.

    ⁂ @@ -363,9 +363,9 @@ Content-Type: application/xml

  • Let’s turn on debugging and see what’s on the wire. This is the httplib2 equivalent of turning on debugging in http.client. httplib2 will print all the data being sent to the server and some key information being sent back.
  • Create an httplib2.Http object with the same directory name as before.
  • Request the same URL as before. Nothing appears to happen. More precisely, nothing gets sent to the server, and nothing gets returned from the server. There is absolutely no network activity whatsoever. -
  • Yet we did “receive” some data — in fact, we received all of it. +
  • Yet we did “receive” some data — in fact, we received all of it.
  • We also “received” an HTTP status code indicating that the “request” was successful. -
  • Here’s the rub: this “response” was generated from httplib2’s local cache. That directory name you passed in when you created the httplib2.Http object — that directory holds httplib2’s cache of all the operations it’s ever performed. +
  • Here’s the rub: this “response” was generated from httplib2’s local cache. That directory name you passed in when you created the httplib2.Http object — that directory holds httplib2’s cache of all the operations it’s ever performed.

    You previously requested the data at this URL. That request was successful (status: 200). That response included not only the feed data, but also a set of caching headers that told anyone who was listening that they could cache this resource for up to 24 hours (Cache-Control: max-age=86400, which is 24 hours measured in seconds). httplib2 understand and respects those caching headers, and it stored the previous response in the .cache directory (which you passed in when you create the Http object). That cache hasn’t expired yet, so the second time you request the data at this URL, httplib2 simply returns the cached result without ever hitting the network. @@ -409,7 +409,7 @@ reply: 'HTTP/1.1 200 OK' 'content-type': 'application/xml'}

  1. httplib2 allows you to add arbitrary HTTP headers to any outgoing request. In order to bypass all caches (not just your local disk cache, but also any caching proxies between you and the remote server), add a no-cache header in the headers dictionary. -
  2. Now you see httplib2 initiating a network request. httplib2 understands and respects caching headers in both directions — as part of the incoming response and as part of the outgoing request. It noticed that you added the no-cache header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data. +
  3. Now you see httplib2 initiating a network request. httplib2 understands and respects caching headers in both directions — as part of the incoming response and as part of the outgoing request. It noticed that you added the no-cache header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data.
  4. This response was not generated from your local cache. You knew that, of course, because you saw the debugging information on the outgoing request. But it’s nice to have that programmatically verified.
  5. The request succeeded; you downloaded the entire feed again from the remote server. Of course, the server also sent back a full complement of HTTP headers along with the feed data. That includes caching headers, which httplib2 uses to update its local cache, in the hopes of avoiding network access the next time you request this feed. Everything about HTTP caching is designed to maximize cache hits and minimize network access. Even though you bypassed the cache this time, the remote server would really appreciate it if you would cache the result for next time.
@@ -477,7 +477,7 @@ user-agent: Python-httplib2/$Rev: 259 $'
  • httplib2 also sends the Last-Modified validator back to the server in the If-Modified-Since header.
  • The server looked at these validators, looked at the page you requested, and determined that the page has not changed since you last requested it, so it sends back a 304 status code and no data.
  • Back on the client, httplib2 notices the 304 status code and loads the content of the page from its cache. -
  • This might be a bit confusing. There are really two status codes — 304 (returned from the server this time, which caused httplib2 to look in its cache), and 200 (returned from the server last time, and stored in httplib2’s cache along with the page data). response.status returns the status from the cache. +
  • This might be a bit confusing. There are really two status codes — 304 (returned from the server this time, which caused httplib2 to look in its cache), and 200 (returned from the server last time, and stored in httplib2’s cache along with the page data). response.status returns the status from the cache.
  • If you want the raw status code returned from the server, you can get that by looking in response.dict, which is a dictionary of the actual headers returned from the server.
  • However, you still get the data in the content variable. Generally, you don’t need to know why a response was served from the cache. (You may not even care that it was served from the cache at all, and that’s fine too. httplib2 is smart enough to let you act dumb.) By the time the request() method returns to the caller, httplib2 has already updated its cache and returned the data to you. diff --git a/iterators.html b/iterators.html index 1faccaf..9a01917 100644 --- a/iterators.html +++ b/iterators.html @@ -288,8 +288,8 @@ rules = LazyRules() return self
      -
    1. The __iter__() method will be called every time someone — say, a for loop — calls iter(rules). -
    2. This is the place to reset the counter that we’re going to use to retrieve items from the cache (that we haven’t built yet — patience, grasshopper). +
    3. The __iter__() method will be called every time someone — say, a for loop — calls iter(rules). +
    4. This is the place to reset the counter that we’re going to use to retrieve items from the cache (that we haven’t built yet — patience, grasshopper).
    5. Finally, the __iter__() method returns self, which signals that this class will take care of returning its own values throughout an iteration.
    @@ -303,7 +303,7 @@ rules = LazyRules() self.cache.append(funcs) return funcs
      -
    1. The __next__() method gets called whenever someone — say, a for loop — calls next(rules). This method will only make sense if we start at the end and work backwards. So let’s do that. +
    2. The __next__() method gets called whenever someone — say, a for loop — calls next(rules). This method will only make sense if we start at the end and work backwards. So let’s do that.
    3. The last part of this function should look familiar, at least. The build_match_and_apply_functions() function hasn’t changed; it’s the same as it ever was. Each line of the pattern file will be read exactly once, as late as possible.
    4. The only difference is that, before returning the match and apply functions (which are stored in the tuple funcs), we’ve going to save them in self.cache. Each match and apply function will be built exactly once, as late as possible, then cached.
    @@ -341,7 +341,7 @@ rules = LazyRules() .
    1. self.cache will be a list of the functions we need to match and apply individual rules. (At least that should sound familiar!) self.cache_index keeps track of which cached item we should return next. If we haven’t exhausted the cache yet (i.e. if the length of self.cache is greater than self.cache_index), then we have a cache hit! Hooray! We can return the match and apply functions from the cache instead of building them from scratch. -
    2. On the other hand, if we don’t get a hit from the cache, and the file object has been closed (which could happen, further down the method, as you saw in the previous code snippet), then there’s nothing more we can do. If the file is closed, it means we’ve exhausted it — we’ve already read through every line from the pattern file, and we’ve already built and cached the match and apply functions for each pattern. The file is exhausted; the cache is exhausted; I’m exhausted. Wait, what? Hang in there, we’re almost done. +
    3. On the other hand, if we don’t get a hit from the cache, and the file object has been closed (which could happen, further down the method, as you saw in the previous code snippet), then there’s nothing more we can do. If the file is closed, it means we’ve exhausted it — we’ve already read through every line from the pattern file, and we’ve already built and cached the match and apply functions for each pattern. The file is exhausted; the cache is exhausted; I’m exhausted. Wait, what? Hang in there, we’re almost done.

    Putting it all together, here’s what happens when: @@ -352,7 +352,7 @@ rules = LazyRules()

  • Let’s say, for the sake of argument, that the very first rule matched. If so, no further match and apply functions are built, and no further lines are read from the pattern file.
  • Furthermore, for the sake of argument, suppose that the caller calls the plural() function again to pluralize a different word. The for loop in the plural() function will call iter(rules), which will reset the cache index but will not reset the open file object.
  • The first time through, the for loop will ask for a value from rules, which will invoke its __next__() method. This time, however, the cache is primed with a single pair of match and apply functions, corresponding to the patterns in the first line of the pattern file. Since they were built and cached in the course of pluralizing the previous word, they’re retrieved from the cache. The cache index increments, and the open file is never touched. -
  • Let’s say, for the sake of argument, that the first rule does not match this time around. So the for loop comes around again and asks for another value from rules. This invokes the __next__() method a second time. This time, the cache is exhausted — it only contained one item, and we’re asking for a second — so the __next__() method continues. It reads another line from the open file, builds match and apply functions out of the patterns, and caches them. +
  • Let’s say, for the sake of argument, that the first rule does not match this time around. So the for loop comes around again and asks for another value from rules. This invokes the __next__() method a second time. This time, the cache is exhausted — it only contained one item, and we’re asking for a second — so the __next__() method continues. It reads another line from the open file, builds match and apply functions out of the patterns, and caches them.
  • This read-build-and-cache process will continue as long as the rules being read from the pattern file don’t match the word we’re trying to pluralize. If we do find a matching rule before the end of the file, we simply use it and stop, with the file still open. The file pointer will stay wherever we stopped reading, waiting for the next readline() command. In the meantime, the cache now has more items in it, and if we start all over again trying to pluralize a new word, each of those items in the cache will be tried before reading the next line from the pattern file. diff --git a/native-datatypes.html b/native-datatypes.html index b6251a1..932f461 100644 --- a/native-datatypes.html +++ b/native-datatypes.html @@ -32,7 +32,7 @@ body{counter-reset:h1 2}
  • Dictionaries are unordered bags of key-value pairs.

    Of course, there are a lot more types than these seven. Everything is an object in Python, so there are types like module, function, class, method, file, and even compiled code. You’ve already seen some of these: modules have names, functions have docstrings, &c. You’ll learn about classes in [FIXME xref] and files in [FIXME xref]. -

    Strings and bytes are important enough — and complicated enough — that they get their own chapter. Let’s look at the others first. +

    Strings and bytes are important enough — and complicated enough — that they get their own chapter. Let’s look at the others first.

    Booleans

    @@ -272,19 +272,20 @@ ZeroDivisionError: Fraction(0, 0)
     >>> a_list = ['a']
     >>> a_list = a_list + [2.0, 3]    
    ->>> a_list
    +>>> a_list                        
     ['a', 2.0, 3]
    ->>> a_list.append(True)           
    +>>> a_list.append(True)           
     >>> a_list
     ['a', 2.0, 3, True]
    ->>> a_list.extend(['four', 'Ω'])  
    +>>> a_list.extend(['four', 'Ω'])  
     >>> a_list
     ['a', 2.0, 3, True, 'four', 'Ω']
    ->>> a_list.insert(0, 'Ω')         
    +>>> a_list.insert(0, 'Ω')         
     >>> a_list
     ['Ω', 'a', 2.0, 3, True, 'four', 'Ω']
      -
    1. The + operator concatenates lists. A list can contain any number of items; there is no size limit (other than available memory). A list can contain items of any datatype; they don’t all need to be the same type. Here we have a list containing a string, a floating point number, and an integer. +
    2. The + operator concatenates lists to create a new list. A list can contain any number of items; there is no size limit (other than available memory). However, if memory is a concern, you should be aware that list concatenation creates a second list in memory. In this case, that new list is immediately assigned to the existing variable a_list. So this line of code is really a two-step process — concatenation then assignment — which can (temporarily) consume a lot of memory when you’re dealing with large lists. +
    3. A list can contain items of any datatype, and the items in a single list don’t all need to be the same type. Here we have a list containing a string, a floating point number, and an integer.
    4. The append() method adds a single item to the end of the list. (Now we have four different datatypes in the list!)
    5. Lists are implemented as classes. “Creating” a list is really instantiating a class. As such, a list has methods that operate on it. The extend() method takes one argument, a list, and appends each of the items of the argument to the original list.
    6. The insert() method inserts a single item into a list. The first argument is the index of the first item in the list that will get bumped out of position. List items do not need to be unique; for example, there are now two separate items with the value 'Ω': the first item, a_list[0], and the last item, a_list[6]. @@ -487,7 +488,7 @@ KeyError: 'db.diveintopython3.org'
    7. You can add new key-value pairs at any time. This syntax is identical to modifying existing values.
    8. The new dictionary item (key 'user', value 'mark') appears to be in the middle. In fact, it was just a coincidence that the items appeared to be in order in the first example; it is just as much a coincidence that they appear to be out of order now.
    9. Assigning a value to an existing dictionary key simply replaces the old value with the new one. -
    10. Will this change the value of the user key back to "mark"? No! Look at the key closely — that’s a capital U in "User". Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it’s completely different. +
    11. Will this change the value of the user key back to "mark"? No! Look at the key closely — that’s a capital U in "User". Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it’s completely different.

    Mixed-Value Dictionaries

    Dictionaries aren’t just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don’t all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary. diff --git a/porting-code-to-python-3-with-2to3.html b/porting-code-to-python-3-with-2to3.html index d48fc42..5290ddf 100644 --- a/porting-code-to-python-3-with-2to3.html +++ b/porting-code-to-python-3-with-2to3.html @@ -32,7 +32,7 @@ td pre{padding:0;border:0}

    Diving in

    Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. Case study: porting chardet to Python 3 describes how to run the 2to3 script, then shows some things it can’t fix automatically. This appendix documents what it can fix automatically.

    print statement

    -

    In Python 2, print was a statement. Whatever you wanted to print simply followed the print keyword. In Python 3, print() is a function — whatever you want to print is passed to print() like any other function. +

    In Python 2, print was a statement. Whatever you wanted to print simply followed the print keyword. In Python 3, print() is a function — whatever you want to print is passed to print() like any other function.
    Notes Python 2 @@ -58,7 +58,7 @@ td pre{padding:0;border:0}
  • To print a single value, call print() with one argument
  • To print two values separated by a space, call print() with two arguments.
  • This one is a little tricky. In Python 2, if you ended a print statement with a comma, it would print the values separated by spaces, then print a trailing space, then stop without printing a carriage return. In Python 3, the way to do this is to pass end=' ' as a keyword argument to the print() function. The end argument defaults to '\n' (a carriage return), so overriding it will suppress the carriage return after printing the other arguments. -
  • In Python 2, you could redirect the output to a pipe — like sys.stderr — by using the >>pipe_name syntax. In Python 3, the way to do this is to pass the pipe in the file keyword argument. The file argument defaults to sys.stdout (standard out), so overriding it will output to a different pipe instead. +
  • In Python 2, you could redirect the output to a pipe — like sys.stderr — by using the >>pipe_name syntax. In Python 3, the way to do this is to pass the pipe in the file keyword argument. The file argument defaults to sys.stdout (standard out), so overriding it will output to a different pipe instead.

    Unicode string literals

    Python 2 had two string types: Unicode strings and non-Unicode strings. Python 3 has one string type: Unicode strings. @@ -159,7 +159,7 @@ td pre{padding:0;border:0}

    1. The simplest form.
    2. The or operator takes precedence over the in operator, so there is no need for parentheses here. -
    3. On the other hand, you do need parentheses here, for the same reason — or takes precedence over in. +
    4. On the other hand, you do need parentheses here, for the same reason — or takes precedence over in.
    5. The in operator takes precedence over the + operator, so this form technically doesn’t need parentheses, but 2to3 includes them anyway.
    6. This form definitely needs parentheses, since the in operator takes precedence over the + operator.
    @@ -252,7 +252,7 @@ from urllib.error import HTTPError
    1. The old urllib module in Python 2 had a variety of functions, including urlopen() for fetching data and splittype(), splithost(), and splituser() for splitting a URL into its constituent parts. These functions have been reorganized more logically within the new urllib package. 2to3 will also change all calls to these functions so they use the new naming scheme. -
    2. The old urllib2 module in Python 2 has been folded into the urllib package in Python 3. All your urllib2 favorites — the build_opener() method, Request objects, and HTTPBasicAuthHandler and friends — are still available. +
    3. The old urllib2 module in Python 2 has been folded into the urllib package in Python 3. All your urllib2 favorites — the build_opener() method, Request objects, and HTTPBasicAuthHandler and friends — are still available.
    4. The urllib.parse module in Python 3 contains all the parsing functions from the old urlparse module in Python 2.
    5. The urllib.robotparser module parses robots.txt files.
    6. The FancyURLopener class, which handles HTTP redirects and other status codes, is still available in the new urllib.request module. The urlencode() function has moved to urllib.parse. @@ -567,7 +567,7 @@ reduce(a, b, c) repr('PapayaWhip' + repr(2))
        -
      1. Remember, x can be anything — a class, a function, a module, a primitive data type, etc. The repr() function works on everything. +
      2. Remember, x can be anything — a class, a function, a module, a primitive data type, etc. The repr() function works on everything.
      3. In Python 2, backticks could be nested, leading to this sort of confusing (but valid) expression. The 2to3 tool is smart enough to convert this into nested calls to repr().

      try...except statement

      @@ -1037,7 +1037,7 @@ except:
    7. The 2to3 script is smart enough to construct a valid class declaration, even if the class is inherited from one or more base classes.

    Matters of style

    -

    The rest of the “fixes” listed here aren’t really fixes per se. That is, the things they change are matters of style, not substance. They work just as well in Python 3 as they do in Python 2, but the developers of Python have a vested interest in making Python code as uniform as possible. To that end, there is an official Python style guide which outlines — in excruciating detail — all sorts of nitpicky details that you almost certainly don’t care about. And given that 2to3 provides such a great infrastructure for converting Python code from one thing to another, the authors took it upon themselves to add a few optional features to improve the readability of your Python programs. +

    The rest of the “fixes” listed here aren’t really fixes per se. That is, the things they change are matters of style, not substance. They work just as well in Python 3 as they do in Python 2, but the developers of Python have a vested interest in making Python code as uniform as possible. To that end, there is an official Python style guide which outlines — in excruciating detail — all sorts of nitpicky details that you almost certainly don’t care about. And given that 2to3 provides such a great infrastructure for converting Python code from one thing to another, the authors took it upon themselves to add a few optional features to improve the readability of your Python programs.

    set() literals (explicit)

    In Python 2, the only way to define a literal set in your code was to call set(a_sequence). This still works in Python 3, but a clearer way of doing it is to use the new set literal notation: curly braces. (Dictionaries are also defined with curly braces, which makes sense once you think about it, because dictionaries are just sets of key-value pairs.)

    diff --git a/refactoring.html b/refactoring.html index 1d40a56..8b72d17 100644 --- a/refactoring.html +++ b/refactoring.html @@ -301,7 +301,7 @@ Ran 12 tests in 0.203s

    Answer: there’s only 5000 of them; why don’t you just build a lookup table? This idea gets even better when you realize that you don’t need to use regular expressions at all. As you build the lookup table for converting integers to Roman numerals, you can build the reverse lookup table to convert Roman numerals to integers. By the time you need to check whether an arbitrary string is a valid Roman numeral, you will have collected all the valid Roman numerals. “Validating” is reduced to a single dictionary lookup. -

    And best of all, you already have a complete set of unit tests. You can change over half the code in the module, but the unit tests will stay the same. That means you can prove — to yourself and to others — that the new code works just as well as the original. +

    And best of all, you already have a complete set of unit tests. You can change over half the code in the module, but the unit tests will stay the same. That means you can prove — to yourself and to others — that the new code works just as well as the original.

    [download roman10.py]

    class OutOfRangeError(ValueError): pass
    @@ -392,7 +392,7 @@ def build_lookup_tables():
             to_roman_table.append(roman_numeral)       
             from_roman_table[roman_numeral] = integer
      -
    1. This is a clever bit of programming… perhaps too clever. The to_roman() function is defined above; it looks up values in the lookup table and returns them. But the build_lookup_tables() function redefines the to_roman() function to actually do work (like the previous examples did, before you added a lookup table). Within the build_lookup_tables() function, calling to_roman() will call this redefined version. Once the build_lookup_tables() function exits, the redefined version disappears — it is only defined in the local scope of the build_lookup_tables() function. +
    2. This is a clever bit of programming… perhaps too clever. The to_roman() function is defined above; it looks up values in the lookup table and returns them. But the build_lookup_tables() function redefines the to_roman() function to actually do work (like the previous examples did, before you added a lookup table). Within the build_lookup_tables() function, calling to_roman() will call this redefined version. Once the build_lookup_tables() function exits, the redefined version disappears — it is only defined in the local scope of the build_lookup_tables() function.
    3. This line of code will call the redefined to_roman() function, which actually calculates the Roman numeral.
    4. Once you have the result (from the redefined to_roman() function), you add the integer and its Roman numeral equivalent to both lookup tables.
    diff --git a/special-method-names.html b/special-method-names.html index fcbf27f..51a4a3d 100644 --- a/special-method-names.html +++ b/special-method-names.html @@ -31,7 +31,7 @@ td a:link, td a:visited{border:0}

     

    Diving in

    -

    We’ve already covered a few special method names elsewhere in this book — “magic” methods that Python invokes when you use certain syntax. Using special methods, your classes can act like sequences, like dictionaries, like functions, like iterators, or even like numbers! This appendix serves both as a reference for the special methods we’ve seen already and a brief introduction to some of the more esoteric ones. +

    We’ve already covered a few special method names elsewhere in this book — “magic” methods that Python invokes when you use certain syntax. Using special methods, your classes can act like sequences, like dictionaries, like functions, like iterators, or even like numbers! This appendix serves both as a reference for the special methods we’ve seen already and a brief introduction to some of the more esoteric ones.

    Basics

    @@ -207,7 +207,7 @@ AttributeError

    Classes That Act Like Functions

    -

    You can make an instance of a class callable — exactly like a function is callable — by defining the __call__() method. +

    You can make an instance of a class callable — exactly like a function is callable — by defining the __call__() method.
    Notes @@ -255,7 +255,7 @@ bytes = zef_file.read(12)

    Classes That Act Like Sequences

    -

    If your class acts as a container for a set of values — that is, if it makes sense to ask whether your class “contains” a value — then it should probably define the following special methods that make it act like a sequence. +

    If your class acts as a container for a set of values — that is, if it makes sense to ask whether your class “contains” a value — then it should probably define the following special methods that make it act like a sequence.
    Notes @@ -358,7 +358,7 @@ class FieldStorage:

    Classes That Act Like Numbers

    -

    Using the appropriate special methods, you can define your own classes that act like numbers. That is, you can add them, subtract them, and perform other mathematical operations on them. This is how fractions are implemented — the Fraction class implements these special methods, then you can do things like this: +

    Using the appropriate special methods, you can define your own classes that act like numbers. That is, you can add them, subtract them, and perform other mathematical operations on them. This is how fractions are implemented — the Fraction class implements these special methods, then you can do things like this:

     >>> from fractions import Fraction
    @@ -635,7 +635,7 @@ class FieldStorage:
     
     

    Classes That Can Be Compared

    -

    I broke this section out from the previous one because comparisons are not strictly the purview of numbers. Many datatypes can be compared — strings, lists, even dictionaries. If you’re creating your own class and it makes sense to compare your objects to other objects, you can use the following special methods to implement comparisons. +

    I broke this section out from the previous one because comparisons are not strictly the purview of numbers. Many datatypes can be compared — strings, lists, even dictionaries. If you’re creating your own class and it makes sense to compare your objects to other objects, you can use the following special methods to implement comparisons.
    Notes @@ -755,7 +755,7 @@ def __exit__(self, *args) -> None: self.close()
    1. The file object defines both an __enter__() and an __exit__() method. The __enter__() method checks that the file is open; if it’s not, the _checkClosed() method raises an exception. -
    2. The __enter__() method should almost always return self — this is the object that the with block will use to dispatch properties and methods. +
    3. The __enter__() method should almost always return self — this is the object that the with block will use to dispatch properties and methods.
    4. After the with block, the file object automatically closes. How? In the __exit__() method, it calls self.close().
    diff --git a/strings.html b/strings.html index ea2a36a..d2f9cd1 100644 --- a/strings.html +++ b/strings.html @@ -21,11 +21,11 @@ My alphabet starts where your alphabet ends!
    &m

     

    Some Boring Stuff You Need To Understand Before You Can Dive In

    -

    Did you know that the people of Bougainville have the smallest alphabet in the world? Their Rotokas alphabet is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters — 52 if you count uppercase and lowercase separately — plus a handful of !@#$%& punctuation marks. +

    Did you know that the people of Bougainville have the smallest alphabet in the world? Their Rotokas alphabet is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters — 52 if you count uppercase and lowercase separately — plus a handful of !@#$%& punctuation marks.

    When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. -

    In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and the result will be gibberish. +

    In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and the result will be gibberish.

    @@ -61,11 +61,11 @@ My alphabet starts where your alphabet ends!
    &m

    Even though there are a lot of Unicode characters, it turns out that most people will never use anything beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes every character from 0–65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used “astral plane” Unicode characters beyond 65535. Most obvious advantage: UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store instead of four bytes (except for the ones that don’t). And you can still easily find the Nth character of a string in constant time, if you assume that the string doesn’t include any astral plane characters, which is a good assumption right up until the moment that it’s not. -

    But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character U+4E2D could be stored in UTF-16 as either 4E 2D or 2D 4E, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you’re safe — different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you’re going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence 4E 2D means U+4E2D or U+2D4E. +

    But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character U+4E2D could be stored in UTF-16 as either 4E 2D or 2D 4E, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you’re safe — different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you’re going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence 4E 2D means U+4E2D or U+2D4E.

    To solve this problem, the multi-byte Unicode encodings define a “Byte Order Mark,” which is a special non-printable character that you can include at the beginning of your document to indicate what order your bytes are in. For UTF-16, the Byte Order Mark is U+FEFF. If you receive a UTF-16 document that starts with the bytes FF FE, you know the byte ordering is one way; if it starts with FE FF, you know the byte ordering is reversed. -

    Still, UTF-16 isn’t exactly ideal, especially if you’re dealing with a lot of ASCII characters. If you think about it, even a Chinese web page is going to contain a lot of ASCII characters — all the elements and attributes surrounding the printable Chinese characters. Being able to find the Nth character in constant time is nice, but there’s still the nagging problem of those astral plane characters, which mean that you can’t guarantee that every character is exactly two bytes, so you can’t really find the Nth character in constant time unless you maintain a separate index. And boy, there sure is a lot of ASCII text in the world… +

    Still, UTF-16 isn’t exactly ideal, especially if you’re dealing with a lot of ASCII characters. If you think about it, even a Chinese web page is going to contain a lot of ASCII characters — all the elements and attributes surrounding the printable Chinese characters. Being able to find the Nth character in constant time is nice, but there’s still the nagging problem of those astral plane characters, which mean that you can’t guarantee that every character is exactly two bytes, so you can’t really find the Nth character in constant time unless you maintain a separate index. And boy, there sure is a lot of ASCII text in the world…

    Other people pondered these questions, and they came up with a solution: @@ -73,7 +73,7 @@ My alphabet starts where your alphabet ends!
    &m

    UTF-8 is a variable-length encoding system for Unicode. That is, different characters take up a different number of bytes. For ASCII characters (A-Z, &c.) UTF-8 uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0–127) in UTF-8 are indistinguishable from ASCII. “Extended Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end up taking three bytes. The rarely-used “astral plane” characters take four bytes. -

    Disadvantages: because each character can take a different number of bytes, finding the Nth character is an O(N) operation — that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters. +

    Disadvantages: because each character can take a different number of bytes, finding the Nth character is an O(N) operation — that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.

    Advantages: super-efficient encoding of common ASCII characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer. @@ -164,7 +164,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):

    1. Rather than calling any function in the humansize module, you’re just grabbing one of the data structures it defines: the list of “SI” (powers-of-1000) suffixes. -
    2. This looks complicated, but it’s not. {0} would refer to the first argument passed to the format() method, si_suffixes. But si_suffixes is a list. So {0[0]} refers to the first item of the list which is the first argument passed to the format() method: 'KB'. Meanwhile, {0[1]} refers to the second item of the same list: 'MB'. Everything outside the curly braces — including 1000, the equals sign, and the spaces — is untouched. The final result is the string '1000KB = 1MB'. +
    3. This looks complicated, but it’s not. {0} would refer to the first argument passed to the format() method, si_suffixes. But si_suffixes is a list. So {0[0]} refers to the first item of the list which is the first argument passed to the format() method: 'KB'. Meanwhile, {0[1]} refers to the second item of the same list: 'MB'. Everything outside the curly braces — including 1000, the equals sign, and the spaces — is untouched. The final result is the string '1000KB = 1MB'.
    @@ -340,7 +340,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
  • By an amazing coincidence, this line of code says “count the occurrences of the string that you would get after decoding this sequence of bytes in this particular character encoding.” -

    And here is the link between strings and bytes: bytes objects have a decode() method that takes a character encoding and returns a string, and strings have an encode() method that takes a character encoding and returns a bytes object. In the previous example, the decoding was relatively straightforward — converting a sequence of bytes n the ASCII encoding into a string of characters. But the same process works with any encoding that supports the characters of the string — even legacy (non-Unicode) encodings. +

    And here is the link between strings and bytes: bytes objects have a decode() method that takes a character encoding and returns a string, and strings have an encode() method that takes a character encoding and returns a bytes object. In the previous example, the decoding was relatively straightforward — converting a sequence of bytes n the ASCII encoding into a string of characters. But the same process works with any encoding that supports the characters of the string — even legacy (non-Unicode) encodings.

     >>> a_string = '深入 Python'         
    @@ -378,7 +378,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
     
     

    Postscript: Character Encoding Of Python Source Code

    -

    Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8. +

    Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8.

    In Python 2, the default encoding for .py files was ASCII. In Python 3, the default encoding is UTF-8. @@ -425,7 +425,7 @@ TypeError: Can't convert 'bytes' object to str implicitly

    On strings and string formatting:

      -
    • string — Common string operations +
    • string — Common string operations
    • Format String Syntax
    • Format Specification Mini-Language
    • PEP 3101: Advanced String Formatting diff --git a/unit-testing.html b/unit-testing.html index ba6e822..8a39f49 100644 --- a/unit-testing.html +++ b/unit-testing.html @@ -32,7 +32,7 @@ body{counter-reset:h1 8}

      Let’s start mapping out what a roman.py module should do. It will have two main functions, to_roman() and from_roman(). The to_roman() function should take an integer from 1 to 3999 and return the Roman numeral representation as a string…

      Stop right there. Now let’s do something a little unexpected: write a test case that checks whether the to_roman() function does what you want it to. You read that right: you’re going to write code that tests code that you haven’t written yet. -

      This is called unit testing. The set of two conversion functions — to_roman(), and later from_roman() — can be written and tested as a unit, separate from any larger program that imports them. Python has a framework for unit testing, the appropriately-named unittest module. +

      This is called unit testing. The set of two conversion functions — to_roman(), and later from_roman() — can be written and tested as a unit, separate from any larger program that imports them. Python has a framework for unit testing, the appropriately-named unittest module.

      Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important to write them early (preferably before writing the code that they test), and to keep them updated as code and requirements change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases of development:

      • Before writing code, it forces you to detail your requirements in a useful fashion. @@ -134,7 +134,7 @@ if __name__ == '__main__':
      • Assuming the to_roman() function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check whether it returned the right value. This is a common question, and the TestCase class provides a method, assertEqual, to check whether two values are equal. If the result returned from to_roman() (result) does not match the known value you were expecting (numeral), assertEqual will raise an exception and the test will fail. If the two values are equal, assertEqual will do nothing. If every value returned from to_roman() matches the known value you expect, assertEqual never raises an exception, so testToRomanKnownValues eventually exits normally, which means to_roman() has passed this test. -

        Once you have a test case, you can start coding the to_roman() function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you’ve written any code, you’re doing it wrong — your tests aren’t testing your code at all! Write a test that fails, then code until it passes. +

        Once you have a test case, you can start coding the to_roman() function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you’ve written any code, you’re doing it wrong — your tests aren’t testing your code at all! Write a test that fails, then code until it passes.

        # roman1.py
         
         function to_roman(n):
        @@ -237,7 +237,7 @@ OK
        >>> roman1.to_roman(9000) 'MMMMMMMMM'
      -
    1. That’s definitely not what you wanted — that’s not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is baaaaaaad; if a program is going to fail, it is far better that it fail quickly and noisily. “Halt and catch fire,” as the saying goes. The Pythonic way to halt and catch fire is to raise an exception. +
    2. That’s definitely not what you wanted — that’s not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is baaaaaaad; if a program is going to fail, it is far better that it fail quickly and noisily. “Halt and catch fire,” as the saying goes. The Pythonic way to halt and catch fire is to raise an exception.

    The question to ask yourself is, “How can I express this as a testable requirement?” How’s this for starters:

    @@ -275,14 +275,14 @@ Ran 2 tests in 0.000s FAILED (errors=1)
      -
    1. You should have expected this to fail (since you haven’t written any code to pass it yet), but... it didn’t actually “fail,” it had an “error” instead. This is a subtle but important distinction. A unit test actually has three return values: pass, fail, and error. Pass, of course, means that the test passed — the code did what you expected. “Fail” is what the previous test case did (until you wrote code to make it pass) — it executed the code but the result was not what you expected. “Error” means that the code didn’t even execute properly. +
    2. You should have expected this to fail (since you haven’t written any code to pass it yet), but... it didn’t actually “fail,” it had an “error” instead. This is a subtle but important distinction. A unit test actually has three return values: pass, fail, and error. Pass, of course, means that the test passed — the code did what you expected. “Fail” is what the previous test case did (until you wrote code to make it pass) — it executed the code but the result was not what you expected. “Error” means that the code didn’t even execute properly.
    3. Why didn’t the code execute properly? The traceback gives the answer: the module you’re testing doesn’t have an exception called OutOfRangeError. Remember, you passed this exception to the assertRaises() method, because it’s the exception you want the function to raise given an out-of-range input. But the exception doesn’t exist, so the call to the assertRaises() method failed. It never got a chance to test the to_roman() function; it didn’t get that far.

    To solve this problem, you need to define the OutOfRangeError exception in roman2.py.

    class OutOfRangeError(ValueError):  
         pass                            
      -
    1. Exceptions are classes. An “out of range” error is a kind of value error — the argument value is out of its acceptable range. So this exception inherits from the built-in ValueError exception. This is not strictly necessary (it could just inherit from the base Exception class), but it feels right. +
    2. Exceptions are classes. An “out of range” error is a kind of value error — the argument value is out of its acceptable range. So this exception inherits from the built-in ValueError exception. This is not strictly necessary (it could just inherit from the base Exception class), but it feels right.
    3. Exceptions don’t actually do anything, but you need at least one line of code to make a class. Calling pass does precisely nothing, but it’s a line of Python code, so that makes it a class.

    Now run the test suite again. @@ -305,7 +305,7 @@ Ran 2 tests in 0.016s FAILED (failures=1)

    1. The new test is still not passing, but it’s not returning an error either. Instead, the test is failing. That’s progress! It means the call to the assertRaises() method succeeded this time, and the unit test framework actually tested the to_roman() function. -
    2. Of course, the to_roman() function isn’t raising the OutOfRangeError exception you just defined, because you haven’t told it to do that yet. That’s excellent news! It means this is a valid test case — it fails before you write the code to make it pass. +
    3. Of course, the to_roman() function isn’t raising the OutOfRangeError exception you just defined, because you haven’t told it to do that yet. That’s excellent news! It means this is a valid test case — it fails before you write the code to make it pass.

    Now you can write the code to make this test pass.

    [download roman2.py] diff --git a/xml.html b/xml.html index 7ea9e3a..e96c73b 100644 --- a/xml.html +++ b/xml.html @@ -23,7 +23,7 @@ mark{display:inline}

    Diving In

    Most of the chapters in this book have centered around a piece of sample code. But XML isn’t about code; it’s about data. One common use of XML is “syndication feeds” that list the latest articles on a blog, forum, or other frequently-updated website. Most popular blogging software can produce a feed and update it whenever new articles, discussion threads, or blog posts are published. You can follow a blog by “subscribing” to its feed, and you can follow multiple blogs with a dedicated “feed aggregator” like Google Reader. -

    Here, then, is the XML data we’ll be working with in this chapter. It’s a feed — specifically, an Atom syndication feed. +

    Here, then, is the XML data we’ll be working with in this chapter. It’s a feed — specifically, an Atom syndication feed.

    [download feed.xml]

    <?xml version='1.0' encoding='utf-8'?>
    @@ -320,9 +320,9 @@ mark{display:inline}
     {}
    1. The attrib property is a dictionary of the element’s attributes. The original markup here was <feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>. The xml: prefix refers to a built-in namespace that every XML document can use without declaring it. -
    2. The fifth child — [4] in a 0-based list — is the link element. +
    3. The fifth child — [4] in a 0-based list — is the link element.
    4. The link element has three attributes: href, type, and rel. -
    5. The fourth child — [3] in a 0-based list — is the updated element. +
    6. The fourth child — [3] in a 0-based list — is the updated element.
    7. The updated element has no attributes, so its .attrib is just an empty dictionary.
    @@ -348,7 +348,7 @@ mark{display:inline} []
    1. The findall() method finds child elements that match a specific query. (More on the query format in a minute.) -
    2. Each element — including the root element, but also child elements — has a findall() method. It finds all matching elements among the element’s children. But why aren’t there any results? Although it may not be obvious, this particular query only searches the element’s children. Since the root feed element has no child named feed, this query returns an empty list. +
    3. Each element — including the root element, but also child elements — has a findall() method. It finds all matching elements among the element’s children. But why aren’t there any results? Although it may not be obvious, this particular query only searches the element’s children. Since the root feed element has no child named feed, this query returns an empty list.
    4. This result may also surprise you. There is an author element in this document; in fact, there are three (one in each entry). But those author elements are not direct children of the root element; they are “grandchildren” (literally, a child element of a child element). If you want to look for author elements at any nesting level, you can do that, but the query format is slightly different.
    @@ -391,7 +391,7 @@ mark{display:inline} 'type': 'text/html', 'rel': 'alternate'}
      -
    1. This query — //{http://www.w3.org/2005/Atom}link — is very similar to the previous examples, except for the two slashes at the beginning of the query. Those two slashes mean “don’t just look for direct children; I want any elements, regardless of nesting level.” So the result is a list of four link elements, not just one. +
    2. This query — //{http://www.w3.org/2005/Atom}link — is very similar to the previous examples, except for the two slashes at the beginning of the query. Those two slashes mean “don’t just look for direct children; I want any elements, regardless of nesting level.” So the result is a list of four link elements, not just one.
    3. The first result is a direct child of the root element. As you can see from its attributes, this is the feed-level alternate link that points to the HTML version of the website that the feed describes.
    4. The other three results are each entry-level alternate links. Each entry has a single link child element, and because of the double slash at the beginning of the query, this query finds all of them.
    @@ -509,7 +509,7 @@ except ImportError:
  • At any time, you can serialize any element (and its children) with the ElementTree tostring() function. -

    Was that serialization surprising to you? The way ElementTree serializes namespaced XML elements is technically accurate but not optimal. The sample XML document at the beginning of this chapter defined a default namespace (xmlns='http://www.w3.org/2005/Atom'). Defining a default namespace is useful for documents — like Atom feeds — where every element is in the same namespace, because you can declare the namespace once and declare each element with just its local name (<feed>, <link>, <entry>). There is no need to use any prefixes unless you want to declare elements from another namespace. +

    Was that serialization surprising to you? The way ElementTree serializes namespaced XML elements is technically accurate but not optimal. The sample XML document at the beginning of this chapter defined a default namespace (xmlns='http://www.w3.org/2005/Atom'). Defining a default namespace is useful for documents — like Atom feeds — where every element is in the same namespace, because you can declare the namespace once and declare each element with just its local name (<feed>, <link>, <entry>). There is no need to use any prefixes unless you want to declare elements from another namespace.

    An XML parser won’t “see” any difference between an XML document with a default namespace and an XML document with a prefixed namespace. The resulting DOM of this serialization: @@ -566,7 +566,7 @@ except ImportError:

    Parsing Broken XML

    -

    The XML specification mandates that all conforming XML parsers employ “draconian error handling.” That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the XML document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like HTML — your browser doesn’t stop rendering a web page if you forget to close an HTML tag or escape an ampersand in an attribute value. (It is a common misconception that HTML has no defined error handling. HTML error handling is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”) +

    The XML specification mandates that all conforming XML parsers employ “draconian error handling.” That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the XML document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like HTML — your browser doesn’t stop rendering a web page if you forget to close an HTML tag or escape an ampersand in an attribute value. (It is a common misconception that HTML has no defined error handling. HTML error handling is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”)

    Some people (myself included) believe that it was a mistake for the inventors of XML to mandate draconian error handling. Don’t get me wrong; I can certainly see the allure of simplifying the error handling rules. But in practice, the concept of “wellformedness” is trickier than it sounds, especially for XML documents (like Atom feeds) that are published on the web and served over HTTP. Despite the maturity of XML, which standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom feeds on the web are plagued with wellformedness errors. diff --git a/your-first-python-program.html b/your-first-python-program.html index f387107..aba01bb 100644 --- a/your-first-python-program.html +++ b/your-first-python-program.html @@ -66,7 +66,7 @@ if __name__ == '__main__':

    What just happened? You executed your first Python program. You called the Python intepreter on the command line, and you passed the name of the script you wanted Python to execute. The script defines a single function, the approximate_size() function, which takes an exact file size in bytes and calculates a “pretty” (but approximate) size. (You’ve probably seen this in Windows Explorer, or the Mac OS X Finder, or Nautilus or Dolphin or Thunar on Linux. If you display a folder of documents as a multi-column list, it will display a table with the document icon, the document name, the size, type, last-modified date, and so on. If the folder contains a 1093-byte file named TODO, your file manager won’t display TODO 1093 bytes; it’ll say something like TODO 1 KB instead. That’s what the approximate_size() function does.) -

    Look at the bottom of the script, and you’ll see two calls to print(approximate_size(arguments)). These are function calls — first calling the approximate_size() function and passing a number of arguments, then taking the return value and passing it straight on to the print() function. The print() function is built-in; you’ll never see an explicit declaration of it. You can just use it, anytime, anywhere. (There are lots of built-in functions, and lots more functions that are separated into modules. Patience, grasshopper.) +

    Look at the bottom of the script, and you’ll see two calls to print(approximate_size(arguments)). These are function calls — first calling the approximate_size() function and passing a number of arguments, then taking the return value and passing it straight on to the print() function. The print() function is built-in; you’ll never see an explicit declaration of it. You can just use it, anytime, anywhere. (There are lots of built-in functions, and lots more functions that are separated into modules. Patience, grasshopper.)

    So why does running the script on the command line give you the same output every time? We’ll get to that. First, let’s look at that approximate_size() function. @@ -81,7 +81,7 @@ if __name__ == '__main__':

    In some languages, functions (that return a value) start with function, and subroutines (that do not return a value) start with sub. There are no subroutines in Python. Everything is a function, all functions return a value (even if it’s None), and all functions start with def.

    -

    The approximate_size() function takes the two arguments — size and a_kilobyte_is_1024_bytes — but neither argument specifies a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally. +

    The approximate_size() function takes the two arguments — size and a_kilobyte_is_1024_bytes — but neither argument specifies a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.

    In Java and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.