diff --git a/about.html b/about.html index 0ebe1d0..ff3daca 100644 --- a/about.html +++ b/about.html @@ -1,18 +1,16 @@ - About the book - Dive Into Python 3 - -
 
-
 
+

You are here: Home Dive Into Python 3

About the book

The content of Dive Into Python 3 is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

The chardet library referenced in Case study: porting chardet to Python 3 is licensed under the LGPL 2.1 or later. All other example code is licensed under the MIT license. Full licensing terms are included in each source code file. diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 33aac87..d24d831 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -1,19 +1,21 @@ - Case study: porting chardet to Python 3 - Dive into Python 3 - -

skip to main content -

  
-

skip to main content +

  
+

You are here: Home Dive Into Python 3

Case study: porting chardet to Python 3

Words, words. They’re all we have to go on.
Rosencrantz and Guildenstern are Dead @@ -49,7 +51,7 @@ body{counter-reset:h1 20}

  • Summary

    Diving in

    -

    Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In Chapter 3, I talked about the history of character encoding and the creation of Unicode, the “one encoding to rule them all.” I’d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings. +

    Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In Chapter 3, I talked about the history of character encoding and the creation of Unicode, the “one encoding to rule them all.” I’d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.

    I’d also like a pony.

    A Unicode pony.

    A Unipony, as it were. @@ -98,8 +100,8 @@ body{counter-reset:h1 20}

    We’re going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy — a function was renamed or moved to a different modules — but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.

    The main chardet package is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn.

    [The code examples will be easier to follow if you enable Javascript, but whatever.] -

    skip over this -

    C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w chardet\
    +

    skip over this +

    C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w chardet\
     RefactoringTool: Skipping implicit fixer: buffer
     RefactoringTool: Skipping implicit fixer: idioms
     RefactoringTool: Skipping implicit fixer: set_literal
    @@ -566,8 +568,8 @@ RefactoringTool: chardet\sjisprober.py
     RefactoringTool: chardet\universaldetector.py
     RefactoringTool: chardet\utf8prober.py

    Now run the 2to3 script on the testing harness, test.py. -

    skip over this -

    C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w test.py
    +

    skip over this +

    C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w test.py
     RefactoringTool: Skipping implicit fixer: buffer
     RefactoringTool: Skipping implicit fixer: idioms
     RefactoringTool: Skipping implicit fixer: set_literal
    @@ -602,8 +604,8 @@ RefactoringTool: test.py

    Fixing what 2to3 can’t

    False is invalid syntax

    Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere. -

    skip over this -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this +

    C:\home\chardet> python test.py tests\*\*
     Traceback (most recent call last):
       File "test.py", line 1, in <module>
         from chardet.universaldetector import UniversalDetector
    @@ -612,7 +614,7 @@ RefactoringTool: test.py
    ^ SyntaxError: invalid syntax

    Hmm, a small snag. In Python 3, False is a reserved word, so you can’t use it as a variable name. Let’s look at constants.py to see where it’s defined. Here’s the original version from constants.py, before the 2to3 script changed it: -

    skip over this +

    skip over this

    import __builtin__
     if not hasattr(__builtin__, 'False'):
         False = 0
    @@ -629,8 +631,8 @@ else:
     

    Ah, wasn’t that satisfying? The code is shorter and more readable already.

    No module named constants

    Time to run test.py again and see how far it gets. -

    skip over this -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this +

    C:\home\chardet> python test.py tests\*\*
     Traceback (most recent call last):
       File "test.py", line 1, in <module>
         from chardet.universaldetector import UniversalDetector
    @@ -649,8 +651,8 @@ import sys

    Onward!

    Name 'file' is not defined

    And here we go again, running test.py to try to execute our test cases…

    -

    skip over this -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
       File "test.py", line 9, in <module>
    @@ -662,8 +664,8 @@ NameError: name 'file' is not defined

    And that’s all I have to say about that.

    Can’t use a string pattern on a bytes-like object

    Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.” -

    skip over this -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
       File "test.py", line 10, in <module>
    @@ -673,14 +675,14 @@ NameError: name 'file' is not defined
    TypeError: can't use a string pattern on a bytes-like object

    To debug this, let’s see what self._highBitDetector is. It’s defined in the __init__ method of the UniversalDetector class: -

    skip over this +

    skip over this

    class UniversalDetector:
         def __init__(self):
             self._highBitDetector = re.compile(r'[\x80-\xFF]')

    This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.

    And therein lies the problem.

    In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py: -

    skip over this +

    skip over this

    def feed(self, aBuf):
         .
         .
    @@ -688,7 +690,7 @@ TypeError: can't use a string pattern on a bytes-like object
    if self._mInputState == ePureAscii: if self._highBitDetector.search(aBuf):

    And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py. -

    skip over this +

    skip over this

    u = UniversalDetector()
     .
     .
    @@ -698,7 +700,7 @@ for line in open(f, 'rb'):
     

    And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.

    What we need this regular expression to search is not an array of characters, but an array of bytes.

    Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.) -

    skip over this code listing +

    skip over this code listing

      class UniversalDetector:
           def __init__(self):
     -         self._highBitDetector = re.compile(b'[\x80-\xFF]')
    @@ -709,7 +711,7 @@ for line in open(f, 'rb'):
               self._mCharSetProbers = []
               self.reset()

    Searching the entire codebase for other uses of the re module turns up two more instances, in charsetprober.py. Again, the code is defining regular expressions as strings but executing them on aBuf, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays. -

    skip over this code listing +

    skip over this code listing

      class CharSetProber:
           .
           .
    @@ -726,8 +728,8 @@ for line in open(f, 'rb'):
             
     

    Can't convert 'bytes' object to str implicitly

    Curiouser and curiouser… -

    skip over this -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
       File "test.py", line 10, in <module>
    @@ -736,12 +738,12 @@ for line in open(f, 'rb'):
         elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
     TypeError: Can't convert 'bytes' object to str implicitly

    There's an unfortunate clash of coding style and Python interpreter here. The TypeError could be anywhere on that line, but the traceback doesn't tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this: -

    skip over this code listing +

    skip over this code listing

    elif (self._mInputState == ePureAscii) and \
         self._escDetector.search(self._mLastChar + aBuf):

    And re-run the test:

    -

    skip over this command output listing -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this command output listing +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
       File "test.py", line 10, in <module>
    @@ -751,7 +753,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
    TypeError: Can't convert 'bytes' object to str implicitly

    Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second one. So what could cause a TypeError there? Perhaps you're thinking that the search() method is expecting a value of a different type, but that wouldn't generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value of a different type than it's expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the search() method. So the problem must be in that + operation, as it's trying to construct the value that it will eventually pass to the search() method.

    We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It's an instance variable, defined in the reset() method, which is actually called from the __init__() method. -

    skip over this code listing +

    skip over this code listing

    class UniversalDetector:
         def __init__(self):
             self._highBitDetector = re.compile(b'[\x80-\xFF]')
    @@ -769,7 +771,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
    self._mLastChar = ''

    And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can't concatenate a string to a byte array — not even a zero-length string.

    So what is self._mLastChar anyway? The answer is in the feed() method, just a few lines down from where the trackback occurred. -

    skip over this code listing +

    skip over this code listing

    if self._mInputState == ePureAscii:
         if self._highBitDetector.search(aBuf):
             self._mInputState = eHighbyte
    @@ -779,7 +781,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
    self._mLastChar = aBuf[-1]

    The calling function calls this feed() method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as aBuf), then stores the last byte in self._mLastChar in case it's needed during the next call. (In a multi-byte encoding, the feed() method might get called with half of a character, then called again with the other half.) But because aBuf is now a byte array instead of a string, self._mLastChar needs to be a byte array as well. Thus: -

    skip over this code listing +

    skip over this code listing

      def reset(self):
           .
           .
    @@ -787,7 +789,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
    - self._mLastChar = '' + self._mLastChar = b''

    Searching the entire codebase for "mLastChar" turns up a similar problem in mbcharsetprober.py, but instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers. -

    skip over this code listing +

    skip over this code listing

    
       class MultiByteCharSetProber(CharSetProber):
           def __init__(self):
    @@ -807,8 +809,8 @@ TypeError: Can't convert 'bytes' object to str implicitly
    + self._mLastChar = [0, 0]

    Unsupported operand type(s) for +: 'int' and 'bytes'

    I have good news, and I have bad news. The good news is we're making progress… -

    skip over this command listing -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this command listing +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
       File "test.py", line 10, in <module>
    @@ -819,7 +821,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'

    …The bad news is it doesn't always feel like progress.

    But this is progress! Really! Even though the traceback calls out the same line of code, it's a different error than it used to be. Progress! So what's the problem now? The last time I checked, this line of code didn't try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that self._mLastChar was a byte array. How did it turn into an int?

    The answer lies not in the previous lines of code, but in the following lines. -

    skip over this code listing +

    skip over this code listing

    if self._mInputState == ePureAscii:
         if self._highBitDetector.search(aBuf):
             self._mInputState = eHighbyte
    @@ -829,24 +831,24 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
    self._mLastChar = aBuf[-1]

    This error doesn't occur the first time the feed() method gets called; it occurs the second time, after self._mLastChar has been set to the last byte of aBuf. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell: -

    skip over this interpreter listing +

    skip over this interpreter listing

    ->>> aBuf = b'\xEF\xBB\xBF'         
    ->>> len(aBuf)
    +>>> aBuf = b'\xEF\xBB\xBF'         
    +>>> len(aBuf)
     3
    ->>> mLastChar = aBuf[-1]
    ->>> mLastChar                      
    +>>> mLastChar = aBuf[-1]
    +>>> mLastChar                      
     191
    ->>> type(mLastChar)                
    +>>> type(mLastChar)                
     <class 'int'>
    ->>> mLastChar + aBuf               
    +>>> mLastChar + aBuf               
     Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
     TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
    ->>> mLastChar = aBuf[-1:]          
    ->>> mLastChar
    +>>> mLastChar = aBuf[-1:]          
    +>>> mLastChar
     b'\xbf'
    ->>> mLastChar + aBuf               
    +>>> mLastChar + aBuf               
     b'\xbf\xef\xbb\xbf'
    1. Define a byte array of length 3. @@ -864,8 +866,8 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes' + self._mLastChar = aBuf[-1:]

    ord() expected string of length 1, but int found

    Tired yet? You're almost there… -

    skip over this command output listing -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this command output listing +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
     Traceback (most recent call last):
    @@ -881,28 +883,28 @@ tests\Big5\0804.blogspot.com.xml
         byteCls = self._mModel['classTable'][ord(c)]
     TypeError: ord() expected string of length 1, but int found

    OK, so c is an int, but the ord() function was expecting a 1-character string. Fair enough. Where is c defined? -

    skip over this code listing +

    skip over this code listing

    # codingstatemachine.py
     def next_state(self, c):
         # for each byte we get its class
         # if it is first byte, we also get byte length
         byteCls = self._mModel['classTable'][ord(c)]

    That's no help; it's just passed into the function. Let's pop the stack. -

    skip over this code listing +

    skip over this code listing

    # utf8prober.py
     def feed(self, aBuf):
         for c in aBuf:
             codingState = self._mCodingSM.next_state(c)

    And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That's what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there's no need to call the ord() function because c is already an int!

    Thus: -

    skip over this code listing +

    skip over this code listing

      def next_state(self, c):
           # for each byte we get its class
           # if it is first byte, we also get byte length
     -     byteCls = self._mModel['classTable'][ord(c)]
     +     byteCls = self._mModel['classTable'][c]

    Searching the entire codebase for instances of "ord(c)" uncovers similar problems in sbcharsetprober.py… -

    skip over this code listing +

    skip over this code listing

    # sbcharsetprober.py
     def feed(self, aBuf):
         if not self._mModel['keepEnglishLetter']:
    @@ -913,14 +915,14 @@ def feed(self, aBuf):
         for c in aBuf:
             order = self._mModel['charToOrderMap'][ord(c)]

    …and latin1prober.py… -

    skip over this code listing +

    skip over this code listing

    # latin1prober.py
     def feed(self, aBuf):
         aBuf = self.filter_with_english_letters(aBuf)
         for c in aBuf:
             charClass = Latin1_CharToClass[ord(c)]

    c is iterating over aBuf, which means it is an integer, not a 1-character string. The solution is the same: change ord(c) to just plain c. -

    skip over this code listing +

    skip over this code listing

      # sbcharsetprober.py
       def feed(self, aBuf):
           if not self._mModel['keepEnglishLetter']:
    @@ -941,8 +943,8 @@ def feed(self, aBuf):
     

    Unorderable types: int() >= str()

    Let's go again. -

    skip over this command output listing -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this command output listing +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
     Traceback (most recent call last):
    @@ -961,7 +963,7 @@ tests\Big5\0804.blogspot.com.xml
     TypeError: unorderable types: int() >= str()

    Did you notice? This time around, the code passed the first test case (tests\ascii\howto.diveintomark.org.xml). You're making real progress here.

    So what's this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code: -

    skip over this code listing +

    skip over this code listing

    class SJISContextAnalysis(JapaneseContextAnalysis):
         def get_order(self, aStr):
             if not aStr: return -1, 1
    @@ -972,7 +974,7 @@ TypeError: unorderable types: int() >= str()
    else: charLen = 1

    And where does aStr come from? Let's pop the stack: -

    skip over this code listing +

    skip over this code listing

    def feed(self, aBuf, aLen):
         .
         .
    @@ -983,7 +985,7 @@ TypeError: unorderable types: int() >= str()

    Oh look, it's our old friend, aBuf. As you might have guessed from every other issue we've encountered in this chapter, aBuf is a byte array. Here, the feed() method isn't just passing it on wholesale; it's slicing it. But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array.

    And what is this code trying to do with aStr? It's taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because aStr and aBuf were strings, and aStr[0] would be a string, and you can compare strings for inequality. But in Python 3, aStr and aBuf are byte arrays, aStr[0] is an integer, and you can't compare integers and strings for inequality without explicitly coercing one of them.

    In this case, there's no need to make the code more complicated by adding an explicit coercion. aStr[0] yields an integer; the things you're comparing to are all constants. Let's change them from 1-character strings to integers. -

    skip over this code listing +

    skip over this code listing

      class SJISContextAnalysis(JapaneseContextAnalysis):
           def get_order(self, aStr):
               if not aStr: return -1, 1
    @@ -1037,8 +1039,8 @@ TypeError: unorderable types: int() >= str()
    return -1, charLen

    Searching the entire codebase for occurrences of the ord() function uncovers the same problem in chardistribution.py: -

    skip over this command output listing -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this command output listing +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
     Traceback (most recent call last):
    @@ -1056,7 +1058,7 @@ tests\Big5\0804.blogspot.com.xml
         if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):
     TypeError: unorderable types: int() >= str()

    The fix is the same: -

    skip over this code listing +

    skip over this code listing

      class EUCTWDistributionAnalysis(CharDistributionAnalysis):
           def __init__(self):
               CharDistributionAnalysis.__init__(self)
    @@ -1163,8 +1165,8 @@ TypeError: unorderable types: int() >= str()
    return -1

    Global name 'reduce' is not defined

    Once more into the breach… -

    skip over this command output listing -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this command output listing +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
     Traceback (most recent call last):
    @@ -1177,14 +1179,14 @@ tests\Big5\0804.blogspot.com.xml
     NameError: global name 'reduce' is not defined

    According to the official What's New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: "Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable."

    OK then, let's refactor it to use a for loop. -

    skip over this code listing +

    skip over this code listing

    def get_confidence(self):
         if self.get_state() == constants.eNotMe:
             return 0.01
       
         total = reduce(operator.add, self._mFreqCounter)

    The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result. It looks much more readable as a for loop. -

    skip over this code listing +

    skip over this code listing

      def get_confidence(self):
           if self.get_state() == constants.eNotMe:
               return 0.01
    @@ -1194,8 +1196,8 @@ NameError: global name 'reduce' is not defined
    + for frequency in self._mFreqCounter: + total += frequency

    I CAN HAZ TESTZ? -

    skip over this command output listing -

    C:\home\chardet> python test.py tests\*\*
    +

    skip over this command output listing +

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml                             Big5 with confidence 0.99
     tests\Big5\blog.worren.net.xml                               Big5 with confidence 0.99
    @@ -1239,6 +1241,6 @@ tests\EUC-JP\arclamp.jp.xml                                  EUC-JP with confide
     
  • You need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least, you need to be comfortable with all its quirks and musty corners. The bugs are everywhere.
  • Test cases are essential. Don't port anything without them. Don't even try. The only reason I have any confidence at all that chardet works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I never would have found half of these problems with manual spot-checking. -

    © 2001–4, 2009 ark Pilgrim • open standards • open content • open source +

    © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/dip2 b/dip2 index f7572d8..3c911c4 100644 --- a/dip2 +++ b/dip2 @@ -254,7 +254,7 @@ several months behind in updating their ActivePython installer when new version PythonWin 2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)] on win32. Portions Copyright 1994-2001 Mark Hammond (mhammond@skippinet.com.au) - see 'Help/About PythonWin' for further copyright information. ->>> +>>>

  • Procedure 1.2. Option 2: Installing Python from Python.org

      @@ -289,7 +289,7 @@ Type "copyright", "credits" or "license()" for more information. **************************************************************** IDLE 1.0 ->>> +>>>

    1.3. Python on Mac OS X

    On Mac OS X, you have two choices for installing Python: install it, or don't install it. You probably want to install it.

    Mac OS X 10.2 and later comes with a command-line version of Python preinstalled. If you are comfortable with the command line, you can use this version for the first third of the book. However, @@ -316,12 +316,12 @@ interactive shell.

    Try it out:

     Welcome to Darwin!
    -[localhost:~] you% python
    +[localhost:~] you% python
     Python 2.2 (#1, 07/14/02, 23:25:09)
     [GCC Apple cpp-precomp 6.14] on darwin
     Type "help", "copyright", "credits", or "license" for more information.
    ->>> [press Ctrl+D to get back to the command prompt]
    -[localhost:~] you% 
    +>>> [press Ctrl+D to get back to the command prompt]
    +[localhost:~] you% 
     

    Procedure 1.4. Installing the Latest Version of Python on Mac OS X

    Follow these steps to download and install the latest version of Python: @@ -358,21 +358,21 @@ Window->Python Interactive (Cmd-0). The opening window [GCC 3.1 20020420 (prerelease)] Type "copyright", "credits" or "license" for more information. MacPython IDE 1.0.1 ->>> +>>>

    Note that once you install the latest version, the pre-installed version is still present. If you are running scripts from the command line, you need to be aware which version of Python you are using.

    Example 1.1. Two versions of Python

    -[localhost:~] you% python
    +[localhost:~] you% python
     Python 2.2 (#1, 07/14/02, 23:25:09)
     [GCC Apple cpp-precomp 6.14] on darwin
     Type "help", "copyright", "credits", or "license" for more information.
    ->>> [press Ctrl+D to get back to the command prompt]
    -[localhost:~] you% /usr/local/bin/python
    +>>> [press Ctrl+D to get back to the command prompt]
    +[localhost:~] you% /usr/local/bin/python
     Python 2.3 (#2, Jul 30 2003, 11:45:28)
     [GCC 3.1 20020420 (prerelease)] on darwin
     Type "help", "copyright", "credits", or "license" for more information.
    ->>> [press Ctrl+D to get back to the command prompt]
    -[localhost:~] you% 
    +>>> [press Ctrl+D to get back to the command prompt]
    +[localhost:~] you% 
     

    1.4. Python on Mac OS 9

    Mac OS 9 does not come with any version of Python, but installation is very simple, and there is only one choice.

    @@ -407,34 +407,34 @@ Window->Python Interactive (Cmd-0). You'll see a scree [GCC 3.1 20020420 (prerelease)] Type "copyright", "credits" or "license" for more information. MacPython IDE 1.0.1 ->>> +>>>

    1.5. Python on RedHat Linux

    Installing under UNIX-compatible operating systems such as Linux is easy if you're willing to install a binary package. Pre-built binary packages are available for most popular Linux distributions. Or you can always compile from source.

    Download the latest Python RPM by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then selecting the rpms/ directory within that. Then download the RPM with the highest version number. You can install it with the rpm command, as shown here:

    Example 1.2. Installing on RedHat Linux 9

    -localhost:~$ su -
    -Password: [enter your root password]
    -[root@localhost root]# wget http://python.org/ftp/python/2.3/rpms/redhat-9/python2.3-2.3-5pydotorg.i386.rpm
    +localhost:~$ su -
    +Password: [enter your root password]
    +[root@localhost root]# wget http://python.org/ftp/python/2.3/rpms/redhat-9/python2.3-2.3-5pydotorg.i386.rpm
     Resolving python.org... done.
     Connecting to python.org[194.109.137.226]:80... connected.
     HTTP request sent, awaiting response... 200 OK
     Length: 7,495,111 [application/octet-stream]
     ...
    -[root@localhost root]# rpm -Uvh python2.3-2.3-5pydotorg.i386.rpm
    +[root@localhost root]# rpm -Uvh python2.3-2.3-5pydotorg.i386.rpm
     Preparing...                ########################################### [100%]
        1:python2.3              ########################################### [100%]
    -[root@localhost root]# python          
    +[root@localhost root]# python          
     Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
     [GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2
     Type "help", "copyright", "credits", or "license" for more information.
    ->>> [press Ctrl+D to exit]
    -[root@localhost root]# python2.3       
    +>>> [press Ctrl+D to exit]
    +[root@localhost root]# python2.3       
     Python 2.3 (#1, Sep 12 2003, 10:53:56)
     [GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
     Type "help", "copyright", "credits", or "license" for more information.
    ->>> [press Ctrl+D to exit]
    -[root@localhost root]# which python2.3 
    +>>> [press Ctrl+D to exit]
    +[root@localhost root]# which python2.3 
     /usr/bin/python2.3
     
      @@ -444,9 +444,9 @@ Type "help", "copyright", "credits", or "license" for more information.

      1.6. Python on Debian GNU/Linux

      If you are lucky enough to be running Debian GNU/Linux, you install Python through the apt command.

      Example 1.3. Installing on Debian GNU/Linux

      -localhost:~$ su -
      -Password: [enter your root password]
      -localhost:~# apt-get install python
      +localhost:~$ su -
      +Password: [enter your root password]
      +localhost:~# apt-get install python
       Reading Package Lists... Done
       Building Dependency Tree... Done
       The following extra packages will be installed:
      @@ -458,7 +458,7 @@ The following NEW packages will be installed:
       0 upgraded, 2 newly installed, 0 to remove and 3 not upgraded.
       Need to get 0B/2880kB of archives.
       After unpacking 9351kB of additional disk space will be used.
      -Do you want to continue? [Y/n] Y
      +Do you want to continue? [Y/n] Y
       Selecting previously deselected package python2.3.
       (Reading database ... 22848 files and directories currently installed.)
       Unpacking python2.3 (from .../python2.3_2.3.1-1_i386.deb) ...
      @@ -468,32 +468,32 @@ Setting up python (2.3.1-1) ...
       Setting up python2.3 (2.3.1-1) ...
       Compiling python modules in /usr/lib/python2.3 ...
       Compiling optimized python modules in /usr/lib/python2.3 ...
      -localhost:~# exit
      +localhost:~# exit
       logout
      -localhost:~$ python
      +localhost:~$ python
       Python 2.3.1 (#2, Sep 24 2003, 11:39:14)
       [GCC 3.3.2 20030908 (Debian prerelease)] on linux2
       Type "help", "copyright", "credits" or "license" for more information.
      ->>> [press Ctrl+D to exit]
      +>>> [press Ctrl+D to exit]
       

      1.7. Python Installation from Source

      If you prefer to build from source, you can download the Python source code from http://www.python.org/ftp/python/. Select the highest version number listed, download the .tgz file), and then do the usual configure, make, make install dance.

      Example 1.4. Installing from source

      -localhost:~$ su -
      -Password: [enter your root password]
      -localhost:~# wget http://www.python.org/ftp/python/2.3/Python-2.3.tgz
      +localhost:~$ su -
      +Password: [enter your root password]
      +localhost:~# wget http://www.python.org/ftp/python/2.3/Python-2.3.tgz
       Resolving www.python.org... done.
       Connecting to www.python.org[194.109.137.226]:80... connected.
       HTTP request sent, awaiting response... 200 OK
       Length: 8,436,880 [application/x-tar]
       ...
      -localhost:~# tar xfz Python-2.3.tgz
      -localhost:~# cd Python-2.3
      -localhost:~/Python-2.3# ./configure
      +localhost:~# tar xfz Python-2.3.tgz
      +localhost:~# cd Python-2.3
      +localhost:~/Python-2.3# ./configure
       checking MACHDEP... linux2
       checking EXTRAPLATDIR...
       checking for --without-gcc... no
       ...
      -localhost:~/Python-2.3# make
      +localhost:~/Python-2.3# make
       gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
       -I. -I./Include  -DPy_BUILD_CORE -o Modules/python.o Modules/python.c
       gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
      @@ -501,19 +501,19 @@ gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
       gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
       -I. -I./Include  -DPy_BUILD_CORE -o Parser/grammar1.o Parser/grammar1.c
       ...
      -localhost:~/Python-2.3# make install
      +localhost:~/Python-2.3# make install
       /usr/bin/install -c python /usr/local/bin/python2.3
       ...
      -localhost:~/Python-2.3# exit
      +localhost:~/Python-2.3# exit
       logout
      -localhost:~$ which python
      +localhost:~$ which python
       /usr/local/bin/python
      -localhost:~$ python
      +localhost:~$ python
       Python 2.3.1 (#2, Sep 24 2003, 11:39:14)
       [GCC 3.3.2 20030908 (Debian prerelease)] on linux2
       Type "help", "copyright", "credits" or "license" for more information.
      ->>> [press Ctrl+D to get back to the command prompt]
      -localhost:~$ 
      +>>> [press Ctrl+D to get back to the command prompt]
      +localhost:~$ 
       

      1.8. The Interactive Shell

      Now that you have Python installed, what's this interactive shell thing you're running?

      It's like this: Python leads a double life. It's an interpreter for scripts that you can run from the command line or run like applications, by @@ -521,13 +521,13 @@ double-clicking the scripts. But it's also an interactive shell that can evaluat This is extremely useful for debugging, quick hacking, and testing. I even know some people who use the Python interactive shell in lieu of a calculator!

      Launch the Python interactive shell in whatever way works on your platform, and let's dive in with the steps shown here:

      Example 1.5. First Steps in the Interactive Shell

      ->>> 1 + 1               
      +>>> 1 + 1               
       2
      ->>> print 'hello world' 
      +>>> print 'hello world' 
       hello world
      ->>> x = 1               
      ->>> y = 2
      ->>> x + y
      +>>> x = 1               
      +>>> y = 2
      +>>> x + y
       3
       
        @@ -575,8 +575,8 @@ if __name__ == "__main__":

        Some quick observations before you get to the NoteLike C, Python uses == for comparison and = for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.

        So why is this particular if statement a trick? Modules are objects, and all modules have a built-in attribute __name__. A module's __name__ depends on how you're using the module. If you import the module, then __name__ is the module's filename, without a directory path or file extension. But you can also run the module directly as a standalone program, in which case __name__ will be a special default value, __main__. -

        >>> import odbchelper
        ->>> odbchelper.__name__
        +
        >>> import odbchelper
        +>>> odbchelper.__name__
         'odbchelper'

        Knowing this, you can design a test suite for your module within the module itself by putting it in this if statement. When you run the module directly, __name__ is __main__, so the test suite executes. When you import the module, __name__ is something else, so the test suite is ignored. This makes it easier to develop and debug new modules before integrating them into a larger program. @@ -620,35 +620,35 @@ if __name__ == "__main__": a matter of style.

        Third, you never declared the variable myParams, you just assigned a value to it. This is like VBScript without the option explicit option. Luckily, unlike VBScript, Python will not allow you to reference a variable that has never been assigned a value; trying to do so will raise an exception.

        3.4.1. Referencing Variables

        -

        Example 3.18. Referencing an Unbound Variable

        >>> x
        +

        Example 3.18. Referencing an Unbound Variable

        >>> x
         Traceback (innermost last):
           File "<interactive input>", line 1, in ?
         NameError: There is no variable named 'x'
        ->>> x = 1
        ->>> x
        +>>> x = 1
        +>>> x
         1

        You will thank Python for this one day.

        3.4.2. Assigning Multiple Values at Once

        One of the cooler programming shortcuts in Python is using sequences to assign multiple values at once. -

        Example 3.19. Assigning multiple values at once

        >>> v = ('a', 'b', 'e')
        ->>> (x, y, z) = v     
        ->>> x
        +

        Example 3.19. Assigning multiple values at once

        >>> v = ('a', 'b', 'e')
        +>>> (x, y, z) = v     
        +>>> x
         'a'
        ->>> y
        +>>> y
         'b'
        ->>> z
        +>>> z
         'e'
        1. v is a tuple of three elements, and (x, y, z) is a tuple of three variables. Assigning one to the other assigns each of the values of v to each of the variables, in order.

          This has all sorts of uses. I often want to assign names to a range of values. In C, you would use enum and manually list each constant and its associated value, which seems especially tedious when the values are consecutive. In Python, you can use the built-in range function with multi-variable assignment to quickly assign consecutive values. -

          Example 3.20. Assigning Consecutive Values

          >>> range(7)              
          +

          Example 3.20. Assigning Consecutive Values

          >>> range(7)              
           [0, 1, 2, 3, 4, 5, 6]
          ->>> (MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7) 
          ->>> MONDAY                
          +>>> (MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7) 
          +>>> MONDAY                
           0
          ->>> TUESDAY
          +>>> TUESDAY
           1
          ->>> SUNDAY
          +>>> SUNDAY
           6
          1. The built-in range function returns a list of integers. In its simplest form, it takes an upper limit and returns a zero-based list counting @@ -668,13 +668,13 @@ NameError: There is no variable named 'x'

            3.6. Mapping Lists

            One of the most powerful features of Python is the list comprehension, which provides a compact way of mapping a list into another list by applying a function to each of the elements of the list. -

            Example 3.24. Introducing List Comprehensions

            >>> li = [1, 9, 8, 4]
            ->>> [elem*2 for elem in li]      
            +

            Example 3.24. Introducing List Comprehensions

            >>> li = [1, 9, 8, 4]
            +>>> [elem*2 for elem in li]      
             [2, 18, 16, 8]
            ->>> li         
            +>>> li         
             [1, 9, 8, 4]
            ->>> li = [elem*2 for elem in li] 
            ->>> li
            +>>> li = [elem*2 for elem in li] 
            +>>> li
             [2, 18, 16, 8]
            1. To make sense of this, look at it from right to left. li is the list you're mapping. Python loops through li one element at a time, temporarily assigning the value of each element to the variable elem. Python then applies the function elem*2 and appends that result to the returned list. @@ -683,12 +683,12 @@ NameError: There is no variable named 'x'

              Here are the list comprehensions in the buildConnectionString function that you declared in Chapter 2:

              
               ["%s=%s" % (k, v) for k, v in params.items()]

              First, notice that you're calling the items function of the params dictionary. This function returns a list of tuples of all the data in the dictionary. -

              Example 3.25. The keys, values, and items Functions

              >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
              ->>> params.keys()   
              +

              Example 3.25. The keys, values, and items Functions

              >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
              +>>> params.keys()   
               ['server', 'uid', 'database', 'pwd']
              ->>> params.values() 
              +>>> params.values() 
               ['mpilgrim', 'sa', 'master', 'secret']
              ->>> params.items()  
              +>>> params.items()  
               [('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]
              1. The keys method of a dictionary returns a list of all the keys. The list is not in the order in which the dictionary was defined @@ -697,14 +697,14 @@ NameError: There is no variable named 'x'
              2. The items method returns a list of tuples of the form (key, value). The list contains all the data in the dictionary.

                Now let's see what buildConnectionString does. It takes a list, params.items(), and maps it to a new list by applying string formatting to each element. The new list will have the same number of elements as params.items(), but each element in the new list will be a string that contains both a key and its associated value from the params dictionary. -

                Example 3.26. List Comprehensions in buildConnectionString, Step by Step

                >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
                ->>> params.items()
                +

                Example 3.26. List Comprehensions in buildConnectionString, Step by Step

                >>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
                +>>> params.items()
                 [('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]
                ->>> [k for k, v in params.items()]                
                +>>> [k for k, v in params.items()]                
                 ['server', 'uid', 'database', 'pwd']
                ->>> [v for k, v in params.items()]                
                +>>> [v for k, v in params.items()]                
                 ['mpilgrim', 'sa', 'master', 'secret']
                ->>> ["%s=%s" % (k, v) for k, v in params.items()] 
                +>>> ["%s=%s" % (k, v) for k, v in params.items()] 
                 ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
                1. Note that you're using two variables to iterate through the params.items() list. This is another use of multi-variable assignment. The first element of params.items() is ('server', 'mpilgrim'), so in the first iteration of the list comprehension, k will get 'server' and v will get 'mpilgrim'. In this case, you're ignoring the value of v and only including the value of k in the returned list, so this list comprehension ends up being equivalent to params.keys(). @@ -789,9 +789,9 @@ if __name__ == "__main__": if statements use == for comparison, and parentheses are not required.

                  The info function is designed to be used by you, the programmer, while working in the Python IDE. It takes any object that has functions or methods (like a module, which has functions, or a list, which has methods) and prints out the functions and their docstrings. -

                  Example 4.2. Sample Usage of apihelper.py

                  >>> from apihelper import info
                  ->>> li = []
                  ->>> info(li)
                  +

                  Example 4.2. Sample Usage of apihelper.py

                  >>> from apihelper import info
                  +>>> li = []
                  +>>> info(li)
                   append     L.append(object) -- append object to end
                   count      L.count(value) -> integer -- return number of occurrences of value
                   extend     L.extend(list) -- extend list by appending list elements
                  @@ -801,12 +801,12 @@ pop        L.pop([index]) -> item -- remove and return item at index (default la
                   remove     L.remove(value) -- remove first occurrence of value
                   reverse    L.reverse() -- reverse *IN PLACE*
                   sort       L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1

                  By default the output is formatted to be easy to read. Multi-line docstrings are collapsed into a single long line, but this option can be changed by specifying 0 for the collapse argument. If the function names are longer than 10 characters, you can specify a larger value for the spacing argument to make the output easier to read. -

                  Example 4.3. Advanced Usage of apihelper.py

                  >>> import odbchelper
                  ->>> info(odbchelper)
                  +

                  Example 4.3. Advanced Usage of apihelper.py

                  >>> import odbchelper
                  +>>> info(odbchelper)
                   buildConnectionString Build a connection string from a dictionary Returns string.
                  ->>> info(odbchelper, 30)
                  +>>> info(odbchelper, 30)
                   buildConnectionString          Build a connection string from a dictionary Returns string.
                  ->>> info(odbchelper, 30, 0)
                  +>>> info(odbchelper, 30, 0)
                   buildConnectionString          Build a connection string from a dictionary
                       
                       Returns string.
                  @@ -846,16 +846,16 @@ time, you'll call functions the “normal” way, but you always have th
                      cough, Visual Basic).
                   

                  4.3.1. The type Function

                  The type function returns the datatype of any arbitrary object. The possible types are listed in the types module. This is useful for helper functions that can handle several types of data. -

                  Example 4.5. Introducing type

                  >>> type(1)           
                  +

                  Example 4.5. Introducing type

                  >>> type(1)           
                   <type 'int'>
                  ->>> li = []
                  ->>> type(li)          
                  +>>> li = []
                  +>>> type(li)          
                   <type 'list'>
                  ->>> import odbchelper
                  ->>> type(odbchelper)  
                  +>>> import odbchelper
                  +>>> type(odbchelper)  
                   <type 'module'>
                  ->>> import types      
                  ->>> type(odbchelper) == types.ModuleType
                  +>>> import types      
                  +>>> type(odbchelper) == types.ModuleType
                   True
                  1. type takes anything -- and I mean anything -- and returns its datatype. Integers, strings, lists, dictionaries, tuples, functions, @@ -866,17 +866,17 @@ True

                  4.3.2. The str Function

                  The str coerces data into a string. Every datatype can be coerced into a string.

                  Example 4.6. Introducing str

                  ->>> str(1)          
                  +>>> str(1)          
                   '1'
                  ->>> horsemen = ['war', 'pestilence', 'famine']
                  ->>> horsemen
                  +>>> horsemen = ['war', 'pestilence', 'famine']
                  +>>> horsemen
                   ['war', 'pestilence', 'famine']
                  ->>> horsemen.append('Powerbuilder')
                  ->>> str(horsemen)   
                  +>>> horsemen.append('Powerbuilder')
                  +>>> str(horsemen)   
                   "['war', 'pestilence', 'famine', 'Powerbuilder']"
                  ->>> str(odbchelper) 
                  +>>> str(odbchelper) 
                   "<module 'odbchelper' from 'c:\\docbook\\dip\\py\\odbchelper.py'>"
                  ->>> str(None)       
                  +>>> str(None)       
                   'None'
                  1. For simple datatypes like integers, you would expect str to work, because almost every language has a function to convert an integer to a string. @@ -886,15 +886,15 @@ True
                2. A subtle but important behavior of str is that it works on None, the Python null value. It returns the string 'None'. You'll use this to your advantage in the info function, as you'll see shortly.

                  At the heart of the info function is the powerful dir function. dir returns a list of the attributes and methods of any object: modules, functions, strings, lists, dictionaries... pretty much anything. -

                  Example 4.7. Introducing dir

                  >>> li = []
                  ->>> dir(li)           
                  +

                  Example 4.7. Introducing dir

                  >>> li = []
                  +>>> dir(li)           
                   ['append', 'count', 'extend', 'index', 'insert',
                   'pop', 'remove', 'reverse', 'sort']
                  ->>> d = {}
                  ->>> dir(d)            
                  +>>> d = {}
                  +>>> dir(d)            
                   ['clear', 'copy', 'get', 'has_key', 'items', 'keys', 'setdefault', 'update', 'values']
                  ->>> import odbchelper
                  ->>> dir(odbchelper)   
                  +>>> import odbchelper
                  +>>> dir(odbchelper)   
                   ['__builtins__', '__doc__', '__file__', '__name__', 'buildConnectionString']
                  1. li is a list, so dir(li) returns a list of all the methods of a list. Note that the returned list contains the names of the methods as strings, not @@ -903,16 +903,16 @@ True
                3. This is where it really gets interesting. odbchelper is a module, so dir(odbchelper) returns a list of all kinds of stuff defined in the module, including built-in attributes, like __name__, __doc__, and whatever other attributes and methods you define. In this case, odbchelper has only one user-defined method, the buildConnectionString function described in Chapter 2.

                  Finally, the callable function takes any object and returns True if the object can be called, or False otherwise. Callable objects include functions, class methods, even classes themselves. (More on classes in the next chapter.)

                  Example 4.8. Introducing callable

                  ->>> import string
                  ->>> string.punctuation           
                  +>>> import string
                  +>>> string.punctuation           
                   '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
                  ->>> string.join
                  +>>> string.join
                   <function join at 00C55A7C>
                  ->>> callable(string.punctuation) 
                  +>>> callable(string.punctuation) 
                   False
                  ->>> callable(string.join)        
                  +>>> callable(string.join)        
                   True
                  ->>> print string.join.__doc__    
                  +>>> print string.join.__doc__    
                   join(list [,sep]) -> string
                   
                       Return a string composed of the words in list, with
                  @@ -932,9 +932,9 @@ True
                   

                  The advantage of thinking like this is that you can access all the built-in functions and attributes as a group by getting information about the __builtin__ module. And guess what, Python has a function called info. Try it yourself and skim through the list now. We'll dive into some of the more important functions later. (Some of the built-in error classes, like AttributeError, should already look familiar.) -

                  Example 4.9. Built-in Attributes and Functions

                  >>> from apihelper import info
                  ->>> import __builtin__
                  ->>> info(__builtin__, 20)
                  +

                  Example 4.9. Built-in Attributes and Functions

                  >>> from apihelper import info
                  +>>> import __builtin__
                  +>>> info(__builtin__, 20)
                   ArithmeticError      Base class for arithmetic errors.
                   AssertionError       Assertion failed.
                   AttributeError       Attribute not found.
                  @@ -957,17 +957,17 @@ IOError              I/O operation failed.
                   

                  4.4. Getting Object References With getattr

                  You already know that Python functions are objects. What you don't know is that you can get a reference to a function without knowing its name until run-time, by using the getattr function. -

                  Example 4.10. Introducing getattr

                  >>> li = ["Larry", "Curly"]
                  ->>> li.pop     
                  +

                  Example 4.10. Introducing getattr

                  >>> li = ["Larry", "Curly"]
                  +>>> li.pop     
                   <built-in method pop of list object at 010DF884>
                  ->>> getattr(li, "pop")           
                  +>>> getattr(li, "pop")           
                   <built-in method pop of list object at 010DF884>
                  ->>> getattr(li, "append")("Moe") 
                  ->>> li
                  +>>> getattr(li, "append")("Moe") 
                  +>>> li
                   ["Larry", "Curly", "Moe"]
                  ->>> getattr({}, "clear")         
                  +>>> getattr({}, "clear")         
                   <built-in method clear of dictionary object at 00F113D4>
                  ->>> getattr((), "pop")           
                  +>>> getattr((), "pop")           
                   Traceback (innermost last):
                     File "<interactive input>", line 1, in ?
                   AttributeError: 'tuple' object has no attribute 'pop'
                  @@ -980,21 +980,21 @@ AttributeError: 'tuple' object has no attribute 'pop'
                  In theory, getattr would work on tuples, except that tuples have no methods, so getattr will raise an exception no matter what attribute name you give.

                  4.4.1. getattr with Modules

                  getattr isn't just for built-in datatypes. It also works on modules. -

                  Example 4.11. The getattr Function in apihelper.py

                  >>> import odbchelper
                  ->>> odbchelper.buildConnectionString             
                  +

                  Example 4.11. The getattr Function in apihelper.py

                  >>> import odbchelper
                  +>>> odbchelper.buildConnectionString             
                   <function buildConnectionString at 00D18DD4>
                  ->>> getattr(odbchelper, "buildConnectionString") 
                  +>>> getattr(odbchelper, "buildConnectionString") 
                   <function buildConnectionString at 00D18DD4>
                  ->>> object = odbchelper
                  ->>> method = "buildConnectionString"
                  ->>> getattr(object, method)    
                  +>>> object = odbchelper
                  +>>> method = "buildConnectionString"
                  +>>> getattr(object, method)    
                   <function buildConnectionString at 00D18DD4>
                  ->>> type(getattr(object, method))                
                  +>>> type(getattr(object, method))                
                   <type 'function'>
                  ->>> import types
                  ->>> type(getattr(object, method)) == types.FunctionType
                  +>>> import types
                  +>>> type(getattr(object, method)) == types.FunctionType
                   True
                  ->>> callable(getattr(object, method))            
                  +>>> callable(getattr(object, method))            
                   True
                  1. This returns a reference to the buildConnectionString function in the odbchelper module, which you studied in Chapter 2, Your First Python Program. (The hex address you see is specific to my machine; your output will be different.) @@ -1040,12 +1040,12 @@ def output(data, format="text"):

                    Here is the list filtering syntax:

                    
                     [mapping-expression for element in source-list if filter-expression]

                    This is an extension of the list comprehensions that you know and love. The first two thirds are the same; the last part, starting with the if, is the filter expression. A filter expression can be any expression that evaluates true or false (which in Python can be almost anything). Any element for which the filter expression evaluates true will be included in the mapping. All other elements are ignored, so they are never put through the mapping expression and are not included in the output list. -

                    Example 4.14. Introducing List Filtering

                    >>> li = ["a", "mpilgrim", "foo", "b", "c", "b", "d", "d"]
                    ->>> [elem for elem in li if len(elem) > 1]       
                    +

                    Example 4.14. Introducing List Filtering

                    >>> li = ["a", "mpilgrim", "foo", "b", "c", "b", "d", "d"]
                    +>>> [elem for elem in li if len(elem) > 1]       
                     ['mpilgrim', 'foo']
                    ->>> [elem for elem in li if elem != "b"]         
                    +>>> [elem for elem in li if elem != "b"]         
                     ['a', 'mpilgrim', 'foo', 'c', 'd', 'd']
                    ->>> [elem for elem in li if li.count(elem) == 1] 
                    +>>> [elem for elem in li if li.count(elem) == 1] 
                     ['a', 'mpilgrim', 'foo', 'c']
                    1. The mapping expression here is simple (it just returns the value of each element), so concentrate on the filter expression. @@ -1074,11 +1074,11 @@ the pop method of a list) and user-defined (like the buildCon

                      4.6. The Peculiar Nature of and and or

                      In Python, and and or perform boolean logic as you would expect, but they do not return boolean values; instead, they return one of the actual values they are comparing. -

                      Example 4.15. Introducing and

                      >>> 'a' and 'b'         
                      +

                      Example 4.15. Introducing and

                      >>> 'a' and 'b'         
                       'b'
                      ->>> '' and 'b'          
                      +>>> '' and 'b'          
                       ''
                      ->>> 'a' and 'b' and 'c' 
                      +>>> 'a' and 'b' and 'c' 
                       'c'
                      1. When using and, values are evaluated in a boolean context from left to right. 0, '', [], (), {}, and None are false in a boolean context; everything else is true. Well, almost everything. By default, instances of classes are @@ -1086,16 +1086,16 @@ the pop method of a list) and user-defined (like the buildCon learn all about classes and special methods in Chapter 5. If all values are true in a boolean context, and returns the last value. In this case, and evaluates 'a', which is true, then 'b', which is true, and returns 'b'.
                      2. If any value is false in a boolean context, and returns the first false value. In this case, '' is the first false value.
                      3. All values are true, so and returns the last value, 'c'. -

                        Example 4.16. Introducing or

                        >>> 'a' or 'b'          
                        +

                        Example 4.16. Introducing or

                        >>> 'a' or 'b'          
                         'a'
                        ->>> '' or 'b'           
                        +>>> '' or 'b'           
                         'b'
                        ->>> '' or [] or {}      
                        +>>> '' or [] or {}      
                         {}
                        ->>> def sidefx():
                        -...    print "in sidefx()"
                        -...    return 1
                        ->>> 'a' or sidefx()     
                        +>>> def sidefx():
                        +...    print "in sidefx()"
                        +...    return 1
                        +>>> 'a' or sidefx()     
                         'a'
                        1. When using or, values are evaluated in a boolean context from left to right, just like and. If any value is true, or returns that value immediately. In this case, 'a' is the first true value. @@ -1105,11 +1105,11 @@ the pop method of a list) and user-defined (like the buildCon is important if some values can have side effects. Here, the function sidefx is never called, because or evaluates 'a', which is true, and returns 'a' immediately.

                          If you're a C hacker, you are certainly familiar with the bool ? a : b expression, which evaluates to a if bool is true, and b otherwise. Because of the way and and or work in Python, you can accomplish the same thing.

                          4.6.1. Using the and-or Trick

                          -

                          Example 4.17. Introducing the and-or Trick

                          >>> a = "first"
                          ->>> b = "second"
                          ->>> 1 and a or b 
                          +

                          Example 4.17. Introducing the and-or Trick

                          >>> a = "first"
                          +>>> b = "second"
                          +>>> 1 and a or b 
                           'first'
                          ->>> 0 and a or b 
                          +>>> 0 and a or b 
                           'second'
                           
                            @@ -1117,17 +1117,17 @@ the pop method of a list) and user-defined (like the buildCon
                          1. 0 and 'first' evalutes to False, and then 0 or 'second' evaluates to 'second'.

                            However, since this Python expression is simply boolean logic, and not a special construct of the language, there is one extremely important difference between this and-or trick in Python and the bool ? a : b syntax in C. If the value of a is false, the expression will not work as you would expect it to. (Can you tell I was bitten by this? More than once?) -

                            Example 4.18. When the and-or Trick Fails

                            >>> a = ""
                            ->>> b = "second"
                            ->>> 1 and a or b         
                            +

                            Example 4.18. When the and-or Trick Fails

                            >>> a = ""
                            +>>> b = "second"
                            +>>> 1 and a or b         
                             'second'
                            1. Since a is an empty string, which Python considers false in a boolean context, 1 and '' evalutes to '', and then '' or 'second' evalutes to 'second'. Oops! That's not what you wanted.

                              The and-or trick, bool and a or b, will not work like the C expression bool ? a : b when a is false in a boolean context.

                              The real trick behind the and-or trick, then, is to make sure that the value of a is never false. One common way of doing this is to turn a into [a] and b into [b], then taking the first element of the returned list, which will be either a or b. -

                              Example 4.19. Using the and-or Trick Safely

                              >>> a = ""
                              ->>> b = "second"
                              ->>> (1 and [a] or [b])[0] 
                              +

                              Example 4.19. Using the and-or Trick Safely

                              >>> a = ""
                              +>>> b = "second"
                              +>>> (1 and [a] or [b])[0] 
                               ''
                              1. Since [a] is a non-empty list, it is never false. Even if a is 0 or '' or some other false value, the list [a] is true because it has one element. @@ -1142,15 +1142,15 @@ the pop method of a list) and user-defined (like the buildCon

                                4.7. Using lambda Functions

                                Python supports an interesting syntax that lets you define one-line mini-functions on the fly. Borrowed from Lisp, these so-called lambda functions can be used anywhere a function is required. -

                                Example 4.20. Introducing lambda Functions

                                >>> def f(x):
                                -...    return x*2
                                -...    
                                ->>> f(3)
                                +

                                Example 4.20. Introducing lambda Functions

                                >>> def f(x):
                                +...    return x*2
                                +...    
                                +>>> f(3)
                                 6
                                ->>> g = lambda x: x*2  
                                ->>> g(3)
                                +>>> g = lambda x: x*2  
                                +>>> g(3)
                                 6
                                ->>> (lambda x: x*2)(3) 
                                +>>> (lambda x: x*2)(3) 
                                 6
                                1. This is a lambda function that accomplishes the same thing as the normal function above it. Note the abbreviated syntax here: there are no @@ -1170,13 +1170,13 @@ a lambda function; if you need something more complex, define a nor

                                  Here are the lambda functions in apihelper.py:

                                  
                                       processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)

                                  Notice that this uses the simple form of the and-or trick, which is okay, because a lambda function is always true in a boolean context. (That doesn't mean that a lambda function can't return a false value. The function is always true; its return value could be anything.)

                                  Also notice that you're using the split function with no arguments. You've already seen it used with one or two arguments, but without any arguments it splits on whitespace. -

                                  Example 4.21. split With No Arguments

                                  >>> s = "this   is\na\ttest"  
                                  ->>> print s
                                  +

                                  Example 4.21. split With No Arguments

                                  >>> s = "this   is\na\ttest"  
                                  +>>> print s
                                   this   is
                                   a	test
                                  ->>> print s.split()           
                                  +>>> print s.split()           
                                   ['this', 'is', 'a', 'test']
                                  ->>> print " ".join(s.split()) 
                                  +>>> print " ".join(s.split()) 
                                   'this is a test'
                                  1. This is a multiline string, defined by escape characters instead of triple quotes. \n is a carriage return, and \t is a tab character. @@ -1212,12 +1212,12 @@ a test square brackets.

                                    Now, let's take it from the end and work backwards. The

                                    
                                     for method in methodList

                                    shows that this is a list comprehension. As you know, methodList is a list of all the methods you care about in object. So you're looping through that list with method. -

                                    Example 4.22. Getting a docstring Dynamically

                                    >>> import odbchelper
                                    ->>> object = odbchelper 
                                    ->>> method = 'buildConnectionString'      
                                    ->>> getattr(object, method)               
                                    +

                                    Example 4.22. Getting a docstring Dynamically

                                    >>> import odbchelper
                                    +>>> object = odbchelper 
                                    +>>> method = 'buildConnectionString'      
                                    +>>> getattr(object, method)               
                                     <function buildConnectionString at 010D6D74>
                                    ->>> print getattr(object, method).__doc__ 
                                    +>>> print getattr(object, method).__doc__ 
                                     Build a connection string from a dictionary of parameters.
                                     
                                         Returns string.
                                    @@ -1227,13 +1227,13 @@ for method in methodList

                                    shows that this is a Using the getattr function, you're getting a reference to the method function in the object module.

                                  2. Now, printing the actual docstring of the method is easy.

                                    The next piece of the puzzle is the use of str around the docstring. As you may recall, str is a built-in function that coerces data into a string. But a docstring is always a string, so why bother with the str function? The answer is that not every function has a docstring, and if it doesn't, its __doc__ attribute is None. -

                                    Example 4.23. Why Use str on a docstring?

                                    >>> >>> def foo(): print 2
                                    ->>> >>> foo()
                                    +

                                    Example 4.23. Why Use str on a docstring?

                                    >>> >>> def foo(): print 2
                                    +>>> >>> foo()
                                     2
                                    ->>> >>> foo.__doc__     
                                    ->>> foo.__doc__ == None 
                                    +>>> >>> foo.__doc__     
                                    +>>> foo.__doc__ == None 
                                     True
                                    ->>> str(foo.__doc__)    
                                    +>>> str(foo.__doc__)    
                                     'None'
                                     
                                      @@ -1245,17 +1245,17 @@ True
        NoteIn SQL, you must use IS NULL instead of = NULL to compare a null value. In Python, you can use either == None or is None, but is None is faster.

        Now that you are guaranteed to have a string, you can pass the string to processFunc, which you have already defined as a function that either does or doesn't collapse whitespace. Now you see why it was important to use str to convert a None value into a string representation. processFunc is assuming a string argument and calling its split method, which would crash if you passed it None because None doesn't have a split method.

        Stepping back even further, you see that you're using string formatting again to concatenate the return value of processFunc with the return value of method's ljust method. This is a new string method that you haven't seen before. -

        Example 4.24. Introducing ljust

        >>> s = 'buildConnectionString'
        ->>> s.ljust(30) 
        +

        Example 4.24. Introducing ljust

        >>> s = 'buildConnectionString'
        +>>> s.ljust(30) 
         'buildConnectionString         '
        ->>> s.ljust(20) 
        +>>> s.ljust(20) 
         'buildConnectionString'
        1. ljust pads the string with spaces to the given length. This is what the info function uses to make two columns of output and line up all the docstrings in the second column.
        2. If the given length is smaller than the length of the string, ljust will simply return the string unchanged. It never truncates the string.

          You're almost finished. Given the padded method name from the ljust method and the (possibly collapsed) docstring from the call to processFunc, you concatenate the two and get a single string. Since you're mapping methodList, you end up with a list of strings. Using the join method of the string "\n", you join this list into a single string, with each element of the list on a separate line, and print the result. -

          Example 4.25. Printing a List

          >>> li = ['a', 'b', 'c']
          ->>> print "\n".join(li) 
          +

          Example 4.25. Printing a List

          >>> li = ['a', 'b', 'c']
          +>>> print "\n".join(li) 
           a
           b
           c
          @@ -1282,9 +1282,9 @@ def info(object, spacing=10, collapse=1): if __name__ == "__main__": print info.__doc__
          -

          Here is the output of apihelper.py:

          >>> from apihelper import info
          ->>> li = []
          ->>> info(li)
          +

          Here is the output of apihelper.py:

          >>> from apihelper import info
          +>>> li = []
          +>>> info(li)
           append     L.append(object) -- append object to end
           count      L.count(value) -> integer -- return number of occurrences of value
           extend     L.extend(list) -- extend list by appending list elements
          @@ -1461,15 +1461,15 @@ can import individual items or use from module import *
           
          Notefrom module import * in Python is like import module.* in Java; import module in Python is like import module in Java. -

          Example 5.2. import module vs. from module import

          >>> import types
          ->>> types.FunctionType             
          +

          Example 5.2. import module vs. from module import

          >>> import types
          +>>> types.FunctionType             
           <type 'function'>
          ->>> FunctionType 
          +>>> FunctionType 
           Traceback (innermost last):
             File "<interactive input>", line 1, in ?
           NameError: There is no variable named 'FunctionType'
          ->>> from types import FunctionType 
          ->>> FunctionType 
          +>>> from types import FunctionType 
          +>>> FunctionType 
           <type 'function'>
          1. The types module contains no methods; it just has attributes for each Python object type. Note that the attribute, FunctionType, must be qualified by the module name, types. @@ -1586,13 +1586,13 @@ class FileInfo(UserDict):

            5.4. Instantiating Classes

            Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the __init__ method defines. The return value will be the newly created object. -

            Example 5.7. Creating a FileInfo Instance

            >>> import fileinfo
            ->>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3") 
            ->>> f.__class__    
            +

            Example 5.7. Creating a FileInfo Instance

            >>> import fileinfo
            +>>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3") 
            +>>> f.__class__    
             <class fileinfo.FileInfo at 010EC204>
            ->>> f.__doc__      
            +>>> f.__doc__      
             'store file metadata'
            ->>> f              
            +>>> f              
             {'name': '/music/_singles/kairo.mp3'}
            1. You are creating an instance of the FileInfo class (defined in the fileinfo module) and assigning the newly created instance to the variable f. You are passing one parameter, /music/_singles/kairo.mp3, which will end up as the filename argument in FileInfo's __init__ method. @@ -1606,11 +1606,11 @@ class FileInfo(UserDict):

              5.4.1. Garbage Collection

              If creating new instances is easy, destroying them is even easier. In general, there is no need to explicitly free instances, because they are freed automatically when the variables assigned to them go out of scope. Memory leaks are rare in Python. -

              Example 5.8. Trying to Implement a Memory Leak

              >>> def leakmem():
              -...    f = fileinfo.FileInfo('/music/_singles/kairo.mp3') 
              -...    
              ->>> for i in range(100):
              -...    leakmem()      
              +

              Example 5.8. Trying to Implement a Memory Leak

              >>> def leakmem():
              +...    f = fileinfo.FileInfo('/music/_singles/kairo.mp3') 
              +...    
              +>>> for i in range(100):
              +...    leakmem()      
              1. Every time the leakmem function is called, you are creating an instance of FileInfo and assigning it to the variable f, which is a local variable within the function. Then the function ends without ever freeing f, so you would expect a memory leak, but you would be wrong. When the function ends, the local variable f goes out of scope. At this point, there are no longer any references to the newly created instance of FileInfo (since you never assigned it to anything other than f), so Python destroys the instance for us.
              2. No matter how many times you call the leakmem function, it will never leak memory, because every time, Python will destroy the newly created FileInfo class before returning from leakmem. @@ -1716,12 +1716,12 @@ there are a lot of things you can do with dictionaries besides call methods on t provide a way to map non-method-calling syntax into method calls.

                5.6.1. Getting and Setting Items

                Example 5.12. The __getitem__ Special Method

                
                -    def __getitem__(self, key): return self.data[key]
                >>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3")
                ->>> f
                +    def __getitem__(self, key): return self.data[key]
                >>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3")
                +>>> f
                 {'name':'/music/_singles/kairo.mp3'}
                ->>> f.__getitem__("name") 
                +>>> f.__getitem__("name") 
                 '/music/_singles/kairo.mp3'
                ->>> f["name"]             
                +>>> f["name"]             
                 '/music/_singles/kairo.mp3'
                1. The __getitem__ special method looks simple enough. Like the normal methods clear, keys, and values, it just redirects to the dictionary to return its value. But how does it get called? Well, you can call __getitem__ directly, but in practice you wouldn't actually do that; I'm just doing it here to show you how it works. The right way @@ -1729,13 +1729,13 @@ provide a way to map non-method-calling syntax into method calls.
                2. This looks just like the syntax you would use to get a dictionary value, and in fact it returns the value you would expect. But here's the missing link: under the covers, Python has converted this syntax to the method call f.__getitem__("name"). That's why __getitem__ is a special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax.

                  Of course, Python has a __setitem__ special method to go along with __getitem__, as shown in the next example.

                  Example 5.13. The __setitem__ Special Method

                  
                  -    def __setitem__(self, key, item): self.data[key] = item
                  >>> f
                  +    def __setitem__(self, key, item): self.data[key] = item
                  >>> f
                   {'name':'/music/_singles/kairo.mp3'}
                  ->>> f.__setitem__("genre", 31) 
                  ->>> f
                  +>>> f.__setitem__("genre", 31) 
                  +>>> f
                   {'name':'/music/_singles/kairo.mp3', 'genre':31}
                  ->>> f["genre"] = 32            
                  ->>> f
                  +>>> f["genre"] = 32            
                  +>>> f
                   {'name':'/music/_singles/kairo.mp3', 'genre':32}
                  1. Like the __getitem__ method, __setitem__ simply redirects to the real dictionary self.data to do its work. And like __getitem__, you wouldn't ordinarily call it directly like this; Python calls __setitem__ for you when you use the right syntax. @@ -1761,17 +1761,17 @@ provide a way to map non-method-calling syntax into method calls.
                    NoteWhen accessing data attributes within a class, you need to qualify the attribute name: self.attribute. When calling other methods within a class, you need to qualify the method name: self.method. -

                    Example 5.15. Setting an MP3FileInfo's name

                    >>> import fileinfo
                    ->>> mp3file = fileinfo.MP3FileInfo() 
                    ->>> mp3file
                    +

                    Example 5.15. Setting an MP3FileInfo's name

                    >>> import fileinfo
                    +>>> mp3file = fileinfo.MP3FileInfo() 
                    +>>> mp3file
                     {'name':None}
                    ->>> mp3file["name"] = "/music/_singles/kairo.mp3"      
                    ->>> mp3file
                    +>>> mp3file["name"] = "/music/_singles/kairo.mp3"      
                    +>>> mp3file
                     {'album': 'Rave Mix', 'artist': '***DJ MARY-JANE***', 'genre': 31,
                     'title': 'KAIRO****THE BEST GOA', 'name': '/music/_singles/kairo.mp3',
                     'year': '2000', 'comment': 'http://mp3.com/DJMARYJANE'}
                    ->>> mp3file["name"] = "/music/_singles/sidewinder.mp3" 
                    ->>> mp3file
                    +>>> mp3file["name"] = "/music/_singles/sidewinder.mp3" 
                    +>>> mp3file
                     {'album': '', 'artist': 'The Cynic Project', 'genre': 18, 'title': 'Sidewinder', 
                     'name': '/music/_singles/sidewinder.mp3', 'year': '2000', 
                     'comment': 'http://mp3.com/cynicproject'}
                    @@ -1832,18 +1832,18 @@ class MP3FileInfo(FileInfo): "album" : ( 63, 93, stripnulls), "year" : ( 93, 97, stripnulls), "comment" : ( 97, 126, stripnulls), -"genre" : (127, 128, ord)}
                    >>> import fileinfo
                    ->>> fileinfo.MP3FileInfo            
                    +"genre"   : (127, 128, ord)}
                    >>> import fileinfo
                    +>>> fileinfo.MP3FileInfo            
                     <class fileinfo.MP3FileInfo at 01257FDC>
                    ->>> fileinfo.MP3FileInfo.tagDataMap 
                    +>>> fileinfo.MP3FileInfo.tagDataMap 
                     {'title': (3, 33, <function stripnulls at 0260C8D4>), 
                     'genre': (127, 128, <built-in function ord>), 
                     'artist': (33, 63, <function stripnulls at 0260C8D4>), 
                     'year': (93, 97, <function stripnulls at 0260C8D4>), 
                     'comment': (97, 126, <function stripnulls at 0260C8D4>), 
                     'album': (63, 93, <function stripnulls at 0260C8D4>)}
                    ->>> m = fileinfo.MP3FileInfo()      
                    ->>> m.tagDataMap
                    +>>> m = fileinfo.MP3FileInfo()      
                    +>>> m.tagDataMap
                     {'title': (3, 33, <function stripnulls at 0260C8D4>), 
                     'genre': (127, 128, <built-in function ord>), 
                     'artist': (33, 63, <function stripnulls at 0260C8D4>), 
                    @@ -1861,26 +1861,26 @@ class MP3FileInfo(FileInfo):
                     
                    NoteThere are no constants in Python. Everything can be changed if you try hard enough. This fits with one of the core principles of Python: bad behavior should be discouraged but not banned. If you really want to change the value of None, you can do it, but don't come running to me when your code is impossible to debug. -

                    Example 5.18. Modifying Class Attributes

                    >>> class counter:
                    -...    count = 0   
                    -...    def __init__(self):
                    -...        self.__class__.count += 1 
                    -...    
                    ->>> counter
                    +

                    Example 5.18. Modifying Class Attributes

                    >>> class counter:
                    +...    count = 0   
                    +...    def __init__(self):
                    +...        self.__class__.count += 1 
                    +...    
                    +>>> counter
                     <class __main__.counter at 010EAECC>
                    ->>> counter.count   
                    +>>> counter.count   
                     0
                    ->>> c = counter()
                    ->>> c.count         
                    +>>> c = counter()
                    +>>> c.count         
                     1
                    ->>> counter.count
                    +>>> counter.count
                     1
                    ->>> d = counter()   
                    ->>> d.count
                    +>>> d = counter()   
                    +>>> d.count
                     2
                    ->>> c.count
                    +>>> c.count
                     2
                    ->>> counter.count
                    +>>> counter.count
                     2
                    1. count is a class attribute of the counter class. @@ -1907,9 +1907,9 @@ call it directly (even from outside the fileinfo module) if you had
                    NoteIn Python, all special methods (like __setitem__) and built-in attributes (like __doc__) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and attributes this way, because it will only confuse you (and others) later. -

                    Example 5.19. Trying to Call a Private Method

                    >>> import fileinfo
                    ->>> m = fileinfo.MP3FileInfo()
                    ->>> m.__parse("/music/_singles/kairo.mp3") 
                    +

                    Example 5.19. Trying to Call a Private Method

                    >>> import fileinfo
                    +>>> m = fileinfo.MP3FileInfo()
                    +>>> m.__parse("/music/_singles/kairo.mp3") 
                     Traceback (innermost last):
                       File "<interactive input>", line 1, in ?
                     AttributeError: 'MP3FileInfo' instance has no attribute '__parse'
                    @@ -1969,15 +1969,15 @@ way back to the default behavior built in to Python, which is to spit out some d many times, an exception is something you can anticipate. If you're opening a file, it might not exist. If you're connecting to a database, it might be unavailable, or you might not have the correct security credentials to access it. If you know a line of code may raise an exception, you should handle the exception using a try...except block. -

                    Example 6.1. Opening a Non-Existent File

                    >>> fsock = open("/notthere", "r")      
                    +

                    Example 6.1. Opening a Non-Existent File

                    >>> fsock = open("/notthere", "r")      
                     Traceback (innermost last):
                       File "<interactive input>", line 1, in ?
                     IOError: [Errno 2] No such file or directory: '/notthere'
                    ->>> try:
                    -...    fsock = open("/notthere")       
                    -... except IOError:   
                    -...    print "The file does not exist, exiting gracefully"
                    -... print "This line will always print" 
                    +>>> try:
                    +...    fsock = open("/notthere")       
                    +... except IOError:   
                    +...    print "The file does not exist, exiting gracefully"
                    +... print "This line will always print" 
                     The file does not exist, exiting gracefully
                     This line will always print
                      @@ -2041,12 +2041,12 @@ exceptions, errors occur immediately, and you can handle them in a standard way

                      6.2. Working with File Objects

                      Python has a built-in function, open, for opening a file on disk. open returns a file object, which has methods and attributes for getting information about and manipulating the opened file. -

                      Example 6.3. Opening a File

                      >>> f = open("/music/_singles/kairo.mp3", "rb") 
                      ->>> f       
                      +

                      Example 6.3. Opening a File

                      >>> f = open("/music/_singles/kairo.mp3", "rb") 
                      +>>> f       
                       <open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
                      ->>> f.mode  
                      +>>> f.mode  
                       'rb'
                      ->>> f.name  
                      +>>> f.name  
                       '/music/_singles/kairo.mp3'
                      1. The open method can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename, @@ -2058,18 +2058,18 @@ exceptions, errors occur immediately, and you can handle them in a standard way

                        6.2.1. Reading Files

                        After you open a file, the first thing you'll want to do is read from it, as shown in the next example.

                        Example 6.4. Reading a File

                        ->>> f
                        +>>> f
                         <open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
                        ->>> f.tell()              
                        +>>> f.tell()              
                         0
                        ->>> f.seek(-128, 2)       
                        ->>> f.tell()              
                        +>>> f.seek(-128, 2)       
                        +>>> f.tell()              
                         7542909
                        ->>> tagData = f.read(128) 
                        ->>> tagData
                        +>>> tagData = f.read(128) 
                        +>>> tagData
                         'TAGKAIRO****THE BEST GOA         ***DJ MARY-JANE***            
                         Rave Mix    2000http://mp3.com/DJMARYJANE     \037'
                        ->>> f.tell()              
                        +>>> f.tell()              
                         7543037
                        1. A file object maintains state about the file it has open. The tell method of a file object tells you your current position in the open file. Since you haven't done anything with this file @@ -2086,28 +2086,28 @@ Rave Mix 2000http://mp3.com/DJMARYJANE \037'

                          Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's important to close files as soon as you're finished with them.

                          Example 6.5. Closing a File

                          ->>> f
                          +>>> f
                           <open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
                          ->>> f.closed       
                          +>>> f.closed       
                           False
                          ->>> f.close()      
                          ->>> f
                          +>>> f.close()      
                          +>>> f
                           <closed file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
                          ->>> f.closed       
                          +>>> f.closed       
                           True
                          ->>> f.seek(0)      
                          +>>> f.seek(0)      
                           Traceback (innermost last):
                             File "<interactive input>", line 1, in ?
                           ValueError: I/O operation on closed file
                          ->>> f.tell()
                          +>>> f.tell()
                           Traceback (innermost last):
                             File "<interactive input>", line 1, in ?
                           ValueError: I/O operation on closed file
                          ->>> f.read()
                          +>>> f.read()
                           Traceback (innermost last):
                             File "<interactive input>", line 1, in ?
                           ValueError: I/O operation on closed file
                          ->>> f.close()      
                          +>>> f.close()
                      1. The closed attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (closed is False).
                      2. To close a file, call the close method of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any) @@ -2151,15 +2151,15 @@ ValueError: I/O operation on closed file "if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open it and start writing.

                        Example 6.7. Writing to Files

                        ->>> logfile = open('test.log', 'w') 
                        ->>> logfile.write('test succeeded') 
                        ->>> logfile.close()
                        ->>> print file('test.log').read()   
                        +>>> logfile = open('test.log', 'w') 
                        +>>> logfile.write('test succeeded') 
                        +>>> logfile.close()
                        +>>> print file('test.log').read()   
                         test succeeded
                        ->>> logfile = open('test.log', 'a') 
                        ->>> logfile.write('line 2')
                        ->>> logfile.close()
                        ->>> print file('test.log').read()   
                        +>>> logfile = open('test.log', 'a') 
                        +>>> logfile.write('line 2')
                        +>>> logfile.close()
                        +>>> print file('test.log').read()   
                         test succeededline 2
                         
                          @@ -2187,13 +2187,13 @@ test succeededline 2

                          Like most other languages, Python has for loops. The only reason you haven't seen them until now is that Python is good at so many other things that you don't need them as often.

                          Most other languages don't have a powerful list datatype like Python, so you end up doing a lot of manual work, specifying a start, end, and step to define a range of integers or characters or other iteratable entities. But in Python, a for loop simply iterates over a list, the same way list comprehensions work. -

                          Example 6.8. Introducing the for Loop

                          >>> li = ['a', 'b', 'e']
                          ->>> for s in li:         
                          -...    print s          
                          +

                          Example 6.8. Introducing the for Loop

                          >>> li = ['a', 'b', 'e']
                          +>>> for s in li:         
                          +...    print s          
                           a
                           b
                           e
                          ->>> print "\n".join(li)  
                          +>>> print "\n".join(li)  
                           a
                           b
                           e
                          @@ -2203,16 +2203,16 @@ e
                        1. This is the reason you haven't seen the for loop yet: you haven't needed it yet. It's amazing how often you use for loops in other languages when all you really want is a join or a list comprehension.

                          Doing a “normal” (by Visual Basic standards) counter for loop is also simple.

                          Example 6.9. Simple Counters

                          ->>> for i in range(5):             
                          -...    print i
                          +>>> for i in range(5):             
                          +...    print i
                           0
                           1
                           2
                           3
                           4
                          ->>> li = ['a', 'b', 'c', 'd', 'e']
                          ->>> for i in range(len(li)):       
                          -...    print li[i]
                          +>>> li = ['a', 'b', 'c', 'd', 'e']
                          +>>> for i in range(len(li)):       
                          +...    print li[i]
                           a
                           b
                           c
                          @@ -2225,17 +2225,17 @@ e
                           
                        2. Don't ever do this. This is Visual Basic-style thinking. Break out of it. Just iterate through the list, as shown in the previous example.

                          for loops are not just for simple counters. They can iterate through all kinds of things. Here is an example of using a for loop to iterate through a dictionary.

                          Example 6.10. Iterating Through a Dictionary

                          ->>> import os
                          ->>> for k, v in os.environ.items():       
                          -...    print "%s=%s" % (k, v)
                          +>>> import os
                          +>>> for k, v in os.environ.items():       
                          +...    print "%s=%s" % (k, v)
                           USERPROFILE=C:\Documents and Settings\mpilgrim
                           OS=Windows_NT
                           COMPUTERNAME=MPILGRIM
                           USERNAME=mpilgrim
                           
                           [...snip...]
                          ->>> print "\n".join(["%s=%s" % (k, v)
                          -...    for k, v in os.environ.items()]) 
                          +>>> print "\n".join(["%s=%s" % (k, v)
                          +...    for k, v in os.environ.items()]) 
                           USERPROFILE=C:\Documents and Settings\mpilgrim
                           OS=Windows_NT
                           COMPUTERNAME=MPILGRIM
                          @@ -2271,8 +2271,8 @@ USERNAME=mpilgrim
                           
                        3. Now that you've extracted all the parameters for a single MP3 tag, saving the tag data is easy. You slice tagdata from start to end to get the actual data for this tag, call parseFunc to post-process the data, and assign this as the value for the key tag in the pseudo-dictionary self. After iterating through all the elements in tagDataMap, self has the values for all the tags, and you know what that looks like.

                          6.4. Using sys.modules

                          Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules. -

                          Example 6.12. Introducing sys.modules

                          >>> import sys        
                          ->>> print '\n'.join(sys.modules.keys()) 
                          +

                          Example 6.12. Introducing sys.modules

                          >>> import sys        
                          +>>> print '\n'.join(sys.modules.keys()) 
                           win32api
                           os.path
                           os
                          @@ -2290,8 +2290,8 @@ stat
                        4. The sys module contains system-level information, such as the version of Python you're running (sys.version or sys.version_info), and system-level options such as the maximum allowed recursion depth (sys.getrecursionlimit() and sys.setrecursionlimit()).
                        5. sys.modules is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules your program has imported. Python preloads some modules on startup, and if you're using a Python IDE, sys.modules contains all the modules imported by all the programs you've run within the IDE.

                          This example demonstrates how to use sys.modules. -

                          Example 6.13. Using sys.modules

                          >>> import fileinfo         
                          ->>> print '\n'.join(sys.modules.keys())
                          +

                          Example 6.13. Using sys.modules

                          >>> import fileinfo         
                          +>>> print '\n'.join(sys.modules.keys())
                           win32api
                           os.path
                           os
                          @@ -2306,18 +2306,18 @@ site
                           signal
                           UserDict
                           stat
                          ->>> fileinfo
                          +>>> fileinfo
                           <module 'fileinfo' from 'fileinfo.pyc'>
                          ->>> sys.modules["fileinfo"] 
                          +>>> sys.modules["fileinfo"] 
                           <module 'fileinfo' from 'fileinfo.pyc'>
                          1. As new modules are imported, they are added to sys.modules. This explains why importing the same module twice is very fast: Python has already loaded and cached the module in sys.modules, so importing the second time is simply a dictionary lookup.
                          2. Given the name (as a string) of any previously-imported module, you can get a reference to the module itself through the sys.modules dictionary.

                            The next example shows how to use the __module__ class attribute with the sys.modules dictionary to get a reference to the module in which a class is defined. -

                            Example 6.14. The __module__ Class Attribute

                            >>> from fileinfo import MP3FileInfo
                            ->>> MP3FileInfo.__module__              
                            +

                            Example 6.14. The __module__ Class Attribute

                            >>> from fileinfo import MP3FileInfo
                            +>>> MP3FileInfo.__module__              
                             'fileinfo'
                            ->>> sys.modules[MP3FileInfo.__module__] 
                            +>>> sys.modules[MP3FileInfo.__module__] 
                             <module 'fileinfo' from 'fileinfo.pyc'>
                            1. Every Python class has a built-in class attribute __module__, which is the name of the module in which the class is defined. @@ -2346,14 +2346,14 @@ stat

                              The os.path module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing the contents of a directory.

                              Example 6.16. Constructing Pathnames

                              ->>> import os
                              ->>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")  
                              +>>> import os
                              +>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")  
                               'c:\\music\\ap\\mahadeva.mp3'
                              ->>> os.path.join("c:\\music\\ap", "mahadeva.mp3")   
                              +>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")   
                               'c:\\music\\ap\\mahadeva.mp3'
                              ->>> os.path.expanduser("~")       
                              +>>> os.path.expanduser("~")       
                               'c:\\Documents and Settings\\mpilgrim\\My Documents'
                              ->>> os.path.join(os.path.expanduser("~"), "Python") 
                              +>>> os.path.join(os.path.expanduser("~"), "Python") 
                               'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'
                              1. os.path is a reference to a module -- which module depends on your platform. Just as getpass encapsulates differences between platforms by setting getpass to a platform-specific function, os encapsulates differences between platforms by setting path to a platform-specific module. @@ -2364,17 +2364,17 @@ stat
                              2. expanduser will expand a pathname that uses ~ to represent the current user's home directory. This works on any platform where users have a home directory, like Windows, UNIX, and Mac OS X; it has no effect on Mac OS.
                              3. Combining these techniques, you can easily construct pathnames for directories and files under the user's home directory. -

                                Example 6.17. Splitting Pathnames

                                >>> os.path.split("c:\\music\\ap\\mahadeva.mp3")      
                                +

                                Example 6.17. Splitting Pathnames

                                >>> os.path.split("c:\\music\\ap\\mahadeva.mp3")      
                                 ('c:\\music\\ap', 'mahadeva.mp3')
                                ->>> (filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3") 
                                ->>> filepath      
                                +>>> (filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3") 
                                +>>> filepath      
                                 'c:\\music\\ap'
                                ->>> filename      
                                +>>> filename      
                                 'mahadeva.mp3'
                                ->>> (shortname, extension) = os.path.splitext(filename)                 
                                ->>> shortname
                                +>>> (shortname, extension) = os.path.splitext(filename)                 
                                +>>> shortname
                                 'mahadeva'
                                ->>> extension
                                +>>> extension
                                 '.mp3'
                                1. The split function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use @@ -2384,23 +2384,23 @@ stat
                                2. The second variable, filename, receives the value of the second element of the tuple returned from split, the filename.
                                3. os.path also contains a function splitext, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique to assign each of them to separate variables. -

                                  Example 6.18. Listing Directories

                                  >>> os.listdir("c:\\music\\_singles\\")              
                                  +

                                  Example 6.18. Listing Directories

                                  >>> os.listdir("c:\\music\\_singles\\")              
                                   ['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
                                   'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3', 
                                   'spinning.mp3']
                                  ->>> dirname = "c:\\"
                                  ->>> os.listdir(dirname)            
                                  +>>> dirname = "c:\\"
                                  +>>> os.listdir(dirname)            
                                   ['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin',
                                   'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS',
                                   'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys',
                                   'Program Files', 'Python20', 'RECYCLER',
                                   'System Volume Information', 'TEMP', 'WINNT']
                                  ->>> [f for f in os.listdir(dirname)
                                  -...    if os.path.isfile(os.path.join(dirname, f))] 
                                  +>>> [f for f in os.listdir(dirname)
                                  +...    if os.path.isfile(os.path.join(dirname, f))] 
                                   ['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS',
                                   'NTDETECT.COM', 'ntldr', 'pagefile.sys']
                                  ->>> [f for f in os.listdir(dirname)
                                  -...    if os.path.isdir(os.path.join(dirname, f))]  
                                  +>>> [f for f in os.listdir(dirname)
                                  +...    if os.path.isdir(os.path.join(dirname, f))]  
                                   ['cygwin', 'docbook', 'Documents and Settings', 'Incoming',
                                   'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER',
                                   'System Volume Information', 'TEMP', 'WINNT']
                                  @@ -2431,22 +2431,22 @@ def listDirectory(directory, fileExtList):

                                  There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you may already be familiar with from working on the command line.

                                  Example 6.20. Listing Directories with glob

                                  ->>> os.listdir("c:\\music\\_singles\\")               
                                  +>>> os.listdir("c:\\music\\_singles\\")               
                                   ['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
                                   'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
                                   'spinning.mp3']
                                  ->>> import glob
                                  ->>> glob.glob('c:\\music\\_singles\\*.mp3')           
                                  +>>> import glob
                                  +>>> glob.glob('c:\\music\\_singles\\*.mp3')           
                                   ['c:\\music\\_singles\\a_time_long_forgotten_con.mp3',
                                   'c:\\music\\_singles\\hellraiser.mp3',
                                   'c:\\music\\_singles\\kairo.mp3',
                                   'c:\\music\\_singles\\long_way_home1.mp3',
                                   'c:\\music\\_singles\\sidewinder.mp3',
                                   'c:\\music\\_singles\\spinning.mp3']
                                  ->>> glob.glob('c:\\music\\_singles\\s*.mp3')          
                                  +>>> glob.glob('c:\\music\\_singles\\s*.mp3')          
                                   ['c:\\music\\_singles\\sidewinder.mp3',
                                   'c:\\music\\_singles\\spinning.mp3']
                                  ->>> glob.glob('c:\\music\\*\\*.mp3')
                                  +>>> glob.glob('c:\\music\\*\\*.mp3')
                                   
                                  1. As you saw earlier, os.listdir simply takes a directory path and lists all files and directories in that directory. @@ -2874,7 +2874,7 @@ the SGMLParser class and defining unknown_starttag,

                                    Example 8.4. Sample test of sgmllib.py

                                    Here is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven't downloaded the HTML version of the book, you can do so at http://diveintopython3.org/.

                                    -c:\python23\lib> type "c:\downloads\diveintopython3\html\toc\index.html"
                                    +c:\python23\lib> type "c:\downloads\diveintopython3\html\toc\index.html"
                                     
                                     <!DOCTYPE html
                                       PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
                                    @@ -2887,7 +2887,7 @@ the SGMLParser class and defining unknown_starttag, 

                                    Running this through the test suite of sgmllib.py yields this output:

                                    -c:\python23\lib> python sgmllib.py "c:\downloads\diveintopython3\html\toc\index.html"
                                    +c:\python23\lib> python sgmllib.py "c:\downloads\diveintopython3\html\toc\index.html"
                                     data: '\n\n'
                                     start tag: <html >
                                     data: '\n   '
                                    @@ -2922,11 +2922,11 @@ data: '\n      '
                                     

                                    To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.

                                    The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages.

                                    Example 8.5. Introducing urllib

                                    ->>> import urllib   
                                    ->>> sock = urllib.urlopen("http://diveintopython3.org/") 
                                    ->>> htmlSource = sock.read()          
                                    ->>> sock.close()    
                                    ->>> print htmlSource
                                    +>>> import urllib   
                                    +>>> sock = urllib.urlopen("http://diveintopython3.org/") 
                                    +>>> htmlSource = sock.read()          
                                    +>>> sock.close()    
                                    +>>> print htmlSource
                                     <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>
                                           <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
                                        <title>Dive Into Python</title>
                                    @@ -2969,13 +2969,13 @@ class URLLister(SGMLParser):
                                     
                                  2. You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension.
                                  3. String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs.

                                    Example 8.7. Using urllister.py

                                    ->>> import urllib, urllister
                                    ->>> usock = urllib.urlopen("http://diveintopython3.org/")
                                    ->>> parser = urllister.URLLister()
                                    ->>> parser.feed(usock.read())         
                                    ->>> usock.close()   
                                    ->>> parser.close()  
                                    ->>> for url in parser.urls: print url 
                                    +>>> import urllib, urllister
                                    +>>> usock = urllib.urlopen("http://diveintopython3.org/")
                                    +>>> parser = urllister.URLLister()
                                    +>>> parser.feed(usock.read())         
                                    +>>> usock.close()   
                                    +>>> parser.close()  
                                    +>>> for url in parser.urls: print url 
                                     toc/index.html
                                     #download
                                     #languages
                                    @@ -3094,13 +3094,13 @@ module, which holds built-in functions and exceptions.
                                     
                                     
                    ImportantPython 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a nested function or lambda function, Python will search for that variable in the current (nested or lambda) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested or lambda) function's namespace, then in the parent function's namespace, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2:
                    
                     from __future__ import nested_scopes

                    Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in locals function, and the global (module level) namespace is accessible via the built-in globals function. -

                    Example 8.10. Introducing locals

                    >>> def foo(arg): 
                    -...    x = 1
                    -...    print locals()
                    -...    
                    ->>> foo(7)        
                    +

                    Example 8.10. Introducing locals

                    >>> def foo(arg): 
                    +...    x = 1
                    +...    print locals()
                    +...    
                    +>>> foo(7)        
                     {'arg': 7, 'x': 1}
                    ->>> foo('bar')    
                    +>>> foo('bar')    
                     {'arg': 'bar', 'x': 1}
                    1. The function foo has two variables in its local namespace: arg, whose value is passed in to the function, and x, which is defined within the function. @@ -3121,7 +3121,7 @@ if __name__ == "__main__":
                      1. Just so you don't get intimidated, remember that you've seen all this before. The globals function returns a dictionary, and you're iterating through the dictionary using the items method and multi-variable assignment. The only thing new here is the globals function.

                        Now running the script from the command line gives this output (note that your output may be slightly different, depending - on your platform and where you installed Python):

                        c:\docbook\dip\py> python BaseHTMLProcessor.py
                        
                        +   on your platform and where you installed Python):
                        c:\docbook\dip\py> python BaseHTMLProcessor.py
                        
                         SGMLParser = sgmllib.SGMLParser                
                         htmlentitydefs = <module 'htmlentitydefs' from 'C:\Python23\lib\htmlentitydefs.py'> 
                         BaseHTMLProcessor = __main__.BaseHTMLProcessor 
                        @@ -3166,12 +3166,12 @@ values are being inserted. You can't simply scan through the string in one pass
                         constantly switching between reading the string and reading the tuple of values.
                         

                        There is an alternative form of string formatting that uses dictionaries instead of tuples of values.

                        Example 8.13. Introducing dictionary-based string formatting

                        ->>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
                        ->>> "%(pwd)s" % params
                        +>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
                        +>>> "%(pwd)s" % params
                         'secret'
                        ->>> "%(pwd)s is not a good password for %(uid)s" % params 
                        +>>> "%(pwd)s is not a good password for %(uid)s" % params 
                         'secret is not a good password for sa'
                        ->>> "%(database)s of mind, %(database)s of body" % params 
                        +>>> "%(database)s of mind, %(database)s of body" % params 
                         'master of mind, master of body'
                        1. Instead of a tuple of explicit values, this form of string formatting uses a dictionary, params. And instead of a simple %s marker in the string, the marker contains a name in parentheses. This name is used as a key in the params dictionary and subsitutes the corresponding value, secret, in place of the %(pwd)s marker. @@ -3221,23 +3221,23 @@ meaningful keys and values already. Like Example 8.16. Quoting attribute values
                          ->>> htmlSource = """        
                          -...    <html>
                          -...    <head>
                          -...    <title>Test page</title>
                          -...    </head>
                          -...    <body>
                          -...    <ul>
                          -...    <li><a href=index.html>Home</a></li>
                          -...    <li><a href=toc.html>Table of contents</a></li>
                          -...    <li><a href=history.html>Revision history</a></li>
                          -...    </body>
                          -...    </html>
                          -...    """
                          ->>> from BaseHTMLProcessor import BaseHTMLProcessor
                          ->>> parser = BaseHTMLProcessor()
                          ->>> parser.feed(htmlSource) 
                          ->>> print parser.output()   
                          +>>> htmlSource = """        
                          +...    <html>
                          +...    <head>
                          +...    <title>Test page</title>
                          +...    </head>
                          +...    <body>
                          +...    <ul>
                          +...    <li><a href=index.html>Home</a></li>
                          +...    <li><a href=toc.html>Table of contents</a></li>
                          +...    <li><a href=history.html>Revision history</a></li>
                          +...    </body>
                          +...    </html>
                          +...    """
                          +>>> from BaseHTMLProcessor import BaseHTMLProcessor
                          +>>> parser = BaseHTMLProcessor()
                          +>>> parser.feed(htmlSource) 
                          +>>> print parser.output()   
                           <html>
                           <head>
                           <title>Test page</title>
                          @@ -3712,7 +3712,7 @@ def openAnything(source):
                               import StringIO     
                               return StringIO.StringIO(str(source)) 
                           

                          Run the program kgp.py by itself, and it will parse the default XML-based grammar, in kant.xml, and print several paragraphs worth of philosophy in the style of Immanuel Kant. -

                          Example 9.3. Sample output of kgp.py

                          [you@localhost kgp]$ python kgp.py
                          +

                          Example 9.3. Sample output of kgp.py

                          [you@localhost kgp]$ python kgp.py
                                As is shown in the writings of Hume, our a priori concepts, in
                           reference to ends, abstract from all content of knowledge; in the study
                           of space, the discipline of human reason, in accordance with the
                          @@ -3753,17 +3753,17 @@ But all of it is in the style of Immanuel Kant.
                           

                          The interesting thing about this program is that there is nothing Kant-specific about it. All the content in the previous example was derived from the grammar file, kant.xml. If you tell the program to use a different grammar file (which you can specify on the command line), the output will be completely different. -

                          Example 9.4. Simpler output from kgp.py

                          [you@localhost kgp]$ python kgp.py -g binary.xml
                          +

                          Example 9.4. Simpler output from kgp.py

                          [you@localhost kgp]$ python kgp.py -g binary.xml
                           00101001
                          -[you@localhost kgp]$ python kgp.py -g binary.xml
                          +[you@localhost kgp]$ python kgp.py -g binary.xml
                           10110100

                          You will take a closer look at the structure of the grammar file later in this chapter. For now, all you need to know is that the grammar file defines the structure of the output, and the kgp.py program reads through the grammar and makes random decisions about which words to plug in where.

                          9.2. Packages

                          Actually parsing an XML document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour to talk about packages.

                          Example 9.5. Loading an XML document (a sneak peek)

                          ->>> from xml.dom import minidom 
                          ->>> xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')
                          +>>> from xml.dom import minidom +>>> xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')
                          1. This is a syntax you haven't seen before. It looks almost like the from module import you know and love, but the "." gives it away as something above and beyond a simple import. In fact, xml is what is known as a package, dom is a nested package within xml, and minidom is a module within xml.dom.

                            That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than @@ -3782,21 +3782,21 @@ just .py files, like always, except that they're in a subdirectory +--parsers/ xml.parsers package (used internally)

                          So when you say from xml.dom import minidom, Python figures out that that means “look in the xml directory for a dom directory, and look in that for the minidom module, and import it as minidom”. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import specific classes or functions from a module contained within a package. You can also import the package itself as a module. The syntax is all the same; Python figures out what you mean based on the file layout of the package, and automatically does the right thing. -

                          Example 9.7. Packages are modules, too

                          >>> from xml.dom import minidom         
                          ->>> minidom
                          +

                          Example 9.7. Packages are modules, too

                          >>> from xml.dom import minidom         
                          +>>> minidom
                           <module 'xml.dom.minidom' from 'C:\Python21\lib\xml\dom\minidom.pyc'>
                          ->>> minidom.Element
                          +>>> minidom.Element
                           <class xml.dom.minidom.Element at 01095744>
                          ->>> from xml.dom.minidom import Element 
                          ->>> Element
                          +>>> from xml.dom.minidom import Element 
                          +>>> Element
                           <class xml.dom.minidom.Element at 01095744>
                          ->>> minidom.Element
                          +>>> minidom.Element
                           <class xml.dom.minidom.Element at 01095744>
                          ->>> from xml import dom                 
                          ->>> dom
                          +>>> from xml import dom                 
                          +>>> dom
                           <module 'xml.dom' from 'C:\Python21\lib\xml\dom\__init__.pyc'>
                          ->>> import xml        
                          ->>> xml
                          +>>> import xml        
                          +>>> xml
                           <module 'xml' from 'C:\Python21\lib\xml\__init__.pyc'>
                          1. Here you're importing a module (minidom) from a nested package (xml.dom). The result is that minidom is imported into your namespace, and in order to reference classes within the minidom module (like Element), you need to preface them with the module name. @@ -3817,11 +3817,11 @@ package architecture. It's one of the many things Python is good at, so take adv

                            9.3. Parsing XML

                            As I was saying, actually parsing an XML document is very simple: one line of code. Where you go from there is up to you.

                            Example 9.8. Loading an XML document (for real this time)

                            ->>> from xml.dom import minidom      
                            ->>> xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')  
                            ->>> xmldoc         
                            +>>> from xml.dom import minidom      
                            +>>> xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')  
                            +>>> xmldoc         
                             <xml.dom.minidom.Document instance at 010BE87C>
                            ->>> print xmldoc.toxml()             
                            +>>> print xmldoc.toxml()             
                             <?xml version="1.0" ?>
                             <grammar>
                             <ref id="bit">
                            @@ -3841,11 +3841,11 @@ package architecture. It's one of the many things Python is good at, so take adv
                             
                          2. toxml is a method of the Node class (and is therefore available on the Document object you got from minidom.parse). toxml prints out the XML that this Node represents. For the Document node, this prints out the entire XML document.

                            Now that you have an XML document in memory, you can start traversing through it.

                            Example 9.9. Getting child nodes

                            ->>> xmldoc.childNodes    
                            +>>> xmldoc.childNodes    
                             [<DOM Element: grammar at 17538908>]
                            ->>> xmldoc.childNodes[0] 
                            +>>> xmldoc.childNodes[0] 
                             <DOM Element: grammar at 17538908>
                            ->>> xmldoc.firstChild    
                            +>>> xmldoc.firstChild    
                             <DOM Element: grammar at 17538908>
                            1. Every Node has a childNodes attribute, which is a list of the Node objects. A Document always has only one child node, the root element of the XML document (in this case, the grammar element). @@ -3853,8 +3853,8 @@ package architecture. It's one of the many things Python is good at, so take adv going on here; this is just a regular Python list of regular Python objects.
                            2. Since getting the first child node of a node is a useful and common activity, the Node class has a firstChild attribute, which is synonymous with childNodes[0]. (There is also a lastChild attribute, which is synonymous with childNodes[-1].)

                              Example 9.10. toxml works on any node

                              ->>> grammarNode = xmldoc.firstChild
                              ->>> print grammarNode.toxml() 
                              +>>> grammarNode = xmldoc.firstChild
                              +>>> print grammarNode.toxml() 
                               <grammar>
                               <ref id="bit">
                                 <p>0</p>
                              @@ -3868,24 +3868,24 @@ package architecture. It's one of the many things Python is good at, so take adv
                               
                              1. Since the toxml method is defined in the Node class, it is available on any XML node, not just the Document element.

                                Example 9.11. Child nodes can be text

                                ->>> grammarNode.childNodes
                                +>>> grammarNode.childNodes
                                 [<DOM Text node "\n">, <DOM Element: ref at 17533332>, \
                                 <DOM Text node "\n">, <DOM Element: ref at 17549660>, <DOM Text node "\n">]
                                ->>> print grammarNode.firstChild.toxml()    
                                +>>> print grammarNode.firstChild.toxml()    
                                 
                                 
                                 
                                ->>> print grammarNode.childNodes[1].toxml() 
                                +>>> print grammarNode.childNodes[1].toxml() 
                                 <ref id="bit">
                                   <p>0</p>
                                   <p>1</p>
                                 </ref>
                                ->>> print grammarNode.childNodes[3].toxml() 
                                +>>> print grammarNode.childNodes[3].toxml() 
                                 <ref id="byte">
                                   <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                 <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                                 </ref>
                                ->>> print grammarNode.lastChild.toxml()     
                                +>>> print grammarNode.lastChild.toxml()     
                                 
                                 
                                 
                                @@ -3896,23 +3896,23 @@ package architecture. It's one of the many things Python is good at, so take adv
                              2. The fourth child is an Element object representing the second ref element.
                              3. The last child is a Text object representing the carriage return after the '</ref>' end tag and before the '</grammar>' end tag.

                                Example 9.12. Drilling down all the way to text

                                ->>> grammarNode
                                +>>> grammarNode
                                 <DOM Element: grammar at 19167148>
                                ->>> refNode = grammarNode.childNodes[1] 
                                ->>> refNode
                                +>>> refNode = grammarNode.childNodes[1] 
                                +>>> refNode
                                 <DOM Element: ref at 17987740>
                                ->>> refNode.childNodes
                                +>>> refNode.childNodes
                                 [<DOM Text node "\n">, <DOM Text node "  ">, <DOM Element: p at 19315844>, \
                                 <DOM Text node "\n">, <DOM Text node "  ">, \
                                 <DOM Element: p at 19462036>, <DOM Text node "\n">]
                                ->>> pNode = refNode.childNodes[2]
                                ->>> pNode
                                +>>> pNode = refNode.childNodes[2]
                                +>>> pNode
                                 <DOM Element: p at 19315844>
                                ->>> print pNode.toxml()                 
                                +>>> print pNode.toxml()                 
                                 <p>0</p>
                                ->>> pNode.firstChild  
                                +>>> pNode.firstChild  
                                 <DOM Text node "0">
                                ->>> pNode.firstChild.data               
                                +>>> pNode.firstChild.data               
                                 u'0'
                                1. As you saw in the previous example, the first ref element is grammarNode.childNodes[1], since childNodes[0] is a Text node for the carriage return. @@ -3949,11 +3949,11 @@ sys.setdefaultencoding('iso-8859-1') (as long as import can find it), but it usually goes in the site-packages directory within your Python lib directory.
                                2. setdefaultencoding function sets, well, the default encoding. This is the encoding scheme that Python will try to use whenever it needs to auto-coerce a unicode string into a regular string.

                                  Example 9.16. Effects of setting the default encoding

                                  ->>> import sys
                                  ->>> sys.getdefaultencoding() 
                                  +>>> import sys
                                  +>>> sys.getdefaultencoding() 
                                   'iso-8859-1'
                                  ->>> s = u'La Pe\xf1a'
                                  ->>> print s
                                  +>>> s = u'La Pe\xf1a'
                                  +>>> print s
                                   La Peña
                                  1. This example assumes that you have made the changes listed in the previous example to your sitecustomize.py file, and restarted Python. If your default encoding still says 'ascii', you didn't set up your sitecustomize.py properly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even @@ -3989,17 +3989,17 @@ La Peña
                              </ref> </grammar>
                          3. It has two refs, 'bit' and 'byte'. A bit is either a '0' or '1', and a byte is 8 bits.

                            Example 9.21. Introducing getElementsByTagName

                            ->>> from xml.dom import minidom
                            ->>> xmldoc = minidom.parse('binary.xml')
                            ->>> reflist = xmldoc.getElementsByTagName('ref') 
                            ->>> reflist
                            +>>> from xml.dom import minidom
                            +>>> xmldoc = minidom.parse('binary.xml')
                            +>>> reflist = xmldoc.getElementsByTagName('ref') 
                            +>>> reflist
                             [<DOM Element: ref at 136138108>, <DOM Element: ref at 136144292>]
                            ->>> print reflist[0].toxml()
                            +>>> print reflist[0].toxml()
                             <ref id="bit">
                               <p>0</p>
                               <p>1</p>
                             </ref>
                            ->>> print reflist[1].toxml()
                            +>>> print reflist[1].toxml()
                             <ref id="byte">
                               <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                             <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                            @@ -4008,32 +4008,32 @@ La Peña
                            1. getElementsByTagName takes one argument, the name of the element you wish to find. It returns a list of Element objects, corresponding to the XML elements that have that name. In this case, you find two ref elements.

                              Example 9.22. Every element is searchable

                              ->>> firstref = reflist[0]    
                              ->>> print firstref.toxml()
                              +>>> firstref = reflist[0]    
                              +>>> print firstref.toxml()
                               <ref id="bit">
                                 <p>0</p>
                                 <p>1</p>
                               </ref>
                              ->>> plist = firstref.getElementsByTagName("p") 
                              ->>> plist
                              +>>> plist = firstref.getElementsByTagName("p") 
                              +>>> plist
                               [<DOM Element: p at 136140116>, <DOM Element: p at 136142172>]
                              ->>> print plist[0].toxml()   
                              +>>> print plist[0].toxml()   
                               <p>0</p>
                              ->>> print plist[1].toxml()
                              +>>> print plist[1].toxml()
                               <p>1</p>
                              1. Continuing from the previous example, the first object in your reflist is the 'bit' ref element.
                              2. You can use the same getElementsByTagName method on this Element to find all the <p> elements within the 'bit' ref element.
                              3. Just as before, the getElementsByTagName method returns a list of all the elements it found. In this case, you have two, one for each bit.

                                Example 9.23. Searching is actually recursive

                                ->>> plist = xmldoc.getElementsByTagName("p") 
                                ->>> plist
                                +>>> plist = xmldoc.getElementsByTagName("p") 
                                +>>> plist
                                 [<DOM Element: p at 136140116>, <DOM Element: p at 136142172>, <DOM Element: p at 136146124>]
                                ->>> plist[0].toxml()       
                                +>>> plist[0].toxml()       
                                 '<p>0</p>'
                                ->>> plist[1].toxml()
                                +>>> plist[1].toxml()
                                 '<p>1</p>'
                                ->>> plist[2].toxml()       
                                +>>> plist[2].toxml()       
                                 '<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                 <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>'
                                  @@ -4048,21 +4048,21 @@ La Peña
                    NoteThis section may be a little confusing, because of some overlapping terminology. Elements in an XML document have attributes, and Python objects also have attributes. When you parse an XML document, you get a bunch of Python objects that represent all the pieces of the XML document, and some of these Python objects represent attributes of the XML elements. But the (Python) objects that represent the (XML) attributes also have (Python) attributes, which are used to access various parts of the (XML) attribute that the object represents. I told you it was confusing. I am open to suggestions on how to distinguish these more clearly.

                    Example 9.24. Accessing element attributes

                    ->>> xmldoc = minidom.parse('binary.xml')
                    ->>> reflist = xmldoc.getElementsByTagName('ref')
                    ->>> bitref = reflist[0]
                    ->>> print bitref.toxml()
                    +>>> xmldoc = minidom.parse('binary.xml')
                    +>>> reflist = xmldoc.getElementsByTagName('ref')
                    +>>> bitref = reflist[0]
                    +>>> print bitref.toxml()
                     <ref id="bit">
                       <p>0</p>
                       <p>1</p>
                     </ref>
                    ->>> bitref.attributes          
                    +>>> bitref.attributes          
                     <xml.dom.minidom.NamedNodeMap instance at 0x81e0c9c>
                    ->>> bitref.attributes.keys()    
                    +>>> bitref.attributes.keys()    
                     [u'id']
                    ->>> bitref.attributes.values() 
                    +>>> bitref.attributes.values() 
                     [<xml.dom.minidom.Attr instance at 0x81d5044>]
                    ->>> bitref.attributes["id"]    
                    +>>> bitref.attributes["id"]    
                     <xml.dom.minidom.Attr instance at 0x81d5044>
                    1. Each Element object has an attribute called attributes, which is a NamedNodeMap object. This sounds scary, but it's not, because a NamedNodeMap is an object that acts like a dictionary, so you already know how to use it. @@ -4072,12 +4072,12 @@ La Peña
                    2. Still treating the NamedNodeMap as a dictionary, you can access an individual attribute by name, using normal dictionary syntax. (Readers who have been paying extra-close attention will already know how the NamedNodeMap class accomplishes this neat trick: by defining a __getitem__ special method. Other readers can take comfort in the fact that they don't need to understand how it works in order to use it effectively.)

                      Example 9.25. Accessing individual attributes

                      ->>> a = bitref.attributes["id"]
                      ->>> a
                      +>>> a = bitref.attributes["id"]
                      +>>> a
                       <xml.dom.minidom.Attr instance at 0x81d5044>
                      ->>> a.name  
                      +>>> a.name  
                       u'id'
                      ->>> a.value 
                      +>>> a.value 
                       u'bit'
                      1. The Attr object completely represents a single XML attribute of a single XML element. The name of the attribute (the same name as you used to find this object in the bitref.attributes NamedNodeMap pseudo-dictionary) is stored in a.name. @@ -4116,11 +4116,11 @@ disk, a web page, even a hard-coded string. As long as you pass a file-like obje calls the object's read method, the function can handle any kind of input source without specific code to handle each kind.

                        In case you were wondering how this relates to XML processing, minidom.parse is one such function which can take a file-like object.

                        Example 10.1. Parsing XML from a file

                        ->>> from xml.dom import minidom
                        ->>> fsock = open('binary.xml')    
                        ->>> xmldoc = minidom.parse(fsock) 
                        ->>> fsock.close()                 
                        ->>> print xmldoc.toxml()          
                        +>>> from xml.dom import minidom
                        +>>> fsock = open('binary.xml')    
                        +>>> xmldoc = minidom.parse(fsock) 
                        +>>> fsock.close()                 
                        +>>> print xmldoc.toxml()          
                         <?xml version="1.0" ?>
                         <grammar>
                         <ref id="bit">
                        @@ -4140,11 +4140,11 @@ calls the object's read method, the function can handle any kind of
                         

                        Well, that all seems like a colossal waste of time. After all, you've already seen that minidom.parse can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're just going to be parsing a local file, you can pass the filename and minidom.parse is smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet.

                        Example 10.2. Parsing XML from a URL

                        ->>> import urllib
                        ->>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf') 
                        ->>> xmldoc = minidom.parse(usock)            
                        ->>> usock.close()          
                        ->>> print xmldoc.toxml()   
                        +>>> import urllib
                        +>>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf') 
                        +>>> xmldoc = minidom.parse(usock)            
                        +>>> usock.close()          
                        +>>> print xmldoc.toxml()   
                         <?xml version="1.0" ?>
                         <rdf:RDF xmlns="http://my.netscape.com/rdf/simple/0.9/"
                          xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
                        @@ -4173,9 +4173,9 @@ just going to be parsing a local file, you can pass the filename and minid
                         
                      2. As soon as you're done with it, be sure to close the file-like object that urlopen gives you.
                      3. By the way, this URL is real, and it really is XML. It's an XML representation of the current headlines on Slashdot, a technical news and gossip site.

                        Example 10.3. Parsing XML from a string (the easy but inflexible way)

                        ->>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                        ->>> xmldoc = minidom.parseString(contents) 
                        ->>> print xmldoc.toxml()
                        +>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                        +>>> xmldoc = minidom.parseString(contents) 
                        +>>> print xmldoc.toxml()
                         <?xml version="1.0" ?>
                         <grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
                          @@ -4184,21 +4184,21 @@ just going to be parsing a local file, you can pass the filename and minid file, a URL, or a string, you'll need special logic to check whether it's a string, and call the parseString function instead. How unsatisfying.

                          If there were a way to turn a string into a file-like object, then you could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO.

                          Example 10.4. Introducing StringIO

                          ->>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                          ->>> import StringIO
                          ->>> ssock = StringIO.StringIO(contents)   
                          ->>> ssock.read()        
                          +>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                          +>>> import StringIO
                          +>>> ssock = StringIO.StringIO(contents)   
                          +>>> ssock.read()        
                           "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                          ->>> ssock.read()        
                          +>>> ssock.read()        
                           ''
                          ->>> ssock.seek(0)       
                          ->>> ssock.read(15)      
                          +>>> ssock.seek(0)       
                          +>>> ssock.read(15)      
                           '<grammar><ref i'
                          ->>> ssock.read(15)
                          +>>> ssock.read(15)
                           "d='bit'><p>0</p"
                          ->>> ssock.read()
                          +>>> ssock.read()
                           '><p>1</p></ref></grammar>'
                          ->>> ssock.close()       
                          +>>> ssock.close()
                        1. The StringIO module contains a single class, also called StringIO, which allows you to turn a string into a file-like object. The StringIO class takes the string as a parameter when creating an instance.
                        2. Now you have a file-like object, and you can do all sorts of file-like things with it. Like read, which returns the original string. @@ -4209,11 +4209,11 @@ file, a URL, or a string, you'll need special logic to check whethe
                        3. At any time, read will return the rest of the string that you haven't read yet. All of this is exactly how file objects work; hence the term file-like object.

                          Example 10.5. Parsing XML from a string (the file-like object way)

                          ->>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                          ->>> ssock = StringIO.StringIO(contents)
                          ->>> xmldoc = minidom.parse(ssock) 
                          ->>> ssock.close()
                          ->>> print xmldoc.toxml()
                          +>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                          +>>> ssock = StringIO.StringIO(contents)
                          +>>> xmldoc = minidom.parse(ssock) 
                          +>>> ssock.close()
                          +>>> print xmldoc.toxml()
                           <?xml version="1.0" ?>
                           <grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
                            @@ -4257,17 +4257,17 @@ class KantGenerator: prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system with a window-based Python IDE, stdout and stderr default to your “Interactive Window”.)

                            Example 10.8. Introducing stdout and stderr

                            ->>> for i in range(3):
                            -...    print 'Dive in'             
                            +>>> for i in range(3):
                            +...    print 'Dive in'             
                             Dive in
                             Dive in
                             Dive in
                            ->>> import sys
                            ->>> for i in range(3):
                            -...    sys.stdout.write('Dive in') 
                            +>>> import sys
                            +>>> for i in range(3):
                            +...    sys.stdout.write('Dive in') 
                             Dive inDive inDive in
                            ->>> for i in range(3):
                            -...    sys.stderr.write('Dive in') 
                            +>>> for i in range(3):
                            +...    sys.stderr.write('Dive in') 
                             Dive inDive inDive in
                            1. As you saw in Example 6.9, “Simple Counters”, you can use Python's built-in range function to build simple counter loops that repeat something a set number of times. @@ -4275,9 +4275,9 @@ Dive inDive inDive in
                      4. In the simplest case, stdout and stderr send their output to the same place: the Python IDE (if you're in one), or the terminal (if you're running Python from the command line). Like stdout, stderr does not add carriage returns for you; if you want them, add them yourself.

                        stdout and stderr are both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have no read method, only write. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output.

                        Example 10.9. Redirecting output

                        -[you@localhost kgp]$ python stdout.py
                        +[you@localhost kgp]$ python stdout.py
                         Dive in
                        -[you@localhost kgp]$ cat out.log
                        +[you@localhost kgp]$ cat out.log
                         This message will be logged instead of displayed

                        (On Windows, you can use type instead of cat to display the contents of a file.)

                        If you have not already done so, you can download this and other examples used in this book.

                        
                        @@ -4302,8 +4302,8 @@ fsock.close()        
                         
                      5. Close the log file.

                        Redirecting stderr works exactly the same way, using sys.stderr instead of sys.stdout.

                        Example 10.10. Redirecting error information

                        -[you@localhost kgp]$ python stderr.py
                        -[you@localhost kgp]$ cat error.log
                        +[you@localhost kgp]$ python stderr.py
                        +[you@localhost kgp]$ cat error.log
                         Traceback (most recent line last):
                           File "stderr.py", line 5, in ?
                             raise Exception, 'this error will be logged'
                        @@ -4324,10 +4324,10 @@ raise Exception, 'this error will be logged'  

                        Since it is so common to write error messages to standard error, there is a shorthand syntax that can be used instead of going through the hassle of redirecting it outright.

                        Example 10.11. Printing to stderr

                        ->>> print 'entering function'
                        +>>> print 'entering function'
                         entering function
                        ->>> import sys
                        ->>> print >> sys.stderr, 'entering function' 
                        +>>> import sys
                        +>>> print >> sys.stderr, 'entering function' 
                         entering function
                         
                          @@ -4338,9 +4338,9 @@ becomes the input for the next program in the chain. The first program simply ou special redirecting itself, just doing normal print statements or whatever), and the next program reads from standard input, and the operating system takes care of connecting one program's output to the next program's input.

                          Example 10.12. Chaining commands

                          -[you@localhost kgp]$ python kgp.py -g binary.xml         
                          +[you@localhost kgp]$ python kgp.py -g binary.xml         
                           01100111
                          -[you@localhost kgp]$ cat binary.xml    
                          +[you@localhost kgp]$ cat binary.xml    
                           <?xml version="1.0"?>
                           <!DOCTYPE grammar PUBLIC "-//diveintopython3.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
                           <grammar>
                          @@ -4353,7 +4353,7 @@ one program's output to the next program's input.
                           <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                           </ref>
                           </grammar>
                          -[you@localhost kgp]$ cat binary.xml | python kgp.py -g -  
                          +[you@localhost kgp]$ cat binary.xml | python kgp.py -g -  
                           10110001
                          1. As you saw in Section 9.1, “Diving in”, this will print a string of eight random bits, 0 or 1. @@ -4421,13 +4421,13 @@ def openAnything(source):

                            10.5. Creating separate handlers by node type

                            The third useful XML processing tip involves separating your code into logical functions, based on node types and element names. Parsed XML documents are made up of various types of nodes, each represented by a Python object. The root level of the document itself is represented by a Document object. The Document then contains one or more Element objects (for actual XML tags), each of which may contain other Element objects, Text objects (for bits of text), or Comment objects (for embedded comments). Python makes it easy to write a dispatcher to separate the logic for each node type.

                            Example 10.17. Class names of parsed XML objects

                            ->>> from xml.dom import minidom
                            ->>> xmldoc = minidom.parse('kant.xml') 
                            ->>> xmldoc
                            +>>> from xml.dom import minidom
                            +>>> xmldoc = minidom.parse('kant.xml') 
                            +>>> xmldoc
                             <xml.dom.minidom.Document instance at 0x01359DE8>
                            ->>> xmldoc.__class__ 
                            +>>> xmldoc.__class__ 
                             <class xml.dom.minidom.Document at 0x01105D40>
                            ->>> xmldoc.__class__.__name__          
                            +>>> xmldoc.__class__.__name__          
                             'Document'
                            1. Assume for a moment that kant.xml is in the current directory. @@ -4491,16 +4491,16 @@ for arg in sys.argv:
                              1. Each command-line argument passed to the program will be in sys.argv, which is just a list. Here you are printing each argument on a separate line.

                                Example 10.21. The contents of sys.argv

                                -[you@localhost py]$ python argecho.py             
                                +[you@localhost py]$ python argecho.py             
                                 argecho.py
                                -[you@localhost py]$ python argecho.py abc def     
                                +[you@localhost py]$ python argecho.py abc def     
                                 argecho.py
                                 abc
                                 def
                                -[you@localhost py]$ python argecho.py --help      
                                +[you@localhost py]$ python argecho.py --help      
                                 argecho.py
                                 --help
                                -[you@localhost py]$ python argecho.py -m kant.xml 
                                +[you@localhost py]$ python argecho.py -m kant.xml 
                                 argecho.py
                                 -m
                                 kant.xml
                                @@ -4814,9 +4814,9 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT): it once; you want to download it over and over again, every hour, to get the latest news from the site that's offering the news feed. Let's do it the quick-and-dirty way first, and then see how you can do better.

                                Example 11.2. Downloading a feed the quick-and-dirty way

                                ->>> import urllib
                                ->>> data = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read()    
                                ->>> print data
                                +>>> import urllib
                                +>>> data = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read()    
                                +>>> print data
                                 <?xml version="1.0" encoding="iso-8859-1"?>
                                 <feed version="0.3"
                                   xmlns="http://purl.org/atom/ns#"
                                @@ -4888,10 +4888,10 @@ Python comes with a separate gzip module, which has functions you c
                                 

                                First, let's turn on the debugging features of Python's HTTP library and see what's being sent over the wire. This will be useful throughout the chapter, as you add more and more features.

                                Example 11.3. Debugging HTTP

                                ->>> import httplib
                                ->>> httplib.HTTPConnection.debuglevel = 1             
                                ->>> import urllib
                                ->>> feeddata = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read()
                                +>>> import httplib
                                +>>> httplib.HTTPConnection.debuglevel = 1             
                                +>>> import urllib
                                +>>> feeddata = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read()
                                 connect: (diveintomark.org, 80)     
                                 send: '
                                 GET /xml/atom.xml HTTP/1.0          
                                @@ -4928,12 +4928,12 @@ header: Connection: close
                                 

                                11.5. Setting the User-Agent

                                The first step to improving your HTTP web services client is to identify yourself properly with a User-Agent. To do that, you need to move beyond the basic urllib and dive into urllib2.

                                Example 11.4. Introducing urllib2

                                ->>> import httplib
                                ->>> httplib.HTTPConnection.debuglevel = 1           
                                ->>> import urllib2
                                ->>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml') 
                                ->>> opener = urllib2.build_opener()                 
                                ->>> feeddata = opener.open(request).read()          
                                +>>> import httplib
                                +>>> httplib.HTTPConnection.debuglevel = 1           
                                +>>> import urllib2
                                +>>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml') 
                                +>>> opener = urllib2.build_opener()                 
                                +>>> feeddata = opener.open(request).read()          
                                 connect: (diveintomark.org, 80)
                                 send: '
                                 GET /xml/atom.xml HTTP/1.0
                                @@ -4960,13 +4960,13 @@ header: Connection: close
                                 
                              2. The final step is to tell the opener to open the URL, using the Request object you created. As you can see from all the debugging information that gets printed, this step actually retrieves the resource and stores the returned data in feeddata.

                                Example 11.5. Adding headers with the Request

                                ->>> request            
                                +>>> request            
                                 <urllib2.Request instance at 0x00250AA8>
                                ->>> request.get_full_url()
                                +>>> request.get_full_url()
                                 http://diveintomark.org/xml/atom.xml
                                ->>> request.add_header('User-Agent',
                                -...    'OpenAnything/1.0 +http://diveintopython3.org/')    
                                ->>> feeddata = opener.open(request).read()                 
                                +>>> request.add_header('User-Agent',
                                +...    'OpenAnything/1.0 +http://diveintopython3.org/')    
                                +>>> feeddata = opener.open(request).read()                 
                                 connect: (diveintomark.org, 80)
                                 send: '
                                 GET /xml/atom.xml HTTP/1.0
                                @@ -4997,11 +4997,11 @@ header: Connection: close
                                 

                                These examples show the output with debugging turned off. If you still have it turned on from the previous section, you can turn it off by setting httplib.HTTPConnection.debuglevel = 0. Or you can just leave debugging on, if that helps you.

                                Example 11.6. Testing Last-Modified

                                ->>> import urllib2
                                ->>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
                                ->>> opener = urllib2.build_opener()
                                ->>> firstdatastream = opener.open(request)
                                ->>> firstdatastream.headers.dict     
                                +>>> import urllib2
                                +>>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
                                +>>> opener = urllib2.build_opener()
                                +>>> firstdatastream = opener.open(request)
                                +>>> firstdatastream.headers.dict     
                                 {'date': 'Thu, 15 Apr 2004 20:42:41 GMT', 
                                  'server': 'Apache/2.0.49 (Debian GNU/Linux)', 
                                  'content-type': 'application/atom+xml',
                                @@ -5010,9 +5010,9 @@ turn it off by setting httplib.HTTPConnection.debuglevel = 0. Or yo
                                  'content-length': '15955', 
                                  'accept-ranges': 'bytes', 
                                  'connection': 'close'}
                                ->>> request.add_header('If-Modified-Since',
                                -...    firstdatastream.headers.get('Last-Modified'))  
                                ->>> seconddatastream = opener.open(request)            
                                +>>> request.add_header('If-Modified-Since',
                                +...    firstdatastream.headers.get('Last-Modified'))  
                                +>>> seconddatastream = opener.open(request)            
                                 Traceback (most recent call last):
                                   File "<stdin>", line 1, in ?
                                   File "c:\python23\lib\urllib2.py", line 326, in open
                                @@ -5057,15 +5057,15 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):    ①This is the key part: before returning, you save the status code returned by the HTTP server. This will allow you easy access
                                             to it from the calling program.
                                 

                                Example 11.8. Using custom URL handlers

                                ->>> request.headers         
                                +>>> request.headers         
                                 {'If-modified-since': 'Thu, 15 Apr 2004 19:45:21 GMT'}
                                ->>> import openanything
                                ->>> opener = urllib2.build_opener(
                                -...    openanything.DefaultErrorHandler())   
                                ->>> seconddatastream = opener.open(request)
                                ->>> seconddatastream.status 
                                +>>> import openanything
                                +>>> opener = urllib2.build_opener(
                                +...    openanything.DefaultErrorHandler())   
                                +>>> seconddatastream = opener.open(request)
                                +>>> seconddatastream.status 
                                 304
                                ->>> seconddatastream.read() 
                                +>>> seconddatastream.read() 
                                 ''
                                 
                                  @@ -5077,15 +5077,15 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): ①Handling ETag works much the same way, but instead of checking for Last-Modified and sending If-Modified-Since, you check for ETag and send If-None-Match. Let's start with a fresh IDE session.

                                  Example 11.9. Supporting ETag/If-None-Match

                                  ->>> import urllib2, openanything
                                  ->>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
                                  ->>> opener = urllib2.build_opener(
                                  -...    openanything.DefaultErrorHandler())
                                  ->>> firstdatastream = opener.open(request)
                                  ->>> firstdatastream.headers.get('ETag')        
                                  +>>> import urllib2, openanything
                                  +>>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
                                  +>>> opener = urllib2.build_opener(
                                  +...    openanything.DefaultErrorHandler())
                                  +>>> firstdatastream = opener.open(request)
                                  +>>> firstdatastream.headers.get('ETag')        
                                   '"e842a-3e53-55d97640"'
                                  ->>> firstdata = firstdatastream.read()
                                  ->>> print firstdata          
                                  +>>> firstdata = firstdatastream.read()
                                  +>>> print firstdata          
                                   <?xml version="1.0" encoding="iso-8859-1"?>
                                   <feed version="0.3"
                                     xmlns="http://purl.org/atom/ns#"
                                  @@ -5094,12 +5094,12 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):    ①dive into mark</title>
                                     <link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
                                     <-- rest of feed omitted for brevity -->
                                  ->>> request.add_header('If-None-Match',
                                  -...    firstdatastream.headers.get('ETag'))   
                                  ->>> seconddatastream = opener.open(request)
                                  ->>> seconddatastream.status  
                                  +>>> request.add_header('If-None-Match',
                                  +...    firstdatastream.headers.get('ETag'))   
                                  +>>> seconddatastream = opener.open(request)
                                  +>>> seconddatastream.status  
                                   304
                                  ->>> seconddatastream.read()  
                                  +>>> seconddatastream.read()  
                                   ''
                                   
                                    @@ -5116,12 +5116,12 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): ①You can support permanent and temporary redirects using a different kind of custom URL handler.

                                    First, let's see why a redirect handler is necessary in the first place.

                                    Example 11.10. Accessing web services without a redirect handler

                                    ->>> import urllib2, httplib
                                    ->>> httplib.HTTPConnection.debuglevel = 1           
                                    ->>> request = urllib2.Request(
                                    -...    'http://diveintomark.org/redir/example301.xml') 
                                    ->>> opener = urllib2.build_opener()
                                    ->>> f = opener.open(request)
                                    +>>> import urllib2, httplib
                                    +>>> httplib.HTTPConnection.debuglevel = 1           
                                    +>>> request = urllib2.Request(
                                    +...    'http://diveintomark.org/redir/example301.xml') 
                                    +>>> opener = urllib2.build_opener()
                                    +>>> f = opener.open(request)
                                     connect: (diveintomark.org, 80)
                                     send: '
                                     GET /redir/example301.xml HTTP/1.0
                                    @@ -5150,9 +5150,9 @@ header: Accept-Ranges: bytes
                                     header: Content-Length: 15955
                                     header: Connection: close
                                     header: Content-Type: application/atom+xml
                                    ->>> f.url           
                                    +>>> f.url           
                                     'http://diveintomark.org/xml/atom.xml'
                                    ->>> f.headers.dict
                                    +>>> f.headers.dict
                                     {'content-length': '15955', 
                                     'accept-ranges': 'bytes', 
                                     'server': 'Apache/2.0.49 (Debian GNU/Linux)', 
                                    @@ -5161,7 +5161,7 @@ header: Content-Type: application/atom+xml
                                     'etag': '"e842a-3e53-55d97640"', 
                                     'date': 'Thu, 15 Apr 2004 22:06:25 GMT', 
                                     'content-type': 'application/atom+xml'}
                                    ->>> f.status
                                    +>>> f.status
                                     Traceback (most recent call last):
                                       File "<stdin>", line 1, in ?
                                     AttributeError: addinfourl instance has no attribute 'status'
                                    @@ -5202,12 +5202,12 @@ class SmartRedirectHandler(urllib2.HTTPRedirectHandler):     ①So what has this bought us?  You can now build a URL opener with the custom redirect handler, and it will still automatically
                                     follow redirects, but now it will also expose the redirect status code.
                                     

                                    Example 11.12. Using the redirect handler to detect permanent redirects

                                    ->>> request = urllib2.Request('http://diveintomark.org/redir/example301.xml')
                                    ->>> import openanything, httplib
                                    ->>> httplib.HTTPConnection.debuglevel = 1
                                    ->>> opener = urllib2.build_opener(
                                    -...    openanything.SmartRedirectHandler())           
                                    ->>> f = opener.open(request)
                                    +>>> request = urllib2.Request('http://diveintomark.org/redir/example301.xml')
                                    +>>> import openanything, httplib
                                    +>>> httplib.HTTPConnection.debuglevel = 1
                                    +>>> opener = urllib2.build_opener(
                                    +...    openanything.SmartRedirectHandler())           
                                    +>>> f = opener.open(request)
                                     connect: (diveintomark.org, 80)
                                     send: 'GET /redir/example301.xml HTTP/1.0
                                     Host: diveintomark.org
                                    @@ -5236,9 +5236,9 @@ header: Content-Length: 15955
                                     header: Connection: close
                                     header: Content-Type: application/atom+xml
                                     
                                    ->>> f.status       
                                    +>>> f.status       
                                     301
                                    ->>> f.url
                                    +>>> f.url
                                     'http://diveintomark.org/xml/atom.xml'
                                     
                                      @@ -5250,9 +5250,9 @@ header: Content-Type: application/atom+xml the server with requests at the old address. It's time to update your address book.

                                      The same redirect handler can also tell you that you shouldn't update your address book.

                                      Example 11.13. Using the redirect handler to detect temporary redirects

                                      ->>> request = urllib2.Request(
                                      -...    'http://diveintomark.org/redir/example302.xml')   
                                      ->>> f = opener.open(request)
                                      +>>> request = urllib2.Request(
                                      +...    'http://diveintomark.org/redir/example302.xml')   
                                      +>>> f = opener.open(request)
                                       connect: (diveintomark.org, 80)
                                       send: '
                                       GET /redir/example302.xml HTTP/1.0
                                      @@ -5281,9 +5281,9 @@ header: Accept-Ranges: bytes
                                       header: Content-Length: 15955
                                       header: Connection: close
                                       header: Content-Type: application/atom+xml
                                      ->>> f.status          
                                      +>>> f.status          
                                       302
                                      ->>> f.url
                                      +>>> f.url
                                       http://diveintomark.org/xml/atom.xml
                                       
                                        @@ -5300,12 +5300,12 @@ http://diveintomark.org/xml/atom.xml XML data compresses very well.

                                        Servers won't give you compressed data unless you tell them you can handle it.

                                        Example 11.14. Telling the server you would like compressed data

                                        ->>> import urllib2, httplib
                                        ->>> httplib.HTTPConnection.debuglevel = 1
                                        ->>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
                                        ->>> request.add_header('Accept-encoding', 'gzip')        
                                        ->>> opener = urllib2.build_opener()
                                        ->>> f = opener.open(request)
                                        +>>> import urllib2, httplib
                                        +>>> httplib.HTTPConnection.debuglevel = 1
                                        +>>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
                                        +>>> request.add_header('Accept-encoding', 'gzip')        
                                        +>>> opener = urllib2.build_opener()
                                        +>>> f = opener.open(request)
                                         connect: (diveintomark.org, 80)
                                         send: '
                                         GET /xml/atom.xml HTTP/1.0
                                        @@ -5332,15 +5332,15 @@ header: Content-Type: application/atom+xml
                                         
                                      1. The Content-Length header is the length of the compressed data, not the uncompressed data. As you'll see in a minute, the actual length of the uncompressed data was 15955, so gzip compression cut your bandwidth by over 60%!

                                        Example 11.15. Decompressing the data

                                        ->>> compresseddata = f.read()            
                                        ->>> len(compresseddata)
                                        +>>> compresseddata = f.read()            
                                        +>>> len(compresseddata)
                                         6289
                                        ->>> import StringIO
                                        ->>> compressedstream = StringIO.StringIO(compresseddata)   
                                        ->>> import gzip
                                        ->>> gzipper = gzip.GzipFile(fileobj=compressedstream)      
                                        ->>> data = gzipper.read()                
                                        ->>> print data         
                                        +>>> import StringIO
                                        +>>> compressedstream = StringIO.StringIO(compresseddata)   
                                        +>>> import gzip
                                        +>>> gzipper = gzip.GzipFile(fileobj=compressedstream)      
                                        +>>> data = gzipper.read()                
                                        +>>> print data         
                                         <?xml version="1.0" encoding="iso-8859-1"?>
                                         <feed version="0.3"
                                           xmlns="http://purl.org/atom/ns#"
                                        @@ -5349,7 +5349,7 @@ header: Content-Type: application/atom+xml
                                           <title mode="escaped">dive into mark</title>
                                           <link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
                                           <-- rest of feed omitted for brevity -->
                                        ->>> len(data)
                                        +>>> len(data)
                                         15955
                                         
                                          @@ -5362,10 +5362,10 @@ header: Content-Type: application/atom+xml
                                        1. This is the line that does all the actual work: “reading” from GzipFile will decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. gzipper is a file-like object which represents a gzip-compressed file. That “file” is not a real file on disk, though; gzipper is really just “reading” from the file-like object you created with StringIO to wrap the compressed data, which is only in memory in the variable compresseddata. And where did that compressed data come from? You originally downloaded it from a remote HTTP server by “reading” from the file-like object you built with urllib2.build_opener. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it.
                                        2. Look ma, real data. (15955 bytes of it, in fact.)

                                          “But wait!” I hear you cry. “This could be even easier!” I know what you're thinking. You're thinking that opener.open returns a file-like object, so why not cut out the StringIO middleman and just pass f directly to GzipFile? OK, maybe you weren't thinking that, but don't worry about it, because it doesn't work.

                                          Example 11.16. Decompressing the data directly from the server

                                          ->>> f = opener.open(request)
                                          ->>> f.headers.get('Content-Encoding')         
                                          +>>> f = opener.open(request)
                                          +>>> f.headers.get('Content-Encoding')         
                                           'gzip'
                                          ->>> data = gzip.GzipFile(fileobj=f).read()    
                                          +>>> data = gzip.GzipFile(fileobj=f).read()    
                                           Traceback (most recent call last):
                                             File "<stdin>", line 1, in ?
                                             File "c:\python23\lib\gzip.py", line 217, in read
                                          @@ -5443,11 +5443,11 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
                                           
                                        3. If you got a URL back from the server, save it, and assume that the status code is 200 until you find out otherwise.
                                        4. If one of the custom URL handlers captured a status code, then save that too.

                                          Example 11.19. Using openanything.py

                                          ->>> import openanything
                                          ->>> useragent = 'MyHTTPWebServicesApp/1.0'
                                          ->>> url = 'http://diveintopython3.org/redir/example301.xml'
                                          ->>> params = openanything.fetch(url, agent=useragent)              
                                          ->>> params   
                                          +>>> import openanything
                                          +>>> useragent = 'MyHTTPWebServicesApp/1.0'
                                          +>>> url = 'http://diveintopython3.org/redir/example301.xml'
                                          +>>> params = openanything.fetch(url, agent=useragent)              
                                          +>>> params   
                                           {'url': 'http://diveintomark.org/xml/atom.xml', 
                                           'lastmodified': 'Thu, 15 Apr 2004 19:45:21 GMT', 
                                           'etag': '"e842a-3e53-55d97640"', 
                                          @@ -5455,11 +5455,11 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
                                           'data': '<?xml version="1.0" encoding="iso-8859-1"?>
                                           <feed version="0.3"
                                           <-- rest of data omitted for brevity -->'}
                                          ->>> if params['status'] == 301:
                                          -...    url = params['url']
                                          ->>> newparams = openanything.fetch(
                                          -...    url, params['etag'], params['lastmodified'], useragent)    
                                          ->>> newparams
                                          +>>> if params['status'] == 301:
                                          +...    url = params['url']
                                          +>>> newparams = openanything.fetch(
                                          +...    url, params['etag'], params['lastmodified'], useragent)    
                                          +>>> newparams
                                           {'url': 'http://diveintomark.org/xml/atom.xml', 
                                           'lastmodified': None, 
                                           'etag': '"e842a-3e53-55d97640"', 
                                          @@ -5890,8 +5890,8 @@ def from_roman(s):
                                                       result += numeral
                                                       n -= integer
                                                       print 'subtracting', integer, 'from input, adding', numeral, 'to output'
                                          ->>> import roman2
                                          ->>> roman2.to_roman(1424)
                                          +>>> import roman2
                                          +>>> roman2.to_roman(1424)
                                           subtracting 1000 from input, adding M to output
                                           subtracting 400 from input, adding CD to output
                                           subtracting 10 from input, adding X to output
                                          @@ -6069,14 +6069,14 @@ def from_roman(s):
                                           
                                        5. This is the non-integer check. Non-integers can not be converted to Roman numerals.
                                        6. The rest of the function is unchanged.

                                          Example 14.7. Watching to_roman() handle bad input

                                          ->>> import roman3
                                          ->>> roman3.to_roman(4000)
                                          +>>> import roman3
                                          +>>> roman3.to_roman(4000)
                                           Traceback (most recent call last):
                                             File "<interactive input>", line 1, in ?
                                             File "roman3.py", line 27, in to_roman
                                               raise OutOfRangeError, "number out of range (must be 1..3999)"
                                           OutOfRangeError: number out of range (must be 1..3999)
                                          ->>> roman3.to_roman(1.5)
                                          +>>> roman3.to_roman(1.5)
                                           Traceback (most recent call last):
                                             File "<interactive input>", line 1, in ?
                                             File "roman3.py", line 29, in to_roman
                                          @@ -6214,8 +6214,8 @@ def from_roman(s):
                                                       result += integer
                                                       index += len(numeral)
                                                       print 'found', numeral, 'of length', len(numeral), ', adding', integer
                                          ->>> import roman4
                                          ->>> roman4.from_roman('MCMLXXII')
                                          +>>> import roman4
                                          +>>> roman4.from_roman('MCMLXXII')
                                           found M , of length 1, adding 1000
                                           found CM , of length 2, adding 900
                                           found L , of length 1, adding 50
                                          @@ -6394,8 +6394,8 @@ OK     

                                          Chapter 15. Refactoring

                                          15.1. Handling bugs

                                          Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by “bug”? A bug is a test case you haven't written yet. -

                                          Example 15.1. The bug

                                          >>> import roman5
                                          ->>> roman5.from_roman("") 
                                          +

                                          Example 15.1. The bug

                                          >>> import roman5
                                          +>>> roman5.from_roman("") 
                                           0
                                          1. Remember in the previous section when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals? @@ -6816,16 +6816,16 @@ program. It's probably not worth trying to do away with the regular expression altogether (it would be difficult, and it might not end up any faster), but you can speed up the function by precompiling the regular expression.

                                            Example 15.10. Compiling regular expressions

                                            ->>> import re
                                            ->>> pattern = '^M?M?M?$'
                                            ->>> re.search(pattern, 'M')               
                                            +>>> import re
                                            +>>> pattern = '^M?M?M?$'
                                            +>>> re.search(pattern, 'M')               
                                             <SRE_Match object at 01090490>
                                            ->>> compiledPattern = re.compile(pattern) 
                                            ->>> compiledPattern
                                            +>>> compiledPattern = re.compile(pattern) 
                                            +>>> compiledPattern
                                             <SRE_Pattern object at 00F06E28>
                                            ->>> dir(compiledPattern)
                                            +>>> dir(compiledPattern)
                                             ['findall', 'match', 'scanner', 'search', 'split', 'sub', 'subn']
                                            ->>> compiledPattern.search('M')           
                                            +>>> compiledPattern.search('M')           
                                             <SRE_Match object at 01104928>
                                            1. This is the syntax you've seen before: re.search takes a regular expression as a string (pattern) and a string to match against it ('M'). If the pattern matches, the function returns a match object which can be queried to find out exactly what matched and @@ -7134,7 +7134,7 @@ if __name__ == "__main__":

                                          Running this script in the same directory as the rest of the example scripts that come with this book will find all the unit tests, named moduletest.py, run them as a single test, and pass or fail them all at once.

                                          Example 16.2. Sample output of regression.py

                                          -[you@localhost py]$ python regression.py -v
                                          +[you@localhost py]$ python regression.py -v
                                           help should fail with no object ... ok           
                                           help should return known result for apihelper ... ok
                                           help should honor collapse argument ... ok
                                          @@ -7195,16 +7195,16 @@ print 'full path =', os.path.abspath(pathname) 
                                          os.path.abspath is the key here. It takes a pathname, which can be partial or even blank, and returns a fully qualified pathname.

                                          os.path.abspath deserves further explanation. It is very flexible; it can take any kind of pathname.

                                          Example 16.4. Further explanation of os.path.abspath

                                          ->>> import os
                                          ->>> os.getcwd()      
                                          +>>> import os
                                          +>>> os.getcwd()      
                                           /home/you
                                          ->>> os.path.abspath('')                
                                          +>>> os.path.abspath('')                
                                           /home/you
                                          ->>> os.path.abspath('.ssh')            
                                          +>>> os.path.abspath('.ssh')            
                                           /home/you/.ssh
                                          ->>> os.path.abspath('/home/you/.ssh') 
                                          +>>> os.path.abspath('/home/you/.ssh') 
                                           /home/you/.ssh
                                          ->>> os.path.abspath('.ssh/../foo/')    
                                          +>>> os.path.abspath('.ssh/../foo/')    
                                           /home/you/foo
                                          1. os.getcwd() returns the current working directory. @@ -7220,16 +7220,16 @@ print 'full path =', os.path.abspath(pathname)
                                        7. Note
                    os.path.abspath not only constructs full path names, it also normalizes them. That means that if you are in the /usr/ directory, os.path.abspath('bin/../local/bin') will return /usr/local/bin. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without turning it into a full pathname, use os.path.normpath instead.

                    Example 16.5. Sample output from fullpath.py

                    -[you@localhost py]$ python /home/you/diveintopython3/common/py/fullpath.py 
                    +[you@localhost py]$ python /home/you/diveintopython3/common/py/fullpath.py 
                     sys.argv[0] = /home/you/diveintopython3/common/py/fullpath.py
                     path = /home/you/diveintopython3/common/py
                     full path = /home/you/diveintopython3/common/py
                    -[you@localhost diveintopython3]$ python common/py/fullpath.py               
                    +[you@localhost diveintopython3]$ python common/py/fullpath.py               
                     sys.argv[0] = common/py/fullpath.py
                     path = common/py
                     full path = /home/you/diveintopython3/common/py
                    -[you@localhost diveintopython3]$ cd common/py
                    -[you@localhost py]$ python fullpath.py 
                    +[you@localhost diveintopython3]$ cd common/py
                    +[you@localhost py]$ python fullpath.py 
                     sys.argv[0] = fullpath.py
                     path = 
                     full path = /home/you/diveintopython3/common/py
                    @@ -7264,19 +7264,19 @@ def regressionTest(): [7] The function passed as the first argument to filter must itself take one argument, and the list that filter returns will contain all the elements from the list passed to filter for which the function passed to filter returns true.

                    Got all that? It's not as difficult as it sounds.

                    Example 16.7. Introducing filter

                    ->>> def odd(n):                 
                    -...    return n % 2
                    -...    
                    ->>> li = [1, 2, 3, 5, 9, 10, 256, -3]
                    ->>> filter(odd, li)             
                    +>>> def odd(n):                 
                    +...    return n % 2
                    +...    
                    +>>> li = [1, 2, 3, 5, 9, 10, 256, -3]
                    +>>> filter(odd, li)             
                     [1, 3, 5, 9, -3]
                    ->>> [e for e in li if odd(e)]   
                    ->>> filteredList = []
                    ->>> for n in li:                
                    -...    if odd(n):
                    -...        filteredList.append(n)
                    -...    
                    ->>> filteredList
                    +>>> [e for e in li if odd(e)]   
                    +>>> filteredList = []
                    +>>> for n in li:                
                    +...    if odd(n):
                    +...        filteredList.append(n)
                    +...    
                    +>>> filteredList
                     [1, 3, 5, 9, -3]
                    1. odd uses the built-in mod function “%” to return True if n is odd and False if n is even. @@ -7307,19 +7307,19 @@ There is discussion that map and filter might be depre

                      16.4. Mapping lists revisited

                      You're already familiar with using list comprehensions to map one list into another. There is another way to accomplish the same thing, using the built-in map function. It works much the same way as the filter function.

                      Example 16.10. Introducing map

                      ->>> def double(n):
                      -...    return n*2
                      -...    
                      ->>> li = [1, 2, 3, 5, 9, 10, 256, -3]
                      ->>> map(double, li)     
                      +>>> def double(n):
                      +...    return n*2
                      +...    
                      +>>> li = [1, 2, 3, 5, 9, 10, 256, -3]
                      +>>> map(double, li)     
                       [2, 4, 6, 10, 18, 20, 512, -6]
                      ->>> [double(n) for n in li]               
                      +>>> [double(n) for n in li]               
                       [2, 4, 6, 10, 18, 20, 512, -6]
                      ->>> newlist = []
                      ->>> for n in li:        
                      -...    newlist.append(double(n))
                      -...    
                      ->>> newlist
                      +>>> newlist = []
                      +>>> for n in li:        
                      +...    newlist.append(double(n))
                      +...    
                      +>>> newlist
                       [2, 4, 6, 10, 18, 20, 512, -6]
                      1. map takes a function and a list[8] and returns a new list by calling the function with each element of the list in order. In this case, the function simply @@ -7327,8 +7327,8 @@ There is discussion that map and filter might be depre
                      2. You could accomplish the same thing with a list comprehension. List comprehensions were first introduced in Python 2.0; map has been around forever.
                      3. You could, if you insist on thinking like a Visual Basic programmer, use a for loop to accomplish the same thing.

                        Example 16.11. map with lists of mixed datatypes

                        ->>> li = [5, 'a', (2, 'b')]
                        ->>> map(double, li)     
                        +>>> li = [5, 'a', (2, 'b')]
                        +>>> map(double, li)     
                         [10, 'aa', (2, 'b', 2, 'b')]
                        1. As a side note, I'd like to point out that map works just as well with lists of mixed datatypes, as long as the function you're using correctly handles each type. In this @@ -7379,14 +7379,14 @@ import sys, os, re, unittest
                        2. This imports four modules at once: sys (for system functions and access to the command line parameters), os (for operating system functions like directory listings), re (for regular expressions), and unittest (for unit testing).

                          Now let's do the same thing, but with dynamic imports.

                          Example 16.14. Importing modules dynamically

                          ->>> sys = __import__('sys')           
                          ->>> os = __import__('os')
                          ->>> re = __import__('re')
                          ->>> unittest = __import__('unittest')
                          ->>> sys             
                          ->>> <module 'sys' (built-in)>
                          ->>> os
                          ->>> <module 'os' from '/usr/local/lib/python2.2/os.pyc'>
                          +>>> sys = __import__('sys')           
                          +>>> os = __import__('os')
                          +>>> re = __import__('re')
                          +>>> unittest = __import__('unittest')
                          +>>> sys             
                          +>>> <module 'sys' (built-in)>
                          +>>> os
                          +>>> <module 'os' from '/usr/local/lib/python2.2/os.pyc'>
                           
                          1. The built-in __import__ function accomplishes the same goal as using the import statement, but it's an actual function, and it takes a string as an argument. @@ -7395,19 +7395,19 @@ import sys, os, re, unittest but it could just as easily be a variable, or the result of a function call. And the variable that you assign the module to doesn't need to match the module name, either. You could import a series of modules and assign them to a list.

                            Example 16.15. Importing a list of modules dynamically

                            ->>> moduleNames = ['sys', 'os', 're', 'unittest'] 
                            ->>> moduleNames
                            +>>> moduleNames = ['sys', 'os', 're', 'unittest'] 
                            +>>> moduleNames
                             ['sys', 'os', 're', 'unittest']
                            ->>> modules = map(__import__, moduleNames)        
                            ->>> modules   
                            +>>> modules = map(__import__, moduleNames)        
                            +>>> modules   
                             [<module 'sys' (built-in)>,
                             <module 'os' from 'c:\Python22\lib\os.pyc'>,
                             <module 're' from 'c:\Python22\lib\re.pyc'>,
                             <module 'unittest' from 'c:\Python22\lib\unittest.pyc'>]
                            ->>> modules[0].version          
                            +>>> modules[0].version          
                             '2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
                            ->>> import sys
                            ->>> sys.version
                            +>>> import sys
                            +>>> sys.version
                             '2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
                             
                              @@ -7434,10 +7434,10 @@ load = unittest.defaultTestLoader.loadTestsFromModule return unittest.TestSuite(map(load, modules))

                              Let's look at it line by line, interactively. Assume that the current directory is c:\diveintopython3\py, which contains the examples that come with this book, including this chapter's script. As you saw in Section 16.2, “Finding the path”, the script directory will end up in the path variable, so let's start hard-code that and go from there.

                              Example 16.17. Step 1: Get all the files

                              ->>> import sys, os, re, unittest
                              ->>> path = r'c:\diveintopython3\py'
                              ->>> files = os.listdir(path)             
                              ->>> files 
                              +>>> import sys, os, re, unittest
                              +>>> path = r'c:\diveintopython3\py'
                              +>>> files = os.listdir(path)             
                              +>>> files 
                               ['BaseHTMLProcessor.py', 'LICENSE.txt', 'apihelper.py', 'apihelpertest.py',
                               'argecho.py', 'autosize.py', 'builddialectexamples.py', 'dialect.py',
                               'fileinfo.py', 'fullpath.py', 'kgptest.py', 'makerealworddoc.py',
                              @@ -7450,9 +7450,9 @@ return unittest.TestSuite(map(load, modules))
                               
                            1. files is a list of all the files and directories in the script's directory. (If you've been running some of the examples already, you may also see some .pyc files in there as well.)

                              Example 16.18. Step 2: Filter to find the files you care about

                              ->>> test = re.compile("test\.py$", re.IGNORECASE)           
                              ->>> files = filter(test.search, files)    
                              ->>> files               
                              +>>> test = re.compile("test\.py$", re.IGNORECASE)           
                              +>>> files = filter(test.search, files)    
                              +>>> files               
                               ['apihelpertest.py', 'kgptest.py', 'odbchelpertest.py', 'pluraltest.py', 'romantest.py']
                               
                                @@ -7461,13 +7461,13 @@ return unittest.TestSuite(map(load, modules)) to find the ones that match the regular expression.
                              1. And you're left with the list of unit testing scripts, because they were the only ones named SOMETHINGtest.py.

                                Example 16.19. Step 3: Map filenames to module names

                                ->>> filenameToModuleName = lambda f: os.path.splitext(f)[0] 
                                ->>> filenameToModuleName('romantest.py')  
                                +>>> filenameToModuleName = lambda f: os.path.splitext(f)[0] 
                                +>>> filenameToModuleName('romantest.py')  
                                 'romantest'
                                ->>> filenameToModuleName('odchelpertest.py')
                                +>>> filenameToModuleName('odchelpertest.py')
                                 'odbchelpertest'
                                ->>> moduleNames = map(filenameToModuleName, files)          
                                ->>> moduleNames         
                                +>>> moduleNames = map(filenameToModuleName, files)          
                                +>>> moduleNames         
                                 ['apihelpertest', 'kgptest', 'odbchelpertest', 'pluraltest', 'romantest']
                                 
                                  @@ -7477,14 +7477,14 @@ return unittest.TestSuite(map(load, modules))
                                1. Now you can apply this function to each file in the list of unit test files, using map.
                                2. And the result is just what you wanted: a list of modules, as strings.

                                  Example 16.20. Step 4: Mapping module names to modules

                                  ->>> modules = map(__import__, moduleNames)
                                  ->>> modules             
                                  +>>> modules = map(__import__, moduleNames)
                                  +>>> modules             
                                   [<module 'apihelpertest' from 'apihelpertest.py'>,
                                   <module 'kgptest' from 'kgptest.py'>,
                                   <module 'odbchelpertest' from 'odbchelpertest.py'>,
                                   <module 'pluraltest' from 'pluraltest.py'>,
                                   <module 'romantest' from 'romantest.py'>]
                                  ->>> modules[-1]         
                                  +>>> modules[-1]         
                                   <module 'romantest' from 'romantest.py'>
                                   
                                    @@ -7493,8 +7493,8 @@ return unittest.TestSuite(map(load, modules))
                                  1. modules is now a list of modules, fully accessible like any other module.
                                  2. The last module in the list is the romantest module, just as if you had said import romantest.

                                    Example 16.21. Step 5: Loading the modules into a test suite

                                    ->>> load = unittest.defaultTestLoader.loadTestsFromModule  
                                    ->>> map(load, modules)   
                                    +>>> load = unittest.defaultTestLoader.loadTestsFromModule  
                                    +>>> map(load, modules)   
                                     [<unittest.TestSuite tests=[
                                       <unittest.TestSuite tests=[<apihelpertest.BadInput testMethod=testNoObject>]>,
                                       <unittest.TestSuite tests=[<apihelpertest.KnownValues testMethod=testApiHelper>]>,
                                    @@ -7504,7 +7504,7 @@ return unittest.TestSuite(map(load, modules))
                                         ...
                                       ]
                                     ]
                                    ->>> unittest.TestSuite(map(load, modules)) 
                                    +>>> unittest.TestSuite(map(load, modules)) 
                                     
                                    1. These are real module objects. Not only can you access them like any other module, instantiate classes and call functions, @@ -7588,14 +7588,14 @@ def plural(noun):
                                    2. OK, this is a regular expression, but it uses a syntax you didn't see in Chapter 7, Regular Expressions. The square brackets mean “match exactly one of these characters”. So [sxz] means “s, or x, or z”, but only one of them. The $ should be familiar; it matches the end of string. So you're checking to see if noun ends with s, x, or z.
                                    3. This re.sub function performs regular expression-based string substitutions. Let's look at it in more detail.

                                      Example 17.2. Introducing re.sub

                                      ->>> import re
                                      ->>> re.search('[abc]', 'Mark')   
                                      +>>> import re
                                      +>>> re.search('[abc]', 'Mark')   
                                       <_sre.SRE_Match object at 0x001C1FA8>
                                      ->>> re.sub('[abc]', 'o', 'Mark') 
                                      +>>> re.sub('[abc]', 'o', 'Mark') 
                                       'Mork'
                                      ->>> re.sub('[abc]', 'o', 'rock') 
                                      +>>> re.sub('[abc]', 'o', 'rock') 
                                       'rook'
                                      ->>> re.sub('[abc]', 'o', 'caps') 
                                      +>>> re.sub('[abc]', 'o', 'caps') 
                                       'oops'
                                       
                                        @@ -7621,26 +7621,26 @@ def plural(noun):
                                      1. Look closely, this is another new variation. The ^ as the first character inside the square brackets means something special: negation. [^abc] means “any single character except a, b, or c”. So [^aeioudgkprt] means any character except a, e, i, o, u, d, g, k, p, r, or t. Then that character needs to be followed by h, followed by end of string. You're looking for words that end in H where the H can be heard.
                                      2. Same pattern here: match words that end in Y, where the character before the Y is not a, e, i, o, or u. You're looking for words that end in Y that sounds like I.

                                        Example 17.4. More on negation regular expressions

                                        ->>> import re
                                        ->>> re.search('[^aeiou]y$', 'vacancy') 
                                        +>>> import re
                                        +>>> re.search('[^aeiou]y$', 'vacancy') 
                                         <_sre.SRE_Match object at 0x001C1FA8>
                                        ->>> re.search('[^aeiou]y$', 'boy')     
                                        ->>> 
                                        ->>> re.search('[^aeiou]y$', 'day')
                                        ->>> 
                                        ->>> re.search('[^aeiou]y$', 'pita')    
                                        ->>> 
                                        +>>> re.search('[^aeiou]y$', 'boy')     
                                        +>>> 
                                        +>>> re.search('[^aeiou]y$', 'day')
                                        +>>> 
                                        +>>> re.search('[^aeiou]y$', 'pita')    
                                        +>>> 
                                         
                                        1. vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u.
                                        2. boy does not match, because it ends in oy, and you specifically said that the character before the y could not be o. day does not match, because it ends in ay.
                                        3. pita does not match, because it does not end in y.

                                          Example 17.5. More on re.sub

                                          ->>> re.sub('y$', 'ies', 'vacancy')              
                                          +>>> re.sub('y$', 'ies', 'vacancy')              
                                           'vacancies'
                                          ->>> re.sub('y$', 'ies', 'agency')
                                          +>>> re.sub('y$', 'ies', 'agency')
                                           'agencies'
                                          ->>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy') 
                                          +>>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy') 
                                           'vacancies'
                                           
                                            @@ -7826,12 +7826,12 @@ def buildMatchAndApplyFunctions((pattern, search, replace)): ①pattern, search, and replace. Confused yet? Let's see it in action.

                                            Example 17.14. Expanding tuples when calling functions

                                            ->>> def foo((a, b, c)):
                                            -...    print c
                                            -...    print b
                                            -...    print a
                                            ->>> parameters = ('apple', 'bear', 'catnap')
                                            ->>> foo(parameters) 
                                            +>>> def foo((a, b, c)):
                                            +...    print c
                                            +...    print b
                                            +...    print a
                                            +>>> parameters = ('apple', 'bear', 'catnap')
                                            +>>> foo(parameters) 
                                             catnap
                                             bear
                                             apple
                                            @@ -7902,23 +7902,23 @@ def plural(noun, language='en'):
                                                     if result: return result      
                                             

                                            This uses a technique called generators, which I'm not even going to try to explain until you look at a simpler example first.

                                            Example 17.18. Introducing generators

                                            ->>> def make_counter(x):
                                            -...    print 'entering make_counter'
                                            -...    while 1:
                                            -...        yield x               
                                            -...        print 'incrementing x'
                                            -...        x = x + 1
                                            -...    
                                            ->>> counter = make_counter(2) 
                                            ->>> counter 
                                            +>>> def make_counter(x):
                                            +...    print 'entering make_counter'
                                            +...    while 1:
                                            +...        yield x               
                                            +...        print 'incrementing x'
                                            +...        x = x + 1
                                            +...    
                                            +>>> counter = make_counter(2) 
                                            +>>> counter 
                                             <generator object at 0x001C9C10>
                                            ->>> counter.next()            
                                            +>>> counter.next()            
                                             entering make_counter
                                             2
                                            ->>> counter.next()            
                                            +>>> counter.next()            
                                             incrementing x
                                             3
                                            ->>> counter.next()            
                                            +>>> counter.next()            
                                             incrementing x
                                             4
                                             
                                            @@ -7947,8 +7947,8 @@ def fibonacci(max):

                                            So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way is easier to read. Also, it works well with for loops.

                                            Example 17.20. Generators in for loops

                                            ->>> for n in fibonacci(1000): 
                                            -...    print n,              
                                            +>>> for n in fibonacci(1000): 
                                            +...    print n,              
                                             0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
                                             
                                              @@ -8142,12 +8142,12 @@ in your timing framework will irreparably skew your results.

                                              Example 18.2. Introducing timeit

                                              If you have not already done so, you can download this and other examples used in this book.

                                              ->>> import timeit
                                              ->>> t = timeit.Timer("soundex.soundex('Pilgrim')",
                                              -...    "import soundex")   
                                              ->>> t.timeit()              
                                              +>>> import timeit
                                              +>>> t = timeit.Timer("soundex.soundex('Pilgrim')",
                                              +...    "import soundex")   
                                              +>>> t.timeit()              
                                               8.21683733547
                                              ->>> t.repeat(3, 2000000)    
                                              +>>> t.repeat(3, 2000000)    
                                               [16.48319309109, 16.46128984923, 16.44203948912]
                                               
                                                @@ -8167,7 +8167,7 @@ or in the Python interpreter; they took longer because of those pesky background have too much variability to trust the results. Otherwise, take the minimum time and discard the rest.

                                                Python has a handy min function that takes a list and returns the smallest value:

                                                ->>> min(t.repeat(3, 1000000))
                                                +>>> min(t.repeat(3, 1000000))
                                                 8.22203948912
                                                 

                                                The timeit module only works if you already know what piece of code you need to optimize. If you have a larger Python program and don't know where your performance problems are, check out the hotshot module.

                                                18.3. Optimizing Regular Expressions

                                                @@ -8193,7 +8193,7 @@ if __name__ == '__main__': print name.ljust(15), soundex(name), min(t.repeat())
                            2. So how does soundex1a.py perform with this regular expression?

                              -C:\samples\soundex\stage1>python soundex1a.py
                              +C:\samples\soundex\stage1>python soundex1a.py
                               Woo             W000 19.3356647283
                               Pilgrim         P426 24.0772053431
                               Flingjingwaller F452 35.0463220884
                              @@ -8209,7 +8209,7 @@ of different cases.
                                       return "0000"
                               

                              timeit says soundex1b.py is slightly faster than soundex1a.py, but nothing to get terribly excited about:

                              -C:\samples\soundex\stage1>python soundex1b.py
                              +C:\samples\soundex\stage1>python soundex1b.py
                               Woo             W000 17.1361133887
                               Pilgrim         P426 21.8201693232
                               Flingjingwaller F452 32.7262294509
                              @@ -8222,7 +8222,7 @@ def soundex(source):
                                       return "0000"
                               

                              Using a compiled regular expression in soundex1c.py is significantly faster:

                              -C:\samples\soundex\stage1>python soundex1c.py
                              +C:\samples\soundex\stage1>python soundex1c.py
                               Woo             W000 14.5348347346
                               Pilgrim         P426 19.2784703084
                               Flingjingwaller F452 30.0893873383
                              @@ -8237,7 +8237,7 @@ character, and do away with regular expressions altogether?
                                           return "0000"
                               

                              It turns out that this technique in soundex1d.py is not faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression):

                              -C:\samples\soundex\stage1>python soundex1d.py
                              +C:\samples\soundex\stage1>python soundex1d.py
                               Woo             W000 15.4065058548
                               Pilgrim         P426 22.2753567842
                               Flingjingwaller F452 37.5845122774
                              @@ -8252,7 +8252,7 @@ The method is called isalpha(), and it checks whether a string cont
                                       return "0000"
                               

                              How much did we gain by using this specific method in soundex1e.py? Quite a bit.

                              -C:\samples\soundex\stage1>python soundex1e.py
                              +C:\samples\soundex\stage1>python soundex1e.py
                               Woo             W000 13.5069504644
                               Pilgrim         P426 18.2199394057
                               Flingjingwaller F452 28.9975225902
                              @@ -8352,7 +8352,7 @@ def soundex(source):
                                       digits += charToSoundex[s]
                               

                              You timed soundex1c.py already; this is how it performs:

                              -C:\samples\soundex\stage1>python soundex1c.py
                              +C:\samples\soundex\stage1>python soundex1c.py
                               Woo             W000 14.5341678901
                               Pilgrim         P426 19.2650071448
                               Flingjingwaller F452 30.1003563302
                              @@ -8368,7 +8368,7 @@ def soundex(source):
                                   digits = source[0] + "".join(map(lambda c: charToSoundex[c], source[1:]))
                               

                              Surprisingly, soundex2a.py is not faster:

                              -C:\samples\soundex\stage2>python soundex2a.py
                              +C:\samples\soundex\stage2>python soundex2a.py
                               Woo             W000 15.0097526362
                               Pilgrim         P426 19.254806407
                               Flingjingwaller F452 29.3790847719
                              @@ -8379,7 +8379,7 @@ Flingjingwaller F452 29.3790847719
                                   digits = source[0] + "".join([charToSoundex[c] for c in source[1:]])
                               

                              Using a list comprehension in soundex2b.py is faster than using ↦ and lambda in soundex2a.py, but still not faster than the original code (incrementally building a string in soundex1c.py):

                              -C:\samples\soundex\stage2>python soundex2b.py
                              +C:\samples\soundex\stage2>python soundex2b.py
                               Woo             W000 13.4221324219
                               Pilgrim         P426 16.4901234654
                               Flingjingwaller F452 25.8186157738
                              @@ -8398,7 +8398,7 @@ to 1, C maps to 2, and so forth. But it's not a dictionary; it's a specialized d
                               string method translate, which translates each character into the corresponding digit, according to the matrix defined by string.maketrans.
                               

                              timeit shows that soundex2c.py is significantly faster than defining a dictionary and looping through the input and building the output incrementally:

                              -C:\samples\soundex\stage2>python soundex2c.py
                              +C:\samples\soundex\stage2>python soundex2c.py
                               Woo             W000 11.437645008
                               Pilgrim         P426 13.2825062962
                               Flingjingwaller F452 18.5570110168
                              @@ -8440,7 +8440,7 @@ if __name__ == '__main__':
                                           digits2 += d
                               

                              Here are the performance results for soundex2c.py:

                              -C:\samples\soundex\stage2>python soundex2c.py
                              +C:\samples\soundex\stage2>python soundex2c.py
                               Woo             W000 12.6070768771
                               Pilgrim         P426 14.4033353401
                               Flingjingwaller F452 19.7774882003
                              @@ -8456,7 +8456,7 @@ variable, and checking that instead?
                                           last_digit = d
                               

                              soundex3a.py does not run any faster than soundex2c.py, and may even be slightly slower (although it's not enough of a difference to say for sure):

                              -C:\samples\soundex\stage3>python soundex3a.py
                              +C:\samples\soundex\stage3>python soundex3a.py
                               Woo             W000 11.5346048171
                               Pilgrim         P426 13.3950636184
                               Flingjingwaller F452 18.6108927252
                              @@ -8473,7 +8473,7 @@ from the previous character. That will give you a list of characters, and you ca
                                    if i == 0 or digits[i-1] != digits[i]])
                               

                              Is this faster? In a word, no.

                              -C:\samples\soundex\stage3>python soundex3b.py
                              +C:\samples\soundex\stage3>python soundex3b.py
                               Woo             W000 14.2245271396
                               Pilgrim         P426 17.8337165757
                               Flingjingwaller F452 25.9954005327
                              @@ -8491,7 +8491,7 @@ within a single list?
                                   digits2 = "".join(digits)
                               

                              Is this faster than soundex3a.py or soundex3b.py? No, in fact it's the slowest method yet:

                              -C:\samples\soundex\stage3>python soundex3c.py
                              +C:\samples\soundex\stage3>python soundex3c.py
                               Woo             W000 14.1662554878
                               Pilgrim         P426 16.0397885765
                               Flingjingwaller F452 22.1789341942
                              @@ -8534,7 +8534,7 @@ if __name__ == '__main__':
                                   return digits3[:4]
                               

                              These are the results for soundex2c.py:

                              -C:\samples\soundex\stage2>python soundex2c.py
                              +C:\samples\soundex\stage2>python soundex2c.py
                               Woo             W000 12.6070768771
                               Pilgrim         P426 14.4033353401
                               Flingjingwaller F452 19.7774882003
                              @@ -8546,7 +8546,7 @@ Flingjingwaller F452 19.7774882003
                                           digits3 += d
                               

                              Is soundex4a.py faster? Yes it is:

                              -C:\samples\soundex\stage4>python soundex4a.py
                              +C:\samples\soundex\stage4>python soundex4a.py
                               Woo             W000 6.62865531792
                               Pilgrim         P426 9.02247576158
                               Flingjingwaller F452 13.6328416042
                              @@ -8555,7 +8555,7 @@ Flingjingwaller F452 13.6328416042
                                   digits3 = digits2.replace('9', '')
                               

                              Is soundex4b.py faster? That's an interesting question. It depends on the input:

                              -C:\samples\soundex\stage4>python soundex4b.py
                              +C:\samples\soundex\stage4>python soundex4b.py
                               Woo             W000 6.75477414029
                               Pilgrim         P426 7.56652144337
                               Flingjingwaller F452 10.8727729362
                              @@ -8572,7 +8572,7 @@ we already have at least one character (the initial letter, which is passed unch
                               exact wording of the problem; looking at the problem slightly differently can lead to a simpler solution.
                               

                              How much speed do we gain in soundex4c.py by dropping the while loop? It's significant:

                              -C:\samples\soundex\stage4>python soundex4c.py
                              +C:\samples\soundex\stage4>python soundex4c.py
                               Woo             W000 4.89129791636
                               Pilgrim         P426 7.30642134685
                               Flingjingwaller F452 10.689832367
                              @@ -8582,7 +8582,7 @@ one line. Take a look at soundex/stage4/soundex4d.py:
                                   return (digits2.replace('9', '') + '000')[:4]
                               

                              Putting all this code on one line in soundex4d.py is barely faster than soundex4c.py:

                              -C:\samples\soundex\stage4>python soundex4d.py
                              +C:\samples\soundex\stage4>python soundex4d.py
                               Woo             W000 4.93624105857
                               Pilgrim         P426 7.19747593619
                               Flingjingwaller F452 10.5490700634
                              diff --git a/dip3.css b/dip3.css
                              index c5dfd70..8ba2637 100644
                              --- a/dip3.css
                              +++ b/dip3.css
                              @@ -1,16 +1,16 @@
                               /* typography */
                              -body,.widgets a{font:medium 'Gill Sans','Gill Sans MT',Corbel,Helvetica,Jara,'Nimbus Sans L',sans-serif;line-height:1.75;word-spacing:0.1em}
                              +body,.w a{font:medium 'Gill Sans','Gill Sans MT',Corbel,Helvetica,Jara,'Nimbus Sans L',sans-serif;line-height:1.75;word-spacing:0.1em}
                               pre,kbd,code,samp{font-family:Consolas,'Andale Mono',Monaco,'Liberation Mono','Bitstream Vera Sans Mono','DejaVu Sans Mono',monospace;font-size:medium;line-height:1.75;word-spacing:0}
                              -span,tr + tr th:first-child{font:medium 'Arial Unicode MS',FreeSerif,OpenSymbol,'DejaVu Sans',sans-serif}
                              +span{font:medium 'Arial Unicode MS',FreeSerif,OpenSymbol,'DejaVu Sans',sans-serif}
                               pre span{font-family:'Arial Unicode MS','DejaVu Sans',FreeSerif,OpenSymbol,sans-serif}
                               .baa{font:oblique large Constantia,Baskerville,Palatino,'Palatino Linotype','URW Palladio L',serif}
                               abbr{font-variant:small-caps;text-transform:lowercase;letter-spacing:0.1em}
                              -.q{margin:auto;text-align:right;font-style:oblique}
                              +.q{text-align:right;font-style:oblique}
                               .q span{font-size:large}
                               .note{margin-left:4.94em}
                               .note span{display:block;float:left;font-size:xx-large;line-height:0.875;margin:0 0.22em 0 -1.22em}
                              -.c,pre,.widgets,.widgets a,.download,ins,del,mark{line-height:2.154}
                              -.fancy:first-letter{float:left;color:#ddd;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
                              +.c,pre,.w,.w a,.download{line-height:2.154}
                              +.f:first-letter{float:left;color:#ddd;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
                               h1,h2,h3,p,ul,ol{margin:1.75em 0;font-size:medium}
                               
                               /* basics */
                              @@ -22,46 +22,34 @@ form div{float:right}
                               /* links */
                               a{background:transparent;text-decoration:none;border-bottom:1px dotted}
                               a:hover{border-bottom:1px solid}
                              -a:link,.widgets a{color:#26c}
                              +a:link,.w a{color:#26c}
                               a:visited{color:#93c}
                              -.skip a,.skip a:hover,.skip a:visited{position:absolute;left:0px;top:-500px;width:1px;height:1px;overflow:hidden}
                              -.skip a:active,.skip a:focus{position:static;width:auto;height:auto}
                              +
                              +/* skip links */
                              +.s a,.s a:hover,.s a:visited{position:absolute;left:0px;top:-500px;width:1px;height:1px;overflow:hidden}
                              +.s a:active,.s a:focus{position:static;width:auto;height:auto}
                               
                               /* code blocks */
                               pre{white-space:pre-wrap;padding-left:2.154em;border-left:1px solid #ddd}
                              -.widgets{float:left}
                              -.c,.widgets,.widgets a,.download{font-size:small}
                              +.w{float:left}
                              +.c,.w,.w a,.download{font-size:small}
                               .block,ol,p,blockquote,h1,h2,h3{clear:left}
                              -pre a,.widgets a{padding:0.4375em 0}
                              -.widgets a{text-decoration:underline}
                              -kbd,mark{font-weight:bold}
                              -.prompt{color:#667}
                              -ins,del,mark{text-decoration:none;font-style:normal;display:inline-block;width:100%}
                              -del{background:#f87}
                              -ins{background:#9f9}
                              -mark{background:#ff8}
                              +pre a,.w a{padding:0.4375em 0}
                              +.w a{text-decoration:underline}
                              +kbd{font-weight:bold}
                              +.p{color:#667}
                               
                              -/* tables */
                              -table{width:100%;border-collapse:collapse}
                              -th,td{width:45%;padding:0 0.5em;border:1px solid #bbb}
                              -th{text-align:left;vertical-align:baseline}
                              -td{vertical-align:top}
                              -th:first-child{width:10%;text-align:center}
                              -.simple th{font-family:inherit !important}
                              -.hover{background:#eee;cursor:default}
                              +/* hover effect for table rows, list items, and lines in code blocks */
                              +.h{background:#eee;cursor:default}
                               
                               /* overrides */
                              -th,td,td pre,li ol{margin:0}
                              -td pre{padding:0}
                              -pre a,.widgets a,pre a:hover,td pre{border:0}
                              +li ol,.q{margin:0}
                              +pre a,.w a,pre a:hover{border:0}
                               
                               /* headers */
                              -h1,#noscript{background:PapayaWhip;width:100%}
                              +h1,#noscript{background:PapayaWhip;width:100%} /* all hail PapayaWhip */
                               h1:before{content:"Chapter " counter(h1) ". "}
                               h1{counter-reset:h2}
                               h2:before{counter-increment:h2;content:counter(h1) "." counter(h2) ". "}
                               h2{counter-reset:h3}
                               h3:before{counter-increment:h3;content:counter(h1) "." counter(h2) "." counter(h3) ". "}
                              -
                              -/* HTML 5 support */
                              -article,aside,dialog,footer,header,section{display:block}
                              \ No newline at end of file
                              diff --git a/dip3.js b/dip3.js
                              index c4b6ec0..ebc61a0 100644
                              --- a/dip3.js
                              +++ b/dip3.js
                              @@ -11,7 +11,7 @@ $(document).ready(function() {
                                 for (var lang in LANGS) {
                                   $("blockquote.compare").filter("blockquote." + lang).each(function(i) {
                                     $(this).wrapInner('
                              '); - $(this).prepend(''); + $(this).prepend(''); }); } */ @@ -26,10 +26,10 @@ $(document).ready(function() { $("pre.code, pre.screen").each(function(i) { this.id = "autopre" + i; $(this).wrapInner('
                              '); - $(this).prepend(''); + $(this).prepend(''); $(this).prev("p.download").each(function(i) { - $(this).next("pre").find("div.widgets").append(" " + $(this).html()); + $(this).next("pre").find("div.w").append(" " + $(this).html()); this.parentNode.removeChild(this); }); }); @@ -39,8 +39,8 @@ $(document).ready(function() { $(this).find("a:not([href])").each(function(i) { var a = $(this); var li = a.parents("pre").next("ol").find("li:nth-child(" + (i+1) + ")"); - li.add(a).hover(function() { a.addClass("hover"); li.addClass("hover"); }, - function() { a.removeClass("hover"); li.removeClass("hover"); }); + li.add(a).hover(function() { a.addClass("h"); li.addClass("h"); }, + function() { a.removeClass("h"); li.removeClass("h"); }); }); }); @@ -50,8 +50,8 @@ $(document).ready(function() { var tr = $(this); var li = tr.parents("table").next("ol").find("li:nth-child(" + (i+1) + ")"); if (li.length > 0) { - li.add(tr).hover(function() { tr.addClass("hover"); li.addClass("hover"); }, - function() { tr.removeClass("hover"); li.removeClass("hover"); }); + li.add(tr).hover(function() { tr.addClass("h"); li.addClass("h"); }, + function() { tr.removeClass("h"); li.removeClass("h"); }); } }); }); @@ -63,7 +63,7 @@ $(document).ready(function() { function toggleComparisonNotes(lang) { // FIXME: save state in cookie, pass state to toggle(), reset text accordingly $("blockquote." + lang + " div.block").toggle(false); - $("blockquote." + lang + " div.widgets a.toggle").text("show " + LANGS[lang] + " notes"); + $("blockquote." + lang + " div.w a.toggle").text("show " + LANGS[lang] + " notes"); } */ @@ -75,7 +75,7 @@ function toggleCodeBlock(id) { function plainTextOnClick(id) { var clone = $("#" + id).clone(); - clone.find("div.widgets, span").remove(); + clone.find("div.w, span").remove(); var win = window.open("about:blank", "plaintext", "toolbar=0,scrollbars=1,location=0,statusbar=0,menubar=0,resizable=1,width=600,height=400,left=35,top=75"); win.document.open(); win.document.write('
                              ' + clone.html());
                              diff --git a/htmlminimizer.py b/htmlminimizer.py
                              index 6017266..670b556 100644
                              --- a/htmlminimizer.py
                              +++ b/htmlminimizer.py
                              @@ -20,3 +20,365 @@ for line in open(input_file).readlines():
                                   else:
                                       out.write(g)
                               out.close()
                              +
                              +out = open(output_file)
                              +html = out.read()
                              +out.close()
                              +html = html.replace("å", "å")
                              +html = html.replace(">", ">")
                              +html = html.replace(">", ">")
                              +html = html.replace("⊃", "⊃")
                              +html = html.replace("⊃", "⊃")
                              +html = html.replace("Ñ", "Ñ")
                              +html = html.replace("ϒ", "ϒ")
                              +html = html.replace("ϒ", "ϒ")
                              +html = html.replace("Ý", "Ý")
                              +html = html.replace("Ã", "Ã")
                              +html = html.replace("√", "√")
                              +html = html.replace("⊗", "⊗")
                              +html = html.replace("⊗", "⊗")
                              +html = html.replace("æ", "æ")
                              +html = html.replace("Ψ", "Ψ")
                              +html = html.replace("Ψ", "Ψ")
                              +html = html.replace("Ε", "Ε")
                              +html = html.replace("Ε", "Ε")
                              +html = html.replace("Î", "Î")
                              +html = html.replace("É", "É")
                              +html = html.replace("Λ", "Λ")
                              +html = html.replace("Λ", "Λ")
                              +html = html.replace("″", "″")
                              +html = html.replace("Κ", "Κ")
                              +html = html.replace("Κ", "Κ")
                              +html = html.replace("ς", "ς")
                              +html = html.replace("ς", "ς")
                              +html = html.replace("‎", "‎")
                              +html = html.replace("‎", "‎")
                              +html = html.replace("¸", "¸")
                              +html = html.replace(" ", " ")
                              +html = html.replace(" ", " ")
                              +html = html.replace("Æ", "Æ")
                              +html = html.replace("′", "′")
                              +html = html.replace("Τ", "Τ")
                              +html = html.replace("Τ", "Τ")
                              +html = html.replace("⌈", "⌈")
                              +html = html.replace("⇓", "⇓")
                              +html = html.replace("⇓", "⇓")
                              +html = html.replace("≥", "≥")
                              +html = html.replace("≥", "≥")
                              +html = html.replace("⋅", "⋅")
                              +html = html.replace("⋅", "⋅")
                              +html = html.replace("⌊", "⌊")
                              +html = html.replace("⌊", "⌊")
                              +html = html.replace("⇐", "⇐")
                              +html = html.replace("⇐", "⇐")
                              +html = html.replace("¦", "¦")
                              +html = html.replace("Õ", "Õ")
                              +html = html.replace("Θ", "Θ")
                              +html = html.replace("Θ", "Θ")
                              +html = html.replace("Π", "Π")
                              +html = html.replace("Π", "Π")
                              +html = html.replace("Œ", "Œ")
                              +html = html.replace("Œ", "Œ")
                              +html = html.replace("Š", "Š")
                              +html = html.replace("Š", "Š")
                              +html = html.replace("è", "è")
                              +html = html.replace("⊂", "⊂")
                              +html = html.replace("⊂", "⊂")
                              +html = html.replace("¡", "¡")
                              +html = html.replace("∑", "∑")
                              +html = html.replace("∑", "∑")
                              +html = html.replace("ñ", "ñ")
                              +html = html.replace("ã", "ã")
                              +html = html.replace("θ", "θ")
                              +html = html.replace("θ", "θ")
                              +html = html.replace("⊄", "⊄")
                              +html = html.replace("⊄", "⊄")
                              +html = html.replace("⇔", "⇔")
                              +html = html.replace("⇔", "⇔")
                              +html = html.replace("Ø", "Ø")
                              +html = html.replace("Þ", "Þ")
                              +html = html.replace("Μ", "Μ")
                              +html = html.replace("Μ", "Μ")
                              +html = html.replace(" ", " ")
                              +html = html.replace(" ", " ")
                              +html = html.replace("ê", "ê")
                              +html = html.replace("„", "„")
                              +html = html.replace("Å", "Å")
                              +html = html.replace("∇", "∇")
                              +html = html.replace("‰", "‰")
                              +html = html.replace("‰", "‰")
                              +html = html.replace("Ù", "Ù")
                              +html = html.replace("η", "η")
                              +html = html.replace("η", "η")
                              +html = html.replace("À", "À")
                              +html = html.replace("∀", "∀")
                              +html = html.replace("∀", "∀")
                              +html = html.replace("ð", "ð")
                              +html = html.replace("ð", "ð")
                              +html = html.replace("⌉", "⌉")
                              +html = html.replace("È", "È")
                              +html = html.replace("÷", "÷")
                              +html = html.replace("ì", "ì")
                              +html = html.replace("õ", "õ")
                              +html = html.replace("£", "£")
                              +html = html.replace("⁄", "⁄")
                              +html = html.replace("Ð", "Ð")
                              +html = html.replace("Ð", "Ð")
                              +html = html.replace("∗", "∗")
                              +html = html.replace("∗", "∗")
                              +html = html.replace("χ", "χ")
                              +html = html.replace("χ", "χ")
                              +html = html.replace("Á", "Á")
                              +html = html.replace("Β", "Β")
                              +html = html.replace("⊥", "⊥")
                              +html = html.replace("⊥", "⊥")
                              +html = html.replace("∴", "∴")
                              +html = html.replace("∴", "∴")
                              +html = html.replace("π", "π")
                              +html = html.replace("π", "π")
                              +html = html.replace("∅", "∅")
                              +html = html.replace("∉", "∉")
                              +html = html.replace("î", "î")
                              +html = html.replace("•", "•")
                              +html = html.replace("•", "•")
                              +html = html.replace("υ", "υ")
                              +html = html.replace("υ", "υ")
                              +html = html.replace("Ó", "Ó")
                              +html = html.replace("κ", "κ")
                              +html = html.replace("κ", "κ")
                              +html = html.replace("ç", "ç")
                              +html = html.replace("∩", "∩")
                              +html = html.replace("∩", "∩")
                              +html = html.replace("μ", "μ")
                              +html = html.replace("μ", "μ")
                              +html = html.replace("°", "°")
                              +html = html.replace("°", "°")
                              +html = html.replace("τ", "τ")
                              +html = html.replace("τ", "τ")
                              +html = html.replace(" ", " ")
                              +html = html.replace(" ", " ")
                              +html = html.replace("…", "…")
                              +html = html.replace("…", "…")
                              +html = html.replace("û", "û")
                              +html = html.replace("ù", "ù")
                              +html = html.replace("≅", "≅")
                              +html = html.replace("≅", "≅")
                              +html = html.replace("Ι", "Ι")
                              +html = html.replace(""", """)
                              +html = html.replace(""", """)
                              +html = html.replace("→", "→")
                              +html = html.replace("→", "→")
                              +html = html.replace("Ρ", "Ρ")
                              +html = html.replace("Ρ", "Ρ")
                              +html = html.replace("ú", "ú")
                              +html = html.replace("â", "â")
                              +html = html.replace("∼", "∼")
                              +html = html.replace("∼", "∼")
                              +html = html.replace("φ", "φ")
                              +html = html.replace("φ", "φ")
                              +html = html.replace("♦", "♦")
                              +html = html.replace("Ç", "Ç")
                              +html = html.replace("Η", "Η")
                              +html = html.replace("Η", "Η")
                              +html = html.replace("Γ", "Γ")
                              +html = html.replace("Γ", "Γ")
                              +html = html.replace("€", "€")
                              +html = html.replace("€", "€")
                              +html = html.replace("ϑ", "ϑ")
                              +html = html.replace("ϑ", "ϑ")
                              +html = html.replace("“", "“")
                              +html = html.replace("♥", "♥")
                              +html = html.replace("♥", "♥")
                              +html = html.replace("ó", "ó")
                              +html = html.replace("‌", "‌")
                              +html = html.replace("‌", "‌")
                              +html = html.replace("¥", "¥")
                              +html = html.replace("¥", "¥")
                              +html = html.replace("ò", "ò")
                              +html = html.replace("Χ", "Χ")
                              +html = html.replace("Χ", "Χ")
                              +html = html.replace("™", "™")
                              +html = html.replace("ξ", "ξ")
                              +html = html.replace("ξ", "ξ")
                              +html = html.replace("˜", "˜")
                              +html = html.replace("˜", "˜")
                              +html = html.replace("‹", "‹")
                              +html = html.replace("‹", "‹")
                              +html = html.replace("œ", "œ")
                              +html = html.replace("œ", "œ")
                              +html = html.replace("≡", "≡")
                              +html = html.replace("≤", "≤")
                              +html = html.replace("≤", "≤")
                              +html = html.replace("∪", "∪")
                              +html = html.replace("∪", "∪")
                              +html = html.replace("Ÿ", "Ÿ")
                              +html = html.replace("<", "<")
                              +html = html.replace("<", "<")
                              +html = html.replace("Υ", "Υ")
                              +html = html.replace("Υ", "Υ")
                              +html = html.replace("–", "–")
                              +html = html.replace("ý", "ý")
                              +html = html.replace("ℜ", "ℜ")
                              +html = html.replace("ℜ", "ℜ")
                              +html = html.replace("ψ", "ψ")
                              +html = html.replace("ψ", "ψ")
                              +html = html.replace("›", "›")
                              +html = html.replace("›", "›")
                              +html = html.replace("↓", "↓")
                              +html = html.replace("↓", "↓")
                              +html = html.replace("Α", "Α")
                              +html = html.replace("Α", "Α")
                              +html = html.replace("¬", "¬")
                              +html = html.replace("¬", "¬")
                              +html = html.replace("&", "&")
                              +html = html.replace("ø", "ø")
                              +html = html.replace("´", "´")
                              +html = html.replace("‍", "‍")
                              +html = html.replace("‍", "‍")
                              +html = html.replace("«", "«")
                              +html = html.replace("”", "”")
                              +html = html.replace("Ì", "Ì")
                              +html = html.replace("µ", "µ")
                              +html = html.replace("­", "­")
                              +html = html.replace("­", "­")
                              +html = html.replace("⊇", "⊇")
                              +html = html.replace("⊇", "⊇")
                              +html = html.replace("ß", "ß")
                              +html = html.replace("♣", "♣")
                              +html = html.replace("à", "à")
                              +html = html.replace("Ô", "Ô")
                              +html = html.replace("↔", "↔")
                              +html = html.replace("↔", "↔")
                              +html = html.replace("←", "←")
                              +html = html.replace("←", "←")
                              +html = html.replace("½", "½")
                              +html = html.replace("∝", "∝")
                              +html = html.replace("∝", "∝")
                              +html = html.replace("ˆ", "ˆ")
                              +html = html.replace("ô", "ô")
                              +html = html.replace("≈", "≈")
                              +html = html.replace("¨", "¨")
                              +html = html.replace("¨", "¨")
                              +html = html.replace("∏", "∏")
                              +html = html.replace("∏", "∏")
                              +html = html.replace("®", "®")
                              +html = html.replace("®", "®")
                              +html = html.replace("‏", "‏")
                              +html = html.replace("‏", "‏")
                              +html = html.replace("∞", "∞")
                              +html = html.replace("Σ", "Σ")
                              +html = html.replace("Σ", "Σ")
                              +html = html.replace("—", "—")
                              +html = html.replace("↑", "↑")
                              +html = html.replace("↑", "↑")
                              +html = html.replace("×", "×")
                              +html = html.replace("⇒", "⇒")
                              +html = html.replace("⇒", "⇒")
                              +html = html.replace("∨", "∨")
                              +html = html.replace("∨", "∨")
                              +html = html.replace("γ", "γ")
                              +html = html.replace("γ", "γ")
                              +html = html.replace("λ", "λ")
                              +html = html.replace("λ", "λ")
                              +html = html.replace("〉", "⟩")
                              +html = html.replace("〉", "⟩")
                              +html = html.replace("†", "†")
                              +html = html.replace("†", "†")
                              +html = html.replace("ℑ", "ℑ")
                              +html = html.replace("ℵ", "ℵ")
                              +html = html.replace("ℵ", "ℵ")
                              +html = html.replace("⊆", "⊆")
                              +html = html.replace("⊆", "⊆")
                              +html = html.replace("α", "α")
                              +html = html.replace("α", "α")
                              +html = html.replace("Ν", "Ν")
                              +html = html.replace("Ν", "Ν")
                              +html = html.replace("±", "±")
                              +html = html.replace("¾", "¾")
                              +html = html.replace("‾", "‾")
                              +html = html.replace("Δ", "Δ")
                              +html = html.replace("Δ", "Δ")
                              +html = html.replace("◊", "◊")
                              +html = html.replace("◊", "◊")
                              +html = html.replace("ι", "ι")
                              +html = html.replace("í", "í")
                              +html = html.replace("ε", "ε")
                              +html = html.replace("ε", "ε")
                              +html = html.replace("℘", "℘")
                              +html = html.replace("℘", "℘")
                              +html = html.replace("∂", "∂")
                              +html = html.replace("∂", "∂")
                              +html = html.replace("δ", "δ")
                              +html = html.replace("δ", "δ")
                              +html = html.replace("ο", "ο")
                              +html = html.replace("ο", "ο")
                              +html = html.replace("Ξ", "Ξ")
                              +html = html.replace("Ξ", "Ξ")
                              +html = html.replace("‡", "‡")
                              +html = html.replace("‡", "‡")
                              +html = html.replace("Ò", "Ò")
                              +html = html.replace("Û", "Û")
                              +html = html.replace("š", "š")
                              +html = html.replace("š", "š")
                              +html = html.replace("‘", "‘")
                              +html = html.replace("∈", "∈")
                              +html = html.replace("∈", "∈")
                              +html = html.replace("Ζ", "Ζ")
                              +html = html.replace("−", "−")
                              +html = html.replace("∧", "∧")
                              +html = html.replace("∧", "∧")
                              +html = html.replace("∠", "∠")
                              +html = html.replace("∠", "∠")
                              +html = html.replace("¤", "¤")
                              +html = html.replace("∫", "∫")
                              +html = html.replace("∫", "∫")
                              +html = html.replace("⌋", "⌋")
                              +html = html.replace("⌋", "⌋")
                              +html = html.replace("↵", "↵")
                              +html = html.replace("∃", "∃")
                              +html = html.replace("⊕", "⊕")
                              +html = html.replace("Â", "Â")
                              +html = html.replace("ϖ", "ϖ")
                              +html = html.replace("ϖ", "ϖ")
                              +html = html.replace("∋", "∋")
                              +html = html.replace("∋", "∋")
                              +html = html.replace("Φ", "Φ")
                              +html = html.replace("Φ", "Φ")
                              +html = html.replace("Í", "Í")
                              +html = html.replace("Ú", "Ú")
                              +html = html.replace("Ο", "Ο")
                              +html = html.replace("Ο", "Ο")
                              +html = html.replace("≠", "≠")
                              +html = html.replace("≠", "≠")
                              +html = html.replace("¿", "¿")
                              +html = html.replace("‚", "‚")
                              +html = html.replace("Ê", "Ê")
                              +html = html.replace("ζ", "ζ")
                              +html = html.replace("Ω", "Ω")
                              +html = html.replace("Ω", "Ω")
                              +html = html.replace("ν", "ν")
                              +html = html.replace("ν", "ν")
                              +html = html.replace("¼", "¼")
                              +html = html.replace("á", "á")
                              +html = html.replace("⇑", "⇑")
                              +html = html.replace("⇑", "⇑")
                              +html = html.replace("β", "β")
                              +html = html.replace("ƒ", "ƒ")
                              +html = html.replace("ρ", "ρ")
                              +html = html.replace("ρ", "ρ")
                              +html = html.replace("é", "é")
                              +html = html.replace("ω", "ω")
                              +html = html.replace("ω", "ω")
                              +html = html.replace("·", "·")
                              +html = html.replace("〈", "⟨")
                              +html = html.replace("〈", "⟨")
                              +html = html.replace("♠", "♠")
                              +html = html.replace("♠", "♠")
                              +html = html.replace("’", "’")
                              +html = html.replace("þ", "þ")
                              +html = html.replace("»", "»")
                              +html = html.replace("σ", "σ")
                              +html = html.replace("σ", "σ")
                              +out = open(output_file, 'w')
                              +out.write(html)
                              +out.close()
                              diff --git a/index.html b/index.html
                              index aec2105..8164e27 100644
                              --- a/index.html
                              +++ b/index.html
                              @@ -1,5 +1,4 @@
                               
                              -
                               
                               
                               Dive Into Python 3
                              @@ -9,15 +8,14 @@
                               
                               
                               
                              -
                               
                              +
                               
                              -

                              You are here:  

                              Dive Into Python 3

                              @@ -47,15 +45,15 @@ span{cursor:default}
                            3. Creating graphics with the Python Imaging Library
                            4. Where to go from here
                            5. Case study: porting chardet to Python 3 -
                            6. Porting code to Python 3 with 2to3 +
                            7. Porting code to Python 3 with 2to3

                            There is a changelog, a feed, and discussion on Reddit. During development, you can download the book by cloning the Mercurial repository: -

                            you@localhost:~$ hg clone http://hg.diveintopython3.org/ diveintopython3
                            +
                            you@localhost:~$ hg clone http://hg.diveintopython3.org/ diveintopython3

                            The final version will be downloadable as HTML and PDF.

                            This site is optimized for Lynx just because fuck you.
                            I’m told it also looks good in graphical browsers. -

                            © 2001–4, 2009 ark Pilgrim • open standards • open content • open source +

                            © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/native-datatypes.html b/native-datatypes.html index 16269be..1b74f7d 100644 --- a/native-datatypes.html +++ b/native-datatypes.html @@ -1,19 +1,17 @@ - Native datatypes - Dive into Python 3 - -

                            skip to main content -

                              
                            -

                            skip to main content +

                              
                            +

                            You are here: Home Dive Into Python 3

                            Native datatypes

                            Wonder is the foundation of all philosophy, research its progress, ignorance its end.
                            Michel de Montaigne @@ -61,7 +59,7 @@ body{counter-reset:h1 2}

                          2. Further reading

                          Diving in

                          -

                          Cast aside your first Python program for just a minute, and let's talk about datatypes. In Python, every variable has a datatype, but you don't need to declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally. +

                          Cast aside your first Python program for just a minute, and let's talk about datatypes. In Python, every variable has a datatype, but you don't need to declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally.

                          Python has many native datatypes. Here are the important ones:

                          1. Booleans are either True or False. @@ -82,25 +80,25 @@ body{counter-reset:h1 2} raise ValueError('number must be non-negative')

                            size is an integer, 0 is an integer, and < is a numerical operator. The result of the expression size < 0 is always a boolean. You can test this yourself in the Python interactive shell:

                            ->>> size = 1
                            ->>> size < 0
                            +>>> size = 1
                            +>>> size < 0
                             False
                            ->>> size = 0
                            ->>> size < 0
                            +>>> size = 0
                            +>>> size < 0
                             False
                            ->>> size = -1
                            ->>> size < 0
                            +>>> size = -1
                            +>>> size < 0
                             True

                            Numbers

                            Numbers are awesome. There are so many to choose from. Python supports both integers and floating point numbers. There's no type declaration to distinguish them; Python tells them apart by the presence or absence of a decimal point.

                            ->>> type(1)                 
                            +>>> type(1)                 
                             <class 'int'>
                            ->>> 1 + 1                   
                            +>>> 1 + 1                   
                             2
                            ->>> 1 + 1.0                 
                            +>>> 1 + 1.0                 
                             2.0
                            ->>> type(2.0)
                            +>>> type(2.0)
                             <class 'float'>
                            1. You can use the type() function to check the type of any value or variable. As you might expect, 1 is an int. @@ -110,17 +108,17 @@ body{counter-reset:h1 2}

                              Coercing integers to floats and vice-versa

                              As you just saw, some operators (like addition) will coerce integers to floating point numbers as needed. You can also coerce them by yourself.

                              ->>> float(2)                
                              +>>> float(2)                
                               2.0
                              ->>> int(2.0)                
                              +>>> int(2.0)                
                               2
                              ->>> int(2.5)                
                              +>>> int(2.5)                
                               2
                              ->>> int(-2.5)               
                              +>>> int(-2.5)               
                               -2
                              ->>> 1.12345678901234567890  
                              +>>> 1.12345678901234567890  
                               1.1234567890123457
                              ->>> type(1000000000000000)  
                              +>>> type(1000000000000000)  
                               <class 'int'>
                              1. You can explicitly coerce an int to a float by calling the float() function. @@ -136,17 +134,17 @@ body{counter-reset:h1 2}

                                Common numerical operations

                                You can do all kinds of things with numbers.

                                ->>> 11 / 2      
                                +>>> 11 / 2      
                                 5.5
                                ->>> 11 // 2     
                                +>>> 11 // 2     
                                 5
                                ->>> −11 // 2    
                                +>>> −11 // 2    
                                 −6
                                ->>> 11.0 // 2   
                                +>>> 11.0 // 2   
                                 5.0
                                ->>> 11 ** 2     
                                +>>> 11 ** 2     
                                 121
                                ->>> 11 % 2      
                                +>>> 11 % 2      
                                 1
                                 
                                  @@ -163,13 +161,13 @@ body{counter-reset:h1 2}

                                  Fractions

                                  Python isn't limited to integers and floating point numbers. It can also do all the fancy math you learned in high school and promptly forgot about.

                                  ->>> import fractions              
                                  ->>> x = fractions.Fraction(1, 3)  
                                  ->>> x
                                  +>>> import fractions              
                                  +>>> x = fractions.Fraction(1, 3)  
                                  +>>> x
                                   Fraction(1, 3)
                                  ->>> x * 2                         
                                  +>>> x * 2                         
                                   Fraction(2, 3)
                                  ->>> fractions.Fraction(6, 4)      
                                  +>>> fractions.Fraction(6, 4)      
                                   Fraction(3, 2)
                                  1. To start using fractions, import the fractions module. @@ -180,12 +178,12 @@ body{counter-reset:h1 2}

                                    Trigonometry

                                    You can also do basic trigonometry in Python.

                                    ->>> import math
                                    ->>> math.pi                
                                    +>>> import math
                                    +>>> math.pi                
                                     3.1415926535897931
                                    ->>> math.sin(math.pi / 2)  
                                    +>>> math.sin(math.pi / 2)  
                                     1.0
                                    ->>> math.tan(math.pi / 4)  
                                    +>>> math.tan(math.pi / 4)  
                                     0.99999999999999989
                                    1. The math module has a constant for π, the ratio of a circle's circumference to its diameter. @@ -195,26 +193,26 @@ body{counter-reset:h1 2}

                                      Numbers in a boolean context

                                      You can use numbers in a boolean context, such as an if statement. Zero values are false, and non-zero values are true.

                                      ->>> def is_it_true(anything):             
                                      -...   if anything:
                                      -...     print("yes, it's true")
                                      -...   else:
                                      -...     print("no, it's false")
                                      -...
                                      ->>> is_it_true(1)                         
                                      +>>> def is_it_true(anything):             
                                      +...   if anything:
                                      +...     print("yes, it's true")
                                      +...   else:
                                      +...     print("no, it's false")
                                      +...
                                      +>>> is_it_true(1)                         
                                       yes, it's true
                                      ->>> is_it_true(-1)
                                      +>>> is_it_true(-1)
                                       yes, it's true
                                      ->>> is_it_true(0)
                                      +>>> is_it_true(0)
                                       no, it's false
                                      ->>> is_it_true(0.1)                       
                                      +>>> is_it_true(0.1)                       
                                       yes, it's true
                                      ->>> is_it_true(0.0)
                                      +>>> is_it_true(0.0)
                                       no, it's false
                                      ->>> import fractions
                                      ->>> is_it_true(fractions.Fraction(1, 2))  
                                      +>>> import fractions
                                      +>>> is_it_true(fractions.Fraction(1, 2))  
                                       yes, it's true
                                      ->>> is_it_true(fractions.Fraction(0, 1))
                                      +>>> is_it_true(fractions.Fraction(0, 1))
                                       no, it's false
                                      1. Did you know you can define your own functions in the Python interactive shell? Just press ENTER at the end of each line, and ENTER on a blank line to finish. @@ -233,16 +231,16 @@ body{counter-reset:h1 2}

                                        Creating a list

                                        Creating a list is easy: use square brackets to wrap a comma-separated list of values.

                                        ->>> a_list = ['a', 'b', 'mpilgrim', 'z', 'example']  
                                        ->>> a_list
                                        +>>> a_list = ['a', 'b', 'mpilgrim', 'z', 'example']  
                                        +>>> a_list
                                         ['a', 'b', 'mpilgrim', 'z', 'example']
                                        ->>> a_list[0]                                        
                                        +>>> a_list[0]                                        
                                         'a'
                                        ->>> a_list[4]                                        
                                        +>>> a_list[4]                                        
                                         'example'
                                        ->>> a_list[-1]                                       
                                        +>>> a_list[-1]                                       
                                         'example'
                                        ->>> a_list[-3]                                       
                                        +>>> a_list[-3]                                       
                                         'mpilgrim'
                                        1. First, you define a list of five items. Note that they retain their original order. This is not an accident. A list is an ordered set of items. @@ -254,19 +252,19 @@ body{counter-reset:h1 2}

                                          Slicing a list

                                          Once you've defined a list, you can get any part of it as a new list. This is called slicing the list.

                                          ->>> a_list
                                          +>>> a_list
                                           ['a', 'b', 'mpilgrim', 'z', 'example']
                                          ->>> a_list[1:3]            
                                          +>>> a_list[1:3]            
                                           ['b', 'mpilgrim']
                                          ->>> a_list[1:-1]           
                                          +>>> a_list[1:-1]           
                                           ['b', 'mpilgrim', 'z']
                                          ->>> a_list[0:3]            
                                          +>>> a_list[0:3]            
                                           ['a', 'b', 'mpilgrim']
                                          ->>> a_list[:3]             
                                          +>>> a_list[:3]             
                                           ['a', 'b', 'mpilgrim']
                                          ->>> a_list[3:]             
                                          +>>> a_list[3:]             
                                           ['z', 'example']
                                          ->>> a_list[:]              
                                          +>>> a_list[:]              
                                           ['a', 'b', 'mpilgrim', 'z', 'example']
                                          1. You can get a part of a list, called a “slice”, by specifying two indices. The return value is a new list containing all the items of the list, in order, starting with the first slice index (in this case a_list[1]), up to but not including the second slice index (in this case a_list[3]). @@ -279,18 +277,18 @@ body{counter-reset:h1 2}

                                            Adding items to a list

                                            There are four ways to add items to a list.

                                            ->>> a_list = ['a']
                                            ->>> a_list = a_list + [2.0, 3]    
                                            ->>> a_list
                                            +>>> a_list = ['a']
                                            +>>> a_list = a_list + [2.0, 3]    
                                            +>>> a_list
                                             ['a', 2.0, 3]
                                            ->>> a_list.append(True)           
                                            ->>> a_list
                                            +>>> a_list.append(True)           
                                            +>>> a_list
                                             ['a', 2.0, 3, True]
                                            ->>> a_list.extend(['four', 'e'])  
                                            ->>> a_list
                                            +>>> a_list.extend(['four', 'e'])  
                                            +>>> a_list
                                             ['a', 2.0, 3, True, 'four', 'e']
                                            ->>> a_list.insert(1, 'a')         
                                            ->>> a_list
                                            +>>> a_list.insert(1, 'a')         
                                            +>>> a_list
                                             ['a', 'a', 2.0, 3, True, 'four', 'e']
                                            1. The + operator concatenates lists. A list can contain any number of items; there is no size limit (other than available memory). A list can contain items of any datatype; they don't all need to be the same type. Here we have a list containing a string, a floating point number, and an integer. @@ -300,20 +298,20 @@ body{counter-reset:h1 2}

                                            Let's look closer at the difference between append() and extend().

                                            ->>> a_list = ['a', 'b', 'c']
                                            ->>> a_list.extend(['d', 'e', 'f'])  
                                            ->>> a_list
                                            +>>> a_list = ['a', 'b', 'c']
                                            +>>> a_list.extend(['d', 'e', 'f'])  
                                            +>>> a_list
                                             ['a', 'b', 'c', 'd', 'e', 'f']
                                            ->>> len(a_list)                     
                                            +>>> len(a_list)                     
                                             6
                                            ->>> a_list[-1]
                                            +>>> a_list[-1]
                                             'f'
                                            ->>> a_list.append(['g', 'h', 'i'])  
                                            ->>> a_list
                                            +>>> a_list.append(['g', 'h', 'i'])  
                                            +>>> a_list
                                             ['a', 'b', 'c', 'd', 'e', 'f', ['g', 'h', 'i']]
                                            ->>> len(a_list)                     
                                            +>>> len(a_list)                     
                                             4
                                            ->>> a_list[-1]
                                            +>>> a_list[-1]
                                             ['g', 'h', 'i']
                                            1. The extend() method takes a single argument, which is always a list, and adds each of the items of that list to a_list. @@ -323,16 +321,16 @@ body{counter-reset:h1 2}

                                            Searching for values in a list

                                            ->>> a_list = ['a', 'b', 'new', 'mpilgrim', 'new']
                                            ->>> 'mpilgrim' in a_list      
                                            +>>> a_list = ['a', 'b', 'new', 'mpilgrim', 'new']
                                            +>>> 'mpilgrim' in a_list      
                                             True
                                            ->>> a_list.index('mpilgrim')  
                                            +>>> a_list.index('mpilgrim')  
                                             3
                                            ->>> a_list.index('new')       
                                            +>>> a_list.index('new')       
                                             2
                                            ->>> 'c' in a_list             
                                            +>>> 'c' in a_list             
                                             False
                                            ->>> a_list.index('c')         
                                            +>>> a_list.index('c')         
                                             Traceback (innermost last):
                                               File "<interactive input>", line 1, in ?
                                             ValueError: list.index(x): x not in list
                                            @@ -346,15 +344,15 @@ ValueError: list.index(x): x not in list

                                            Lists in a boolean context

                                            You can also use a list in a boolean context, such as an if statement.

                                            ->>> def is_it_true(anything):
                                            -...   if anything:
                                            -...     print("yes, it's true")
                                            -...   else:
                                            -...     print("no, it's false")
                                            -...
                                            ->>> is_it_true([])             
                                            +>>> def is_it_true(anything):
                                            +...   if anything:
                                            +...     print("yes, it's true")
                                            +...   else:
                                            +...     print("no, it's false")
                                            +...
                                            +>>> is_it_true([])             
                                             no, it's false
                                            ->>> is_it_true(['a'])          
                                            +>>> is_it_true(['a'])          
                                             yes, it's true
                                            1. In a boolean context, an empty list is false. @@ -372,14 +370,14 @@ ValueError: list.index(x): x not in list

                                              Creating a dictionary

                                              Creating a dictionary is easy. The syntax is similar to sets, but instead of values, you have key-value pairs. Once you have a dictionary, you can look up values by their key.

                                              ->>> a_dict = {"server":"db.diveintopython3.org", "database":"mysql"}  
                                              ->>> a_dict
                                              +>>> a_dict = {"server":"db.diveintopython3.org", "database":"mysql"}  
                                              +>>> a_dict
                                               {'server': 'db.diveintopython3.org', 'database': 'mysql'}
                                              ->>> a_dict["server"]                                                  
                                              +>>> a_dict["server"]                                                  
                                               'db.diveintopython3.org'
                                              ->>> a_dict["database"]                                                
                                              +>>> a_dict["database"]                                                
                                               'mysql'
                                              ->>> a_dict["db.diveintopython3.org"]                                  
                                              +>>> a_dict["db.diveintopython3.org"]                                  
                                               Traceback (most recent call last):
                                                 File "<stdin>", line 1, in <module>
                                               KeyError: 'db.diveintopython3.org'
                                              @@ -392,19 +390,19 @@ KeyError: 'db.diveintopython3.org'

                                              Modifying a dictionary

                                              Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any time, or you can modify the value of an existing key. Continuing from the previous example:

                                              ->>> a_dict
                                              +>>> a_dict
                                               {'server': 'db.diveintopython3.org', 'database': 'mysql'}
                                              ->>> a_dict["database"] = "blog"  
                                              ->>> a_dict
                                              +>>> a_dict["database"] = "blog"  
                                              +>>> a_dict
                                               {'server': 'db.diveintopython3.org', 'database': 'blog'}
                                              ->>> a_dict["user"] = "mark"      
                                              ->>> a_dict                       
                                              +>>> a_dict["user"] = "mark"      
                                              +>>> a_dict                       
                                               {'server': 'db.diveintopython3.org', 'user': 'mark', 'database': 'blog'}
                                              ->>> a_dict["user"] = "dora"      
                                              ->>> a_dict
                                              +>>> a_dict["user"] = "dora"      
                                              +>>> a_dict
                                               {'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}
                                              ->>> a_dict["User"] = "mark"      
                                              ->>> a_dict
                                              +>>> a_dict["User"] = "mark"      
                                              +>>> a_dict
                                               {'User': 'mark', 'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}
                                              1. You can not have duplicate keys in a dictionary. Assigning a value to an existing key will wipe out the old value. @@ -420,15 +418,15 @@ KeyError: 'db.diveintopython3.org' 1024: ('KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB')}

                                                Let's tear that apart in the interactive shell.

                                                ->>> SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
                                                -...             1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
                                                ->>> len(SUFFIXES)      
                                                +>>> SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
                                                +...             1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
                                                +>>> len(SUFFIXES)      
                                                 2
                                                ->>> SUFFIXES[1000]     
                                                +>>> SUFFIXES[1000]     
                                                 ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']
                                                ->>> SUFFIXES[1024]     
                                                +>>> SUFFIXES[1024]     
                                                 ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']
                                                ->>> SUFFIXES[1000][3]  
                                                +>>> SUFFIXES[1000][3]  
                                                 'TB'
                                                1. As with lists, the len() function gives you the number of items in a dictionary. @@ -439,15 +437,15 @@ KeyError: 'db.diveintopython3.org'

                                                  Dictionaries in a boolean context

                                                  You can also use a list in a boolean context, such as an if statement.

                                                  ->>> def is_it_true(anything):
                                                  -...   if anything:
                                                  -...     print("yes, it's true")
                                                  -...   else:
                                                  -...     print("no, it's false")
                                                  -...
                                                  ->>> is_it_true({})             
                                                  +>>> def is_it_true(anything):
                                                  +...   if anything:
                                                  +...     print("yes, it's true")
                                                  +...   else:
                                                  +...     print("no, it's false")
                                                  +...
                                                  +>>> is_it_true({})             
                                                   no, it's false
                                                  ->>> is_it_true({'a': 1})       
                                                  +>>> is_it_true({'a': 1})       
                                                   yes, it's true
                                                  1. In a boolean context, an empty dictionary is false. @@ -457,35 +455,35 @@ KeyError: 'db.diveintopython3.org'

                                                    None is a special constant in Python. It is a null value. None is not the same as False. None is not 0. None is not an empty string. Comparing None to anything other than None will always return False.

                                                    None is the only null value. It has its own datatype (NoneType). You can assign None to any variable, but you can not create other NoneType objects. All variables whose value is None are equal to each other.

                                                    ->>> type(None)
                                                    +>>> type(None)
                                                     <class 'NoneType'>
                                                    ->>> None == False
                                                    +>>> None == False
                                                     False
                                                    ->>> None == 0
                                                    +>>> None == 0
                                                     False
                                                    ->>> None == ''
                                                    +>>> None == ''
                                                     False
                                                    ->>> None == None
                                                    +>>> None == None
                                                     True
                                                    ->>> x = None
                                                    ->>> x == None
                                                    +>>> x = None
                                                    +>>> x == None
                                                     True
                                                    ->>> y = None
                                                    ->>> x == y
                                                    +>>> y = None
                                                    +>>> x == y
                                                     True
                                                     

                                                    None in a boolean context

                                                    In a boolean context, None is false and not None is true.

                                                    ->>> def is_it_true(anything):
                                                    -...   if anything:
                                                    -...     print("yes, it's true")
                                                    -...   else:
                                                    -...     print("no, it's false")
                                                    -...
                                                    ->>> is_it_true(None)
                                                    +>>> def is_it_true(anything):
                                                    +...   if anything:
                                                    +...     print("yes, it's true")
                                                    +...   else:
                                                    +...     print("no, it's false")
                                                    +...
                                                    +>>> is_it_true(None)
                                                     no, it's false
                                                    ->>> is_it_true(not None)
                                                    +>>> is_it_true(not None)
                                                     yes, it's true

                                                    Further reading

                                                    -

                                                    © 2001–4, 2009 ark Pilgrim • open standards • open content • open source +

                                                    © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/porting-code-to-python-3-with-2to3.html b/porting-code-to-python-3-with-2to3.html index 8a16d9c..1da85f9 100644 --- a/porting-code-to-python-3-with-2to3.html +++ b/porting-code-to-python-3-with-2to3.html @@ -1,21 +1,27 @@ - Porting code to Python 3 with 2to3 - Dive into Python 3 - -

                                                    skip to main content -

                                                      
                                                    -

                                                    skip to main content +

                                                      
                                                    +

                                                    You are here: Home Dive Into Python 3

                                                    Porting code to Python 3 with 2to3

                                                    Life is pleasant. Death is peaceful. It’s the transition that’s troublesome.
                                                    — Isaac Asimov (attributed) @@ -79,11 +85,11 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}

                                                Diving in

                                                -

                                                Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. Case study: porting chardet to Python 3 describes how to run the 2to3 script, then shows some things it can't fix automatically. This appendix documents what it can fix automatically. +

                                                Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. Case study: porting chardet to Python 3 describes how to run the 2to3 script, then shows some things it can't fix automatically. This appendix documents what it can fix automatically.

                                                print statement

                                                In Python 2, print was a statement. Whatever you wanted to print simply followed the print keyword. In Python 3, print() is a function — whatever you want to print is passed to print() like any other function.

                                                [The code examples will be easier to follow if you enable Javascript, but whatever.] -

                                                skip over this table +

                                                skip over this table @@ -115,7 +121,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}

                                                Unicode string literals

                                                Python 2 had two string types: Unicode strings and non-Unicode strings. Python 3 has one string type: Unicode strings. -

                                                skip over this table +

                                                skip over this table

                                                Notes
                                                @@ -134,7 +140,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}

                                                unicode() global function

                                                Python 2 had two global functions to coerce objects into strings: unicode() to coerce them into Unicode strings, and str() to coerce them into non-Unicode strings. Python 3 has only one string type, Unicode strings, so the str() function is all you need. (The unicode() function no longer exists.) -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -148,7 +154,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}

                                                long data type

                                                Python 2 had separate int and long types for non-floating-point numbers. An int could not be any larger than sys.maxint, which varied by platform. Longs were defined by appending an L to the end of the number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called int, which mostly behaves like the long type in Python 2. Since there are no longer two types, there is no need for special syntax to distinguish them.

                                                Further reading: PEP 237: Unifying Long Integers and Integers. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -179,7 +185,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}

                                                <> comparison

                                                Python 2 supported <> as a synonym for !=, the not-equals comparison operator. Python 3 supports the != operator, but not <>. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -198,7 +204,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}

                                                has_key() dictionary method

                                                In Python 2, dictionaries had a has_key() method to test whether the dictionary had a certain key. In Python 3, this method no longer exists. Instead, you need to use the in operator. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -229,7 +235,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}

                                                Dictionary methods that return lists

                                                In Python 2, many dictionary methods returned lists. The most frequently used methods were keys(), items(), and values(). In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method's return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -262,7 +268,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}

                                                Several modules in the Python Standard Library have been renamed. Several other modules which are related to each other have been combined or reorganized to make their association more logical.

                                                http

                                                In Python 3, several related HTTP modules have been combined into a single package, http. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -291,7 +297,7 @@ import CGIHttpServer

                                                urllib

                                                Python 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, urllib. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -330,7 +336,7 @@ from urllib.error import HTTPError

                                                dbm

                                                All the various DBM clones are now in a single package, dbm. If you need a specific variant like GNU DBM, you can import the appropriate module within the dbm package. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -356,7 +362,7 @@ import whichdb

                                                xmlrpc

                                                XML-RPC is a lightweight method of performing remote RPC calls over HTTP. The XML-RPC client library and several XML-RPC server implementations are now combined in a single package, xmlrpc. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -372,7 +378,7 @@ import SimpleXMLRPCServer
                                                Notes Python 2

                                                Other modules

                                                -

                                                skip over this table +

                                                skip over this table @@ -426,7 +432,7 @@ except ImportError:

                                                Relative imports within a package

                                                A package is a group of related modules that function as a single entity. In Python 2, when modules within a package need to reference each other, you use import foo or from foo import Bar. The Python 2 interpreter first searches within the current package to find foo.py, and then moves on to the other directories in the Python search path (sys.path). Python 3 works a bit differently. Instead of searching the current package, it goes directly to the Python search path. If you want one module within a package to import another module in the same package, you need to explicitly provide the relative path between the two modules.

                                                Suppose you had this package, with multiple files in the same directory: -

                                                skip over this ASCII art +

                                                skip over this ASCII art

                                                chardet/
                                                 |
                                                 +--__init__.py
                                                @@ -437,7 +443,7 @@ except ImportError:
                                                 |
                                                 +--universaldetector.py

                                                Now suppose that universaldetector.py needs to import the entire constants.py file and one class from mbcharsetprober.py. How do you do it? -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -456,7 +462,7 @@ except ImportError:

                                                next() iterator method

                                                In Python 2, iterators had a next() method which returned the next item in the sequence. That's still true in Python 3, but there is now also a global next() function that takes an iterator as an argument. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -497,7 +503,7 @@ for an_iterator in a_sequence_of_iterators:

                                                filter() global function

                                                In Python 2, the filter() function returned a list, the result of filtering a sequence through a function that returned True or False for each item in the sequence. In Python 3, the filter() function returns an iterator, not a list. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -528,7 +534,7 @@ for an_iterator in a_sequence_of_iterators:

                                                map() global function

                                                In much the same way as filter(), the map() function now returns an iterator. (In Python 2, it returned a list.) -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -559,7 +565,7 @@ for an_iterator in a_sequence_of_iterators:

                                                reduce() global function (3.1+)

                                                In Python 3, the reduce() function has been removed from the global namespace and placed in the functools module. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -575,7 +581,7 @@ reduce(a, b, c)

                                                apply() global function

                                                Python 2 had a global function called apply(), which took a function f and a list [a, b, c] and returned f(a, b, c). In Python 3, the apply() function no longer exists. Instead, there is a new function calling syntax that allows you to pass a list and have Python apply the list as the function's arguments. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -602,7 +608,7 @@ reduce(a, b, c)

                                                intern() global function

                                                In Python 2, you could call the intern() function on a string to intern it as a performance optimization. In Python 3, the intern() function has been moved to the sys module. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -615,7 +621,7 @@ reduce(a, b, c)

                                                exec statement

                                                Just as the print statement became a function in Python 3, so too has the exec statement. The exec() function takes a string which contains arbitrary Python code and executes it as if it were just another statement or expression. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -638,7 +644,7 @@ reduce(a, b, c)

                                                execfile statement (3.1+)

                                                Like the old exec statement, the old execfile statement will execute strings as if they were Python code. Where exec took a string, execfile took a filename. In Python 3, the execfile statement has been eliminated. If you really need to take a file of Python code and execute it (but you're not willing to simply import it), you can accomplish the same thing by opening the file, reading its contents, calling the global compile() function to force the Python interpreter to compile the code, and then call the new exec() function. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -653,7 +659,7 @@ reduce(a, b, c)

                                                repr literals (backticks)

                                                In Python 2, there was a special syntax of wrapping any object in backticks (like `x`) to get a representation of the object. In Python 3, this capability still exists, but you can no longer use backticks to get it. Instead, use the global repr() function. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -672,7 +678,7 @@ reduce(a, b, c)

                                                try...except statement

                                                The syntax for catching exceptions has changed slightly between Python 2 and Python 3. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -720,7 +726,7 @@ except:

                                                raise statement

                                                The syntax for raising your own exceptions has changed slightly between Python 2 and Python 3. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -747,7 +753,7 @@ except:

                                                throw method on generators

                                                In Python 2, generators have a throw() method. Calling a_generator.throw() raises an exception at the point where the generator was paused, then returns the next value yielded by the generator function. In Python 3, this functionality is still available, but the syntax is slightly different. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -770,7 +776,7 @@ except:

                                                xrange() global function

                                                In Python 2, there were two ways to get a range of numbers: range(), which returned a list, and xrange(), which returned an iterator. In Python 3, range() returns an iterator, and xrange() doesn't exist. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -801,7 +807,7 @@ except:

                                                raw_input() and input() global functions

                                                Python 2 had two global functions for asking the user for input on the command line. The first, called input(), expected the user to enter a Python expression (and returned the result). The second, called raw_input(), just returned whatever the user typed. This was wildly confusing for beginners and widely regarded as a “wart” in the language. Python 3 excises this wart by renaming raw_input() to input(), so it works the way everyone naively expects it to work. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -824,7 +830,7 @@ except:

                                                func_* function attributes

                                                In Python 2, code within functions can access special attributes about the function itself. In Python 3, these special function attributes have been renamed for consistency with other attributes. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -863,7 +869,7 @@ except:

                                                xreadlines() I/O method

                                                In Python 2, file objects had an xreadlines() method which returned an iterator that would read the file one line at a time. This was useful in for loops, among other places. In fact, it was so useful, later versions of Python 2 added the capability to file objects themselves. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -883,7 +889,7 @@ except:

                                                lambda functions with multiple parameters

                                                In Python 2, you could define anonymous lambda functions which took multiple parameters by defining the function as taking a tuple with a specific number of items. In effect, Python 2 would “unpack” the tuple into named arguments, which you could then reference (by name) within the lambda function. In Python 3, you can still pass a tuple to a lambda function, but the Python interpreter will not unpack the tuple into named arguments. Instead, you will need to reference each argument by its positional index. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -906,7 +912,7 @@ except:

                                                Special method attributes

                                                In Python 2, class methods can reference the class object they are defined in, as well as the method object itself. im_self is the class instance object; the class im_func is the function object; im_class is the class of im_self (for bound methods) or the class that asked for the method (for unbound methods). In Python 3, these special method attributes have been renamed to follow the naming conventions of other attributes. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -925,7 +931,7 @@ except:

                                                __nonzero__ special class attribute

                                                In Python 2, you could build your own classes that could be used in a boolean context. For example, you could instantiate the class and then use the instance in an if statement. To do this, you defined a special __nonzero__() method which returned True or False, and it was called whenever the instance was used in a boolean context. In Python 3, you can still do this, but the name of the method has changed to __bool__(). -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -950,7 +956,7 @@ except:

                                                Octal literals

                                                The syntax for defining base 8 (octal) numbers has changed slightly between Python 2 and Python 3. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -963,7 +969,7 @@ except:

                                                sys.maxint

                                                Due to the integration of the long and int types, the sys.maxint constant is no longer accurate. Because the value may still be useful in determining platform-specific capabilities, it has been retained but renamed as sys.maxsize. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -982,7 +988,7 @@ except:

                                                callable() global function

                                                In Python 2, you could check whether an object was callable (like a function) with the global callable() function. In Python 3, this global function has been eliminated. To check whether an object is callable, check for the existence of the __call__() special method. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -995,7 +1001,7 @@ except:

                                                zip() global function

                                                In Python 2, the global zip() function took any number of sequences and returned a list of tuples. The first tuple contained the first item from each sequence; the second tuple contained the second item from each sequence; and so on. In Python 3, zip() returns an iterator instead of a list. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1014,7 +1020,7 @@ except:

                                                StandardError exception

                                                In Python 2, StandardError was the base class for all built-in exceptions other than StopIteration, GeneratorExit, KeyboardInterrupt, and SystemExit. In Python 3, StandardError has been eliminated; use Exception instead. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1030,7 +1036,7 @@ except:

                                                types module constants

                                                The types module contains a variety of constants to help you determine the type of an object. In Python 2, it contained constants for all primitive types like dict and int. In Python 3, these constants have been eliminated; just use the primitive type name instead. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1058,7 +1064,7 @@ except:

                                                isinstance() global function (3.1+)

                                                The isinstance() function checks whether an object is an instance of a particular class or type. In Python 2, you could pass a tuple of types, and isinstance() would return True if the object was any of those types. In Python 3, you can still do this, but passing the same type twice is deprecated. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1073,7 +1079,7 @@ except:

                                                basestring datatype

                                                Python 2 had two string types: Unicode and non-Unicode. But there was also another type, basestring. It was an abstract type, a superclass for both the str and unicode types. It couldn't be called or instantiated directly, but you could pass it to the global isinstance() function to check whether an object was either a Unicode or non-Unicode string. In Python 3, there is only one string type, so basestring has no reason to exist. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1112,7 +1118,7 @@ except:

                                                sys.exc_type, sys.exc_value, sys.exc_traceback

                                                Python 2 had three variables in the sys module that you could access while an exception was being handled: sys.exc_type, sys.exc_value, sys.exc_traceback. (Actually, these date all the way back to Python 1.) Ever since Python 1.5, these variables have been deprecated in favor of sys.exc_info, which is a tuple that contains all three values. In Python 3, these individual variables have finally gone away; you must use sys.exc_info. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1131,7 +1137,7 @@ except:

                                                List comprehensions over tuples

                                                In Python 2, if you wanted to code a list comprehension that iterated over a tuple, you did not need to put parentheses around the tuple values. In Python 3, explicit parentheses are required. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1144,7 +1150,7 @@ except:

                                                os.getcwdu() function

                                                Python 2 had a function named os.getcwd(), which returned the current working directory as a (non-Unicode) string. Because modern file systems can handle directory names in any character encoding, Python 2.3 introduced os.getcwdu(). The os.getcwdu() function returned the current working directory as a Unicode string. In Python 3, there is only one string type (Unicode), so os.getcwd() is all you need. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1157,7 +1163,7 @@ except:

                                                Metaclasses

                                                In Python 2, you could create metaclasses either by defining the metaclass argument in the class declaration, or by defining a special class-level __metaclass__ attribute. In Python 3, the class-level attribute has been eliminated. -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1190,7 +1196,7 @@ except:

                                                The 2to3 script will not fix set() literals by default. To enable this fix, specify -f set_literal on the command line when you call 2to3.

                                                -

                                                skip over this table +

                                                skip over this table

                                                Notes Python 2
                                                @@ -1212,7 +1218,7 @@ except:

                                                The 2to3 script will not fix the buffer() function by default. To enable this fix, specify -f buffer on the command line when you call 2to3.

                                                -

                                                skip over this table +

                                                skip over this table

                                                Notes Before
                                                @@ -1228,7 +1234,7 @@ except:

                                                The 2to3 script will not fix whitespace around commas by default. To enable this fix, specify -f wscomma on the command line when you call 2to3.

                                                -

                                                skip over this table +

                                                skip over this table

                                                Notes Before
                                                @@ -1247,7 +1253,7 @@ except:

                                                The 2to3 script will not fix common idioms by default. To enable this fix, specify -f idioms on the command line when you call 2to3.

                                                -

                                                skip over this table +

                                                skip over this table

                                                Notes Before
                                                @@ -1273,6 +1279,6 @@ do_stuff(a_list)
                                                Notes Before

                                                FIXME: once the rest of the book is written, this appendix should contain copious links back to any chapter or section that touches on these features. -

                                                © 2001–4, 2009 ark Pilgrim • open standards • open content • open source +

                                                © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/publish b/publish index 30c2c1d..e2da038 100644 --- a/publish +++ b/publish @@ -4,11 +4,13 @@ rm -rf build mkdir build cp *.py robots.txt *.js *.css build/ + +# minimize HTML (note: this script is quite fragile and relies on knowledge of how I write HTML) for f in *.html; do python htmlminimizer.py "$f" build/"$f" done -# replace local jquery reference with Google API loader +# jQuery will be served by Google AJAX Libraries API sed -i -e "s|jquery\.js|http://www.google.com/jsapi|g" build/*.html sed -i -e "s|//google\.|google.|g" build/dip3.js sed -i -e "s|//}.; /\* google\..*|});|g" build/dip3.js @@ -18,16 +20,22 @@ revision=`hg log|grep changeset|cut -d":" -f3|head -1` java -jar yuicompressor-2.4.2.jar build/dip3.js > build/$revision.js java -jar yuicompressor-2.4.2.jar build/dip3.css > build/$revision.css sed -i -e "s|;}|}|g" build/$revision.css -css=`cat build/$revision.css` -sed -i -e "s|dip3\.js|http://wearehugh.com/dip3/${revision}.js|g" build/*.html -#sed -i -e "s|dip3\.css|http://wearehugh.com/dip3/${revision}.css|g" build/*.html -sed -i -e "s|||g" -e "s||g" -e "s| -

                                                skip to main content -

                                                  
                                                -

                                                skip to main content +

                                                  
                                                +

                                                You are here: Home Dive Into Python 3

                                                Regular expressions

                                                Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
                                                Jamie Zawinski @@ -35,7 +33,7 @@ body{counter-reset:h1 4}

                                              2. Summary

                                              Diving in

                                              -

                                              Every modern programming language has built-in functions for working with strings. In Python, strings have methods for searching and replacing: index(), find(), split(), count(), replace(), &c. But these methods are limited to the simplest of cases. For example, the index() method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace() and split() methods have the same limitations. +

                                              Every modern programming language has built-in functions for working with strings. In Python, strings have methods for searching and replacing: index(), find(), split(), count(), replace(), &c. But these methods are limited to the simplest of cases. For example, the index() method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace() and split() methods have the same limitations.

                                              If your goal can be accomplished with string methods, you should use them. They’re fast and simple and easy to read, and there’s a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with if statements to handle special cases, or if you’re chaining calls to split() and join() to slice-and-dice your strings, you may need to move up to regular expressions.

                                              Regular expressions are a powerful and (mostly) standardized way of searching, replacing, and parsing text with complex patterns of characters. Although the regular expression syntax is tight and unlike normal code, the result can end up being more readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions, so you can include fine-grained documentation within them.

                                              @@ -45,16 +43,16 @@ body{counter-reset:h1 4}

                                              This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don’t just make this stuff up; it’s actually useful.) This example shows how I approached the problem.

                                              [The code examples will be easier to follow if you enable Javascript, but whatever.]

                                              ->>> s = '100 NORTH MAIN ROAD'
                                              ->>> s.replace('ROAD', 'RD.')                
                                              +>>> s = '100 NORTH MAIN ROAD'
                                              +>>> s.replace('ROAD', 'RD.')                
                                               '100 NORTH MAIN RD.'
                                              ->>> s = '100 NORTH BROAD ROAD'
                                              ->>> s.replace('ROAD', 'RD.')                
                                              +>>> s = '100 NORTH BROAD ROAD'
                                              +>>> s.replace('ROAD', 'RD.')                
                                               '100 NORTH BRD. RD.'
                                              ->>> s[:-4] + s[-4:].replace('ROAD', 'RD.')  
                                              +>>> s[:-4] + s[-4:].replace('ROAD', 'RD.')  
                                               '100 NORTH BROAD RD.'
                                              ->>> import re                               
                                              ->>> re.sub('ROAD$', 'RD.', s)               
                                              +>>> import re                               
                                              +>>> re.sub('ROAD$', 'RD.', s)               
                                               '100 NORTH BROAD RD.'
                                              1. My goal is to standardize a street address so that 'ROAD' is always abbreviated as 'RD.'. At first glance, I thought this was simple enough that I could just use the string method replace(). After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, 'ROAD', was a constant. And in this deceptively simple example, s.replace() does indeed work. @@ -65,17 +63,17 @@ body{counter-reset:h1 4}

                                              Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching 'ROAD' at the end of the address, was not good enough, because not all addresses included a street designation at all. Some addresses simply ended with the street name. I got away with it most of the time, but if the street name was 'BROAD', then the regular expression would match 'ROAD' at the end of the string as part of the word 'BROAD', which is not what I wanted.

                                              ->>> s = '100 BROAD'
                                              ->>> re.sub('ROAD$', 'RD.', s)
                                              +>>> s = '100 BROAD'
                                              +>>> re.sub('ROAD$', 'RD.', s)
                                               '100 BRD.'
                                              ->>> re.sub('\\bROAD$', 'RD.', s)   
                                              +>>> re.sub('\\bROAD$', 'RD.', s)   
                                               '100 BROAD'
                                              ->>> re.sub(r'\bROAD$', 'RD.', s)   
                                              +>>> re.sub(r'\bROAD$', 'RD.', s)   
                                               '100 BROAD'
                                              ->>> s = '100 BROAD ROAD APT. 3'
                                              ->>> re.sub(r'\bROAD$', 'RD.', s)   
                                              +>>> s = '100 BROAD ROAD APT. 3'
                                              +>>> re.sub(r'\bROAD$', 'RD.', s)   
                                               '100 BROAD ROAD APT. 3'
                                              ->>> re.sub(r'\bROAD\b', 'RD.', s)  
                                              +>>> re.sub(r'\bROAD\b', 'RD.', s)  
                                               '100 BROAD RD. APT 3'
                                              1. What I really wanted was to match 'ROAD' when it was at the end of the string and it was its own word (and not a part of some larger word). To express this in a regular expression, you use \b, which means “a word boundary must occur right here.” In Python, this is complicated by the fact that the '\' character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it’s a bug in syntax or a bug in your regular expression. @@ -106,16 +104,16 @@ body{counter-reset:h1 4}

                                                Checking for thousands

                                                What would it take to validate that an arbitrary string is a valid Roman numeral? Let’s take it one digit at a time. Since Roman numerals are always written highest to lowest, let’s start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of M characters.

                                                ->>> import re
                                                ->>> pattern = '^M?M?M?$'        
                                                ->>> re.search(pattern, 'M')     
                                                +>>> import re
                                                +>>> pattern = '^M?M?M?$'        
                                                +>>> re.search(pattern, 'M')     
                                                 <SRE_Match object at 0106FB58>
                                                ->>> re.search(pattern, 'MM')    
                                                +>>> re.search(pattern, 'MM')    
                                                 <SRE_Match object at 0106C290>
                                                ->>> re.search(pattern, 'MMM')   
                                                +>>> re.search(pattern, 'MMM')   
                                                 <SRE_Match object at 0106AA38>
                                                ->>> re.search(pattern, 'MMMM')  
                                                ->>> re.search(pattern, '')      
                                                +>>> re.search(pattern, 'MMMM')  
                                                +>>> re.search(pattern, '')      
                                                 <SRE_Match object at 0106F4A8>
                                                1. This pattern has three parts. ^ matches what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the M characters were, which is not what you want. You want to make sure that the M characters, if they’re there, are at the beginning of the string. M? optionally matches a single M character. Since this is repeated three times, you’re matching anywhere from zero to three M characters in a row. And $ matches the end of the string. When combined with the ^ character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the M characters. @@ -151,16 +149,16 @@ body{counter-reset:h1 4}

                                                  This example shows how to validate the hundreds place of a Roman numeral.

                                                  ->>> import re
                                                  ->>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'  
                                                  ->>> re.search(pattern, 'MCM')             
                                                  +>>> import re
                                                  +>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'  
                                                  +>>> re.search(pattern, 'MCM')             
                                                   <SRE_Match object at 01070390>
                                                  ->>> re.search(pattern, 'MD')              
                                                  +>>> re.search(pattern, 'MD')              
                                                   <SRE_Match object at 01073A50>
                                                  ->>> re.search(pattern, 'MMMCCC')          
                                                  +>>> re.search(pattern, 'MMMCCC')          
                                                   <SRE_Match object at 010748A8>
                                                  ->>> re.search(pattern, 'MCMC')            
                                                  ->>> re.search(pattern, '')                
                                                  +>>> re.search(pattern, 'MCMC')            
                                                  +>>> re.search(pattern, '')                
                                                   <SRE_Match object at 01071D98>
                                                  1. This pattern starts out the same as the previous one, checking for the beginning of the string (^), then the thousands place (M?M?M?). Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical bars: CM, CD, and D?C?C?C? (which is an optional D followed by zero to three optional C characters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first one that matches, and ignores the rest. @@ -174,18 +172,18 @@ body{counter-reset:h1 4}

                                                    Using the {n,m} Syntax

                                                    In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.

                                                    ->>> import re
                                                    ->>> pattern = '^M?M?M?$'
                                                    ->>> re.search(pattern, 'M')     
                                                    +>>> import re
                                                    +>>> pattern = '^M?M?M?$'
                                                    +>>> re.search(pattern, 'M')     
                                                     <_sre.SRE_Match object at 0x008EE090>
                                                    ->>> pattern = '^M?M?M?$'
                                                    ->>> re.search(pattern, 'MM')    
                                                    +>>> pattern = '^M?M?M?$'
                                                    +>>> re.search(pattern, 'MM')    
                                                     <_sre.SRE_Match object at 0x008EEB48>
                                                    ->>> pattern = '^M?M?M?$'
                                                    ->>> re.search(pattern, 'MMM')   
                                                    +>>> pattern = '^M?M?M?$'
                                                    +>>> re.search(pattern, 'MMM')   
                                                     <_sre.SRE_Match object at 0x008EE090>
                                                    ->>> re.search(pattern, 'MMMM')  
                                                    ->>> 
                                                    +>>> re.search(pattern, 'MMMM') +>>>
                                                    1. This matches the start of the string, and then the first optional M, but not the second and third M (but that’s okay because they’re optional), and then the end of the string.
                                                    2. This matches the start of the string, and then the first and second optional M, but not the third M (but that’s okay because it’s optional), and then the end of the string. @@ -193,15 +191,15 @@ body{counter-reset:h1 4}
                                                    3. This matches the start of the string, and then all three optional M, but then does not match the the end of the string (because there is still one unmatched M), so the pattern does not match and returns None.
                                                    ->>> pattern = '^M{0,3}$'        
                                                    ->>> re.search(pattern, 'M')     
                                                    +>>> pattern = '^M{0,3}$'        
                                                    +>>> re.search(pattern, 'M')     
                                                     <_sre.SRE_Match object at 0x008EEB48>
                                                    ->>> re.search(pattern, 'MM')    
                                                    +>>> re.search(pattern, 'MM')    
                                                     <_sre.SRE_Match object at 0x008EE090>
                                                    ->>> re.search(pattern, 'MMM')   
                                                    +>>> re.search(pattern, 'MMM')   
                                                     <_sre.SRE_Match object at 0x008EEDA8>
                                                    ->>> re.search(pattern, 'MMMM')  
                                                    ->>> 
                                                    +>>> re.search(pattern, 'MMMM') +>>>
                                                    1. This pattern says: “Match the start of the string, then anywhere from zero to three M characters, then the end of the string.” The 0 and 3 can be any numbers; if you want to match at least one but no more than three M characters, you could say M{1,3}.
                                                    2. This matches the start of the string, then one M out of a possible three, then the end of the string. @@ -212,17 +210,17 @@ body{counter-reset:h1 4}

                                                      Checking for tens and ones

                                                      Now let’s expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.

                                                      ->>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
                                                      ->>> re.search(pattern, 'MCMXL')     
                                                      +>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
                                                      +>>> re.search(pattern, 'MCMXL')     
                                                       <_sre.SRE_Match object at 0x008EEB48>
                                                      ->>> re.search(pattern, 'MCML')      
                                                      +>>> re.search(pattern, 'MCML')      
                                                       <_sre.SRE_Match object at 0x008EEB48>
                                                      ->>> re.search(pattern, 'MCMLX')     
                                                      +>>> re.search(pattern, 'MCMLX')     
                                                       <_sre.SRE_Match object at 0x008EEB48>
                                                      ->>> re.search(pattern, 'MCMLXXX')   
                                                      +>>> re.search(pattern, 'MCMLXXX')   
                                                       <_sre.SRE_Match object at 0x008EEB48>
                                                      ->>> re.search(pattern, 'MCMLXXXX')  
                                                      ->>> 
                                                      +>>> re.search(pattern, 'MCMLXXXX') +>>>
                                                      1. This matches the start of the string, then the first optional M, then CM, then XL, then the end of the string. Remember, the (A|B|C) syntax means “match exactly one of A, B, or C”. You match XL, so you ignore the XC and L?X?X?X? choices, and then move on to the end of the string. MCML is the Roman numeral representation of 1940.
                                                      2. This matches the start of the string, then the first optional M, then CM, then L?X?X?X?. Of the L?X?X?X?, it matches the L and skips all three optional X characters. Then you move to the end of the string. MCML is the Roman numeral representation of 1950. @@ -232,17 +230,17 @@ body{counter-reset:h1 4}

                                                      The expression for the ones place follows the same pattern. I’ll spare you the details and show you the end result.

                                                      ->>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
                                                      +>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
                                                       

                                                      So what does that look like using this alternate {n,m} syntax? This example shows the new syntax.

                                                      ->>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
                                                      ->>> re.search(pattern, 'MDLV')              
                                                      +>>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
                                                      +>>> re.search(pattern, 'MDLV')              
                                                       <_sre.SRE_Match object at 0x008EEB48>
                                                      ->>> re.search(pattern, 'MMDCLXVI')          
                                                      +>>> re.search(pattern, 'MMDCLXVI')          
                                                       <_sre.SRE_Match object at 0x008EEB48>
                                                      ->>> re.search(pattern, 'MMMMDCCCLXXXVIII')  
                                                      +>>> re.search(pattern, 'MMMMDCCCLXXXVIII')  
                                                       <_sre.SRE_Match object at 0x008EEB48>
                                                      ->>> re.search(pattern, 'I')                 
                                                      +>>> re.search(pattern, 'I')                 
                                                       <_sre.SRE_Match object at 0x008EEB48>
                                                      1. This matches the start of the string, then one of a possible four M characters, then D?C{0,3}. Of that, it matches the optional D and zero of three possible C characters. Moving on, it matches L?X{0,3} by matching the optional L and zero of three possible X characters. Then it matches V?I{0,3} by matching the optional V and zero of three possible I characters, and finally the end of the string. MDLV is the Roman numeral representation of 1555. @@ -261,7 +259,7 @@ body{counter-reset:h1 4}

                                                        This will be more clear with an example. Let’s revisit the compact regular expression you’ve been working with, and make it a verbose regular expression. This example shows how.

                                                        ->>> pattern = """
                                                        +>>> pattern = """
                                                             ^                   # beginning of string
                                                             M{0,3}              # thousands - 0 to 3 M's
                                                             (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                                                        @@ -272,13 +270,13 @@ body{counter-reset:h1 4}
                                                                                 #        or 5-8 (V, followed by 0 to 3 I's)
                                                             $                   # end of string
                                                             """
                                                        ->>> re.search(pattern, 'M', re.VERBOSE)                 
                                                        +>>> re.search(pattern, 'M', re.VERBOSE)                 
                                                         <_sre.SRE_Match object at 0x008EEB48>
                                                        ->>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)         
                                                        +>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)         
                                                         <_sre.SRE_Match object at 0x008EEB48>
                                                        ->>> re.search(pattern, 'MMMMDCCCLXXXVIII', re.VERBOSE)  
                                                        +>>> re.search(pattern, 'MMMMDCCCLXXXVIII', re.VERBOSE)  
                                                         <_sre.SRE_Match object at 0x008EEB48>
                                                        ->>> re.search(pattern, 'M')                             
                                                        +>>> re.search(pattern, 'M')
                                                        1. The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them: re.VERBOSE is a constant defined in the re module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it’s a lot more readable.
                                                        2. This matches the start of the string, then one of a possible four M, then CM, then L and three of a possible three X, then IX, then the end of the string. @@ -303,24 +301,24 @@ body{counter-reset:h1 4}

                                                          Quite a variety! In each of these cases, I need to know that the area code was 800, the trunk was 555, and the rest of the phone number was 1212. For those with an extension, I need to know that the extension was 1234.

                                                          Let’s work through developing a solution for phone number parsing. This example shows the first step.

                                                          ->>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')  
                                                          ->>> phonePattern.search('800-555-1212').groups()             
                                                          +>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')  
                                                          +>>> phonePattern.search('800-555-1212').groups()             
                                                           ('800', '555', '1212')
                                                          ->>> phonePattern.search('800-555-1212-1234')                 
                                                          ->>> 
                                                          +>>> phonePattern.search('800-555-1212-1234') +>>>
                                                          1. Always read regular expressions from left to right. This one matches the beginning of the string, and then (\d{3}). What’s \d{3}? Well, the {3} means “match exactly three numeric digits”; it’s a variation on the {n,m} syntax you saw earlier. \d means “any numeric digit” (0 through 9). Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another group of exactly four digits. Then match the end of the string.
                                                          2. To get access to the groups that the regular expression parser remembered along the way, use the groups() method on the object that the search() method returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you defined three groups, one with three digits, one with three digits, and one with four digits.
                                                          3. This regular expression is not the final answer, because it doesn’t handle a phone number with an extension on the end. For that, you’ll need to expand the regular expression.
                                                          ->>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')  
                                                          ->>> phonePattern.search('800-555-1212-1234').groups()              
                                                          +>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')  
                                                          +>>> phonePattern.search('800-555-1212-1234').groups()              
                                                           ('800', '555', '1212', '1234')
                                                          ->>> phonePattern.search('800 555 1212 1234')                       
                                                          ->>> 
                                                          ->>> phonePattern.search('800-555-1212')                            
                                                          ->>> 
                                                          +>>> phonePattern.search('800 555 1212 1234') +>>> +>>> phonePattern.search('800-555-1212') +>>>
                                                          1. This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then a remembered group of three digits, then a hyphen, then a remembered group of three digits, then a hyphen, then a remembered group of four digits. What’s new is that you then match another hyphen, and a remembered group of one or more digits, then the end of the string.
                                                          2. The groups() method now returns a tuple of four elements, since the regular expression now defines four groups to remember. @@ -329,15 +327,15 @@ body{counter-reset:h1 4}

                                                          The next example shows the regular expression to handle separators between the different parts of the phone number.

                                                          ->>> phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')  
                                                          ->>> phonePattern.search('800 555 1212 1234').groups()  
                                                          +>>> phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')  
                                                          +>>> phonePattern.search('800 555 1212 1234').groups()  
                                                           ('800', '555', '1212', '1234')
                                                          ->>> phonePattern.search('800-555-1212-1234').groups()  
                                                          +>>> phonePattern.search('800-555-1212-1234').groups()  
                                                           ('800', '555', '1212', '1234')
                                                          ->>> phonePattern.search('80055512121234')              
                                                          ->>> 
                                                          ->>> phonePattern.search('800-555-1212')                
                                                          ->>> 
                                                          +>>> phonePattern.search('80055512121234') +>>> +>>> phonePattern.search('800-555-1212') +>>>
                                                          1. Hang on to your hat. You’re matching the beginning of the string, then a group of three digits, then \D+. What the heck is that? Well, \D matches any character except a numeric digit, and + means “1 or more”. So \D+ matches one or more characters that are not digits. This is what you’re using instead of a literal hyphen, to try to match different separators.
                                                          2. Using \D+ instead of - means you can now match phone numbers where the parts are separated by spaces instead of hyphens. @@ -347,15 +345,15 @@ body{counter-reset:h1 4}

                                                          The next example shows the regular expression for handling phone numbers without separators.

                                                          ->>> phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')  
                                                          ->>> phonePattern.search('80055512121234').groups()      
                                                          +>>> phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')  
                                                          +>>> phonePattern.search('80055512121234').groups()      
                                                           ('800', '555', '1212', '1234')
                                                          ->>> phonePattern.search('800.555.1212 x1234').groups()  
                                                          +>>> phonePattern.search('800.555.1212 x1234').groups()  
                                                           ('800', '555', '1212', '1234')
                                                          ->>> phonePattern.search('800-555-1212').groups()        
                                                          +>>> phonePattern.search('800-555-1212').groups()        
                                                           ('800', '555', '1212', '')
                                                          ->>> phonePattern.search('(800)5551212 x1234')           
                                                          ->>> 
                                                          +>>> phonePattern.search('(800)5551212 x1234') +>>>
                                                          1. The only change you’ve made since that last step is changing all the + to *. Instead of \D+ between the parts of the phone number, you now match on \D*. Remember that + means “1 or more”? Well, * means “zero or more”. So now you should be able to parse phone numbers even when there is no separator character at all.
                                                          2. Lo and behold, it actually works. Why? You matched the beginning of the string, then a remembered group of three digits (800), then zero non-numeric characters, then a remembered group of three digits (555), then zero non-numeric characters, then a remembered group of four digits (1212), then zero non-numeric characters, then a remembered group of an arbitrary number of digits (1234), then the end of the string. @@ -365,13 +363,13 @@ body{counter-reset:h1 4}

                                                          The next example shows how to handle leading characters in phone numbers.

                                                          ->>> phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')  
                                                          ->>> phonePattern.search('(800)5551212 ext. 1234').groups()                  
                                                          +>>> phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')  
                                                          +>>> phonePattern.search('(800)5551212 ext. 1234').groups()                  
                                                           ('800', '555', '1212', '1234')
                                                          ->>> phonePattern.search('800-555-1212').groups()                            
                                                          +>>> phonePattern.search('800-555-1212').groups()                            
                                                           ('800', '555', '1212', '')
                                                          ->>> phonePattern.search('work 1-(800) 555.1212 #1234')                      
                                                          ->>> 
                                                          +>>> phonePattern.search('work 1-(800) 555.1212 #1234') +>>>
                                                          1. This is the same as in the previous example, except now you’re matching \D*, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you’re not remembering these non-numeric characters (they’re not in parentheses). If you find them, you’ll just skip over them and then start remembering the area code whenever you get to it.
                                                          2. You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis after the area code is already handled; it’s treated as a non-numeric separator and matched by the \D* after the first remembered group.) @@ -380,12 +378,12 @@ body{counter-reset:h1 4}

                                                          Let’s back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than trying to match it all just so you can skip over it, let’s take a different approach: don’t explicitly match the beginning of the string at all. This approach is shown in the next example.

                                                          ->>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')  
                                                          ->>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()         
                                                          +>>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')  
                                                          +>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()         
                                                           ('800', '555', '1212', '1234')
                                                          ->>> phonePattern.search('800-555-1212')                                 
                                                          +>>> phonePattern.search('800-555-1212')                                 
                                                           ('800', '555', '1212', '')
                                                          ->>> phonePattern.search('80055512121234')                               
                                                          +>>> phonePattern.search('80055512121234')                               
                                                           ('800', '555', '1212', '1234')
                                                          1. Note the lack of ^ in this regular expression. You are not matching the beginning of the string anymore. There’s nothing that says you need to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there. @@ -396,7 +394,7 @@ body{counter-reset:h1 4}

                                                            See how quickly a regular expression can get out of control? Take a quick glance at any of the previous iterations. Can you tell the difference between one and the next?

                                                            While you still understand the final answer (and it is the final answer; if you’ve discovered a case it doesn’t handle, I don’t want to know about it), let’s write it out as a verbose regular expression, before you forget why you made the choices you made.

                                                            ->>> phonePattern = re.compile(r'''
                                                            +>>> phonePattern = re.compile(r'''
                                                                             # don't match beginning of string, number can start anywhere
                                                                 (\d{3})     # area code is 3 digits (e.g. '800')
                                                                 \D*         # optional separator is any number of non-digits
                                                            @@ -407,9 +405,9 @@ body{counter-reset:h1 4}
                                                                 (\d*)       # extension is optional and can be any number of digits
                                                                 $           # end of string
                                                                 ''', re.VERBOSE)
                                                            ->>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()  
                                                            +>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()  
                                                             ('800', '555', '1212', '1234')
                                                            ->>> phonePattern.search('800-555-1212')                          
                                                            +>>> phonePattern.search('800-555-1212')                          
                                                             ('800', '555', '1212', '')
                                                            1. Other than being spread out over multiple lines, this is exactly the same regular expression as the last step, so it’s no surprise that it parses the same inputs. @@ -432,6 +430,6 @@ body{counter-reset:h1 4}
                                                            2. (x) in general is a remembered group. You can get the value of what matched by using the groups() method of the object returned by re.search.

                                                              Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough about them to know when they are appropriate, when they will solve your problems, and when they will cause more problems than they solve. -

                                                              © 2001–4, 2009 ark Pilgrim • open standards • open content • open source +

                                                              © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/strings.html b/strings.html index 61d88cf..983bc4b 100644 --- a/strings.html +++ b/strings.html @@ -1,19 +1,17 @@ - Strings - Dive into Python 3 - -

                                                              skip to main content -

                                                                
                                                              -

                                                              skip to main content +

                                                                
                                                              +

                                                              You are here: Home Dive Into Python 3

                                                              Strings

                                                              I’m telling you this ’cause you’re one of my friends.
                                                              @@ -35,7 +33,7 @@ My alphabet starts where your alphabet ends!
                                                              Further reading

                                                            Diving in

                                                            -

                                                            Chinese has thousands of characters. The Rotokas alphabet of Bougainville is the smallest alphabet in the world, with just 12 letters. English has 26, plus a handful of punctuation marks. Python 3 can handle all of these languages, and more. +

                                                            Chinese has thousands of characters. The Rotokas alphabet of Bougainville is the smallest alphabet in the world, with just 12 letters. English has 26, plus a handful of punctuation marks. Python 3 can handle all of these languages, and more.

                                                            When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. @@ -91,21 +89,21 @@ FIXME: update for Python 3

                                                            Python has had Unicode support throughout the language since version 2.0. The XML package uses Unicode to store all parsed XML data, but you can use Unicode anywhere.

                                                            Example 9.13. Introducing Unicode

                                                            ->>> s = u'Dive in'            
                                                            ->>> s
                                                            +>>> s = u'Dive in'            
                                                            +>>> s
                                                             u'Dive in'
                                                            ->>> print s 
                                                            +>>> print s 
                                                             Dive in
                                                            1. To create a Unicode string instead of a regular ASCII string, add the letter “u” before the string. Note that this particular string doesn't have any non-ASCII characters. That's fine; Unicode is a superset of ASCII (a very large superset at that), so any regular ASCII string can also be stored as Unicode.
                                                            2. When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this Unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn't know that s was a Unicode string, you'd never notice the difference.

                                                              Example 9.14. Storing non-ASCII characters

                                                              ->>> s = u'La Pe\xf1a'         
                                                              ->>> print s 
                                                              +>>> s = u'La Pe\xf1a'         
                                                              +>>> print s 
                                                               Traceback (innermost last):
                                                                 File "<interactive input>", line 1, in ?
                                                               UnicodeError: ASCII encoding error: ordinal not in range(128)
                                                              ->>> print s.encode('latin-1') 
                                                              +>>> print s.encode('latin-1') 
                                                               La Peña
                                                              1. The real advantage of Unicode, of course, is its ability to store non-ASCII characters, like the Spanish “ñ” (n with a tilde over it). The Unicode character code for the tilde-n is 0xf1 in hexadecimal (241 in decimal), which you can type like this: \xf1. @@ -146,9 +144,9 @@ http://www.python.org/dev/peps/pep-3120/ - UTF-8 is now the default encoding (Py to insert values into a string with the %s placeholder.
                                                                ->>> k = "uid"
                                                                ->>> v = "sa"
                                                                ->>> "%s=%s" % (k, v) 
                                                                +>>> k = "uid"
                                                                +>>> v = "sa"
                                                                +>>> "%s=%s" % (k, v) 
                                                                 'uid=sa'
                                                                1. The whole expression evaluates to a string. The first %s is replaced by the value of k; the second %s is replaced by the value of v. All other characters in the string (in this case, the equal sign) stay as they are. @@ -160,16 +158,16 @@ http://www.python.org/dev/peps/pep-3120/ - UTF-8 is now the default encoding (Py string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
                                                                  ->>> uid = "sa"
                                                                  ->>> pwd = "secret"
                                                                  ->>> print pwd + " is not a good password for " + uid      
                                                                  +>>> uid = "sa"
                                                                  +>>> pwd = "secret"
                                                                  +>>> print pwd + " is not a good password for " + uid      
                                                                   secret is not a good password for sa
                                                                  ->>> print "%s is not a good password for %s" % (pwd, uid) 
                                                                  +>>> print "%s is not a good password for %s" % (pwd, uid) 
                                                                   secret is not a good password for sa
                                                                  ->>> userCount = 6
                                                                  ->>> print "Users connected: %d" % (userCount, )            
                                                                  +>>> userCount = 6
                                                                  +>>> print "Users connected: %d" % (userCount, )            
                                                                   Users connected: 6
                                                                  ->>> print "Users connected: " + userCount                 
                                                                  +>>> print "Users connected: " + userCount                 
                                                                   Traceback (innermost last):
                                                                     File "<interactive input>", line 1, in ?
                                                                   TypeError: cannot concatenate 'str' and 'int' objects
                                                                  @@ -184,11 +182,11 @@ TypeError: cannot concatenate 'str' and 'int' objects

                                                                  As with printf in C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.

                                                                  ->>> print "Today's stock price: %f" % 50.4625   
                                                                  +>>> print "Today's stock price: %f" % 50.4625   
                                                                   50.462500
                                                                  ->>> print "Today's stock price: %.2f" % 50.4625 
                                                                  +>>> print "Today's stock price: %.2f" % 50.4625 
                                                                   50.46
                                                                  ->>> print "Change since yesterday: %+.2f" % 1.5 
                                                                  +>>> print "Change since yesterday: %+.2f" % 1.5 
                                                                   +1.50
                                                                  1. The %f string formatting option treats the value as a decimal, and prints it to six decimal places. @@ -213,10 +211,10 @@ is an object. You might have thought I meant that string variables are
                                                                    ->>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
                                                                    ->>> ["%s=%s" % (k, v) for k, v in params.items()]
                                                                    +>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
                                                                    +>>> ["%s=%s" % (k, v) for k, v in params.items()]
                                                                     ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
                                                                    ->>> ";".join(["%s=%s" % (k, v) for k, v in params.items()])
                                                                    +>>> ";".join(["%s=%s" % (k, v) for k, v in params.items()])
                                                                     'server=mpilgrim;uid=sa;database=master;pwd=secret'

                                                                    This string is then returned from the odbchelper function and printed by the calling block, which gives you the output that you marveled at when you started reading this chapter. @@ -224,13 +222,13 @@ is an object. You might have thought I meant that string variables are

                                                                    You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's called split.

                                                                    ->>> li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
                                                                    ->>> s = ";".join(li)
                                                                    ->>> s
                                                                    +>>> li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
                                                                    +>>> s = ";".join(li)
                                                                    +>>> s
                                                                     'server=mpilgrim;uid=sa;database=master;pwd=secret'
                                                                    ->>> s.split(";")    
                                                                    +>>> s.split(";")    
                                                                     ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
                                                                    ->>> s.split(";", 1) 
                                                                    +>>> s.split(";", 1) 
                                                                     ['server=mpilgrim', 'uid=sa;database=master;pwd=secret']
                                                                    1. split reverses join by splitting a string into a multi-element list. Note that the delimiter (“;”) is stripped out completely; it does not appear in any of the elements of the returned list. @@ -263,6 +261,6 @@ http://www.w3.org/People/Dürst/papers.html http://rishida.net/scripts/chinese/ -

                                                                      © 2001–4, 2009 ark Pilgrim • open standards • open content • open source +

                                                                      © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/table-of-contents.html b/table-of-contents.html index 68a1665..fefaf3d 100644 --- a/table-of-contents.html +++ b/table-of-contents.html @@ -1,11 +1,9 @@ - Table of contents - Dive Into Python 3 - -

                                                                       
                                                                      -
                                                                       
                                                                      +

                                                                      You are here: Home Dive Into Python 3

                                                                      Table of contents

                                                                      1. Installing Python @@ -380,4 +378,4 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
                                                                      2. Dictionary comprehensions
                                                                      3. Views (several dictionary methods return them, they're dynamic, update when the dictionary changes, etc.) -

                                                                        © 2001–4, 2009 ark Pilgrim • open standards • open content • open source +

                                                                        © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/unit-testing.html b/unit-testing.html index 22afcba..17d1388 100644 --- a/unit-testing.html +++ b/unit-testing.html @@ -1,19 +1,17 @@ - Unit testing - Dive into Python 3 - -

                                                                        skip to main content -

                                                                          
                                                                        -

                                                                        skip to main content +

                                                                          
                                                                        +

                                                                        You are here: Home Dive Into Python 3

                                                                        Unit testing

                                                                        Certitude is not the test of certainty. We have been cocksure of many things that were not so.
                                                                        Oliver Wendell Holmes, Jr. @@ -26,7 +24,7 @@ body{counter-reset:h1 7}

                                                                      4. ...

                                                                      (Not) diving in

                                                                      -

                                                                      How do you know that the code you wrote yesterday still works after the changes you made today? Every seasoned programmer has war stories of an “innocent” change that couldn't possibly have affected that other “unrelated” module… If this sounds familiar, this chapter is for you. +

                                                                      How do you know that the code you wrote yesterday still works after the changes you made today? Every seasoned programmer has war stories of an “innocent” change that couldn't possibly have affected that other “unrelated” module… If this sounds familiar, this chapter is for you.

                                                                      In this chapter, you're going to write and debug a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in “Case study: roman numerals”. Now step back and consider what it would take to expand that into a two-way utility.

                                                                      The rules for Roman numerals lead to a number of interesting observations:

                                                                        @@ -149,7 +147,7 @@ function to_roman(n):

                                                                      Execute romantest1.py on the command line to run the test. If you call it with the -v command-line option, it will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this:

                                                                      -you@localhost:~$ python3 romantest1.py -v
                                                                      +you@localhost:~$ python3 romantest1.py -v
                                                                       to_roman should give known result with known input ... FAIL            
                                                                       
                                                                       ======================================================================
                                                                      @@ -206,8 +204,8 @@ while n >= integer:
                                                                           print('subtracting {0} from input, adding {1} to output'.format(integer, numeral))

                                                                      With the debug print() statements, the output looks like this:

                                                                      ->>> import roman1
                                                                      ->>> roman1.to_roman(1424)
                                                                      +>>> import roman1
                                                                      +>>> roman1.to_roman(1424)
                                                                       subtracting 1000 from input, adding M to output
                                                                       subtracting 400 from input, adding CD to output
                                                                       subtracting 10 from input, adding X to output
                                                                      @@ -216,7 +214,7 @@ subtracting 4 from input, adding IV to output
                                                                       'MCDXXIV'

                                                                      So the to_roman() function appears to work, at least in this manual spot check. But will it pass the test case you wrote?

                                                                      -you@localhost:~$ python3 romantest1.py -v
                                                                      +you@localhost:~$ python3 romantest1.py -v
                                                                       to_roman should give known result with known input ... ok
                                                                       
                                                                       ----------------------------------------------------------------------
                                                                      @@ -230,12 +228,12 @@ OK

                                                                      “Halt and catch fire”

                                                                      It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect.

                                                                      ->>> import roman1
                                                                      ->>> roman1.to_roman(4000)
                                                                      +>>> import roman1
                                                                      +>>> roman1.to_roman(4000)
                                                                       'MMMM'
                                                                      ->>> roman1.to_roman(5000)
                                                                      +>>> roman1.to_roman(5000)
                                                                       'MMMMM'
                                                                      ->>> roman1.to_roman(9000)  
                                                                      +>>> roman1.to_roman(9000)  
                                                                       'MMMMMMMMM'
                                                                      1. That's definitely not what you wanted — that's not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is baaaaaaad; if a program is going to fail, it is far better that it fail quickly and noisily. “Halt and catch fire,” as the saying goes. The Pythonic way to halt and catch fire is to raise an exception. @@ -260,7 +258,7 @@ OK

                                                                        Also note that you're passing the to_roman() function itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned recently how handy it is that everything in Python is an object?

                                                                        So what happens when you run the test suite with this new test?

                                                                        -you@localhost:~$ python3 romantest2.py -v
                                                                        +you@localhost:~$ python3 romantest2.py -v
                                                                         to_roman should give known result with known input ... ok
                                                                         to_roman should fail with large input ... ERROR                         
                                                                         
                                                                        @@ -289,7 +287,7 @@ FAILED (errors=1)

                                                                      Now run the test suite again.

                                                                      -you@localhost:~$ python3 romantest2.py -v
                                                                      +you@localhost:~$ python3 romantest2.py -v
                                                                       to_roman should give known result with known input ... ok
                                                                       to_roman should fail with large input ... FAIL                          
                                                                       
                                                                      @@ -327,7 +325,7 @@ FAILED (failures=1)

                                                                    Does this make the test pass? Let's find out.

                                                                    -you@localhost:~$ python3 romantest2.py -v
                                                                    +you@localhost:~$ python3 romantest2.py -v
                                                                     to_roman should give known result with known input ... ok
                                                                     to_roman should fail with large input ... ok                            
                                                                     
                                                                    @@ -364,6 +362,6 @@ For instance, the testFromRomanCase method (“from_roman
                                                                     
                                                                  2. from_roman should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input).
                                                                  --> -

                                                                  © 2001–4, 2009 ark Pilgrim • open standards • open content • open source +

                                                                  © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/your-first-python-program.html b/your-first-python-program.html index 3f52e24..c2ec774 100644 --- a/your-first-python-program.html +++ b/your-first-python-program.html @@ -1,19 +1,18 @@ - Your first Python program - Dive into Python 3 - -

                                                                  skip to main content -

                                                                    
                                                                  -

                                                                  skip to main content +

                                                                    
                                                                  +

                                                                  You are here: Home Dive Into Python 3

                                                                  Your first Python program

                                                                  Don’t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate.
                                                                  Ven. Henepola Gunararatana @@ -40,9 +39,9 @@ body{counter-reset:h1 1}

                                                                2. Further reading

                                                                Diving in

                                                                -

                                                                Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it. +

                                                                Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.

                                                                [The code examples will be easier to follow if you enable Javascript, but whatever.] -

                                                                skip over this code listing +

                                                                skip over this code listing

                                                                [download humansize.py]

                                                                SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
                                                                             1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
                                                                @@ -73,14 +72,14 @@ if __name__ == "__main__":
                                                                     print(approximate_size(1000000000000, False))
                                                                     print(approximate_size(1000000000000))

                                                                Now let's run this program on the command line. On Windows, it will look something like this: -

                                                                skip over this command output listing +

                                                                skip over this command output listing

                                                                -c:\home\diveintopython3> c:\python30\python.exe humansize.py
                                                                +c:\home\diveintopython3> c:\python30\python.exe humansize.py
                                                                 1.0 TB
                                                                 931.3 GiB

                                                                On Mac OS X or Linux, it would look something like this:

                                                                -you@localhost:~$ python3 humansize.py
                                                                +you@localhost:~$ python3 humansize.py
                                                                 1.0 TB
                                                                 931.3 GiB

                                                                FIXME: this would be a good place to explain what the program, you know, actually does. @@ -114,7 +113,7 @@ if __name__ == "__main__":

                                                                So Python is both dynamically typed (because it doesn't use explicit datatype declarations) and strongly typed (because once a variable has a datatype, it actually matters).

                                                                If you have experience in other programming languages, this table may help you visualize how Python compares to them: - +
                                                                @@ -123,7 +122,7 @@ if __name__ == "__main__":

                                                                I won't bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you've forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You'll thank me in six months.

                                                                Documentation strings

                                                                You can document a Python function by giving it a documentation string (docstring for short). In this program, the approximate_size function has a docstring: -

                                                                skip over this code listing +

                                                                skip over this code listing

                                                                def approximate_size(size, a_kilobyte_is_1024_bytes=True):
                                                                     """Convert a file size to human-readable form.
                                                                 
                                                                @@ -150,12 +149,12 @@ if __name__ == "__main__":
                                                                 

                                                                Everything is an object

                                                                In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime. A function, like everything else in Python, is an object.

                                                                Run the interactive Python shell and follow along: -

                                                                skip over this interpreter listing +

                                                                skip over this interpreter listing

                                                                ->>> import humansize                               
                                                                ->>> print(humansize.approximate_size(4096, True))  
                                                                +>>> import humansize                               
                                                                +>>> print(humansize.approximate_size(4096, True))  
                                                                 4.0 KiB
                                                                ->>> print(humansize.approximate_size.__doc__)      
                                                                +>>> print(humansize.approximate_size.__doc__)      
                                                                 Convert a file size to human-readable form.
                                                                 
                                                                     Keyword arguments:
                                                                @@ -176,14 +175,14 @@ if __name__ == "__main__":
                                                                 
                                                                 

                                                                The import search path

                                                                Before this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in sys.path. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.) -

                                                                skip over this interpreter listing +

                                                                skip over this interpreter listing

                                                                ->>> import sys                       
                                                                ->>> sys.path                         
                                                                +>>> import sys                       
                                                                +>>> sys.path                         
                                                                 ['', '/usr/lib/python30.zip', '/usr/lib/python3.0', '/usr/lib/python3.0/plat-linux2@EXTRAMACHDEPPATH@', '/usr/lib/python3.0/lib-dynload', '/usr/lib/python3.0/dist-packages', '/usr/local/lib/python3.0/dist-packages']
                                                                ->>> sys                              
                                                                +>>> sys                              
                                                                 <module 'sys' (built-in)>
                                                                ->>> sys.path.append('/my/new/path')  
                                                                +>>> sys.path.append('/my/new/path')
                                                                1. Importing the sys module makes all of its functions and attributes available.
                                                                2. sys.path is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a .py file whose name matches what you're trying to import. @@ -196,7 +195,7 @@ if __name__ == "__main__":

                                                                  This is so important that I'm going to repeat it in case you missed it the first few times: everything in Python is an object. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.

                                                                  Indenting code

                                                                  Python functions have no explicit begin or end, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (:) and the indentation of the code itself. -

                                                                  skip over this code listing +

                                                                  skip over this code listing

                                                                  
                                                                   def approximate_size(size, a_kilobyte_is_1024_bytes=True):  
                                                                       if size < 0:                                            
                                                                  @@ -222,7 +221,7 @@ if __name__ == "__main__":
                                                                   
                                                                   

                                                                  Running scripts

                                                                  Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them, by including a special block of code that executes when you run the Python file on the command line. Take the last few lines of humansize.py: -

                                                                  skip over this code listing +

                                                                  skip over this code listing

                                                                  
                                                                   if __name__ == "__main__":
                                                                       print(approximate_size(1000000000000, False))
                                                                  @@ -231,15 +230,15 @@ if __name__ == "__main__":
                                                                   

                                                                  Like C, Python uses == for comparison and = for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.

                                                                  So what makes this if statement special? Well, modules are objects, and all modules have a built-in attribute __name__. A module's __name__ depends on how you're using the module. If you import the module, then __name__ is the module's filename, without a directory path or file extension. -

                                                                  skip over this interpreter listing +

                                                                  skip over this interpreter listing

                                                                  ->>> import humansize
                                                                  ->>> humansize.__name__
                                                                  +>>> import humansize
                                                                  +>>> humansize.__name__
                                                                   'humansize'

                                                                  But you can also run the module directly as a standalone program, in which case __name__ will be a special default value, __main__. Python will evaluate this if statement, find a true expression, and execute the if code block. In this case, to print two values. -

                                                                  skip over this command output listing +

                                                                  skip over this command output listing

                                                                  -c:\home\diveintopython3> c:\python30\python.exe humansize.py
                                                                  +c:\home\diveintopython3> c:\python30\python.exe humansize.py
                                                                   1.0 TB
                                                                   931.3 GiB

                                                                  Further reading

                                                                  @@ -249,6 +248,6 @@ if __name__ == "__main__":
                                                                3. PEP 8: Style Guide for Python Code discusses good indentation style.
                                                                4. Python Reference Manual explains what it means to say that everything in Python is an object, because some people are pedantic and like to discuss that sort of thing at great length. -

                                                                  © 2001–4, 2009 ark Pilgrim • open standards • open content • open source +

                                                                  © 2001–4, 2009 ark Pilgrim • open standards • open content • open source

                                                                Statically typedDynamically typed
                                                                Weakly typedC, Objective-CJavaScript, Perl 5, PHP
                                                                Strongly typedPascal, JavaPython, Ruby