From 4d69a47f987be52557afbc902b98716b7614c656 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Wed, 29 Apr 2009 23:49:36 -0400 Subject: [PATCH] whats-new, more special-method-names, typography fiddling --- advanced-classes.html | 4 ++ advanced-iterators.html | 38 +++++----- case-study-porting-chardet-to-python-3.html | 64 ++++++++--------- index.html | 3 +- iterators.html | 63 +++++----------- native-datatypes.html | 46 ++++++------ porting-code-to-python-3-with-2to3.html | 42 +++++------ refactoring.html | 42 +++++------ special-method-names.html | 71 ++++++++++++++++-- strings.html | 42 +++++------ table-of-contents.html | 3 +- unit-testing.html | 80 ++++++++++----------- whats-new.html | 44 ++++++++++++ your-first-python-program.html | 54 +++++++------- 14 files changed, 337 insertions(+), 259 deletions(-) create mode 100644 whats-new.html diff --git a/advanced-classes.html b/advanced-classes.html index b078fe8..e96eae1 100644 --- a/advanced-classes.html +++ b/advanced-classes.html @@ -20,6 +20,8 @@ body{counter-reset:h1 11}

Diving In

FIXME +

Ordered Dictionary: Not An Oxymoron

+

[download ordereddict.py]

import collections
 import itertools
@@ -92,6 +94,8 @@ class OrderedDict(dict, collections.MutableMapping):
             return all(p==q for p, q in itertools.zip_longest(self.items(), other.items()))
         return dict.__eq__(self, other)
+

Implementing Fractions

+

© 2001–9 Mark Pilgrim diff --git a/advanced-iterators.html b/advanced-iterators.html index a119e7d..3d5027d 100644 --- a/advanced-iterators.html +++ b/advanced-iterators.html @@ -17,7 +17,7 @@ body{counter-reset:h1 7}

 

Diving In

-

HAWAII + IDAHO + IOWA + OHIO == STATES. Or, to put it another way, 510199 + 98153 + 9301 + 3593 == 621246. Am I speaking in tongues? No, it's just a puzzle. +

HAWAII + IDAHO + IOWA + OHIO == STATES. Or, to put it another way, 510199 + 98153 + 9301 + 3593 == 621246. Am I speaking in tongues? No, it’s just a puzzle.

Let me spell it out for you. @@ -38,7 +38,7 @@ E = 4

The most well-known alphametic puzzle is SEND + MORE = MONEY. -

In this chapter, we'll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles in just 14 lines of code. +

In this chapter, we’ll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles in just 14 lines of code.

[download alphametics.py]

import re
@@ -91,13 +91,13 @@ if __name__ == '__main__':
 >>> re.findall('[A-Z]+', 'SEND + MORE == MONEY')     
 ['SEND', 'MORE', 'MONEY']
    -
  1. The re module is Python's implementation of regular expressions. It has a nifty function called findall() which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the string. In this case, the pattern matches sequences of numbers. The findall() function returns a list of all the substrings that matched the pattern. +
  2. The re module is Python’s implementation of regular expressions. It has a nifty function called findall() which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the string. In this case, the pattern matches sequences of numbers. The findall() function returns a list of all the substrings that matched the pattern.
  3. Here the regular expression pattern matches sequences of letters. Again, the return value is a list, and each item in the list is a string that matched the regular expression pattern.

Finding the unique items in a sequence

-

Set comprehensions make it trivial to find the unique items in a sequence. [FIXME-not sure if I'm going to cover set comprehensions in an earlier chapter; if not, this is certainly an abrupt and inadequate introduction to the topic.] +

Set comprehensions make it trivial to find the unique items in a sequence. [FIXME-not sure if I’m going to cover set comprehensions in an earlier chapter; if not, this is certainly an abrupt and inadequate introduction to the topic.]

 >>> a_list = ['a', 'c', 'b', 'a', 'd', 'b']
@@ -112,7 +112,7 @@ if __name__ == '__main__':
 >>> {c for c in ''.join(words)}              
 {'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
    -
  1. Given a list of several strings, a set comprehension with the identity function will return a set of unique strings from the list. This makes sense if you think of it like a for loop. Take the first item from the list, put it in the set. Second. Third. Fourth — wait, that's in the set already, so it only gets listed once. Fifth. Sixth — again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn't even need to be sorted first. +
  2. Given a list of several strings, a set comprehension with the identity function will return a set of unique strings from the list. This makes sense if you think of it like a for loop. Take the first item from the list, put it in the set. Second. Third. Fourth — wait, that’s in the set already, so it only gets listed once. Fifth. Sixth — again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn’t even need to be sorted first.
  3. The same technique works with strings, since a string is just a sequence of characters.
  4. Given a list of strings, ''.join(a_list) concatenates all the strings together into one.
  5. So, given a list of strings, this set comprehension returns all the unique characters across all the strings, with no duplicates. @@ -126,7 +126,7 @@ if __name__ == '__main__':

    Making assertions

    -

    Like many programming languages, Python has an assert statement. Here's how it works. +

    Like many programming languages, Python has an assert statement. Here’s how it works.

     >>> assert 1 + 1 == 2  
    @@ -172,9 +172,9 @@ AssertionError

    Calculating Permutations… The Lazy Way!

    -

    First of all, what the heck are permutations? Permutations are a mathematical concept. (There are actually several definitions, depending on what kind of math you're doing. Here I'm talking about combinatorics, but if that doesn't mean anything to you, don't worry about it. As always, Wikipedia is your friend.) +

    First of all, what the heck are permutations? Permutations are a mathematical concept. (There are actually several definitions, depending on what kind of math you’re doing. Here I’m talking about combinatorics, but if that doesn’t mean anything to you, don’t worry about it. As always, Wikipedia is your friend.) -

    The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated. Mathematicians say things like "let's find the permutations of 3 different items taken 2 at a time," which means you have a sequence of 3 items and you want to find all the possible ordered pairs. +

    The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated. Mathematicians say things like “let’s find the permutations of 3 different items taken 2 at a time,” which means you have a sequence of 3 items and you want to find all the possible ordered pairs.

     >>> import itertools                              
    @@ -197,13 +197,13 @@ AssertionError
    StopIteration
    1. The itertools module has all kinds of fun stuff in it, including a permutations() function that does all the hard work of finding permutations. -
    2. The permutations() function takes a sequence (here a list of three integers) and a number, which is the number of items you want in each smaller group. The function returns an iterator, which you can use in a foor loop or any old place that iterates. Here I'll step through the iterator manually to show all the values. +
    3. The permutations() function takes a sequence (here a list of three integers) and a number, which is the number of items you want in each smaller group. The function returns an iterator, which you can use in a foor loop or any old place that iterates. Here I’ll step through the iterator manually to show all the values.
    4. The first permutation of [1, 2, 3] taken 2 at a time is (1, 2).
    5. Note that permutations are ordered: (2, 1) is different than (1, 2). -
    6. That's it! Those are all the permutations of [1, 2, 3] taken 2 at a time. Pairs like (1, 1) and (2, 2) never show up, because they contain repeats so they aren't valid permutations. When there are no more permutations, the iterator raises a StopIteration exception. +
    7. That’s it! Those are all the permutations of [1, 2, 3] taken 2 at a time. Pairs like (1, 1) and (2, 2) never show up, because they contain repeats so they aren’t valid permutations. When there are no more permutations, the iterator raises a StopIteration exception.
    -

    The permutations() function doesn't have to take a list. It can take any sequence — even a string. +

    The permutations() function doesn’t have to take a list. It can take any sequence — even a string.

     >>> import itertools
    @@ -245,7 +245,7 @@ StopIteration
     [('A', 'B'), ('A', 'C'), ('B', 'C')]
    1. The itertools.product() function returns an iterator containing the Cartesian product of two sequences. -
    2. The itertools.combinations() function returns an iterator containing all the possible combinations of the given sequence of the given length. This is like the itertools.permutations() function, except combinations don't include items that are duplicates of other items in a different order. So itertools.permutations('ABC', 2) will return both ('A', 'B') and ('B', 'A') (among others), but itertools.combinations('ABC', 2) will not return ('B', 'A') because it is a duplicate of ('A', 'B') in a different order. +
    3. The itertools.combinations() function returns an iterator containing all the possible combinations of the given sequence of the given length. This is like the itertools.permutations() function, except combinations don’t include items that are duplicates of other items in a different order. So itertools.permutations('ABC', 2) will return both ('A', 'B') and ('B', 'A') (among others), but itertools.combinations('ABC', 2) will not return ('B', 'A') because it is a duplicate of ('A', 'B') in a different order.

    [download favorite-people.txt] @@ -273,7 +273,7 @@ StopIteration

  6. But the sorted() function can also take a function as the key parameter, and it sorts by that key. In this case, the sort function is len(), so it sorts by len(each item). Shorter names come first, then longer, then longest.
-

What does this have to do with the itertools module? I'm glad you asked. +

What does this have to do with the itertools module? I’m glad you asked.

 

…continuing from the previous interactive shell… @@ -330,7 +330,7 @@ Wesley

  • On the other hand, the itertools.zip_longest() function stops at the end of the longest sequence, inserting None values for items past the end of the shorter sequences. -

    OK, that was all very interesting, but how does it relate to the alphametics solver? Here's how: +

    OK, that was all very interesting, but how does it relate to the alphametics solver? Here’s how:

     >>> characters = ('S', 'M', 'E', 'D', 'O', 'N', 'R', 'Y')
    @@ -343,7 +343,7 @@ Wesley
    'N': '5', 'S': '1', 'R': '6', 'Y': '7'}
    1. Given a list of letters and a list of digits (each represented here as 1-character strings), the zip function will create a pairing of letters and digits, in order. -
    2. Why is that cool? Because that data structure happens to be exactly the right structure to pass to the dict() function to create a dictionary that uses letters as keys and their associated digits as values. Although the printed representation of the dictionary lists the pairs in a different order (dictionaries have no "order" per se), you can see that each letter is associated with the digit, based on the ordering of the original characters and guess sequences. +
    3. Why is that cool? Because that data structure happens to be exactly the right structure to pass to the dict() function to create a dictionary that uses letters as keys and their associated digits as values. Although the printed representation of the dictionary lists the pairs in a different order (dictionaries have no “order” per se), you can see that each letter is associated with the digit, based on the ordering of the original characters and guess sequences.

    The alphametics solver uses this technique to create a dictionary that maps letters in the puzzle to digits in the solution, for each possible solution. @@ -355,7 +355,7 @@ for guess in itertools.permutations(digits, len(characters)): ... equation = puzzle.translate(dict(zip(characters, guess))) -

    But what is this translate() method? Ah, now you're getting to the really fun part. +

    But what is this translate() method? Ah, now you’re getting to the really fun part.

    A New Kind Of String Manipulation

    @@ -411,9 +411,9 @@ for guess in itertools.permutations(digits, len(characters)):

    Further Reading

    diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index acf3a63..02af7ff 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -614,7 +614,7 @@ ImportError: No module named constants

    Needs to become two separate imports:

    from . import constants
     import sys
    -

    There are variations of this problem scattered throughout the chardet library. In some places it’s "import constants, sys"; in other places, it’s "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import. +

    There are variations of this problem scattered throughout the chardet library. In some places it’s “import constants, sys”; in other places, it’s “import constants, re”. The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.

    Onward!

    Name 'file' is not defined

    @@ -697,7 +697,7 @@ for line in open(f, 'rb'): File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf): TypeError: Can't convert 'bytes' object to str implicitly -

    There's an unfortunate clash of coding style and Python interpreter here. The TypeError could be anywhere on that line, but the traceback doesn't tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this: +

    There’s an unfortunate clash of coding style and Python interpreter here. The TypeError could be anywhere on that line, but the traceback doesn’t tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:

    elif (self._mInputState == ePureAscii) and \
         self._escDetector.search(self._mLastChar + aBuf):

    And re-run the test:

    @@ -709,8 +709,8 @@ TypeError: Can't convert 'bytes' object to str implicitly File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed self._escDetector.search(self._mLastChar + aBuf): TypeError: Can't convert 'bytes' object to str implicitly -

    Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second one. So what could cause a TypeError there? Perhaps you're thinking that the search() method is expecting a value of a different type, but that wouldn't generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value of a different type than it's expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the search() method. So the problem must be in that + operation, as it's trying to construct the value that it will eventually pass to the search() method. -

    We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It's an instance variable, defined in the reset() method, which is actually called from the __init__() method. +

    Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second one. So what could cause a TypeError there? Perhaps you’re thinking that the search() method is expecting a value of a different type, but that wouldn’t generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value of a different type than it’s expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the search() method. So the problem must be in that + operation, as it’s trying to construct the value that it will eventually pass to the search() method. +

    We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It’s an instance variable, defined in the reset() method, which is actually called from the __init__() method.

    class UniversalDetector:
         def __init__(self):
             self._highBitDetector = re.compile(b'[\x80-\xFF]')
    @@ -726,7 +726,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
    self._mGotData = False self._mInputState = ePureAscii self._mLastChar = '' -

    And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can't concatenate a string to a byte array — not even a zero-length string. +

    And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can’t concatenate a string to a byte array — not even a zero-length string.

    So what is self._mLastChar anyway? The answer is in the feed() method, just a few lines down from where the trackback occurred.

    if self._mInputState == ePureAscii:
         if self._highBitDetector.search(aBuf):
    @@ -736,14 +736,14 @@ TypeError: Can't convert 'bytes' object to str implicitly
    self._mInputState = eEscAscii self._mLastChar = aBuf[-1] -

    The calling function calls this feed() method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as aBuf), then stores the last byte in self._mLastChar in case it's needed during the next call. (In a multi-byte encoding, the feed() method might get called with half of a character, then called again with the other half.) But because aBuf is now a byte array instead of a string, self._mLastChar needs to be a byte array as well. Thus: +

    The calling function calls this feed() method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as aBuf), then stores the last byte in self._mLastChar in case it’s needed during the next call. (In a multi-byte encoding, the feed() method might get called with half of a character, then called again with the other half.) But because aBuf is now a byte array instead of a string, self._mLastChar needs to be a byte array as well. Thus:

      def reset(self):
           .
           .
           .
     -     self._mLastChar = ''
     +     self._mLastChar = b''
    -

    Searching the entire codebase for "mLastChar" turns up a similar problem in mbcharsetprober.py, but instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers. +

    Searching the entire codebase for “mLastChar” turns up a similar problem in mbcharsetprober.py, but instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.

    
       class MultiByteCharSetProber(CharSetProber):
           def __init__(self):
    @@ -762,7 +762,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
    - self._mLastChar = ['\x00', '\x00'] + self._mLastChar = [0, 0]

    Unsupported operand type(s) for +: 'int' and 'bytes'

    -

    I have good news, and I have bad news. The good news is we're making progress… +

    I have good news, and I have bad news. The good news is we’re making progress…

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
    @@ -771,8 +771,8 @@ TypeError: Can't convert 'bytes' object to str implicitly
    File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed self._escDetector.search(self._mLastChar + aBuf): TypeError: unsupported operand type(s) for +: 'int' and 'bytes' -

    …The bad news is it doesn't always feel like progress. -

    But this is progress! Really! Even though the traceback calls out the same line of code, it's a different error than it used to be. Progress! So what's the problem now? The last time I checked, this line of code didn't try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that self._mLastChar was a byte array. How did it turn into an int? +

    …The bad news is it doesn’t always feel like progress. +

    But this is progress! Really! Even though the traceback calls out the same line of code, it’s a different error than it used to be. Progress! So what’s the problem now? The last time I checked, this line of code didn’t try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that self._mLastChar was a byte array. How did it turn into an int?

    The answer lies not in the previous lines of code, but in the following lines.

    if self._mInputState == ePureAscii:
         if self._highBitDetector.search(aBuf):
    @@ -783,7 +783,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
    self._mLastChar = aBuf[-1] -

    This error doesn't occur the first time the feed() method gets called; it occurs the second time, after self._mLastChar has been set to the last byte of aBuf. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell: +

    This error doesn’t occur the first time the feed() method gets called; it occurs the second time, after self._mLastChar has been set to the last byte of aBuf. Well, what’s the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:

     >>> aBuf = b'\xEF\xBB\xBF'         
     >>> len(aBuf)
    @@ -805,19 +805,19 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
     
    1. Define a byte array of length 3.
    2. The last element of the byte array is 191. -
    3. That's an integer. -
    4. Concatenating an integer with a byte array doesn't work. You've now replicated the error you just found in universaldetector.py. -
    5. Ah, here's the fix. Instead of taking the last element of the byte array, use list slicing to create a new byte array containing just the last element. That is, start with the last element and continue the slice until the end of the byte array. Now mLastChar is a byte array of length 1. +
    6. That’s an integer. +
    7. Concatenating an integer with a byte array doesn’t work. You’ve now replicated the error you just found in universaldetector.py. +
    8. Ah, here’s the fix. Instead of taking the last element of the byte array, use list slicing to create a new byte array containing just the last element. That is, start with the last element and continue the slice until the end of the byte array. Now mLastChar is a byte array of length 1.
    9. Concatenating a byte array of length 1 with a byte array of length 3 returns a new byte array of length 4.
    -

    So, to ensure that the feed() method in universaldetector.py continues to work no matter how often it's called, you need to initialize self._mLastChar as a 0-length byte array, then make sure it stays a byte array. +

    So, to ensure that the feed() method in universaldetector.py continues to work no matter how often it’s called, you need to initialize self._mLastChar as a 0-length byte array, then make sure it stays a byte array.

                  self._escDetector.search(self._mLastChar + aBuf):
               self._mInputState = eEscAscii
     
     - self._mLastChar = aBuf[-1]
     + self._mLastChar = aBuf[-1:]

    ord() expected string of length 1, but int found

    -

    Tired yet? You're almost there… +

    Tired yet? You’re almost there…

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
    @@ -839,19 +839,19 @@ def next_state(self, c):
         # for each byte we get its class
         # if it is first byte, we also get byte length
         byteCls = self._mModel['classTable'][ord(c)]
    -

    That's no help; it's just passed into the function. Let's pop the stack. +

    That’s no help; it’s just passed into the function. Let’s pop the stack.

    # utf8prober.py
     def feed(self, aBuf):
         for c in aBuf:
             codingState = self._mCodingSM.next_state(c)
    -

    And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That's what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there's no need to call the ord() function because c is already an int! +

    And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That’s what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there’s no need to call the ord() function because c is already an int!

    Thus:

      def next_state(self, c):
           # for each byte we get its class
           # if it is first byte, we also get byte length
     -     byteCls = self._mModel['classTable'][ord(c)]
     +     byteCls = self._mModel['classTable'][c]
    -

    Searching the entire codebase for instances of "ord(c)" uncovers similar problems in sbcharsetprober.py… +

    Searching the entire codebase for instances of “ord(c)” uncovers similar problems in sbcharsetprober.py

    # sbcharsetprober.py
     def feed(self, aBuf):
         if not self._mModel['keepEnglishLetter']:
    @@ -887,7 +887,7 @@ def feed(self, aBuf):
     +         charClass = Latin1_CharToClass[c]
     

    Unorderable types: int() >= str()

    -

    Let's go again. +

    Let’s go again.

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
    @@ -905,8 +905,8 @@ tests\Big5\0804.blogspot.com.xml
       File "C:\home\chardet\chardet\jpcntx.py", line 176, in get_order
         if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
     TypeError: unorderable types: int() >= str()
    -

    Did you notice? This time around, the code passed the first test case (tests\ascii\howto.diveintomark.org.xml). You're making real progress here. -

    So what's this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code: +

    Did you notice? This time around, the code passed the first test case (tests\ascii\howto.diveintomark.org.xml). You’re making real progress here. +

    So what’s this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code:

    class SJISContextAnalysis(JapaneseContextAnalysis):
         def get_order(self, aStr):
             if not aStr: return -1, 1
    @@ -916,7 +916,7 @@ TypeError: unorderable types: int() >= str()
    charLen = 2 else: charLen = 1
    -

    And where does aStr come from? Let's pop the stack: +

    And where does aStr come from? Let’s pop the stack:

    def feed(self, aBuf, aLen):
         .
         .
    @@ -924,9 +924,9 @@ TypeError: unorderable types: int() >= str()
    i = self._mNeedToSkipCharNum while i < aLen: order, charLen = self.get_order(aBuf[i:i+2]) -

    Oh look, it's our old friend, aBuf. As you might have guessed from every other issue we've encountered in this chapter, aBuf is a byte array. Here, the feed() method isn't just passing it on wholesale; it's slicing it. But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array. -

    And what is this code trying to do with aStr? It's taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because aStr and aBuf were strings, and aStr[0] would be a string, and you can compare strings for inequality. But in Python 3, aStr and aBuf are byte arrays, aStr[0] is an integer, and you can't compare integers and strings for inequality without explicitly coercing one of them. -

    In this case, there's no need to make the code more complicated by adding an explicit coercion. aStr[0] yields an integer; the things you're comparing to are all constants. Let's change them from 1-character strings to integers. +

    Oh look, it’s our old friend, aBuf. As you might have guessed from every other issue we’ve encountered in this chapter, aBuf is a byte array. Here, the feed() method isn’t just passing it on wholesale; it’s slicing it. But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array. +

    And what is this code trying to do with aStr? It’s taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because aStr and aBuf were strings, and aStr[0] would be a string, and you can compare strings for inequality. But in Python 3, aStr and aBuf are byte arrays, aStr[0] is an integer, and you can’t compare integers and strings for inequality without explicitly coercing one of them. +

    In this case, there’s no need to make the code more complicated by adding an explicit coercion. aStr[0] yields an integer; the things you’re comparing to are all constants. Let’s change them from 1-character strings to integers.

      class SJISContextAnalysis(JapaneseContextAnalysis):
           def get_order(self, aStr):
               if not aStr: return -1, 1
    @@ -1115,7 +1115,7 @@ tests\Big5\0804.blogspot.com.xml
       File "C:\home\chardet\chardet\latin1prober.py", line 126, in get_confidence
         total = reduce(operator.add, self._mFreqCounter)
     NameError: global name 'reduce' is not defined
    -

    According to the official What's New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: "Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable." You can read more about the decision from Guido van Rossum's weblog: The fate of reduce() in Python 3000. +

    According to the official What’s New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: “Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable.” You can read more about the decision from Guido van Rossum’s weblog: The fate of reduce() in Python 3000.

    def get_confidence(self):
         if self.get_state() == constants.eNotMe:
             return 0.01
    @@ -1129,7 +1129,7 @@ NameError: global name 'reduce' is not defined
    - total = reduce(operator.add, self._mFreqCounter) + total = sum(self._mFreqCounter) -

    Since you're no longer using the operator module, you can remove that import from the top of the file as well. +

    Since you’re no longer using the operator module, you can remove that import from the top of the file as well.

      from .charsetprober import CharSetProber
       from . import constants
     - import operator
    @@ -1172,11 +1172,11 @@ tests\EUC-JP\arclamp.jp.xml EUC-JP with confide

    Summary

    What have we learned?

      -
    1. Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There's no way around it. It's hard. -
    2. The automated 2to3 tool is helpful as far as it goes, but it will only do the easy parts — function renames, module renames, syntax changes. It's an impressive piece of engineering, but in the end it's just an intelligent search-and-replace bot. -
    3. The #1 porting problem in this library was the difference between strings and bytes. In this case that seems obvious, since the whole point of the chardet library is to convert a stream of bytes into a string. But “a stream of bytes” comes up more often than you might think. Reading a file in “binary” mode? You'll get a stream of bytes. Fetching a web page? Calling a web API? They return a stream of bytes, too. +
    4. Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There’s no way around it. It’s hard. +
    5. The automated 2to3 tool is helpful as far as it goes, but it will only do the easy parts — function renames, module renames, syntax changes. It’s an impressive piece of engineering, but in the end it’s just an intelligent search-and-replace bot. +
    6. The #1 porting problem in this library was the difference between strings and bytes. In this case that seems obvious, since the whole point of the chardet library is to convert a stream of bytes into a string. But “a stream of bytes” comes up more often than you might think. Reading a file in “binary” mode? You’ll get a stream of bytes. Fetching a web page? Calling a web API? They return a stream of bytes, too.
    7. You need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least, you need to be comfortable with all its quirks and musty corners. The bugs are everywhere. -
    8. Test cases are essential. Don't port anything without them. Don't even try. The only reason I have any confidence at all that chardet works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I never would have found half of these problems with manual spot-checking. +
    9. Test cases are essential. Don’t port anything without them. Don’t even try. The only reason I have any confidence at all that chardet works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I never would have found half of these problems with manual spot-checking.

    You can see the full table of contents (not finalized), or read what I’ve written so far:

    -
      +
        +
      1. What’s New in “Dive Into Python 3”
      2. Installing Python
      3. Your First Python Program
      4. Native Datatypes diff --git a/iterators.html b/iterators.html index 080fccf..6b114c1 100644 --- a/iterators.html +++ b/iterators.html @@ -40,33 +40,33 @@ body{counter-reset:h1 6} self.a, self.b = self.b, self.a + self.b return fib -

        Let's take that one line at a time. +

        Let’s take that one line at a time.

        class Fib:
        -

        class? What's a class? +

        class? What’s a class?

        Defining Classes

        -

        Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you've defined. +

        Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you’ve defined. -

        Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that's all that's required, since a class doesn't need to inherit from any other class. +

        Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that’s all that’s required, since a class doesn’t need to inherit from any other class.

        
         class PapayaWhip:  
             pass           
          -
        1. The name of this class is PapayaWhip, and it doesn't inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement. +
        2. The name of this class is PapayaWhip, and it doesn’t inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement.
        3. You probably guessed this, but everything in a class is indented, just like the code within a function, if statement, for loop, or any other block of code. The first line not indented is outside the class.
        -

        This PapayaWhip class doesn't define any methods or attributes, but syntactically, there needs to be something in the definition, thus the pass statement. This is a Python reserved word that just means “move along, nothing to see here”. It's a statement that does nothing, and it's a good placeholder when you're stubbing out functions or classes. +

        This PapayaWhip class doesn’t define any methods or attributes, but syntactically, there needs to be something in the definition, thus the pass statement. This is a Python reserved word that just means “move along, nothing to see here”. It’s a statement that does nothing, and it’s a good placeholder when you’re stubbing out functions or classes.

        The pass statement in Python is like a empty set of curly braces ({}) in Java or C.

        -

        Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don't have explicit constructors and destructors. Although it's not required, Python classes can have something similar to a constructor: the __init__() method. +

        Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don’t have explicit constructors and destructors. Although it’s not required, Python classes can have something similar to a constructor: the __init__() method.

        The __init__() Method

        @@ -79,10 +79,10 @@ class Fib: def __init__(self, max):
        1. Classes can (and should) have docstrings too, just like modules and functions. -
        2. The __init__() method is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor of the class. It's tempting, because it looks like a constructor (by convention, the __init__() method is the first method defined for the class), acts like one (it's the first piece of code executed in a newly created instance of the class), and even sounds like one. Incorrect, because the object has already been constructed by the time the __init__() method is called, and you already have a valid reference to the new instance of the class. +
        3. The __init__() method is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor of the class. It’s tempting, because it looks like a constructor (by convention, the __init__() method is the first method defined for the class), acts like one (it’s the first piece of code executed in a newly created instance of the class), and even sounds like one. Incorrect, because the object has already been constructed by the time the __init__() method is called, and you already have a valid reference to the new instance of the class.
        -

        The first argument of every class method, including the __init__() method, is always a reference to the current instance of the class. By convention, this argument is named self. This argument fills the role of the reserved word this in C++ or Java, but self is not a reserved word in Python, merely a naming convention. Nonetheless, please don't call it anything but self; this is a very strong convention. +

        The first argument of every class method, including the __init__() method, is always a reference to the current instance of the class. By convention, this argument is named self. This argument fills the role of the reserved word this in C++ or Java, but self is not a reserved word in Python, merely a naming convention. Nonetheless, please don’t call it anything but self; this is a very strong convention.

        In the __init__() method, self refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify self explicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically. @@ -99,10 +99,10 @@ class Fib: >>> fib.__doc__ 'iterator that yields numbers in the Fibanocci sequence'

          -
        1. You are creating an instance of the Fib class (defined in the fibonacci2 module) and assigning the newly created instance to the variable fib. You are passing one parameter, 100, which will end up as the max argument in Fib's __init__() method. +
        2. You are creating an instance of the Fib class (defined in the fibonacci2 module) and assigning the newly created instance to the variable fib. You are passing one parameter, 100, which will end up as the max argument in Fib’s __init__() method.
        3. fib is now an instance of the Fib class. -
        4. Every class instance has a built-in attribute, __class__, which is the object's class. Java programmers may be familiar with the Class class, which contains methods like getName and getSuperclass to get metadata information about an object. In Python, this kind of metadata is available directly on the object itself through attributes like __class__, __name__, and __bases__. -
        5. You can access the instance's docstring just as with a function or a module. All instances of a class share the same docstring. +
        6. Every class instance has a built-in attribute, __class__, which is the object’s class. Java programmers may be familiar with the Class class, which contains methods like getName and getSuperclass to get metadata information about an object. In Python, this kind of metadata is available directly on the object itself through attributes like __class__, __name__, and __bases__. +
        7. You can access the instance’s docstring just as with a function or a module. All instances of a class share the same docstring.
        @@ -117,7 +117,7 @@ class Fib: def __init__(self, max): self.max = max
          -
        1. What is self.max? It's an instance variable. It is completely separate from max, which was passed into the __init__() method as an argument. self.max is “global” to the instance. That means that you can access it from other methods. +
        2. What is self.max? It’s an instance variable. It is completely separate from max, which was passed into the __init__() method as an argument. self.max is “global” to the instance. That means that you can access it from other methods.
        class Fib:
        @@ -147,7 +147,7 @@ class Fib:
         
         

        A Fibonacci Iterator

        -

        Now you're ready to learn how to build an iterator. An iterator is just a class that defines an __iter__() method. +

        Now you’re ready to learn how to build an iterator. An iterator is just a class that defines an __iter__() method.

        [download fibonacci2.py]

        class Fib:                                        
        @@ -195,7 +195,7 @@ class Fib:
         

        A Plural Rule Iterator

        -

        Now it’s time for the finale. Let's rewrite the plural rules generator as an iterator. +

        Now it’s time for the finale. Let’s rewrite the plural rules generator as an iterator.

        [download plural6.py]

        class LazyRules:
        @@ -246,7 +246,7 @@ rules = LazyRules()
      5. Also, this is a good place to initialize the cache, which you’ll use later as you read the patterns from the pattern file.
      -

      Before we continue, let's take a closer look at rules_f. It's not defined within the __init__() method. In fact, it's not defined within any method. It's defined at the class level. It's a class variable, and although you can access it just like an instance variable (self.rules_f), it is shared across all instances of the LazyRules class. +

      Before we continue, let’s take a closer look at rules_f. It’s not defined within the __init__() method. In fact, it’s not defined within any method. It’s defined at the class level. It’s a class variable, and although you can access it just like an instance variable (self.rules_f), it is shared across all instances of the LazyRules class.

       >>> import plural6
      @@ -364,34 +364,3 @@ rules = LazyRules()

      © 2001–9 Mark Pilgrim - - diff --git a/native-datatypes.html b/native-datatypes.html index a19541e..c219056 100644 --- a/native-datatypes.html +++ b/native-datatypes.html @@ -17,7 +17,7 @@ body{counter-reset:h1 2}

       

      Diving In

      -

      Cast aside your first Python program for just a minute, and let's talk about datatypes. In Python, every variable has a datatype, but you don't need to declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally. +

      Cast aside your first Python program for just a minute, and let’s talk about datatypes. In Python, every variable has a datatype, but you don’t need to declare it explicitly. Based on each variable’s original assignment, Python figures out what type it is and keeps tracks of that internally.

      Python has many native datatypes. Here are the important ones:

      1. Booleans are either True or False. @@ -28,8 +28,8 @@ body{counter-reset:h1 2}
      2. Sets are unordered bags of values.
      3. Dictionaries are unordered bags of key-value pairs.
      -

      Of course, there are a lot more types than these seven. Everything is an object in Python, so there are types like module, function, class, method, file, and even compiled code. You've already seen some of these: modules have names, functions have docstrings, &c. You'll learn about classes in [FIXME xref] and files in [FIXME xref]. -

      Strings and bytes are important enough — and complicated enough — that they get their own chapter. Let's look at the others first. +

      Of course, there are a lot more types than these seven. Everything is an object in Python, so there are types like module, function, class, method, file, and even compiled code. You’ve already seen some of these: modules have names, functions have docstrings, &c. You’ll learn about classes in [FIXME xref] and files in [FIXME xref]. +

      Strings and bytes are important enough — and complicated enough — that they get their own chapter. Let’s look at the others first.

      Booleans

      Booleans are either true or false. Python has two constants, True and False, which can be used to assign boolean values directly. Expressions can also evaluate to a boolean value. In certain places (like if statements), Python expects an expression to evaluate to a boolean value. These places are called boolean contexts. You can use virtually any expression in a boolean context, and Python will try to determine its truth value. Different datatypes have different rules about which values are true or false in a boolean context. (This will make more sense once you see some concrete examples later in this chapter.) @@ -48,7 +48,7 @@ body{counter-reset:h1 2} >>> size < 0 True

      Numbers

      -

      Numbers are awesome. There are so many to choose from. Python supports both integers and floating point numbers. There's no type declaration to distinguish them; Python tells them apart by the presence or absence of a decimal point. +

      Numbers are awesome. There are so many to choose from. Python supports both integers and floating point numbers. There’s no type declaration to distinguish them; Python tells them apart by the presence or absence of a decimal point.

       >>> type(1)                 
       <class 'int'>
      @@ -82,7 +82,7 @@ body{counter-reset:h1 2}
       
    1. You can explicitly coerce an int to a float by calling the float() function.
    2. Unsurprisingly, you can also coerce a float to an int by calling int().
    3. The int() function will truncate, not round. -
    4. The int() function truncates negative numbers towards 0. It's a true truncate function, not a a floor function. +
    5. The int() function truncates negative numbers towards 0. It’s a true truncate function, not a a floor function.
    6. Floating point numbers are accurate to 15 decimal places.
    7. Integers can be arbitrarily large.
    @@ -108,8 +108,8 @@ body{counter-reset:h1 2}
    1. The / operator performs floating point division. It returns a float even if both the numerator and denominator are ints.
    2. The // operator performs a quirky kind of integer division. When the result is positive, you can think of it as truncating (not rounding) to 0 decimal places, but be careful with that. -
    3. When integer-dividing negative numbers, the // operator rounds “up” to the nearest integer. Mathematically speaking, it's rounding “down” since −6 is less than −5, but it could trip you up if you expecting it to truncate to −5. -
    4. The // operator doesn't always return an integer. If either the numerator or denominator is a float, it will still round to the nearest integer, but the actual return value will be a float. +
    5. When integer-dividing negative numbers, the // operator rounds “up” to the nearest integer. Mathematically speaking, it’s rounding “down” since −6 is less than −5, but it could trip you up if you expecting it to truncate to −5. +
    6. The // operator doesn’t always return an integer. If either the numerator or denominator is a float, it will still round to the nearest integer, but the actual return value will be a float.
    7. The ** operator means “raised to the power of.” 112 is 121.
    8. The % operator gives the remainder after performing integer division. 11 divided by 2 is 5 with a remainder of 1, so the result here is 1.
    @@ -117,7 +117,7 @@ body{counter-reset:h1 2}

    In Python 2, the / operator usually meant integer division, but you could make it behave like floating point division by including a special directive in your code. In Python 3, the / operator always means floating point division. See PEP 238 for details.

    Fractions

    -

    Python isn't limited to integers and floating point numbers. It can also do all the fancy math you learned in high school and promptly forgot about. +

    Python isn’t limited to integers and floating point numbers. It can also do all the fancy math you learned in high school and promptly forgot about.

     >>> import fractions              
     >>> x = fractions.Fraction(1, 3)  
    @@ -144,7 +144,7 @@ body{counter-reset:h1 2}
     >>> math.tan(math.pi / 4)  
     0.99999999999999989
      -
    1. The math module has a constant for π, the ratio of a circle's circumference to its diameter. +
    2. The math module has a constant for π, the ratio of a circle’s circumference to its diameter.
    3. The math module has all the basic trigonometric functions, including sin(), cos(), tan(), and variants like asin().
    4. Note, however, that Python does not have infinite precision. tan(π / 4) should return 1.0, not 0.99999999999999989.
    @@ -176,16 +176,16 @@ body{counter-reset:h1 2}
    1. Did you know you can define your own functions in the Python interactive shell? Just press ENTER at the end of each line, and ENTER on a blank line to finish.
    2. In a boolean context, non-zero integers are true; 0 is false. -
    3. Non-zero floating point numbers are true; 0.0 is false. Be careful with this one! If there's the slightest rounding error (not impossible, as you saw in the previous section) then Python will be testing 0.0000000000001 instead of 0 and will return True. +
    4. Non-zero floating point numbers are true; 0.0 is false. Be careful with this one! If there’s the slightest rounding error (not impossible, as you saw in the previous section) then Python will be testing 0.0000000000001 instead of 0 and will return True.
    5. Fractions can also be used in a boolean context. Fraction(0, n) is false for all values of n. All other fractions are true.

    Lists

    -

    Lists are Python's workhorse datatype. When I say “list,” you might be thinking “array whose size I have to declare in advance, that can only contain items of the same type, &c.” Don't think that. Lists are much cooler than that. +

    Lists are Python’s workhorse datatype. When I say “list,” you might be thinking “array whose size I have to declare in advance, that can only contain items of the same type, &c.” Don’t think that. Lists are much cooler than that.

    A list in Python is like an array in Perl 5. In Perl 5, variables that store arrays always start with the @ character; in Python, variables can be named anything, and Python keeps track of the datatype internally.

    -

    A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the ArrayList class, which can hold arbitrary objects and can expand dynamically as new items are added. +

    A list in Python is much more than an array in Java (although it can be used as one if that’s really all you want out of life). A better analogy would be to the ArrayList class, which can hold arbitrary objects and can expand dynamically as new items are added.

    Creating A List

    Creating a list is easy: use square brackets to wrap a comma-separated list of values. @@ -210,7 +210,7 @@ body{counter-reset:h1 2}

    Slicing A List

    -

    Once you've defined a list, you can get any part of it as a new list. This is called slicing the list. +

    Once you’ve defined a list, you can get any part of it as a new list. This is called slicing the list.

     >>> a_list
     ['a', 'b', 'mpilgrim', 'z', 'example']
    @@ -228,7 +228,7 @@ body{counter-reset:h1 2}
     ['a', 'b', 'mpilgrim', 'z', 'example']
    1. You can get a part of a list, called a “slice”, by specifying two indices. The return value is a new list containing all the items of the list, in order, starting with the first slice index (in this case a_list[1]), up to but not including the second slice index (in this case a_list[3]). -
    2. Slicing works if one or both of the slice indices is negative. If it helps, you can think of it this way: reading the list from left to right, the first slice index specifies the first item you want, and the second slice index specifies the first item you don't want. The return value is everything in between. +
    3. Slicing works if one or both of the slice indices is negative. If it helps, you can think of it this way: reading the list from left to right, the first slice index specifies the first item you want, and the second slice index specifies the first item you don’t want. The return value is everything in between.
    4. Lists are zero-based, so a_list[0:3] returns the first three items of the list, starting at a_list[0], up to but not including a_list[3].
    5. If the left slice index is 0, you can leave it out, and 0 is implied. So a_list[:3] is the same as a_list[0:3], because the starting 0 is implied.
    6. Similarly, if the right slice index is the length of the list, you can leave it out. So a_list[3:] is the same as a_list[3:5], because this list has five items. There is a pleasing symmetry here. In this five-item list, a_list[:3] returns the first 3 items, and a_list[3:] returns the last two items. In fact, a_list[:n] will always return the first n items, and a_list[n:] will return the rest, regardless of the length of the list. @@ -251,12 +251,12 @@ body{counter-reset:h1 2} >>> a_list ['a', 'a', 2.0, 3, True, 'four', 'e']
        -
      1. The + operator concatenates lists. A list can contain any number of items; there is no size limit (other than available memory). A list can contain items of any datatype; they don't all need to be the same type. Here we have a list containing a string, a floating point number, and an integer. +
      2. The + operator concatenates lists. A list can contain any number of items; there is no size limit (other than available memory). A list can contain items of any datatype; they don’t all need to be the same type. Here we have a list containing a string, a floating point number, and an integer.
      3. The append() method adds a single item to the end of the list. (Now we have four different datatypes in the list!)
      4. Lists are implemented as classes. “Creating” a list is really instantiating a class. As such, a list has methods that operate on it. The extend() method takes one argument, a list, and appends each of the items of the argument to the original list.
      5. The insert() method inserts a single item into a list. The first argument is the index of the first item in the list that will get bumped out of position. List items do not need to be unique; for example, there are now two separate items with the value 'a', a_list[0] and a_list[1].
      -

      Let's look closer at the difference between append() and extend(). +

      Let’s look closer at the difference between append() and extend().

       >>> a_list = ['a', 'b', 'c']
       >>> a_list.extend(['d', 'e', 'f'])  
      @@ -276,8 +276,8 @@ body{counter-reset:h1 2}
       
      1. The extend() method takes a single argument, which is always a list, and adds each of the items of that list to a_list.
      2. If you start with a list of three items and extend it with a list of another three items, you end up with a list of six items. -
      3. On the other hand, the append() method takes any number of arguments, each of which can be any datatype. Here, you're calling the append() method with a single argument, a list of three items. -
      4. If you start with a list of six items and append a list onto it, you end up with... a list of seven items. Why seven? Because the last item (which you just appended) is itself a list. Lists can contain any type of data, including other lists. That may be what you want, or it may not. But it's what you asked for, and it's what you got. +
      5. On the other hand, the append() method takes any number of arguments, each of which can be any datatype. Here, you’re calling the append() method with a single argument, a list of three items. +
      6. If you start with a list of six items and append a list onto it, you end up with... a list of seven items. Why seven? Because the last item (which you just appended) is itself a list. Lists can contain any type of data, including other lists. That may be what you want, or it may not. But it’s what you asked for, and it’s what you got.

      Searching For Values In A List

      @@ -324,7 +324,7 @@ ValueError: list.index(x): x not in list

      FIXME -->

      Dictionaries

      -

      One of Python's most important datatypes is the dictionary, which defines one-to-one relationships between keys and values. +

      One of Python’s most important datatypes is the dictionary, which defines one-to-one relationships between keys and values.

      A dictionary in Python is like a hash in Perl 5. In Perl 5, variables that store hashes always start with a % character. In Python, variables can be named anything, and Python keeps track of the datatype internally.

      @@ -346,7 +346,7 @@ KeyError: 'db.diveintopython3.org'
    7. First, you create a new dictionary with two items and assign it to the variable a_dict. Each item is a key-value pair, and the whole set of items is enclosed in curly braces.
    8. 'server' is a key, and its associated value, referenced by a_dict["server"], is 'db.diveintopython3.org'.
    9. 'database' is a key, and its associated value, referenced by a_dict["database"], is 'mysql'. -
    10. You can get values by key, but you can't get keys by value. So a_dict["server"] is 'db.diveintopython3.org', but a_dict["db.diveintopython3.org"] raises an exception, because 'db.diveintopython3.org' is not a key. +
    11. You can get values by key, but you can’t get keys by value. So a_dict["server"] is 'db.diveintopython3.org', but a_dict["db.diveintopython3.org"] raises an exception, because 'db.diveintopython3.org' is not a key.

    Modifying A Dictionary

    Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any time, or you can modify the value of an existing key. Continuing from the previous example: @@ -370,11 +370,11 @@ KeyError: 'db.diveintopython3.org'

  • You can add new key-value pairs at any time. This syntax is identical to modifying existing values.
  • The new dictionary item (key 'user', value 'mark') appears to be in the middle. In fact, it was just a coincidence that the items appeared to be in order in the first example; it is just as much a coincidence that they appear to be out of order now.
  • Assigning a value to an existing dictionary key simply replaces the old value with the new one. -
  • Will this change the value of the user key back to "mark"? No! Look at the key closely — that's a capital U in "User". Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it's completely different. +
  • Will this change the value of the user key back to "mark"? No! Look at the key closely — that’s a capital U in "User". Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it’s completely different.

    Mixed-Value Dictionaries

    -

    Dictionaries aren't just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don't all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary. -

    In fact, you've already seen a dictionary with non-string keys and values, in your first Python program. +

    Dictionaries aren’t just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don’t all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary. +

    In fact, you’ve already seen a dictionary with non-string keys and values, in your first Python program.

    SUFFIXES = {1000: ('KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'),
                 1024: ('KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB')}

    Let's tear that apart in the interactive shell. diff --git a/porting-code-to-python-3-with-2to3.html b/porting-code-to-python-3-with-2to3.html index 42fe31e..7a7d8a5 100644 --- a/porting-code-to-python-3-with-2to3.html +++ b/porting-code-to-python-3-with-2to3.html @@ -27,7 +27,7 @@ td pre{padding:0;border:0}

     

    Diving in

    -

    Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. Case study: porting chardet to Python 3 describes how to run the 2to3 script, then shows some things it can't fix automatically. This appendix documents what it can fix automatically. +

    Virtually all Python 2 programs will need at least some tweaking to run properly under Python 3. To help with this transition, Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. Case study: porting chardet to Python 3 describes how to run the 2to3 script, then shows some things it can’t fix automatically. This appendix documents what it can fix automatically.

    print statement

    In Python 2, print was a statement. Whatever you wanted to print simply followed the print keyword. In Python 3, print() is a function — whatever you want to print is passed to print() like any other function. @@ -110,7 +110,7 @@ td pre{padding:0;border:0}
    1. Base 10 long integer literals become base 10 integer literals.
    2. Base 16 long integer literals become base 16 integer literals. -
    3. In Python 3, the old long() function no longer exists, since longs don't exist. To coerce a variable to an integer, use the int() function. +
    4. In Python 3, the old long() function no longer exists, since longs don’t exist. To coerce a variable to an integer, use the int() function.
    5. To check whether a variable is an integer, get its type and compare it to int, not long.
    6. You can also use the isinstance() function to check data types; again, use int, not long, to check for integers.
    @@ -161,7 +161,7 @@ td pre{padding:0;border:0}
  • Again with the parentheses, for the same reason.

    Dictionary methods that return lists

    -

    In Python 2, many dictionary methods returned lists. The most frequently used methods were keys(), items(), and values(). In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method's return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing. +

    In Python 2, many dictionary methods returned lists. The most frequently used methods were keys(), items(), and values(). In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method’s return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing.

  • Notes Python 2 @@ -219,7 +219,7 @@ import CGIHttpServer
  • The http.server module provides a basic HTTP server.

    urllib

    -

    Python 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, urllib. +

    Python 2 had a rat’s nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, urllib.
    Notes Python 2 @@ -368,10 +368,10 @@ except ImportError:

    1. When you need to import an entire module from elsewhere in your package, use the new from . import syntax. The period is actually a relative path from this file (universaldetector.py) to the file you want to import (constants.py). In this case, they are in the same directory, thus the single period. You can also import from the parent directory (from .. import anothermodule) or a subdirectory. -
    2. To import a specific class or function from another module directly into your module's namespace, prefix the target module with a relative path, minus the trailing slash. In this case, mbcharsetprober.py is in the same directory as universaldetector.py, so the path is a single period. You can also import form the parent directory (from ..anothermodule import AnotherClass) or a subdirectory. +
    3. To import a specific class or function from another module directly into your module’s namespace, prefix the target module with a relative path, minus the trailing slash. In this case, mbcharsetprober.py is in the same directory as universaldetector.py, so the path is a single period. You can also import form the parent directory (from ..anothermodule import AnotherClass) or a subdirectory.

    next() iterator method

    -

    In Python 2, iterators had a next() method which returned the next item in the sequence. That's still true in Python 3, but there is now also a global next() function that takes an iterator as an argument. +

    In Python 2, iterators had a next() method which returned the next item in the sequence. That’s still true in Python 3, but there is now also a global next() function that takes an iterator as an argument.
    Notes Python 2 @@ -403,11 +403,11 @@ for an_iterator in a_sequence_of_iterators: an_iterator.__next__()

      -
    1. In the simplest case, instead of calling an iterator's next() method, you now pass the iterator itself to the global next() function. +
    2. In the simplest case, instead of calling an iterator’s next() method, you now pass the iterator itself to the global next() function.
    3. If you have a function that returns an iterator, call the function and pass the result to the next() function. (The 2to3 script is smart enough to convert this properly.)
    4. If you define your own class and mean to use it as an iterator, define the __next__() special method.
    5. If you define your own class and just happen to have a method named next() that takes one or more arguments, 2to3 will not touch it. This class can not be used as an iterator, because its next() method takes arguments. -
    6. This one is a bit tricky. If you have a local variable named next, then it takes precedence over the new global next() function. In this case, you need to call the iterator's special __next()__ method to get the next item in the sequence. (Alternatively, you could also refactor the code so the local variable wasn't named next, but 2to3 will not do that for you automatically.) +
    7. This one is a bit tricky. If you have a local variable named next, then it takes precedence over the new global next() function. In this case, you need to call the iterator’s special __next()__ method to get the next item in the sequence. (Alternatively, you could also refactor the code so the local variable wasn’t named next, but 2to3 will not do that for you automatically.)

    filter() global function

    In Python 2, the filter() function returned a list, the result of filtering a sequence through a function that returned True or False for each item in the sequence. In Python 3, the filter() function returns an iterator, not a list. @@ -482,7 +482,7 @@ reduce(a, b, c)

    The version of 2to3 that shipped with Python 3.0 would not fix the reduce() function automatically. The fix first appeared in the 2to3 script that shipped with Python 3.1.

    apply() global function

    -

    Python 2 had a global function called apply(), which took a function f and a list [a, b, c] and returned f(a, b, c). In Python 3, the apply() function no longer exists. Instead, there is a new function calling syntax that allows you to pass a list and have Python apply the list as the function's arguments. +

    Python 2 had a global function called apply(), which took a function f and a list [a, b, c] and returned f(a, b, c). In Python 3, the apply() function no longer exists. Instead, there is a new function calling syntax that allows you to pass a list and have Python apply the list as the function’s arguments.
    Notes Python 2 @@ -538,7 +538,7 @@ reduce(a, b, c)
  • Even fancier, the old exec statement could also take a local namespace (like the variables defined within a function). In Python 3, the exec() function can do that too.

    execfile statement (3.1+)

    -

    Like the old exec statement, the old execfile statement will execute strings as if they were Python code. Where exec took a string, execfile took a filename. In Python 3, the execfile statement has been eliminated. If you really need to take a file of Python code and execute it (but you're not willing to simply import it), you can accomplish the same thing by opening the file, reading its contents, calling the global compile() function to force the Python interpreter to compile the code, and then call the new exec() function. +

    Like the old exec statement, the old execfile statement will execute strings as if they were Python code. Where exec took a string, execfile took a filename. In Python 3, the execfile statement has been eliminated. If you really need to take a file of Python code and execute it (but you’re not willing to simply import it), you can accomplish the same thing by opening the file, reading its contents, calling the global compile() function to force the Python interpreter to compile the code, and then call the new exec() function.
    Notes Python 2 @@ -607,7 +607,7 @@ except:
    1. Instead of a comma after the exception type, Python 3 uses a new keyword, as.
    2. The as keyword also works for catching multiple types of exceptions at once. -
    3. If you catch an exception but don't actually care about accessing the exception object itself, the syntax is identical between Python 2 and Python 3. +
    4. If you catch an exception but don’t actually care about accessing the exception object itself, the syntax is identical between Python 2 and Python 3.
    5. Similarly, if you use a fallback to catch all exceptions, the syntax is identical.
    @@ -660,7 +660,7 @@ except:
  • Python 2 also supported throwing an exception with only a custom error message. Python 3 does not support this, and the 2to3 script will display a warning telling you that you will need to fix this code manually.

    xrange() global function

    -

    In Python 2, there were two ways to get a range of numbers: range(), which returned a list, and xrange(), which returned an iterator. In Python 3, range() returns an iterator, and xrange() doesn't exist. +

    In Python 2, there were two ways to get a range of numbers: range(), which returned a list, and xrange(), which returned an iterator. In Python 3, range() returns an iterator, and xrange() doesn’t exist.
    Notes Python 2 @@ -738,11 +738,11 @@ except: a_function.__code__

      -
    1. The __name__ attribute (previously func_name) contains the function's name. -
    2. The __doc__ attribute (previously func_doc) contains the docstring that you defined in the function's source code. +
    3. The __name__ attribute (previously func_name) contains the function’s name. +
    4. The __doc__ attribute (previously func_doc) contains the docstring that you defined in the function’s source code.
    5. The __defaults__ attribute (previously func_defaults) is a tuple containing default argument values for those arguments that have default values.
    6. The __dict__ attribute (previously func_dict) is the namespace supporting arbitrary function attributes. -
    7. The __closure__ attribute (previously func_closure) is a tuple of cells that contain bindings for the function's free variables. +
    8. The __closure__ attribute (previously func_closure) is a tuple of cells that contain bindings for the function’s free variables.
    9. The __globals__ attribute (previously func_globals) is a reference to the global namespace of the module in which the function was defined.
    10. The __code__ attribute (previously func_code) is a code object representing the compiled function body.
    @@ -934,7 +934,7 @@ except:

    The version of 2to3 that shipped with Python 3.0 would not fix these cases of isinstance() automatically. The fix first appeared in the 2to3 script that shipped with Python 3.1.

  • basestring datatype

    -

    Python 2 had two string types: Unicode and non-Unicode. But there was also another type, basestring. It was an abstract type, a superclass for both the str and unicode types. It couldn't be called or instantiated directly, but you could pass it to the global isinstance() function to check whether an object was either a Unicode or non-Unicode string. In Python 3, there is only one string type, so basestring has no reason to exist. +

    Python 2 had two string types: Unicode and non-Unicode. But there was also another type, basestring. It was an abstract type, a superclass for both the str and unicode types. It couldn’t be called or instantiated directly, but you could pass it to the global isinstance() function to check whether an object was either a Unicode or non-Unicode string. In Python 3, there is only one string type, so basestring has no reason to exist.
    Notes Python 2 @@ -966,7 +966,7 @@ except:
  • Instead of itertools.izip(), just use the global zip() function.
  • Instead of itertools.imap(), just use map().
  • itertools.ifilter() becomes filter(). -
  • The itertools module still exists in Python 3, it just doesn't have the functions that have migrated to the global namespace. The 2to3 script is smart enough to remove the specific imports that no longer exist, while leaving other imports intact. +
  • The itertools module still exists in Python 3, it just doesn’t have the functions that have migrated to the global namespace. The 2to3 script is smart enough to remove the specific imports that no longer exist, while leaving other imports intact.

    sys.exc_type, sys.exc_value, sys.exc_traceback

    Python 2 had three variables in the sys module that you could access while an exception was being handled: sys.exc_type, sys.exc_value, sys.exc_traceback. (Actually, these date all the way back to Python 1.) Ever since Python 1.5, these variables have been deprecated in favor of sys.exc_info, which is a tuple that contains all three values. In Python 3, these individual variables have finally gone away; you must use sys.exc_info. @@ -1027,11 +1027,11 @@ except:

    1. Declaring the metaclass in the class declaration worked in Python 2, and it still works the same in Python 3. -
    2. Declaring the metaclass in a class attribute worked in Python 2, but doesn't work in Python 3. +
    3. Declaring the metaclass in a class attribute worked in Python 2, but doesn’t work in Python 3.
    4. The 2to3 script is smart enough to construct a valid class declaration, even if the class is inherited from one or more base classes.

    Matters of style

    -

    The rest of the “fixes” listed here aren't really fixes per se. That is, the things they change are matters of style, not substance. They work just as well in Python 3 as they do in Python 2, but the developers of Python have a vested interest in making Python code as uniform as possible. To that end, there is an official Python style guide which outlines — in excruciating detail — all sorts of nitpicky details that you almost certainly don't care about. And given that 2to3 provides such a great infrastructure for converting Python code from one thing to another, the authors took it upon themselves to add a few optional features to improve the readability of your Python programs. +

    The rest of the “fixes” listed here aren’t really fixes per se. That is, the things they change are matters of style, not substance. They work just as well in Python 3 as they do in Python 2, but the developers of Python have a vested interest in making Python code as uniform as possible. To that end, there is an official Python style guide which outlines — in excruciating detail — all sorts of nitpicky details that you almost certainly don’t care about. And given that 2to3 provides such a great infrastructure for converting Python code from one thing to another, the authors took it upon themselves to add a few optional features to improve the readability of your Python programs.

    set() literals (explicit)

    In Python 2, the only way to define a literal set in your code was to call set(a_sequence). This still works in Python 3, but a clearer way of doing it is to use the new set literal notation: curly braces. (Dictionaries are also defined with curly braces, which makes sense once you think about it, because dictionaries are just sets of key-value pairs.)

    @@ -1053,7 +1053,7 @@ except:
    {i for i in a_sequence}

    buffer() global function (explicit)

    -

    Python objects implemented in C can export a “buffer interface,” which is a block of memory that is directly readable and writeable without copying. (That is exactly as powerful and scary as it sounds.) In Python 3, buffer() has been renamed to memoryview(). (It's a little more complicated than that, but you can almost certainly ignore the differences.) +

    Python objects implemented in C can export a “buffer interface,” which is a block of memory that is directly readable and writeable without copying. (That is exactly as powerful and scary as it sounds.) In Python 3, buffer() has been renamed to memoryview(). (It’s a little more complicated than that, but you can almost certainly ignore the differences.)

    The 2to3 script will not fix the buffer() function by default. To enable this fix, specify -f buffer on the command line when you call 2to3.

    @@ -1084,7 +1084,7 @@ except:
  • {a: b}

    Common idioms (explicit)

    -

    There were a number of common idioms built up in the Python community. Some, like the while 1: loop, date back to Python 1. (Python didn't have a true boolean type until version 2.3, so developers used 1 and 0 instead.) Modern Python programmers should train their brains to use modern versions of these idioms instead. +

    There were a number of common idioms built up in the Python community. Some, like the while 1: loop, date back to Python 1. (Python didn’t have a true boolean type until version 2.3, so developers used 1 and 0 instead.) Modern Python programmers should train their brains to use modern versions of these idioms instead.

    The 2to3 script will not fix common idioms by default. To enable this fix, specify -f idioms on the command line when you call 2to3.

    diff --git a/refactoring.html b/refactoring.html index 031e383..ef7d02f 100644 --- a/refactoring.html +++ b/refactoring.html @@ -17,13 +17,13 @@ body{counter-reset:h1 10}

     

    Diving In

    -

    Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by “bug”? A bug is a test case you haven't written yet. +

    Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by “bug”? A bug is a test case you haven’t written yet.

    >>> import roman7
     >>> roman7.from_roman("") 
     0
      -
    1. Remember in the [FIXME-xref] previous section when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals? Well, it turns out that this is still true for the final version of the regular expression. And that's a bug; you want an empty string to raise an InvalidRomanNumeralError exception just like any other sequence of characters that don't represent a valid Roman numeral. +
    2. Remember in the [FIXME-xref] previous section when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals? Well, it turns out that this is still true for the final version of the regular expression. And that’s a bug; you want an empty string to raise an InvalidRomanNumeralError exception just like any other sequence of characters that don’t represent a valid Roman numeral.

    After reproducing the bug, and before fixing it, you should write a test case that fails, thus illustrating the bug. @@ -107,15 +107,15 @@ Ran 11 tests in 0.156s OK

    1. The blank string test case now passes, so the bug is fixed. -
    2. All the other test cases still pass, which means that this bug fix didn't break anything else. Stop coding. +
    3. All the other test cases still pass, which means that this bug fix didn’t break anything else. Stop coding.
    -

    Coding this way does not make fixing bugs any easier. Simple bugs (like this one) require simple test cases; complex bugs will require complex test cases. In a testing-centric environment, it may seem like it takes longer to fix a bug, since you need to articulate in code exactly what the bug is (to write the test case), then fix the bug itself. Then if the test case doesn't pass right away, you need to figure out whether the fix was wrong, or whether the test case itself has a bug in it. However, in the long run, this back-and-forth between test code and code tested pays for itself, because it makes it more likely that bugs are fixed correctly the first time. Also, since you can easily re-run all the test cases along with your new one, you are much less likely to break old code when fixing new code. Today's unit test is tomorrow's regression test. +

    Coding this way does not make fixing bugs any easier. Simple bugs (like this one) require simple test cases; complex bugs will require complex test cases. In a testing-centric environment, it may seem like it takes longer to fix a bug, since you need to articulate in code exactly what the bug is (to write the test case), then fix the bug itself. Then if the test case doesn’t pass right away, you need to figure out whether the fix was wrong, or whether the test case itself has a bug in it. However, in the long run, this back-and-forth between test code and code tested pays for itself, because it makes it more likely that bugs are fixed correctly the first time. Also, since you can easily re-run all the test cases along with your new one, you are much less likely to break old code when fixing new code. Today’s unit test is tomorrow’s regression test.

    Handling Changing Requirements

    -

    Despite your best efforts to pin your customers to the ground and extract exact requirements from them on pain of horrible nasty things involving scissors and hot wax, requirements will change. Most customers don't know what they want until they see it, and even if they do, they aren't that good at articulating what they want precisely enough to be useful. And even if they do, they'll want more in the next release anyway. So be prepared to update your test cases as requirements change. +

    Despite your best efforts to pin your customers to the ground and extract exact requirements from them on pain of horrible nasty things involving scissors and hot wax, requirements will change. Most customers don’t know what they want until they see it, and even if they do, they aren’t that good at articulating what they want precisely enough to be useful. And even if they do, they’ll want more in the next release anyway. So be prepared to update your test cases as requirements change. -

    Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Remember [FIXME-xref] the rule that said that no character could be repeated more than three times? Well, the Romans were willing to make an exception to that rule by having 4 M characters in a row to represent 4000. If you make this change, you'll be able to expand the range of convertible numbers from 1..3999 to 1..4999. But first, you need to make some changes to your test cases. +

    Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Remember [FIXME-xref] the rule that said that no character could be repeated more than three times? Well, the Romans were willing to make an exception to that rule by having 4 M characters in a row to represent 4000. If you make this change, you’ll be able to expand the range of convertible numbers from 1..3999 to 1..4999. But first, you need to make some changes to your test cases.

    [download roman8.py]

    
    @@ -157,7 +157,7 @@ class RoundtripCheck(unittest.TestCase):
                 result = roman8.from_roman(numeral)
                 self.assertEqual(integer, result)
      -
    1. The existing known values don't change (they're all still reasonable values to test), but you need to add a few more in the 4000 range. Here I've included 4000 (the shortest), 4500 (the second shortest), 4888 (the longest), and 4999 (the largest). +
    2. The existing known values don’t change (they’re all still reasonable values to test), but you need to add a few more in the 4000 range. Here I’ve included 4000 (the shortest), 4500 (the second shortest), 4888 (the longest), and 4999 (the largest).
    3. The definition of “large input” has changed. This test used to call to_roman() with 4000 and expect an error; now that 4000-4999 are good values, you need to bump this up to 5000.
    4. The definition of “too many repeated numerals” has also changed. This test used to call from_roman() with 'MMMM' and expect an error; now that MMMM is considered a valid Roman numeral, you need to bump this up to 'MMMMM'.
    5. The sanity check loops through every number in the range, from 1 to 3999. Since the range has now expanded, this for loop need to be updated as well to go up to 4999. @@ -220,7 +220,7 @@ FAILED (errors=3)
    6. The roundtrip check will also fail as soon as it hits 4000, because to_roman() still thinks this is out of range.
    -

    Now that you have test cases that fail due to the new requirements, you can think about fixing the code to bring it in line with the test cases. (One thing that takes some getting used to when you first start coding unit tests is that the code being tested is never “ahead” of the test cases. While it's behind, you still have some work to do, and as soon as it catches up to the test cases, you stop coding.) +

    Now that you have test cases that fail due to the new requirements, you can think about fixing the code to bring it in line with the test cases. (One thing that takes some getting used to when you first start coding unit tests is that the code being tested is never “ahead” of the test cases. While it’s behind, you still have some work to do, and as soon as it catches up to the test cases, you stop coding.)

    [download roman9.py]

    
    @@ -255,11 +255,11 @@ def from_roman(s):
         .
         .
      -
    1. You don't need to make any changes to the from_roman() function at all. The only change is to roman_numeral_pattern. If you look closely, you'll notice that I changed the maximum number of optional M characters from 3 to 4 in the first section of the regular expression. This will allow the Roman numeral equivalents of 4999 instead of 3999. The actual from_roman() function is completely generic; it just looks for repeated Roman numeral characters and adds them up, without caring how many times they repeat. The only reason it didn't handle 'MMMM' before is that you explicitly stopped it with the regular expression pattern matching. -
    2. The to_roman() function only needs one small change, in the range check. Where you used to check 0 < n < 4000, you now check 0 < n < 5000. And you change the error message that you raise to reflect the new acceptable range (1..4999 instead of 1..3999). You don't need to make any changes to the rest of the function; it handles the new cases already. (It merrily adds 'M' for each thousand that it finds; given 4000, it will spit out 'MMMM'. The only reason it didn't do this before is that you explicitly stopped it with the range check.) +
    3. You don’t need to make any changes to the from_roman() function at all. The only change is to roman_numeral_pattern. If you look closely, you’ll notice that I changed the maximum number of optional M characters from 3 to 4 in the first section of the regular expression. This will allow the Roman numeral equivalents of 4999 instead of 3999. The actual from_roman() function is completely generic; it just looks for repeated Roman numeral characters and adds them up, without caring how many times they repeat. The only reason it didn’t handle 'MMMM' before is that you explicitly stopped it with the regular expression pattern matching. +
    4. The to_roman() function only needs one small change, in the range check. Where you used to check 0 < n < 4000, you now check 0 < n < 5000. And you change the error message that you raise to reflect the new acceptable range (1..4999 instead of 1..3999). You don’t need to make any changes to the rest of the function; it handles the new cases already. (It merrily adds 'M' for each thousand that it finds; given 4000, it will spit out 'MMMM'. The only reason it didn’t do this before is that you explicitly stopped it with the range check.)
    -

    You may be skeptical that these two small changes are all that you need. Hey, don't take my word for it; see for yourself. +

    You may be skeptical that these two small changes are all that you need. Hey, don’t take my word for it; see for yourself.

     you@localhost:~$ python3 romantest9.py -v
    @@ -288,13 +288,13 @@ Ran 12 tests in 0.203s
     
     

    Refactoring

    -

    The best thing about comprehensive unit testing is not the feeling you get when all your test cases finally pass, or even the feeling you get when someone else blames you for breaking their code and you can actually prove that you didn't. The best thing about unit testing is that it gives you the freedom to refactor mercilessly. +

    The best thing about comprehensive unit testing is not the feeling you get when all your test cases finally pass, or even the feeling you get when someone else blames you for breaking their code and you can actually prove that you didn’t. The best thing about unit testing is that it gives you the freedom to refactor mercilessly.

    Refactoring is the process of taking working code and making it work better. Usually, “better” means “faster”, although it can also mean “using less memory”, or “using less disk space”, or simply “more elegantly”. Whatever it means to you, to your project, in your environment, refactoring is important to the long-term health of any program. -

    Here, “better” means both “faster” and “easier to maintain.” Specifically, the from_roman() function is slower and more complex than I'd like, because of that big nasty regular expression that you use to validate Roman numerals. Now, you might think, "Sure, the regular expression is big and hairy, but how else am I supposed to validate that an arbitrary string is a valid a Roman numeral?" +

    Here, “better” means both “faster” and “easier to maintain.” Specifically, the from_roman() function is slower and more complex than I’d like, because of that big nasty regular expression that you use to validate Roman numerals. Now, you might think, "Sure, the regular expression is big and hairy, but how else am I supposed to validate that an arbitrary string is a valid a Roman numeral?" -

    Answer: there's only 5000 of them; why don't you just build a lookup table? This idea gets even better when you realize that you don't need to use regular expressions at all. As you build the lookup table for converting integers to Roman numerals, you can build the reverse lookup table to convert Roman numerals to integers. By the time you need to check whether an arbitrary string is a valid Roman numeral, you will have collected all the valid Roman numerals. “Validating” is reduced to a single dictionary lookup. +

    Answer: there’s only 5000 of them; why don’t you just build a lookup table? This idea gets even better when you realize that you don’t need to use regular expressions at all. As you build the lookup table for converting integers to Roman numerals, you can build the reverse lookup table to convert Roman numerals to integers. By the time you need to check whether an arbitrary string is a valid Roman numeral, you will have collected all the valid Roman numerals. “Validating” is reduced to a single dictionary lookup.

    And best of all, you already have a complete set of unit tests. You can change over half the code in the module, but the unit tests will stay the same. That means you can prove — to yourself and to others — that the new code works just as well as the original. @@ -357,13 +357,13 @@ def build_lookup_tables(): build_lookup_tables()

    -

    Let's break that down into digestable pieces. Arguably, the most important line is the last one: +

    Let’s break that down into digestable pieces. Arguably, the most important line is the last one:

    build_lookup_tables()
    -

    You will note that is a function call, but there's no if statement around it. This is not an if __name__ == '__main__' block; it gets called when the module is imported. (It is important to understand that modules are only imported once, then cached. If you import an already-imported module, it does nothing. So this code will only get called the first time you import this module.) +

    You will note that is a function call, but there’s no if statement around it. This is not an if __name__ == '__main__' block; it gets called when the module is imported. (It is important to understand that modules are only imported once, then cached. If you import an already-imported module, it does nothing. So this code will only get called the first time you import this module.) -

    So what does the build_lookup_tables() function do? I'm glad you asked. +

    So what does the build_lookup_tables() function do? I’m glad you asked.

    to_roman_table = [ None ]
     from_roman_table = {}
    @@ -438,7 +438,7 @@ to_roman should fail with 0 input ... ok
     
     OK
      -
    1. Not that you asked, but it's fast, too! Like, almost 10× as fast. Of course, it's not entirely a fair comparison, because this version takes longer to import (when it builds the lookup tables). But since the import is only done once, the startup cost is amortized over all the calls to the to_roman() and from_roman() functions. Since the tests make several thousand function calls (the roundtrip test alone makes 10,000), this savings adds up in a hurry! +
    2. Not that you asked, but it’s fast, too! Like, almost 10× as fast. Of course, it’s not entirely a fair comparison, because this version takes longer to import (when it builds the lookup tables). But since the import is only done once, the startup cost is amortized over all the calls to the to_roman() and from_roman() functions. Since the tests make several thousand function calls (the roundtrip test alone makes 10,000), this savings adds up in a hurry!

    The moral of the story? @@ -451,9 +451,9 @@ OK

    Summary

    -

    Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and increase flexibility in any long-term project. It is also important to understand that unit testing is not a panacea, a Magic Problem Solver, or a silver bullet. Writing good test cases is hard, and keeping them up to date takes discipline (especially when customers are screaming for critical bug fixes). Unit testing is not a replacement for other forms of testing, including functional testing, integration testing, and user acceptance testing. But it is feasible, and it does work, and once you've seen it work, you'll wonder how you ever got along without it. +

    Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and increase flexibility in any long-term project. It is also important to understand that unit testing is not a panacea, a Magic Problem Solver, or a silver bullet. Writing good test cases is hard, and keeping them up to date takes discipline (especially when customers are screaming for critical bug fixes). Unit testing is not a replacement for other forms of testing, including functional testing, integration testing, and user acceptance testing. But it is feasible, and it does work, and once you’ve seen it work, you’ll wonder how you ever got along without it. -

    These few chapters have covered a lot of ground, and much of it wasn't even Python-specific. There are unit testing frameworks for many languages, all of which require you to understand the same basic concepts: +

    These few chapters have covered a lot of ground, and much of it wasn’t even Python-specific. There are unit testing frameworks for many languages, all of which require you to understand the same basic concepts:

    • Designing test cases that are specific, automated, and independent @@ -461,7 +461,7 @@ OK
    • Writing tests that test good input and check for proper results
    • Writing tests that test bad input and check for proper failure responses
    • Writing and updating test cases to reflect new requirements -
    • Refactoring mercilessly to improve performance, scalability, readability, maintainability, or whatever other -ility you're lacking +
    • Refactoring mercilessly to improve performance, scalability, readability, maintainability, or whatever other -ility you’re lacking

    © 2001–9 Mark Pilgrim diff --git a/special-method-names.html b/special-method-names.html index 31bb0fb..b19fa86 100644 --- a/special-method-names.html +++ b/special-method-names.html @@ -50,7 +50,8 @@ __ne__ __gt__ - covered in fractions.py __ge__ - covered in fractions.py __bool__ - covered in fractions.py -__cmp__ (*) + +(__cmp__ is gone)

    Custom Attributes

    @@ -118,7 +119,15 @@ __reversed__ - covered in ordereddict.py

    Classes That Act Like Numbers

    -

    FIXME binary operator intro +

    Using the appropriate special methods, you can define your own classes that act like numbers. That is, you can add them, subtract them, and perform other mathematical operations on them. This is how fractions are implemented — the Fraction class implements these special methods, then you can do things like this: + +

    +>>> from fractions import Fraction
    +>>> x = Fraction(1, 3)
    +>>> x / 3
    +Fraction(1, 9)
    + +

    Here is the comprehensive list of special methods you need to implement a number-like class.
    Notes @@ -195,7 +204,24 @@ __xor__ __or__ --> -

    FIXME explain circumstances under which reflected methods will be called. +

    That’s all well and good if x is an instance of a class that implements those methods. But what if it doesn’t implement one of them? Or worse, what if it implements it, but it can’t handle certain kinds of arguments? For example: + +

    +>>> from fractions import Fraction
    +>>> x = Fraction(1, 3)
    +>>> 1 / x
    +Fraction(3, 1)
    + +

    This is not a case of taking a Fraction and dividing it by an integer (as in the previous example). That case was straightforward: x / 3 calls x.__truediv__(3), and the __truedive__() method of the Fraction class handles all the math. But integers don’t “know” how to do arithmetic operations with fractions. So why does this example work? + +

    The answer lies in a second set of arithmetic special methods with reflected operands. Given an arithmetic operation that takes two operands (e.g. x / y), there are two ways to go about it: + +

      +
    1. Tell x to divide itself by y, or +
    2. Tell y to divide itself into x +
    + +

    The set of special methods above take the first approach: given x / y, they provide a way for x to say “I know how to divide myself by y.” The following set of special methods tackle the second approach: they provide a way for y to say “I know how to be the denominator and divide myself into x.”
    Notes @@ -271,7 +297,7 @@ __rxor__ __ror__ --> -

    FIXME explain in-place augmented assignments +

    But wait! There’s more! If you’re doing “in-place” operations, like x /= 3, there are even more special methods you can define.
    Notes @@ -343,7 +369,17 @@ __ixor__ __ior__ --> -

    FIXME unary operator intro +

    Note: for the most part, the in-place operation methods are not required. If you don’t define an in-place method for a particular operation, Python will try the methods. For example, to execute the expression x /= y, Python will: + +

      +
    1. Try calling x.__itruediv__(y). If this method is defined and returns a value other than NotImplemented, we’re done. +
    2. Try calling x.__truediv__(y). If this method is defined and returns a value other than NotImplemented, the old value of x is discarded and replaced with the return value, just as if you had done x = x / y instead. +
    3. Try calling y.__rtruediv__(y). If this method is defined and returns a value other than NotImplemented, the old value of x is discarded and replaced with the return value. +
    + +

    So you only need to define in-place methods like the __itruediv__() method if you want to do some special optimization for in-place operands. Otherwise Python will essentially reformulate the in-place operand to use a regular operand + a variable assignment. + +

    There are also a few “unary” mathematical operations you can perform on number-like objects by themselves.
    Notes @@ -399,7 +435,7 @@ __ior__ math.trunc(x) x.__trunc__()
    -??? +??? FIXME what the hell is this? ??? x.__index__()
    @@ -439,6 +475,29 @@ __reduce_ex__ (*)

     __enter__ see http://docs.python.org/3.0/library/stdtypes.html#typecontextmanager
     __exit__
    +
    +relevant excerpt from io.py:
    +
    +    def __enter__(self) -> "IOBase":  # That's a forward reference
    +        """Context management protocol.  Returns self."""
    +        self._checkClosed()
    +        return self
    +
    +    def __exit__(self, *args) -> None:
    +        """Context management protocol.  Calls close()"""
    +        self.close()
    +
    +relevant excerpt from http://www.python.org/doc/3.0/reference/datamodel.html#with-statement-context-managers
    +
    +object.__enter__(self)
    +  Enter the runtime context related to this object. The with statement will bind this method’s return value to the target(s) specified in the as clause of the statement, if any.
    +object.__exit__(self, exc_type, exc_value, traceback)
    +  Exit the runtime context related to this object. The parameters describe the exception that caused the context to be exited. If the context was exited without an exception, all three arguments will be None.
    +
    +If an exception is supplied, and the method wishes to suppress the exception (i.e., prevent it from being propagated), it should return a true value. Otherwise, the exception will be processed normally upon exit from this method.
    +
    +Note that __exit__() methods should not reraise the passed-in exception; this is the caller’s responsibility.
    +
     

    Really Esoteric Stuff

    diff --git a/strings.html b/strings.html index 46c14bc..c8aa5d4 100644 --- a/strings.html +++ b/strings.html @@ -49,19 +49,19 @@ My alphabet starts where your alphabet ends!
    — Dr

    Enter Unicode. -

    Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That's 232−1.) Each 4-byte number represents a unique character used in at least one of the world's languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn't be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. U+0041 is always 'A', even if your language doesn't have an 'A' in it. +

    Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That’s 232−1.) Each 4-byte number represents a unique character used in at least one of the world’s languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn’t be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. U+0041 is always 'A', even if your language doesn’t have an 'A' in it. -

    On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it's wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character. +

    On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it’s wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character. -

    There is a Unicode encoding that uses four bytes per character. It's called UTF-32, because 32 bits = 4 bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the Nth character of a string in constant time, because the Nth character starts at the 4×Nth byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every freaking character. +

    There is a Unicode encoding that uses four bytes per character. It’s called UTF-32, because 32 bits = 4 bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the Nth character of a string in constant time, because the Nth character starts at the 4×Nth byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every freaking character. -

    Even though there are a lot of Unicode characters, it turns out that most people will never use anything beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes every character from 0–65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used “astral plane” Unicode characters beyond 65535. Most obvious advantage: UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store instead of four bytes (except for the ones that don't). And you can still easily find the Nth character of a string in constant time, if you assume that the string doesn't include any astral plane characters, which is a good assumption right up until the moment that it's not. +

    Even though there are a lot of Unicode characters, it turns out that most people will never use anything beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes every character from 0–65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used “astral plane” Unicode characters beyond 65535. Most obvious advantage: UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store instead of four bytes (except for the ones that don’t). And you can still easily find the Nth character of a string in constant time, if you assume that the string doesn’t include any astral plane characters, which is a good assumption right up until the moment that it’s not. -

    But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character U+4E2D could be stored in UTF-16 as either 4E 2D or 2D 4E, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you're safe — different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you're going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence 4E 2D means U+4E2D or U+2D4E. +

    But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character U+4E2D could be stored in UTF-16 as either 4E 2D or 2D 4E, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you’re safe — different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you’re going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence 4E 2D means U+4E2D or U+2D4E.

    To solve this problem, the multi-byte Unicode encodings define a “Byte Order Mark,” which is a special non-printable character that you can include at the beginning of your document to indicate what order your bytes are in. For UTF-16, the Byte Order Mark is U+FEFF. If you receive a UTF-16 document that starts with the bytes FF FE, you know the byte ordering is one way; if it starts with FE FF, you know the byte ordering is reversed. -

    Still, UTF-16 isn't exactly ideal, especially if you're dealing with a lot of ASCII characters. If you think about it, even a Chinese web page is going to contain a lot of ASCII characters — all the elements and attributes surrounding the printable Chinese characters. Being able to find the Nth character in O(1) time is nice, but there's still the nagging problem of those astral plane characters, which mean that you can't guarantee that every character is exactly two bytes, so you can't really find the Nth character in O(1) time unless you maintain a separate index. And boy, there sure is a lot of ASCII text in the world… +

    Still, UTF-16 isn’t exactly ideal, especially if you’re dealing with a lot of ASCII characters. If you think about it, even a Chinese web page is going to contain a lot of ASCII characters — all the elements and attributes surrounding the printable Chinese characters. Being able to find the Nth character in O(1) time is nice, but there’s still the nagging problem of those astral plane characters, which mean that you can’t guarantee that every character is exactly two bytes, so you can’t really find the Nth character in O(1) time unless you maintain a separate index. And boy, there sure is a lot of ASCII text in the world…

    Other people pondered these questions, and they came up with a solution: @@ -71,7 +71,7 @@ My alphabet starts where your alphabet ends!
    — Dr

    Disadvantages: because each character can take a different number of bytes, finding the Nth character is an O(N) operation. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters. -

    Advantages: super-efficient encoding of common ASCII characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you'll have to trust me on this, because I'm not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer. +

    Advantages: super-efficient encoding of common ASCII characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.

    Diving In

    @@ -95,7 +95,7 @@ My alphabet starts where your alphabet ends!
    — Dr

    Formatting Strings

    -

    Let's take another look at humansize.py: +

    Let’s take another look at humansize.py:

    [download humansize.py]

    
    @@ -127,8 +127,8 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
     
  • 'KB', 'MB', 'GB'… those are each strings.
  • Function docstrings are strings. This docstring spans multiple lines, so it uses three-in-a-row quotes to start and end the string.
  • These three-in-a-row quotes end the docstring. -
  • There's another string, being passed to the exception as a human-readable error message. -
  • There's a… whoa, what the heck is that? +
  • There’s another string, being passed to the exception as a human-readable error message. +
  • There’s a… whoa, what the heck is that?

    Python 3 supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with single placeholder. @@ -140,7 +140,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True): "mark's password is PapayaWhip"

    1. No, my password is not really PapayaWhip. -
    2. There's a lot going on here. First, that's a method call on a string literal. Strings are objects, and objects have methods. Second, the whole expression evaluates to a string. Third, {0} and {1} are replacement fields, which are replaced by the arguments passed to the format() method. +
    3. There’s a lot going on here. First, that’s a method call on a string literal. Strings are objects, and objects have methods. Second, the whole expression evaluates to a string. Third, {0} and {1} are replacement fields, which are replaced by the arguments passed to the format() method.

    Compound Field Names

    @@ -156,8 +156,8 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True): '1000KB = 1MB'
      -
    1. Rather than calling any function in the humansize module, you're just grabbing one of the data structures it defines: the list of "SI" (powers-of-1000) suffixes. -
    2. This looks complicated, but it's not. {0} would refer to the first argument passed to the format() method, si_suffixes. But si_suffixes is a list. So {0[0]} refers to the first item of the list which is the first argument passed to the format() method: 'KB'. Meanwhile, {0[1]} refers to the second item of the same list: 'MB'. Everything outside the curly braces — including 1000, the equals sign, and the spaces — is untouched. The final result is the string '1000KB = 1MB'. +
    3. Rather than calling any function in the humansize module, you’re just grabbing one of the data structures it defines: the list of "SI" (powers-of-1000) suffixes. +
    4. This looks complicated, but it’s not. {0} would refer to the first argument passed to the format() method, si_suffixes. But si_suffixes is a list. So {0[0]} refers to the first item of the list which is the first argument passed to the format() method: 'KB'. Meanwhile, {0[1]} refers to the second item of the same list: 'MB'. Everything outside the curly braces — including 1000, the equals sign, and the spaces — is untouched. The final result is the string '1000KB = 1MB'.
    @@ -171,7 +171,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
  • Any combination of the above -

    Just to blow your mind, here's an example that combines all of the above: +

    Just to blow your mind, here’s an example that combines all of the above:

     >>> import humansize
    @@ -179,7 +179,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
     >>> "1MB = 1000{0.modules[humansize].SUFFIXES[1000][0]}".format(sys)
     '1MB = 1000KB'
    -

    Here's how it works: +

    Here’s how it works:

    • The sys module holds information about the currently running Python instance. Since you just imported it, you can pass the sys module itself as an argument to the format() method. So the replacement field {0} refers to the sys module. @@ -192,12 +192,12 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):

      Format Specifiers

      -

      But wait! There's more! Let's take another look at that strange line of code from humansize.py: +

      But wait! There’s more! Let’s take another look at that strange line of code from humansize.py:

      if size < multiple:
           return "{0:.1f} {1}".format(size, suffix)
      -

      {1} is replaced with the second argument passed to the format() method, which is suffix. But what is {0:.1f}? It's two things: {0}, which you recognize, and :.1f, which you don't. The second half (including and after the colon) defines the format specifier, which further refines how the replaced variable should be formatted. +

      {1} is replaced with the second argument passed to the format() method, which is suffix. But what is {0:.1f}? It’s two things: {0}, which you recognize, and :.1f, which you don’t. The second half (including and after the colon) defines the format specifier, which further refines how the replaced variable should be formatted.

      Format specifiers allow you to munge the replacement text in a variety of useful ways, like the printf() function in C. You can add zero- or space-padding, align strings, control decimal precision, and even convert numbers to hexadecimal. @@ -239,7 +239,7 @@ experience of years.

    • The count() method counts the number of occurrences of a substring. Yes, there really are six “f”s in that sentence! -

      Here's another common case. Let's say you have a list of key-value pairs in the form key1=value1&key2=value2, and you want to split them up and make a dictionary of the form {key1: value1, key2: value2}. +

      Here’s another common case. Let’s say you have a list of key-value pairs in the form key1=value1&key2=value2, and you want to split them up and make a dictionary of the form {key1: value1, key2: value2}.

       >>> query = 'user=pilgrim&database=master&password=PapayaWhip'
      @@ -324,8 +324,8 @@ TypeError: Can't convert 'bytes' object to str implicitly
       >>> s.count(by.decode('ascii'))  
       1
        -
      1. You can't concatenate bytes and strings. They are two different data types. -
      2. You can't count the occurrences of bytes in a string, because there are no bytes in a string. A string is a sequence of characters. Perhaps you meant “count the occurrences of the string that you would get after decoding this sequence of bytes in a particular character encoding”? Well then, you'll need to say that explicitly. Python 3 won't implicitly convert bytes to strings or strings to bytes. +
      3. You can’t concatenate bytes and strings. They are two different data types. +
      4. You can’t count the occurrences of bytes in a string, because there are no bytes in a string. A string is a sequence of characters. Perhaps you meant “count the occurrences of the string that you would get after decoding this sequence of bytes in a particular character encoding”? Well then, you’ll need to say that explicitly. Python 3 won’t implicitly convert bytes to strings or strings to bytes.
      5. By an amazing coincidence, this line of code says “count the occurrences of the string that you would get after decoding this sequence of bytes in this particular character encoding.”
      @@ -393,7 +393,7 @@ FIXME: move this to the intro of the upcoming files chapter?

      On Unicode in general: diff --git a/table-of-contents.html b/table-of-contents.html index afb6b0e..db1b74c 100644 --- a/table-of-contents.html +++ b/table-of-contents.html @@ -15,7 +15,8 @@ ul li ol{margin:0;padding:0 0 0 2.5em}

       

      You are here: Home Dive Into Python 3

      Table of contents

      -
        +
          +
        1. What’s New In “Dive Into Python 3”
        2. Installing Python
          1. Python on Windows diff --git a/unit-testing.html b/unit-testing.html index 7dfb3d6..4716051 100644 --- a/unit-testing.html +++ b/unit-testing.html @@ -17,18 +17,18 @@ body{counter-reset:h1 8}
    •  

      (Not) Diving In

      -

      In this chapter, you're going to write and debug a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in “Case study: roman numerals”. Now step back and consider what it would take to expand that into a two-way utility. +

      In this chapter, you’re going to write and debug a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in “Case study: roman numerals”. Now step back and consider what it would take to expand that into a two-way utility.

      The rules for Roman numerals lead to a number of interesting observations:

      1. There is only one correct way to represent a particular number as a Roman numeral.
      2. The converse is also true: if a string of characters is a valid Roman numeral, it represents only one number (that is, it can only be interpreted one way). -
      3. There is a limited range of numbers that can be expressed as Roman numerals, specifically 1 through 3999. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent that its normal value should be multiplied by 1000, but you're not going to deal with that. For the purposes of this chapter, let's stipulate that Roman numerals go from 1 to 3999.) +
      4. There is a limited range of numbers that can be expressed as Roman numerals, specifically 1 through 3999. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent that its normal value should be multiplied by 1000, but you’re not going to deal with that. For the purposes of this chapter, let’s stipulate that Roman numerals go from 1 to 3999.)
      5. There is no way to represent 0 in Roman numerals.
      6. There is no way to represent negative numbers in Roman numerals.
      7. There is no way to represent fractions or non-integer numbers in Roman numerals.
      -

      Let's start mapping out what a roman.py module should do. It will have two main functions, to_roman() and from_roman(). The to_roman() function should take an integer from 1 to 3999 and return the Roman numeral representation as a string…

      -

      Stop right there. Now let's do something a little unexpected: write a test case that checks whether the to_roman() function does what you want it to. You read that right: you're going to write code that tests code that you haven't written yet. +

      Let’s start mapping out what a roman.py module should do. It will have two main functions, to_roman() and from_roman(). The to_roman() function should take an integer from 1 to 3999 and return the Roman numeral representation as a string…

      +

      Stop right there. Now let’s do something a little unexpected: write a test case that checks whether the to_roman() function does what you want it to. You read that right: you’re going to write code that tests code that you haven’t written yet.

      This is called unit testing. The set of two conversion functions — to_roman(), and later from_roman() — can be written and tested as a unit, separate from any larger program that imports them. Python has a framework for unit testing, the appropriately-named unittest module.

      Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important to write them early (preferably before writing the code that they test), and to keep them updated as code and requirements change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases of development:

        @@ -36,7 +36,7 @@ body{counter-reset:h1 8}
      • While writing code, it keeps you from over-coding. When all the test cases pass, the function is complete.
      • When refactoring code, it assures you that the new version behaves the same way as the old version.
      • When maintaining code, it helps you cover your ass when someone comes screaming that your latest change broke their old code. (“But sir, all the unit tests passed when I checked it in...”) -
      • When writing code in a team, it increases confidence that the code you're about to commit isn't going to break someone else's code, because you can run their unit tests first. (I've seen this sort of thing in code sprints. A team breaks up the assignment, everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team. That way, nobody goes off too far into developing code that doesn't play well with others.) +
      • When writing code in a team, it increases confidence that the code you’re about to commit isn’t going to break someone else’s code, because you can run their unit tests first. (I’ve seen this sort of thing in code sprints. A team breaks up the assignment, everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team. That way, nobody goes off too far into developing code that doesn’t play well with others.)

      A Single Question

      @@ -46,11 +46,11 @@ body{counter-reset:h1 8}
    • ...determine by itself whether the function it is testing has passed or failed, without a human interpreting the results.
    • ...run in isolation, separate from any other test cases (even if they test the same functions). Each test case is an island.
    -

    Given that, let's build a test case for the first requirement: +

    Given that, let’s build a test case for the first requirement:

    1. The to_roman() function should return the Roman numeral representation for all integers 1 to 3999.
    -

    It is not immediately obvious how this code does… well, anything. It defines a class which has no __init__() method. The class does have another method, but it is never called. The entire script has a __main__ block, but it doesn't reference the class or its method. But it does do something, I promise. +

    It is not immediately obvious how this code does… well, anything. It defines a class which has no __init__() method. The class does have another method, but it is never called. The entire script has a __main__ block, but it doesn’t reference the class or its method. But it does do something, I promise.

    [download romantest1.py]

    import roman1
     import unittest
    @@ -125,20 +125,20 @@ if __name__ == "__main__":
     
  • To write a test case, first subclass the TestCase class of the unittest module. This class provides many useful methods which you can use in your test case to test specific conditions.
  • This is a list of integer/numeral pairs that I verified manually. It includes the lowest ten numbers, the highest number, every number that translates to a single-character Roman numeral, and a random sampling of other valid numbers. The point of a unit test is not to test every possible input, but to test a representative sample.
  • Every individual test is its own method, which must take no parameters and return no value. If the method exits normally without raising an exception, the test is considered passed; if the method raises an exception, the test is considered failed. -
  • Here you call the actual to_roman() function. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you have now defined the API for the to_roman() function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the API is different than that, this test is considered failed. Also notice that you are not trapping any exceptions when you call to_roman(). This is intentional. to_roman() shouldn't raise an exception when you call it with valid input, and these input values are all valid. If to_roman() raises an exception, this test is considered failed. +
  • Here you call the actual to_roman() function. (Well, the function hasn’t be written yet, but once it is, this is the line that will call it.) Notice that you have now defined the API for the to_roman() function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the API is different than that, this test is considered failed. Also notice that you are not trapping any exceptions when you call to_roman(). This is intentional. to_roman() shouldn’t raise an exception when you call it with valid input, and these input values are all valid. If to_roman() raises an exception, this test is considered failed.
  • Assuming the to_roman() function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check whether it returned the right value. This is a common question, and the TestCase class provides a method, assertEqual, to check whether two values are equal. If the result returned from to_roman() (result) does not match the known value you were expecting (numeral), assertEqual will raise an exception and the test will fail. If the two values are equal, assertEqual will do nothing. If every value returned from to_roman() matches the known value you expect, assertEqual never raises an exception, so testToRomanKnownValues eventually exits normally, which means to_roman() has passed this test. -

    Once you have a test case, you can start coding the to_roman() function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you've written any code, you're doing it wrong — your tests aren't testing your code at all! Write a test that fails, then code until it passes. +

    Once you have a test case, you can start coding the to_roman() function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you’ve written any code, you’re doing it wrong — your tests aren’t testing your code at all! Write a test that fails, then code until it passes.

    # roman1.py
     
     function to_roman(n):
         """convert integer to Roman numeral"""
         pass                                   
      -
    1. At this stage, you want to define the API of the to_roman() function, but you don't want to code it yet. (Your test needs to fail first.) To stub it out, use the Python reserved word pass [FIXME ref], which does precisely nothing. +
    2. At this stage, you want to define the API of the to_roman() function, but you don’t want to code it yet. (Your test needs to fail first.) To stub it out, use the Python reserved word pass [FIXME ref], which does precisely nothing.
    -

    Execute romantest1.py on the command line to run the test. If you call it with the -v command-line option, it will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this: +

    Execute romantest1.py on the command line to run the test. If you call it with the -v command-line option, it will give more verbose output so you can see exactly what’s going on as each test case runs. With any luck, your output should look like this:

     you@localhost:~$ python3 romantest1.py -v
     to_roman should give known result with known input ... FAIL            
    @@ -157,9 +157,9 @@ Traceback (most recent call last):
     FAILED (failures=1)                                                    
    1. Running the script runs unittest.main(), which runs each test case. Each test case is a method within each class in romantest.py that inherits from unittest.TestCase. For each test case, the unittest module will print out the docstring of the method and whether that test passed or failed. As expected, this test case fails. -
    2. For each failed test case, unittest displays the trace information showing exactly what happened. In this case, the call to assertEqual() raised an AssertionError because it was expecting to_roman(1) to return "I", but it didn't. (Since there was no explicit return statement, the function returned None, the Python null value.) +
    3. For each failed test case, unittest displays the trace information showing exactly what happened. In this case, the call to assertEqual() raised an AssertionError because it was expecting to_roman(1) to return "I", but it didn’t. (Since there was no explicit return statement, the function returned None, the Python null value.)
    4. After the detail of each test, unittest displays a summary of how many tests were performed and how long it took. -
    5. Overall, the unit test failed because at least one test case did not pass. When a test case doesn't pass, unittest distinguishes between failures and errors. A failure is a call to an assertXYZ method, like assertEqual or assertRaises, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort of exception raised in the code you're testing or the unit test case itself. +
    6. Overall, the unit test failed because at least one test case did not pass. When a test case doesn’t pass, unittest distinguishes between failures and errors. A failure is a call to an assertXYZ method, like assertEqual or assertRaises, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort of exception raised in the code you’re testing or the unit test case itself.

    Now, finally, you can write the to_roman() function.

    [download roman1.py] @@ -186,10 +186,10 @@ def to_roman(n): n -= integer return result

    1. -
    2. roman_numeral_map is a tuple of tuples which defines three things: the character representations of the most basic Roman numerals; the order of the Roman numerals (in descending value order, from M all the way down to I); the value of each Roman numeral. Each inner tuple is a pair of (numeral, value). It's not just single-character Roman numerals; it also defines two-character pairs like CM (“one hundred less than one thousand”). This makes the to_roman() function code simpler. -
    3. Here's where the rich data structure of roman_numeral_map pays off, because you don't need any special logic to handle the subtraction rule. To convert to Roman numerals, simply iterate through roman_numeral_map looking for the largest integer value less than or equal to the input. Once found, add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat. +
    4. roman_numeral_map is a tuple of tuples which defines three things: the character representations of the most basic Roman numerals; the order of the Roman numerals (in descending value order, from M all the way down to I); the value of each Roman numeral. Each inner tuple is a pair of (numeral, value). It’s not just single-character Roman numerals; it also defines two-character pairs like CM (“one hundred less than one thousand”). This makes the to_roman() function code simpler. +
    5. Here’s where the rich data structure of roman_numeral_map pays off, because you don’t need any special logic to handle the subtraction rule. To convert to Roman numerals, simply iterate through roman_numeral_map looking for the largest integer value less than or equal to the input. Once found, add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
    -

    If you're still not clear how the to_roman() function works, add a print() call to the end of the while loop: +

    If you’re still not clear how the to_roman() function works, add a print() call to the end of the while loop:

    
     while n >= integer:
         result += numeral
    @@ -215,7 +215,7 @@ Ran 1 test in 0.016s
     
     OK
      -
    1. Hooray! The to_roman() function passes the “known values” test case. It's not comprehensive, but it does put the function through its paces with a variety of inputs, including inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it. +
    2. Hooray! The to_roman() function passes the “known values” test case. It’s not comprehensive, but it does put the function through its paces with a variety of inputs, including inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.

    “Good” input? Hmm. What about bad input?

    “Halt And Catch Fire”

    @@ -230,9 +230,9 @@ OK >>> roman1.to_roman(9000) 'MMMMMMMMM'
      -
    1. That's definitely not what you wanted — that's not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is baaaaaaad; if a program is going to fail, it is far better that it fail quickly and noisily. “Halt and catch fire,” as the saying goes. The Pythonic way to halt and catch fire is to raise an exception. +
    2. That’s definitely not what you wanted — that’s not even a valid Roman numeral! In fact, each of these numbers is outside the range of acceptable input, but the function returns a bogus value anyway. Silently returning bad values is baaaaaaad; if a program is going to fail, it is far better that it fail quickly and noisily. “Halt and catch fire,” as the saying goes. The Pythonic way to halt and catch fire is to raise an exception.
    -

    The question to ask yourself is, “How can I express this as a testable requirement?” How's this for starters: +

    The question to ask yourself is, “How can I express this as a testable requirement?” How’s this for starters:

    The to_roman() function should raise an OutOfRangeError when given an integer greater than 3999.

    @@ -244,12 +244,12 @@ OK """to_roman should fail with large input""" self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000)
      -
    1. Like the previous test case, you create a class that inherits from unittest.TestCase. You can have more than one test per class (as you'll see later in this chapter), but I chose to create a new class here because this test is something different than the last one. We'll keep all the good input tests together in one class, and all the bad input tests together in another. +
    2. Like the previous test case, you create a class that inherits from unittest.TestCase. You can have more than one test per class (as you’ll see later in this chapter), but I chose to create a new class here because this test is something different than the last one. We’ll keep all the good input tests together in one class, and all the bad input tests together in another.
    3. Like the previous test case, the test itself is a method of the class, with a name starting with test. -
    4. The unittest.TestCase class provides the assertRaises method, which takes the following arguments: the exception you're expecting, the function you're testing, and the arguments you're passing to that function. (If the function you're testing takes more than one argument, pass them all to assertRaises, in order, and it will pass them right along to the function you're testing.) +
    5. The unittest.TestCase class provides the assertRaises method, which takes the following arguments: the exception you’re expecting, the function you’re testing, and the arguments you’re passing to that function. (If the function you’re testing takes more than one argument, pass them all to assertRaises, in order, and it will pass them right along to the function you’re testing.)
    -

    Pay close attention to this last line of code. Instead of calling to_roman() directly and manually checking that it raises a particular exception (by wrapping it in a try...except block [FIXME xref]), the assertRaises method has encapsulated all of that for us. All you do is tell it what exception you're expecting (roman2.OutOfRangeError), the function (to_roman()), and the function's arguments (4000). The assertRaises method takes care of calling to_roman() and checking that it raises roman2.OutOfRangeError. -

    Also note that you're passing the to_roman() function itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned recently how handy it is that everything in Python is an object? +

    Pay close attention to this last line of code. Instead of calling to_roman() directly and manually checking that it raises a particular exception (by wrapping it in a try...except block [FIXME xref]), the assertRaises method has encapsulated all of that for us. All you do is tell it what exception you’re expecting (roman2.OutOfRangeError), the function (to_roman()), and the function’s arguments (4000). The assertRaises method takes care of calling to_roman() and checking that it raises roman2.OutOfRangeError. +

    Also note that you’re passing the to_roman() function itself as an argument; you’re not calling it, and you’re not passing the name of it as a string. Have I mentioned recently how handy it is that everything in Python is an object?

    So what happens when you run the test suite with this new test?

     you@localhost:~$ python3 romantest2.py -v
    @@ -269,15 +269,15 @@ Ran 2 tests in 0.000s
     
     FAILED (errors=1)
      -
    1. You should have expected this to fail (since you haven't written any code to pass it yet), but... it didn't actually “fail,” it had an “error” instead. This is a subtle but important distinction. A unit test actually has three return values: pass, fail, and error. Pass, of course, means that the test passed — the code did what you expected. “Fail” is what the previous test case did (until you wrote code to make it pass) — it executed the code but the result was not what you expected. “Error” means that the code didn't even execute properly. -
    2. Why didn't the code execute properly? The traceback gives the answer: the module you're testing doesn't have an exception called OutOfRangeError. Remember, you passed this exception to the assertRaises() method, because it's the exception you want the function to raise given an out-of-range input. But the exception doesn't exist, so the call to the assertRaises() method failed. It never got a chance to test the to_roman() function; it didn't get that far. +
    3. You should have expected this to fail (since you haven’t written any code to pass it yet), but... it didn’t actually “fail,” it had an “error” instead. This is a subtle but important distinction. A unit test actually has three return values: pass, fail, and error. Pass, of course, means that the test passed — the code did what you expected. “Fail” is what the previous test case did (until you wrote code to make it pass) — it executed the code but the result was not what you expected. “Error” means that the code didn’t even execute properly. +
    4. Why didn’t the code execute properly? The traceback gives the answer: the module you’re testing doesn’t have an exception called OutOfRangeError. Remember, you passed this exception to the assertRaises() method, because it’s the exception you want the function to raise given an out-of-range input. But the exception doesn’t exist, so the call to the assertRaises() method failed. It never got a chance to test the to_roman() function; it didn’t get that far.

    To solve this problem, you need to define the OutOfRangeError exception in roman2.py.

    class OutOfRangeError(ValueError):  
         pass                            
    1. Exceptions are classes. An “out of range” error is a kind of value error — the argument value is out of its acceptable range. So this exception inherits from the built-in ValueError exception. This is not strictly necessary (it could just inherit from the base Exception class), but it feels right. -
    2. Exceptions don't actually do anything, but you need at least one line of code to make a class. Calling pass does precisely nothing, but it's a line of Python code, so that makes it a class. +
    3. Exceptions don’t actually do anything, but you need at least one line of code to make a class. Calling pass does precisely nothing, but it’s a line of Python code, so that makes it a class.

    Now run the test suite again.

    @@ -298,8 +298,8 @@ Ran 2 tests in 0.016s
     
     FAILED (failures=1)
      -
    1. The new test is still not passing, but it's not returning an error either. Instead, the test is failing. That's progress! It means the call to the assertRaises() method succeeded this time, and the unit test framework actually tested the to_roman() function. -
    2. Of course, the to_roman() function isn't raising the OutOfRangeError exception you just defined, because you haven't told it to do that yet. That's excellent news! It means this is a valid test case — it fails before you write the code to make it pass. +
    3. The new test is still not passing, but it’s not returning an error either. Instead, the test is failing. That’s progress! It means the call to the assertRaises() method succeeded this time, and the unit test framework actually tested the to_roman() function. +
    4. Of course, the to_roman() function isn’t raising the OutOfRangeError exception you just defined, because you haven’t told it to do that yet. That’s excellent news! It means this is a valid test case — it fails before you write the code to make it pass.

    Now you can write the code to make this test pass.

    [download roman2.py] @@ -315,9 +315,9 @@ FAILED (failures=1) n -= integer return result

      -
    1. This is straightforward: if the given input (n) is greater than 3999, raise an OutOfRangeError exception. The unit test does not check the human-readable string that accompanies the exception, although you could write another test that did check it (but watch out for internationalization issues for strings that vary by the user's language or environment). +
    2. This is straightforward: if the given input (n) is greater than 3999, raise an OutOfRangeError exception. The unit test does not check the human-readable string that accompanies the exception, although you could write another test that did check it (but watch out for internationalization issues for strings that vary by the user’s language or environment).
    -

    Does this make the test pass? Let's find out. +

    Does this make the test pass? Let’s find out.

     you@localhost:~$ python3 romantest2.py -v
     to_roman should give known result with known input ... ok
    @@ -328,7 +328,7 @@ Ran 2 tests in 0.000s
     
     OK
      -
    1. Hooray! Both tests pass. Because you worked iteratively, bouncing back and forth between testing and coding, you can be sure that the two lines of code you just wrote were the cause of that one test going from “fail” to “pass.” That kind of confidence doesn't come cheap, but it will pay for itself over the lifetime of your code. +
    2. Hooray! Both tests pass. Because you worked iteratively, bouncing back and forth between testing and coding, you can be sure that the two lines of code you just wrote were the cause of that one test going from “fail” to “pass.” That kind of confidence doesn’t come cheap, but it will pay for itself over the lifetime of your code.

    More Halting, More Fire

    @@ -342,7 +342,7 @@ OK >>> roman2.to_roman(-1) '' -

    Well that's not good. Let's add tests for each of these conditions. +

    Well that’s not good. Let’s add tests for each of these conditions.

    [download romantest3.py]

    
    @@ -359,8 +359,8 @@ class ToRomanBadInput(unittest.TestCase):
             """to_roman should fail with negative input"""
             self.assertRaises(roman3.OutOfRangeError, roman3.to_roman, -1)    
      -
    1. The test_too_large() method has not changed since the previous step. I'm including it here to show where the new code fits. -
    2. Here's a new test: the test_zero() method. Like the test_too_large() method, it tells the assertRaises() method defined in unittest.TestCase to call our to_roman() function with a parameter of 0, and check that it raises the appropriate exception, OutOfRangeError. +
    3. The test_too_large() method has not changed since the previous step. I’m including it here to show where the new code fits. +
    4. Here’s a new test: the test_zero() method. Like the test_too_large() method, it tells the assertRaises() method defined in unittest.TestCase to call our to_roman() function with a parameter of 0, and check that it raises the appropriate exception, OutOfRangeError.
    5. The test_negative() method is almost identical, except it passes -1 to the to_roman() function. If either of these new tests does not raise an OutOfRangeError (either because the function returns an actual value, or because it raises some other exception), the test is considered failed.
    @@ -394,7 +394,7 @@ Ran 4 tests in 0.000s FAILED (failures=2) -

    Excellent. Both tests failed, as expected. Now let's switch over to the code and see what we can do to make them pass. +

    Excellent. Both tests failed, as expected. Now let’s switch over to the code and see what we can do to make them pass.

    [download roman3.py]

    def to_roman(n):
    @@ -409,11 +409,11 @@ FAILED (failures=2)
    n -= integer return result
      -
    1. This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to if not ((0 < n) and (n < 4000)), but it's much easier to read. This one line of code should catch inputs that are too large, negative, or zero. -
    2. If you change your conditions, make sure to update your human-readable error strings to match. The unittest framework won't care, but it'll make it difficult to do manual debugging if your code is throwing incorrectly-described exceptions. +
    3. This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to if not ((0 < n) and (n < 4000)), but it’s much easier to read. This one line of code should catch inputs that are too large, negative, or zero. +
    4. If you change your conditions, make sure to update your human-readable error strings to match. The unittest framework won’t care, but it’ll make it difficult to do manual debugging if your code is throwing incorrectly-described exceptions.
    -

    I could show you a whole series of unrelated examples to show that the multiple-comparisons-at-once shortcut works, but instead I'll just run the unit tests and prove it. +

    I could show you a whole series of unrelated examples to show that the multiple-comparisons-at-once shortcut works, but instead I’ll just run the unit tests and prove it.

     you@localhost:~$ python3 romantest3.py -v
    @@ -438,8 +438,8 @@ OK
    >>> roman3.to_roman(1.5) 'I'
      -
    1. Oh, that's bad. -
    2. Oh, that's even worse. Both of these cases should raise an exception. Instead, they give bogus results. +
    3. Oh, that’s bad. +
    4. Oh, that’s even worse. Both of these cases should raise an exception. Instead, they give bogus results.

    Testing for non-integers is not difficult. First, define a NonIntegerError exception. diff --git a/whats-new.html b/whats-new.html new file mode 100644 index 0000000..897b7fc --- /dev/null +++ b/whats-new.html @@ -0,0 +1,44 @@ + + + +What's New In "Dive into Python 3" + + + + + +

      
    +

    You are here: Home Dive Into Python 3 +

    Difficulty level: ♦♦♦♦♢ +

    What’s New In “Dive Into Python 3”

    +
    +

    Isn’t this where we came in?
    — Pink Floyd, The Wall +

    +

      +

    a.k.a. “the minus level”

    + +

    a.k.a. I don’t want to read any more of this damn book than I absolutely have to

    + +

    You read the original “Dive Into Python” and maybe even bought it on paper. (Thanks!) You already know Python 2 pretty well. You’re ready to take the plunge into Python 3. … If all of that is true, read on. (If none of that is true, you’d be better off starting at the beginning.) + +

    Python 3 comes with a script called 2to3. Learn it. Love it. Use it. Porting Code to Python 3 with 2to3 is a reference of all the things that the 2to3 tool can fix automatically. Since a lot of those things are syntax changes, it’s a good starting point to learn about a lot of the syntax changes in Python 3. (print is now a function, `x` doesn’t work, &c.) + +

    Case Study: Porting chardet to Python 3 documents my (ultimately successful) effort to port a non-trivial library from Python 2 to Python 3. It may help you; it may not. There’s a fairly steep learning curve, since you need to kind of understand the library first, so you can understand why it broke and how I fixed it. A lot of the breakage centers around strings. Speaking of which… + +

    Strings. Whew. Where to start. Python 2 had “strings” and “Unicode strings.” Python 3 has “bytes” and “strings.” That is, all strings are now Unicode strings, and if you want to deal with a bag of bytes, you use the new bytes type. Oh, and Python 3 will never implicitly convert between strings and bytes, so if you’re not sure which one you have, your code will almost certainly break. Read the Strings chapter for more details. + +

    Even if you don’t care about Unicode, you’ll want to read about string formatting in Python 3, which is completely different from Python 2. + +

    Iterators are everywhere in Python 3, and I understand them a lot better than I did five years ago when I wrote “Dive Into Python”. You need to understand them too, because lots of functions that used to return lists in Python 2 will now return iterators in Python 3. At a minimum, you should read the second half of the Iterators chapter and the second half of the Advanced Iterators chapter. + +

    By popular request, I’ve added an appendix on Special Method Names, which is kind of like the Python docs “Data Model” chapter but with more snark. + +

    That’s it for now; the book’s not finished yet! The file I/O subsystem is totally different now; I hope to write about that soon. There are much better choices for XML processing now; I hope to write about that, too. + + +

    © 2001–9 Mark Pilgrim + + diff --git a/your-first-python-program.html b/your-first-python-program.html index 8fa5140..671e529 100644 --- a/your-first-python-program.html +++ b/your-first-python-program.html @@ -20,7 +20,7 @@ th{text-align:left}

     

    Diving In

    -

    Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it. +

    Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let’s skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don’t worry about that, because you’re going to dissect it line by line. But read through it first and see what, if anything, you can make of it.

    [download humansize.py]

    SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
                 1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
    @@ -50,7 +50,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
     if __name__ == "__main__":
         print(approximate_size(1000000000000, False))
         print(approximate_size(1000000000000))
    -

    Now let's run this program on the command line. On Windows, it will look something like this: +

    Now let’s run this program on the command line. On Windows, it will look something like this:

     c:\home\diveintopython3> c:\python30\python.exe humansize.py
     1.0 TB
    @@ -66,15 +66,15 @@ if __name__ == "__main__":
     
    def approximate_size(size, a_kilobyte_is_1024_bytes=True):

    The keyword def starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments are separated with commas. -

    Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value. (In fact, every Python function returns a value; if the function ever executes a return statement, it will return that value, otherwise it will return None, the Python null value.) +

    Also note that the function doesn’t define a return datatype. Python functions do not specify the datatype of their return value; they don’t even specify whether or not they return a value. (In fact, every Python function returns a value; if the function ever executes a return statement, it will return that value, otherwise it will return None, the Python null value.)

    -

    In some languages, functions (that return a value) start with function, and subroutines (that do not return a value) start with sub. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's None), and all functions start with def. +

    In some languages, functions (that return a value) start with function, and subroutines (that do not return a value) start with sub. There are no subroutines in Python. Everything is a function, all functions return a value (even if it’s None), and all functions start with def.

    -

    The approximate_size function takes the two arguments — size and a_kilobyte_is_1024_bytes — but neither argument specifies a datatype. (As you might guess from the =True syntax, the second argument is a boolean. You'll learn what that syntax does in [FIXME xref-was-#apihelper].) In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally. +

    The approximate_size function takes the two arguments — size and a_kilobyte_is_1024_bytes — but neither argument specifies a datatype. (As you might guess from the =True syntax, the second argument is a boolean. You’ll learn what that syntax does in [FIXME xref-was-#apihelper].) In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.

    In Java and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.

    -

    How Python's Datatypes Compare to Other Programming Languages

    +

    How Python’s Datatypes Compare to Other Programming Languages

    An erudite reader sent me this explanation of how Python compares to other programming languages:

    statically typed language
    @@ -84,13 +84,13 @@ if __name__ == "__main__":
    A language in which types are discovered at execution time; the opposite of statically typed. JavaScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
    strongly typed language
    -
    A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it. +
    A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can’t treat it like a string without explicitly converting it.
    weakly typed language
    A language in which types are “automagically” coerced to other types as needed; the opposite of strongly typed. PHP is weakly typed. In PHP, you can concatenate the string '12' and the integer 3 to get the string '123', then treat that as the integer 123, all without any explicit conversion.
    -

    So Python is both dynamically typed (because it doesn't use explicit datatype declarations) and strongly typed (because once a variable has a datatype, it actually matters). +

    So Python is both dynamically typed (because it doesn’t use explicit datatype declarations) and strongly typed (because once a variable has a datatype, it actually matters).

    If you have experience in other programming languages, this table may help you visualize how Python compares to them:
    Statically typedDynamically typed @@ -98,7 +98,7 @@ if __name__ == "__main__":
    Strongly typedPascal, JavaPython, Ruby

    Writing Readable Code

    -

    I won't bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you've forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You'll thank me in six months. +

    I won’t bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you’ve forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You’ll thank me in six months.

    Documentation Strings

    You can document a Python function by giving it a documentation string (docstring for short). In this program, the approximate_size function has a docstring:

    def approximate_size(size, a_kilobyte_is_1024_bytes=True):
    @@ -113,13 +113,13 @@ if __name__ == "__main__":
     
         """
    -

    Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you'll see them most often used when defining a docstring. +

    Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you’ll see them most often used when defining a docstring.

    Triple quotes are also an easy way to define a string with both single and double quotes, like qq/.../ in Perl 5.

    -

    Everything between the triple quotes is the function's docstring, which documents what the function does. A docstring, if it exists, must be the first thing defined in a function (that is, on the next line after the function declaration). You don't technically need to give your function a docstring, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the docstring is available at runtime as an attribute of the function. +

    Everything between the triple quotes is the function’s docstring, which documents what the function does. A docstring, if it exists, must be the first thing defined in a function (that is, on the next line after the function declaration). You don’t technically need to give your function a docstring, but you always should. I know you’ve heard this in every programming class you’ve ever taken, but Python gives you an added incentive: the docstring is available at runtime as an attribute of the function.

    -

    Many Python IDEs use the docstring to provide context-sensitive documentation, so that when you type a function name, its docstring appears as a tooltip. This can be incredibly helpful, but it's only as good as the docstrings you write. +

    Many Python IDEs use the docstring to provide context-sensitive documentation, so that when you type a function name, its docstring appears as a tooltip. This can be incredibly helpful, but it’s only as good as the docstrings you write.