diff --git a/advanced-iterators.html b/advanced-iterators.html index 2427f4a..17a4770 100755 --- a/advanced-iterators.html +++ b/advanced-iterators.html @@ -150,7 +150,7 @@ if __name__ == '__main__':

The alphametics solver uses this technique to get a list of all the unique characters in the puzzle. -

unique_characters = {c for c in ''.join(words)}
+
unique_characters = {c for c in ''.join(words)}

This list is later used to assign digits to characters as the solver iterates through the possible solutions. @@ -173,11 +173,11 @@ AssertionError

Therefore, this line of code: -

assert len(unique_characters) <= 10
+
assert len(unique_characters) <= 10

…is equivalent to… -

if len(unique_characters) > 10:
+
if len(unique_characters) > 10:
     raise AssertionError

But a bit easier to read and write. @@ -210,7 +210,7 @@ AssertionError

Here’s another way to accomplish the same thing, using a generator function: -

def ord_map(a_string):
+
def ord_map(a_string):
     for c in a_string:
         yield ord(c)
 
@@ -398,7 +398,7 @@ Wesley

The alphametics solver uses this technique to create a dictionary that maps letters in the puzzle to digits in the solution, for each possible solution. -

characters = tuple(ord(c) for c in sorted_characters)
+
characters = tuple(ord(c) for c in sorted_characters)
 digits = tuple(ord(c) for c in '0123456789')
 ...
 for guess in itertools.permutations(digits, len(characters)):
@@ -454,7 +454,7 @@ for guess in itertools.permutations(digits, len(characters)):
 
 

This is the final piece of the puzzle (or rather, the final piece of the puzzle solver). After all that fancy string manipulation, we’re left with a string like '9567 + 1085 == 10652'. But that’s a string, and what good is a string? Enter eval(), the universal Python evaluation tool. -

+
 >>> eval('1 + 1 == 2')
 True
 >>> eval('1 + 1 == 3')
@@ -464,7 +464,7 @@ for guess in itertools.permutations(digits, len(characters)):
 
 

But wait, there’s more! The eval() function isn’t limited to boolean expressions. It can handle any Python expression and returns any datatype. -

+
 >>> eval('"A" + "B"')
 'AB'
 >>> eval('"MARK".translate({65: 79})')
diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html
old mode 100644
new mode 100755
index 561ceb8..33cd361
--- a/case-study-porting-chardet-to-python-3.html
+++ b/case-study-porting-chardet-to-python-3.html
@@ -545,7 +545,7 @@ RefactoringTool: chardet\sjisprober.py
 RefactoringTool: chardet\universaldetector.py
 RefactoringTool: chardet\utf8prober.py

Now run the 2to3 script on the testing harness, test.py. -

C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w test.py
+
C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w test.py
 RefactoringTool: Skipping implicit fixer: buffer
 RefactoringTool: Skipping implicit fixer: idioms
 RefactoringTool: Skipping implicit fixer: set_literal
@@ -583,7 +583,7 @@ RefactoringTool: test.py

False is invalid syntax

Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere. -

C:\home\chardet> python test.py tests\*\*
+
C:\home\chardet> python test.py tests\*\*
 Traceback (most recent call last):
   File "test.py", line 1, in <module>
     from chardet.universaldetector import UniversalDetector
@@ -592,7 +592,7 @@ RefactoringTool: test.py
^ SyntaxError: invalid syntax

Hmm, a small snag. In Python 3, False is a reserved word, so you can’t use it as a variable name. Let’s look at constants.py to see where it’s defined. Here’s the original version from constants.py, before the 2to3 script changed it: -

import __builtin__
+
import __builtin__
 if not hasattr(__builtin__, 'False'):
     False = 0
     True = 1
@@ -602,13 +602,13 @@ else:
 

This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in bool type. This code detects the absence of the built-in constants True and False, and defines them if necessary.

However, Python 3 will always have a bool type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of constants.True and constants.False with True and False, respectively, then delete this dead code from constants.py.

So this line in universaldetector.py: -

self.done = constants.False
+
self.done = constants.False

Becomes -

self.done = False
+
self.done = False

Ah, wasn’t that satisfying? The code is shorter and more readable already.

No module named constants

Time to run test.py again and see how far it gets. -

C:\home\chardet> python test.py tests\*\*
+
C:\home\chardet> python test.py tests\*\*
 Traceback (most recent call last):
   File "test.py", line 1, in <module>
     from chardet.universaldetector import UniversalDetector
@@ -616,12 +616,12 @@ else:
     import constants, sys
 ImportError: No module named constants

What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead: -

from . import constants
+
from . import constants

But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two.

The solution is to split the import statement manually. So this two-in-one import: -

import constants, sys
+
import constants, sys

Needs to become two separate imports: -

from . import constants
+
from . import constants
 import sys

There are variations of this problem scattered throughout the chardet library. In some places it’s “import constants, sys”; in other places, it’s “import constants, re”. The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.

FIXME-xref to as-yet-unwritten PEP 8 style section (which says you should put all imports on their own line) @@ -629,7 +629,7 @@ import sys

Name 'file' is not defined

And here we go again, running test.py to try to execute our test cases… -

C:\home\chardet> python test.py tests\*\*
+
C:\home\chardet> python test.py tests\*\*
 tests\ascii\howto.diveintomark.org.xml
 Traceback (most recent call last):
   File "test.py", line 9, in <module>
@@ -637,11 +637,11 @@ import sys
NameError: name 'file' is not defined

This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for the open() function, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it’s an alias for io.open(), but never mind that right now.)

Thus, the simplest solution to the problem of the missing file() is to call the open() function instead: -

for line in open(f, 'rb'):
+
for line in open(f, 'rb'):

And that’s all I have to say about that.

Can’t use a string pattern on a bytes-like object

Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.” -

C:\home\chardet> python test.py tests\*\*
+
C:\home\chardet> python test.py tests\*\*
 tests\ascii\howto.diveintomark.org.xml
 Traceback (most recent call last):
   File "test.py", line 10, in <module>
@@ -650,20 +650,20 @@ NameError: name 'file' is not defined
if self._highBitDetector.search(aBuf): TypeError: can't use a string pattern on a bytes-like object

To debug this, let’s see what self._highBitDetector is. It’s defined in the __init__ method of the UniversalDetector class: -

class UniversalDetector:
+
class UniversalDetector:
     def __init__(self):
         self._highBitDetector = re.compile(r'[\x80-\xFF]')

This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.

And therein lies the problem.

In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py: -

def feed(self, aBuf):
+
def feed(self, aBuf):
     .
     .
     .
     if self._mInputState == ePureAscii:
         if self._highBitDetector.search(aBuf):

And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py. -

u = UniversalDetector()
+
u = UniversalDetector()
 .
 .
 .
@@ -673,7 +673,7 @@ for line in open(f, 'rb'):
 

And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to the open() function, but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.

What we need this regular expression to search is not an array of characters, but an array of bytes.

Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.) -

  class UniversalDetector:
+
  class UniversalDetector:
       def __init__(self):
 -         self._highBitDetector = re.compile(r'[\x80-\xFF]')
 -         self._escDetector = re.compile(r'(\033|~{)')
@@ -683,7 +683,7 @@ for line in open(f, 'rb'):
           self._mCharSetProbers = []
           self.reset()

Searching the entire codebase for other uses of the re module turns up two more instances, in charsetprober.py. Again, the code is defining regular expressions as strings but executing them on aBuf, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays. -

  class CharSetProber:
+
  class CharSetProber:
       .
       .
       .
@@ -699,7 +699,7 @@ for line in open(f, 'rb'):
         
 

Can't convert 'bytes' object to str implicitly

Curiouser and curiouser… -

C:\home\chardet> python test.py tests\*\*
+
C:\home\chardet> python test.py tests\*\*
 tests\ascii\howto.diveintomark.org.xml
 Traceback (most recent call last):
   File "test.py", line 10, in <module>
@@ -708,10 +708,10 @@ for line in open(f, 'rb'):
     elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
 TypeError: Can't convert 'bytes' object to str implicitly

There’s an unfortunate clash of coding style and Python interpreter here. The TypeError could be anywhere on that line, but the traceback doesn’t tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this: -

elif (self._mInputState == ePureAscii) and \
+
elif (self._mInputState == ePureAscii) and \
     self._escDetector.search(self._mLastChar + aBuf):

And re-run the test: -

C:\home\chardet> python test.py tests\*\*
+
C:\home\chardet> python test.py tests\*\*
 tests\ascii\howto.diveintomark.org.xml
 Traceback (most recent call last):
   File "test.py", line 10, in <module>
@@ -721,7 +721,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
TypeError: Can't convert 'bytes' object to str implicitly

Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second one. So what could cause a TypeError there? Perhaps you’re thinking that the search() method is expecting a value of a different type, but that wouldn’t generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value of a different type than it’s expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the search() method. So the problem must be in that + operation, as it’s trying to construct the value that it will eventually pass to the search() method.

We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It’s an instance variable, defined in the reset() method, which is actually called from the __init__() method. -

class UniversalDetector:
+
class UniversalDetector:
     def __init__(self):
         self._highBitDetector = re.compile(b'[\x80-\xFF]')
         self._escDetector = re.compile(b'(\033|~{)')
@@ -738,7 +738,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mLastChar = ''

And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can’t concatenate a string to a byte array — not even a zero-length string.

So what is self._mLastChar anyway? The answer is in the feed() method, just a few lines down from where the trackback occurred. -

if self._mInputState == ePureAscii:
+
if self._mInputState == ePureAscii:
     if self._highBitDetector.search(aBuf):
         self._mInputState = eHighbyte
     elif (self._mInputState == ePureAscii) and \
@@ -747,14 +747,14 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mLastChar = aBuf[-1]

The calling function calls this feed() method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as aBuf), then stores the last byte in self._mLastChar in case it’s needed during the next call. (In a multi-byte encoding, the feed() method might get called with half of a character, then called again with the other half.) But because aBuf is now a byte array instead of a string, self._mLastChar needs to be a byte array as well. Thus: -

  def reset(self):
+
  def reset(self):
       .
       .
       .
 -     self._mLastChar = ''
 +     self._mLastChar = b''

Searching the entire codebase for “mLastChar” turns up a similar problem in mbcharsetprober.py, but instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers. -

  class MultiByteCharSetProber(CharSetProber):
+
  class MultiByteCharSetProber(CharSetProber):
       def __init__(self):
           CharSetProber.__init__(self)
           self._mDistributionAnalyzer = None
@@ -772,7 +772,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
+ self._mLastChar = [0, 0]

Unsupported operand type(s) for +: 'int' and 'bytes'

I have good news, and I have bad news. The good news is we’re making progress… -

C:\home\chardet> python test.py tests\*\*
+
C:\home\chardet> python test.py tests\*\*
 tests\ascii\howto.diveintomark.org.xml
 Traceback (most recent call last):
   File "test.py", line 10, in <module>
@@ -783,7 +783,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'

…The bad news is it doesn’t always feel like progress.

But this is progress! Really! Even though the traceback calls out the same line of code, it’s a different error than it used to be. Progress! So what’s the problem now? The last time I checked, this line of code didn’t try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that self._mLastChar was a byte array. How did it turn into an int?

The answer lies not in the previous lines of code, but in the following lines. -

if self._mInputState == ePureAscii:
+
if self._mInputState == ePureAscii:
     if self._highBitDetector.search(aBuf):
         self._mInputState = eHighbyte
     elif (self._mInputState == ePureAscii) and \
@@ -820,14 +820,14 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
 
  • Concatenating a byte array of length 1 with a byte array of length 3 returns a new byte array of length 4.

    So, to ensure that the feed() method in universaldetector.py continues to work no matter how often it’s called, you need to initialize self._mLastChar as a 0-length byte array, then make sure it stays a byte array. -

                  self._escDetector.search(self._mLastChar + aBuf):
    +
                  self._escDetector.search(self._mLastChar + aBuf):
               self._mInputState = eEscAscii
     
     - self._mLastChar = aBuf[-1]
     + self._mLastChar = aBuf[-1:]

    ord() expected string of length 1, but int found

    Tired yet? You’re almost there… -

    C:\home\chardet> python test.py tests\*\*
    +
    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
     Traceback (most recent call last):
    @@ -843,25 +843,25 @@ tests\Big5\0804.blogspot.com.xml
         byteCls = self._mModel['classTable'][ord(c)]
     TypeError: ord() expected string of length 1, but int found

    OK, so c is an int, but the ord() function was expecting a 1-character string. Fair enough. Where is c defined? -

    # codingstatemachine.py
    +
    # codingstatemachine.py
     def next_state(self, c):
         # for each byte we get its class
         # if it is first byte, we also get byte length
         byteCls = self._mModel['classTable'][ord(c)]

    That’s no help; it’s just passed into the function. Let’s pop the stack. -

    # utf8prober.py
    +
    # utf8prober.py
     def feed(self, aBuf):
         for c in aBuf:
             codingState = self._mCodingSM.next_state(c)

    And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That’s what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there’s no need to call the ord() function because c is already an int!

    Thus: -

      def next_state(self, c):
    +
      def next_state(self, c):
           # for each byte we get its class
           # if it is first byte, we also get byte length
     -     byteCls = self._mModel['classTable'][ord(c)]
     +     byteCls = self._mModel['classTable'][c]

    Searching the entire codebase for instances of “ord(c)” uncovers similar problems in sbcharsetprober.py… -

    # sbcharsetprober.py
    +
    # sbcharsetprober.py
     def feed(self, aBuf):
         if not self._mModel['keepEnglishLetter']:
             aBuf = self.filter_without_english_letters(aBuf)
    @@ -871,13 +871,13 @@ def feed(self, aBuf):
         for c in aBuf:
             order = self._mModel['charToOrderMap'][ord(c)]

    …and latin1prober.py… -

    # latin1prober.py
    +
    # latin1prober.py
     def feed(self, aBuf):
         aBuf = self.filter_with_english_letters(aBuf)
         for c in aBuf:
             charClass = Latin1_CharToClass[ord(c)]

    c is iterating over aBuf, which means it is an integer, not a 1-character string. The solution is the same: change ord(c) to just plain c. -

      # sbcharsetprober.py
    +
      # sbcharsetprober.py
       def feed(self, aBuf):
           if not self._mModel['keepEnglishLetter']:
               aBuf = self.filter_without_english_letters(aBuf)
    @@ -897,7 +897,7 @@ def feed(self, aBuf):
     

    Unorderable types: int() >= str()

    Let’s go again. -

    C:\home\chardet> python test.py tests\*\*
    +
    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
     Traceback (most recent call last):
    @@ -916,7 +916,7 @@ tests\Big5\0804.blogspot.com.xml
     TypeError: unorderable types: int() >= str()

    Did you notice? This time around, the code passed the first test case (tests\ascii\howto.diveintomark.org.xml). You’re making real progress here.

    So what’s this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code: -

    class SJISContextAnalysis(JapaneseContextAnalysis):
    +
    class SJISContextAnalysis(JapaneseContextAnalysis):
         def get_order(self, aStr):
             if not aStr: return -1, 1
             # find out current char's byte length
    @@ -926,7 +926,7 @@ TypeError: unorderable types: int() >= str()
    else: charLen = 1

    And where does aStr come from? Let’s pop the stack: -

    def feed(self, aBuf, aLen):
    +
    def feed(self, aBuf, aLen):
         .
         .
         .
    @@ -936,7 +936,7 @@ TypeError: unorderable types: int() >= str()

    Oh look, it’s our old friend, aBuf. As you might have guessed from every other issue we’ve encountered in this chapter, aBuf is a byte array. Here, the feed() method isn’t just passing it on wholesale; it’s slicing it. But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array.

    And what is this code trying to do with aStr? It’s taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because aStr and aBuf were strings, and aStr[0] would be a string, and you can compare strings for inequality. But in Python 3, aStr and aBuf are byte arrays, aStr[0] is an integer, and you can’t compare integers and strings for inequality without explicitly coercing one of them.

    In this case, there’s no need to make the code more complicated by adding an explicit coercion. aStr[0] yields an integer; the things you’re comparing to are all constants. Let’s change them from 1-character strings to integers. -

      class SJISContextAnalysis(JapaneseContextAnalysis):
    +
      class SJISContextAnalysis(JapaneseContextAnalysis):
           def get_order(self, aStr):
               if not aStr: return -1, 1
               # find out current char's byte length
    @@ -989,7 +989,7 @@ TypeError: unorderable types: int() >= str()
    return -1, charLen

    Searching the entire codebase for occurrences of the ord() function uncovers the same problem in chardistribution.py: -

    C:\home\chardet> python test.py tests\*\*
    +
    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
     Traceback (most recent call last):
    @@ -1007,7 +1007,7 @@ tests\Big5\0804.blogspot.com.xml
         if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):
     TypeError: unorderable types: int() >= str()

    The fix is the same: -

      class EUCTWDistributionAnalysis(CharDistributionAnalysis):
    +
      class EUCTWDistributionAnalysis(CharDistributionAnalysis):
           def __init__(self):
               CharDistributionAnalysis.__init__(self)
               self._mCharToFreqOrder = EUCTWCharToFreqOrder
    @@ -1113,7 +1113,7 @@ TypeError: unorderable types: int() >= str()
    return -1

    Global name 'reduce' is not defined

    Once more into the breach… -

    C:\home\chardet> python test.py tests\*\*
    +
    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml
     Traceback (most recent call last):
    @@ -1125,25 +1125,25 @@ tests\Big5\0804.blogspot.com.xml
         total = reduce(operator.add, self._mFreqCounter)
     NameError: global name 'reduce' is not defined

    According to the official What’s New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: “Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable.” You can read more about the decision from Guido van Rossum’s weblog: The fate of reduce() in Python 3000. -

    def get_confidence(self):
    +
    def get_confidence(self):
         if self.get_state() == constants.eNotMe:
             return 0.01
       
         total = reduce(operator.add, self._mFreqCounter)

    The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.

    This monstrosity was so common that Python added a global sum() function. -

      def get_confidence(self):
    +
      def get_confidence(self):
           if self.get_state() == constants.eNotMe:
               return 0.01
       
     -     total = reduce(operator.add, self._mFreqCounter)
     +     total = sum(self._mFreqCounter)

    Since you’re no longer using the operator module, you can remove that import from the top of the file as well. -

      from .charsetprober import CharSetProber
    +
      from .charsetprober import CharSetProber
       from . import constants
     - import operator

    I CAN HAZ TESTZ? -

    C:\home\chardet> python test.py tests\*\*
    +
    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
     tests\Big5\0804.blogspot.com.xml                             Big5 with confidence 0.99
     tests\Big5\blog.worren.net.xml                               Big5 with confidence 0.99
    diff --git a/generators.html b/generators.html
    old mode 100644
    new mode 100755
    index f2b5b10..9e5bb5f
    --- a/generators.html
    +++ b/generators.html
    @@ -174,7 +174,7 @@ def plural(noun):
     
     

    If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire for loop is equivalent to the following: -

    
    +
    
     def plural(noun):
         if match_sxz(noun):
             return apply_sxz(noun)
    @@ -256,7 +256,7 @@ def build_match_and_apply_functions(pattern, search, replace):
     

    First, let’s create a text file that contains the rules you want. No fancy data structures, just whitespace-delimited strings in three columns. Let’s call it plural4-rules.txt.

    [download plural4-rules.txt] -

    [sxz]$               $    es
    +
    [sxz]$               $    es
     [^aeioudgkprt]h$     $    es
     [^aeiou]y$          y$    ies
     $                    $    s
    @@ -295,7 +295,7 @@ rules = []

    Wouldn’t it be grand to have a generic plural() function that parses the rules file? Get rules, check for a match, apply appropriate transformation, go to next rule. That’s all the plural() function has to do, and that’s all the plural() function should do.

    [download plural5.py] -

    def rules():
    +
    def rules():
         with open('plural5-rules.txt') as pattern_file:
             for line in pattern_file:
                 pattern, search, replace = line.split(None, 3)
    diff --git a/http-web-services.html b/http-web-services.html
    old mode 100644
    new mode 100755
    index ff8f360..06e1b5a
    --- a/http-web-services.html
    +++ b/http-web-services.html
    @@ -58,7 +58,7 @@ mark{display:inline}
     
     

    Here’s a concrete example of how caching works. You visit diveintomark.org in your browser. That page includes a background image, wearehugh.com/m.jpg. When your browser downloads that image, the server includes the following HTTP headers: -

    HTTP/1.1 200 OK
    +
    HTTP/1.1 200 OK
     Date: Sun, 31 May 2009 17:14:04 GMT
     Server: Apache
     Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
    @@ -86,7 +86,7 @@ Content-Type: image/jpeg

    HTTP has a solution to this, too. When you request data for the first time, the server can send back a Last-Modified header. This is exactly what it sounds like: the date that the data was changed. That background image referenced from diveintomark.org included a Last-Modified header. -

    HTTP/1.1 200 OK
    +
    HTTP/1.1 200 OK
     Date: Sun, 31 May 2009 17:14:04 GMT
     Server: Apache
     Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
    @@ -101,7 +101,7 @@ Content-Type: image/jpeg
     
     

    When you request the same data a second (or third or fourth) time, you can send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data hasn’t changed since then, the server sends back a special HTTP 304 status code, which means “this data hasn’t changed since the last time you asked for it.” You can test this on the command line, using curl: -

    +
     you@localhost:~$ curl -I -H "If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT" http://wearehugh.com/m.jpg
     HTTP/1.1 304 Not Modified
     Date: Sun, 31 May 2009 18:04:39 GMT
    @@ -119,7 +119,7 @@ Cache-Control: max-age=31536000, public

    ETags are an alternate way to accomplish the same thing as the last-modified checking. With Etags, the server sends a hash code in an ETag header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) That background image referenced from diveintomark.org had an ETag header. -

    HTTP/1.1 200 OK
    +
    HTTP/1.1 200 OK
     Date: Sun, 31 May 2009 17:14:04 GMT
     Server: Apache
     Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
    @@ -136,7 +136,7 @@ The second time you request the same data, you include the ETag hash in an Again with the curl:
     
    -
    +
     you@localhost:~$ curl -I -H "If-None-Match: \"3075-ddc8d800\"" http://wearehugh.com/m.jpg  
     HTTP/1.1 304 Not Modified
     Date: Sun, 31 May 2009 18:04:39 GMT
    @@ -176,7 +176,7 @@ Cache-Control: max-age=31536000, public

    How Not To Fetch Data Over HTTP

    Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better. -

    +
     >>> import urllib.request
     >>> data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read()  
     >>> print(data)
    @@ -255,7 +255,7 @@ Content-Type: application/xml
     
     

    But wait, it gets worse! To see just how inefficient this code is, let’s request the same feed a second time. -

    +
     # continued from the previous example
     >>> response2 = urlopen('http://diveintopython3.org/examples/feed.xml')
     send: b'GET /examples/feed.xml HTTP/1.1
    diff --git a/installing-python.html b/installing-python.html
    old mode 100644
    new mode 100755
    index b24ebfd..cb6f6ea
    --- a/installing-python.html
    +++ b/installing-python.html
    @@ -36,7 +36,7 @@ h2,.i>li{clear:both}
     
     

    Once you’re at a command line prompt, just type python3 (all lowercase, no spaces) and see what happens. On my home Linux system, Python 3 is already installed, and this command gets me into the Python interactive shell. -

    +
     mark@atlantis:~$ python3
     Python 3.0.1+ (r301:69556, Apr 15 2009, 17:25:52)
     [GCC 4.3.3] on linux2
    @@ -47,7 +47,7 @@ Type "help", "copyright", "credits" or "license" for more information.
     
     

    My web hosting provider also runs Linux and provides command-line access, but my server does not have Python 3 installed. (Boo!) -

    +
     mark@manganese:~$ python3
     bash: python3: command not found
    @@ -274,7 +274,7 @@ Type "help", "copyright", "credits" or "license" for more information.

    First things first. The Python Shell itself is an amazing interactive playground. Throughout this book, you’ll see examples like this: -

    +
     >>> 1 + 1
     2
    @@ -286,14 +286,14 @@ Type "help", "copyright", "credits" or "license" for more information.

    Let’s try another one. -

    +
     >>> print('Hello world!')
     Hello world!
     

    Pretty simple, no? But there’s lots more you can do in the Python shell. If you ever get stuck — you can’t remember a command, or you can’t remember the proper arguments to pass a certain function — you can get interactive help in the Python Shell. Just type help and press ENTER. -

    +
     >>> help
     Type help() for interactive help, or help(object) for help about object.
    @@ -301,7 +301,7 @@ Type "help", "copyright", "credits" or "license" for more information.

    To enter the interactive help mode, type help() and press ENTER. -

    +
     >>> help()
     Welcome to Python 3.0!  This is the online help utility.
     
    diff --git a/iterators.html b/iterators.html
    index 7fb466a..7488376 100755
    --- a/iterators.html
    +++ b/iterators.html
    @@ -45,7 +45,7 @@ body{counter-reset:h1 6}
     
     

    Let’s take that one line at a time. -

    class Fib:
    +
    class Fib:

    class? What’s a class? @@ -143,7 +143,7 @@ body{counter-reset:h1 6}

    Instance variables are specific to one instance of a class. For example, if you create two Fib instances with different maximum values, they will each remember their own values. -

    +
     >>> import fibonacci2
     >>> fib1 = fibonacci2.Fib(100)
     >>> fib2 = fibonacci2.Fib(200)
    @@ -189,7 +189,7 @@ All three of these class methods, __init__, __iter__,
     
     

    Thoroughly confused yet? Excellent. Let’s see how to call this iterator: -

    +
     >>> from fibonacci2 import Fib
     >>> for n in Fib(1000):
     ...     print(n, end=' ')
    diff --git a/native-datatypes.html b/native-datatypes.html
    old mode 100644
    new mode 100755
    index b5a54c1..3f8cf55
    --- a/native-datatypes.html
    +++ b/native-datatypes.html
    @@ -39,10 +39,10 @@ body{counter-reset:h1 2}
     
     

    Booleans are either true or false. Python has two constants, cleverly True and False, which can be used to assign boolean values directly. Expressions can also evaluate to a boolean value. In certain places (like if statements), Python expects an expression to evaluate to a boolean value. These places are called boolean contexts. You can use virtually any expression in a boolean context, and Python will try to determine its truth value. Different datatypes have different rules about which values are true or false in a boolean context. (This will make more sense once you see some concrete examples later in this chapter.)

    For example, take this snippet from humansize.py: -

    if size < 0:
    +
    if size < 0:
         raise ValueError('number must be non-negative')

    size is an integer, 0 is an integer, and < is a numerical operator. The result of the expression size < 0 is always a boolean. You can test this yourself in the Python interactive shell: -

    +
     >>> size = 1
     >>> size < 0
     False
    @@ -53,7 +53,7 @@ body{counter-reset:h1 2}
     >>> size < 0
     True

    Due to some legacy issues left over from Python 2, booleans can be treated as numbers. True is 1; False is 0. -

    +
     >>> True + True
     2
     >>> True + False
    @@ -741,7 +741,7 @@ KeyError: 'db.diveintopython3.org'

    Mixed-Value Dictionaries

    Dictionaries aren’t just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don’t all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.

    In fact, you’ve already seen a dictionary with non-string keys and values, in your first Python program. -

    SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
    +
    SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
                 1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}

    Let's tear that apart in the interactive shell.

    @@ -787,7 +787,7 @@ KeyError: 'db.diveintopython3.org'

    None

    None is a special constant in Python. It is a null value. None is not the same as False. None is not 0. None is not an empty string. Comparing None to anything other than None will always return False.

    None is the only null value. It has its own datatype (NoneType). You can assign None to any variable, but you can not create other NoneType objects. All variables whose value is None are equal to each other. -

    +
     >>> type(None)
     <class 'NoneType'>
     >>> None == False
    @@ -807,7 +807,7 @@ KeyError: 'db.diveintopython3.org'

    None In A Boolean Context

    In a boolean context, None is false and not None is true. -

    +
     >>> def is_it_true(anything):
     ...   if anything:
     ...     print('yes, it's true')
    diff --git a/refactoring.html b/refactoring.html
    old mode 100644
    new mode 100755
    index 8b72d17..d3b3c5d
    --- a/refactoring.html
    +++ b/refactoring.html
    @@ -43,7 +43,7 @@ body{counter-reset:h1 10}
     
     
     

    Since your code has a bug, and you now have a test case that tests this bug, the test case will fail: -

    +
     you@localhost:~$ python3 romantest8.py -v
     from_roman should fail with blank string ... FAIL
     from_roman should fail with malformed antecedents ... ok
    @@ -264,7 +264,7 @@ def from_roman(s):
     
     

    You may be skeptical that these two small changes are all that you need. Hey, don’t take my word for it; see for yourself. -

    +
     you@localhost:~$ python3 romantest9.py -v
     from_roman should fail with blank string ... ok
     from_roman should fail with malformed antecedents ... ok
    @@ -364,7 +364,7 @@ build_lookup_tables()

    Let’s break that down into digestable pieces. Arguably, the most important line is the last one: -

    build_lookup_tables()
    +
    build_lookup_tables()

    You will note that is a function call, but there’s no if statement around it. This is not an if __name__ == '__main__' block; it gets called when the module is imported. (It is important to understand that modules are only imported once, then cached. If you import an already-imported module, it does nothing. So this code will only get called the first time you import this module.) diff --git a/regular-expressions.html b/regular-expressions.html old mode 100644 new mode 100755 index e2ab8c7..9886796 --- a/regular-expressions.html +++ b/regular-expressions.html @@ -225,7 +225,7 @@ body{counter-reset:h1 4}

    The expression for the ones place follows the same pattern. I’ll spare you the details and show you the end result. -

    +
     >>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
     

    So what does that look like using this alternate {n,m} syntax? This example shows the new syntax.

    diff --git a/strings.html b/strings.html
    old mode 100644
    new mode 100755
    index 6dd4d7a..b1a256f
    --- a/strings.html
    +++ b/strings.html
    @@ -180,7 +180,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
     
     

    Just to blow your mind, here’s an example that combines all of the above: -

    +
     >>> import humansize
     >>> import sys
     >>> '1MB = 1000{0.modules[humansize].SUFFIXES[1000][0]}'.format(sys)
    @@ -201,7 +201,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
     
     

    But wait! There’s more! Let’s take another look at that strange line of code from humansize.py: -

    if size < multiple:
    +
    if size < multiple:
         return '{0:.1f} {1}'.format(size, suffix)

    {1} is replaced with the second argument passed to the format() method, which is suffix. But what is {0:.1f}? It’s two things: {0}, which you recognize, and :.1f, which you don’t. The second half (including and after the colon) defines the format specifier, which further refines how the replaced variable should be formatted. @@ -212,7 +212,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):

    Within a replacement field, a colon (:) marks the start of the format specifier. The format specifier “.1” means “round to the nearest tenth” (i.e. display only one digit after the decimal point). The format specifier “f” means “fixed-point number” (as opposed to exponential notation or some other decimal representation). Thus, given a size of 698.25 and suffix of 'GB', the formatted string would be '698.3 GB', because 698.25 gets rounded to one decimal place, then the suffix is appended after the number. -

    +
     >>> '{0:.1f} {1}'.format(698.25, 'GB')
     '698.3 GB'
    @@ -414,11 +414,11 @@ TypeError: Can't convert 'bytes' object to str implicitly

    If you would like to use a different encoding within your Python code, you can put an encoding declaration on the first line of each file. This declaration defines a .py file to be windows-1252: -

    # -*- coding: windows-1252 -*-
    +
    # -*- coding: windows-1252 -*-

    Technically, the character encoding override can also be on the second line, if the first line is a UNIX-like hash-bang command. -

    #!/usr/bin/python3
    +
    #!/usr/bin/python3
     # -*- coding: windows-1252 -*-

    For more information, consult PEP 263: Defining Python Source Code Encodings. diff --git a/table-of-contents.html b/table-of-contents.html index 49548e4..cb51e32 100755 --- a/table-of-contents.html +++ b/table-of-contents.html @@ -68,16 +68,14 @@ ul li ol{margin:0;padding:0 0 0 2.5em}

  • Searching for values in a list
  • Lists in a boolean context - +
      +
    1. Creating A Set +
    2. Modifying A Set +
    3. Removing Items From A Set +
    4. Common Set Operations +
    5. Sets In A Boolean Context +
  • Dictionaries
    1. Creating a dictionary @@ -92,21 +90,23 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
    2. Further reading
  • Strings -
      -
    1. Diving in -
    2. Unicode
        -
      1. How strings are stored in memory -
      2. Converting between different character encodings -
      3. Specifying character encoding in .py files +
      4. Some Boring Stuff You Need To Understand Before You Can Dive In +
      5. Unicode +
      6. Diving In +
      7. Formatting Strings +
          +
        1. Compound Field Names +
        2. Format Specifiers +
        +
      8. Other Common String Methods +
          +
        1. Slicing A String +
        +
      9. Strings vs. Bytes +
      10. Postscript: Character Encoding Of Python Source Code +
      11. Further Reading
      -
    3. Strings in Python 3 -
    4. Common string operations -
    5. Formatting strings -
    6. The string module -
    7. Strings vs. bytes -
    8. Further reading -
  • Regular expressions
    1. Diving in @@ -361,22 +361,4 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
    2. Really Esoteric Stuff
    -

    Orphans (not sure where these belong yet): -

      -
    • Tuples -
    • List comprehensions -
    • Set comprehensions -
    • Dictionary comprehensions -
    • Views (several dictionary methods return them, they're dynamic, update when the dictionary changes, etc.) -
    • Function annotations -
    • PEP 8 style conventions -
    • Decorators -
        -
      1. @unittest.skipUnless(sys.platform.startswith("win"), "requires Windows") -
      -
    • Importing modules -
        -
      1. ...mention why from module import * is only allowed at module level -
      -

    © 2001–9 Mark Pilgrim diff --git a/unit-testing.html b/unit-testing.html old mode 100644 new mode 100755 index 8a39f49..1cc4a02 --- a/unit-testing.html +++ b/unit-testing.html @@ -195,13 +195,13 @@ def to_roman(n):

  • Here’s where the rich data structure of roman_numeral_map pays off, because you don’t need any special logic to handle the subtraction rule. To convert to Roman numerals, simply iterate through roman_numeral_map looking for the largest integer value less than or equal to the input. Once found, add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.

    If you’re still not clear how the to_roman() function works, add a print() call to the end of the while loop: -

    
    +
    
     while n >= integer:
         result += numeral
         n -= integer
         print('subtracting {0} from input, adding {1} to output'.format(integer, numeral))

    With the debug print() statements, the output looks like this: -

    +
     >>> import roman1
     >>> roman1.to_roman(1424)
     subtracting 1000 from input, adding M to output
    @@ -211,7 +211,7 @@ subtracting 10 from input, adding X to output
     subtracting 4 from input, adding IV to output
     'MCDXXIV'

    So the to_roman() function appears to work, at least in this manual spot check. But will it pass the test case you wrote? -

    +
     you@localhost:~$ python3 romantest1.py -v
     to_roman should give known result with known input ... ok
     
    @@ -343,7 +343,7 @@ OK

    Along with testing numbers that are too large, you need to test numbers that are too small. As we noted in our functional requirements, Roman numerals cannot express 0 or negative numbers. -

    +
     >>> import roman2
     >>> roman2.to_roman(0)
     ''
    @@ -373,7 +373,7 @@ OK

    Now check that the tests fail: -

    +
     you@localhost:~$ python3 romantest3.py -v
     to_roman should give known result with known input ... ok
     to_roman should fail with negative input ... FAIL
    @@ -422,7 +422,7 @@ FAILED (failures=2)

    I could show you a whole series of unrelated examples to show that the multiple-comparisons-at-once shortcut works, but instead I’ll just run the unit tests and prove it. -

    +
     you@localhost:~$ python3 romantest3.py -v
     to_roman should give known result with known input ... ok
     to_roman should fail with negative input ... ok
    @@ -453,13 +453,13 @@ OK

    Testing for non-integers is not difficult. First, define a NonIntegerError exception. -

    # roman4.py
    +
    # roman4.py
     class OutOfRangeError(ValueError): pass
     class NotIntegerError(ValueError): pass

    Next, write a test case that checks for the NonIntegerError exception. -

    class ToRomanBadInput(unittest.TestCase):
    +
    class ToRomanBadInput(unittest.TestCase):
         .
         .
         .
    @@ -469,7 +469,7 @@ class OutOfRangeError(ValueError): pass
     
     

    Now check that the test fails properly. -

    +
     you@localhost:~$ python3 romantest4.py -v
     to_roman should give known result with known input ... ok
     to_roman should fail with negative input ... ok
    @@ -512,7 +512,7 @@ FAILED (failures=1)

    Finally, check that the code does indeed make the test pass. -

    +
     you@localhost:~$ python3 romantest4.py -v
     to_roman should give known result with known input ... ok
     to_roman should fail with negative input ... ok
    diff --git a/xml.html b/xml.html
    index e96c73b..b2ecc48 100755
    --- a/xml.html
    +++ b/xml.html
    @@ -443,7 +443,7 @@ StopIteration

    For large XML documents, lxml is significantly faster than the built-in ElementTree libary. If you’re only using the ElementTree API and want to use the fastest available implementation, you can try to import lxml and fall back to the built-in ElementTree. -

    try:
    +
    try:
         from lxml import etree
     except ImportError:
         import xml.etree.ElementTree as etree
    @@ -582,7 +582,7 @@ except ImportError:

    That’s an error, because the &hellip; entity is not defined in XML. (It is defined in HTML.) If you try to parse this broken feed with the default settings, lxml will choke on the undefined entity. -

    +
     >>> import lxml.etree
     >>> tree = lxml.etree.parse('examples/feed-broken.xml')
     Traceback (most recent call last):
    diff --git a/your-first-python-program.html b/your-first-python-program.html
    index 2ef8e07..f85ee53 100755
    --- a/your-first-python-program.html
    +++ b/your-first-python-program.html
    @@ -137,7 +137,7 @@ SyntaxError: non-keyword arg after keyword arg

    I won’t bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you’ve forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You’ll thank me in six months.

    Documentation Strings

    You can document a Python function by giving it a documentation string (docstring for short). In this program, the approximate_size() function has a docstring: -

    def approximate_size(size, a_kilobyte_is_1024_bytes=True):
    +
    def approximate_size(size, a_kilobyte_is_1024_bytes=True):
         '''Convert a file size to human-readable form.
     
         Keyword arguments: