diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index b9877b4..c9911f1 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -9,7 +9,11 @@ body{counter-reset:h1 19} -

Case study: porting chardet to Python 3

+

Case study: porting chardet to Python 3

+ +
+

Words, words. They're all we have to go on.
Rosencrantz and Guildenstern are Dead +

  1. Introducing chardet: a mini-FAQ @@ -41,7 +45,7 @@ body{counter-reset:h1 19}

    Introducing chardet: a mini-FAQ

    -

    When you think of "text", you probably think of "characters and symbols I see on my computer screen". But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. +

    When you think of "text," you probably think of "characters and symbols I see on my computer screen." But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.

    In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's "text", you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever). @@ -136,7 +140,7 @@ body{counter-reset:h1 19}

    The main chardet package is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn. -

    +

    skip over this

    C:\home\chardet>python c:\Python30\Tools\Scripts\2to3.py -w chardet\
     RefactoringTool: Skipping implicit fixer: buffer
     RefactoringTool: Skipping implicit fixer: idioms
    @@ -606,7 +610,7 @@ RefactoringTool: chardet\utf8prober.py

    Now run the 2to3 script on the testing harness, test.py. -

    +

    skip over this

    C:\home\chardet>python c:\Python30\Tools\Scripts\2to3.py -w test.py
     RefactoringTool: Skipping implicit fixer: buffer
     RefactoringTool: Skipping implicit fixer: idioms
    @@ -646,7 +650,7 @@ RefactoringTool: test.py

    Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it's a good way to test our ported code to make sure there aren't any bugs lurking anywhere. -

    +

    skip over this

    C:\home\chardet>python test.py tests\*\*
     Traceback (most recent call last):
       File "test.py", line 1, in <module>
    @@ -658,7 +662,7 @@ SyntaxError: invalid syntax

    Hmm, a small snag. In Python 3, False is a reserved word, so you can't use it as a variable name. Let's look at constants.py to see where it's defined. Here's the original version from constants.py, before the 2to3 script changed it: -

    +

    skip over this

    import __builtin__
     if not hasattr(__builtin__, 'False'):
         False = 0
    @@ -685,7 +689,7 @@ else:
     
     

    Time to run test.py again and see how far it gets. -

    +

    skip over this

    C:\home\chardet>python test.py tests\*\*
     Traceback (most recent call last):
       File "test.py", line 1, in <module>
    @@ -717,7 +721,7 @@ import sys

    FIXME intro -

    +

    skip over this

    C:\home\chardet>python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
    @@ -737,7 +741,7 @@ NameError: name 'file' is not defined

    FIXME intro -

    +

    skip over this

    C:\home\chardet>python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
    @@ -751,7 +755,7 @@ TypeError: can't use a string pattern on a bytes-like object

    First, let's see what self._highBitDetector is. It's defined in the __init__ method of the UniversalDetector class: -

    +

    skip over this

    class UniversalDetector:
         def __init__(self):
             self._highBitDetector = re.compile(r'[\x80-\xFF]')
    @@ -762,7 +766,7 @@ TypeError: can't use a string pattern on a bytes-like object

    In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we're searching is not a string, it's a byte array. Looking at the traceback, this error occurred in universaldetector.py: -

    +

    skip over this

    def feed(self, aBuf):
         .
         .
    @@ -772,7 +776,7 @@ TypeError: can't use a string pattern on a bytes-like object

    And what is aBuf? Let's backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py. -

    +

    skip over this

    u = UniversalDetector()
     .
     .
    @@ -804,7 +808,7 @@ for line in open(f, 'rb'):
     
     

    Curiouser and curiouser... -

    +

    skip over this

    C:\home\chardet>python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
    diff --git a/dip3.css b/dip3.css
    index f182006..ae2f8ed 100644
    --- a/dip3.css
    +++ b/dip3.css
    @@ -25,12 +25,14 @@ figure{display:block;text-align:center;margin:1.75em 0}
     figure img{display:block;margin:0 auto}
     section,article,footer{display:block}
     var{font-family:monospace;font-style:normal}
    -a.skip{font-size:small;display:block;margin:auto;text-align:right;border:0}
    +.skip a,.skip a:hover,.skip a:visited{position:absolute;left:0px;top:-500px;width:1px;height:1px;overflow:hidden}
    +.skip a:active,.skip a:focus{position:static;width:auto;height:auto}
     table{width:100%;border-collapse:collapse}
     th{text-align:left;padding:0 0.5em;vertical-align:baseline;border:1px dotted}
     th,td{width:45%;vertical-align:top}
     th:first-child{width:10%;text-align:center}
    -.q span,tr + tr th:first-child{font-family:'Arial Unicode MS',sans-serif;font-style:normal}
    +.q span,.note p:first-child,tr + tr th:first-child{font-family:'Arial Unicode MS',sans-serif;font-style:normal}
    +.note p:first-child{float:left;font-size:xx-large;line-height:0.875em;margin:0 0.22em 0 0}
     .q span{font-size:large}
     td{border:1px dotted;padding:0 0.5em}
     body{counter-reset:h1}
    diff --git a/porting-code-to-python-3-with-2to3.html b/porting-code-to-python-3-with-2to3.html
    index 4df7f63..4c70791 100644
    --- a/porting-code-to-python-3-with-2to3.html
    +++ b/porting-code-to-python-3-with-2to3.html
    @@ -122,7 +122,7 @@ for (var i = arTables.length - 1; i >= 0; i--) {
     
     

    In Python 2, print was a statement -- whatever you wanted to print simply followed the print keyword. In Python 3, print() is a function -- whatever you want to print is passed to print() like any other function. -

    +

    skip over this table @@ -168,7 +168,7 @@ for (var i = arTables.length - 1; i >= 0; i--) {

    Python 2 supported <> as a synonym for !=, the not-equals comparison operator. Python 3 supports the != operator, but not <>. -

    +

    skip over this table

    Notes
    @@ -196,7 +196,7 @@ for (var i = arTables.length - 1; i >= 0; i--) {

    In Python 2, dictionaries had a has_key() method to test whether the dictionary had a certain key. In Python 3, this method no longer exists. Instead, you need to use the in operator. -

    +

    skip over this table

    Notes
    @@ -242,7 +242,7 @@ for (var i = arTables.length - 1; i >= 0; i--) {

    In Python 2, many dictionary methods returned lists. The most frequently used methods were keys(), items(), and values(). In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method's return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing. -

    +

    skip over this table

    Notes
    @@ -288,13 +288,11 @@ for (var i = arTables.length - 1; i >= 0; i--) {

    Several modules in the Python Standard Library have been renamed. Several other modules which are related to each other have been combined or reorganized to make their association more logical. -

    FIXME: once the rest of the book is written, these should link back to the chapters and sections that explain these modules. -

    http package

    In Python 3, several related HTTP modules have been combined into a single package, http. -

    +

    skip over this table

    Notes
    @@ -336,7 +334,7 @@ import CGIHttpServer

    Python 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, urllib. -

    +

    skip over this table

    Notes
    @@ -392,7 +390,7 @@ from urllib.error import HTTPError

    All the various DBM clones are now in a single package, dbm. If you need a specific variant like GNU DBM, you can import the appropriate module within the dbm package. -

    +

    skip over this table

    Notes
    @@ -433,7 +431,7 @@ import whichdb

    XML-RPC is a lightweight method of performing remote RPC calls over HTTP. The XML-RPC client library and several XML-RPC server implementations are now combined in a single package, xmlrpc. -

    +

    skip over this table

    Notes
    @@ -457,7 +455,7 @@ import SimpleXMLRPCServer

    Other modules

    -

    +

    skip over this table

    Notes
    @@ -536,7 +534,7 @@ except ImportError:

    Suppose you had this package, with multiple files in the same directory: -

    +

    skip over this ASCII art

    chardet/
     |
     +--__init__.py
    @@ -549,7 +547,7 @@ except ImportError:
     
     

    Now suppose that universaldetector.py needs to import the entire constants.py file and one class from mbcharsetprober.py. How do you do it? -

    +

    skip over this table

    Notes
    @@ -575,9 +573,9 @@ except ImportError:

    filter() global function

    -

    FIXME intro +

    In Python 2, the filter() function returned a list, the result of "filtering" a sequence through a function that returned True or False for each item in the sequence. In Python 3, the filter() function returns an interator, not a list. -

    +

    skip over this table

    Notes
    @@ -601,7 +599,7 @@ except ImportError: - + @@ -612,18 +610,18 @@ except ImportError:
    Notes
    for i in filter(None, a_sequence)for i in filter(None, a_sequence): no change

      -
    1. ... -
    2. ... -
    3. ... -
    4. ... -
    5. ... +
    6. In the most basic case, 2to3 will wrap a call to filter() with a call to list(), which simply iterates through its argument and returns a real list. +
    7. However, if the call to filter() is already wrapped in list(), 2to3 will do nothing, since the fact that filter() is returning an iterator is irrelevant. +
    8. For the special syntax of filter(None, ...), 2to3 will transform the call into a semantically equivalent list comprehension. +
    9. In contexts like for loops, which iterate through the entire sequence anyway, no changes are necessary. +
    10. Again, no changes are necessary, because the list comprehension will iterate through the entire sequence, and it can do that just as well if filter() returns an iterator as if it returns a list.

    map() global function

    -

    FIXME intro +

    In much the same way as filter(), the map() function now returns an iterator. (In Python 2, it returned a list.) -

    +

    skip over this table @@ -648,28 +646,33 @@ except ImportError: - + - +
    Notes
    for i in map(a_function, a_sequence):unchangedno change
    [i for i in map(a_function, a_sequence)]unchangedno change

      -
    1. ... -
    2. ... -
    3. ... -
    4. ... -
    5. ... +
    6. As with filter(), in the most basic case, 2to3 will wrap a call to map() with a call to list(). +
    7. For the special syntax of map(None, ...), the identity function, 2to3 will convert it to an equivalent call to list(). +
    8. If the first argument to map() is a lambda function, 2to3 will convert it to an equivalent list comprehension. +
    9. In contexts like for loops, which iterate through the entire sequence anyway, no changes are necessary. +
    10. Again, no changes are necessary, because the list comprehension will iterate through the entire sequence, and it can do that just as well if map() returns an iterator as if it returns a list.

    reduce() global function (3.1+)

    -

    FIXME intro +

    In Python 3, the reduce() function has been removed from the global namespace and placed in the functools module. -

    +

    +

    +

    The version of 2to3 that shipped with Python 3.0 would not fix this case automatically. The fix first appeared in the 2to3 script that shipped with Python 3.1. +

    + +

    skip over this table @@ -677,22 +680,20 @@ except ImportError: - +
    NotesPython 3
    reduce(a, b, c)
    from functtools import reduce
     reduce(a, b, c)
    -

      -
    1. ... -
    +

    apply() global function

    FIXME intro -

    +

    skip over this table @@ -732,7 +733,7 @@ reduce(a, b, c)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -754,7 +755,7 @@ reduce(a, b, c)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -788,7 +789,7 @@ reduce(a, b, c)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -810,7 +811,7 @@ reduce(a, b, c)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -844,7 +845,7 @@ reduce(a, b, c)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -879,7 +880,7 @@ except (RuntimeError, ImportError) as e: import mymodule except ImportError: pass - + @@ -887,7 +888,7 @@ except ImportError: import mymodule except: pass - +
    Notesunchangedno change
    unchangedno change
    @@ -902,7 +903,7 @@ except:

    FIXME intro -

    +

    skip over this table @@ -936,7 +937,7 @@ except:

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -946,7 +947,7 @@ except: - + @@ -970,7 +971,7 @@ except:

    FIXME intro -

    +

    skip over this table

    Notes
    aGenerator.throw(MyException)unchangedno change
    @@ -1016,7 +1017,7 @@ except:

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1041,12 +1042,12 @@ except: - + - +
    Notes
    for i in range(10):unchangedno change
    sum(range(10))unchangedno change
    @@ -1062,7 +1063,7 @@ except:

    FIXME intro -

    +

    skip over this table @@ -1102,7 +1103,7 @@ except:

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1160,7 +1161,7 @@ except:

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1175,7 +1176,7 @@ except: - +
    Notes
    for line in a_file.xreadlines(5):unchangedno change
    @@ -1188,7 +1189,7 @@ except:

    FIXME intro -

    +

    skip over this table @@ -1222,7 +1223,7 @@ except:

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1256,7 +1257,7 @@ except:

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1287,7 +1288,7 @@ except: - + @@ -1312,7 +1313,7 @@ for an_iterator in a_sequence_of_iterators:

    FIXME intro -

    +

    skip over this table

    Notes
    class A:
         def next(self, x, y):
             pass
    unchangedno change
    @@ -1333,7 +1334,7 @@ for an_iterator in a_sequence_of_iterators: - +
    Notes
    class A:
         def __nonzero__(self, x, y):
             pass
    unchangedno change
    @@ -1346,7 +1347,7 @@ for an_iterator in a_sequence_of_iterators:

    FIXME intro -

    +

    skip over this table @@ -1374,7 +1375,7 @@ for an_iterator in a_sequence_of_iterators:

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1404,7 +1405,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1426,7 +1427,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1454,7 +1455,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1476,7 +1477,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1491,7 +1492,7 @@ a_function(sys.maxsize) - +
    Notes
    d.join(zip(a, b, c))unchangedno change
    @@ -1504,7 +1505,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table @@ -1532,7 +1533,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1584,7 +1585,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1606,7 +1607,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1673,7 +1674,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1707,7 +1708,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1729,7 +1730,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1751,7 +1752,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1783,7 +1784,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1817,7 +1818,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1839,7 +1840,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1867,7 +1868,7 @@ a_function(sys.maxsize)

    FIXME intro -

    +

    skip over this table

    Notes
    @@ -1908,6 +1909,8 @@ do_stuff(a_list)
  2. ... +

    FIXME: once the rest of the book is written, this appendix should contain copious links back to any chapter or section that touches on these features. +

  3. Notes