diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 5c4788f..a3c50ea 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -25,6 +25,7 @@

The chardet library is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn.

+

C:\home\chardet>python c:\Python30\Tools\Scripts\2to3.py -w chardet\
 RefactoringTool: Skipping implicit fixer: buffer
 RefactoringTool: Skipping implicit fixer: idioms
@@ -492,8 +493,9 @@ RefactoringTool: chardet\sjisprober.py
 RefactoringTool: chardet\universaldetector.py
 RefactoringTool: chardet\utf8prober.py
-

Now run the 2to3 script on the testing harness, test.py.

+

Now run the 2to3 script on the testing harness, test.py.

+

C:\home\chardet>python c:\Python30\Tools\Scripts\2to3.py -w test.py
 RefactoringTool: Skipping implicit fixer: buffer
 RefactoringTool: Skipping implicit fixer: idioms
@@ -525,7 +527,7 @@ RefactoringTool: Skipping implicit fixer: ws_comma
 RefactoringTool: Files that were modified:
 RefactoringTool: test.py
-

Well, that wasn't so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it'll work?

+

Well, that wasn't so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it'll work?

@@ -533,6 +535,7 @@ RefactoringTool: test.py

Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it's a good way to test our ported code to make sure there aren't any bugs lurking anywhere.

+

C:\home\chardet>python test.py tests\*\*
 Traceback (most recent call last):
   File "test.py", line 1, in <module>
@@ -542,8 +545,9 @@ RefactoringTool: test.py
^ SyntaxError: invalid syntax -

Hmm, a small snag. In Python 3, False is a reserved word, so you can't use it as a variable name. Let's look at constants.py to see where it's defined. Here's the original version from constants.py, before the 2to3 script changed it:

+

Hmm, a small snag. In Python 3, False is a reserved word, so you can't use it as a variable name. Let's look at constants.py to see where it's defined. Here's the original version from constants.py, before the 2to3 script changed it:

+

import __builtin__
 if not hasattr(__builtin__, 'False'):
     False = 0
@@ -552,7 +556,7 @@ else:
     False = __builtin__.False
     True = __builtin__.True
-

This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in Boolean type. This code detects the absence of the built-in constants True and False, and defines them if necessary.

+

This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in Boolean type. This code detects the absence of the built-in constants True and False, and defines them if necessary.

However, Python 3 will always have a Boolean type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of "constants.True" and "constants.False" with "True" and "False", respectively, then delete this dead code from constants.py.

@@ -572,6 +576,7 @@ else:

Time to run test.py again and see how far it gets.

+

C:\home\chardet>python test.py tests\*\*
 Traceback (most recent call last):
   File "test.py", line 1, in <module>
@@ -580,7 +585,7 @@ else:
     import constants, sys
 ImportError: No module named constants
-

What's that you say? No module named constants? Of course there's a module named constants. ... Oh wait, no there isn't. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:

+

What's that you say? No module named constants? Of course there's a module named constants. ... Oh wait, no there isn't. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:

from . import constants
@@ -603,6 +608,9 @@ import sys

Name 'file' is not defined

+

FIXME intro

+ +

C:\home\chardet>python test.py tests\*\*
 tests\ascii\howto.diveintomark.org.xml
 Traceback (most recent call last):
@@ -610,7 +618,7 @@ import sys
for line in file(f, 'rb'): NameError: name 'file' is not defined -

This one surprised me, because I've been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I'll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it's an alias for io.open(), but never mind that right now.)

+

This one surprised me, because I've been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I'll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it's an alias for io.open(), but never mind that right now.)

Thus, the simplest solution to the problem of the missing file() is to call open() instead:

@@ -624,6 +632,7 @@ NameError: name 'file' is not defined

FIXME intro

+

C:\home\chardet>python test.py tests\*\*
 tests\ascii\howto.diveintomark.org.xml
 Traceback (most recent call last):
@@ -633,20 +642,22 @@ NameError: name 'file' is not defined
if self._highBitDetector.search(aBuf): TypeError: can't use a string pattern on a bytes-like object -

Now things are starting to get interesting. And by "interesting," I mean "confusing as all hell."

+

Now things are starting to get interesting. And by "interesting," I mean "confusing as all hell."

First, let's see what self._highBitDetector is. It's defined in the __init__ method of the UniversalDetector class:

+

class UniversalDetector:
     def __init__(self):
         self._highBitDetector = re.compile(r'[\x80-\xFF]')
-

This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that's not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.

+

This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that's not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.

And therein lies the problem.

In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we're searching is not a string, it's a byte array. Looking at the traceback, this error occurred in universaldetector.py:

+

def feed(self, aBuf):
     .
     .
@@ -654,8 +665,9 @@ TypeError: can't use a string pattern on a bytes-like object
if self._mInputState == ePureAscii: if self._highBitDetector.search(aBuf): -

And what is aBuf? Let's backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py.

+

And what is aBuf? Let's backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py.

+

u = UniversalDetector()
 .
 .
@@ -663,7 +675,7 @@ TypeError: can't use a string pattern on a bytes-like object
for line in open(f, 'rb'): u.feed(line) -

And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for "read"; OK, big deal, we're reading the file. Ah, but 'b' is for "bytes." Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit... characters. But we don't have characters; we have bytes. Oops.

+

And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for "read"; OK, big deal, we're reading the file. Ah, but 'b' is for "bytes." Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit... characters. But we don't have characters; we have bytes. Oops.

What we need this regular expression to search is not an array of characters, but an array of bytes.

@@ -689,6 +701,7 @@ for line in open(f, 'rb'):

Curiouser and curiouser...

+

C:\home\chardet>python test.py tests\*\*
 tests\ascii\howto.diveintomark.org.xml
 Traceback (most recent call last):
@@ -698,6 +711,7 @@ for line in open(f, 'rb'):
     elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
 TypeError: Can't convert 'bytes' object to str implicitly
+

...

diff --git a/dip3.css b/dip3.css index 4ccd528..9047996 100644 --- a/dip3.css +++ b/dip3.css @@ -33,3 +33,4 @@ h1{counter-reset:h2} h2:before{counter-increment:h2;content:counter(h1) "." counter(h2) ". "} h2{counter-reset:h3} h3:before{counter-increment:h3;content:counter(h1) "." counter(h2) "." counter(h3) ". "} +a.skip{font-size:small;display:block;margin:auto;text-align:center;border:0} \ No newline at end of file