diff --git a/about.html b/about.html index e654473..e423ada 100644 --- a/about.html +++ b/about.html @@ -3,7 +3,7 @@
© 2001–4, 2009 ℳark Pilgrim diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index e07716a..ca91a1f 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -3,7 +3,7 @@
And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
What we need this regular expression to search is not an array of characters, but an array of bytes.
Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.) -
class UniversalDetector:
def __init__(self):
@@ -716,7 +715,6 @@ for line in open(f, 'rb'):
self._mCharSetProbers = []
self.reset()
Searching the entire codebase for other uses of the re module turns up two more instances, in charsetprober.py. Again, the code is defining regular expressions as strings but executing them on aBuf, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays.
-
class CharSetProber:
.
@@ -743,15 +741,11 @@ for line in open(f, 'rb'):
File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly
-
There's an unfortunate clash of coding style and Python interpreter here. The TypeError could be anywhere on that line, but the traceback doesn't tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
-
elif (self._mInputState == ePureAscii) and \
self._escDetector.search(self._mLastChar + aBuf):
-
And re-run the test:
-skip over this command output listing
C:\home\chardet> python test.py tests\*\* tests\ascii\howto.diveintomark.org.xml @@ -761,11 +755,8 @@ TypeError: Can't convert 'bytes' object to str implicitlyFile "C:\home\chardet\chardet\universaldetector.py", line 101, in feed self._escDetector.search(self._mLastChar + aBuf): TypeError: Can't convert 'bytes' object to str implicitly -
Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second one. So what could cause a TypeError there? Perhaps you're thinking that the search() method is expecting a value of a different type, but that wouldn't generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value of a different type than it's expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the search() method. So the problem must be in that + operation, as it's trying to construct the value that it will eventually pass to the search() method.
-
We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It's an instance variable, defined in the reset() method, which is actually called from the __init__() method.
-
class UniversalDetector:
def __init__(self):
@@ -782,11 +773,8 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mGotData = False
self._mInputState = ePureAscii
self._mLastChar = ''
-
And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can't concatenate a string to a byte array — not even a zero-length string. -
So what is self._mLastChar anyway? The answer is in the feed() method, just a few lines down from where the trackback occurred.
-
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
@@ -796,9 +784,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mInputState = eEscAscii
self._mLastChar = aBuf[-1]
-
The calling function calls this feed() method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as aBuf), then stores the last byte in self._mLastChar in case it's needed during the next call. (In a multi-byte encoding, the feed() method might get called with half of a character, then called again with the other half.) But because aBuf is now a byte array instead of a string, self._mLastChar needs to be a byte array as well. Thus:
-
def reset(self):
.
@@ -806,9 +792,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
.
Searching the entire codebase for "mLastChar" turns up a similar problem in mbcharsetprober.py, but instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.
-
class MultiByteCharSetProber(CharSetProber):
@@ -827,11 +811,8 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mDistributionAnalyzer.reset()
'int' and 'bytes'I have good news, and I have bad news. The good news is we're making progress… -
skip over this command listing
C:\home\chardet> python test.py tests\*\* tests\ascii\howto.diveintomark.org.xml @@ -841,13 +822,9 @@ TypeError: Can't convert 'bytes' object to str implicitlyFile "C:\home\chardet\chardet\universaldetector.py", line 101, in feed self._escDetector.search(self._mLastChar + aBuf): TypeError: unsupported operand type(s) for +: 'int' and 'bytes' -
…The bad news is it doesn't always feel like progress. -
But this is progress! Really! Even though the traceback calls out the same line of code, it's a different error than it used to be. Progress! So what's the problem now? The last time I checked, this line of code didn't try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that self._mLastChar was a byte array. How did it turn into an int?
-
The answer lies not in the previous lines of code, but in the following lines. -
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
@@ -857,9 +834,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
self._mInputState = eEscAscii
self._mLastChar = aBuf[-1]
-
This error doesn't occur the first time the feed() method gets called; it occurs the second time, after self._mLastChar has been set to the last byte of aBuf. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
-
skip over this interpreter listing
>>> aBuf = b'\xEF\xBB\xBF' ① @@ -887,19 +862,14 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
So, to ensure that the feed() method in universaldetector.py continues to work no matter how often it's called, you need to initialize self._mLastChar as a 0-length byte array, then make sure it stays a byte array.
-
self._escDetector.search(self._mLastChar + aBuf):
self._mInputState = eEscAscii
- self._mLastChar = aBuf[-1]
+ self._mLastChar = aBuf[-1:]
-
ord() expected string of length 1, but int foundTired yet? You're almost there… -
skip over this command output listing
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
@@ -916,37 +886,28 @@ tests\Big5\0804.blogspot.com.xml
File "C:\home\chardet\chardet\codingstatemachine.py", line 43, in next_state
byteCls = self._mModel['classTable'][ord(c)]
TypeError: ord() expected string of length 1, but int found
-
OK, so c is an int, but the ord() function was expecting a 1-character string. Fair enough. Where is c defined?
-
# codingstatemachine.py
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
byteCls = self._mModel['classTable'][ord(c)]
-
That's no help; it's just passed into the function. Let's pop the stack. -
# utf8prober.py
def feed(self, aBuf):
for c in aBuf:
codingState = self._mCodingSM.next_state(c)
-
And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That's what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there's no need to call the ord() function because c is already an int!
-
Thus: -
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
- byteCls = self._mModel['classTable'][ord(c)]
+ byteCls = self._mModel['classTable'][c]
-
Searching the entire codebase for instances of "ord(c)" uncovers similar problems in sbcharsetprober.py…
-
# sbcharsetprober.py
def feed(self, aBuf):
@@ -957,18 +918,14 @@ def feed(self, aBuf):
return self.get_state()
for c in aBuf:
order = self._mModel['charToOrderMap'][ord(c)]
-
…and latin1prober.py…
-
# latin1prober.py
def feed(self, aBuf):
aBuf = self.filter_with_english_letters(aBuf)
for c in aBuf:
charClass = Latin1_CharToClass[ord(c)]
-
c is iterating over aBuf, which means it is an integer, not a 1-character string. The solution is the same: change ord(c) to just plain c.
-
# sbcharsetprober.py
def feed(self, aBuf):
@@ -988,11 +945,8 @@ def feed(self, aBuf):
- charClass = Latin1_CharToClass[ord(c)]
+ charClass = Latin1_CharToClass[c]
-
int() >= str()Let's go again. -
skip over this command output listing
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
@@ -1011,11 +965,8 @@ tests\Big5\0804.blogspot.com.xml
File "C:\home\chardet\chardet\jpcntx.py", line 176, in get_order
if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
TypeError: unorderable types: int() >= str()
-
Did you notice? This time around, the code passed the first test case (tests\ascii\howto.diveintomark.org.xml). You're making real progress here.
-
So what's this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code: -
class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
@@ -1026,9 +977,7 @@ TypeError: unorderable types: int() >= str()
charLen = 2
else:
charLen = 1
-
And where does aStr come from? Let's pop the stack: -
def feed(self, aBuf, aLen):
.
@@ -1037,13 +986,9 @@ TypeError: unorderable types: int() >= str()
i = self._mNeedToSkipCharNum
while i < aLen:
order, charLen = self.get_order(aBuf[i:i+2])
-
Oh look, it's our old friend, aBuf. As you might have guessed from every other issue we've encountered in this chapter, aBuf is a byte array. Here, the feed() method isn't just passing it on wholesale; it's slicing it. But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array.
-
And what is this code trying to do with aStr? It's taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because aStr and aBuf were strings, and aStr[0] would be a string, and you can compare strings for inequality. But in Python 3, aStr and aBuf are byte arrays, aStr[0] is an integer, and you can't compare integers and strings for inequality without explicitly coercing one of them. -
In this case, there's no need to make the code more complicated by adding an explicit coercion. aStr[0] yields an integer; the things you're comparing to are all constants. Let's change them from 1-character strings to integers. -
class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
@@ -1097,9 +1042,7 @@ TypeError: unorderable types: int() >= str()
+ return aStr[1] - 0xA1, charLen
return -1, charLen
-
Searching the entire codebase for occurrences of the ord() function uncovers the same problem in chardistribution.py:
-
skip over this command output listing
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
@@ -1118,9 +1061,7 @@ tests\Big5\0804.blogspot.com.xml
File "C:\home\chardet\chardet\chardistribution.py", line 174, in get_order
if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):
TypeError: unorderable types: int() >= str()
-
The fix is the same: -
class EUCTWDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
@@ -1226,11 +1167,8 @@ TypeError: unorderable types: int() >= str()
+ return 94 * (aStr[0] - 0xA1) + aStr[1] - 0xA1
else:
return -1
-
'reduce' is not definedOnce more into the breach… -
skip over this command output listing
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
@@ -1243,20 +1181,15 @@ tests\Big5\0804.blogspot.com.xml
File "C:\home\chardet\chardet\latin1prober.py", line 126, in get_confidence
total = reduce(operator.add, self._mFreqCounter)
NameError: global name 'reduce' is not defined
-
According to the official What's New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: "Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable."
-
OK then, let's refactor it to use a for loop.
-
def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
total = reduce(operator.add, self._mFreqCounter)
-
The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result. It looks much more readable as a for loop.
-
def get_confidence(self):
if self.get_state() == constants.eNotMe:
@@ -1266,9 +1199,7 @@ NameError: global name 'reduce' is not defined
+ total = 0
+ for frequency in self._mFreqCounter:
+ total += frequency
-
I CAN HAZ TESTZ? -
skip over this command output listing
C:\home\chardet> python test.py tests\*\* tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0 @@ -1304,13 +1235,9 @@ tests\EUC-JP\arclamp.jp.xml EUC-JP with confide . . 316 tests-
Holy crap, it actually works! /me does a little dance -
What have we learned? -
2to3 tool is helpful as far as it goes, but it will only do the easy parts — function renames, module renames, syntax changes. It's an impressive piece of engineering, but in the end it's just an intelligent search-and-replace bot.
@@ -1318,7 +1245,6 @@ tests\EUC-JP\arclamp.jp.xml EUC-JP with confide
chardet works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I never would have found half of these problems with manual spot-checking.
© 2001–4, 2009 ℳark Pilgrim • open standards • open content • open source diff --git a/htmlminimizer.py b/htmlminimizer.py new file mode 100644 index 0000000..6017266 --- /dev/null +++ b/htmlminimizer.py @@ -0,0 +1,22 @@ +"""Quick-and-dirty HTML minimizer""" + +import sys + +input_file = sys.argv[1] +output_file = sys.argv[2] +in_pre = False +out = open(output_file, 'w') +for line in open(input_file).readlines(): + g = line.strip() + if g.count('
then+ # on the same line, so don't do that + in_pre = False + g = line.rstrip() + if in_pre: + out.write(line) + else: + out.write(g) +out.close() diff --git a/index.html b/index.html index e3b3d78..46ff1e2 100644 --- a/index.html +++ b/index.html @@ -3,7 +3,7 @@Dive Into Python 3 - + @@ -41,15 +41,8 @@ li.todo{background:white;color:gainsboro}
chardet to Python 3
2to3
-There is a changelog, a feed, and discussion on Reddit. During development, you can download the book by cloning the Mercurial repository: +
There is a changelog, a feed, and discussion on Reddit. During development, you can download the book by cloning the Mercurial repository:
you@localhost:~$ hg clone http://hg.diveintopython3.org/ diveintopython3
The final version will be downloadable as HTML and PDF.
This site is optimized for Lynx just because fuck you.
I’m told it also looks good in graphical browsers.
© 2001–4, 2009 ℳark Pilgrim • open standards • open content • open source - diff --git a/native-datatypes.html b/native-datatypes.html index 0c937e5..76f3950 100644 --- a/native-datatypes.html +++ b/native-datatypes.html @@ -3,7 +3,7 @@