diff --git a/about.html b/about.html new file mode 100644 index 0000000..0a60fdc --- /dev/null +++ b/about.html @@ -0,0 +1,27 @@ + + + + +About the book - Dive Into Python 3 + + + + + + + +

About the book

+

The content of Dive Into Python 3 is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. +

The chardet library referenced in Case study: porting chardet to Python 3 is licensed under the LGPL 2.1 or later. All other example code is licensed under the MIT license. Full licensing terms are included in each source code file. +

The dynamic highlighting effects in the online edition are built on top of jQuery, which is dual-licensed under the MIT and GPL licenses. +

The online edition loads as quickly as it does because +

    +
  1. jQuery is served by Google AJAX Libraries API. +
  2. Other Javascript and CSS resources are minimized by YUI Compressor. +
  3. HTTP caching and other server-side options are optimized based on advice from YSlow. +
  4. The entire book was lovingly hand-authored in HTML 5. View-source; I typed that. +
+

© 2001–4, 2009 ark Pilgrim diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 1d78229..44e5462 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -19,7 +19,7 @@ body{counter-reset:h1 20}

Words, words. They’re all we have to go on.
Rosencrantz and Guildenstern are Dead

    -
  1. Introducing chardet +
  2. What is character encoding?
    1. What is character encoding auto-detection?
    2. Isn’t that impossible? @@ -35,19 +35,22 @@ body{counter-reset:h1 20}
    3. Single-byte encodings
    4. windows-1252
    -
  3. Running 2to3 -
  4. Fixing what 2to3 can’t +
  5. Running 2to3 +
  6. Fixing what 2to3 can’t
    1. False is invalid syntax -
    2. No module named constants -
    3. Name 'file' is not defined +
    4. No module named constants +
    5. Name 'file' is not defined
    6. Can’t use a string pattern on a bytes-like object -
    7. Can’t convert 'bytes' object to str implicitly -
    8. TypeError: unsupported operand type(s) for +: 'int' and 'bytes' -
    9. TypeError: ord() expected string of length 1, but int found +
    10. Can’t convert 'bytes' object to str implicitly +
    11. Unsupported operand type(s) for +: 'int' and 'bytes' +
    12. ord() expected string of length 1, but int found +
    13. Unorderable types: int() >= str() +
    14. Global name 'reduce' is not defined
    +
  7. Summary
-

Introducing chardet: a mini-FAQ

+

What is character encoding?

Usually, when people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.

In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).

What is character encoding auto-detection?

@@ -72,7 +75,7 @@ body{counter-reset:h1 20}

If following the relevant standards gets you nowhere, and you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my Universal Feed Parser, which calls this auto-detection library only after exhausting all other options.

Diving in

This is a brief guide to navigating the code itself. -

The main entry point for the detection algorithm is universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that’s really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.) +

The main entry point for the detection algorithm is universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that’s really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.)

There are 5 categories of encodings that UniversalDetector handles:

  1. UTF-n with a BOM. This includes UTF-8, both BE and LE variants of UTF-16, and all 4 byte-order variants of UTF-32. @@ -84,23 +87,23 @@ body{counter-reset:h1 20}

    UTF-n with a BOM

    If the text starts with a BOM, we can reasonably assume that the text is encoded in UTF-8, UTF-16, or UTF-32. (The BOM will tell us exactly which one; that’s what it’s for.) This is handled inline in UniversalDetector, which returns the result immediately without any further processing.

    Escaped encodings

    -

    If the text contains a recognizable escape sequence that might indicate an escaped encoding, UniversalDetector creates an EscCharSetProber (defined in escprober.py) and feeds it the text. -

    EscCharSetProber creates a series of state machines, based on models of HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR (defined in escsm.py). EscCharSetProber feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, EscCharSetProber immediately returns the positive result to UniversalDetector, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines. +

    If the text contains a recognizable escape sequence that might indicate an escaped encoding, UniversalDetector creates an EscCharSetProber (defined in escprober.py) and feeds it the text. +

    EscCharSetProber creates a series of state machines, based on models of HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR (defined in escsm.py). EscCharSetProber feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, EscCharSetProber immediately returns the positive result to UniversalDetector, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.

    Multi-byte encodings

    Assuming no BOM, UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252. -

    The multi-byte encoding prober, MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. MBCSGroupProber feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller. -

    Most of the multi-byte encoding probers are inherited from MultiByteCharSetProber (defined in mbcharsetprober.py), and simply hook up the appropriate state machine and distribution analyzer and let MultiByteCharSetProber do the rest of the work. MultiByteCharSetProber runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, MultiByteCharSetProber feeds the text to an encoding-specific distribution analyzer. -

    The distribution analyzers (each defined in chardistribution.py) use language-specific models of which characters are used most frequently. Once MultiByteCharSetProber has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, MultiByteCharSetProber returns the result to MBCSGroupProber, which returns it to UniversalDetector, which returns it to the caller. -

    The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between EUC-JP and SHIFT_JIS, so the SJISProber (defined in sjisprober.py) also uses 2-character distribution analysis. SJISContextAnalysis and EUCJPContextAnalysis (both defined in jpcntx.py and both inheriting from a common JapaneseContextAnalysis class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to SJISProber, which checks both analyzers and returns the higher confidence level to MBCSGroupProber. +

    The multi-byte encoding prober, MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. MBCSGroupProber feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller. +

    Most of the multi-byte encoding probers are inherited from MultiByteCharSetProber (defined in mbcharsetprober.py), and simply hook up the appropriate state machine and distribution analyzer and let MultiByteCharSetProber do the rest of the work. MultiByteCharSetProber runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, MultiByteCharSetProber feeds the text to an encoding-specific distribution analyzer. +

    The distribution analyzers (each defined in chardistribution.py) use language-specific models of which characters are used most frequently. Once MultiByteCharSetProber has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, MultiByteCharSetProber returns the result to MBCSGroupProber, which returns it to UniversalDetector, which returns it to the caller. +

    The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between EUC-JP and SHIFT_JIS, so the SJISProber (defined in sjisprober.py) also uses 2-character distribution analysis. SJISContextAnalysis and EUCJPContextAnalysis (both defined in jpcntx.py and both inheriting from a common JapaneseContextAnalysis class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to SJISProber, which checks both analyzers and returns the higher confidence level to MBCSGroupProber.

    Single-byte encodings

    -

    The single-byte encoding prober, SBCSGroupProber (defined in sbcsgroupprober.py), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: windows-1251, KOI8-R, ISO-8859-5, MacCyrillic, IBM855, and IBM866 (Russian); ISO-8859-7 and windows-1253 (Greek); ISO-8859-5 and windows-1251 (Bulgarian); ISO-8859-2 and windows-1250 (Hungarian); TIS-620 (Thai); windows-1255 and ISO-8859-8 (Hebrew). -

    SBCSGroupProber feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, SingleByteCharSetProber (defined in sbcharsetprober.py), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. SingleByteCharSetProber processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio. -

    Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored "backwards" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew). +

    The single-byte encoding prober, SBCSGroupProber (defined in sbcsgroupprober.py), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: windows-1251, KOI8-R, ISO-8859-5, MacCyrillic, IBM855, and IBM866 (Russian); ISO-8859-7 and windows-1253 (Greek); ISO-8859-5 and windows-1251 (Bulgarian); ISO-8859-2 and windows-1250 (Hungarian); TIS-620 (Thai); windows-1255 and ISO-8859-8 (Hebrew). +

    SBCSGroupProber feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, SingleByteCharSetProber (defined in sbcharsetprober.py), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. SingleByteCharSetProber processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio. +

    Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored "backwards" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew).

    windows-1252

    -

    If UniversalDetector detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a Latin1Prober (defined in latin1prober.py) to try to detect English text in a windows-1252 encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish windows-1252 is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. Latin1Prober automatically reduces its confidence rating to allow more accurate probers to win if at all possible. -

    Running 2to3

    -

    We’re going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic. -

    The main chardet package is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn. +

    If UniversalDetector detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a Latin1Prober (defined in latin1prober.py) to try to detect English text in a windows-1252 encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish windows-1252 is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. Latin1Prober automatically reduces its confidence rating to allow more accurate probers to win if at all possible. +

    Running 2to3

    +

    We’re going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic. +

    The main chardet package is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn.

    [The code examples will be easier to follow if you enable Javascript, but whatever.]

    skip over this

    C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w chardet\
    @@ -569,7 +572,7 @@ RefactoringTool: chardet\sbcsgroupprober.py
     RefactoringTool: chardet\sjisprober.py
     RefactoringTool: chardet\universaldetector.py
     RefactoringTool: chardet\utf8prober.py
    -

    Now run the 2to3 script on the testing harness, test.py. +

    Now run the 2to3 script on the testing harness, test.py.

    skip over this

    C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w test.py
     RefactoringTool: Skipping implicit fixer: buffer
    @@ -602,7 +605,7 @@ RefactoringTool: Skipping implicit fixer: ws_comma
     RefactoringTool: Files that were modified:
     RefactoringTool: test.py

    Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work? -

    Fixing what 2to3 can’t

    +

    Fixing what 2to3 can’t

    False is invalid syntax

    Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere.

    skip over this @@ -614,7 +617,7 @@ RefactoringTool: test.py self.done = constants.False ^ SyntaxError: invalid syntax -

    Hmm, a small snag. In Python 3, False is a reserved word, so you can’t use it as a variable name. Let’s look at constants.py to see where it’s defined. Here’s the original version from constants.py, before the 2to3 script changed it: +

    Hmm, a small snag. In Python 3, False is a reserved word, so you can’t use it as a variable name. Let’s look at constants.py to see where it’s defined. Here’s the original version from constants.py, before the 2to3 script changed it:

    skip over this

    import __builtin__
     if not hasattr(__builtin__, 'False'):
    @@ -624,14 +627,14 @@ else:
         False = __builtin__.False
         True = __builtin__.True

    This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in Boolean type. This code detects the absence of the built-in constants True and False, and defines them if necessary. -

    However, Python 3 will always have a Boolean type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of constants.True and constants.False with True and False, respectively, then delete this dead code from constants.py. -

    So this line in universaldetector.py: +

    However, Python 3 will always have a Boolean type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of constants.True and constants.False with True and False, respectively, then delete this dead code from constants.py. +

    So this line in universaldetector.py:

    self.done = constants.False

    Becomes

    self.done = False

    Ah, wasn’t that satisfying? The code is shorter and more readable already. -

    No module named constants

    -

    Time to run test.py again and see how far it gets. +

    No module named constants

    +

    Time to run test.py again and see how far it gets.

    skip over this

    C:\home\chardet> python test.py tests\*\*
     Traceback (most recent call last):
    @@ -640,17 +643,17 @@ else:
       File "C:\home\chardet\chardet\universaldetector.py", line 29, in <module>
         import constants, sys
     ImportError: No module named constants
    -

    What’s that you say? No module named constants? Of course there’s a module named constants. ... Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead: +

    What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:

    from . import constants
    -

    But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two. +

    But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two.

    The solution is to split the import statement manually. So this two-in-one import:

    import constants, sys

    Needs to become two separate imports:

    from . import constants
     import sys
    -

    There are variations of this problem scattered throughout the chardet library. In some places it’s "import constants, sys"; in other places, it’s "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import. +

    There are variations of this problem scattered throughout the chardet library. In some places it’s "import constants, sys"; in other places, it’s "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.

    Onward! -

    Name 'file' is not defined

    +

    Name 'file' is not defined

    And here we go again, running test.py to try to execute our test cases…

    skip over this

    C:\home\chardet> python test.py tests\*\*
    @@ -659,7 +662,7 @@ import sys
    File "test.py", line 9, in <module> for line in file(f, 'rb'): NameError: name 'file' is not defined -

    This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it’s an alias for io.open(), but never mind that right now.) +

    This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it’s an alias for io.open(), but never mind that right now.)

    Thus, the simplest solution to the problem of the missing file() is to call open() instead:

    for line in open(f, 'rb'):

    And that’s all I have to say about that. @@ -682,7 +685,7 @@ TypeError: can't use a string pattern on a bytes-like object self._highBitDetector = re.compile(r'[\x80-\xFF]')

    This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.

    And therein lies the problem. -

    In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py: +

    In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py:

    skip over this

    def feed(self, aBuf):
         .
    @@ -690,7 +693,7 @@ TypeError: can't use a string pattern on a bytes-like object
    . if self._mInputState == ePureAscii: if self._highBitDetector.search(aBuf): -

    And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py. +

    And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py.

    skip over this

    u = UniversalDetector()
     .
    @@ -698,18 +701,39 @@ TypeError: can't use a string pattern on a bytes-like object
    . for line in open(f, 'rb'): u.feed(line) -

    And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit... characters. But we don’t have characters; we have bytes. Oops. +

    And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.

    What we need this regular expression to search is not an array of characters, but an array of bytes. -

    Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. So instead of this: -

    self._highBitDetector = re.compile(r'[\x80-\xFF]')
    -

    We now have this: -

    self._highBitDetector = re.compile(b'[\x80-\xFF]')
    -

    There is one other case of this same problem, on the very next line: -

    self._escDetector = re.compile(r'(\033|~{)')
    -

    Again, this is going to be used to search a byte array (the same aBuf variable, in fact), so the regular expression pattern needs to be defined as a byte array: -

    self._escDetector = re.compile(b'(\033|~{)')
    -

    Can't convert 'bytes' object to str implicitly

    -

    Curiouser and curiouser... +

    Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.) + +

    skip over this code listing +

      class UniversalDetector:
    +      def __init__(self):
    +-         self._highBitDetector = re.compile(b'[\x80-\xFF]')
    +-         self._escDetector = re.compile(b'(\033|~{)')
    ++         self._highBitDetector = re.compile(b'[\x80-\xFF]')
    ++         self._escDetector = re.compile(b'(\033|~{)')
    +          self._mEscCharSetProber = None
    +          self._mCharSetProbers = []
    +          self.reset()
    +

    Searching the entire codebase for other uses of the re module turns up two more instances, in charsetprober.py. Again, the code is defining regular expressions as strings but executing them on aBuf, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays. + +

    skip over this code listing +

      class CharSetProber:
    +      .
    +      .
    +      .
    +      def filter_high_bit_only(self, aBuf):
    +-         aBuf = re.sub(r'([\x00-\x7F])+', ' ', aBuf)
    ++         aBuf = re.sub(b'([\x00-\x7F])+', b' ', aBuf)
    +          return aBuf
    +    
    +      def filter_without_english_letters(self, aBuf):
    +-         aBuf = re.sub(r'([A-Za-z])+', ' ', aBuf)
    ++         aBuf = re.sub(b'([A-Za-z])+', b' ', aBuf)
    +          return aBuf
    + +

    Can't convert 'bytes' object to str implicitly

    +

    Curiouser and curiouser…

    skip over this

    C:\home\chardet> python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
    @@ -740,7 +764,7 @@ TypeError: Can't convert 'bytes' object to str implicitly

    Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second one. So what could cause a TypeError there? Perhaps you're thinking that the search() method is expecting a value of a different type, but that wouldn't generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value of a different type than it's expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the search() method. So the problem must be in that + operation, as it's trying to construct the value that it will eventually pass to the search() method. -

    We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It's an instance variable, defined in the reset() method, which is actually called from the __init__() method. +

    We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It's an instance variable, defined in the reset() method, which is actually called from the __init__() method.

    skip over this code listing

    class UniversalDetector:
    @@ -775,6 +799,7 @@ TypeError: Can't convert 'bytes' object to str implicitly

    The calling function calls this feed() method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as aBuf), then stores the last byte in self._mLastChar in case it's needed during the next call. (In a multi-byte encoding, the feed() method might get called with half of a character, then called again with the other half.) But because aBuf is now a byte array instead of a string, self._mLastChar needs to be a byte array as well. Thus: +

    skip over this code listing

      def reset(self):
           .
           .
    @@ -782,7 +807,28 @@ TypeError: Can't convert 'bytes' object to str implicitly
    - self._mLastChar = '' + self._mLastChar = b'' -

    TypeError: unsupported operand type(s) for +: 'int' and 'bytes'

    +

    Searching the entire codebase for "mLastChar" turns up a similar problem in mbcharsetprober.py, but instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers. + +

    skip over this code listing +

    
    +  class MultiByteCharSetProber(CharSetProber):
    +      def __init__(self):
    +          CharSetProber.__init__(self)
    +          self._mDistributionAnalyzer = None
    +          self._mCodingSM = None
    +-         self._mLastChar = ['\x00', '\x00']
    ++         self._mLastChar = [0, 0]
    +
    +      def reset(self):
    +          CharSetProber.reset(self)
    +          if self._mCodingSM:
    +              self._mCodingSM.reset()
    +          if self._mDistributionAnalyzer:
    +              self._mDistributionAnalyzer.reset()
    +-         self._mLastChar = ['\x00', '\x00']
    ++         self._mLastChar = [0, 0]
    + +

    Unsupported operand type(s) for +: 'int' and 'bytes'

    I have good news, and I have bad news. The good news is we're making progress… @@ -798,7 +844,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'

    …The bad news is it doesn't always feel like progress. -

    But this is progress! Really! Even though the traceback calls out the same line of code, it's a different error than it used to be. Progress! So what's the problem now? The last time I checked, this line of code didn't try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that self._mLastChar was a byte array. How did it turn into an int? +

    But this is progress! Really! Even though the traceback calls out the same line of code, it's a different error than it used to be. Progress! So what's the problem now? The last time I checked, this line of code didn't try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that self._mLastChar was a byte array. How did it turn into an int?

    The answer lies not in the previous lines of code, but in the following lines. @@ -834,7 +880,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes' >>> mLastChar + aBuf b'\xbf\xef\xbb\xbf'

      -
    1. Define a byte array of 3 bytes. +
    2. Define a byte array of length 3.
    3. The last element of the byte array is 191.
    4. That's an integer.
    5. Concatenating an integer with a byte array doesn't work. You've now replicated the error you just found in universaldetector.py. @@ -850,7 +896,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes' - self._mLastChar = aBuf[-1] + self._mLastChar = aBuf[-1:] -

      TypeError: ord() expected string of length 1, but int found

      +

      ord() expected string of length 1, but int found

      Tired yet? You're almost there… @@ -871,7 +917,7 @@ tests\Big5\0804.blogspot.com.xml byteCls = self._mModel['classTable'][ord(c)] TypeError: ord() expected string of length 1, but int found -

      FIXME +

      OK, so c is an int, but the ord() function was expecting a 1-character string. Fair enough. Where is c defined?

      skip over this code listing

      # codingstatemachine.py
      @@ -880,7 +926,7 @@ def next_state(self, c):
           # if it is first byte, we also get byte length
           byteCls = self._mModel['classTable'][ord(c)]
      -

      FIXME [aBuf is a byte array, so c is an int, not a 1-character string. IOW, there's no need to call the ord() function because c is already an int!] +

      That's no help; it's just passed into the function. Let's pop the stack.

      skip over this code listing

      # utf8prober.py
      @@ -888,11 +934,64 @@ def feed(self, aBuf):
           for c in aBuf:
               codingState = self._mCodingSM.next_state(c)
      -

      FIXME [wrapup or deleteme] +

      And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That's what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there's no need to call the ord() function because c is already an int! -

      TypeError: unorderable types: int() >= str()

      +

      Thus: -

      FIXME [let's go again] +

      skip over this code listing +

        def next_state(self, c):
      +      # for each byte we get its class
      +      # if it is first byte, we also get byte length
      +-     byteCls = self._mModel['classTable'][ord(c)]
      ++     byteCls = self._mModel['classTable'][c]
      + +

      Searching the entire codebase for instances of "ord(c)" uncovers similar problems in sbcharsetprober.py… + +

      skip over this code listing +

      # sbcharsetprober.py
      +def feed(self, aBuf):
      +    if not self._mModel['keepEnglishLetter']:
      +        aBuf = self.filter_without_english_letters(aBuf)
      +    aLen = len(aBuf)
      +    if not aLen:
      +        return self.get_state()
      +    for c in aBuf:
      +        order = self._mModel['charToOrderMap'][ord(c)]
      + +

      …and latin1prober.py… + +

      skip over this code listing +

      # latin1prober.py
      +def feed(self, aBuf):
      +    aBuf = self.filter_with_english_letters(aBuf)
      +    for c in aBuf:
      +        charClass = Latin1_CharToClass[ord(c)]
      + +

      c is iterating over aBuf, which means it is an integer, not a 1-character string. The solution is the same: change ord(c) to just plain c. + +

      skip over this code listing +

        # sbcharsetprober.py
      +  def feed(self, aBuf):
      +      if not self._mModel['keepEnglishLetter']:
      +          aBuf = self.filter_without_english_letters(aBuf)
      +      aLen = len(aBuf)
      +      if not aLen:
      +          return self.get_state()
      +      for c in aBuf:
      +-         order = self._mModel['charToOrderMap'][ord(c)]
      ++         order = self._mModel['charToOrderMap'][c]
      +
      +  # latin1prober.py
      +  def feed(self, aBuf):
      +      aBuf = self.filter_with_english_letters(aBuf)
      +      for c in aBuf:
      +-         charClass = Latin1_CharToClass[ord(c)]
      ++         charClass = Latin1_CharToClass[c]
      +
      + +

      Unorderable types: int() >= str()

      + +

      Let's go again.

      skip over this command output listing

      C:\home\chardet> python test.py tests\*\*
      @@ -913,8 +1012,313 @@ tests\Big5\0804.blogspot.com.xml
           if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
       TypeError: unorderable types: int() >= str()
      -

      FIXME +

      Did you notice? This time around, the code passed the first test case (tests\ascii\howto.diveintomark.org.xml). You're making real progress here. -

      © 2001–4, 2009 ark Pilgrim, CC-BY-SA-3.0 +

      So what's this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code: + +

      skip over this code listing +

      class SJISContextAnalysis(JapaneseContextAnalysis):
      +    def get_order(self, aStr):
      +        if not aStr: return -1, 1
      +        # find out current char's byte length
      +        if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
      +           ((aStr[0] >= '\xE0') and (aStr[0] <= '\xFC')):
      +            charLen = 2
      +        else:
      +            charLen = 1
      + +

      And where does aStr come from? Let's pop the stack: + +

      skip over this code listing +

      def feed(self, aBuf, aLen):
      +    .
      +    .
      +    .
      +    i = self._mNeedToSkipCharNum
      +    while i < aLen:
      +        order, charLen = self.get_order(aBuf[i:i+2])
      + +

      Oh look, it's our old friend, aBuf. As you might have guessed from every other issue we've encountered in this chapter, aBuf is a byte array. Here, the feed() method isn't just passing it on wholesale; it's slicing it. But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array. + +

      And what is this code trying to do with aStr? It's taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because aStr and aBuf were strings, and aStr[0] would be a string, and you can compare strings for inequality. But in Python 3, aStr and aBuf are byte arrays, aStr[0] is an integer, and you can't compare integers and strings for inequality without explicitly coercing one of them. + +

      In this case, there's no need to make the code more complicated by adding an explicit coercion. aStr[0] yields an integer; the things you're comparing to are all constants. Let's change them from 1-character strings to integers. + +

      skip over this code listing +

        class SJISContextAnalysis(JapaneseContextAnalysis):
      +      def get_order(self, aStr):
      +          if not aStr: return -1, 1
      +          # find out current char's byte length
      +-         if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
      +-            ((aStr[0] >= '\xE0') and (aStr[0] <= '\xFC')):
      ++         if ((aStr[0] >= 0x81) and (aStr[0] <= 0x9F)) or \
      ++            ((aStr[0] >= 0xE0) and (aStr[0] <= 0xFC)):
      +              charLen = 2
      +          else:
      +              charLen = 1
      +
      +          # return its order if it is hiragana
      +          if len(aStr) > 1:
      +-             if (aStr[0] == '\202') and \
      +-                (aStr[1] >= '\x9F') and \
      +-                (aStr[1] <= '\xF1'):
      +-                return ord(aStr[1]) - 0x9F, charLen
      ++             if (aStr[0] == 0x202) and \
      ++                (aStr[1] >= 0x9F) and \
      ++                (aStr[1] <= 0xF1):
      ++                return aStr[1] - 0x9F, charLen
      +
      +          return -1, charLen
      +
      +  class EUCJPContextAnalysis(JapaneseContextAnalysis):
      +      def get_order(self, aStr):
      +          if not aStr: return -1, 1
      +          # find out current char's byte length
      +-         if (aStr[0] == '\x8E') or \
      +-           ((aStr[0] >= '\xA1') and (aStr[0] <= '\xFE')):
      ++         if (aStr[0] == 0x8E) or \
      ++           ((aStr[0] >= 0xA1) and (aStr[0] <= 0xFE)):
      +              charLen = 2
      +-         elif aStr[0] == '\x8F':
      ++         elif aStr[0] == 0x8F:
      +              charLen = 3
      +          else:
      +              charLen = 1
      +
      +        # return its order if it is hiragana
      +        if len(aStr) > 1:
      +-           if (aStr[0] == '\xA4') and \
      +-              (aStr[1] >= '\xA1') and \
      +-              (aStr[1] <= '\xF3'):
      +-                 return ord(aStr[1]) - 0xA1, charLen
      ++           if (aStr[0] == 0xA4) and \
      ++              (aStr[1] >= 0xA1) and \
      ++              (aStr[1] <= 0xF3):
      ++               return aStr[1] - 0xA1, charLen
      +
      +        return -1, charLen
      + +

      Searching the entire codebase for occurrences of the ord() function uncovers the same problem in chardistribution.py: + +

      skip over this command output listing +

      C:\home\chardet> python test.py tests\*\*
      +tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
      +tests\Big5\0804.blogspot.com.xml
      +Traceback (most recent call last):
      +  File "test.py", line 10, in <module>
      +    u.feed(line)
      +  File "C:\home\chardet\chardet\universaldetector.py", line 117, in feed
      +    if prober.feed(aBuf) == constants.eFoundIt:
      +  File "C:\home\chardet\chardet\charsetgroupprober.py", line 60, in feed
      +    st = prober.feed(aBuf)
      +  File "C:\home\chardet\chardet\sjisprober.py", line 72, in feed
      +    self._mDistributionAnalyzer.feed(aBuf[i - 1 : i + 1], charLen)
      +  File "C:\home\chardet\chardet\chardistribution.py", line 56, in feed
      +    order = self.get_order(aStr)
      +  File "C:\home\chardet\chardet\chardistribution.py", line 174, in get_order
      +    if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):
      +TypeError: unorderable types: int() >= str()
      + +

      The fix is the same: + +

      skip over this code listing +

        class EUCTWDistributionAnalysis(CharDistributionAnalysis):
      +      def __init__(self):
      +          CharDistributionAnalysis.__init__(self)
      +          self._mCharToFreqOrder = EUCTWCharToFreqOrder
      +          self._mTableSize = EUCTW_TABLE_SIZE
      +          self._mTypicalDistributionRatio = EUCTW_TYPICAL_DISTRIBUTION_RATIO
      +
      +      def get_order(self, aStr):
      +-         if aStr[0] >= '\xC4':
      +-             return 94 * (ord(aStr[0]) - 0xC4) + ord(aStr[1]) - 0xA1
      ++         if aStr[0] >= 0xC4:
      ++             return 94 * (aStr[0] - 0xC4) + aStr[1] - 0xA1
      +          else:
      +              return -1
      +
      +  class EUCKRDistributionAnalysis(CharDistributionAnalysis):
      +      def __init__(self):
      +          CharDistributionAnalysis.__init__(self)
      +          self._mCharToFreqOrder = EUCKRCharToFreqOrder
      +          self._mTableSize = EUCKR_TABLE_SIZE
      +          self._mTypicalDistributionRatio = EUCKR_TYPICAL_DISTRIBUTION_RATIO
      +
      +      def get_order(self, aStr):
      +-         if aStr[0] >= '\xB0':
      +-             return 94 * (ord(aStr[0]) - 0xB0) + ord(aStr[1]) - 0xA1
      ++         if aStr[0] >= '\xB0':
      ++             return 94 * (aStr[0] - 0xB0) + aStr[1] - 0xA1
      +          else:
      +              return -1;
      +
      +  class GB2312DistributionAnalysis(CharDistributionAnalysis):
      +      def __init__(self):
      +          CharDistributionAnalysis.__init__(self)
      +          self._mCharToFreqOrder = GB2312CharToFreqOrder
      +          self._mTableSize = GB2312_TABLE_SIZE
      +          self._mTypicalDistributionRatio = GB2312_TYPICAL_DISTRIBUTION_RATIO
      +
      +      def get_order(self, aStr):
      +-         if (aStr[0] >= '\xB0') and (aStr[1] >= '\xA1'):
      +-             return 94 * (ord(aStr[0]) - 0xB0) + ord(aStr[1]) - 0xA1
      ++         if (aStr[0] >= 0xB0) and (aStr[1] >= 0xA1):
      ++             return 94 * (aStr[0] - 0xB0) + aStr[1] - 0xA1
      +          else:
      +              return -1;
      +
      +  class Big5DistributionAnalysis(CharDistributionAnalysis):
      +      def __init__(self):
      +          CharDistributionAnalysis.__init__(self)
      +          self._mCharToFreqOrder = Big5CharToFreqOrder
      +          self._mTableSize = BIG5_TABLE_SIZE
      +          self._mTypicalDistributionRatio = BIG5_TYPICAL_DISTRIBUTION_RATIO
      +
      +      def get_order(self, aStr):
      +-         if aStr[0] >= '\xA4':
      +-             if aStr[1] >= '\xA1':
      +-                 return 157 * (ord(aStr[0]) - 0xA4) + ord(aStr[1]) - 0xA1 + 63
      ++         if aStr[0] >= 0xA4:
      ++             if aStr[1] >= 0xA1:
      ++                 return 157 * (aStr[0] - 0xA4) + aStr[1] - 0xA1 + 63
      +              else:
      +-                 return 157 * (ord(aStr[0]) - 0xA4) + ord(aStr[1]) - 0x40
      ++                 return 157 * (aStr[0] - 0xA4) + aStr[1] - 0x40
      +          else:
      +              return -1
      +
      +  class SJISDistributionAnalysis(CharDistributionAnalysis):
      +      def __init__(self):
      +          CharDistributionAnalysis.__init__(self)
      +          self._mCharToFreqOrder = JISCharToFreqOrder
      +          self._mTableSize = JIS_TABLE_SIZE
      +          self._mTypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO
      +
      +      def get_order(self, aStr):
      +-         if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):
      +-             order = 188 * (ord(aStr[0]) - 0x81)
      +-         elif (aStr[0] >= '\xE0') and (aStr[0] <= '\xEF'):
      +-             order = 188 * (ord(aStr[0]) - 0xE0 + 31)
      ++         if (aStr[0] >= 0x81) and (aStr[0] <= 0x9F):
      ++             order = 188 * (aStr[0] - 0x81)
      ++         elif (aStr[0] >= 0xE0) and (aStr[0] <= 0xEF):
      ++             order = 188 * (aStr[0] - 0xE0 + 31)
      +          else:
      +              return -1;
      +-         order = order + ord(aStr[1]) - 0x40
      +-         if aStr[1] > '\x7F':
      ++         order = order + aStr[1] - 0x40
      ++         if aStr[1] > 0x7F:
      +              order =- 1
      +          return order
      +
      +  class EUCJPDistributionAnalysis(CharDistributionAnalysis):
      +      def __init__(self):
      +          CharDistributionAnalysis.__init__(self)
      +          self._mCharToFreqOrder = JISCharToFreqOrder
      +          self._mTableSize = JIS_TABLE_SIZE
      +          self._mTypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO
      +
      +      def get_order(self, aStr):
      +-         if aStr[0] >= '\xA0':
      +-             return 94 * (ord(aStr[0]) - 0xA1) + ord(aStr[1]) - 0xA1
      ++         if aStr[0] >= 0xA0:
      ++             return 94 * (aStr[0] - 0xA1) + aStr[1] - 0xA1
      +          else:
      +              return -1
      + +

      Global name 'reduce' is not defined

      + +

      Once more into the breach… + +

      skip over this command output listing +

      C:\home\chardet> python test.py tests\*\*
      +tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
      +tests\Big5\0804.blogspot.com.xml
      +Traceback (most recent call last):
      +  File "test.py", line 12, in <module>
      +    u.close()
      +  File "C:\home\chardet\chardet\universaldetector.py", line 141, in close
      +    proberConfidence = prober.get_confidence()
      +  File "C:\home\chardet\chardet\latin1prober.py", line 126, in get_confidence
      +    total = reduce(operator.add, self._mFreqCounter)
      +NameError: global name 'reduce' is not defined
      + +

      According to the official What's New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: "Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable." + +

      OK then, let's refactor it to use a for loop. + +

      skip over this code listing +

      def get_confidence(self):
      +    if self.get_state() == constants.eNotMe:
      +        return 0.01
      +  
      +    total = reduce(operator.add, self._mFreqCounter)
      + +

      The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result. It looks much more readable as a for loop. + +

      skip over this code listing +

        def get_confidence(self):
      +      if self.get_state() == constants.eNotMe:
      +          return 0.01
      +  
      +-     total = reduce(operator.add, self._mFreqCounter)
      ++     total = 0
      ++     for frequency in self._mFreqCounter:
      ++         total += frequency
      + +

      I CAN HAZ TESTZ? + +

      skip over this command output listing +

      C:\home\chardet> python test.py tests\*\*
      +tests\ascii\howto.diveintomark.org.xml                       ascii with confidence 1.0
      +tests\Big5\0804.blogspot.com.xml                             Big5 with confidence 0.99
      +tests\Big5\blog.worren.net.xml                               Big5 with confidence 0.99
      +tests\Big5\carbonxiv.blogspot.com.xml                        Big5 with confidence 0.99
      +tests\Big5\catshadow.blogspot.com.xml                        Big5 with confidence 0.99
      +tests\Big5\coolloud.org.tw.xml                               Big5 with confidence 0.99
      +tests\Big5\digitalwall.com.xml                               Big5 with confidence 0.99
      +tests\Big5\ebao.us.xml                                       Big5 with confidence 0.99
      +tests\Big5\fudesign.blogspot.com.xml                         Big5 with confidence 0.99
      +tests\Big5\kafkatseng.blogspot.com.xml                       Big5 with confidence 0.99
      +tests\Big5\ke207.blogspot.com.xml                            Big5 with confidence 0.99
      +tests\Big5\leavesth.blogspot.com.xml                         Big5 with confidence 0.99
      +tests\Big5\letterlego.blogspot.com.xml                       Big5 with confidence 0.99
      +tests\Big5\linyijen.blogspot.com.xml                         Big5 with confidence 0.99
      +tests\Big5\marilynwu.blogspot.com.xml                        Big5 with confidence 0.99
      +tests\Big5\myblog.pchome.com.tw.xml                          Big5 with confidence 0.99
      +tests\Big5\oui-design.com.xml                                Big5 with confidence 0.99
      +tests\Big5\sanwenji.blogspot.com.xml                         Big5 with confidence 0.99
      +tests\Big5\sinica.edu.tw.xml                                 Big5 with confidence 0.99
      +tests\Big5\sylvia1976.blogspot.com.xml                       Big5 with confidence 0.99
      +tests\Big5\tlkkuo.blogspot.com.xml                           Big5 with confidence 0.99
      +tests\Big5\tw.blog.xubg.com.xml                              Big5 with confidence 0.99
      +tests\Big5\unoriginalblog.com.xml                            Big5 with confidence 0.99
      +tests\Big5\upsaid.com.xml                                    Big5 with confidence 0.99
      +tests\Big5\willythecop.blogspot.com.xml                      Big5 with confidence 0.99
      +tests\Big5\ytc.blogspot.com.xml                              Big5 with confidence 0.99
      +tests\EUC-JP\aivy.co.jp.xml                                  EUC-JP with confidence 0.99
      +tests\EUC-JP\akaname.main.jp.xml                             EUC-JP with confidence 0.99
      +tests\EUC-JP\arclamp.jp.xml                                  EUC-JP with confidence 0.99
      +.
      +.
      +.
      +316 tests
      + +

      Holy crap, it actually works! /me does a little dance + +

      Summary

      + +

      What have we learned? + +

        +
      1. Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There's no way around it. It's hard. +
      2. The automated 2to3 tool is helpful as far as it goes, but it will only do the easy parts — function renames, module renames, syntax changes. It's an impressive piece of engineering, but in the end it's just an intelligent search-and-replace bot. +
      3. The #1 porting problem in this library was the difference between strings and bytes. In this case that seems obvious, since the whole point of the chardet library is to convert a stream of bytes into a string. But “a stream of bytes” comes up more often than you might think. Reading a file in “binary” mode? You'll get a stream of bytes. Fetching a web page? Calling a web API? They return a stream of bytes, too. +
      4. You need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least, you need to be comfortable with all its quirks and musty corners. The bugs are everywhere. +
      5. Test cases are essential. Don't port anything without them. Don't even try. The only reason I have any confidence at all that chardet works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I never would have found half of these problems with manual spot-checking. +
      + +

      © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/chardet/chardet/latin1prober.py b/chardet/chardet/latin1prober.py index 1ab0f0e..7296fb9 100644 --- a/chardet/chardet/latin1prober.py +++ b/chardet/chardet/latin1prober.py @@ -123,7 +123,7 @@ class Latin1Prober(CharSetProber): if self.get_state() == constants.eNotMe: return 0.01 - total = 0.0 + total = 0 for frequency in self._mFreqCounter: total += frequency if total < 0.01: diff --git a/chardet/python3-conversion-notes.txt b/chardet/python3-conversion-notes.txt index 5f74a32..0e0346f 100644 --- a/chardet/python3-conversion-notes.txt +++ b/chardet/python3-conversion-notes.txt @@ -1,30 +1,34 @@ * python 2to3.py -w test.py (the -w flag makes a backup then overwrites the original file) * python 2to3.py -w chardet/ directory (passing a directory acts on all .py files in the directory) + * global search-and-replace constants.False --> False, constants.True --> True (unnecessary, Python3 always defines a Boolean type) * constants.py: remove code for defining True and False + * universaldetector.py, charsetgroupprober.py, charsetprober.py, escprober.py, eucjpprober.py, mbcharsetprober.py, sbcharsetprober.py, sbcsgroupprober.py, sjisprober.py, utf8prober.py: manually fix import statements that 2to3 missed old: import constants, sys new: from . import constants import sys + * test.py: change file() to open() + * universaldetector.py: change r'' strings to b'' byte arrays in self._highBitDetector, self._escDetector regular expressions +* charsetprober.py: change regular expression-based replace to use b'' byte arrays instead of strings * universaldetector.py: change self._mLastChar from a '' string to a b'' byte array -* universaldetector.py: getting a single element from a byte array yields an integer, not a byte, so change syntax to make sure we self._mLastChar is always a byte +* mbcharsetprober.py: change self._mLastChar from a list of two 1-character strings to a list of two ints + +* universaldetector.py: getting a single element from a byte array yields an integer, not a byte, so change syntax to make sure self._mLastChar is always a byte old: self._mLastChar = aBuf[-1] new: self._mLastChar = aBuf[-1:] -- jpcntx.py, chardistribution.py: change 1-character strings to ints and hex ints, since we're just comparing ints to ints anyway -- jpcntx.py, chardistribution.py: change ord(aBuf[0]) to aBuf[0] since it's already an int (iterating through a byte array) -- jpcntx.py, chardistribution.py (editorial): global search-and-replace "aStr" --> "aBuf" to make it clear that we're passing around a byte array - sbcharsetprober.py, latin1prober.py: change ord(c) to c since it's already an int (iterating through a byte array) -- (not sure where this fits) mbcharsetprober.py: change self._mLastChar from a list of two 1-character strings to a list of two ints - -- (not sure where this fits) charsetprober.py: change regular expression-based replace to use b'' byte arrays instead of strings +* jpcntx.py, chardistribution.py: change 1-character strings to ints and hex ints, since we're just comparing ints to ints anyway +* jpcntx.py, chardistribution.py: change ord(aBuf[0]) to aBuf[0] since it's already an int (iterating through a byte array) +X jpcntx.py, chardistribution.py (editorial): global search-and-replace "aStr" --> "aBuf" to make it clear that we're passing around a byte array - latin1prober.py: refactor reduce(operator.add, ...) to use a for loop instead diff --git a/dip3.css b/dip3.css index 1ddd8cc..64d0177 100644 --- a/dip3.css +++ b/dip3.css @@ -34,7 +34,7 @@ pre{white-space:pre-wrap;padding-left:2.154em;line-height:2.154;border-left:1px pre a,.widgets a{padding:0.4375em 0;border:0} .widgets a{text-decoration:underline} pre a:hover{border:0} -kbd{font-weight:bold} +kbd,mark{font-weight:bold} .prompt{color:#667} ins,del,mark{text-decoration:none;font-style:normal;display:inline-block;width:100%;line-height:2.154} del{background:salmon} diff --git a/index.html b/index.html index 505a0ad..e3b3d78 100644 --- a/index.html +++ b/index.html @@ -45,7 +45,7 @@ li.todo{background:white;color:gainsboro}

      you@localhost:~$ hg clone http://hg.diveintopython3.org/ diveintopython3

      The final version will be downloadable as HTML and PDF.

      This site is optimized for Lynx just because fuck you.
      I’m told it also looks good in graphical browsers. -

      © 2001–4, 2009 ark Pilgrim, CC-BY-SA-3.0 +

      © 2001–4, 2009 ark Pilgrim • open standards • open content • open source -

      © 2001–4, 2009 ark Pilgrim, CC-BY-SA-3.0 +

      © 2001–4, 2009 ark Pilgrim • open standards • open content • open source diff --git a/your-first-python-program.html b/your-first-python-program.html index 9513797..a9d4cce 100644 --- a/your-first-python-program.html +++ b/your-first-python-program.html @@ -249,6 +249,6 @@ if __name__ == "__main__":

    6. PEP 8: Style Guide for Python Code discusses good indentation style.
    7. Python Reference Manual explains what it means to say that everything in Python is an object, because some people are pedantic and like to discuss that sort of thing at great length. -

      © 2001–4, 2009 ark Pilgrim, CC-BY-SA-3.0 +

      © 2001–4, 2009 ark Pilgrim • open standards • open content • open source