diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index a2ca212..8b726b5 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -11,35 +11,132 @@ body{counter-reset:h1 19}

Case study: porting chardet to Python 3

-
    -
  1. Diving in
  2. -
  3. Running 2to3
  4. -
  5. False is invalid syntax
  6. -
  7. No module named constants
  8. -
  9. Name 'file' is not defined
  10. -
  11. Can't use a string pattern on a bytes-like object
  12. -
  13. Can't convert 'bytes' object to str implicitly
  14. +
      +
    1. Introducing chardet: a mini-FAQ +
        +
      1. What is character encoding auto-detection? +
      2. Isn't that impossible? +
      3. Who wrote this detection algorithm? +
      4. Yippie! Screw the standards, I'll just auto-detect everything! +
      5. Why bother with auto-detection if it's slow, inaccurate, and non-standard? +
      +
    2. Diving in +
        +
      1. UTF-n with a BOM +
      2. Escaped encodings +
      3. Multi-byte encodings +
      4. Single-byte encodings +
      5. windows-1252 +
      +
    3. Running 2to3 +
    4. Fixing what 2to3 can't +
        +
      1. False is invalid syntax +
      2. No module named constants +
      3. Name 'file' is not defined +
      4. Can't use a string pattern on a bytes-like object +
      5. Can't convert 'bytes' object to str implicitly +
    -
    +

    Introducing chardet: a mini-FAQ

    -

    Diving in

    +

    When you think of “text”, you probably think of “characters and symbols I see on my computer screen”. But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. -

    FIXME intro

    +

    In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever). -

    ...

    +

    What is character encoding auto-detection?

    -
    +

    It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It's like cracking a code when you don't have the decryption key. -

    +

    Isn't that impossible?

    -

    Running 2to3

    +

    In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language. +

    In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings. -

    We're going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script to help with this, called 2to3. 2to3 takes your actual Python 2 source code as input, and auto-converts as much as it can to Python 3. [FIXME reference 2to3 chapter once it's done]

    +

    Who wrote this detection algorithm?

    -

    The chardet library is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn.

    +

    This library is a port of the auto-detection code in Mozilla. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors' comments, which are quite extensive and informative. -

    +

    You may also be interested in the research paper which led to the Mozilla implementation, A composite approach to language/encoding detection. + +

    Yippie! Screw the standards, I'll just auto-detect everything!

    + +

    Don't do that. Virtually every format and protocol contains a method for specifying character encoding. + +

    + +

    If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards and figure out which one wins if they give you conflicting information.) + +

    Despite the complexity, it's worthwhile to follow standards and respect explicit character encoding information. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards. + +

    Why bother with auto-detection if it's slow, inaccurate, and non-standard?

    + +

    Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn't work. There are also some poorly designed standards that have no way to specify encoding at all. + +

    If following the relevant standards gets you nowhere, and you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my Universal Feed Parser, which calls this auto-detection library only after exhausting all other options. + +

    Diving in

    + +

    This is a brief guide to navigating the code itself. + +

    The main entry point for the detection algorithm is universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that's really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.) + +

    There are 5 categories of encodings that UniversalDetector handles: + +

      +
    1. UTF-n with a BOM. This includes UTF-8, both BE and LE variants of UTF-16, and all 4 byte-order variants of UTF-32. +
    2. Escaped encodings, which are entirely 7-bit ASCII compatible, where non-ASCII characters start with an escape sequence. Examples: ISO-2022-JP (Japanese) and HZ-GB-2312 (Chinese). +
    3. Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: Big5 (Chinese), SHIFT_JIS (Japanese), EUC-KR (Korean), and UTF-8 without a BOM. +
    4. Single-byte encodings, where each character is represented by one byte. Examples: KOI8-R (Russian), windows-1255 (Hebrew), and TIS-620 (Thai). +
    5. windows-1252, which is used primarily on Microsoft Windows by middle managers who wouldn't know a character encoding from a hole in the ground. +
    + +

    UTF-n with a BOM

    + +

    If the text starts with a BOM, we can reasonably assume that the text is encoded in UTF-8, UTF-16, or UTF-32. (The BOM will tell us exactly which one; that's what it's for.) This is handled inline in UniversalDetector, which returns the result immediately without any further processing. + +

    Escaped encodings

    + +

    If the text contains a recognizable escape sequence that might indicate an escaped encoding, UniversalDetector creates an EscCharSetProber (defined in escprober.py) and feeds it the text. + +

    EscCharSetProber creates a series of state machines, based on models of HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR (defined in escsm.py). EscCharSetProber feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, EscCharSetProber immediately returns the positive result to UniversalDetector, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines. + +

    Multi-byte encodings

    + +

    Assuming no BOM, UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252. + +

    The multi-byte encoding prober, MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. MBCSGroupProber feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller. + +

    Most of the multi-byte encoding probers are inherited from MultiByteCharSetProber (defined in mbcharsetprober.py), and simply hook up the appropriate state machine and distribution analyzer and let MultiByteCharSetProber do the rest of the work. MultiByteCharSetProber runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, MultiByteCharSetProber feeds the text to an encoding-specific distribution analyzer. + +

    The distribution analyzers (each defined in chardistribution.py) use language-specific models of which characters are used most frequently. Once MultiByteCharSetProber has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, MultiByteCharSetProber returns the result to MBCSGroupProber, which returns it to UniversalDetector, which returns it to the caller. + +

    The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between EUC-JP and SHIFT_JIS, so the SJISProber (defined in sjisprober.py) also uses 2-character distribution analysis. SJISContextAnalysis and EUCJPContextAnalysis (both defined in jpcntx.py and both inheriting from a common JapaneseContextAnalysis class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to SJISProber, which checks both analyzers and returns the higher confidence level to MBCSGroupProber. + +

    Single-byte encodings

    + +

    The single-byte encoding prober, SBCSGroupProber (defined in sbcsgroupprober.py), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: windows-1251, KOI8-R, ISO-8859-5, MacCyrillic, IBM855, and IBM866 (Russian); ISO-8859-7 and windows-1253 (Greek); ISO-8859-5 and windows-1251 (Bulgarian); ISO-8859-2 and windows-1250 (Hungarian); TIS-620 (Thai); windows-1255 and ISO-8859-8 (Hebrew). + +

    SBCSGroupProber feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, SingleByteCharSetProber (defined in sbcharsetprober.py), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. SingleByteCharSetProber processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio. + +

    Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored “backwards” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew). + +

    windows-1252

    + +

    If UniversalDetector detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a Latin1Prober (defined in latin1prober.py) to try to detect English text in a windows-1252 encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish windows-1252 is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. Latin1Prober automatically reduces its confidence rating to allow more accurate probers to win if at all possible. + +

    Running 2to3

    + +

    We're going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we'll start by running 2to3 on the chardet package, but as you'll see, there will still be a lot of work to do after the automated tools have performed their magic. + +

    The main chardet package is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn. + +

    C:\home\chardet>python c:\Python30\Tools\Scripts\2to3.py -w chardet\
     RefactoringTool: Skipping implicit fixer: buffer
     RefactoringTool: Skipping implicit fixer: idioms
    @@ -507,9 +604,9 @@ RefactoringTool: chardet\sjisprober.py
     RefactoringTool: chardet\universaldetector.py
     RefactoringTool: chardet\utf8prober.py
    -

    Now run the 2to3 script on the testing harness, test.py.

    +

    Now run the 2to3 script on the testing harness, test.py. -

    +

    C:\home\chardet>python c:\Python30\Tools\Scripts\2to3.py -w test.py
     RefactoringTool: Skipping implicit fixer: buffer
     RefactoringTool: Skipping implicit fixer: idioms
    @@ -541,15 +638,15 @@ RefactoringTool: Skipping implicit fixer: ws_comma
     RefactoringTool: Files that were modified:
     RefactoringTool: test.py
    -

    Well, that wasn't so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it'll work?

    -
    +

    Well, that wasn't so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it'll work? -

    -

    False is invalid syntax

    +

    Fixing what 2to3 can't

    -

    Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it's a good way to test our ported code to make sure there aren't any bugs lurking anywhere.

    +

    False is invalid syntax

    -

    +

    Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it's a good way to test our ported code to make sure there aren't any bugs lurking anywhere. + +

    C:\home\chardet>python test.py tests\*\*
     Traceback (most recent call last):
       File "test.py", line 1, in <module>
    @@ -559,9 +656,9 @@ RefactoringTool: test.py
    ^ SyntaxError: invalid syntax -

    Hmm, a small snag. In Python 3, False is a reserved word, so you can't use it as a variable name. Let's look at constants.py to see where it's defined. Here's the original version from constants.py, before the 2to3 script changed it:

    +

    Hmm, a small snag. In Python 3, False is a reserved word, so you can't use it as a variable name. Let's look at constants.py to see where it's defined. Here's the original version from constants.py, before the 2to3 script changed it: -

    +

    import __builtin__
     if not hasattr(__builtin__, 'False'):
         False = 0
    @@ -570,27 +667,25 @@ else:
         False = __builtin__.False
         True = __builtin__.True
    -

    This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in Boolean type. This code detects the absence of the built-in constants True and False, and defines them if necessary.

    +

    This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in Boolean type. This code detects the absence of the built-in constants True and False, and defines them if necessary. -

    However, Python 3 will always have a Boolean type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of "constants.True" and "constants.False" with "True" and "False", respectively, then delete this dead code from constants.py.

    +

    However, Python 3 will always have a Boolean type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of "constants.True" and "constants.False" with "True" and "False", respectively, then delete this dead code from constants.py. -

    So this line in universaldetector.py:

    +

    So this line in universaldetector.py:

    self.done = constants.False
    -

    Becomes

    +

    Becomes

    self.done = False
    -

    Ah, wasn't that satisfying? The code is shorter and more readable already.

    -
    +

    Ah, wasn't that satisfying? The code is shorter and more readable already. -

    -

    No module named constants

    +

    No module named constants

    -

    Time to run test.py again and see how far it gets.

    +

    Time to run test.py again and see how far it gets. -

    +

    C:\home\chardet>python test.py tests\*\*
     Traceback (most recent call last):
       File "test.py", line 1, in <module>
    @@ -599,32 +694,30 @@ else:
         import constants, sys
     ImportError: No module named constants
    -

    What's that you say? No module named constants? Of course there's a module named constants. ... Oh wait, no there isn't. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:

    +

    What's that you say? No module named constants? Of course there's a module named constants. ... Oh wait, no there isn't. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:

    from . import constants
    -

    But wait. Wasn't the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can't, and the 2to3 script is not smart enough to split the import statement into two.

    +

    But wait. Wasn't the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can't, and the 2to3 script is not smart enough to split the import statement into two. -

    The solution is to split the import statement manually. So this two-in-one import:

    +

    The solution is to split the import statement manually. So this two-in-one import:

    import constants, sys
    -

    Needs to become two separate imports:

    +

    Needs to become two separate imports:

    from . import constants
     import sys
    -

    There are variations of this problem scattered throughout the chardet library. In some places it's "import constants, sys"; in other places, it's "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.

    +

    There are variations of this problem scattered throughout the chardet library. In some places it's "import constants, sys"; in other places, it's "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import. -

    Onward!

    -
    +

    Onward! -

    -

    Name 'file' is not defined

    +

    Name 'file' is not defined

    -

    FIXME intro

    +

    FIXME intro -

    +

    C:\home\chardet>python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
    @@ -632,21 +725,19 @@ import sys
    for line in file(f, 'rb'): NameError: name 'file' is not defined -

    This one surprised me, because I've been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I'll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it's an alias for io.open(), but never mind that right now.)

    +

    This one surprised me, because I've been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I'll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it's an alias for io.open(), but never mind that right now.) -

    Thus, the simplest solution to the problem of the missing file() is to call open() instead:

    +

    Thus, the simplest solution to the problem of the missing file() is to call open() instead:

    for line in open(f, 'rb'):
    -

    And that's all I have to say about that.

    -
    +

    And that's all I have to say about that. -

    -

    Can't use a string pattern on a bytes-like object

    +

    Can't use a string pattern on a bytes-like object

    -

    FIXME intro

    +

    FIXME intro -

    +

    C:\home\chardet>python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
    @@ -656,22 +747,22 @@ NameError: name 'file' is not defined
    if self._highBitDetector.search(aBuf): TypeError: can't use a string pattern on a bytes-like object -

    Now things are starting to get interesting. And by "interesting," I mean "confusing as all hell."

    +

    Now things are starting to get interesting. And by "interesting," I mean "confusing as all hell." -

    First, let's see what self._highBitDetector is. It's defined in the __init__ method of the UniversalDetector class:

    +

    First, let's see what self._highBitDetector is. It's defined in the __init__ method of the UniversalDetector class: -

    +

    class UniversalDetector:
         def __init__(self):
             self._highBitDetector = re.compile(r'[\x80-\xFF]')
    -

    This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that's not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.

    +

    This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that's not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255. -

    And therein lies the problem.

    +

    And therein lies the problem. -

    In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we're searching is not a string, it's a byte array. Looking at the traceback, this error occurred in universaldetector.py:

    +

    In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we're searching is not a string, it's a byte array. Looking at the traceback, this error occurred in universaldetector.py: -

    +

    def feed(self, aBuf):
         .
         .
    @@ -679,9 +770,9 @@ TypeError: can't use a string pattern on a bytes-like object
    if self._mInputState == ePureAscii: if self._highBitDetector.search(aBuf): -

    And what is aBuf? Let's backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py.

    +

    And what is aBuf? Let's backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py. -

    +

    u = UniversalDetector()
     .
     .
    @@ -689,33 +780,31 @@ TypeError: can't use a string pattern on a bytes-like object
    for line in open(f, 'rb'): u.feed(line) -

    And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for "read"; OK, big deal, we're reading the file. Ah, but 'b' is for "bytes." Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit... characters. But we don't have characters; we have bytes. Oops.

    +

    And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for "read"; OK, big deal, we're reading the file. Ah, but 'b' is for "bytes." Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit... characters. But we don't have characters; we have bytes. Oops. -

    What we need this regular expression to search is not an array of characters, but an array of bytes.

    +

    What we need this regular expression to search is not an array of characters, but an array of bytes. -

    Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. So instead of this:

    +

    Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. So instead of this:

    self._highBitDetector = re.compile(r'[\x80-\xFF]')
    -

    We now have this:

    +

    We now have this:

    self._highBitDetector = re.compile(b'[\x80-\xFF]')
    -

    There is one other case of this same problem, on the very next line:

    +

    There is one other case of this same problem, on the very next line:

    self._escDetector = re.compile(r'(\033|~{)')
    -

    Again, this is going to be used to search a byte array (the same aBuf variable, in fact), so the regular expression pattern needs to be defined as a byte array:

    +

    Again, this is going to be used to search a byte array (the same aBuf variable, in fact), so the regular expression pattern needs to be defined as a byte array:

    self._escDetector = re.compile(b'(\033|~{)')
    -
    -
    -

    Can't convert 'bytes' object to str implicitly

    +

    Can't convert 'bytes' object to str implicitly

    -

    Curiouser and curiouser...

    +

    Curiouser and curiouser... -

    +

    C:\home\chardet>python test.py tests\*\*
     tests\ascii\howto.diveintomark.org.xml
     Traceback (most recent call last):
    @@ -725,11 +814,10 @@ for line in open(f, 'rb'):
         elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
     TypeError: Can't convert 'bytes' object to str implicitly
    -

    ...

    -
    +

    ...

    diff --git a/porting-code-to-python-3-with-2to3.html b/porting-code-to-python-3-with-2to3.html index 143a245..4df7f63 100644 --- a/porting-code-to-python-3-with-2to3.html +++ b/porting-code-to-python-3-with-2to3.html @@ -51,84 +51,78 @@ for (var i = arTables.length - 1; i >= 0; i--) {

    Porting code to Python 3 with 2to3

    - -
      -
    1. Diving in
    2. -
    3. print statement
    4. -
    5. <> comparison
    6. -
    7. has_key() dictionary method
    8. -
    9. Dictionary methods that return lists
    10. +
        +
      1. Diving in +
      2. print statement +
      3. <> comparison +
      4. has_key() dictionary method +
      5. Dictionary methods that return lists
      6. Modules that have been renamed or reorganized
          -
        1. http package
        2. -
        3. urllib package
        4. -
        5. dbm package
        6. -
        7. xmlrpc package
        8. -
        9. Other modules
        10. +
        11. http package +
        12. urllib package +
        13. dbm package +
        14. xmlrpc package +
        15. Other modules
        -
      7. -
      8. Relative imports within a package
      9. -
      10. filter() global function
      11. -
      12. map() global function
      13. -
      14. reduce() global function (3.1+)
      15. -
      16. apply() global function
      17. -
      18. intern() global function
      19. -
      20. exec statement
      21. -
      22. execfile statement (3.1+)
      23. -
      24. repr literals (backticks)
      25. -
      26. try...except statement
      27. -
      28. raise statement
      29. -
      30. throw statement
      31. -
      32. long data type
      33. -
      34. xrange() global function
      35. -
      36. raw_input() and input() global functions
      37. -
      38. func_* function attributes
      39. -
      40. xreadlines() I/O method
      41. -
      42. lambda functions with multiple parameters
      43. -
      44. Special method attributes
      45. -
      46. next() iterator method
      47. -
      48. __nonzero__ special class attribute
      49. -
      50. Number literals
      51. -
      52. sys.maxint
      53. -
      54. unicode() global function
      55. -
      56. Unicode string literals
      57. -
      58. callable() global function
      59. -
      60. zip() global function
      61. -
      62. StandardError() exception
      63. -
      64. types module constants
      65. -
      66. isinstance global function (3.1+)
      67. -
      68. basestring datatype
      69. -
      70. itertools module
      71. -
      72. sys.exc_type, sys.exc_value, sys.exc_traceback
      73. -
      74. List comprehensions over tuples
      75. -
      76. os.getcwdu() function
      77. -
      78. Metaclasses
      79. -
      80. set() literals
      81. -
      82. buffer() global function
      83. -
      84. Whitespace around commas
      85. -
      86. Common idioms
      87. + +
      88. Relative imports within a package +
      89. filter() global function +
      90. map() global function +
      91. reduce() global function (3.1+) +
      92. apply() global function +
      93. intern() global function +
      94. exec statement +
      95. execfile statement (3.1+) +
      96. repr literals (backticks) +
      97. try...except statement +
      98. raise statement +
      99. throw statement +
      100. long data type +
      101. xrange() global function +
      102. raw_input() and input() global functions +
      103. func_* function attributes +
      104. xreadlines() I/O method +
      105. lambda functions with multiple parameters +
      106. Special method attributes +
      107. next() iterator method +
      108. __nonzero__ special class attribute +
      109. Number literals +
      110. sys.maxint +
      111. unicode() global function +
      112. Unicode string literals +
      113. callable() global function +
      114. zip() global function +
      115. StandardError() exception +
      116. types module constants +
      117. isinstance global function (3.1+) +
      118. basestring datatype +
      119. itertools module +
      120. sys.exc_type, sys.exc_value, sys.exc_traceback +
      121. List comprehensions over tuples +
      122. os.getcwdu() function +
      123. Metaclasses +
      124. set() literals +
      125. buffer() global function +
      126. Whitespace around commas +
      127. Common idioms
      -

      Diving in

      -

      FIXME intro

      +

      FIXME intro -

      ...

      +

      ... -

      - -

      print statement

      -

      In Python 2, print was a statement -- whatever you wanted to print simply followed the print keyword. In Python 3, print() is a function -- whatever you want to print is passed to print() like any other function.

      +

      In Python 2, print was a statement -- whatever you wanted to print simply followed the print keyword. In Python 3, print() is a function -- whatever you want to print is passed to print() like any other function. -

      +

      @@ -163,21 +157,18 @@ for (var i = arTables.length - 1; i >= 0; i--) {
      Notes

        -
      1. To print a blank line, call print() without any arguments.
      2. -
      3. To print a single value, call print() with one argument
      4. -
      5. To print two values separated by a space, call print() with two arguments.
      6. -
      7. This one is a little tricky. In Python 2, if you ended a print statement with a comma, it would print the values separated by spaces, then print a trailing space, then stop without printing a carriage return. In Python 3, the way to do this is to pass end=' ' as a keyword argument to the print() function. The end argument defaults to '\n' (a carriage return), so overriding it will suppress the carriage return after printing the other arguments.
      8. -
      9. In Python 2, you could redirect the output to a pipe -- like sys.stderr -- by using the >>pipe_name syntax. In Python 3, the way to do this is to pass the pipe in the file keyword argument. The file argument defaults to sys.stdout (standard out), so overriding it will output to a different pipe instead.
      10. +
      11. To print a blank line, call print() without any arguments. +
      12. To print a single value, call print() with one argument +
      13. To print two values separated by a space, call print() with two arguments. +
      14. This one is a little tricky. In Python 2, if you ended a print statement with a comma, it would print the values separated by spaces, then print a trailing space, then stop without printing a carriage return. In Python 3, the way to do this is to pass end=' ' as a keyword argument to the print() function. The end argument defaults to '\n' (a carriage return), so overriding it will suppress the carriage return after printing the other arguments. +
      15. In Python 2, you could redirect the output to a pipe -- like sys.stderr -- by using the >>pipe_name syntax. In Python 3, the way to do this is to pass the pipe in the file keyword argument. The file argument defaults to sys.stdout (standard out), so overriding it will output to a different pipe instead.
      -
      - -

      <> comparison

      -

      Python 2 supported <> as a synonym for !=, the not-equals comparison operator. Python 3 supports the != operator, but not <>.

      +

      Python 2 supported <> as a synonym for !=, the not-equals comparison operator. Python 3 supports the != operator, but not <>. -

      +

      @@ -197,18 +188,15 @@ for (var i = arTables.length - 1; i >= 0; i--) {
      Notes

        -
      1. A simple comparison.
      2. -
      3. A more complex comparison between three values.
      4. +
      5. A simple comparison. +
      6. A more complex comparison between three values.
      -
      - -

      has_key() dictionary method

      -

      In Python 2, dictionaries had a has_key() method to test whether the dictionary had a certain key. In Python 3, this method no longer exists. Instead, you need to use the in operator.

      +

      In Python 2, dictionaries had a has_key() method to test whether the dictionary had a certain key. In Python 3, this method no longer exists. Instead, you need to use the in operator. -

      +

      @@ -243,21 +231,18 @@ for (var i = arTables.length - 1; i >= 0; i--) {
      Notes

        -
      1. The simplest form.
      2. -
      3. The or operator takes precedence over the in operator, so there is no need for parentheses here.
      4. -
      5. On the other hand, you do need parentheses here, for the same reason -- or takes precedence over in.
      6. -
      7. The in operator takes precedence over the + operator, so this form needs parentheses too.
      8. -
      9. Again with the parentheses, for the same reason.
      10. +
      11. The simplest form. +
      12. The or operator takes precedence over the in operator, so there is no need for parentheses here. +
      13. On the other hand, you do need parentheses here, for the same reason -- or takes precedence over in. +
      14. The in operator takes precedence over the + operator, so this form needs parentheses too. +
      15. Again with the parentheses, for the same reason.
      -
      - -

      Dictionary methods that return lists

      -

      In Python 2, many dictionary methods returned lists. The most frequently used methods were keys(), items(), and values(). In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method's return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing.

      +

      In Python 2, many dictionary methods returned lists. The most frequently used methods were keys(), items(), and values(). In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method's return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing. -

      +

      @@ -292,28 +277,24 @@ for (var i = arTables.length - 1; i >= 0; i--) {
      Notes

        -
      1. 2to3 errs on the side of safety, converting the return value from keys() to a static list with the list() function. This will always work, but it will be less efficient than using a view. You should examine the converted code to see if a list is absolutely necessary, or if a view would do.
      2. -
      3. Another view-to-list conversion, with the items() method. 2to3 will do the same thing with the values() method.
      4. -
      5. Python 3 does not support the iterkeys() method anymore. Use keys(), and if necessary, convert the view to an iterator with the iter() function.
      6. -
      7. 2to3 recognizes when the iterkeys() method is used inside a list comprehension, and converts it to the keys() method (without wrapping it in an extra call to iter()). This works because views are iterable.
      8. -
      9. 2to3 recognizes that the keys() method is immediately passed to a function which iterates through an entire sequence, so there is no need to convert the return value to a list first. The min() function will happily iterate through the view instead. This applies to min(), max(), sum(), list(), tuple(), set(), sorted(), any(), and all().
      10. +
      11. 2to3 errs on the side of safety, converting the return value from keys() to a static list with the list() function. This will always work, but it will be less efficient than using a view. You should examine the converted code to see if a list is absolutely necessary, or if a view would do. +
      12. Another view-to-list conversion, with the items() method. 2to3 will do the same thing with the values() method. +
      13. Python 3 does not support the iterkeys() method anymore. Use keys(), and if necessary, convert the view to an iterator with the iter() function. +
      14. 2to3 recognizes when the iterkeys() method is used inside a list comprehension, and converts it to the keys() method (without wrapping it in an extra call to iter()). This works because views are iterable. +
      15. 2to3 recognizes that the keys() method is immediately passed to a function which iterates through an entire sequence, so there is no need to convert the return value to a list first. The min() function will happily iterate through the view instead. This applies to min(), max(), sum(), list(), tuple(), set(), sorted(), any(), and all().
      -
      - -

      Modules that have been renamed or reorganized

      -

      Several modules in the Python Standard Library have been renamed. Several other modules which are related to each other have been combined or reorganized to make their association more logical.

      +

      Several modules in the Python Standard Library have been renamed. Several other modules which are related to each other have been combined or reorganized to make their association more logical. -

      FIXME: once the rest of the book is written, these should link back to the chapters and sections that explain these modules.

      +

      FIXME: once the rest of the book is written, these should link back to the chapters and sections that explain these modules. -

      http package

      -

      In Python 3, several related HTTP modules have been combined into a single package, http.

      +

      In Python 3, several related HTTP modules have been combined into a single package, http. -

      +

      @@ -345,20 +326,17 @@ import CGIHttpServer
      Notes

        -
      1. The http.client module implements a low-level library that can request HTTP resources and interpret HTTP responses.
      2. -
      3. The http.cookies module provides a Pythonic interface to "cookies" that are sent in a Set-Cookie: HTTP header.
      4. -
      5. The http.cookiejar module manipulates the actual files on disk that popular web browsers use to store cookies.
      6. -
      7. The http.server module provides a basic HTTP server.
      8. +
      9. The http.client module implements a low-level library that can request HTTP resources and interpret HTTP responses. +
      10. The http.cookies module provides a Pythonic interface to "cookies" that are sent in a Set-Cookie: HTTP header. +
      11. The http.cookiejar module manipulates the actual files on disk that popular web browsers use to store cookies. +
      12. The http.server module provides a basic HTTP server.
      -
      - -

      urllib package

      -

      Python 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, urllib.

      +

      Python 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, urllib. -

      +

      @@ -402,22 +380,19 @@ from urllib.error import HTTPError
      Notes

        -
      1. The old urllib module in Python 2 had a variety of functions, including urlopen() for fetching data and splittype(), splithost(), and splituser() for splitting a URL into its constituent parts. These functions have been reorganized more logically within the new urllib package. 2to3 will also change all calls to these functions so they use the new naming scheme.
      2. -
      3. The old urllib2 module in Python 2 has been folded into into the urllib package in Python 3. All your urllib2 favorites -- the build_opener() method, Request objects, and HTTPBasicAuthHandler and friends -- are still available.
      4. -
      5. The urllib.parse module in Python 3 contains all the parsing functions from the old urlparse module in Python 2.
      6. -
      7. The urllib.robotparser module parses robots.txt files.
      8. -
      9. The FancyURLopener class, which handles HTTP redirects and other status codes, is still available in the new urllib.request module. The urlencode function has moved to urllib.parse.
      10. -
      11. The Request object is still available in urllib.request, but constants like HTTPError have been moved to urllib.error.
      12. +
      13. The old urllib module in Python 2 had a variety of functions, including urlopen() for fetching data and splittype(), splithost(), and splituser() for splitting a URL into its constituent parts. These functions have been reorganized more logically within the new urllib package. 2to3 will also change all calls to these functions so they use the new naming scheme. +
      14. The old urllib2 module in Python 2 has been folded into into the urllib package in Python 3. All your urllib2 favorites -- the build_opener() method, Request objects, and HTTPBasicAuthHandler and friends -- are still available. +
      15. The urllib.parse module in Python 3 contains all the parsing functions from the old urlparse module in Python 2. +
      16. The urllib.robotparser module parses robots.txt files. +
      17. The FancyURLopener class, which handles HTTP redirects and other status codes, is still available in the new urllib.request module. The urlencode function has moved to urllib.parse. +
      18. The Request object is still available in urllib.request, but constants like HTTPError have been moved to urllib.error.
      -
      - -

      dbm package

      -

      All the various DBM clones are now in a single package, dbm. If you need a specific variant like GNU DBM, you can import the appropriate module within the dbm package.

      +

      All the various DBM clones are now in a single package, dbm. If you need a specific variant like GNU DBM, you can import the appropriate module within the dbm package. -

      +

      @@ -452,16 +427,13 @@ import whichdb
      Notes
      -

      +

      -

      - -

      xmlrpc package

      -

      XML-RPC is a lightweight method of performing remote RPC calls over HTTP. The XML-RPC client library and several XML-RPC server implementations are now combined in a single package, xmlrpc.

      +

      XML-RPC is a lightweight method of performing remote RPC calls over HTTP. The XML-RPC client library and several XML-RPC server implementations are now combined in a single package, xmlrpc. -

      +

      @@ -481,14 +453,11 @@ import SimpleXMLRPCServer
      Notes
      -

      +

      -

      - -

      Other modules

      -

      +

      @@ -549,27 +518,38 @@ except ImportError:
      Notes

        -
      1. A common idiom in Python 2 was to try to import cStringIO as StringIO, and if that failed, to import StringIO instead. Do not do this in Python 3; the io module does it for you. It will find the fastest implementation available and use it automatically.
      2. -
      3. A similar idiom was used to import the fastest pickle implementation. Do not do this in Python 3; the pickle module does it for you.
      4. -
      5. The builtins module contains the "global" functions, classes, and constants used throughout the Python language. Redefining a function in the builtins module will redefine the "global" function everywhere. That is exactly as powerful and scary as it sounds.
      6. -
      7. The copyreg module adds pickle support for custom types defined in C.
      8. -
      9. The queue module implements a multi-producer, multi-consumer queue.
      10. -
      11. The socketserver module provides generic base classes for implementing different kinds of socket servers.
      12. -
      13. The configparser module parses INI-style configuration files.
      14. -
      15. The reprlib module reimplements the built-in repr() function, but with limits on how many values are represented.
      16. -
      17. The subprocess module allows you to spawn processes, connect to their pipes, and obtain their return codes.
      18. +
      19. A common idiom in Python 2 was to try to import cStringIO as StringIO, and if that failed, to import StringIO instead. Do not do this in Python 3; the io module does it for you. It will find the fastest implementation available and use it automatically. +
      20. A similar idiom was used to import the fastest pickle implementation. Do not do this in Python 3; the pickle module does it for you. +
      21. The builtins module contains the "global" functions, classes, and constants used throughout the Python language. Redefining a function in the builtins module will redefine the "global" function everywhere. That is exactly as powerful and scary as it sounds. +
      22. The copyreg module adds pickle support for custom types defined in C. +
      23. The queue module implements a multi-producer, multi-consumer queue. +
      24. The socketserver module provides generic base classes for implementing different kinds of socket servers. +
      25. The configparser module parses INI-style configuration files. +
      26. The reprlib module reimplements the built-in repr() function, but with limits on how many values are represented. +
      27. The subprocess module allows you to spawn processes, connect to their pipes, and obtain their return codes.
      -
      -
      - -

      Relative imports within a package

      -

      FIXME intro

      +

      A package is a group of related modules that function as a single entity. In Python 2, when modules within a package need to reference each other, you use import foo or from foo import Bar. The Python 2 interpreter first searches within the current package to find foo.py, and then moves on to the other directories in the Python search path (sys.path). Python 3 works a bit differently. Instead of searching the current package, it goes directly to the Python search path. If you want one module within a package to import another module in the same package, you need to explicitly provide the relative path between the two modules. -

      +

      Suppose you had this package, with multiple files in the same directory: + +

      +

      chardet/
      +|
      ++--__init__.py
      +|
      ++--constants.py
      +|
      ++--mbcharsetprober.py
      +|
      ++--universaldetector.py
      + +

      Now suppose that universaldetector.py needs to import the entire constants.py file and one class from mbcharsetprober.py. How do you do it? + +

      @@ -578,23 +558,26 @@ except ImportError: - - + + + + + + +
      Notes
      FIXMEFIXMEimport constantsfrom . import constants
      from mbcharsetprober import MultiByteCharSetProberfrom .mbcharsetprober import MultiByteCharsetProber

        -
      1. ...
      2. +
      3. When you need to import an entire module from elsewhere in your package, use the new from . import syntax. The period is actually a relative path from this file (universaldetector.py) to the file you want to import (constants.py). In this case, they are in the same directory, thus the single period. You can also import from the parent directory (from .. import anothermodule) or a subdirectory. +
      4. To import a specific class or function from another module directly into your module's namespace, prefix the target module with a relative path, minus the trailing slash. In this case, mbcharsetprober.py is in the same directory as universaldetector.py, so the path is a single period. You can also import form the parent directory (from ..anothermodule import AnotherClass) or a subdirectory.
      -
      - -

      filter() global function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -629,21 +612,18 @@ except ImportError:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. -
      9. ...
      10. +
      11. ... +
      12. ... +
      13. ... +
      14. ... +
      15. ...
      -
      - -

      map() global function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -678,21 +658,18 @@ except ImportError:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. -
      9. ...
      10. +
      11. ... +
      12. ... +
      13. ... +
      14. ... +
      15. ...
      -
      - -

      reduce() global function (3.1+)

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -708,17 +685,14 @@ reduce(a, b, c)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      apply() global function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -748,20 +722,17 @@ reduce(a, b, c)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. +
      9. ... +
      10. ... +
      11. ... +
      12. ...
      -
      - -

      intern() global function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -776,17 +747,14 @@ reduce(a, b, c)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      exec statement

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -811,19 +779,16 @@ reduce(a, b, c)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. +
      7. ... +
      8. ... +
      9. ...
      -
      - -

      execfile statement (3.1+)

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -838,17 +803,14 @@ reduce(a, b, c)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      repr literals (backticks)

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -873,19 +835,16 @@ reduce(a, b, c)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. +
      7. ... +
      8. ... +
      9. ...
      -
      - -

      try...except statement

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -933,20 +892,17 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. +
      9. ... +
      10. ... +
      11. ... +
      12. ...
      -
      - -

      raise statement

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -971,19 +927,16 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. +
      7. ... +
      8. ... +
      9. ...
      -
      - -

      throw statement

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1008,19 +961,16 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. +
      7. ... +
      8. ... +
      9. ...
      -
      - -

      long data type

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1055,21 +1005,18 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. -
      9. ...
      10. +
      11. ... +
      12. ... +
      13. ... +
      14. ... +
      15. ...
      -
      - -

      xrange() global function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1104,21 +1051,18 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. -
      9. ...
      10. +
      11. ... +
      12. ... +
      13. ... +
      14. ... +
      15. ...
      -
      - -

      raw_input() and input() global functions

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1148,20 +1092,17 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. +
      9. ... +
      10. ... +
      11. ... +
      12. ...
      -
      - -

      func_* function attributes

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1206,23 +1147,20 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. -
      9. ...
      10. -
      11. ...
      12. -
      13. ...
      14. +
      15. ... +
      16. ... +
      17. ... +
      18. ... +
      19. ... +
      20. ... +
      21. ...
      -
      - -

      xreadlines() I/O method

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1242,18 +1180,15 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. +
      5. ... +
      6. ...
      -
      - -

      lambda functions with multiple parameters

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1278,19 +1213,16 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. +
      7. ... +
      8. ... +
      9. ...
      -
      - -

      Special method attributes

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1315,19 +1247,16 @@ except:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. +
      7. ... +
      8. ... +
      9. ...
      -
      - -

      next() iterator method

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1372,21 +1301,18 @@ for an_iterator in a_sequence_of_iterators:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. -
      9. ...
      10. +
      11. ... +
      12. ... +
      13. ... +
      14. ... +
      15. ...
      -
      - -

      __nonzero__ special class attribute

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1412,18 +1338,15 @@ for an_iterator in a_sequence_of_iterators:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. +
      5. ... +
      6. ...
      -
      - -

      Number literals

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1443,18 +1366,15 @@ for an_iterator in a_sequence_of_iterators:
      Notes

        -
      1. ...
      2. -
      3. ...
      4. +
      5. ... +
      6. ...
      -
      - -

      sys.maxint

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1476,18 +1396,15 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. +
      5. ... +
      6. ...
      -
      - -

      unicode() global function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1502,17 +1419,14 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      Unicode string literals

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1532,18 +1446,15 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. +
      5. ... +
      6. ...
      -
      - -

      callable() global function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1558,17 +1469,14 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      zip() global function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1588,18 +1496,15 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. +
      5. ... +
      6. ...
      -
      - -

      StandardError() exception

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1619,18 +1524,15 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. +
      5. ... +
      6. ...
      -
      - -

      types module constants

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1670,22 +1572,19 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. -
      9. ...
      10. -
      11. ...
      12. +
      13. ... +
      14. ... +
      15. ... +
      16. ... +
      17. ... +
      18. ...
      -
      - -

      isinstance global function (3.1+)

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1700,17 +1599,14 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      basestring datatype

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1725,15 +1621,12 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      itertools module

      -

      FIXME intro

      +

      FIXME intro @@ -1769,21 +1662,18 @@ a_function(sys.maxsize)

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. -
      9. ...
      10. +
      11. ... +
      12. ... +
      13. ... +
      14. ... +
      15. ...
      -
      - -

      sys.exc_type, sys.exc_value, sys.exc_traceback

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1808,19 +1698,16 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. +
      7. ... +
      8. ... +
      9. ...
      -
      - -

      List comprehensions over tuples

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1835,17 +1722,14 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      os.getcwdu() function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1860,17 +1744,14 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      Metaclasses

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1894,18 +1775,15 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. +
      5. ... +
      6. ...
      -
      - -

      set() literals

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1930,19 +1808,16 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. +
      7. ... +
      8. ... +
      9. ...
      -
      - -

      buffer() global function

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1957,17 +1832,14 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. +
      3. ...
      -
      - -

      Whitespace around commas

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -1987,18 +1859,15 @@ a_function(sys.maxsize)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. +
      5. ... +
      6. ...
      -
      - -

      Common idioms

      -

      FIXME intro

      +

      FIXME intro -

      +

      @@ -2033,16 +1902,14 @@ do_stuff(a_list)
      Notes

        -
      1. ...
      2. -
      3. ...
      4. -
      5. ...
      6. -
      7. ...
      8. +
      9. ... +
      10. ... +
      11. ... +
      12. ...
      -
      -