From c3cbdf035d820de08dcd24522377102a5ee093c5 Mon Sep 17 00:00:00 2001
From: Mark Pilgrim
Date: Fri, 30 Jan 2009 19:46:43 -0500
Subject: [PATCH] several more 2to3 sections completed
---
case-study-porting-chardet-to-python-3.html | 84 +++---
dip3.css | 7 +-
index.html | 12 +-
porting-code-to-python-3-with-2to3.html | 314 ++++++++++----------
4 files changed, 211 insertions(+), 206 deletions(-)
diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html
index c9911f1..315919d 100644
--- a/case-study-porting-chardet-to-python-3.html
+++ b/case-study-porting-chardet-to-python-3.html
@@ -12,17 +12,17 @@ body{counter-reset:h1 19}
❝ Words, words. They're all we have to go on. ❞ ❝ Words, words. They’re all we have to go on. ❞ When you think of "text," you probably think of "characters and symbols I see on my computer screen." But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
+ When you think of “text,” you probably think of “characters and symbols I see on my computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
- In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's "text", you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
+ In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It's like cracking a code when you don't have the decryption key.
+ It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key.
- In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds "txzqJv 2!dasd0a QqdKjvz" will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of "typical" text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
+ In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.
In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
This library is a port of the auto-detection code in Mozilla. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors' comments, which are quite extensive and informative.
+ This library is a port of the auto-detection code in Mozilla. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors’ comments, which are quite extensive and informative.
You may also be interested in the research paper which led to the Mozilla implementation, A composite approach to language/encoding detection.
- Don't do that. Virtually every format and protocol contains a method for specifying character encoding.
+ Don’t do that. Virtually every format and protocol contains a method for specifying character encoding.
If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards and figure out which one wins if they give you conflicting information.)
- Despite the complexity, it's worthwhile to follow standards and respect explicit character encoding information. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
+ Despite the complexity, it’s worthwhile to follow standards and respect explicit character encoding information. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
- Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn't work. There are also some poorly designed standards that have no way to specify encoding at all.
+ Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all.
If following the relevant standards gets you nowhere, and you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my Universal Feed Parser, which calls this auto-detection library only after exhausting all other options.
@@ -88,7 +88,7 @@ body{counter-reset:h1 19}
This is a brief guide to navigating the code itself.
- The main entry point for the detection algorithm is The main entry point for the detection algorithm is There are 5 categories of encodings that If the text starts with a BOM, we can reasonably assume that the text is encoded in If the text starts with a BOM, we can reasonably assume that the text is encoded in Assuming no BOM, Assuming no BOM, The multi-byte encoding prober, We're going to migrate the We’re going to migrate the The main Well, that wasn't so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it'll work?
+ Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work?
- Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it's a good way to test our ported code to make sure there aren't any bugs lurking anywhere.
+ Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere.
Hmm, a small snag. In Python 3, Hmm, a small snag. In Python 3, This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in However, Python 3 will always have a However, Python 3 will always have a So this line in Ah, wasn't that satisfying? The code is shorter and more readable already.
+ Ah, wasn’t that satisfying? The code is shorter and more readable already.
What's that you say? No module named What’s that you say? No module named But wait. Wasn't the But wait. Wasn’t the The solution is to split the import statement manually. So this two-in-one import:
@@ -713,7 +713,7 @@ ImportError: No module named constants
There are variations of this problem scattered throughout the There are variations of this problem scattered throughout the Onward!
@@ -729,15 +729,15 @@ import sys
for line in file(f, 'rb'):
NameError: name 'file' is not defined
- This one surprised me, because I've been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the Thus, the simplest solution to the problem of the missing file() is to call open() instead:
And that's all I have to say about that.
+ And that’s all I have to say about that.
- FIXME intro
@@ -751,20 +751,20 @@ NameError: name 'file' is not defined
if self._highBitDetector.search(aBuf):
TypeError: can't use a string pattern on a bytes-like object
- Now things are starting to get interesting. And by "interesting," I mean "confusing as all hell."
+ Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.”
- First, let's see what self._highBitDetector is. It's defined in the __init__ method of the UniversalDetector class:
+ First, let’s see what self._highBitDetector is. It’s defined in the __init__ method of the UniversalDetector class:
This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that's not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.
+ This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.
And therein lies the problem.
- In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string ( In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string ( And what is aBuf? Let's backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: What we need this regular expression to search is not an array of characters, but an array of bytes.
@@ -821,7 +821,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
...
Case study: porting
chardet to Python 3
-
— Rosencrantz and Guildenstern are Dead
+
— Rosencrantz and Guildenstern are Dead
chardet: a mini-FAQ
@@ -33,40 +33,40 @@ body{counter-reset:h1 19}
windows-1252
2to3
-2to3 can't
+2to3 can’t
Introducing
-chardet: a mini-FAQWhat is character encoding auto-detection?
-Isn't that impossible?
+Isn’t that impossible?
-Who wrote this detection algorithm?
-Yippie! Screw the standards, I'll just auto-detect everything!
+Yippie! Screw the standards, I’ll just auto-detect everything!
-
charset parameter in the Content-type header.
@@ -76,11 +76,11 @@ body{counter-reset:h1 19}
Why bother with auto-detection if it's slow, inaccurate, and non-standard?
+Why bother with auto-detection if it’s slow, inaccurate, and non-standard?
-universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that's really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.)
+universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that’s really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.)
UniversalDetector handles:
@@ -97,12 +97,12 @@ body{counter-reset:h1 19}
ISO-2022-JP (Japanese) and HZ-GB-2312 (Chinese).
Big5 (Chinese), SHIFT_JIS (Japanese), EUC-KR (Korean), and UTF-8 without a BOM.
KOI8-R (Russian), windows-1255 (Hebrew), and TIS-620 (Thai).
-windows-1252, which is used primarily on Microsoft Windows by middle managers who wouldn't know a character encoding from a hole in the ground.
+windows-1252, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.
-UTF-n with a BOMUTF-8, UTF-16, or UTF-32. (The BOM will tell us exactly which one; that's what it's for.) This is handled inline in UniversalDetector, which returns the result immediately without any further processing.
+UTF-8, UTF-16, or UTF-32. (The BOM will tell us exactly which one; that’s what it’s for.) This is handled inline in UniversalDetector, which returns the result immediately without any further processing.
Escaped encodings
@@ -112,7 +112,7 @@ body{counter-reset:h1 19}
Multi-byte encodings
-UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of "probers" for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252.
+UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252.
MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. MBCSGroupProber feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller.
@@ -136,7 +136,7 @@ body{counter-reset:h1 19}
Running
-2to3chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we'll start by running 2to3 on the chardet package, but as you'll see, there will still be a lot of work to do after the automated tools have performed their magic.
+chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.
chardet package is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn.
@@ -642,13 +642,13 @@ RefactoringTool: Skipping implicit fixer: ws_comma
RefactoringTool: Files that were modified:
RefactoringTool: test.py
-Fixing what
+2to3 can'tFixing what
2to3 can’t
-False is invalid syntaxC:\home\chardet>python test.py tests\*\*
@@ -660,7 +660,7 @@ RefactoringTool: test.py
^
SyntaxError: invalid syntax
-False is a reserved word, so you can't use it as a variable name. Let's look at constants.py to see where it's defined. Here's the original version from constants.py, before the 2to3 script changed it:
+False is a reserved word, so you can’t use it as a variable name. Let’s look at constants.py to see where it’s defined. Here’s the original version from constants.py, before the 2to3 script changed it:
-import __builtin__
@@ -673,7 +673,7 @@ else:
Boolean type. This code detects the absence of the built-in constants True and False, and defines them if necessary.
-Boolean type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of "constants.True" and "constants.False" with "True" and "False", respectively, then delete this dead code from constants.py.
+Boolean type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of constants.True and constants.False with True and False, respectively, then delete this dead code from constants.py.
universaldetector.py:
@@ -683,7 +683,7 @@ else:
-self.done = FalseNo module named
@@ -698,11 +698,11 @@ else:
import constants, sys
ImportError: No module named constantsconstantsconstants? Of course there's a module named constants. ... Oh wait, no there isn't. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
+constants? Of course there’s a module named constants. ... Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
-from . import constants2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can't, and the 2to3 script is not smart enough to split the import statement into two.
+2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two.
-from . import constants
import syschardet library. In some places it's "import constants, sys"; in other places, it's "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
+chardet library. In some places it’s "import constants, sys"; in other places, it’s "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
io module. [FIXME-LINK PEP 3116] I'll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it's an alias for io.open(), but never mind that right now.)
+io module. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it’s an alias for io.open(), but never mind that right now.)
-for line in open(f, 'rb'):Can't use a string pattern on a bytes-like object
+Can’t use a string pattern on a bytes-like object
-class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(r'[\x80-\xFF]')u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we're searching is not a string, it's a byte array. Looking at the traceback, this error occurred in universaldetector.py:
+u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py:
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
-def feed(self, aBuf):
@@ -774,7 +774,7 @@ TypeError: can't use a string pattern on a bytes-like objecttest.py.
+test.py.
for line in open(f, 'rb'):
u.feed(line)
-u = UniversalDetector()
@@ -784,7 +784,7 @@ TypeError: can't use a string pattern on a bytes-like object'rb'. 'r' is for "read"; OK, big deal, we're reading the file. Ah, but 'b' is for "bytes." Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit... characters. But we don't have characters; we have bytes. Oops.
+'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit... characters. But we don’t have characters; we have bytes. Oops.
Dive Into Python 3 will cover Python 3 and its differences from Python 2. Compared to the original Dive Into Python, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final book will be published on paper by Apress. The book will remain online under the CC-BY-3.0 license.
-Below is the draft table of contents. It is not finalized. Only a few chapters have been written so far. The rest is just stubs and random notes to myself.
-Yes, that is PapayaWhip. All hail PapayaWhip.
Dive Into Python 3 will cover Python 3 and its differences from Python 2. Compared to the original Dive Into Python, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final book will be published on paper by Apress. The book will remain online under the CC-BY-3.0 license. +
Below is the draft table of contents. It is not finalized. Only a few chapters have been written so far. The rest is just stubs and random notes to myself. +
Yes, that is PapayaWhip. All hail PapayaWhip.
Tentative because most of these have not been ported to Python 3 yet.
+Tentative because most of these have not been ported to Python 3 yet.