diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 09ff8ed..5bf0cb8 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -13,7 +13,6 @@ del{background:#f87} mark{background:#ff8;font-weight:bold} -
You are here: Home ‣ Dive Into Python 3 ‣
chardet to Python 3We’re going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy — a function was renamed or moved to a different modules — but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.
The main chardet package is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn.
[The code examples will be easier to follow if you enable Javascript, but whatever.] -
C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w chardet\ RefactoringTool: Skipping implicit fixer: buffer RefactoringTool: Skipping implicit fixer: idioms @@ -567,8 +565,7 @@ RefactoringTool: chardet\sbcsgroupprober.py RefactoringTool: chardet\sjisprober.py RefactoringTool: chardet\universaldetector.py RefactoringTool: chardet\utf8prober.py-
Now run the 2to3 script on the testing harness, test.py.
-
Now run the 2to3 script on the testing harness, test.py.
C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w test.py RefactoringTool: Skipping implicit fixer: buffer RefactoringTool: Skipping implicit fixer: idioms @@ -599,12 +596,11 @@ RefactoringTool: Skipping implicit fixer: ws_comma +print(count, 'tests') RefactoringTool: Files that were modified: RefactoringTool: test.py-
[FIXME explain the difference in import syntax] +
[FIXME explain the difference in import syntax]
Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work?
2to3 can’tFalse is invalid syntaxNow for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere. -
C:\home\chardet> python test.py tests\*\* Traceback (most recent call last): File "test.py", line 1, in <module> @@ -613,8 +609,7 @@ RefactoringTool: test.pyself.done = constants.False ^ SyntaxError: invalid syntax -
Hmm, a small snag. In Python 3, False is a reserved word, so you can’t use it as a variable name. Let’s look at constants.py to see where it’s defined. Here’s the original version from constants.py, before the 2to3 script changed it:
-
Hmm, a small snag. In Python 3, False is a reserved word, so you can’t use it as a variable name. Let’s look at constants.py to see where it’s defined. Here’s the original version from constants.py, before the 2to3 script changed it:
import __builtin__
if not hasattr(__builtin__, 'False'):
False = 0
@@ -622,7 +617,7 @@ if not hasattr(__builtin__, 'False'):
else:
False = __builtin__.False
True = __builtin__.True
-This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in Boolean type. This code detects the absence of the built-in constants True and False, and defines them if necessary.
+
This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in Boolean type. This code detects the absence of the built-in constants True and False, and defines them if necessary.
However, Python 3 will always have a Boolean type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of constants.True and constants.False with True and False, respectively, then delete this dead code from constants.py.
So this line in universaldetector.py:
self.done = constants.False
@@ -631,7 +626,6 @@ else:
Ah, wasn’t that satisfying? The code is shorter and more readable already.
constantsTime to run test.py again and see how far it gets.
-
C:\home\chardet> python test.py tests\*\*
Traceback (most recent call last):
File "test.py", line 1, in <module>
@@ -639,7 +633,7 @@ else:
File "C:\home\chardet\chardet\universaldetector.py", line 29, in <module>
import constants, sys
ImportError: No module named constants
-What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
+
What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
from . import constants
But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two.
The solution is to split the import statement manually. So this two-in-one import: @@ -651,20 +645,18 @@ import sys
Onward!
And here we go again, running test.py to try to execute our test cases…
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml
Traceback (most recent call last):
File "test.py", line 9, in <module>
for line in file(f, 'rb'):
NameError: name 'file' is not defined
-This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it’s an alias for io.open(), but never mind that right now.)
+
This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it’s an alias for io.open(), but never mind that right now.)
Thus, the simplest solution to the problem of the missing file() is to call open() instead:
for line in open(f, 'rb'):
And that’s all I have to say about that.
Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.” -
C:\home\chardet> python test.py tests\*\* tests\ascii\howto.diveintomark.org.xml Traceback (most recent call last): @@ -673,34 +665,29 @@ NameError: name 'file' is not definedFile "C:\home\chardet\chardet\universaldetector.py", line 98, in feed if self._highBitDetector.search(aBuf): TypeError: can't use a string pattern on a bytes-like object -
To debug this, let’s see what self._highBitDetector is. It’s defined in the __init__ method of the UniversalDetector class: -
class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(r'[\x80-\xFF]')
-This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255. +
This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.
And therein lies the problem.
In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py:
-
def feed(self, aBuf):
.
.
.
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
-And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py.
-
And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py.
u = UniversalDetector()
.
.
.
for line in open(f, 'rb'):
u.feed(line)
-And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
+
And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
What we need this regular expression to search is not an array of characters, but an array of bytes.
Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.) -
class UniversalDetector:
def __init__(self):
- self._highBitDetector = re.compile(b'[\x80-\xFF]')
@@ -710,8 +697,7 @@ for line in open(f, 'rb'):
self._mEscCharSetProber = None
self._mCharSetProbers = []
self.reset()
-Searching the entire codebase for other uses of the re module turns up two more instances, in charsetprober.py. Again, the code is defining regular expressions as strings but executing them on aBuf, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays.
-
Searching the entire codebase for other uses of the re module turns up two more instances, in charsetprober.py. Again, the code is defining regular expressions as strings but executing them on aBuf, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays.
class CharSetProber:
.
.
@@ -728,7 +714,6 @@ for line in open(f, 'rb'):
Can't convert 'bytes' object to str implicitly
Curiouser and curiouser…
-
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml
Traceback (most recent call last):
@@ -737,12 +722,10 @@ for line in open(f, 'rb'):
File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly
-There's an unfortunate clash of coding style and Python interpreter here. The TypeError could be anywhere on that line, but the traceback doesn't tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
-
There's an unfortunate clash of coding style and Python interpreter here. The TypeError could be anywhere on that line, but the traceback doesn't tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
elif (self._mInputState == ePureAscii) and \
self._escDetector.search(self._mLastChar + aBuf):
-And re-run the test:
-skip over this command output listing
+
And re-run the test:
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml
Traceback (most recent call last):
@@ -751,9 +734,8 @@ TypeError: Can't convert 'bytes' object to str implicitly
File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed
self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly
-Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second one. So what could cause a TypeError there? Perhaps you're thinking that the search() method is expecting a value of a different type, but that wouldn't generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value of a different type than it's expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the search() method. So the problem must be in that + operation, as it's trying to construct the value that it will eventually pass to the search() method.
+
Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second one. So what could cause a TypeError there? Perhaps you're thinking that the search() method is expecting a value of a different type, but that wouldn't generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value of a different type than it's expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the search() method. So the problem must be in that + operation, as it's trying to construct the value that it will eventually pass to the search() method.
We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It's an instance variable, defined in the reset() method, which is actually called from the __init__() method.
-
class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(b'[\x80-\xFF]')
@@ -769,9 +751,8 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mGotData = False
self._mInputState = ePureAscii
self._mLastChar = ''
-And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can't concatenate a string to a byte array — not even a zero-length string. +
And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can't concatenate a string to a byte array — not even a zero-length string.
So what is self._mLastChar anyway? The answer is in the feed() method, just a few lines down from where the trackback occurred.
-
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
self._mInputState = eHighbyte
@@ -781,15 +762,13 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mLastChar = aBuf[-1]
The calling function calls this feed() method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as aBuf), then stores the last byte in self._mLastChar in case it's needed during the next call. (In a multi-byte encoding, the feed() method might get called with half of a character, then called again with the other half.) But because aBuf is now a byte array instead of a string, self._mLastChar needs to be a byte array as well. Thus:
-
def reset(self):
.
.
.
- self._mLastChar = ''
+ self._mLastChar = b''
-Searching the entire codebase for "mLastChar" turns up a similar problem in mbcharsetprober.py, but instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.
-
Searching the entire codebase for "mLastChar" turns up a similar problem in mbcharsetprober.py, but instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.
class MultiByteCharSetProber(CharSetProber):
def __init__(self):
@@ -809,7 +788,6 @@ TypeError: Can't convert 'bytes' object to str implicitly
+ self._mLastChar = [0, 0]
'int' and 'bytes'I have good news, and I have bad news. The good news is we're making progress… -
skip over this command listing
C:\home\chardet> python test.py tests\*\* tests\ascii\howto.diveintomark.org.xml Traceback (most recent call last): @@ -818,10 +796,9 @@ TypeError: Can't convert 'bytes' object to str implicitlyFile "C:\home\chardet\chardet\universaldetector.py", line 101, in feed self._escDetector.search(self._mLastChar + aBuf): TypeError: unsupported operand type(s) for +: 'int' and 'bytes' -
…The bad news is it doesn't always feel like progress. +
…The bad news is it doesn't always feel like progress.
But this is progress! Really! Even though the traceback calls out the same line of code, it's a different error than it used to be. Progress! So what's the problem now? The last time I checked, this line of code didn't try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that self._mLastChar was a byte array. How did it turn into an int?
The answer lies not in the previous lines of code, but in the following lines. -
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
self._mInputState = eHighbyte
@@ -830,8 +807,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
self._mInputState = eEscAscii
self._mLastChar = aBuf[-1]
-This error doesn't occur the first time the feed() method gets called; it occurs the second time, after self._mLastChar has been set to the last byte of aBuf. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
-
skip over this interpreter listing +
This error doesn't occur the first time the feed() method gets called; it occurs the second time, after self._mLastChar has been set to the last byte of aBuf. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
>>> aBuf = b'\xEF\xBB\xBF' ① >>> len(aBuf) @@ -850,7 +826,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes' b'\xbf' >>> mLastChar + aBuf ⑥ b'\xbf\xef\xbb\xbf'-
ord() expected string of length 1, but int foundTired yet? You're almost there… -
skip over this command output listing
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml
@@ -882,29 +857,25 @@ tests\Big5\0804.blogspot.com.xml
File "C:\home\chardet\chardet\codingstatemachine.py", line 43, in next_state
byteCls = self._mModel['classTable'][ord(c)]
TypeError: ord() expected string of length 1, but int found
-OK, so c is an int, but the ord() function was expecting a 1-character string. Fair enough. Where is c defined?
-
OK, so c is an int, but the ord() function was expecting a 1-character string. Fair enough. Where is c defined?
# codingstatemachine.py
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
byteCls = self._mModel['classTable'][ord(c)]
-That's no help; it's just passed into the function. Let's pop the stack. -
That's no help; it's just passed into the function. Let's pop the stack.
# utf8prober.py
def feed(self, aBuf):
for c in aBuf:
codingState = self._mCodingSM.next_state(c)
-And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That's what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there's no need to call the ord() function because c is already an int!
+
And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That's what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there's no need to call the ord() function because c is already an int!
Thus: -
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
- byteCls = self._mModel['classTable'][ord(c)]
+ byteCls = self._mModel['classTable'][c]
Searching the entire codebase for instances of "ord(c)" uncovers similar problems in sbcharsetprober.py…
-
# sbcharsetprober.py
def feed(self, aBuf):
if not self._mModel['keepEnglishLetter']:
@@ -914,15 +885,13 @@ def feed(self, aBuf):
return self.get_state()
for c in aBuf:
order = self._mModel['charToOrderMap'][ord(c)]
-…and latin1prober.py…
-
…and latin1prober.py…
# latin1prober.py
def feed(self, aBuf):
aBuf = self.filter_with_english_letters(aBuf)
for c in aBuf:
charClass = Latin1_CharToClass[ord(c)]
-c is iterating over aBuf, which means it is an integer, not a 1-character string. The solution is the same: change ord(c) to just plain c.
-
c is iterating over aBuf, which means it is an integer, not a 1-character string. The solution is the same: change ord(c) to just plain c.
# sbcharsetprober.py
def feed(self, aBuf):
if not self._mModel['keepEnglishLetter']:
@@ -943,7 +912,6 @@ def feed(self, aBuf):
int() >= str()Let's go again. -
skip over this command output listing
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml
@@ -961,9 +929,8 @@ tests\Big5\0804.blogspot.com.xml
File "C:\home\chardet\chardet\jpcntx.py", line 176, in get_order
if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
TypeError: unorderable types: int() >= str()
-Did you notice? This time around, the code passed the first test case (tests\ascii\howto.diveintomark.org.xml). You're making real progress here.
+
Did you notice? This time around, the code passed the first test case (tests\ascii\howto.diveintomark.org.xml). You're making real progress here.
So what's this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code: -
class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
if not aStr: return -1, 1
@@ -973,8 +940,7 @@ TypeError: unorderable types: int() >= str()
charLen = 2
else:
charLen = 1
-And where does aStr come from? Let's pop the stack: -
And where does aStr come from? Let's pop the stack:
def feed(self, aBuf, aLen):
.
.
@@ -982,10 +948,9 @@ TypeError: unorderable types: int() >= str()
i = self._mNeedToSkipCharNum
while i < aLen:
order, charLen = self.get_order(aBuf[i:i+2])
-Oh look, it's our old friend, aBuf. As you might have guessed from every other issue we've encountered in this chapter, aBuf is a byte array. Here, the feed() method isn't just passing it on wholesale; it's slicing it. But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array.
+
Oh look, it's our old friend, aBuf. As you might have guessed from every other issue we've encountered in this chapter, aBuf is a byte array. Here, the feed() method isn't just passing it on wholesale; it's slicing it. But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array.
And what is this code trying to do with aStr? It's taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because aStr and aBuf were strings, and aStr[0] would be a string, and you can compare strings for inequality. But in Python 3, aStr and aBuf are byte arrays, aStr[0] is an integer, and you can't compare integers and strings for inequality without explicitly coercing one of them.
In this case, there's no need to make the code more complicated by adding an explicit coercion. aStr[0] yields an integer; the things you're comparing to are all constants. Let's change them from 1-character strings to integers. -
class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
if not aStr: return -1, 1
@@ -1039,7 +1004,6 @@ TypeError: unorderable types: int() >= str()
return -1, charLen
Searching the entire codebase for occurrences of the ord() function uncovers the same problem in chardistribution.py:
-
skip over this command output listing
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml
@@ -1057,8 +1021,7 @@ tests\Big5\0804.blogspot.com.xml
File "C:\home\chardet\chardet\chardistribution.py", line 174, in get_order
if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):
TypeError: unorderable types: int() >= str()
-The fix is the same: -
The fix is the same:
class EUCTWDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
CharDistributionAnalysis.__init__(self)
@@ -1165,7 +1128,6 @@ TypeError: unorderable types: int() >= str()
return -1
'reduce' is not definedOnce more into the breach… -
skip over this command output listing
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml
@@ -1177,16 +1139,14 @@ tests\Big5\0804.blogspot.com.xml
File "C:\home\chardet\chardet\latin1prober.py", line 126, in get_confidence
total = reduce(operator.add, self._mFreqCounter)
NameError: global name 'reduce' is not defined
-According to the official What's New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: "Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable."
+
According to the official What's New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: "Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable."
OK then, let's refactor it to use a for loop.
-
def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
total = reduce(operator.add, self._mFreqCounter)
The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result. It looks much more readable as a for loop.
-
def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
@@ -1195,8 +1155,7 @@ NameError: global name 'reduce' is not defined
+ total = 0
+ for frequency in self._mFreqCounter:
+ total += frequency
-I CAN HAZ TESTZ? -
skip over this command output listing +
I CAN HAZ TESTZ?
C:\home\chardet> python test.py tests\*\* tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0 tests\Big5\0804.blogspot.com.xml Big5 with confidence 0.99 @@ -1231,7 +1190,7 @@ tests\EUC-JP\arclamp.jp.xml EUC-JP with confide . . 316 tests-
Holy crap, it actually works! /me does a little dance +
Holy crap, it actually works! /me does a little dance
What have we learned?
You are here: Home ‣ Dive Into Python 3 ‣
You are here: Home ‣ Dive Into Python 3 ‣
2to3print statementIn Python 2, print was a statement. Whatever you wanted to print simply followed the print keyword. In Python 3, print() is a function — whatever you want to print is passed to print() like any other function.
[The code examples will be easier to follow if you enable Javascript, but whatever.] -
| Notes | Python 2 | @@ -112,7 +110,7 @@ td pre{padding:0;border:0}print >>sys.stderr, 1, 2, 3 |
print(1, 2, 3, file=sys.stderr) |
|---|
print() without any arguments.
print() with one argument
print() with two arguments.
@@ -121,8 +119,7 @@ td pre{padding:0;border:0}
Python 2 had two string types: Unicode strings and non-Unicode strings. Python 3 has one string type: Unicode strings. -
| Notes | Python 2 | Python 3 | @@ -134,14 +131,13 @@ td pre{padding:0;border:0}ur"PapayaWhip\foo" |
r"PapayaWhip\foo" |
|---|
unicode() global functionPython 2 had two global functions to coerce objects into strings: unicode() to coerce them into Unicode strings, and str() to coerce them into non-Unicode strings. Python 3 has only one string type, Unicode strings, so the str() function is all you need. (The unicode() function no longer exists.)
-
| Notes | Python 2 | Python 3 | @@ -150,12 +146,10 @@ td pre{padding:0;border:0}unicode(anything) |
str(anything) |
|---|
long data typePython 2 had separate int and long types for non-floating-point numbers. An int could not be any larger than sys.maxint, which varied by platform. Longs were defined by appending an L to the end of the number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called int, which mostly behaves like the long type in Python 2. Since there are no longer two types, there is no need for special syntax to distinguish them.
Further reading: PEP 237: Unifying Long Integers and Integers. -
| Notes | Python 2 | Python 3 | @@ -176,7 +170,7 @@ td pre{padding:0;border:0}isinstance(x, long) |
isinstance(x, int) |
|---|
long() function no longer exists, since longs don't exist. To coerce a variable to an integer, use the int() function.
@@ -185,8 +179,7 @@ td pre{padding:0;border:0}
Python 2 supported <> as a synonym for !=, the not-equals comparison operator. Python 3 supports the != operator, but not <>.
-
| Notes | Python 2 | Python 3 | @@ -198,14 +191,13 @@ td pre{padding:0;border:0}if x <> y <> z: |
if x != y != z: |
|---|
has_key() dictionary methodIn Python 2, dictionaries had a has_key() method to test whether the dictionary had a certain key. In Python 3, this method no longer exists. Instead, you need to use the in operator.
-
| Notes | Python 2 | Python 3 | @@ -226,7 +218,7 @@ td pre{padding:0;border:0}x + a_dictionary.has_key(y) |
x + (y in a_dictionary) |
|---|
or operator takes precedence over the in operator, so there is no need for parentheses here.
or takes precedence over in.
@@ -235,8 +227,7 @@ td pre{padding:0;border:0}
In Python 2, many dictionary methods returned lists. The most frequently used methods were keys(), items(), and values(). In Python 3, all of these methods return dynamic views. In some contexts, this is not a problem. If the method's return value is immediately passed to another function that iterates through the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it matters a great deal. If you were expecting a complete list with individually addressable elements, your code will choke, because views do not support indexing.
-
| Notes | Python 2 | Python 3 | @@ -257,7 +248,7 @@ td pre{padding:0;border:0}min(a_dictionary.keys()) |
no change |
|---|
2to3 errs on the side of safety, converting the return value from keys() to a static list with the list() function. This will always work, but it will be less efficient than using a view. You should examine the converted code to see if a list is absolutely necessary, or if a view would do.
items() method. 2to3 will do the same thing with the values() method.
iterkeys() method anymore. Use keys(), and if necessary, convert the view to an iterator with the iter() function.
@@ -268,8 +259,7 @@ td pre{padding:0;border:0}
Several modules in the Python Standard Library have been renamed. Several other modules which are related to each other have been combined or reorganized to make their association more logical.
httpIn Python 3, several related HTTP modules have been combined into a single package, http.
-
| Notes | Python 2 | Python 3 | @@ -289,7 +279,7 @@ import SimpleHTTPServer import CGIHttpServerimport http.server |
|---|
http.client module implements a low-level library that can request HTTP resources and interpret HTTP responses.
http.cookies module provides a Pythonic interface to browser cookies that are sent in a Set-Cookie: HTTP header.
http.cookiejar module manipulates the actual files on disk that popular web browsers use to store cookies.
@@ -297,8 +287,7 @@ import CGIHttpServer
urllibPython 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, urllib.
-
| Notes | Python 2 | Python 3 | @@ -326,7 +315,7 @@ from urllib2 import HTTPError |
|---|
urllib module in Python 2 had a variety of functions, including urlopen() for fetching data and splittype(), splithost(), and splituser() for splitting a URL into its constituent parts. These functions have been reorganized more logically within the new urllib package. 2to3 will also change all calls to these functions so they use the new naming scheme.
urllib2 module in Python 2 has been folded into into the urllib package in Python 3. All your urllib2 favorites — the build_opener() method, Request objects, and HTTPBasicAuthHandler and friends — are still available.
urllib.parse module in Python 3 contains all the parsing functions from the old urlparse module in Python 2.
@@ -336,8 +325,7 @@ from urllib.error import HTTPError
dbmAll the various DBM clones are now in a single package, dbm. If you need a specific variant like GNU DBM, you can import the appropriate module within the dbm package.
-
| Notes | Python 2 | Python 3 | @@ -359,11 +347,9 @@ from urllib.error import HTTPErrorimport dbm |
|---|
xmlrpcXML-RPC is a lightweight method of performing remote RPC calls over HTTP. The XML-RPC client library and several XML-RPC server implementations are now combined in a single package, xmlrpc.
-
| Notes | Python 2 | Python 3 | @@ -376,10 +362,8 @@ import whichdb import SimpleXMLRPCServerimport xmlrpc.server |
|---|
| Notes | Python 2 | Python 3 | @@ -418,7 +402,7 @@ except ImportError:import commands |
import subprocess |
|---|
cStringIO as StringIO, and if that failed, to import StringIO instead. Do not do this in Python 3; the io module does it for you. It will find the fastest implementation available and use it automatically.
pickle module does it for you.
builtins module contains the global functions, classes, and constants used throughout the Python language. Redefining a function in the builtins module will redefine the global function everywhere. That is exactly as powerful and scary as it sounds.
@@ -432,7 +416,6 @@ except ImportError:
A package is a group of related modules that function as a single entity. In Python 2, when modules within a package need to reference each other, you use import foo or from foo import Bar. The Python 2 interpreter first searches within the current package to find foo.py, and then moves on to the other directories in the Python search path (sys.path). Python 3 works a bit differently. Instead of searching the current package, it goes directly to the Python search path. If you want one module within a package to import another module in the same package, you need to explicitly provide the relative path between the two modules.
Suppose you had this package, with multiple files in the same directory: -
chardet/ | +--__init__.py @@ -442,9 +425,8 @@ except ImportError: +--mbcharsetprober.py | +--universaldetector.py-
Now suppose that universaldetector.py needs to import the entire constants.py file and one class from mbcharsetprober.py. How do you do it?
-
| Notes | Python 2 | Python 3 | @@ -456,14 +438,13 @@ except ImportError:from mbcharsetprober import MultiByteCharSetProber |
from .mbcharsetprober import MultiByteCharsetProber |
|---|
from . import syntax. The period is actually a relative path from this file (universaldetector.py) to the file you want to import (constants.py). In this case, they are in the same directory, thus the single period. You can also import from the parent directory (from .. import anothermodule) or a subdirectory.
mbcharsetprober.py is in the same directory as universaldetector.py, so the path is a single period. You can also import form the parent directory (from ..anothermodule import AnotherClass) or a subdirectory.
next() iterator methodIn Python 2, iterators had a next() method which returned the next item in the sequence. That's still true in Python 3, but there is now also a global next() function that takes an iterator as an argument.
-
| Notes | Python 2 | Python 3 | @@ -494,7 +475,7 @@ for an_iterator in a_sequence_of_iterators: for an_iterator in a_sequence_of_iterators: an_iterator.__next__()
|---|
next() method, you now pass the iterator itself to the global next() function.
next() function. (The 2to3 script is smart enough to convert this properly.)
__next__() special method.
@@ -503,8 +484,7 @@ for an_iterator in a_sequence_of_iterators:
filter() global functionIn Python 2, the filter() function returned a list, the result of filtering a sequence through a function that returned True or False for each item in the sequence. In Python 3, the filter() function returns an iterator, not a list.
-
| Notes | Python 2 | Python 3 | @@ -525,7 +505,7 @@ for an_iterator in a_sequence_of_iterators:[i for i in filter(a_function, a_sequence)] |
no change |
|---|
2to3 will wrap a call to filter() with a call to list(), which simply iterates through its argument and returns a real list.
filter() is already wrapped in list(), 2to3 will do nothing, since the fact that filter() is returning an iterator is irrelevant.
filter(None, ...), 2to3 will transform the call into a semantically equivalent list comprehension.
@@ -534,8 +514,7 @@ for an_iterator in a_sequence_of_iterators:
map() global functionIn much the same way as filter(), the map() function now returns an iterator. (In Python 2, it returned a list.)
-
| Notes | Python 2 | Python 3 | @@ -556,7 +535,7 @@ for an_iterator in a_sequence_of_iterators:[i for i in map(a_function, a_sequence)] |
no change |
|---|
filter(), in the most basic case, 2to3 will wrap a call to map() with a call to list().
map(None, ...), the identity function, 2to3 will convert it to an equivalent call to list().
map() is a lambda function, 2to3 will convert it to an equivalent list comprehension.
@@ -565,8 +544,7 @@ for an_iterator in a_sequence_of_iterators:
reduce() global function (3.1+)In Python 3, the reduce() function has been removed from the global namespace and placed in the functools module.
-
| Notes | Python 2 | Python 3 | @@ -576,13 +554,12 @@ for an_iterator in a_sequence_of_iterators: |
|---|
+☞The version of
2to3that shipped with Python 3.0 would not fix thereduce()function automatically. The fix first appeared in the2to3script that shipped with Python 3.1.
apply()global functionPython 2 had a global function called
apply(), which took a function f and a list[a, b, c]and returnedf(a, b, c). In Python 3, theapply()function no longer exists. Instead, there is a new function calling syntax that allows you to pass a list and have Python apply the list as the function's arguments. -+
-
Notes Python 2 Python 3 @@ -600,7 +577,7 @@ reduce(a, b, c)apply(aModule.a_function, a_list_of_args)aModule.a_function(*a_list_of_args)+
- In the simplest form, you can call a function with a list of arguments (an actual list like
[a, b, c]) by prepending the list with an asterisk (*). This is exactly equivalent to the oldapply()function in Python 2.- In Python 2, the
apply()function could actually take three parameters: a function, a list of arguments, and a dictionary of named arguments. In Python 3, you can accomplish the same thing by prepending the list of arguments with an asterisk (*) and the dictionary of named arguments with two asterisks (**).- The
+operator, used here for list concatenation, takes precedence over the*operator, so there is no need for extra parentheses arounda_list_of_args + z. @@ -608,8 +585,7 @@ reduce(a, b, c)
intern()global functionIn Python 2, you could call the
intern()function on a string to intern it as a performance optimization. In Python 3, theintern()function has been moved to thesysmodule. -+
-
Notes Python 2 Python 3 @@ -618,11 +594,9 @@ reduce(a, b, c)intern(aString)sys.intern(aString)
execstatementJust as the
execstatement. Theexec()function takes a string which contains arbitrary Python code and executes it as if it were just another statement or expression. -+
-
Notes Python 2 Python 3 @@ -637,15 +611,14 @@ reduce(a, b, c)exec codeString in a_global_namespace, a_local_namespaceexec(codeString, a_global_namespace, a_local_namespace)+
- In the simplest form, the
2to3script simply encloses the code-as-a-string in parentheses, sinceexec()is now a function instead of a statement.- The old
execstatement could take a namespace, a private environment of globals in which the code-as-a-string would be executed. Python 3 can also do this; just pass the namespace as the second argument to theexec()function.- Even fancier, the old
execstatement could also take a local namespace (like the variables defined within a function). In Python 3, theexec()function can do that too.
execfilestatement (3.1+)Like the old
execstatement, the oldexecfilestatement will execute strings as if they were Python code. Whereexectook a string,execfiletook a filename. In Python 3, theexecfilestatement has been eliminated. If you really need to take a file of Python code and execute it (but you're not willing to simply import it), you can accomplish the same thing by opening the file, reading its contents, calling the globalcompile()function to force the Python interpreter to compile the code, and then call the newexec()function. -+
-
Notes Python 2 Python 3 @@ -654,13 +627,12 @@ reduce(a, b, c)execfile("a_filename")exec(compile(open("a_filename").read(), "a_filename", "exec"))+☞The version of
2to3that shipped with Python 3.0 would not fix theexecfilestatement automatically. The fix first appeared in the2to3script that shipped with Python 3.1.
reprliterals (backticks)In Python 2, there was a special syntax of wrapping any object in backticks (like
`x`) to get a representation of the object. In Python 3, this capability still exists, but you can no longer use backticks to get it. Instead, use the globalrepr()function. -+
-
Notes Python 2 Python 3 @@ -672,14 +644,13 @@ reduce(a, b, c)`"PapayaWhip" + `2``repr("PapayaWhip" + repr(2))+
- Remember, x can be anything — a class, a function, a module, a primitive data type, etc. The
repr()function works on everything.- In Python 2, backticks could be nested, leading to this sort of confusing (but valid) expression. The
2to3tool is smart enough to convert this into nested calls torepr().
try...exceptstatementThe syntax for catching exceptions has changed slightly between Python 2 and Python 3. -
+
-
Notes Python 2 Python 3 @@ -715,7 +686,7 @@ except: passno change +
- Instead of a comma after the exception type, Python 3 uses a new keyword,
as.- The
askeyword also works for catching multiple types of exceptions at once.- If you catch an exception but don't actually care about accessing the exception object itself, the syntax is identical between Python 2 and Python 3. @@ -726,8 +697,7 @@ except:
raisestatementThe syntax for raising your own exceptions has changed slightly between Python 2 and Python 3. -
+
-
Notes Python 2 Python 3 @@ -745,7 +715,7 @@ except:raise "error message"unsupported +
- In the simplest form, raising an exception without a custom error message, the syntax is unchanged.
- The change becomes noticeable when you want to raise an exception with a custom error message. Python 2 separated the exception class and the message with a comma; Python 3 passes the error message as a parameter.
- Python 2 supported a more complex syntax to raise an exception with a custom traceback (stack trace). You can do this in Python 3 as well, but the syntax is quite different. @@ -753,8 +723,7 @@ except:
throwmethod on generatorsIn Python 2, generators have a
throw()method. Callinga_generator.throw()raises an exception at the point where the generator was paused, then returns the next value yielded by the generator function. In Python 3, this functionality is still available, but the syntax is slightly different. -+
-
Notes Python 2 Python 3 @@ -769,15 +738,14 @@ except:a_generator.throw("error message")unsupported +
- In the simplest form, a generator throws an exception without a custom error message. In this case, the syntax has not changed between Python 2 and Python 3.
- If the generator throws an exception with a custom error message, you need to pass the error string to the exception when you create it.
- Python 2 also supported throwing an exception with only a custom error message. Python 3 does not support this, and the
2to3script will display a warning telling you that you will need to fix this code manually.
xrange()global functionIn Python 2, there were two ways to get a range of numbers:
range(), which returned a list, andxrange(), which returned an iterator. In Python 3,range()returns an iterator, andxrange()doesn't exist. -+
-
Notes Python 2 Python 3 @@ -798,7 +766,7 @@ except:sum(range(10))no change +
- In the simplest case, the
2to3script will simply convertxrange()torange().- If your Python 2 code used
range(), the2to3script does not know whether you needed a list, or whether an iterator would do. It errs on the side of caution and coerces the return value into a list by calling thelist()function.- If the
xrange()function was inside a list comprehension, there is no need to coerce the result to a list, since the list comprehension will work just fine with an iterator. @@ -807,8 +775,7 @@ except:
raw_input()andinput()global functionsPython 2 had two global functions for asking the user for input on the command line. The first, called
input(), expected the user to enter a Python expression (and returned the result). The second, calledraw_input(), just returned whatever the user typed. This was wildly confusing for beginners and widely regarded as a “wart” in the language. Python 3 excises this wart by renamingraw_input()toinput(), so it works the way everyone naively expects it to work. -+
-
Notes Python 2 Python 3 @@ -823,15 +790,14 @@ except:input()eval(input())+
- In the simplest form,
raw_input()becomesinput().- In Python 2, the
raw_input()function could take a prompt as a parameter. This has been retained in Python 3.- If you actually need to ask the user for a Python expression to evaluate, use the
input()function and pass the result toeval().
func_*function attributesIn Python 2, code within functions can access special attributes about the function itself. In Python 3, these special function attributes have been renamed for consistency with other attributes. -
+
-
Notes Python 2 Python 3 @@ -858,7 +824,7 @@ except:a_function.func_codea_function.__code__+
- The
__name__attribute (previouslyfunc_name) contains the function's name.- The
__doc__attribute (previouslyfunc_doc) contains the docstring that you defined in the function's source code.- The
__defaults__attribute (previouslyfunc_defaults) is a tuple containing default argument values for those arguments that have default values. @@ -869,8 +835,7 @@ except:
xreadlines()I/O methodIn Python 2, file objects had an
xreadlines()method which returned an iterator that would read the file one line at a time. This was useful inforloops, among other places. In fact, it was so useful, later versions of Python 2 added the capability to file objects themselves. -+
-
Notes Python 2 Python 3 @@ -882,15 +847,14 @@ except:for line in a_file.xreadlines(5):no change +
- If you used to call
xreadlines()with no arguments,2to3will convert it to just the file object. In Python 3, this will accomplish the same thing: read the file one line at a time and execute the body of theforloop.- If you used to call
xreadlines()with an argument (the number of lines to read at a time), keep doing that. It still works in Python 3, and2to3will not change it.☃
lambdafunctions with multiple parametersIn Python 2, you could define anonymous
lambdafunctions which took multiple parameters by defining the function as taking a tuple with a specific number of items. In effect, Python 2 would “unpack” the tuple into named arguments, which you could then reference (by name) within thelambdafunction. In Python 3, you can still pass a tuple to alambdafunction, but the Python interpreter will not unpack the tuple into named arguments. Instead, you will need to reference each argument by its positional index. -+
-
Notes Python 2 Python 3 @@ -905,15 +869,14 @@ except:lambda (x, (y, z)): x + y + zlambda x_y_z: x_y_z[0] + x_y_z[1][0] + x_y_z[1][1]+
- If you had defined a
lambdafunction that took a tuple of one item, in Python 3 that would become alambdawith references to x1[0]. The name x1 is autogenerated by the2to3script, based on the named arguments in the original tuple.- A
lambdafunction with a two-item tuple (x, y) gets converted to x_y with positional arguments x_y[0] and x_y[1].- The
2to3script can even handlelambdafunctions with nested tuples of named arguments. The resulting Python 3 code is a bit unreadable, but it works the same as the old code did in Python 2.Special method attributes
In Python 2, class methods can reference the class object they are defined in, as well as the method object itself.
im_selfis the class instance object; the classim_funcis the function object;im_classis the class ofim_self(for bound methods) or the class that asked for the method (for unbound methods). In Python 3, these special method attributes have been renamed to follow the naming conventions of other attributes. -+
-
Notes Python 2 Python 3 @@ -928,11 +891,9 @@ except:aClassInstance.aClassMethod.im_classaClassInstance.aClassMethod.self.__class__
__nonzero__special class attributeIn Python 2, you could build your own classes that could be used in a boolean context. For example, you could instantiate the class and then use the instance in an
ifstatement. To do this, you defined a special__nonzero__()method which returnedTrueorFalse, and it was called whenever the instance was used in a boolean context. In Python 3, you can still do this, but the name of the method has changed to__bool__(). -+
-
Notes Python 2 Python 3 @@ -950,14 +911,13 @@ except: passno change +
- Instead of
__nonzero__(), Python 3 calls the__bool__()method when evaluating an instance in a boolean context.- However, if you have a
__nonzero__()method that takes arguments, the2to3tool will assume that you were using it for some other purpose, and it will not make any changes.Octal literals
The syntax for defining base 8 (octal) numbers has changed slightly between Python 2 and Python 3. -
+
-
Notes Python 2 Python 3 @@ -966,11 +926,9 @@ except:x = 0755x = 0o755
sys.maxintDue to the integration of the
longandinttypes, thesys.maxintconstant is no longer accurate. Because the value may still be useful in determining platform-specific capabilities, it has been retained but renamed assys.maxsize. -+
-
Notes Python 2 Python 3 @@ -982,14 +940,13 @@ except:a_function(sys.maxint)a_function(sys.maxsize)+
maxintbecomesmaxsize.- Any usage of
sys.maxintbecomessys.maxsize.
callable()global functionIn Python 2, you could check whether an object was callable (like a function) with the global
callable()function. In Python 3, this global function has been eliminated. To check whether an object is callable, check for the existence of the__call__()special method. -+
-
Notes Python 2 Python 3 @@ -998,11 +955,9 @@ except:callable(anything)hasattr(anything, "__call__")
zip()global functionIn Python 2, the global
zip()function took any number of sequences and returned a list of tuples. The first tuple contained the first item from each sequence; the second tuple contained the second item from each sequence; and so on. In Python 3,zip()returns an iterator instead of a list. -+
-
Notes Python 2 Python 3 @@ -1014,14 +969,13 @@ except:d.join(zip(a, b, c))no change +
- In the simplest form, you can get the old behavior of the
zip()function by wrapping the return value in a call tolist(), which will run through the iterator thatzip()returns and return a real list of the results.- In contexts that already iterate through all the items of a sequence (such as this call to the
join()method), the iterator thatzip()returns will work just fine. The2to3script is smart enough to detect these cases and make no change to your code.
StandardErrorexceptionIn Python 2,
StandardErrorwas the base class for all built-in exceptions other thanStopIteration,GeneratorExit,KeyboardInterrupt, andSystemExit. In Python 3,StandardErrorhas been eliminated; useExceptioninstead. -+
-
Notes Python 2 Python 3 @@ -1033,11 +987,9 @@ except:x = StandardError(a, b, c)x = Exception(a, b, c)
typesmodule constantsThe
typesmodule contains a variety of constants to help you determine the type of an object. In Python 2, it contained constants for all primitive types likedictandint. In Python 3, these constants have been eliminated; just use the primitive type name instead. -+
-
Notes Python 2 Python 3 @@ -1061,11 +1013,9 @@ except:types.NoneTypetype(None)
isinstance()global function (3.1+)The
isinstance()function checks whether an object is an instance of a particular class or type. In Python 2, you could pass a tuple of types, andisinstance()would returnTrueif the object was any of those types. In Python 3, you can still do this, but passing the same type twice is deprecated. -+
-
Notes Python 2 Python 3 @@ -1074,13 +1024,12 @@ except:isinstance(x, (int, float, int))isinstance(x, (int, float))+☞The version of
2to3that shipped with Python 3.0 would not fix these cases ofisinstance()automatically. The fix first appeared in the2to3script that shipped with Python 3.1.
basestringdatatypePython 2 had two string types: Unicode and non-Unicode. But there was also another type,
basestring. It was an abstract type, a superclass for both thestrandunicodetypes. It couldn't be called or instantiated directly, but you could pass it to the globalisinstance()function to check whether an object was either a Unicode or non-Unicode string. In Python 3, there is only one string type, sobasestringhas no reason to exist. -+
-
Notes Python 2 Python 3 @@ -1089,10 +1038,9 @@ except:isinstance(x, basestring)isinstance(x, str)
itertoolsmodulePython 2.3 introduced the
itertoolsmodule, which defined variants of the globalzip(),map(), andfilter()functions that returned iterators instead of lists. In Python 3, those global functions return iterators, so those functions in theitertoolsmodule have been eliminated. -+
-
Notes Python 2 Python 3 @@ -1110,7 +1058,7 @@ except:from itertools import imap, izip, foofrom itertools import foo+
- Instead of
itertools.izip(), just use the globalzip()function.- Instead of
itertools.imap(), just usemap().itertools.ifilter()becomesfilter(). @@ -1118,8 +1066,7 @@ except:
sys.exc_type,sys.exc_value,sys.exc_tracebackPython 2 had three variables in the
sysmodule that you could access while an exception was being handled:sys.exc_type,sys.exc_value,sys.exc_traceback. (Actually, these date all the way back to Python 1.) Ever since Python 1.5, these variables have been deprecated in favor ofsys.exc_info, which is a tuple that contains all three values. In Python 3, these individual variables have finally gone away; you must usesys.exc_info. -+
-
Notes Python 2 Python 3 @@ -1134,11 +1081,9 @@ except:sys.exc_tracebacksys.exc_info()[2]
List comprehensions over tuples
In Python 2, if you wanted to code a list comprehension that iterated over a tuple, you did not need to put parentheses around the tuple values. In Python 3, explicit parentheses are required. -
+
-
Notes Python 2 Python 3 @@ -1147,11 +1092,9 @@ except:[i for i in 1, 2][i for i in (1, 2)]
os.getcwdu()functionPython 2 had a function named
os.getcwd(), which returned the current working directory as a (non-Unicode) string. Because modern file systems can handle directory names in any character encoding, Python 2.3 introducedos.getcwdu(). Theos.getcwdu()function returned the current working directory as a Unicode string. In Python 3, there is only one string type (Unicode), soos.getcwd()is all you need. -+
-
Notes Python 2 Python 3 @@ -1160,11 +1103,9 @@ except:os.getcwdu()os.getcwd()
Metaclasses
In Python 2, you could create metaclasses either by defining the
metaclassargument in the class declaration, or by defining a special class-level__metaclass__attribute. In Python 3, the class-level attribute has been eliminated. -+
-
Notes Python 2 Python 3 @@ -1184,7 +1125,7 @@ except:class C(Whipper, Beater, metaclass=PapayaMeta): pass+
- Declaring the metaclass in the class declaration worked in Python 2, and it still works the same in Python 3.
- Declaring the metaclass in a class attribute worked in Python 2, but doesn't work in Python 3.
- The
2to3script is smart enough to construct a valid class declaration, even if the class is inherited from one or more base classes. @@ -1196,8 +1137,7 @@ except:-☞The
2to3script will not fixset()literals by default. To enable this fix, specify -f set_literal on the command line when you call2to3.+
-
Notes Before After @@ -1212,14 +1152,12 @@ except:set([i for i in a_sequence]){i for i in a_sequence}
buffer()global function (explicit)Python objects implemented in C can export a “buffer interface,” which is a block of memory that is directly readable and writeable without copying. (That is exactly as powerful and scary as it sounds.) In Python 3,
buffer()has been renamed tomemoryview(). (It's a little more complicated than that, but you can almost certainly ignore the differences.)-☞The
2to3script will not fix thebuffer()function by default. To enable this fix, specify -f buffer on the command line when you call2to3.+
-
Notes Before After @@ -1228,14 +1166,12 @@ except:x = buffer(y)x = memoryview(y)
Whitespace around commas (explicit)
Despite being draconian about whitespace for indenting and outdenting, Python is actually quite liberal about whitespace in other areas. Within lists, tuples, sets, and dictionaries, whitespace can appear before and after commas with no ill effects. However, the Python style guide states that commas should be preceded by zero spaces and followed by one. Although this is purely an aesthetic issue (the code works either way, in both Python 2 and Python 3), the
2to3script can optionally fix this for you.-☞The
2to3script will not fix whitespace around commas by default. To enable this fix, specify -f wscomma on the command line when you call2to3.+
-
Notes Before After @@ -1247,14 +1183,12 @@ except:{a :b}{a: b}
Common idioms (explicit)
There were a number of common idioms built up in the Python community. Some, like the
while 1:loop, date back to Python 1. (Python didn't have a true boolean type until version 2.3, so developers used1and0instead.) Modern Python programmers should train their brains to use modern versions of these idioms instead.-☞The
2to3script will not fix common idioms by default. To enable this fix, specify -f idioms on the command line when you call2to3.+
-
Notes Before After @@ -1277,7 +1211,6 @@ do_stuff(a_list)a_list = sorted(a_sequence) do_stuff(a_list)
FIXME: once the rest of the book is written, this appendix should contain copious links back to any chapter or section that touches on these features.
© 2001–9 ℳark Pilgrim diff --git a/regular-expressions.html b/regular-expressions.html index c363623..7616d3d 100644 --- a/regular-expressions.html +++ b/regular-expressions.html @@ -9,7 +9,6 @@ body{counter-reset:h1 4} -
You are here: Home ‣ Dive Into Python 3 ‣
Regular expressions
diff --git a/strings.html b/strings.html index 4b1c2a0..db5c795 100644 --- a/strings.html +++ b/strings.html @@ -9,7 +9,6 @@ body{counter-reset:h1 3} -You are here: Home ‣ Dive Into Python 3 ‣
Strings
diff --git a/unit-testing.html b/unit-testing.html index 53fa82d..71cdb21 100644 --- a/unit-testing.html +++ b/unit-testing.html @@ -9,7 +9,6 @@ body{counter-reset:h1 7} -You are here: Home ‣ Dive Into Python 3 ‣
Unit testing
diff --git a/your-first-python-program.html b/your-first-python-program.html index 1143c59..407ef44 100644 --- a/your-first-python-program.html +++ b/your-first-python-program.html @@ -10,7 +10,6 @@ body{counter-reset:h1 1} th{font-family:inherit !important} -You are here: Home ‣ Dive Into Python 3 ‣
Your first Python program
@@ -41,7 +40,6 @@ th{font-family:inherit !important}Diving in
Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
[The code examples will be easier to follow if you enable Javascript, but whatever.] -
-SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'], 1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']} @@ -71,8 +69,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True): if __name__ == "__main__": print(approximate_size(1000000000000, False)) print(approximate_size(1000000000000))Now let's run this program on the command line. On Windows, it will look something like this: -
skip over this command output listing +
Now let's run this program on the command line. On Windows, it will look something like this:
c:\home\diveintopython3> c:\python30\python.exe humansize.py 1.0 TB @@ -82,7 +79,7 @@ if __name__ == "__main__": you@localhost:~$ python3 humansize.py 1.0 TB 931.3 GiB-FIXME: this would be a good place to explain what the program, you know, actually does. +
FIXME: this would be a good place to explain what the program, you know, actually does.
Declaring functions
Python has functions like most other languages, but it does not have separate header files like C++ or
interface/implementationsections like Pascal. When you need a function, just declare it, like this:@@ -122,7 +119,6 @@ if __name__ == "__main__":def approximate_size(size, a_kilobyte_is_1024_bytes=True):I won't bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you've forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You'll thank me in six months.
Documentation strings
You can document a Python function by giving it a documentation string (
docstringfor short). In this program, theapproximate_sizefunction has adocstring: --def approximate_size(size, a_kilobyte_is_1024_bytes=True): """Convert a file size to human-readable form. @@ -134,7 +130,7 @@ if __name__ == "__main__": Returns: string """Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you'll see them most often used when defining a
docstring. +Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you'll see them most often used when defining a
docstring.@@ -149,7 +145,6 @@ if __name__ == "__main__":☞Triple quotes are also an easy way to define a string with both single and double quotes, like
qq/.../in Perl 5.Everything is an object
In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime. A function, like everything else in Python, is an object.
Run the interactive Python shell and follow along: -
skip over this interpreter listing
>>> import humansize ① >>> print(humansize.approximate_size(4096, True)) ② @@ -165,7 +160,7 @@ if __name__ == "__main__": Returns: string-+
- The first line imports the
humansizeprogram as a module -- a chunk of code that you can use interactively, or from a larger Python program. (You'll see examples of multi-module Python programs in [FIXME xref].) Once you import a module, you can reference any of its public functions, classes, or attributes. Modules can do this to access functionality in other modules, and you can do it in the Python interactive shell too. This is an important concept, and you'll see a lot more of it throughout this book.- When you want to use functions defined in imported modules, you need to include the module name. So you can't just say
approximate_size; it must behumansize.approximate_size. If you've used classes in Java, this should feel vaguely familiar.- Instead of calling the function as you would expect to, you asked for one of the function's attributes,
__doc__. @@ -175,7 +170,6 @@ if __name__ == "__main__":The
importsearch pathBefore this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in
sys.path. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.) -skip over this interpreter listing
>>> import sys ① >>> sys.path ② @@ -183,7 +177,7 @@ if __name__ == "__main__": >>> sys ③ <module 'sys' (built-in)> >>> sys.path.append('/my/new/path') ④-+
- Importing the
sysmodule makes all of its functions and attributes available.sys.pathis a list of directory names that constitute the current search path. (Yours will look different, depending on your operating system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a.pyfile whose name matches what you're trying to import.- Actually, I lied; the truth is more complicated than that, because not all modules are stored as
.pyfiles. Some, like thesysmodule, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (Thesysmodule is written in C.) @@ -195,7 +189,6 @@ if __name__ == "__main__":This is so important that I'm going to repeat it in case you missed it the first few times: everything in Python is an object. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.
Indenting code
Python functions have no explicit
beginorend, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (:) and the indentation of the code itself. --def approximate_size(size, a_kilobyte_is_1024_bytes=True): ① if size < 0: ② @@ -208,7 +201,7 @@ if __name__ == "__main__": return "{0:.1f} {1}".format(size, suffix) raise ValueError('number too large')+
- Code blocks are defined by their indentation. By "code block," I mean functions,
ifstatements,forloops,whileloops, and so forth. Indenting starts a block and unindenting ends it. There are no explicit braces, brackets, or keywords. This means that whitespace is significant, and must be consistent. In this example, the function code is indented four spaces. It doesn't need to be four spaces, it just needs to be consistent. The first line that is not indented marks the end of the function.- In Python, an
ifstatement is followed by a code block. If theifexpression evaluates to true, the indented block is executed, otherwise it falls to theelseblock (if any). (Note the lack of parentheses around the expression.)- This line is inside the
ifcode block. Thisraisestatement will raise an exception (of typeValueError), but only ifsize < 0. @@ -221,22 +214,19 @@ if __name__ == "__main__":Running scripts
Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them, by including a special block of code that executes when you run the Python file on the command line. Take the last few lines of
humansize.py: --if __name__ == "__main__": print(approximate_size(1000000000000, False)) print(approximate_size(1000000000000))+☞Like C, Python uses
==for comparison and=for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.So what makes this
ifstatement special? Well, modules are objects, and all modules have a built-in attribute__name__. A module's__name__depends on how you're using the module. If youimportthe module, then__name__is the module's filename, without a directory path or file extension. -skip over this interpreter listing
>>> import humansize >>> humansize.__name__ 'humansize'-But you can also run the module directly as a standalone program, in which case
__name__will be a special default value,__main__. Python will evaluate thisifstatement, find a true expression, and execute theifcode block. In this case, to print two values. -skip over this command output listing +
But you can also run the module directly as a standalone program, in which case
__name__will be a special default value,__main__. Python will evaluate thisifstatement, find a true expression, and execute theifcode block. In this case, to print two values.c:\home\diveintopython3> c:\python30\python.exe humansize.py 1.0 TB