diff --git a/about.html b/about.html index a66eea9..b5607b6 100644 --- a/about.html +++ b/about.html @@ -11,7 +11,7 @@ h1:before{content:''}
-You are here: Home ‣ Dive Into Python 3 ‣ +
You are here: Home ‣ Dive Into Python 3 ‣
The text of Dive Into Python 3 is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
The chardet library referenced in Case study: porting chardet to Python 3 is licensed under the LGPL 2.1 or later. The alphametics solver referenced in Advanced Iterators is based on Raymond Hettinger's solver for Python 2, which he has graciously relicensed under the MIT license so I could port it to Python 3. Advanced Classes and Special Method Names contain snippets of code from the Python standard library which are released under the Python Software Foundation License version 2. All other example code is my original work and is licensed under the MIT license. Full licensing terms are included in each source code file.
diff --git a/advanced-classes.html b/advanced-classes.html
index eff6d0a..7b5f163 100644
--- a/advanced-classes.html
+++ b/advanced-classes.html
@@ -12,11 +12,11 @@ body{counter-reset:h1 11}
You are here: Home ‣ Dive Into Python 3 ‣ +
You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♦♢
-❝ FIXME ❞
— FIXME +❝ FIXME ❞
— FIXME
[FIXME here's why ordered dicts are useful: http://www.gossamer-threads.com/lists/python/dev/656556 ]
import collections
+import collections
import itertools
class OrderedDict(dict, collections.MutableMapping):
@@ -107,7 +107,7 @@ class OrderedDict(dict, collections.MutableMapping):
>>> import ordereddict
>>> od = ordereddict.OrderedDict()
->>> klass = od.__class__ ①
+>>> klass = od.__class__ ①
>>> type(klass)
<class 'abc.ABCMeta'>
>>> klass.__name__
@@ -163,7 +163,8 @@ class OrderedDict(dict, collections.MutableMapping):
Implementing Fractions
-© 2001–9 Mark Pilgrim
+
diff --git a/advanced-iterators.html b/advanced-iterators.html
index 2a2729b..10b2529 100644
--- a/advanced-iterators.html
+++ b/advanced-iterators.html
@@ -12,11 +12,11 @@ body{counter-reset:h1 7}
-You are here: Home ‣ Dive Into Python 3 ‣
+
You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♦♢
Advanced Iterators
-❝ Great fleas have little fleas upon their backs to bite ’em,
And little fleas have lesser fleas, and so ad infinitum. ❞
— Augustus De Morgan
+
❝ Great fleas have little fleas upon their backs to bite ’em,
And little fleas have lesser fleas, and so ad infinitum. ❞
— Augustus De Morgan
Diving In
@@ -44,7 +44,7 @@ E = 4
In this chapter, we’ll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles in just 14 lines of code.
import re
+import re
import itertools
def solve(puzzle):
@@ -91,9 +91,9 @@ if __name__ == '__main__':
>>> import re
->>> re.findall('[0-9]+', '16 2-by-4s in rows of 8') ①
+>>> re.findall('[0-9]+', '16 2-by-4s in rows of 8') ①
['16', '2', '4', '8']
->>> re.findall('[A-Z]+', 'SEND + MORE == MONEY') ②
+>>> re.findall('[A-Z]+', 'SEND + MORE == MONEY') ②
['SEND', 'MORE', 'MONEY']
- The
re module is Python’s implementation of regular expressions. It has a nifty function called findall() which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the string. In this case, the pattern matches sequences of numbers. The findall() function returns a list of all the substrings that matched the pattern.
@@ -108,15 +108,15 @@ if __name__ == '__main__':
>>> a_list = ['a', 'c', 'b', 'a', 'd', 'b']
->>> {c for c in a_list} ①
+>>> {c for c in a_list} ①
{'a', 'c', 'b', 'd'}
>>> a_string = 'EAST IS EAST'
->>> {c for c in a_string} ②
+>>> {c for c in a_string} ②
{'A', ' ', 'E', 'I', 'S', 'T'}
>>> words = ['SEND', 'MORE', 'MONEY']
->>> ''.join(words) ③
+>>> ''.join(words) ③
'SENDMOREMONEY'
->>> {c for c in ''.join(words)} ④
+>>> {c for c in ''.join(words)} ④
{'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
- Given a list of several strings, a set comprehension with the identity function will return a set of unique strings from the list. This makes sense if you think of it like a
for loop. Take the first item from the list, put it in the set. Second. Third. Fourth — wait, that’s in the set already, so it only gets listed once. Fifth. Sixth — again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn’t even need to be sorted first.
@@ -127,7 +127,7 @@ if __name__ == '__main__':
The alphametics solver uses this technique to get a list of all the unique characters in the puzzle.
-
unique_characters = {c for c in ''.join(words)}
+unique_characters = {c for c in ''.join(words)}
This list is later used to assign digits to characters as the solver iterates through the possible solutions.
@@ -138,8 +138,8 @@ if __name__ == '__main__':
Like many programming languages, Python has an assert statement. Here’s how it works.
->>> assert 1 + 1 == 2 ①
->>> assert 1 + 1 == 3 ②
+>>> assert 1 + 1 == 2 ①
+>>> assert 1 + 1 == 3 ②
Traceback (most recent call last):
File "<stdin>", line 1, in
AssertionError
@@ -150,11 +150,11 @@ AssertionError
Therefore, this line of code:
-
assert len(unique_characters) <= 10
+assert len(unique_characters) <= 10
…is equivalent to…
-
if len(unique_characters) > 10:
+if len(unique_characters) > 10:
raise AssertionError
But a bit easier to read and write.
@@ -169,14 +169,14 @@ AssertionError
>>> unique_characters = {'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
->>> gen = (ord(c) for c in unique_characters) ①
->>> gen ②
+>>> gen = (ord(c) for c in unique_characters) ①
+>>> gen ②
<generator object <genexpr> at 0x00BADC10>
->>> next(gen) ③
+>>> next(gen) ③
69
>>> next(gen)
68
->>> tuple(ord(c) for c in unique_characters) ④
+>>> tuple(ord(c) for c in unique_characters) ④
(69, 68, 77, 79, 78, 83, 82, 89)
- A generator expression is like an anonymous function that yields values. The expression itself looks like a list comprehension [FIXME have we introduced this yet?], but it’s wrapped in parentheses instead of square brackets.
@@ -187,7 +187,7 @@ AssertionError
Here’s another way to accomplish the same thing, using a generator function:
-
def ord_map(a_string):
+def ord_map(a_string):
for c in a_string:
yield ord(c)
@@ -202,21 +202,21 @@ gen = ord_map(unique_characters)
The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated. Mathematicians say things like “let’s find the permutations of 3 different items taken 2 at a time,” which means you have a sequence of 3 items and you want to find all the possible ordered pairs.
->>> import itertools ①
->>> perms = itertools.permutations([1, 2, 3], 2) ②
->>> next(perms) ③
+>>> import itertools ①
+>>> perms = itertools.permutations([1, 2, 3], 2) ②
+>>> next(perms) ③
(1, 2)
>>> next(perms)
(1, 3)
>>> next(perms)
-(2, 1) ④
+(2, 1) ④
>>> next(perms)
(2, 3)
>>> next(perms)
(3, 1)
>>> next(perms)
(3, 2)
->>> next(perms) ⑤
+>>> next(perms) ⑤
Traceback (most recent call last):
File "<stdin>", line 1, in
StopIteration
@@ -232,9 +232,9 @@ StopIteration
>>> import itertools
->>> perms = itertools.permutations('ABC', 3) ①
+>>> perms = itertools.permutations('ABC', 3) ①
>>> next(perms)
-('A', 'B', 'C') ②
+('A', 'B', 'C') ②
>>> next(perms)
('A', 'C', 'B')
>>> next(perms)
@@ -249,7 +249,7 @@ StopIteration
Traceback (most recent call last):
File "<stdin>", line 1, in
StopIteration
->>> list(itertools.permutations('ABC', 3)) ③
+>>> list(itertools.permutations('ABC', 3)) ③
[('A', 'B', 'C'), ('A', 'C', 'B'),
('B', 'A', 'C'), ('B', 'C', 'A'),
('C', 'A', 'B'), ('C', 'B', 'A')]
@@ -264,11 +264,11 @@ StopIteration
itertools Module
>>> import itertools
->>> list(itertools.product('ABC', '123')) ①
+>>> list(itertools.product('ABC', '123')) ①
[('A', '1'), ('A', '2'), ('A', '3'),
('B', '1'), ('B', '2'), ('B', '3'),
('C', '1'), ('C', '2'), ('C', '3')]
->>> list(itertools.combinations('ABC', 2)) ②
+>>> list(itertools.combinations('ABC', 2)) ②
[('A', 'B'), ('A', 'C'), ('B', 'C')]
itertools.product() function returns an iterator containing the Cartesian product of two sequences.
@@ -277,19 +277,19 @@ StopIteration
[download favorite-people.txt]
->>> names = list(open('examples/favorite-people.txt')) ① +>>> names = list(open('examples/favorite-people.txt')) ① >>> names ['Dora\n', 'Ethan\n', 'Wesley\n', 'John\n', 'Anne\n', 'Mike\n', 'Chris\n', 'Sarah\n', 'Alex\n', 'Lizzie\n'] ->>> names = [name.rstrip() for name in names] ② +>>> names = [name.rstrip() for name in names] ② >>> names ['Dora', 'Ethan', 'Wesley', 'John', 'Anne', 'Mike', 'Chris', 'Sarah', 'Alex', 'Lizzie'] ->>> names = sorted(names) ③ +>>> names = sorted(names) ③ >>> names ['Alex', 'Anne', 'Chris', 'Dora', 'Ethan', 'John', 'Lizzie', 'Mike', 'Sarah', 'Wesley'] ->>> names = sorted(names, key=len) ④ +>>> names = sorted(names, key=len) ④ >>> names ['Alex', 'Anne', 'Dora', 'John', 'Mike', 'Chris', 'Ethan', 'Sarah', 'Lizzie', 'Wesley']@@ -305,7 +305,7 @@ StopIteration
[0, 1, 2] >>> list(range(10, 13)) [10, 11, 12] ->>> list(itertools.chain(range(0, 3), range(10, 13))) ① +>>> list(itertools.chain(range(0, 3), range(10, 13))) ① [0, 1, 2, 10, 11, 12] ->>> list(zip(range(0, 3), range(10, 13))) ② +>>> list(zip(range(0, 3), range(10, 13))) ② [(0, 10), (1, 11), (2, 12)] ->>> list(zip(range(0, 3), range(10, 14))) ③ +>>> list(zip(range(0, 3), range(10, 14))) ③ [(0, 10), (1, 11), (2, 12)] ->>> list(itertools.zip_longest(range(0, 3), range(10, 14))) ④ +>>> list(itertools.zip_longest(range(0, 3), range(10, 14))) ④ [(0, 10), (1, 11), (2, 12), (None, 13)]…continuing from the previous interactive shell… >>> import itertools ->>> groups = itertools.groupby(names, len) ① +>>> groups = itertools.groupby(names, len) ① >>> groups <itertools.groupby object at 0x00BB20C0> >>> list(groups) @@ -313,7 +313,7 @@ StopIteration (5, <itertools._grouper object at 0x00BB4050>), (6, <itertools._grouper object at 0x00BB4030>)] >>> groups = itertools.groupby(names, len) ->>> for name_length, name_iter in groups: ② +>>> for name_length, name_iter in groups: ② ... print('Names with {0:d} letters:'.format(name_length)) ... for name in name_iter: ... print(name) @@ -342,13 +342,13 @@ Wesley
itertools.chain() function takes two iterators and returns an iterator that contains all the items from the first iterator, followed by all the items from the second iterator. (Actually, it can take any number of iterators, and it chains them all in the order they were passed to the function.)
@@ -362,10 +362,10 @@ Wesley
>>> characters = ('S', 'M', 'E', 'D', 'O', 'N', 'R', 'Y')
>>> guess = ('1', '2', '0', '3', '4', '5', '6', '7')
->>> tuple(zip(characters, guess)) ①
+>>> tuple(zip(characters, guess)) ①
(('S', '1'), ('M', '2'), ('E', '0'), ('D', '3'),
('O', '4'), ('N', '5'), ('R', '6'), ('Y', '7'))
->>> dict(zip(characters, guess)) ②
+>>> dict(zip(characters, guess)) ②
{'E': '0', 'D': '3', 'M': '2', 'O': '4',
'N': '5', 'S': '1', 'R': '6', 'Y': '7'}
The alphametics solver uses this technique to create a dictionary that maps letters in the puzzle to digits in the solution, for each possible solution. -
characters = tuple(ord(c) for c in sorted_characters)
+characters = tuple(ord(c) for c in sorted_characters)
digits = tuple(ord(c) for c in '0123456789')
...
for guess in itertools.permutations(digits, len(characters)):
@@ -391,10 +391,10 @@ for guess in itertools.permutations(digits, len(characters)):
Python strings have many methods. You learned about some of those methods in the Strings chapter: lower(), count(), and format(). Now I want to introduce you to a powerful but little-known string manipulation technique: the translate() method.
->>> translation_table = {ord('A'): ord('O')} ①
->>> translation_table ②
+>>> translation_table = {ord('A'): ord('O')} ①
+>>> translation_table ②
{65: 79}
->>> 'MARK'.translate(translation_table) ③
+>>> 'MARK'.translate(translation_table) ③
'MORK'
- String translation starts with a translation table, which is just a dictionary that maps one character to another. Actually, “character” is incorrect — the translation table really maps one byte to another.
@@ -405,16 +405,16 @@ for guess in itertools.permutations(digits, len(characters)):
What does this have to do with solving alphametic puzzles? As it turns out, everything.
->>> characters = tuple(ord(c) for c in 'SMEDONRY') ①
+>>> characters = tuple(ord(c) for c in 'SMEDONRY') ①
>>> characters
(83, 77, 69, 68, 79, 78, 82, 89)
->>> guess = tuple(ord(c) for c in '91570682') ②
+>>> guess = tuple(ord(c) for c in '91570682') ②
>>> guess
(57, 49, 53, 55, 48, 54, 56, 50)
->>> translation_table = dict(zip(characters, guess)) ③
+>>> translation_table = dict(zip(characters, guess)) ③
>>> translation_table
{68: 55, 69: 53, 77: 49, 78: 54, 79: 48, 82: 56, 83: 57, 89: 50}
->>> 'SEND + MORE == MONEY'.translate(translation_table) ④
+>>> 'SEND + MORE == MONEY'.translate(translation_table) ④
'9567 + 1085 == 10652'
- Using a generator expression, we quickly compute the byte values for each character in a string. characters is an example of the value of sorted_characters in the
alphametics.solve() function.
@@ -455,12 +455,12 @@ for guess in itertools.permutations(digits, len(characters)):
>>> x = 5
->>> eval("x * 5") ①
+>>> eval("x * 5") ①
25
->>> eval("pow(x, 2)") ②
+>>> eval("pow(x, 2)") ②
25
>>> import math
->>> eval("math.sqrt(x)") ③
+>>> eval("math.sqrt(x)") ③
2.2360679774997898
- The expression that
eval() takes can reference global variables defined outside the eval(). If called within a function, it can reference local variables too.
@@ -472,11 +472,11 @@ for guess in itertools.permutations(digits, len(characters)):
>>> import subprocess
->>> eval("subprocess.getoutput('ls ~')") ①
+>>> eval("subprocess.getoutput('ls ~')") ①
'Desktop Library Pictures \
Documents Movies Public \
Music Sites'
->>> eval("subprocess.getoutput('rm -rf /')") ②
+>>> eval("subprocess.getoutput('rm -rf /')") ②
- The
subprocess module allows you to run arbitrary shell commands and get the result as a Python string.
- Don’t do this.
@@ -485,7 +485,7 @@ for guess in itertools.permutations(digits, len(characters)):
It’s even worse than that, because there’s a global __import__() function that takes a module name as a string, imports the module, and returns a reference to it. Combined with the power of eval(), you can construct a single expression that will wipe out all your files:
->>> eval("__import__('subprocess').getoutput('rm -rf /')") ①
+>>> eval("__import__('subprocess').getoutput('rm -rf /')") ①
>>> x = 5 ->>> eval("x * 5", {}, {}) ① +>>> eval("x * 5", {}, {}) ① Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 1, in <module> NameError: name 'x' is not defined ->>> eval("x * 5", {"x": x}, {}) ② +>>> eval("x * 5", {"x": x}, {}) ② >>> import math ->>> eval("math.sqrt(x)", {"x": x}, {}) ② +>>> eval("math.sqrt(x)", {"x": x}, {}) ② Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 1, in <module> @@ -519,9 +519,9 @@ NameError: name 'math' is not defined
Gee, that was easy. Lemme make an alphametics web service now!
->>> eval("pow(5, 2)", {}, {}) ① +>>> eval("pow(5, 2)", {}, {}) ① 25 ->>> eval("__import__('math').sqrt(5)", {}, {}) ② +>>> eval("__import__('math').sqrt(5)", {}, {}) ② 2.2360679774997898
pow(5, 2) works, because 5 and 2 are literals, and pow() is a built-in function.
@@ -531,7 +531,7 @@ NameError: name 'math' is not defined
Yeah, that means you can still do nasty things, even if you explicitly set the global and local namespaces to empty dictionaries when calling eval():
->>> eval("__import__('subprocess').getoutput('rm -rf /')", {}, {}) ①+>>> eval("__import__('subprocess').getoutput('rm -rf /')", {}, {}) ①
>>> eval("__import__('math').sqrt(5)",
-... {"__builtins__":None}, {}) ①
+... {"__builtins__":None}, {}) ①
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 1, in <module>
NameError: name '__import__' is not defined
>>> eval("__import__('subprocess').getoutput('rm -rf /')",
-... {"__builtins__":None}, {}) ②
+... {"__builtins__":None}, {}) ②
Traceback (most recent call last):
File "", line 1, in
File "", line 1, in
@@ -591,9 +591,10 @@ NameError: name '__import__' is not defined
Many, many thanks to Raymond Hettinger for agreeing to relicense his code so I could port it to Python 3 and use it as the basis for this chapter. -
© 2001–9 Mark Pilgrim + diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index ded5b85..9601890 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -15,11 +15,11 @@ del{background:#f87}
-You are here: Home ‣ Dive Into Python 3 ‣ +
You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♦♦
chardet to Python 3-❝ Words, words. They’re all we have to go on. ❞
— Rosencrantz and Guildenstern are Dead +❝ Words, words. They’re all we have to go on. ❞
— Rosencrantz and Guildenstern are Dead
Hmm, a small snag. In Python 3, False is a reserved word, so you can’t use it as a variable name. Let’s look at constants.py to see where it’s defined. Here’s the original version from constants.py, before the 2to3 script changed it:
-
import __builtin__
+import __builtin__
if not hasattr(__builtin__, 'False'):
False = 0
True = 1
@@ -603,9 +603,9 @@ else:
This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in bool type. This code detects the absence of the built-in constants True and False, and defines them if necessary.
However, Python 3 will always have a bool type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of constants.True and constants.False with True and False, respectively, then delete this dead code from constants.py.
So this line in universaldetector.py:
-
self.done = constants.False
+self.done = constants.False
Becomes
-
self.done = False
+self.done = False
Ah, wasn’t that satisfying? The code is shorter and more readable already.
No module named constants
Time to run test.py again and see how far it gets.
@@ -617,12 +617,12 @@ else:
import constants, sys
ImportError: No module named constants
What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
-
from . import constants
+from . import constants
But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two.
The solution is to split the import statement manually. So this two-in-one import:
-
import constants, sys
+import constants, sys
Needs to become two separate imports:
-
from . import constants
+from . import constants
import sys
There are variations of this problem scattered throughout the chardet library. In some places it’s “import constants, sys”; in other places, it’s “import constants, re”. The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
FIXME-xref to as-yet-unwritten PEP 8 style section (which says you should put all imports on their own line)
@@ -638,7 +638,7 @@ import sys
NameError: name 'file' is not defined
This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for the open() function, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the io module. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it’s an alias for io.open(), but never mind that right now.)
Thus, the simplest solution to the problem of the missing file() is to call the open() function instead:
-
for line in open(f, 'rb'):
+for line in open(f, 'rb'):
And that’s all I have to say about that.
Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.” @@ -651,20 +651,20 @@ NameError: name 'file' is not defined if self._highBitDetector.search(aBuf): TypeError: can't use a string pattern on a bytes-like object
To debug this, let’s see what self._highBitDetector is. It’s defined in the __init__ method of the UniversalDetector class: -
class UniversalDetector:
+class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(r'[\x80-\xFF]')
This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.
And therein lies the problem.
In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py:
-
def feed(self, aBuf):
+def feed(self, aBuf):
.
.
.
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py.
-
u = UniversalDetector()
+u = UniversalDetector()
.
.
.
@@ -674,7 +674,7 @@ for line in open(f, 'rb'):
And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to the open() function, but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
What we need this regular expression to search is not an array of characters, but an array of bytes.
Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
-
class UniversalDetector:
+ class UniversalDetector:
def __init__(self):
- self._highBitDetector = re.compile(r'[\x80-\xFF]')
- self._escDetector = re.compile(r'(\033|~{)')
@@ -684,7 +684,7 @@ for line in open(f, 'rb'):
self._mCharSetProbers = []
self.reset()
Searching the entire codebase for other uses of the re module turns up two more instances, in charsetprober.py. Again, the code is defining regular expressions as strings but executing them on aBuf, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays.
-
class CharSetProber:
+ class CharSetProber:
.
.
.
@@ -709,7 +709,7 @@ for line in open(f, 'rb'):
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly
There’s an unfortunate clash of coding style and Python interpreter here. The TypeError could be anywhere on that line, but the traceback doesn’t tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
-
elif (self._mInputState == ePureAscii) and \
+elif (self._mInputState == ePureAscii) and \
self._escDetector.search(self._mLastChar + aBuf):
And re-run the test:
C:\home\chardet> python test.py tests\*\*
@@ -722,7 +722,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
TypeError: Can't convert 'bytes' object to str implicitly
Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second one. So what could cause a TypeError there? Perhaps you’re thinking that the search() method is expecting a value of a different type, but that wouldn’t generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value of a different type than it’s expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the search() method. So the problem must be in that + operation, as it’s trying to construct the value that it will eventually pass to the search() method.
We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It’s an instance variable, defined in the reset() method, which is actually called from the __init__() method.
-
class UniversalDetector:
+class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(b'[\x80-\xFF]')
self._escDetector = re.compile(b'(\033|~{)')
@@ -739,7 +739,7 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mLastChar = ''
And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And you can’t concatenate a string to a byte array — not even a zero-length string.
So what is self._mLastChar anyway? The answer is in the feed() method, just a few lines down from where the trackback occurred.
-
if self._mInputState == ePureAscii:
+if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
self._mInputState = eHighbyte
elif (self._mInputState == ePureAscii) and \
@@ -748,15 +748,14 @@ TypeError: Can't convert 'bytes' object to str implicitly
self._mLastChar = aBuf[-1]
The calling function calls this feed() method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as aBuf), then stores the last byte in self._mLastChar in case it’s needed during the next call. (In a multi-byte encoding, the feed() method might get called with half of a character, then called again with the other half.) But because aBuf is now a byte array instead of a string, self._mLastChar needs to be a byte array as well. Thus:
-
def reset(self):
+ def reset(self):
.
.
.
- self._mLastChar = ''
+ self._mLastChar = b''
Searching the entire codebase for “mLastChar” turns up a similar problem in mbcharsetprober.py, but instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.
-
- class MultiByteCharSetProber(CharSetProber):
+ class MultiByteCharSetProber(CharSetProber):
def __init__(self):
CharSetProber.__init__(self)
self._mDistributionAnalyzer = None
@@ -785,7 +784,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
…The bad news is it doesn’t always feel like progress.
But this is progress! Really! Even though the traceback calls out the same line of code, it’s a different error than it used to be. Progress! So what’s the problem now? The last time I checked, this line of code didn’t try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that self._mLastChar was a byte array. How did it turn into an int?
The answer lies not in the previous lines of code, but in the following lines.
-
if self._mInputState == ePureAscii:
+if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
self._mInputState = eHighbyte
elif (self._mInputState == ePureAscii) and \
@@ -796,22 +795,22 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
This error doesn’t occur the first time the feed() method gets called; it occurs the second time, after self._mLastChar has been set to the last byte of aBuf. Well, what’s the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
->>> aBuf = b'\xEF\xBB\xBF' ①
+>>> aBuf = b'\xEF\xBB\xBF' ①
>>> len(aBuf)
3
>>> mLastChar = aBuf[-1]
->>> mLastChar ②
+>>> mLastChar ②
191
->>> type(mLastChar) ③
+>>> type(mLastChar) ③
<class 'int'>
->>> mLastChar + aBuf ④
+>>> mLastChar + aBuf ④
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
->>> mLastChar = aBuf[-1:] ⑤
+>>> mLastChar = aBuf[-1:] ⑤
>>> mLastChar
b'\xbf'
->>> mLastChar + aBuf ⑥
+>>> mLastChar + aBuf ⑥
b'\xbf\xef\xbb\xbf'
- Define a byte array of length 3.
@@ -822,7 +821,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
- Concatenating a byte array of length 1 with a byte array of length 3 returns a new byte array of length 4.
So, to ensure that the feed() method in universaldetector.py continues to work no matter how often it’s called, you need to initialize self._mLastChar as a 0-length byte array, then make sure it stays a byte array.
-
self._escDetector.search(self._mLastChar + aBuf):
+ self._escDetector.search(self._mLastChar + aBuf):
self._mInputState = eEscAscii
- self._mLastChar = aBuf[-1]
@@ -845,25 +844,25 @@ tests\Big5\0804.blogspot.com.xml
byteCls = self._mModel['classTable'][ord(c)]
TypeError: ord() expected string of length 1, but int found
OK, so c is an int, but the ord() function was expecting a 1-character string. Fair enough. Where is c defined?
-
# codingstatemachine.py
+# codingstatemachine.py
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
byteCls = self._mModel['classTable'][ord(c)]
That’s no help; it’s just passed into the function. Let’s pop the stack.
-
# utf8prober.py
+# utf8prober.py
def feed(self, aBuf):
for c in aBuf:
codingState = self._mCodingSM.next_state(c)
And now we have the answer. Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That’s what you get when you iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a 1-character string. In other words, there’s no need to call the ord() function because c is already an int!
Thus:
-
def next_state(self, c):
+ def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
- byteCls = self._mModel['classTable'][ord(c)]
+ byteCls = self._mModel['classTable'][c]
Searching the entire codebase for instances of “ord(c)” uncovers similar problems in sbcharsetprober.py…
-
# sbcharsetprober.py
+# sbcharsetprober.py
def feed(self, aBuf):
if not self._mModel['keepEnglishLetter']:
aBuf = self.filter_without_english_letters(aBuf)
@@ -873,13 +872,13 @@ def feed(self, aBuf):
for c in aBuf:
order = self._mModel['charToOrderMap'][ord(c)]
…and latin1prober.py…
-
# latin1prober.py
+# latin1prober.py
def feed(self, aBuf):
aBuf = self.filter_with_english_letters(aBuf)
for c in aBuf:
charClass = Latin1_CharToClass[ord(c)]
c is iterating over aBuf, which means it is an integer, not a 1-character string. The solution is the same: change ord(c) to just plain c.
-
# sbcharsetprober.py
+ # sbcharsetprober.py
def feed(self, aBuf):
if not self._mModel['keepEnglishLetter']:
aBuf = self.filter_without_english_letters(aBuf)
@@ -918,7 +917,7 @@ tests\Big5\0804.blogspot.com.xml
TypeError: unorderable types: int() >= str()
Did you notice? This time around, the code passed the first test case (tests\ascii\howto.diveintomark.org.xml). You’re making real progress here.
So what’s this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code:
-
class SJISContextAnalysis(JapaneseContextAnalysis):
+class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
if not aStr: return -1, 1
# find out current char's byte length
@@ -928,7 +927,7 @@ TypeError: unorderable types: int() >= str()
else:
charLen = 1
And where does aStr come from? Let’s pop the stack:
-
def feed(self, aBuf, aLen):
+def feed(self, aBuf, aLen):
.
.
.
@@ -938,7 +937,7 @@ TypeError: unorderable types: int() >= str()
Oh look, it’s our old friend, aBuf. As you might have guessed from every other issue we’ve encountered in this chapter, aBuf is a byte array. Here, the feed() method isn’t just passing it on wholesale; it’s slicing it. But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array.
And what is this code trying to do with aStr? It’s taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because aStr and aBuf were strings, and aStr[0] would be a string, and you can compare strings for inequality. But in Python 3, aStr and aBuf are byte arrays, aStr[0] is an integer, and you can’t compare integers and strings for inequality without explicitly coercing one of them.
In this case, there’s no need to make the code more complicated by adding an explicit coercion. aStr[0] yields an integer; the things you’re comparing to are all constants. Let’s change them from 1-character strings to integers.
-
class SJISContextAnalysis(JapaneseContextAnalysis):
+ class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
if not aStr: return -1, 1
# find out current char's byte length
@@ -1009,7 +1008,7 @@ tests\Big5\0804.blogspot.com.xml
if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):
TypeError: unorderable types: int() >= str()
The fix is the same:
-
class EUCTWDistributionAnalysis(CharDistributionAnalysis):
+ class EUCTWDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
CharDistributionAnalysis.__init__(self)
self._mCharToFreqOrder = EUCTWCharToFreqOrder
@@ -1127,21 +1126,21 @@ tests\Big5\0804.blogspot.com.xml
total = reduce(operator.add, self._mFreqCounter)
NameError: global name 'reduce' is not defined
According to the official What’s New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: “Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable.” You can read more about the decision from Guido van Rossum’s weblog: The fate of reduce() in Python 3000.
-
def get_confidence(self):
+def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
total = reduce(operator.add, self._mFreqCounter)
The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.
This monstrosity was so common that Python added a global sum() function.
-
def get_confidence(self):
+ def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
- total = reduce(operator.add, self._mFreqCounter)
+ total = sum(self._mFreqCounter)
Since you’re no longer using the operator module, you can remove that import from the top of the file as well.
-
from .charsetprober import CharSetProber
+ from .charsetprober import CharSetProber
from . import constants
- import operator
I CAN HAZ TESTZ?
@@ -1192,7 +1191,8 @@ tests\EUC-JP\arclamp.jp.xml EUC-JP with confide
- Test cases are essential. Don’t port anything without them. Don’t even try. The only reason I have any confidence at all that
chardet works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I never would have found half of these problems with manual spot-checking.
© 2001–9 Mark Pilgrim + diff --git a/dip3.css b/dip3.css index c5326e8..0dfb0e0 100644 --- a/dip3.css +++ b/dip3.css @@ -37,6 +37,7 @@ Classname Legend .c = "centered" = centered footer text (also clears floats) .a = "asterism" = section break .v = "navigation" = prev/next navigation links (not breadcrumbs) +.u = "Unicode" = text contains Unicode characters (requires special font declaration) .nm = "no mobile" = hide this section on mobile devices .nd = "no decoration" = hide the widgets on this code block @@ -53,6 +54,7 @@ Acknowledgements & Inspirations "Use the Best Available Ampersand" ....................... http://simplebits.com/notebook/2008/08/14/ampersands.html "Unicode Support in HTML, Fonts, and Web Browsers" ....... http://alanwood.net/unicode/ "Punctuation" ............................................ http://en.wikipedia.org/wiki/Punctuation +"Google Code Prettify" ................................... http://code.google.com/p/google-code-prettify/ */ /* typography */ @@ -61,15 +63,15 @@ body, .w a { font: medium/1.75 'Gill Sans', 'Gill Sans MT', Corbel, Helvetica, 'Nimbus Sans L', sans-serif; word-spacing: 0.1em; } -pre, kbd, samp, code, var, .b { +pre, kbd, samp, code, var, .b, pre span { font: small/2.154 Consolas, 'Andale Mono', Monaco, 'Liberation Mono', 'Bitstream Vera Sans Mono', 'DejaVu Sans Mono', monospace; word-spacing: 0; } -span { - font: medium 'Arial Unicode MS', FreeSerif, OpenSymbol, 'DejaVu Sans', sans-serif; +span.u { + font: medium/1.75 'Arial Unicode MS', FreeSerif, OpenSymbol, 'DejaVu Sans', sans-serif; } -pre span, .a { - font-family: 'Arial Unicode MS', 'DejaVu Sans', FreeSerif, OpenSymbol, sans-serif; +pre span.u, pre span.u span, .a { + font: medium/1.75 'Arial Unicode MS', 'DejaVu Sans', FreeSerif, OpenSymbol, sans-serif; } .baa { font: oblique large Constantia, Baskerville, Palatino, 'Palatino Linotype', 'URW Palladio L', serif; @@ -201,7 +203,7 @@ li ol, .q { code, var, samp { line-height:inherit !important; } -pre a, .w a, pre a:hover { +pre a, td code a, .w a, pre a:hover { border: 0; } @@ -271,6 +273,7 @@ aside { #level span { color: #82b445; } + /* previous/next navigation links */ .v a { @@ -290,3 +293,17 @@ aside { margin: 0; text-shadow: gainsboro 3px 3px 3px; } + +/* syntax highlighting */ + +.str { color: #080; } +.kwd { color: #008; } +.com { color: #800; } +.typ { color: #606; } +.lit { color: #066; } +.pun { color: #660; } +.pln { color: #000; } +.tag { color: #008; } +.atn { color: #606; } +.atv { color: #080; } +.dec { color: #606; } diff --git a/files.html b/files.html index c54b07d..0041b23 100644 --- a/files.html +++ b/files.html @@ -12,11 +12,11 @@ body{counter-reset:h1 12}
You are here: Home ‣ Dive Into Python 3 ‣ +
You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♢♢
-❝ FIXME ❞
— FIXME +❝ FIXME ❞
— FIXME
© 2001–9 Mark Pilgrim + diff --git a/generators.html b/generators.html index 22d02b3..73e3a9a 100644 --- a/generators.html +++ b/generators.html @@ -12,11 +12,11 @@ body{counter-reset:h1 5}
-You are here: Home ‣ Dive Into Python 3 ‣ +
You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♢♢
-❝ My spelling is Wobbly. It’s good spelling but it Wobbles, and the letters get in the wrong places. ❞
— Winnie-the-Pooh +❝ My spelling is Wobbly. It’s good spelling but it Wobbles, and the letters get in the wrong places. ❞
— Winnie-the-Pooh
So you’re looking at words, which, at least in English, means you’re looking at strings of characters. You have rules that say you need to find different combinations of characters, then do different things to them. This sounds like a job for regular expressions!
import re
+import re
def plural(noun):
- if re.search('[sxz]$', noun): ①
- return re.sub('$', 'es', noun) ②
+ if re.search('[sxz]$', noun): ①
+ return re.sub('$', 'es', noun) ②
elif re.search('[^aeioudgkprt]h$', noun):
return re.sub('$', 'es', noun)
elif re.search('[^aeiou]y$', noun):
@@ -57,13 +57,13 @@ def plural(noun):
Let’s look at regular expression substitutions in more detail.
>>> import re
->>> re.search('[abc]', 'Mark') ①
+>>> re.search('[abc]', 'Mark') ①
<_sre.SRE_Match object at 0x001C1FA8>
->>> re.sub('[abc]', 'o', 'Mark') ②
+>>> re.sub('[abc]', 'o', 'Mark') ②
'Mork'
->>> re.sub('[abc]', 'o', 'rock') ③
+>>> re.sub('[abc]', 'o', 'rock') ③
'rook'
->>> re.sub('[abc]', 'o', 'caps') ④
+>>> re.sub('[abc]', 'o', 'caps') ④
'oops'
- Does the string
Mark contain a, b, or c? Yes, it contains a.
@@ -74,11 +74,11 @@ def plural(noun):
And now, back to the plural() function…
-
def plural(noun):
+def plural(noun):
if re.search('[sxz]$', noun):
- return re.sub('$', 'es', noun) ①
- elif re.search('[^aeioudgkprt]h$', noun): ②
- return re.sub('$', 'es', noun) ③
+ return re.sub('$', 'es', noun) ①
+ elif re.search('[^aeioudgkprt]h$', noun): ②
+ return re.sub('$', 'es', noun) ③
elif re.search('[^aeiou]y$', noun):
return re.sub('y$', 'ies', noun)
else:
@@ -93,13 +93,13 @@ def plural(noun):
>>> import re
->>> re.search('[^aeiou]y$', 'vacancy') ①
+>>> re.search('[^aeiou]y$', 'vacancy') ①
<_sre.SRE_Match object at 0x001C1FA8>
->>> re.search('[^aeiou]y$', 'boy') ②
+>>> re.search('[^aeiou]y$', 'boy') ②
>>>
>>> re.search('[^aeiou]y$', 'day')
>>>
->>> re.search('[^aeiou]y$', 'pita') ③
+>>> re.search('[^aeiou]y$', 'pita') ③
>>>
vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u.
@@ -107,11 +107,11 @@ def plural(noun):
pita does not match, because it does not end in y.
->>> re.sub('y$', 'ies', 'vacancy') ①
+>>> re.sub('y$', 'ies', 'vacancy') ①
'vacancies'
>>> re.sub('y$', 'ies', 'agency')
'agencies'
->>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy') ②
+>>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy') ②
'vacancies'
- This regular expression turns
vacancy into vacancies and agency into agencies, which is what you wanted. Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub.
@@ -126,7 +126,7 @@ def plural(noun):
Now you’re going to add a level of abstraction. You started by defining a list of rules: if this, do that, otherwise go to the next rule. Let’s temporarily complicate part of the program so you can simplify another part.
import re
+import re
def match_sxz(noun):
return re.search('[sxz]$', noun)
@@ -140,10 +140,10 @@ def match_h(noun):
def apply_h(noun):
return re.sub('$', 'es', noun)
-def match_y(noun): ①
+def match_y(noun): ①
return re.search('[^aeiou]y$', noun)
-def apply_y(noun): ②
+def apply_y(noun): ②
return re.sub('y$', 'ies', noun)
def match_default(noun):
@@ -152,14 +152,14 @@ def match_default(noun):
def apply_default(noun):
return noun + 's'
-rules = [[match_sxz, apply_sxz], ③
+rules = [[match_sxz, apply_sxz], ③
[match_h, apply_h],
[match_y, apply_y],
[match_default, apply_default]
]
def plural(noun):
- for matches_rule, apply_rule in rules: ④
+ for matches_rule, apply_rule in rules: ④
if matches_rule(noun):
return apply_rule(noun)
@@ -174,7 +174,7 @@ def plural(noun):
If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire for loop is equivalent to the following:
-
+
def plural(noun):
if match_sxz(noun):
return apply_sxz(noun)
@@ -206,14 +206,14 @@ def plural(noun):
Defining separate named functions for each match and apply rule isn’t really necessary. You never call them directly; you add them to the rules list and call them through there. Furthermore, each function follows one of two patterns. All the match functions call re.search(), and all the apply functions call re.sub(). Let’s factor out the patterns so that defining new rules can be easier.
import re
+import re
def build_match_and_apply_functions(pattern, search, replace):
- def matches_rule(word): ①
+ def matches_rule(word): ①
return re.search(pattern, word)
- def apply_rule(word): ②
+ def apply_rule(word): ②
return re.sub(search, replace, word)
- return [matches_rule, apply_rule] ③
+ return [matches_rule, apply_rule] ③
build_match_and_apply_functions() is a function that builds other functions dynamically. It takes pattern, search and replace, then defines a matches_rule() function which calls re.search() with the pattern that was passed to the build_match_and_apply_functions() function, and the word that was passed to the matches_rule() function you’re building. Whoa.
- Building the apply function works the same way. The apply function is a function that takes one parameter, and calls
re.sub() with the search and replace parameters that were passed to the build_match_and_apply_functions() function, and the word that was passed to the apply_rule() function you’re building. This technique of using the values of outside parameters within a dynamic function is called closures. You’re essentially defining constants within the apply function you’re building: it takes one parameter (word), but it then acts on that plus two other values (search and replace) which were set when you defined the apply function.
@@ -222,15 +222,14 @@ def build_match_and_apply_functions(pattern, search, replace):
If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it.
-
-patterns = \ ①
+patterns = \ ①
[
['[sxz]$', '$', 'es'],
['[^aeioudgkprt]h$', '$', 'es'],
['(qu|[^aeiou])y$', 'y$', 'ies'],
['$', '$', 's']
]
-rules = [build_match_and_apply_functions(pattern, search, replace) ②
+rules = [build_match_and_apply_functions(pattern, search, replace) ②
for (pattern, search, replace) in patterns]
- Our pluralization rules are now defined as a list of lists of strings (not functions). The first string in each group is the regular expression pattern that you would use in
re.search() to see if this rule matches. The second and third strings in each group are the search and replace expressions you would use in re.sub() to actually apply the rule to turn a noun into its plural.
@@ -239,8 +238,8 @@ def build_match_and_apply_functions(pattern, search, replace):
Rounding out this version of the script is the main entry point, the plural() function.
-
def plural(noun):
- for matches_rule, apply_rule in rules: ①
+def plural(noun):
+ for matches_rule, apply_rule in rules: ①
if matches_rule(noun):
return apply_rule(noun)
@@ -256,7 +255,7 @@ def build_match_and_apply_functions(pattern, search, replace):
First, let’s create a text file that contains the rules you want. No fancy data structures, just whitespace-delimited strings in three columns. Let’s call it plural4-rules.txt.
[download plural4-rules.txt]
-
[sxz]$ $ es
+[sxz]$ $ es
[^aeioudgkprt]h$ $ es
[^aeiou]y$ y$ ies
$ $ s
@@ -266,9 +265,9 @@ $ $ s
[FIXME: now that this chapter comes before the I/O chapter, need to at least mention what open() does]
[FIXME: try/finally -> with]
import re
+import re
-def build_match_and_apply_functions(pattern, search, replace): ①
+def build_match_and_apply_functions(pattern, search, replace): ①
def matches_rule(word):
return re.search(pattern, word)
def apply_rule(word):
@@ -276,14 +275,14 @@ $ $ s
return [matches_rule, apply_rule]
rules = []
-pattern_file = open('plural4-rules.txt') ②
+pattern_file = open('plural4-rules.txt') ②
try:
- for line in pattern_file: ③
- pattern, search, replace = line.split(None, 3) ④
- rules.append(build_match_and_apply_functions( ⑤
+ for line in pattern_file: ③
+ pattern, search, replace = line.split(None, 3) ④
+ rules.append(build_match_and_apply_functions( ⑤
pattern, search, replace))
finally:
- pattern_file.close() ⑥
+ pattern_file.close() ⑥
- The
build_match_and_apply_functions() function has not changed. You’re still using closures to build two functions dynamically that use variables defined in the outer function.
- Open the file that contains the pattern strings.
@@ -301,7 +300,7 @@ finally:
Wouldn’t it be grand to have a generic plural() function that parses the rules file? Get rules, check for a match, apply appropriate transformation, go to next rule. That’s all the plural() function has to do, and that’s all the plural() function should do.
def rules():
+def rules():
for line in open('plural5-rules.txt'):
pattern, search, replace = line.split(None, 3)
yield build_match_and_apply_functions(pattern, search, replace)
@@ -317,20 +316,20 @@ def plural(noun):
>>> def make_counter(x):
... print('entering make_counter')
... while True:
-... yield x ①
+... yield x ①
... print('incrementing x')
... x = x + 1
...
->>> counter = make_counter(2) ②
->>> counter ③
+>>> counter = make_counter(2) ②
+>>> counter ③
<generator object at 0x001C9C10>
->>> next(counter) ④
+>>> next(counter) ④
entering make_counter
2
->>> next(counter) ⑤
+>>> next(counter) ⑤
incrementing x
3
->>> next(counter) ⑥
+>>> next(counter) ⑥
incrementing x
4
@@ -347,11 +346,11 @@ def plural(noun):
A Fibonacci Generator
def fib(max):
- a, b = 0, 1 ①
+def fib(max):
+ a, b = 0, 1 ①
while a < max:
- yield a ②
- a, b = b, a + b ③
+ yield a ②
+ a, b = b, a + b ③
- The Fibonacci sequence is a sequence of numbers where each number is the sum of the two numbers before it. It starts with
0 and 1, goes up slowly at first, then more and more rapidly. To start the sequence, you need two variables: a starts at 0, and b starts at 1.
- a is the current number in the sequence, so yield it.
@@ -364,8 +363,8 @@ def plural(noun):
>>> from fibonacci import fib
->>> for n in fib(1000): ①
-... print(n, end=' ') ②
+>>> for n in fib(1000): ①
+... print(n, end=' ') ②
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
- You can use a generator like
fib() in a for loop directly. The for loop will automatically call the next() function to get values from the fib() generator and assign them to the for loop index variable (n).
@@ -376,13 +375,13 @@ def plural(noun):
Let’s go back to plural5.py and see how this version of the plural() function works.
-
def rules():
+def rules():
for line in open('plural5-rules.txt'):
- pattern, search, replace = line.split(None, 3) ②
- yield build_match_and_apply_functions(pattern, search, replace) ③
+ pattern, search, replace = line.split(None, 3) ②
+ yield build_match_and_apply_functions(pattern, search, replace) ③
def plural(noun):
- for matches_rule, apply_rule in rules(): ④
+ for matches_rule, apply_rule in rules(): ④
if matches_rule(noun):
return apply_rule(noun)
@@ -406,8 +405,9 @@ def plural(noun):
- PEP 255: Simple Generators
-
© 2001–9 Mark Pilgrim
+
diff --git a/http-web-services.html b/http-web-services.html
index 107ddb6..36b45b2 100644
--- a/http-web-services.html
+++ b/http-web-services.html
@@ -13,11 +13,11 @@ mark{display:inline}
-You are here: Home ‣ Dive Into Python 3 ‣
+
You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♦♢
HTTP Web Services
-❝ A ruffled mind makes a restless pillow. ❞
— Charlotte Brontë
+
❝ A ruffled mind makes a restless pillow. ❞
— Charlotte Brontë
Diving In
@@ -137,7 +137,7 @@ The second time you request the same data, you include the ETag hash in an Again with the curl:
-you@localhost:~$ curl -I -H "If-None-Match: \"3075-ddc8d800\"" http://wearehugh.com/m.jpg ①
+you@localhost:~$ curl -I -H "If-None-Match: \"3075-ddc8d800\"" http://wearehugh.com/m.jpg ①
HTTP/1.1 304 Not Modified
Date: Sun, 31 May 2009 18:04:39 GMT
Server: Apache
@@ -188,7 +188,7 @@ Cache-Control: max-age=31536000, public
Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better.
>>> import urllib.request
->>> data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read() ①
+>>> data = urllib.request.urlopen('http://diveintopython3.org/examples/feed.xml').read() ①
>>> print(data)
<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
@@ -213,13 +213,13 @@ Cache-Control: max-age=31536000, public
>>> from http.client import HTTPConnection
->>> HTTPConnection.debuglevel = 1 ①
+>>> HTTPConnection.debuglevel = 1 ①
>>> from urllib.request import urlopen
->>> response = urlopen('http://diveintopython3.org/examples/feed.xml') ②
-send: b'GET /examples/feed.xml HTTP/1.1 ③
-Host: diveintopython3.org ④
-Accept-Encoding: identity ⑤
-User-Agent: Python-urllib/3.0' ⑥
+>>> response = urlopen('http://diveintopython3.org/examples/feed.xml') ②
+send: b'GET /examples/feed.xml HTTP/1.1 ③
+Host: diveintopython3.org ④
+Accept-Encoding: identity ⑤
+User-Agent: Python-urllib/3.0' ⑥
Connection: close
reply: 'HTTP/1.1 200 OK'
…further debugging information omitted…
@@ -236,19 +236,19 @@ reply: 'HTTP/1.1 200 OK'
# continued from previous example
->>> print(response.headers.as_string()) ①
-Date: Sun, 31 May 2009 19:23:06 GMT ②
+>>> print(response.headers.as_string()) ①
+Date: Sun, 31 May 2009 19:23:06 GMT ②
Server: Apache
-Last-Modified: Sun, 31 May 2009 06:39:55 GMT ③
-ETag: "bfe-93d9c4c0" ④
+Last-Modified: Sun, 31 May 2009 06:39:55 GMT ③
+ETag: "bfe-93d9c4c0" ④
Accept-Ranges: bytes
-Content-Length: 3070 ⑤
-Cache-Control: max-age=86400 ⑥
+Content-Length: 3070 ⑤
+Cache-Control: max-age=86400 ⑥
Expires: Mon, 01 Jun 2009 19:23:06 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: application/xml
->>> data = response.read() ⑦
+>>> data = response.read() ⑦
>>> len(data)
3070
@@ -282,7 +282,7 @@ reply: 'HTTP/1.1 200 OK'
# continued from the previous example
->>> print(response2.headers.as_string()) ①
+>>> print(response2.headers.as_string()) ①
Date: Mon, 01 Jun 2009 03:58:00 GMT
Server: Apache
Last-Modified: Sun, 31 May 2009 22:51:11 GMT
@@ -295,9 +295,9 @@ Vary: Accept-Encoding
Connection: close
Content-Type: application/xml
>>> data2 = response2.read()
->>> len(data2) ②
+>>> len(data2) ②
3070
->>> data2 == data ③
+>>> data2 == data ③
True
- The server is still sending the same array of “smart” headers:
Cache-Control and Expires to allow caching, Last-Modified and ETag to enable “not-modified” tracking. Even the Vary: Accept-Encoding header hints that the server would support compression, if only you would ask for it. But you didn’t.
@@ -315,11 +315,11 @@ Content-Type: application/xml
>>> import httplib2
->>> h = httplib2.Http('.cache') ①
->>> response, content = h.request('http://diveintopython3.org/examples/feed.xml') ②
->>> response.status ③
+>>> h = httplib2.Http('.cache') ①
+>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml') ②
+>>> response.status ③
200
->>> content[:52] ④
+>>> content[:52] ④
b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
>>> len(content)
3070
@@ -331,7 +331,7 @@ Content-Type: application/xml
-☞You probably only need one httplib2.Http object. There are valid reasons for creating more than one, but you should only do so if you know why you need them. “I need to request data from two different URLs” is not a valid reason. Re-use the Http object and just call the request() method twice.
+
☞You probably only need one httplib2.Http object. There are valid reasons for creating more than one, but you should only do so if you know why you need them. “I need to request data from two different URLs” is not a valid reason. Re-use the Http object and just call the request() method twice.
How httplib2 Handles Caching
@@ -340,10 +340,10 @@ Content-Type: application/xml
# continued from the previous example
->>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml') ①
->>> response2.status ②
+>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml') ①
+>>> response2.status ②
200
->>> content2[:52] ③
+>>> content2[:52] ③
b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
>>> len(content2)
3070
@@ -360,14 +360,14 @@ Content-Type: application/xml
# Please exit out of the interactive shell
# and launch a new one.
>>> import httplib2
->>> httplib2.debuglevel = 1 ①
->>> h = httplib2.Http('.cache') ②
->>> response, content = h.request('http://diveintopython3.org/examples/feed.xml') ③
->>> len(content) ④
+>>> httplib2.debuglevel = 1 ①
+>>> h = httplib2.Http('.cache') ②
+>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml') ③
+>>> len(content) ④
3070
->>> response.status ⑤
+>>> response.status ⑤
200
->>> response.fromcache ⑥
+>>> response.fromcache ⑥
True
- Let’s turn on debugging and see what’s on the wire. This is the
httplib2 equivalent of turning on debugging in http.client. httplib2 will print all the data being sent to the server and some key information being sent back.
@@ -389,8 +389,8 @@ Content-Type: application/xml
# continued from the previous example
>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml',
-... headers={'cache-control':'no-cache'}) ①
-connect: (diveintopython3.org, 80) ②
+... headers={'cache-control':'no-cache'}) ①
+connect: (diveintopython3.org, 80) ②
send: b'GET /examples/feed.xml HTTP/1.1
Host: diveintopython3.org
user-agent: Python-httplib2/$Rev: 259 $
@@ -400,9 +400,9 @@ reply: 'HTTP/1.1 200 OK'
…further debugging information omitted…
>>> response2.status
200
->>> response2.fromcache ③
+>>> response2.fromcache ③
False
->>> print(dict(response2.items())) ④
+>>> print(dict(response2.items())) ④
{'status': '200',
'content-length': '3070',
'content-location': 'http://diveintopython3.org/examples/feed.xml',
@@ -434,14 +434,14 @@ reply: 'HTTP/1.1 200 OK'
>>> import httplib2
>>> httplib2.debuglevel = 1
>>> h = httplib2.Http('.cache')
->>> response, content = h.request('http://diveintopython3.org/') ①
+>>> response, content = h.request('http://diveintopython3.org/') ①
connect: (diveintopython3.org, 80)
send: b'GET / HTTP/1.1
Host: diveintopython3.org
accept-encoding: deflate, gzip
user-agent: Python-httplib2/$Rev: 259 $'
reply: 'HTTP/1.1 200 OK'
->>> print(dict(response.items())) ②
+>>> print(dict(response.items())) ②
{'-content-encoding': 'gzip',
'accept-ranges': 'bytes',
'connection': 'close',
@@ -454,7 +454,7 @@ reply: 'HTTP/1.1 200 OK'
'server': 'Apache',
'status': '304',
'vary': 'Accept-Encoding,User-Agent'}
->>> len(content) ③
+>>> len(content) ③
6657
- Instead of the feed, this time we’re going to download the site’s home page, which is HTML. Since this is the first time you’lve ever requested this page,
httplib2 has little to work with, and it sends out a minimum of headers with the request.
@@ -464,22 +464,22 @@ reply: 'HTTP/1.1 200 OK'
# continued from the previous example
->>> response, content = h.request('http://diveintopython3.org/') ①
+>>> response, content = h.request('http://diveintopython3.org/') ①
connect: (diveintopython3.org, 80)
send: b'GET / HTTP/1.1
Host: diveintopython3.org
-if-none-match: "7f806d-1a01-9fb97900" ②
-if-modified-since: Tue, 02 Jun 2009 02:51:48 GMT ③
+if-none-match: "7f806d-1a01-9fb97900" ②
+if-modified-since: Tue, 02 Jun 2009 02:51:48 GMT ③
accept-encoding: deflate, gzip
user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 304 Not Modified' ④
->>> response.fromcache ⑤
+reply: 'HTTP/1.1 304 Not Modified' ④
+>>> response.fromcache ⑤
True
->>> response.status ⑥
+>>> response.status ⑥
200
->>> response.dict['status'] ⑦
+>>> response.dict['status'] ⑦
'304'
->>> len(content) ⑧
+>>> len(content) ⑧
6657
- You request the same page again, with the same
Http object (and the same local cache).
@@ -501,11 +501,11 @@ user-agent: Python-httplib2/$Rev: 259 $'
connect: (diveintopython3.org, 80)
send: b'GET / HTTP/1.1
Host: diveintopython3.org
-accept-encoding: deflate, gzip ①
+accept-encoding: deflate, gzip ①
user-agent: Python-httplib2/$Rev: 259 $'
reply: 'HTTP/1.1 200 OK'
>>> print(dict(response.items()))
-{'-content-encoding': 'gzip', ②
+{'-content-encoding': 'gzip', ②
'accept-ranges': 'bytes',
'connection': 'close',
'content-length': '6657',
@@ -681,7 +681,8 @@ reply: 'HTTP/1.1 301 Moved Permanently'
- How to control caching with HTTP headers on Google Doctype
-
© 2001–9 Mark Pilgrim
+
diff --git a/iterators-and-generators.html b/iterators-and-generators.html
index e79d2b2..da6adde 100644
--- a/iterators-and-generators.html
+++ b/iterators-and-generators.html
@@ -13,10 +13,10 @@ h1:before{counter-increment:h1;content:''}
-You are here: Home ‣ Dive Into Python 3 ‣
+
You are here: Home ‣ Dive Into Python 3 ‣
Secret Leftover Page
-❝ You step in the stream / but the water has moved on. / This page is not here. ❞
— 404 Not Found haiku
+
❝ You step in the stream / but the water has moved on. / This page is not here. ❞
— 404 Not Found haiku
Huh?
diff --git a/iterators.html b/iterators.html
index 018af2a..f2554c9 100644
--- a/iterators.html
+++ b/iterators.html
@@ -12,11 +12,11 @@ body{counter-reset:h1 6}
-You are here: Home ‣ Dive Into Python 3 ‣
+
You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♢♢
Iterators
-❝ East is East, and West is West, and never the twain shall meet. ❞
— Rudyard Kipling
+
❝ East is East, and West is West, and never the twain shall meet. ❞
— Rudyard Kipling
Diving In
@@ -25,7 +25,7 @@ body{counter-reset:h1 6}
Remember the Fibonacci generator? Here it is as a built-from-scratch iterator:
class Fib:
+class Fib:
'''iterator that yields numbers in the Fibonacci sequence'''
def __init__(self, max):
@@ -45,7 +45,7 @@ body{counter-reset:h1 6}
Let’s take that one line at a time.
-
class Fib:
+class Fib:
class? What’s a class?
@@ -57,9 +57,8 @@ body{counter-reset:h1 6}
Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that’s all that’s required, since a class doesn’t need to inherit from any other class.
-
-class PapayaWhip: ①
- pass ②
+class PapayaWhip: ①
+ pass ②
- The name of this class is
PapayaWhip, and it doesn’t inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement.
- You probably guessed this, but everything in a class is indented, just like the code within a function,
if statement, for loop, or any other block of code. The first line not indented is outside the class.
@@ -68,7 +67,7 @@ class PapayaWhip: ①
This PapayaWhip class doesn’t define any methods or attributes, but syntactically, there needs to be something in the definition, thus the pass statement. This is a Python reserved word that just means “move along, nothing to see here”. It’s a statement that does nothing, and it’s a good placeholder when you’re stubbing out functions or classes.
-☞The pass statement in Python is like a empty set of curly braces ({}) in Java or C.
+
☞The pass statement in Python is like a empty set of curly braces ({}) in Java or C.
Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don’t have explicit constructors and destructors. Although it’s not required, Python classes can have something similar to a constructor: the __init__() method.
@@ -77,11 +76,10 @@ class PapayaWhip: ①
This example shows the initialization of the Fib class using the __init__ method.
-
-class Fib:
- '''iterator that yields numbers in the Fibonacci sequence''' ①
+class Fib:
+ '''iterator that yields numbers in the Fibonacci sequence''' ①
- def __init__(self, max): ②
+ def __init__(self, max): ②
- Classes can (and should) have
docstrings too, just like modules and functions.
- The
__init__() method is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor of the class. It’s tempting, because it looks like a constructor (by convention, the __init__() method is the first method defined for the class), acts like one (it’s the first piece of code executed in a newly created instance of the class), and even sounds like one. Incorrect, because the object has already been constructed by the time the __init__() method is called, and you already have a valid reference to the new instance of the class.
@@ -98,12 +96,12 @@ class Fib:
Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the __init__() method requires. The return value will be the newly created object.
>>> import fibonacci2
->>> fib = fibonacci2.Fib(100) ①
->>> fib ②
+>>> fib = fibonacci2.Fib(100) ①
+>>> fib ②
<fibonacci2.Fib object at 0x00DB8810>
->>> fib.__class__ ③
+>>> fib.__class__ ③
<class 'fibonacci2.Fib'>
->>> fib.__doc__ ④
+>>> fib.__doc__ ④
'iterator that yields numbers in the Fibonacci sequence'
- You are creating an instance of the
Fib class (defined in the fibonacci2 module) and assigning the newly created instance to the variable fib. You are passing one parameter, 100, which will end up as the max argument in Fib’s __init__() method.
@@ -113,7 +111,7 @@ class Fib:
-☞In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit new operator like C++ or Java.
+
☞In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit new operator like C++ or Java.
⁂
@@ -122,22 +120,22 @@ class Fib:
On to the next line:
-
class Fib:
+class Fib:
def __init__(self, max):
- self.max = max ①
+ self.max = max ①
- What is self.max? It’s an instance variable. It is completely separate from max, which was passed into the
__init__() method as an argument. self.max is “global” to the instance. That means that you can access it from other methods.
-class Fib:
+class Fib:
def __init__(self, max):
- self.max = max ①
+ self.max = max ①
.
.
.
def __next__(self):
fib = self.a
- if fib > self.max: ②
+ if fib > self.max: ②
- self.max is defined in the
__init__() method…
- …and referenced in the
__next__() method.
@@ -161,20 +159,20 @@ class Fib:
Now you’re ready to learn how to build an iterator. An iterator is just a class that defines an __iter__() method.
class Fib: ①
- def __init__(self, max): ②
+class Fib: ①
+ def __init__(self, max): ②
self.max = max
- def __iter__(self): ③
+ def __iter__(self): ③
self.a, self.b = 0, 1
return self
- def __next__(self): ④
+ def __next__(self): ④
fib = self.a
if fib > self.max:
- raise StopIteration ⑤
+ raise StopIteration ⑤
self.a, self.b = self.b, self.a + self.b
- return fib ⑥
+ return fib ⑥
- To build an iterator from scratch,
fib needs to be a class, not a function.
- “Calling”
Fib(max) is really creating an instance of this class and calling its __init__() method with max. The __init__() method saves the maximum value as an instance variable so other methods can refer to it later.
@@ -211,7 +209,7 @@ class Fib:
Now it’s time for the finale. Let’s rewrite the plural rules generator as an iterator.
class LazyRules:
+class LazyRules:
rules_filename = 'plural6-rules.txt'
def __init__(self):
@@ -247,12 +245,12 @@ rules = LazyRules()
Let’s take the class one bite at a time.
-
class LazyRules:
+class LazyRules:
rules_filename = 'plural6-rules.txt'
- def __init__(self): ①
- self.pattern_file = open(self.rules_filename) ③
- self.cache = [] ②
+ def __init__(self): ①
+ self.pattern_file = open(self.rules_filename) ③
+ self.cache = [] ②
- The
__init__() method is only going to be called once, when you instantiate the class and assign it to rules.
- Since this is only going to get called once, it’s the perfect place to open the pattern file. You’ll read it later; no point doing more than you absolutely have to until absolutely necessary!
@@ -265,16 +263,16 @@ rules = LazyRules()
>>> import plural6
>>> r1 = plural6.LazyRules()
>>> r2 = plural6.LazyRules()
->>> r1.rules_filename ①
+>>> r1.rules_filename ①
'plural6-rules.txt'
>>> r2.rules_filename
'plural6-rules.txt'
->>> r1.__class__.rules_filename ②
+>>> r1.__class__.rules_filename ②
'plural6-rules.txt'
->>> r1.__class__.rules_filename = 'papayawhip.txt' ③
+>>> r1.__class__.rules_filename = 'papayawhip.txt' ③
>>> r1.rules_filename
'papayawhip.txt'
->>> r2.rules_filename ④
+>>> r2.rules_filename ④
'papayawhip.txt'
- FIXME
@@ -285,9 +283,9 @@ rules = LazyRules()
And now back to our show.
-
def __iter__(self): ①
- self.cache_index = 0 ②
- return self ③
+ def __iter__(self): ①
+ self.cache_index = 0 ②
+ return self ③
- The
__iter__() method will be called every time someone — say, a for loop — calls iter(rules).
@@ -295,14 +293,14 @@ rules = LazyRules()
- Finally, the
__iter__() method returns self, which signals that this class will take care of returning its own values throughout an iteration.
- def __next__(self): ①
+ def __next__(self): ①
.
.
.
pattern, search, replace = line.split(None, 3)
- funcs = build_match_and_apply_functions( ②
+ funcs = build_match_and_apply_functions( ②
pattern, search, replace)
- self.cache.append(funcs) ③
+ self.cache.append(funcs) ③
return funcs
- The
__next__() method gets called whenever someone — say, a for loop — calls next(rules). This method will only make sense if we start at the end and work backwards. So let’s do that.
@@ -312,32 +310,32 @@ rules = LazyRules()
Moving backwards…
-
def __next__(self):
+ def __next__(self):
.
.
.
- line = self.pattern_file.readline() ①
- if not line: ②
+ line = self.pattern_file.readline() ①
+ if not line: ②
self.pattern_file.close()
- raise StopIteration ③
+ raise StopIteration ③
.
.
.
- A bit of advanced file trickery here. The
readline() method (note: singular, not the plural readlines()) reads exactly one line from an open file. Specifically, the next line. (File objects are iterators too! It’s iterators all the way down…)
- If there was a line for
readline() to read, line will not be an empty string. Even if the file contained a blank line, line would end up as the one-character string '\n' (a carriage return). If line is really an empty string, that means there are no more lines to read from the file.
- - When we reach the end of the file, we should close the file and raise the magic
StopIteration exception. Remember, we got to this point because we needed a match and apply function for the next rule. The next rule comes from the next line of the file… but there is no next line! Therefore, we have no value to return. The iteration is over. (♫ The party’s over… ♫)
+ - When we reach the end of the file, we should close the file and raise the magic
StopIteration exception. Remember, we got to this point because we needed a match and apply function for the next rule. The next rule comes from the next line of the file… but there is no next line! Therefore, we have no value to return. The iteration is over. (♫ The party’s over… ♫)
Moving backwards all the way to the start of the __next__() method…
-
def __next__(self):
+ def __next__(self):
self.cache_index += 1
if len(self.cache) >= self.cache_index:
- return self.cache[self.cache_index - 1] ①
+ return self.cache[self.cache_index - 1] ①
if self.pattern_file.closed:
- raise StopIteration ②
+ raise StopIteration ②
.
.
.
@@ -374,8 +372,9 @@ rules = LazyRules()
- PEP 255: Simple Generators
-
© 2001–9 Mark Pilgrim
+
diff --git a/j/dip3.js b/j/dip3.js
index cb23697..789a221 100644
--- a/j/dip3.js
+++ b/j/dip3.js
@@ -29,7 +29,8 @@ POSSIBILITY OF SUCH DAMAGE.
var HS = {'visible': 'hide', 'hidden': 'show'};
$(document).ready(function() {
hideTOC();
-
+ prettyPrint();
+
/* "hide", "open in new window", and (optionally) "download" widgets on code & screen blocks */
$("pre > code").each(function(i) {
var pre = $(this.parentNode);
@@ -90,6 +91,7 @@ $(document).ready(function() {
}
});
});
+
}); /* document.ready */
function toggleCodeBlock(id) {
@@ -100,7 +102,7 @@ function toggleCodeBlock(id) {
function plainTextOnClick(id) {
var clone = $("#" + id).clone();
- clone.find("div.w, span").remove();
+ clone.find("div.w, span.u").remove();
var win = window.open("about:blank", "plaintext", "toolbar=0,scrollbars=1,location=0,statusbar=0,menubar=0,resizable=1,width=600,height=400,left=35,top=75");
win.document.open();
win.document.write('
' + clone.html());
diff --git a/j/prettify.js b/j/prettify.js
new file mode 100644
index 0000000..80a5554
--- /dev/null
+++ b/j/prettify.js
@@ -0,0 +1,1427 @@
+// Copyright (C) 2006 Google Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+// Changes from upstream:
+// - use class=pp instead of class=prettyprint to declare blocks-to-colorize
+
+
+/**
+ * @fileoverview
+ * some functions for browser-side pretty printing of code contained in html.
+ *
+ * The lexer should work on a number of languages including C and friends,
+ * Java, Python, Bash, SQL, HTML, XML, CSS, Javascript, and Makefiles.
+ * It works passably on Ruby, PHP and Awk and a decent subset of Perl, but,
+ * because of commenting conventions, doesn't work on Smalltalk, Lisp-like, or
+ * CAML-like languages.
+ *
+ * If there's a language not mentioned here, then I don't know it, and don't
+ * know whether it works. If it has a C-like, Bash-like, or XML-like syntax
+ * then it should work passably.
+ *
+ * Usage:
+ * 1) include this source file in an html page via
+ *
+ * 2) define style rules. See the example page for examples.
+ * 3) mark the and tags in your source with class=pp.
+ * You can also use the (html deprecated) tag, but the pretty printer
+ * needs to do more substantial DOM manipulations to support that, so some
+ * css styles may not be preserved.
+ * That's it. I wanted to keep the API as simple as possible, so there's no
+ * need to specify which language the code is in.
+ *
+ * Change log:
+ * cbeust, 2006/08/22
+ * Java annotations (start with "@") are now captured as literals ("lit")
+ */
+
+// JSLint declarations
+/*global console, document, navigator, setTimeout, window */
+
+/**
+ * Split {@code prettyPrint} into multiple timeouts so as not to interfere with
+ * UI events.
+ * If set to {@code false}, {@code prettyPrint()} is synchronous.
+ */
+window['PR_SHOULD_USE_CONTINUATION'] = true;
+
+/** the number of characters between tab columns */
+window['PR_TAB_WIDTH'] = 8;
+
+/** Walks the DOM returning a properly escaped version of innerHTML.
+ * @param {Node} node
+ * @param {Array.} out output buffer that receives chunks of HTML.
+ */
+window['PR_normalizedHtml']
+
+/** Contains functions for creating and registering new language handlers.
+ * @type {Object}
+ */
+ = window['PR']
+
+/** Pretty print a chunk of code.
+ *
+ * @param {string} sourceCodeHtml code as html
+ * @return {string} code as html, but prettier
+ */
+ = window['prettyPrintOne']
+/** Find all the {@code } and {@code } tags in the DOM with
+ * {@code class=pp} and prettify them.
+ * @param {Function?} opt_whenDone if specified, called when the last entry
+ * has been finished.
+ */
+ = window['prettyPrint'] = void 0;
+
+/** browser detection. @extern */
+window['_pr_isIE6'] = function () {
+ var isIE6 = navigator && navigator.userAgent &&
+ /\bMSIE 6\./.test(navigator.userAgent);
+ window['_pr_isIE6'] = function () { return isIE6; };
+ return isIE6;
+};
+
+
+(function () {
+ // Keyword lists for various languages.
+ var FLOW_CONTROL_KEYWORDS =
+ "break continue do else for if return while ";
+ var C_KEYWORDS = FLOW_CONTROL_KEYWORDS + "auto case char const default " +
+ "double enum extern float goto int long register short signed sizeof " +
+ "static struct switch typedef union unsigned void volatile ";
+ var COMMON_KEYWORDS = C_KEYWORDS + "catch class delete false import " +
+ "new operator private protected public this throw true try ";
+ var CPP_KEYWORDS = COMMON_KEYWORDS + "alignof align_union asm axiom bool " +
+ "concept concept_map const_cast constexpr decltype " +
+ "dynamic_cast explicit export friend inline late_check " +
+ "mutable namespace nullptr reinterpret_cast static_assert static_cast " +
+ "template typeid typename typeof using virtual wchar_t where ";
+ var JAVA_KEYWORDS = COMMON_KEYWORDS +
+ "boolean byte extends final finally implements import instanceof null " +
+ "native package strictfp super synchronized throws transient ";
+ var CSHARP_KEYWORDS = JAVA_KEYWORDS +
+ "as base by checked decimal delegate descending event " +
+ "fixed foreach from group implicit in interface internal into is lock " +
+ "object out override orderby params partial readonly ref sbyte sealed " +
+ "stackalloc string select uint ulong unchecked unsafe ushort var ";
+ var JSCRIPT_KEYWORDS = COMMON_KEYWORDS +
+ "debugger eval export function get null set undefined var with " +
+ "Infinity NaN ";
+ var PERL_KEYWORDS = "caller delete die do dump elsif eval exit foreach for " +
+ "goto if import last local my next no our print package redo require " +
+ "sub undef unless until use wantarray while BEGIN END ";
+ var PYTHON_KEYWORDS = FLOW_CONTROL_KEYWORDS + "and as assert class def del " +
+ "elif except exec finally from global import in is lambda " +
+ "nonlocal not or pass print raise try with yield " +
+ "False True None ";
+ var RUBY_KEYWORDS = FLOW_CONTROL_KEYWORDS + "alias and begin case class def" +
+ " defined elsif end ensure false in module next nil not or redo rescue " +
+ "retry self super then true undef unless until when yield BEGIN END ";
+ var SH_KEYWORDS = FLOW_CONTROL_KEYWORDS + "case done elif esac eval fi " +
+ "function in local set then until ";
+ var ALL_KEYWORDS = (
+ CPP_KEYWORDS + CSHARP_KEYWORDS + JSCRIPT_KEYWORDS + PERL_KEYWORDS +
+ PYTHON_KEYWORDS + RUBY_KEYWORDS + SH_KEYWORDS);
+
+ // token style names. correspond to css classes
+ /** token style for a string literal */
+ var PR_STRING = 'str';
+ /** token style for a keyword */
+ var PR_KEYWORD = 'kwd';
+ /** token style for a comment */
+ var PR_COMMENT = 'com';
+ /** token style for a type */
+ var PR_TYPE = 'typ';
+ /** token style for a literal value. e.g. 1, null, true. */
+ var PR_LITERAL = 'lit';
+ /** token style for a punctuation string. */
+ var PR_PUNCTUATION = 'pun';
+ /** token style for a punctuation string. */
+ var PR_PLAIN = 'pln';
+
+ /** token style for an sgml tag. */
+ var PR_TAG = 'tag';
+ /** token style for a markup declaration such as a DOCTYPE. */
+ var PR_DECLARATION = 'dec';
+ /** token style for embedded source. */
+ var PR_SOURCE = 'src';
+ /** token style for an sgml attribute name. */
+ var PR_ATTRIB_NAME = 'atn';
+ /** token style for an sgml attribute value. */
+ var PR_ATTRIB_VALUE = 'atv';
+
+ /**
+ * A class that indicates a section of markup that is not code, e.g. to allow
+ * embedding of line numbers within code listings.
+ */
+ var PR_NOCODE = 'nocode';
+
+ /** A set of tokens that can precede a regular expression literal in
+ * javascript.
+ * http://www.mozilla.org/js/language/js20/rationale/syntax.html has the full
+ * list, but I've removed ones that might be problematic when seen in
+ * languages that don't support regular expression literals.
+ *
+ * Specifically, I've removed any keywords that can't precede a regexp
+ * literal in a syntactically legal javascript program, and I've removed the
+ * "in" keyword since it's not a keyword in many languages, and might be used
+ * as a count of inches.
+ *
+ *
The link a above does not accurately describe EcmaScript rules since
+ * it fails to distinguish between (a=++/b/i) and (a++/b/i) but it works
+ * very well in practice.
+ *
+ * @private
+ */
+ var REGEXP_PRECEDER_PATTERN = function () {
+ var preceders = [
+ "!", "!=", "!==", "#", "%", "%=", "&", "&&", "&&=",
+ "&=", "(", "*", "*=", /* "+", */ "+=", ",", /* "-", */ "-=",
+ "->", /*".", "..", "...", handled below */ "/", "/=", ":", "::", ";",
+ "<", "<<", "<<=", "<=", "=", "==", "===", ">",
+ ">=", ">>", ">>=", ">>>", ">>>=", "?", "@", "[",
+ "^", "^=", "^^", "^^=", "{", "|", "|=", "||",
+ "||=", "~" /* handles =~ and !~ */,
+ "break", "case", "continue", "delete",
+ "do", "else", "finally", "instanceof",
+ "return", "throw", "try", "typeof"
+ ];
+ var pattern = '(?:^^|[+-]';
+ for (var i = 0; i < preceders.length; ++i) {
+ pattern += '|' + preceders[i].replace(/([^=<>:&a-z])/g, '\\$1');
+ }
+ pattern += ')\\s*'; // matches at end, and matches empty string
+ return pattern;
+ // CAVEAT: this does not properly handle the case where a regular
+ // expression immediately follows another since a regular expression may
+ // have flags for case-sensitivity and the like. Having regexp tokens
+ // adjacent is not valid in any language I'm aware of, so I'm punting.
+ // TODO: maybe style special characters inside a regexp as punctuation.
+ }();
+
+ // Define regexps here so that the interpreter doesn't have to create an
+ // object each time the function containing them is called.
+ // The language spec requires a new object created even if you don't access
+ // the $1 members.
+ var pr_amp = /&/g;
+ var pr_lt = //g;
+ var pr_quot = /\"/g;
+ /** like textToHtml but escapes double quotes to be attribute safe. */
+ function attribToHtml(str) {
+ return str.replace(pr_amp, '&')
+ .replace(pr_lt, '<')
+ .replace(pr_gt, '>')
+ .replace(pr_quot, '"');
+ }
+
+ /** escapest html special characters to html. */
+ function textToHtml(str) {
+ return str.replace(pr_amp, '&')
+ .replace(pr_lt, '<')
+ .replace(pr_gt, '>');
+ }
+
+
+ var pr_ltEnt = /</g;
+ var pr_gtEnt = />/g;
+ var pr_aposEnt = /'/g;
+ var pr_quotEnt = /"/g;
+ var pr_ampEnt = /&/g;
+ var pr_nbspEnt = / /g;
+ /** unescapes html to plain text. */
+ function htmlToText(html) {
+ var pos = html.indexOf('&');
+ if (pos < 0) { return html; }
+ // Handle numeric entities specially. We can't use functional substitution
+ // since that doesn't work in older versions of Safari.
+ // These should be rare since most browsers convert them to normal chars.
+ for (--pos; (pos = html.indexOf('', pos + 1)) >= 0;) {
+ var end = html.indexOf(';', pos);
+ if (end >= 0) {
+ var num = html.substring(pos + 3, end);
+ var radix = 10;
+ if (num && num.charAt(0) === 'x') {
+ num = num.substring(1);
+ radix = 16;
+ }
+ var codePoint = parseInt(num, radix);
+ if (!isNaN(codePoint)) {
+ html = (html.substring(0, pos) + String.fromCharCode(codePoint) +
+ html.substring(end + 1));
+ }
+ }
+ }
+
+ return html.replace(pr_ltEnt, '<')
+ .replace(pr_gtEnt, '>')
+ .replace(pr_aposEnt, "'")
+ .replace(pr_quotEnt, '"')
+ .replace(pr_ampEnt, '&')
+ .replace(pr_nbspEnt, ' ');
+ }
+
+ /** is the given node's innerHTML normally unescaped? */
+ function isRawContent(node) {
+ return 'XMP' === node.tagName;
+ }
+
+ function normalizedHtml(node, out) {
+ switch (node.nodeType) {
+ case 1: // an element
+ var name = node.tagName.toLowerCase();
+ out.push('<', name);
+ for (var i = 0; i < node.attributes.length; ++i) {
+ var attr = node.attributes[i];
+ if (!attr.specified) { continue; }
+ out.push(' ');
+ normalizedHtml(attr, out);
+ }
+ out.push('>');
+ for (var child = node.firstChild; child; child = child.nextSibling) {
+ normalizedHtml(child, out);
+ }
+ if (node.firstChild || !/^(?:br|link|img)$/.test(name)) {
+ out.push('<\/', name, '>');
+ }
+ break;
+ case 2: // an attribute
+ out.push(node.name.toLowerCase(), '="', attribToHtml(node.value), '"');
+ break;
+ case 3: case 4: // text
+ out.push(textToHtml(node.nodeValue));
+ break;
+ }
+ }
+
+ /**
+ * Given a group of {@link RegExp}s, returns a {@code RegExp} that globally
+ * matches the union o the sets o strings matched d by the input RegExp.
+ * Since it matches globally, if the input strings have a start-of-input
+ * anchor (/^.../), it is ignored for the purposes of unioning.
+ * @param {Array.} regexs non multiline, non-global regexs.
+ * @return {RegExp} a global regex.
+ */
+ function combinePrefixPatterns(regexs) {
+ var capturedGroupIndex = 0;
+
+ var needToFoldCase = false;
+ var ignoreCase = false;
+ for (var i = 0, n = regexs.length; i < n; ++i) {
+ var regex = regexs[i];
+ if (regex.ignoreCase) {
+ ignoreCase = true;
+ } else if (/[a-z]/i.test(regex.source.replace(
+ /\\u[0-9a-f]{4}|\\x[0-9a-f]{2}|\\[^ux]/gi, ''))) {
+ needToFoldCase = true;
+ ignoreCase = false;
+ break;
+ }
+ }
+
+ function decodeEscape(charsetPart) {
+ if (charsetPart.charAt(0) !== '\\') { return charsetPart.charCodeAt(0); }
+ switch (charsetPart.charAt(1)) {
+ case 'b': return 8;
+ case 't': return 9;
+ case 'n': return 0xa;
+ case 'v': return 0xb;
+ case 'f': return 0xc;
+ case 'r': return 0xd;
+ case 'u': case 'x':
+ return parseInt(charsetPart.substring(2), 16)
+ || charsetPart.charCodeAt(1);
+ case '0': case '1': case '2': case '3': case '4':
+ case '5': case '6': case '7':
+ return parseInt(charsetPart.substring(1), 8);
+ default: return charsetPart.charCodeAt(1);
+ }
+ }
+
+ function encodeEscape(charCode) {
+ if (charCode < 0x20) {
+ return (charCode < 0x10 ? '\\x0' : '\\x') + charCode.toString(16);
+ }
+ var ch = String.fromCharCode(charCode);
+ if (ch === '\\' || ch === '-' || ch === '[' || ch === ']') {
+ ch = '\\' + ch;
+ }
+ return ch;
+ }
+
+ function caseFoldCharset(charSet) {
+ var charsetParts = charSet.substring(1, charSet.length - 1).match(
+ new RegExp(
+ '\\\\u[0-9A-Fa-f]{4}'
+ + '|\\\\x[0-9A-Fa-f]{2}'
+ + '|\\\\[0-3][0-7]{0,2}'
+ + '|\\\\[0-7]{1,2}'
+ + '|\\\\[\\s\\S]'
+ + '|-'
+ + '|[^-\\\\]',
+ 'g'));
+ var groups = [];
+ var ranges = [];
+ var inverse = charsetParts[0] === '^';
+ for (var i = inverse ? 1 : 0, n = charsetParts.length; i < n; ++i) {
+ var p = charsetParts[i];
+ switch (p) {
+ case '\\B': case '\\b':
+ case '\\D': case '\\d':
+ case '\\S': case '\\s':
+ case '\\W': case '\\w':
+ groups.push(p);
+ continue;
+ }
+ var start = decodeEscape(p);
+ var end;
+ if (i + 2 < n && '-' === charsetParts[i + 1]) {
+ end = decodeEscape(charsetParts[i + 2]);
+ i += 2;
+ } else {
+ end = start;
+ }
+ ranges.push([start, end]);
+ // If the range might intersect letters, then expand it.
+ if (!(end < 65 || start > 122)) {
+ if (!(end < 65 || start > 90)) {
+ ranges.push([Math.max(65, start) | 32, Math.min(end, 90) | 32]);
+ }
+ if (!(end < 97 || start > 122)) {
+ ranges.push([Math.max(97, start) & ~32, Math.min(end, 122) & ~32]);
+ }
+ }
+ }
+
+ // [[1, 10], [3, 4], [8, 12], [14, 14], [16, 16], [17, 17]]
+ // -> [[1, 12], [14, 14], [16, 17]]
+ ranges.sort(function (a, b) { return (a[0] - b[0]) || (b[1] - a[1]); });
+ var consolidatedRanges = [];
+ var lastRange = [NaN, NaN];
+ for (var i = 0; i < ranges.length; ++i) {
+ var range = ranges[i];
+ if (range[0] <= lastRange[1] + 1) {
+ lastRange[1] = Math.max(lastRange[1], range[1]);
+ } else {
+ consolidatedRanges.push(lastRange = range);
+ }
+ }
+
+ var out = ['['];
+ if (inverse) { out.push('^'); }
+ out.push.apply(out, groups);
+ for (var i = 0; i < consolidatedRanges.length; ++i) {
+ var range = consolidatedRanges[i];
+ out.push(encodeEscape(range[0]));
+ if (range[1] > range[0]) {
+ if (range[1] + 1 > range[0]) { out.push('-'); }
+ out.push(encodeEscape(range[1]));
+ }
+ }
+ out.push(']');
+ return out.join('');
+ }
+
+ function allowAnywhereFoldCaseAndRenumberGroups(regex) {
+ // Split into character sets, escape sequences, punctuation strings
+ // like ('(', '(?:', ')', '^'), and runs of characters that do not
+ // include any of the above.
+ var parts = regex.source.match(
+ new RegExp(
+ '(?:'
+ + '\\[(?:[^\\x5C\\x5D]|\\\\[\\s\\S])*\\]' // a character set
+ + '|\\\\u[A-Fa-f0-9]{4}' // a unicode escape
+ + '|\\\\x[A-Fa-f0-9]{2}' // a hex escape
+ + '|\\\\[0-9]+' // a back-reference or octal escape
+ + '|\\\\[^ux0-9]' // other escape sequence
+ + '|\\(\\?[:!=]' // start of a non-capturing group
+ + '|[\\(\\)\\^]' // start/emd of a group, or line start
+ + '|[^\\x5B\\x5C\\(\\)\\^]+' // run of other characters
+ + ')',
+ 'g'));
+ var n = parts.length;
+
+ // Maps captured group numbers to the number they will occupy in
+ // the output or to -1 if that has not been determined, or to
+ // undefined if they need not be capturing in the output.
+ var capturedGroups = [];
+
+ // Walk over and identify back references to build the capturedGroups
+ // mapping.
+ var groupIndex;
+ for (var i = 0, groupIndex = 0; i < n; ++i) {
+ var p = parts[i];
+ if (p === '(') {
+ // groups are 1-indexed, so max group index is count of '('
+ ++groupIndex;
+ } else if ('\\' === p.charAt(0)) {
+ var decimalValue = +p.substring(1);
+ if (decimalValue && decimalValue <= groupIndex) {
+ capturedGroups[decimalValue] = -1;
+ }
+ }
+ }
+
+ // Renumber groups and reduce capturing groups to non-capturing groups
+ // where possible.
+ for (var i = 1; i < capturedGroups.length; ++i) {
+ if (-1 === capturedGroups[i]) {
+ capturedGroups[i] = ++capturedGroupIndex;
+ }
+ }
+ for (var i = 0, groupIndex = 0; i < n; ++i) {
+ var p = parts[i];
+ if (p === '(') {
+ ++groupIndex;
+ if (capturedGroups[groupIndex] === undefined) {
+ parts[i] = '(?:';
+ }
+ } else if ('\\' === p.charAt(0)) {
+ var decimalValue = +p.substring(1);
+ if (decimalValue && decimalValue <= groupIndex) {
+ parts[i] = '\\' + capturedGroups[groupIndex];
+ }
+ }
+ }
+
+ // Remove any prefix anchors so that the output will match anywhere.
+ // ^^ really does mean an anchored match though.
+ for (var i = 0, groupIndex = 0; i < n; ++i) {
+ if ('^' === parts[i] && '^' !== parts[i + 1]) { parts[i] = ''; }
+ }
+
+ // Expand letters to groupts to handle mixing of case-sensitive and
+ // case-insensitive patterns if necessary.
+ if (regex.ignoreCase && needToFoldCase) {
+ for (var i = 0; i < n; ++i) {
+ var p = parts[i];
+ var ch0 = p.charAt(0);
+ if (p.length >= 2 && ch0 === '[') {
+ parts[i] = caseFoldCharset(p);
+ } else if (ch0 !== '\\') {
+ // TODO: handle letters in numeric escapes.
+ parts[i] = p.replace(
+ /[a-zA-Z]/g,
+ function (ch) {
+ var cc = ch.charCodeAt(0);
+ return '[' + String.fromCharCode(cc & ~32, cc | 32) + ']';
+ });
+ }
+ }
+ }
+
+ return parts.join('');
+ }
+
+ var rewritten = [];
+ for (var i = 0, n = regexs.length; i < n; ++i) {
+ var regex = regexs[i];
+ if (regex.global || regex.multiline) { throw new Error('' + regex); }
+ rewritten.push(
+ '(?:' + allowAnywhereFoldCaseAndRenumberGroups(regex) + ')');
+ }
+
+ return new RegExp(rewritten.join('|'), ignoreCase ? 'gi' : 'g');
+ }
+
+ var PR_innerHtmlWorks = null;
+ function getInnerHtml(node) {
+ // inner html is hopelessly broken in Safari 2.0.4 when the content is
+ // an html description of well formed XML and the containing tag is a PRE
+ // tag, so we detect that case and emulate innerHTML.
+ if (null === PR_innerHtmlWorks) {
+ var testNode = document.createElement('PRE');
+ testNode.appendChild(
+ document.createTextNode('\n '));
+ PR_innerHtmlWorks = !/= 0; nSpaces -= SPACES.length) {
+ out.push(SPACES.substring(0, nSpaces));
+ }
+ pos = i + 1;
+ break;
+ case '\n':
+ charInLine = 0;
+ break;
+ default:
+ ++charInLine;
+ }
+ }
+ if (!out) { return plainText; }
+ out.push(plainText.substring(pos));
+ return out.join('');
+ };
+ }
+
+ var pr_chunkPattern = new RegExp(
+ '[^<]+' // A run of characters other than '<'
+ + '|<\!--[\\s\\S]*?--\>' // an HTML comment
+ + '|' // a CDATA section
+ + '|?[a-zA-Z][^>]*>' // a probable tag that should not be highlighted
+ + '|<', // A '<' that does not begin a larger chunk
+ 'g');
+ var pr_commentPrefix = /^<\!--/;
+ var pr_cdataPrefix = /^<\[CDATA\[/;
+ var pr_brPrefix = /^
) into their textual equivalent.
+ *
+ * @param {string} s html where whitespace is considered significant.
+ * @return {Object} source code and extracted tags.
+ * @private
+ */
+ function extractTags(s) {
+ // since the pattern has the 'g' modifier and defines no capturing groups,
+ // this will return a list of all chunks which we then classify and wrap as
+ // PR_Tokens
+ var matches = s.match(pr_chunkPattern);
+ var sourceBuf = [];
+ var sourceBufLen = 0;
+ var extractedTags = [];
+ if (matches) {
+ for (var i = 0, n = matches.length; i < n; ++i) {
+ var match = matches[i];
+ if (match.length > 1 && match.charAt(0) === '<') {
+ if (pr_commentPrefix.test(match)) { continue; }
+ if (pr_cdataPrefix.test(match)) {
+ // strip CDATA prefix and suffix. Don't unescape since it's CDATA
+ sourceBuf.push(match.substring(9, match.length - 3));
+ sourceBufLen += match.length - 12;
+ } else if (pr_brPrefix.test(match)) {
+ //
tags are lexically significant so convert them to text.
+ // This is undone later.
+ sourceBuf.push('\n');
+ ++sourceBufLen;
+ } else {
+ if (match.indexOf(PR_NOCODE) >= 0 && isNoCodeTag(match)) {
+ // A will start a section that should be
+ // ignored. Continue walking the list until we see a matching end
+ // tag.
+ var name = match.match(pr_tagNameRe)[2];
+ var depth = 1;
+ var j;
+ end_tag_loop:
+ for (j = i + 1; j < n; ++j) {
+ var name2 = matches[j].match(pr_tagNameRe);
+ if (name2 && name2[2] === name) {
+ if (name2[1] === '/') {
+ if (--depth === 0) { break end_tag_loop; }
+ } else {
+ ++depth;
+ }
+ }
+ }
+ if (j < n) {
+ extractedTags.push(
+ sourceBufLen, matches.slice(i, j + 1).join(''));
+ i = j;
+ } else { // Ignore unclosed sections.
+ extractedTags.push(sourceBufLen, match);
+ }
+ } else {
+ extractedTags.push(sourceBufLen, match);
+ }
+ }
+ } else {
+ var literalText = htmlToText(match);
+ sourceBuf.push(literalText);
+ sourceBufLen += literalText.length;
+ }
+ }
+ }
+ return { source: sourceBuf.join(''), tags: extractedTags };
+ }
+
+ /** True if the given tag contains a class attribute with the nocode class. */
+ function isNoCodeTag(tag) {
+ return !!tag
+ // First canonicalize the representation of attributes
+ .replace(/\s(\w+)\s*=\s*(?:\"([^\"]*)\"|'([^\']*)'|(\S+))/g,
+ ' $1="$2$3$4"')
+ // Then look for the attribute we want.
+ .match(/[cC][lL][aA][sS][sS]=\"[^\"]*\bnocode\b/);
+ }
+
+ /**
+ * Apply the given language handler to sourceCode and add the resulting
+ * decorations to out.
+ * @param {number} basePos the index of sourceCode within the chunk of source
+ * whose decorations are already present on out.
+ */
+ function appendDecorations(basePos, sourceCode, langHandler, out) {
+ if (!sourceCode) { return; }
+ var job = {
+ source: sourceCode,
+ basePos: basePos
+ };
+ langHandler(job);
+ out.push.apply(out, job.decorations);
+ }
+
+ /** Given triples of [style, pattern, context] returns a lexing function,
+ * The lexing function interprets the patterns to find token boundaries and
+ * returns a decoration list of the form
+ * [index_0, style_0, index_1, style_1, ..., index_n, style_n]
+ * where index_n is an index into the sourceCode, and style_n is a style
+ * constant like PR_PLAIN. index_n-1 <= index_n, and style_n-1 applies to
+ * all characters in sourceCode[index_n-1:index_n].
+ *
+ * The stylePatterns is a list whose elements have the form
+ * [style : string, pattern : RegExp, DEPRECATED, shortcut : string].
+ *
+ * Style is a style constant like PR_PLAIN, or can be a string of the
+ * form 'lang-FOO', where FOO is a language extension describing the
+ * language of the portion of the token in $1 after pattern executes.
+ * E.g., if style is 'lang-lisp', and group 1 contains the text
+ * '(hello (world))', then that portion of the token will be passed to the
+ * registered lisp handler for formatting.
+ * The text before and after group 1 will be restyled using this decorator
+ * so decorators should take care that this doesn't result in infinite
+ * recursion. For example, the HTML lexer rule for SCRIPT elements looks
+ * something like ['lang-js', /<[s]cript>(.+?)<\/script>/]. This may match
+ * '
+
diff --git a/porting-code-to-python-3-with-2to3.html b/porting-code-to-python-3-with-2to3.html
index 40a8356..259354d 100644
--- a/porting-code-to-python-3-with-2to3.html
+++ b/porting-code-to-python-3-with-2to3.html
@@ -22,11 +22,11 @@ td pre{padding:0;border:0}
-You are here: Home ‣ Dive Into Python 3 ‣
+
You are here: Home ‣ Dive Into Python 3 ‣
Difficulty level: ♦♦♦♦♦
Porting Code to Python 3 with 2to3
-❝ Life is pleasant. Death is peaceful. It’s the transition that’s troublesome. ❞
— Isaac Asimov (attributed)
+
❝ Life is pleasant. Death is peaceful. It’s the transition that’s troublesome. ❞
— Isaac Asimov (attributed)
Diving in
@@ -38,20 +38,20 @@ td pre{padding:0;border:0}
Python 2
Python 3
①
- print
-print()
+print
+print()
②
- print 1
-print(1)
+print 1
+print(1)
③
- print 1, 2
-print(1, 2)
+print 1, 2
+print(1, 2)
④
- print 1, 2,
-print(1, 2, end=' ')
+print 1, 2,
+print(1, 2, end=' ')
⑤
- print >>sys.stderr, 1, 2, 3
-print(1, 2, 3, file=sys.stderr)
+print >>sys.stderr, 1, 2, 3
+print(1, 2, 3, file=sys.stderr)
- To print a blank line, call
print() without any arguments.
@@ -67,11 +67,11 @@ td pre{padding:0;border:0}
Python 2
Python 3
①
- u'PapayaWhip'
-'PapayaWhip'
+u'PapayaWhip'
+'PapayaWhip'
②
- ur'PapayaWhip\foo'
-r'PapayaWhip\foo'
+ur'PapayaWhip\foo'
+r'PapayaWhip\foo'
- Unicode string literals are simply converted into string literals, which, in Python 3, are always Unicode.
@@ -84,8 +84,8 @@ td pre{padding:0;border:0}
Python 2
Python 3
- unicode(anything)
-str(anything)
+unicode(anything)
+str(anything)
long data type
Python 2 had separate int and long types for non-floating-point numbers. An int could not be any larger than sys.maxint, which varied by platform. Longs were defined by appending an L to the end of the number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called int, which mostly behaves like the long type in Python 2. Since there are no longer two types, there is no need for special syntax to distinguish them.
@@ -95,20 +95,20 @@ td pre{padding:0;border:0}
Python 2
Python 3
①
- x = 1000000000000L
-x = 1000000000000
+x = 1000000000000L
+x = 1000000000000
②
- x = 0xFFFFFFFFFFFFL
-x = 0xFFFFFFFFFFFF
+x = 0xFFFFFFFFFFFFL
+x = 0xFFFFFFFFFFFF
③
- long(x)
-int(x)
+long(x)
+int(x)
④
- type(x) is long
-type(x) is int
+type(x) is long
+type(x) is int
⑤
- isinstance(x, long)
-isinstance(x, int)
+isinstance(x, long)
+isinstance(x, int)
- Base 10 long integer literals become base 10 integer literals.
@@ -124,11 +124,11 @@ td pre{padding:0;border:0}
Python 2
Python 3
①
- if x <> y:
-if x != y:
+if x <> y:
+if x != y:
②
- if x <> y <> z:
-if x != y != z:
+if x <> y <> z:
+if x != y != z:
- A simple comparison.
@@ -141,20 +141,20 @@ td pre{padding:0;border:0}
Python 2
Python 3
①
- a_dictionary.has_key('PapayaWhip')
-'PapayaWhip' in a_dictionary
+a_dictionary.has_key('PapayaWhip')
+'PapayaWhip' in a_dictionary
②
-a_dictionary.has_key(x) or a_dictionary.has_key(y)
-x in a_dictionary or y in a_dictionary
+a_dictionary.has_key(x) or a_dictionary.has_key(y)
+x in a_dictionary or y in a_dictionary
③
- a_dictionary.has_key(x or y)
-(x or y) in a_dictionary
+a_dictionary.has_key(x or y)
+(x or y) in a_dictionary
④
- a_dictionary.has_key(x + y)
-(x + y) in a_dictionary
+a_dictionary.has_key(x + y)
+(x + y) in a_dictionary
⑤
- x + a_dictionary.has_key(y)
-x + (y in a_dictionary)
+x + a_dictionary.has_key(y)
+x + (y in a_dictionary)
- The simplest form.
@@ -170,19 +170,19 @@ td pre{padding:0;border:0}
Python 2
Python 3
①
- a_dictionary.keys()
-list(a_dictionary.keys())
+a_dictionary.keys()
+list(a_dictionary.keys())
②
- a_dictionary.items()
-list(a_dictionary.items())
+a_dictionary.items()
+list(a_dictionary.items())
③
- a_dictionary.iterkeys()
-iter(a_dictionary.keys())
+a_dictionary.iterkeys()
+iter(a_dictionary.keys())
④
- [i for i in a_dictionary.iterkeys()]
-[i for i in a_dictionary.keys()]
+[i for i in a_dictionary.iterkeys()]
+[i for i in a_dictionary.keys()]
⑤
- min(a_dictionary.keys())
+min(a_dictionary.keys())
no change
@@ -201,19 +201,19 @@ td pre{padding:0;border:0}
Python 2
Python 3
①
- import httplib
-import http.client
+import httplib
+import http.client
②
- import Cookie
-import http.cookies
+import Cookie
+import http.cookies
③
- import cookielib
-import http.cookiejar
+import cookielib
+import http.cookiejar
④
- import BaseHTTPServer
+import BaseHTTPServer
import SimpleHTTPServer
import CGIHttpServer
-import http.server
+import http.server
- The
http.client module implements a low-level library that can request HTTP resources and interpret HTTP responses.
@@ -228,26 +228,26 @@ import CGIHttpServer
Python 2
Python 3
①
- import urllib
-import urllib.request, urllib.parse, urllib.error
+import urllib
+import urllib.request, urllib.parse, urllib.error
②
- import urllib2
-import urllib.request, urllib.error
+import urllib2
+import urllib.request, urllib.error
③
- import urlparse
-import urllib.parse
+import urlparse
+import urllib.parse
④
- import robotparser
-import urllib.robotparser
+import robotparser
+import urllib.robotparser
⑤
- from urllib import FancyURLopener
+from urllib import FancyURLopener
from urllib import urlencode
-from urllib.request import FancyURLopener
+from urllib.request import FancyURLopener
from urllib.parse import urlencode
⑥
- from urllib2 import Request
+from urllib2 import Request
from urllib2 import HTTPError
-from urllib.request import Request
+from urllib.request import Request
from urllib.error import HTTPError
@@ -265,21 +265,21 @@ from urllib.error import HTTPError
Python 2
Python 3
- import dbm
-import dbm.ndbm
+import dbm
+import dbm.ndbm
- import gdbm
-import dbm.gnu
+import gdbm
+import dbm.gnu
- import dbhash
-import dbm.bsd
+import dbhash
+import dbm.bsd
- import dumbdbm
-import dbm.dumb
+import dumbdbm
+import dbm.dumb
- import anydbm
+import anydbm
import whichdb
-import dbm
+import dbm
xmlrpc
XML-RPC is a lightweight method of performing remote RPC calls over HTTP. The XML-RPC client library and several XML-RPC server implementations are now combined in a single package, xmlrpc.
@@ -288,12 +288,12 @@ import whichdb
Python 2
Python 3
- import xmlrpclib
-import xmlrpc.client
+import xmlrpclib
+import xmlrpc.client
- import DocXMLRPCServer
+import DocXMLRPCServer
import SimpleXMLRPCServer
-import xmlrpc.server
+import xmlrpc.server
Other modules
@@ -301,38 +301,38 @@ import SimpleXMLRPCServer
Python 2
Python 3
①
- try:
+try:
import cStringIO as StringIO
except ImportError:
import StringIO
-import io
+import io
②
- try:
+try:
import cPickle as pickle
except ImportError:
import pickle
-import pickle
+import pickle
③
- import __builtin__
-import builtins
+import __builtin__
+import builtins
④
- import copy_reg
-import copyreg
+import copy_reg
+import copyreg
⑤
- import Queue
-import queue
+import Queue
+import queue
⑥
- import SocketServer
-import socketserver
+import SocketServer
+import socketserver
⑦
- import ConfigParser
-import configparser
+import ConfigParser
+import configparser
⑧
- import repr
-import reprlib
+import repr
+import reprlib
⑨
- import commands
-import subprocess
+import commands
+import subprocess
- A common idiom in Python 2 was to try to import
cStringIO as StringIO, and if that failed, to import StringIO instead. Do not do this in Python 3; the io module does it for you. It will find the fastest implementation available and use it automatically.
@@ -363,11 +363,11 @@ except ImportError:
Python 2
Python 3
①
- import constants
-from . import constants
+import constants
+from . import constants
②
- from mbcharsetprober import MultiByteCharSetProber
-from .mbcharsetprober import MultiByteCharsetProber
+from mbcharsetprober import MultiByteCharSetProber
+from .mbcharsetprober import MultiByteCharsetProber
- When you need to import an entire module from elsewhere in your package, use the new
from . import syntax. The period is actually a relative path from this file (universaldetector.py) to the file you want to import (constants.py). In this case, they are in the same directory, thus the single period. You can also import from the parent directory (from .. import anothermodule) or a subdirectory.
@@ -380,28 +380,28 @@ except ImportError:
Python 2
Python 3
①
- anIterator.next()
-next(anIterator)
+anIterator.next()
+next(anIterator)
②
- a_function_that_returns_an_iterator().next()
-next(a_function_that_returns_an_iterator())
+a_function_that_returns_an_iterator().next()
+next(a_function_that_returns_an_iterator())
③
- class A:
+class A:
def next(self):
pass
-class A:
+class A:
def __next__(self):
pass
④
- class A:
+class A:
def next(self, x, y):
pass
no change
⑤
- next = 42
+next = 42
for an_iterator in a_sequence_of_iterators:
an_iterator.next()
-next = 42
+next = 42
for an_iterator in a_sequence_of_iterators:
an_iterator.__next__()
@@ -419,19 +419,19 @@ for an_iterator in a_sequence_of_iterators:
Python 2
Python 3
①
- filter(a_function, a_sequence)
-list(filter(a_function, a_sequence))
+filter(a_function, a_sequence)
+list(filter(a_function, a_sequence))
②
- list(filter(a_function, a_sequence))
+list(filter(a_function, a_sequence))
no change
③
- filter(None, a_sequence)
-[i for i in a_sequence if i]
+filter(None, a_sequence)
+[i for i in a_sequence if i]
④
- for i in filter(None, a_sequence):
+for i in filter(None, a_sequence):
no change
⑤
- [i for i in filter(a_function, a_sequence)]
+[i for i in filter(a_function, a_sequence)]
no change
@@ -448,19 +448,19 @@ for an_iterator in a_sequence_of_iterators:
Python 2
Python 3
①
- map(a_function, 'PapayaWhip')
-list(map(a_function, 'PapayaWhip'))
+map(a_function, 'PapayaWhip')
+list(map(a_function, 'PapayaWhip'))
②
- map(None, 'PapayaWhip')
-list('PapayaWhip')
+map(None, 'PapayaWhip')
+list('PapayaWhip')
③
- map(lambda x: x+1, range(42))
-[x+1 for x in range(42)]
+map(lambda x: x+1, range(42))
+[x+1 for x in range(42)]
④
- for i in map(a_function, a_sequence):
+for i in map(a_function, a_sequence):
no change
⑤
- [i for i in map(a_function, a_sequence)]
+[i for i in map(a_function, a_sequence)]
no change
@@ -477,12 +477,12 @@ for an_iterator in a_sequence_of_iterators:
Python 2
Python 3
- reduce(a, b, c)
-from functtools import reduce
+reduce(a, b, c)
+from functtools import reduce
reduce(a, b, c)
-☞The version of 2to3 that shipped with Python 3.0 would not fix the reduce() function automatically. The fix first appeared in the 2to3 script that shipped with Python 3.1.
+
☞The version of 2to3 that shipped with Python 3.0 would not fix the reduce() function automatically. The fix first appeared in the 2to3 script that shipped with Python 3.1.
apply() global function
Python 2 had a global function called apply(), which took a function f and a list [a, b, c] and returned f(a, b, c). In Python 3, the apply() function no longer exists. Instead, there is a new function calling syntax that allows you to pass a list and have Python apply the list as the function’s arguments.
@@ -491,17 +491,17 @@ reduce(a, b, c)
Python 2
Python 3
①
- apply(a_function, a_list_of_args)
-a_function(*a_list_of_args)
+apply(a_function, a_list_of_args)
+a_function(*a_list_of_args)
②
- apply(a_function, a_list_of_args, a_dictionary_of_named_args)
-a_function(*a_list_of_args, **a_dictionary_of_named_args)
+apply(a_function, a_list_of_args, a_dictionary_of_named_args)
+a_function(*a_list_of_args, **a_dictionary_of_named_args)
③
- apply(a_function, a_list_of_args + z)
-a_function(*a_list_of_args + z)
+apply(a_function, a_list_of_args + z)
+a_function(*a_list_of_args + z)
④
- apply(aModule.a_function, a_list_of_args)
-aModule.a_function(*a_list_of_args)
+apply(aModule.a_function, a_list_of_args)
+aModule.a_function(*a_list_of_args)
- In the simplest form, you can call a function with a list of arguments (an actual list like
[a, b, c]) by prepending the list with an asterisk (*). This is exactly equivalent to the old apply() function in Python 2.
@@ -516,8 +516,8 @@ reduce(a, b, c)
Python 2
Python 3
- intern(aString)
-sys.intern(aString)
+intern(aString)
+sys.intern(aString)
exec statement
Just as the print statement became a function in Python 3, so too has the exec statement. The exec() function takes a string which contains arbitrary Python code and executes it as if it were just another statement or expression.
@@ -526,14 +526,14 @@ reduce(a, b, c)
Python 2
Python 3
①
- exec codeString
-exec(codeString)
+exec codeString
+exec(codeString)
②
- exec codeString in a_global_namespace
-exec(codeString, a_global_namespace)
+exec codeString in a_global_namespace
+exec(codeString, a_global_namespace)
③
- exec codeString in a_global_namespace, a_local_namespace
-exec(codeString, a_global_namespace, a_local_namespace)
+exec codeString in a_global_namespace, a_local_namespace
+exec(codeString, a_global_namespace, a_local_namespace)
- In the simplest form, the
2to3 script simply encloses the code-as-a-string in parentheses, since exec() is now a function instead of a statement.
@@ -547,11 +547,11 @@ reduce(a, b, c)
Python 2
Python 3
- execfile('a_filename')
-exec(compile(open('a_filename').read(), 'a_filename', 'exec'))
+execfile('a_filename')
+exec(compile(open('a_filename').read(), 'a_filename', 'exec'))
-☞The version of 2to3 that shipped with Python 3.0 would not fix the execfile statement automatically. The fix first appeared in the 2to3 script that shipped with Python 3.1.
+
☞The version of 2to3 that shipped with Python 3.0 would not fix the execfile statement automatically. The fix first appeared in the 2to3 script that shipped with Python 3.1.
repr literals (backticks)
In Python 2, there was a special syntax of wrapping any object in backticks (like `x`) to get a representation of the object. In Python 3, this capability still exists, but you can no longer use backticks to get it. Instead, use the global repr() function.
@@ -560,11 +560,11 @@ reduce(a, b, c)
Python 2
Python 3
①
- `x`
-repr(x)
+`x`
+repr(x)
②
- `'PapayaWhip' + `2``
-repr('PapayaWhip' + repr(2))
+`'PapayaWhip' + `2``
+repr('PapayaWhip' + repr(2))
- Remember, x can be anything — a class, a function, a module, a primitive data type, etc. The
repr() function works on everything.
@@ -577,31 +577,31 @@ reduce(a, b, c)
Python 2
Python 3
①
- try:
+try:
import mymodule
except ImportError, e
pass
-try:
+try:
import mymodule
except ImportError as e:
pass
②
- try:
+try:
import mymodule
except (RuntimeError, ImportError), e
pass
-try:
+try:
import mymodule
except (RuntimeError, ImportError) as e:
pass
③
- try:
+try:
import mymodule
except ImportError:
pass
no change
④
- try:
+try:
import mymodule
except:
pass
@@ -614,7 +614,7 @@ except:
- Similarly, if you use a fallback to catch all exceptions, the syntax is identical.
-☞You should never use a fallback to catch all exceptions when importing modules (or most other times). Doing so will catch things like KeyboardInterrupt (if the user pressed Ctrl-C to interrupt the program) and can make it more difficult to debug errors.
+
☞You should never use a fallback to catch all exceptions when importing modules (or most other times). Doing so will catch things like KeyboardInterrupt (if the user pressed Ctrl-C to interrupt the program) and can make it more difficult to debug errors.
raise statement
The syntax for raising your own exceptions has changed slightly between Python 2 and Python 3.
@@ -623,16 +623,16 @@ except:
Python 2
Python 3
①
- raise MyException
+raise MyException
unchanged
②
- raise MyException, 'error message'
-raise MyException('error message')
+raise MyException, 'error message'
+raise MyException('error message')
③
- raise MyException, 'error message', a_traceback
-raise MyException('error message').with_traceback(a_traceback)
+raise MyException, 'error message', a_traceback
+raise MyException('error message').with_traceback(a_traceback)
④
- raise 'error message'
+raise 'error message'
unsupported
@@ -648,13 +648,13 @@ except:
Python 2
Python 3
①
- a_generator.throw(MyException)
+a_generator.throw(MyException)
no change
②
- a_generator.throw(MyException, 'error message')
-a_generator.throw(MyException('error message'))
+a_generator.throw(MyException, 'error message')
+a_generator.throw(MyException('error message'))
③
- a_generator.throw('error message')
+a_generator.throw('error message')
unsupported
@@ -669,19 +669,19 @@ except:
Python 2
Python 3
①
- xrange(10)
-range(10)
+xrange(10)
+range(10)
②
- a_list = range(10)
-a_list = list(range(10))
+a_list = range(10)
+a_list = list(range(10))
③
- [i for i in xrange(10)]
-[i for i in range(10)]
+[i for i in xrange(10)]
+[i for i in range(10)]
④
- for i in range(10):
+for i in range(10):
no change
⑤
- sum(range(10))
+sum(range(10))
no change
@@ -698,14 +698,14 @@ except:
Python 2
Python 3
①
- raw_input()
-input()
+raw_input()
+input()
②
- raw_input('prompt')
-input('prompt')
+raw_input('prompt')
+input('prompt')
③
- input()
-eval(input())
+input()
+eval(input())
- In the simplest form,
raw_input() becomes input().
@@ -719,26 +719,26 @@ except:
Python 2
Python 3
①
- a_function.func_name
-a_function.__name__
+a_function.func_name
+a_function.__name__
②
- a_function.func_doc
-a_function.__doc__
+a_function.func_doc
+a_function.__doc__
③
- a_function.func_defaults
-a_function.__defaults__
+a_function.func_defaults
+a_function.__defaults__
④
- a_function.func_dict
-a_function.__dict__
+a_function.func_dict
+a_function.__dict__
⑤
- a_function.func_closure
-a_function.__closure__
+a_function.func_closure
+a_function.__closure__
⑥
- a_function.func_globals
-a_function.__globals__
+a_function.func_globals
+a_function.__globals__
⑦
- a_function.func_code
-a_function.__code__
+a_function.func_code
+a_function.__code__
- The
__name__ attribute (previously func_name) contains the function’s name.
@@ -756,10 +756,10 @@ except:
Python 2
Python 3
①
- for line in a_file.xreadlines():
-for line in a_file:
+for line in a_file.xreadlines():
+for line in a_file:
②
- for line in a_file.xreadlines(5):
+for line in a_file.xreadlines(5):
no change
@@ -774,16 +774,16 @@ except:
Python 2
Python 3
①
- lambda (x,): x + f(x)
-lambda x1: x1[0] + f(x1[0])
+lambda (x,): x + f(x)
+lambda x1: x1[0] + f(x1[0])
②
- lambda (x, y): x + f(y)
-lambda x_y: x_y[0] + f(x_y[1])
+lambda (x, y): x + f(y)
+lambda x_y: x_y[0] + f(x_y[1])
③
- lambda (x, (y, z)): x + y + z
-lambda x_y_z: x_y_z[0] + x_y_z[1][0] + x_y_z[1][1]
+lambda (x, (y, z)): x + y + z
+lambda x_y_z: x_y_z[0] + x_y_z[1][0] + x_y_z[1][1]
④
- lambda x, y, z: x + y + z
+lambda x, y, z: x + y + z
unchanged
@@ -799,14 +799,14 @@ except:
Python 2
Python 3
- aClassInstance.aClassMethod.im_func
-aClassInstance.aClassMethod.__func__
+aClassInstance.aClassMethod.im_func
+aClassInstance.aClassMethod.__func__
- aClassInstance.aClassMethod.im_self
-aClassInstance.aClassMethod.__self__
+aClassInstance.aClassMethod.im_self
+aClassInstance.aClassMethod.__self__
- aClassInstance.aClassMethod.im_class
-aClassInstance.aClassMethod.__self__.__class__
+aClassInstance.aClassMethod.im_class
+aClassInstance.aClassMethod.__self__.__class__
__nonzero__ special method
In Python 2, you could build your own classes that could be used in a boolean context. For example, you could instantiate the class and then use the instance in an if statement. To do this, you defined a special __nonzero__() method which returned True or False, and it was called whenever the instance was used in a boolean context. In Python 3, you can still do this, but the name of the method has changed to __bool__().
@@ -815,14 +815,14 @@ except:
Python 2
Python 3
①
- class A:
+class A:
def __nonzero__(self):
pass
-class A:
+class A:
def __bool__(self):
pass
②
- class A:
+class A:
def __nonzero__(self, x, y):
pass
no change
@@ -838,8 +838,8 @@ except:
Python 2
Python 3
- x = 0755
-x = 0o755
+x = 0755
+x = 0o755
sys.maxint
Due to the integration of the long and int types, the sys.maxint constant is no longer accurate. Because the value may still be useful in determining platform-specific capabilities, it has been retained but renamed as sys.maxsize.
@@ -848,11 +848,11 @@ except:
Python 2
Python 3
①
- from sys import maxint
-from sys import maxsize
+from sys import maxint
+from sys import maxsize
②
- a_function(sys.maxint)
-a_function(sys.maxsize)
+a_function(sys.maxint)
+a_function(sys.maxsize)
maxint becomes maxsize.
@@ -865,8 +865,8 @@ except:
Python 2
Python 3
- callable(anything)
-hasattr(anything, '__call__')
+callable(anything)
+hasattr(anything, '__call__')
zip() global function
In Python 2, the global zip() function took any number of sequences and returned a list of tuples. The first tuple contained the first item from each sequence; the second tuple contained the second item from each sequence; and so on. In Python 3, zip() returns an iterator instead of a list.
@@ -875,10 +875,10 @@ except:
Python 2
Python 3
①
- zip(a, b, c)
-list(zip(a, b, c))
+zip(a, b, c)
+list(zip(a, b, c))
②
- d.join(zip(a, b, c))
+d.join(zip(a, b, c))
no change
@@ -892,11 +892,11 @@ except:
Python 2
Python 3
- x = StandardError()
-x = Exception()
+x = StandardError()
+x = Exception()
- x = StandardError(a, b, c)
-x = Exception(a, b, c)
+x = StandardError(a, b, c)
+x = Exception(a, b, c)
types module constants
The types module contains a variety of constants to help you determine the type of an object. In Python 2, it contained constants for all primitive types like dict and int. In Python 3, these constants have been eliminated; just use the primitive type name instead.
@@ -905,26 +905,26 @@ except:
Python 2
Python 3
- types.StringType
-bytes
+types.StringType
+bytes
- types.DictType
-dict
+types.DictType
+dict
- types.IntType
-int
+types.IntType
+int
- types.LongType
-int
+types.LongType
+int
- types.ListType
-list
+types.ListType
+list
- types.NoneType
-type(None)
+types.NoneType
+type(None)
-☞types.StringType gets mapped to bytes instead of str because a Python 2 “string” (not a Unicode string, just a regular string) is really just a sequence of bytes in a particular character encoding.
+
☞types.StringType gets mapped to bytes instead of str because a Python 2 “string” (not a Unicode string, just a regular string) is really just a sequence of bytes in a particular character encoding.
isinstance() global function (3.1+)
The isinstance() function checks whether an object is an instance of a particular class or type. In Python 2, you could pass a tuple of types, and isinstance() would return True if the object was any of those types. In Python 3, you can still do this, but passing the same type twice is deprecated.
@@ -933,11 +933,11 @@ except:
Python 2
Python 3
- isinstance(x, (int, float, int))
-isinstance(x, (int, float))
+isinstance(x, (int, float, int))
+isinstance(x, (int, float))
-☞The version of 2to3 that shipped with Python 3.0 would not fix these cases of isinstance() automatically. The fix first appeared in the 2to3 script that shipped with Python 3.1.
+
☞The version of 2to3 that shipped with Python 3.0 would not fix these cases of isinstance() automatically. The fix first appeared in the 2to3 script that shipped with Python 3.1.
basestring datatype
Python 2 had two string types: Unicode and non-Unicode. But there was also another type, basestring. It was an abstract type, a superclass for both the str and unicode types. It couldn’t be called or instantiated directly, but you could pass it to the global isinstance() function to check whether an object was either a Unicode or non-Unicode string. In Python 3, there is only one string type, so basestring has no reason to exist.
@@ -946,8 +946,8 @@ except:
Python 2
Python 3
- isinstance(x, basestring)
-isinstance(x, str)
+isinstance(x, basestring)
+isinstance(x, str)
itertools module
Python 2.3 introduced the itertools module, which defined variants of the global zip(), map(), and filter() functions that returned iterators instead of lists. In Python 3, those global functions return iterators, so those functions in the itertools module have been eliminated.
@@ -956,17 +956,17 @@ except:
Python 2
Python 3
①
- itertools.izip(a, b)
-zip(a, b)
+itertools.izip(a, b)
+zip(a, b)
②
- itertools.imap(a, b)
-map(a, b)
+itertools.imap(a, b)
+map(a, b)
③
- itertools.ifilter(a, b)
-filter(a, b)
+itertools.ifilter(a, b)
+filter(a, b)
④
- from itertools import imap, izip, foo
-from itertools import foo
+from itertools import imap, izip, foo
+from itertools import foo
- Instead of
itertools.izip(), just use the global zip() function.
@@ -981,14 +981,14 @@ except:
Python 2
Python 3
- sys.exc_type
-sys.exc_info()[0]
+sys.exc_type
+sys.exc_info()[0]
- sys.exc_value
-sys.exc_info()[1]
+sys.exc_value
+sys.exc_info()[1]
- sys.exc_traceback
-sys.exc_info()[2]
+sys.exc_traceback
+sys.exc_info()[2]
List comprehensions over tuples
In Python 2, if you wanted to code a list comprehension that iterated over a tuple, you did not need to put parentheses around the tuple values. In Python 3, explicit parentheses are required.
@@ -997,8 +997,8 @@ except:
Python 2
Python 3
- [i for i in 1, 2]
-[i for i in (1, 2)]
+[i for i in 1, 2]
+[i for i in (1, 2)]
os.getcwdu() function
Python 2 had a function named os.getcwd(), which returned the current working directory as a (non-Unicode) string. Because modern file systems can handle directory names in any character encoding, Python 2.3 introduced os.getcwdu(). The os.getcwdu() function returned the current working directory as a Unicode string. In Python 3, there is only one string type (Unicode), so os.getcwd() is all you need.
@@ -1007,8 +1007,8 @@ except:
Python 2
Python 3
- os.getcwdu()
-os.getcwd()
+os.getcwdu()
+os.getcwd()
Metaclasses
In Python 2, you could create metaclasses either by defining the metaclass argument in the class declaration, or by defining a special class-level __metaclass__ attribute. In Python 3, the class-level attribute has been eliminated.
@@ -1017,18 +1017,18 @@ except:
Python 2
Python 3
①
- class C(metaclass=PapayaMeta):
+class C(metaclass=PapayaMeta):
pass
unchanged
②
- class Whip:
+class Whip:
__metaclass__ = PapayaMeta
-class Whip(metaclass=PapayaMeta):
+class Whip(metaclass=PapayaMeta):
pass
③
- class C(Whipper, Beater):
+class C(Whipper, Beater):
__metaclass__ = PapayaMeta
-class C(Whipper, Beater, metaclass=PapayaMeta):
+class C(Whipper, Beater, metaclass=PapayaMeta):
pass
@@ -1041,7 +1041,7 @@ except:
set() literals (explicit)
In Python 2, the only way to define a literal set in your code was to call set(a_sequence). This still works in Python 3, but a clearer way of doing it is to use the new set literal notation: curly braces. (Dictionaries are also defined with curly braces, which makes sense once you think about it, because dictionaries are just sets of key-value pairs.)
-☞The 2to3 script will not fix set() literals by default. To enable this fix, specify -f set_literal on the command line when you call 2to3.
+
☞The 2to3 script will not fix set() literals by default. To enable this fix, specify -f set_literal on the command line when you call 2to3.
Notes
@@ -1049,19 +1049,19 @@ except:
After
- set([1, 2, 3])
-{1, 2, 3}
+set([1, 2, 3])
+{1, 2, 3}
- set((1, 2, 3))
-{1, 2, 3}
+set((1, 2, 3))
+{1, 2, 3}
- set([i for i in a_sequence])
-{i for i in a_sequence}
+set([i for i in a_sequence])
+{i for i in a_sequence}
buffer() global function (explicit)
Python objects implemented in C can export a “buffer interface,” which allows other Python code to directly read and write a block of memory. (That is exactly as powerful and scary as it sounds.) In Python 3, buffer() has been renamed to memoryview(). (It’s a little more complicated than that, but you can almost certainly ignore the differences.)
-☞The 2to3 script will not fix the buffer() function by default. To enable this fix, specify -f buffer on the command line when you call 2to3.
+
☞The 2to3 script will not fix the buffer() function by default. To enable this fix, specify -f buffer on the command line when you call 2to3.
Notes
@@ -1069,13 +1069,13 @@ except:
After
- x = buffer(y)
-x = memoryview(y)
+x = buffer(y)
+x = memoryview(y)
Whitespace around commas (explicit)
Despite being draconian about whitespace for indenting and outdenting, Python is actually quite liberal about whitespace in other areas. Within lists, tuples, sets, and dictionaries, whitespace can appear before and after commas with no ill effects. However, the Python style guide states that commas should be preceded by zero spaces and followed by one. Although this is purely an aesthetic issue (the code works either way, in both Python 2 and Python 3), the 2to3 script can optionally fix this for you.
-☞The 2to3 script will not fix whitespace around commas by default. To enable this fix, specify -f wscomma on the command line when you call 2to3.
+
☞The 2to3 script will not fix whitespace around commas by default. To enable this fix, specify -f wscomma on the command line when you call 2to3.
Notes
@@ -1083,16 +1083,16 @@ except:
After
- a ,b
-a, b
+a ,b
+a, b
- {a :b}
-{a: b}
+{a :b}
+{a: b}
Common idioms (explicit)
There were a number of common idioms built up in the Python community. Some, like the while 1: loop, date back to Python 1. (Python didn’t have a true boolean type until version 2.3, so developers used 1 and 0 instead.) Modern Python programmers should train their brains to use modern versions of these idioms instead.
-☞The 2to3 script will not fix common idioms by default. To enable this fix, specify -f idioms on the command line when you call 2to3.
+
☞The 2to3 script will not fix common idioms by default. To enable this fix, specify -f idioms on the command line when you call 2to3.
Notes
@@ -1100,26 +1100,27 @@ except:
After
- while 1:
+while 1:
do_stuff()
-while True:
+while True:
do_stuff()
- type(x) == T
-isinstance(x, T)
+type(x) == T
+isinstance(x, T)
- type(x) is T
-isinstance(x, T)
+type(x) is T
+isinstance(x, T)
-