diff --git a/.htaccess b/.htaccess index 28a220d..b901de6 100644 --- a/.htaccess +++ b/.htaccess @@ -1,3 +1,3 @@ -FileETag MTime Size - -SetEnv dont-vary +FileETag MTime Size + +SetEnv dont-vary diff --git a/advanced-iterators.html b/advanced-iterators.html index fb099d8..ee15c6d 100755 --- a/advanced-iterators.html +++ b/advanced-iterators.html @@ -1,647 +1,647 @@ - - -
You are here: Home ‣ Dive Into Python 3 ‣ -
Difficulty level: ♦♦♦♦♢ -
--❝ Great fleas have little fleas upon their backs to bite ’em,
And little fleas have lesser fleas, and so ad infinitum. ❞
— Augustus De Morgan -
-
Just as regular expressions put strings on steroids, the itertools module puts iterators on steroids. But first, I want to show you a classic puzzle.
-
-
HAWAII + IDAHO + IOWA + OHIO == STATES
-510199 + 98153 + 9301 + 3593 == 621246
-
-H = 5
-A = 1
-W = 0
-I = 9
-D = 8
-O = 3
-S = 6
-T = 2
-E = 4
-
-Puzzles like this are called cryptarithms or alphametics. The letters spell out actual words, but if you replace each letter with a digit from 0–9, it also “spells” an arithmetic equation. The trick is to figure out which letter maps to each digit. All the occurrences of each letter must map to the same digit, no digit can be repeated, and no “word” can start with the digit 0.
-
-
-
-
In this chapter, we’ll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles in just 14 lines of code. - -
import re
-import itertools
-
-def solve(puzzle):
- words = re.findall('[A-Z]+', puzzle.upper())
- unique_characters = set(''.join(words))
- assert len(unique_characters) <= 10, 'Too many letters'
- first_letters = {word[0] for word in words}
- n = len(first_letters)
- sorted_characters = ''.join(first_letters) + \
- ''.join(unique_characters - first_letters)
- characters = tuple(ord(c) for c in sorted_characters)
- digits = tuple(ord(c) for c in '0123456789')
- zero = digits[0]
- for guess in itertools.permutations(digits, len(characters)):
- if zero not in guess[:n]:
- equation = puzzle.translate(dict(zip(characters, guess)))
- if eval(equation):
- return equation
-
-if __name__ == '__main__':
- import sys
- for puzzle in sys.argv[1:]:
- print(puzzle)
- solution = solve(puzzle)
- if solution:
- print(solution)
-
-You can run the program from the command line. On Linux, it would look like this. (These may take some time, depending on the speed of your computer, and there is no progress bar. Just be patient!) - -
-you@localhost:~/diveintopython3/examples$ python3 alphametics.py "HAWAII + IDAHO + IOWA + OHIO == STATES" -HAWAII + IDAHO + IOWA + OHIO = STATES -510199 + 98153 + 9301 + 3593 == 621246 -you@localhost:~/diveintopython3/examples$ python3 alphametics.py "I + LOVE + YOU == DORA" -I + LOVE + YOU == DORA -1 + 2784 + 975 == 3760 -you@localhost:~/diveintopython3/examples$ python3 alphametics.py "SEND + MORE == MONEY" -SEND + MORE == MONEY -9567 + 1085 == 10652- -
⁂ - -
The first thing this alphametics solver does is find all the letters (A–Z) in the puzzle. - -
->>> import re
->>> re.findall('[0-9]+', '16 2-by-4s in rows of 8') ①
-['16', '2', '4', '8']
->>> re.findall('[A-Z]+', 'SEND + MORE == MONEY') ②
-['SEND', 'MORE', 'MONEY']
-re module is Python’s implementation of regular expressions. It has a nifty function called findall() which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the string. In this case, the pattern matches sequences of numbers. The findall() function returns a list of all the substrings that matched the pattern.
-Here’s another example that will stretch your brain a little. - -
->>> re.findall(' s.*? s', "The sixth sick sheikh's sixth sheep's sick.")
-[' sixth s', " sheikh's s", " sheep's s"]
-
-
-
-Surprised? The regular expression looks for a space, an s, and then the shortest possible series of any character (.*?), then a space, then another s. Well, looking at that input string, I see five matches:
-
-
The sixth sick sheikh's sixth sheep's sick.
-The sixth sick sheikh's sixth sheep's sick.
-The sixth sick sheikh's sixth sheep's sick.
-The sixth sick sheikh's sixth sheep's sick.
-The sixth sick sheikh's sixth sheep's sick.
-But the re.findall() function only returned three matches. Specifically, it returned the first, the third, and the fifth. Why is that? Because it doesn’t return overlapping matches. The first match overlaps with the second, so the first is returned and the second is skipped. Then the third overlaps with the fourth, so the third is returned and the fourth is skipped. Finally, the fifth is returned. Three matches, not five.
-
-
This has nothing to do with the alphametics solver; I just thought it was interesting. - -
⁂ - -
Sets make it trivial to find the unique items in a sequence. - -
->>> a_list = ['The', 'sixth', 'sick', "sheik's", 'sixth', "sheep's", 'sick'] ->>> set(a_list) ① -{'sixth', 'The', "sheep's", 'sick', "sheik's"} ->>> a_string = 'EAST IS EAST' ->>> set(a_string) ② -{'A', ' ', 'E', 'I', 'S', 'T'} ->>> words = ['SEND', 'MORE', 'MONEY'] ->>> ''.join(words) ③ -'SENDMOREMONEY' ->>> set(''.join(words)) ④ -{'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}-
set() function will return a set of unique strings from the list. This makes sense if you think of it like a for loop. Take the first item from the list, put it in the set. Second. Third. Fourth. Fifth — wait, that’s in the set already, so it only gets listed once, because Python sets don’t allow duplicates. Sixth. Seventh — again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn’t even need to be sorted first.
-''.join(a_list) concatenates all the strings together into one.
-The alphametics solver uses this technique to build a set of all the unique characters in the puzzle. - -
unique_characters = set(''.join(words))
-
-This list is later used to assign digits to characters as the solver iterates through the possible solutions. - -
⁂ - -
Like many programming languages, Python has an assert statement. Here’s how it works.
-
-
->>> assert 1 + 1 == 2 ① ->>> assert 1 + 1 == 3 ② -Traceback (most recent call last): - File "<stdin>", line 1, in <module> -AssertionError ->>> assert 2 + 2 == 5, "Only for very large values of 2" ③ -Traceback (most recent call last): - File "<stdin>", line 1, in <module> -AssertionError: Only for very large values of 2-
assert statement is followed by any valid Python expression. In this case, the expression 1 + 1 == 2 evaluates to True, so the assert statement does nothing.
-False, the assert statement will raise an AssertionError.
-AssertionError is raised.
-Therefore, this line of code: - -
assert len(unique_characters) <= 10, 'Too many letters'
-
-…is equivalent to this: - -
if len(unique_characters) > 10:
- raise AssertionError('Too many letters')
-
-The alphametics solver uses this exact assert statement to bail out early if the puzzle contains more than ten unique letters. Since each letter is assigned a unique digit, and there are only ten digits, a puzzle with more than ten unique letters can not possibly have a solution.
-
-
⁂ - -
A generator expression is like a generator function without the function. - -
->>> unique_characters = {'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
->>> gen = (ord(c) for c in unique_characters) ①
->>> gen ②
-<generator object <genexpr> at 0x00BADC10>
->>> next(gen) ③
-69
->>> next(gen)
-68
->>> tuple(ord(c) for c in unique_characters) ④
-(69, 68, 77, 79, 78, 83, 82, 89)
-next(gen) returns the next value from the iterator.
-tuple(), list(), or set(). In these cases, you don’t need an extra set of parentheses — just pass the “bare” expression ord(c) for c in unique_characters to the tuple() function, and Python figures out that it’s a generator expression.
--- -☞Using a generator expression instead of a list comprehension can save both CPU and RAM. If you’re building an list just to throw it away (e.g. passing it to
tuple()orset()), use a generator expression instead! -
Here’s another way to accomplish the same thing, using a generator function: - -
def ord_map(a_string):
- for c in a_string:
- yield ord(c)
-
-gen = ord_map(unique_characters)
-
-The generator expression is more compact but functionally equivalent. - -
⁂ - -
First of all, what the heck are permutations? Permutations are a mathematical concept. (There are actually several definitions, depending on what kind of math you’re doing. Here I’m talking about combinatorics, but if that doesn’t mean anything to you, don’t worry about it. As always, Wikipedia is your friend.) - -
The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated. Mathematicians say things like “let’s find the permutations of 3 different items taken 2 at a time,” which means you have a sequence of 3 items and you want to find all the possible ordered pairs. - -
->>> import itertools ① ->>> perms = itertools.permutations([1, 2, 3], 2) ② ->>> next(perms) ③ -(1, 2) ->>> next(perms) -(1, 3) ->>> next(perms) -(2, 1) ④ ->>> next(perms) -(2, 3) ->>> next(perms) -(3, 1) ->>> next(perms) -(3, 2) ->>> next(perms) ⑤ -Traceback (most recent call last): - File "<stdin>", line 1, in <module> -StopIteration-
itertools module has all kinds of fun stuff in it, including a permutations() function that does all the hard work of finding permutations.
-permutations() function takes a sequence (here a list of three integers) and a number, which is the number of items you want in each smaller group. The function returns an iterator, which you can use in a for loop or any old place that iterates. Here I’ll step through the iterator manually to show all the values.
-[1, 2, 3] taken 2 at a time is (1, 2).
-(2, 1) is different than (1, 2).
-[1, 2, 3] taken 2 at a time. Pairs like (1, 1) and (2, 2) never show up, because they contain repeats so they aren’t valid permutations. When there are no more permutations, the iterator raises a StopIteration exception.
-The permutations() function doesn’t have to take a list. It can take any sequence — even a string.
-
-
->>> import itertools
->>> perms = itertools.permutations('ABC', 3) ①
->>> next(perms)
-('A', 'B', 'C') ②
->>> next(perms)
-('A', 'C', 'B')
->>> next(perms)
-('B', 'A', 'C')
->>> next(perms)
-('B', 'C', 'A')
->>> next(perms)
-('C', 'A', 'B')
->>> next(perms)
-('C', 'B', 'A')
->>> next(perms)
-Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
-StopIteration
->>> list(itertools.permutations('ABC', 3)) ③
-[('A', 'B', 'C'), ('A', 'C', 'B'),
- ('B', 'A', 'C'), ('B', 'C', 'A'),
- ('C', 'A', 'B'), ('C', 'B', 'A')]
-'ABC' is equivalent to the list ['A', 'B', 'C'].
-['A', 'B', 'C'], taken 3 at a time, is ('A', 'B', 'C'). There are five other permutations — the same three characters in every conceivable order.
-permutations() function always returns an iterator, an easy way to debug permutations is to pass that iterator to the built-in list() function to see all the permutations immediately.
-⁂ - -
itertools Module
->>> import itertools
->>> list(itertools.product('ABC', '123')) ①
-[('A', '1'), ('A', '2'), ('A', '3'),
- ('B', '1'), ('B', '2'), ('B', '3'),
- ('C', '1'), ('C', '2'), ('C', '3')]
->>> list(itertools.combinations('ABC', 2)) ②
-[('A', 'B'), ('A', 'C'), ('B', 'C')]
-itertools.product() function returns an iterator containing the Cartesian product of two sequences.
-itertools.combinations() function returns an iterator containing all the possible combinations of the given sequence of the given length. This is like the itertools.permutations() function, except combinations don’t include items that are duplicates of other items in a different order. So itertools.permutations('ABC', 2) will return both ('A', 'B') and ('B', 'A') (among others), but itertools.combinations('ABC', 2) will not return ('B', 'A') because it is a duplicate of ('A', 'B') in a different order.
-[download favorite-people.txt]
-
->>> names = list(open('examples/favorite-people.txt', encoding='utf-8')) ①
->>> names
-['Dora\n', 'Ethan\n', 'Wesley\n', 'John\n', 'Anne\n',
-'Mike\n', 'Chris\n', 'Sarah\n', 'Alex\n', 'Lizzie\n']
->>> names = [name.rstrip() for name in names] ②
->>> names
-['Dora', 'Ethan', 'Wesley', 'John', 'Anne',
-'Mike', 'Chris', 'Sarah', 'Alex', 'Lizzie']
->>> names = sorted(names) ③
->>> names
-['Alex', 'Anne', 'Chris', 'Dora', 'Ethan',
-'John', 'Lizzie', 'Mike', 'Sarah', 'Wesley']
->>> names = sorted(names, key=len) ④
->>> names
-['Alex', 'Anne', 'Dora', 'John', 'Mike',
-'Chris', 'Ethan', 'Sarah', 'Lizzie', 'Wesley']
-list(open(filename)) idiom also includes the carriage returns at the end of each line. This list comprehension uses the rstrip() string method to strip trailing whitespace from each line. (Strings also have an lstrip() method to strip leading whitespace, and a strip() method which strips both.)
-sorted() function takes a list and returns it sorted. By default, it sorts alphabetically.
-sorted() function can also take a function as the key parameter, and it sorts by that key. In this case, the sort function is len(), so it sorts by len(each item). Shorter names come first, then longer, then longest.
-What does this have to do with the itertools module? I’m glad you asked.
-
-
-…continuing from the previous interactive shell… ->>> import itertools ->>> groups = itertools.groupby(names, len) ① ->>> groups -<itertools.groupby object at 0x00BB20C0> ->>> list(groups) -[(4, <itertools._grouper object at 0x00BA8BF0>), - (5, <itertools._grouper object at 0x00BB4050>), - (6, <itertools._grouper object at 0x00BB4030>)] ->>> groups = itertools.groupby(names, len) ② ->>> for name_length, name_iter in groups: ③ -... print('Names with {0:d} letters:'.format(name_length)) -... for name in name_iter: -... print(name) -... -Names with 4 letters: -Alex -Anne -Dora -John -Mike -Names with 5 letters: -Chris -Ethan -Sarah -Names with 6 letters: -Lizzie -Wesley-
itertools.groupby() function takes a sequence and a key function, and returns an iterator that generates pairs. Each pair contains the result of key_function(each item) and another iterator containing all the items that shared that key result.
-list() function “exhausted” the iterator, i.e. you’ve already generated every item in the iterator to make the list. There’s no “reset” button on an iterator; you can’t just start over once you’ve exhausted it. If you want to loop through it again (say, in the upcoming for loop), you need to call itertools.groupby() again to create a new iterator.
-itertools.groupby(names, len) will put all the 4-letter names in one iterator, all the 5-letter names in another iterator, and so on. The groupby() function is completely generic; it could group strings by first letter, numbers by their number of factors, or any other key function you can think of.
--- -☞The
itertools.groupby()function only works if the input sequence is already sorted by the grouping function. In the example above, you grouped a list of names by thelen()function. That only worked because the input list was already sorted by length. -
Are you watching closely? -
->>> list(range(0, 3)) -[0, 1, 2] ->>> list(range(10, 13)) -[10, 11, 12] ->>> list(itertools.chain(range(0, 3), range(10, 13))) ① -[0, 1, 2, 10, 11, 12] ->>> list(zip(range(0, 3), range(10, 13))) ② -[(0, 10), (1, 11), (2, 12)] ->>> list(zip(range(0, 3), range(10, 14))) ③ -[(0, 10), (1, 11), (2, 12)] ->>> list(itertools.zip_longest(range(0, 3), range(10, 14))) ④ -[(0, 10), (1, 11), (2, 12), (None, 13)]-
itertools.chain() function takes two iterators and returns an iterator that contains all the items from the first iterator, followed by all the items from the second iterator. (Actually, it can take any number of iterators, and it chains them all in the order they were passed to the function.)
-zip() function does something prosaic that turns out to be extremely useful: it takes any number of sequences and returns an iterator which returns tuples of the first items of each sequence, then the second items of each, then the third, and so on.
-zip() function stops at the end of the shortest sequence. range(10, 14) has 4 items (10, 11, 12, and 13), but range(0, 3) only has 3, so the zip() function returns an iterator of 3 items.
-itertools.zip_longest() function stops at the end of the longest sequence, inserting None values for items past the end of the shorter sequences.
-OK, that was all very interesting, but how does it relate to the alphametics solver? Here’s how: - -
->>> characters = ('S', 'M', 'E', 'D', 'O', 'N', 'R', 'Y')
->>> guess = ('1', '2', '0', '3', '4', '5', '6', '7')
->>> tuple(zip(characters, guess)) ①
-(('S', '1'), ('M', '2'), ('E', '0'), ('D', '3'),
- ('O', '4'), ('N', '5'), ('R', '6'), ('Y', '7'))
->>> dict(zip(characters, guess)) ②
-{'E': '0', 'D': '3', 'M': '2', 'O': '4',
- 'N': '5', 'S': '1', 'R': '6', 'Y': '7'}
-zip function will create a pairing of letters and digits, in order.
-dict() function to create a dictionary that uses letters as keys and their associated digits as values. (This isn’t the only way to do it, of course. You could use a dictionary comprehension to create the dictionary directly.) Although the printed representation of the dictionary lists the pairs in a different order (dictionaries have no “order” per se), you can see that each letter is associated with the digit, based on the ordering of the original characters and guess sequences.
-The alphametics solver uses this technique to create a dictionary that maps letters in the puzzle to digits in the solution, for each possible solution. - -
characters = tuple(ord(c) for c in sorted_characters)
-digits = tuple(ord(c) for c in '0123456789')
-...
-for guess in itertools.permutations(digits, len(characters)):
- ...
- equation = puzzle.translate(dict(zip(characters, guess)))
-
-But what is this translate() method? Ah, now you’re getting to the really fun part.
-
-
⁂ - -
Python strings have many methods. You learned about some of those methods in the Strings chapter: lower(), count(), and format(). Now I want to introduce you to a powerful but little-known string manipulation technique: the translate() method.
-
-
->>> translation_table = {ord('A'): ord('O')} ①
->>> translation_table ②
-{65: 79}
->>> 'MARK'.translate(translation_table) ③
-'MORK'
-ord() function returns the ASCII value of a character, which, in the case of A–Z, is always a byte from 65 to 90.
-translate() method on a string takes a translation table and runs the string through it. That is, it replaces all occurrences of the keys of the translation table with the corresponding values. In this case, “translating” MARK to MORK.
-What does this have to do with solving alphametic puzzles? As it turns out, everything. - -
->>> characters = tuple(ord(c) for c in 'SMEDONRY') ① ->>> characters -(83, 77, 69, 68, 79, 78, 82, 89) ->>> guess = tuple(ord(c) for c in '91570682') ② ->>> guess -(57, 49, 53, 55, 48, 54, 56, 50) ->>> translation_table = dict(zip(characters, guess)) ③ ->>> translation_table -{68: 55, 69: 53, 77: 49, 78: 54, 79: 48, 82: 56, 83: 57, 89: 50} ->>> 'SEND + MORE == MONEY'.translate(translation_table) ④ -'9567 + 1085 == 10652'-
alphametics.solve() function.
-itertools.permutations() function in the alphametics.solve() function.
-alphametics.solve() function does inside the for loop.
-translate() method of the original puzzle string. This converts each letter in the string to the corresponding digit (based on the letters in characters and the digits in guess). The result is a valid Python expression, as a string.
-That’s pretty impressive. But what can you do with a string that happens to be a valid Python expression? - -
⁂ - -
This is the final piece of the puzzle (or rather, the final piece of the puzzle solver). After all that fancy string manipulation, we’re left with a string like '9567 + 1085 == 10652'. But that’s a string, and what good is a string? Enter eval(), the universal Python evaluation tool.
-
-
->>> eval('1 + 1 == 2')
-True
->>> eval('1 + 1 == 3')
-False
->>> eval('9567 + 1085 == 10652')
-True
-
-But wait, there’s more! The eval() function isn’t limited to boolean expressions. It can handle any Python expression and returns any datatype.
-
-
->>> eval('"A" + "B"')
-'AB'
->>> eval('"MARK".translate({65: 79})')
-'MORK'
->>> eval('"AAAAA".count("A")')
-5
->>> eval('["*"] * 5')
-['*', '*', '*', '*', '*']
-
-But wait, that’s not all! - -
->>> x = 5
->>> eval("x * 5") ①
-25
->>> eval("pow(x, 2)") ②
-25
->>> import math
->>> eval("math.sqrt(x)") ③
-2.2360679774997898
-eval() takes can reference global variables defined outside the eval(). If called within a function, it can reference local variables too.
-Hey, wait a minute… - -
->>> import subprocess
->>> eval("subprocess.getoutput('ls ~')") ①
-'Desktop Library Pictures \
- Documents Movies Public \
- Music Sites'
->>> eval("subprocess.getoutput('rm /some/random/file')") ②
-subprocess module allows you to run arbitrary shell commands and get the result as a Python string.
-It’s even worse than that, because there’s a global __import__() function that takes a module name as a string, imports the module, and returns a reference to it. Combined with the power of eval(), you can construct a single expression that will wipe out all your files:
-
-
->>> eval("__import__('subprocess').getoutput('rm /some/random/file')") ①
-'rm -rf ~'. Actually there wouldn’t be any output, but you wouldn’t have any files left either.
-eval() is EVIL - -
Well, the evil part is evaluating arbitrary expressions from untrusted sources. You should only use eval() on trusted input. Of course, the trick is figuring out what’s “trusted.” But here’s something I know for certain: you should NOT take this alphametics solver and put it on the internet as a fun little web service. Don’t make the mistake of thinking, “Gosh, the function does a lot of string manipulation before getting a string to evaluate; I can’t imagine how someone could exploit that.” Someone WILL figure out how to sneak nasty executable code past all that string manipulation (stranger things have happened), and then you can kiss your server goodbye.
-
-
But surely there’s some way to evaluate expressions safely? To put eval() in a sandbox where it can’t access or harm the outside world? Well, yes and no.
-
-
->>> x = 5
->>> eval("x * 5", {}, {}) ①
-Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- File "<string>", line 1, in <module>
-NameError: name 'x' is not defined
->>> eval("x * 5", {"x": x}, {}) ②
->>> import math
->>> eval("math.sqrt(x)", {"x": x}, {}) ③
-Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- File "<string>", line 1, in <module>
-NameError: name 'math' is not defined
-eval() function act as the global and local namespaces for evaluating the expression. In this case, they are both empty, which means that when the string "x * 5" is evaluated, there is no reference to x in either the global or local namespace, so eval() throws an exception.
-math module, you didn’t include it in the namespace passed to the eval() function, so the evaluation failed.
-Gee, that was easy. Lemme make an alphametics web service now! - -
->>> eval("pow(5, 2)", {}, {}) ①
-25
->>> eval("__import__('math').sqrt(5)", {}, {}) ②
-2.2360679774997898
-pow(5, 2) works, because 5 and 2 are literals, and pow() is a built-in function.
-__import__() function is also a built-in function, so it works too.
-Yeah, that means you can still do nasty things, even if you explicitly set the global and local namespaces to empty dictionaries when calling eval():
-
-
>>> eval("__import__('subprocess').getoutput('rm /some/random/file')", {}, {})
-
-Oops. I’m glad I didn’t make that alphametics web service. Is there any way to use eval() safely? Well, yes and no.
-
-
->>> eval("__import__('math').sqrt(5)",
-... {"__builtins__":None}, {}) ①
-Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- File "<string>", line 1, in <module>
-NameError: name '__import__' is not defined
->>> eval("__import__('subprocess').getoutput('rm -rf /')",
-... {"__builtins__":None}, {}) ②
-Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- File "<string>", line 1, in <module>
-NameError: name '__import__' is not defined
-"__builtins__" to None, the Python null value. Internally, the “built-in” functions are contained within a pseudo-module called "__builtins__". This pseudo-module (i.e. the set of built-in functions) is made available to evaluated expressions unless you explicitly override it.
-__builtins__. Not __builtin__, __built-ins__, or some other variation that will work just fine but expose you to catastrophic risks.
-So eval() is safe now? Well, yes and no.
-
-
->>> eval("2 ** 2147483647",
-... {"__builtins__":None}, {}) ①
-
-__builtins__, you can still launch a denial-of-service attack. For example, trying to raise 2 to the 2147483647th power will spike your server’s CPU utilization to 100% for quite some time. (If you’re trying this in the interactive shell, press Ctrl-C a few times to break out of it.) Technically this expression will return a value eventually, but in the meantime your server will be doing a whole lot of nothing.
-In the end, it is possible to safely evaluate untrusted Python expressions, for some definition of “safe” that turns out not to be terribly useful in real life. It’s fine if you’re just playing around, and it’s fine if you only ever pass it trusted input. But anything else is just asking for trouble. - -
⁂ - -
To recap: this program solves alphametic puzzles by brute force, i.e. through an exhaustive search of all possible solutions. To do this, it… - -
re.findall() function
-set() function
-assert statement
-itertools.permutations() function
-translate() string method
-eval() function
-True
-…in just 14 lines of code. - -
⁂ - -
itertools module
-itertools — Iterator functions for efficient looping
-Many thanks to Raymond Hettinger for agreeing to relicense his code so I could port it to Python 3 and use it as the basis for this chapter. - -
© 2001–10 Mark Pilgrim - - - + + +
You are here: Home ‣ Dive Into Python 3 ‣ +
Difficulty level: ♦♦♦♦♢ +
++❝ Great fleas have little fleas upon their backs to bite ’em,
And little fleas have lesser fleas, and so ad infinitum. ❞
— Augustus De Morgan +
+
Just as regular expressions put strings on steroids, the itertools module puts iterators on steroids. But first, I want to show you a classic puzzle.
+
+
HAWAII + IDAHO + IOWA + OHIO == STATES
+510199 + 98153 + 9301 + 3593 == 621246
+
+H = 5
+A = 1
+W = 0
+I = 9
+D = 8
+O = 3
+S = 6
+T = 2
+E = 4
+
+Puzzles like this are called cryptarithms or alphametics. The letters spell out actual words, but if you replace each letter with a digit from 0–9, it also “spells” an arithmetic equation. The trick is to figure out which letter maps to each digit. All the occurrences of each letter must map to the same digit, no digit can be repeated, and no “word” can start with the digit 0.
+
+
+
+
In this chapter, we’ll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles in just 14 lines of code. + +
import re
+import itertools
+
+def solve(puzzle):
+ words = re.findall('[A-Z]+', puzzle.upper())
+ unique_characters = set(''.join(words))
+ assert len(unique_characters) <= 10, 'Too many letters'
+ first_letters = {word[0] for word in words}
+ n = len(first_letters)
+ sorted_characters = ''.join(first_letters) + \
+ ''.join(unique_characters - first_letters)
+ characters = tuple(ord(c) for c in sorted_characters)
+ digits = tuple(ord(c) for c in '0123456789')
+ zero = digits[0]
+ for guess in itertools.permutations(digits, len(characters)):
+ if zero not in guess[:n]:
+ equation = puzzle.translate(dict(zip(characters, guess)))
+ if eval(equation):
+ return equation
+
+if __name__ == '__main__':
+ import sys
+ for puzzle in sys.argv[1:]:
+ print(puzzle)
+ solution = solve(puzzle)
+ if solution:
+ print(solution)
+
+You can run the program from the command line. On Linux, it would look like this. (These may take some time, depending on the speed of your computer, and there is no progress bar. Just be patient!) + +
+you@localhost:~/diveintopython3/examples$ python3 alphametics.py "HAWAII + IDAHO + IOWA + OHIO == STATES" +HAWAII + IDAHO + IOWA + OHIO = STATES +510199 + 98153 + 9301 + 3593 == 621246 +you@localhost:~/diveintopython3/examples$ python3 alphametics.py "I + LOVE + YOU == DORA" +I + LOVE + YOU == DORA +1 + 2784 + 975 == 3760 +you@localhost:~/diveintopython3/examples$ python3 alphametics.py "SEND + MORE == MONEY" +SEND + MORE == MONEY +9567 + 1085 == 10652+ +
⁂ + +
The first thing this alphametics solver does is find all the letters (A–Z) in the puzzle. + +
+>>> import re
+>>> re.findall('[0-9]+', '16 2-by-4s in rows of 8') ①
+['16', '2', '4', '8']
+>>> re.findall('[A-Z]+', 'SEND + MORE == MONEY') ②
+['SEND', 'MORE', 'MONEY']
+re module is Python’s implementation of regular expressions. It has a nifty function called findall() which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the string. In this case, the pattern matches sequences of numbers. The findall() function returns a list of all the substrings that matched the pattern.
+Here’s another example that will stretch your brain a little. + +
+>>> re.findall(' s.*? s', "The sixth sick sheikh's sixth sheep's sick.")
+[' sixth s', " sheikh's s", " sheep's s"]
+
+
+
+Surprised? The regular expression looks for a space, an s, and then the shortest possible series of any character (.*?), then a space, then another s. Well, looking at that input string, I see five matches:
+
+
The sixth sick sheikh's sixth sheep's sick.
+The sixth sick sheikh's sixth sheep's sick.
+The sixth sick sheikh's sixth sheep's sick.
+The sixth sick sheikh's sixth sheep's sick.
+The sixth sick sheikh's sixth sheep's sick.
+But the re.findall() function only returned three matches. Specifically, it returned the first, the third, and the fifth. Why is that? Because it doesn’t return overlapping matches. The first match overlaps with the second, so the first is returned and the second is skipped. Then the third overlaps with the fourth, so the third is returned and the fourth is skipped. Finally, the fifth is returned. Three matches, not five.
+
+
This has nothing to do with the alphametics solver; I just thought it was interesting. + +
⁂ + +
Sets make it trivial to find the unique items in a sequence. + +
+>>> a_list = ['The', 'sixth', 'sick', "sheik's", 'sixth', "sheep's", 'sick'] +>>> set(a_list) ① +{'sixth', 'The', "sheep's", 'sick', "sheik's"} +>>> a_string = 'EAST IS EAST' +>>> set(a_string) ② +{'A', ' ', 'E', 'I', 'S', 'T'} +>>> words = ['SEND', 'MORE', 'MONEY'] +>>> ''.join(words) ③ +'SENDMOREMONEY' +>>> set(''.join(words)) ④ +{'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}+
set() function will return a set of unique strings from the list. This makes sense if you think of it like a for loop. Take the first item from the list, put it in the set. Second. Third. Fourth. Fifth — wait, that’s in the set already, so it only gets listed once, because Python sets don’t allow duplicates. Sixth. Seventh — again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn’t even need to be sorted first.
+''.join(a_list) concatenates all the strings together into one.
+The alphametics solver uses this technique to build a set of all the unique characters in the puzzle. + +
unique_characters = set(''.join(words))
+
+This list is later used to assign digits to characters as the solver iterates through the possible solutions. + +
⁂ + +
Like many programming languages, Python has an assert statement. Here’s how it works.
+
+
+>>> assert 1 + 1 == 2 ① +>>> assert 1 + 1 == 3 ② +Traceback (most recent call last): + File "<stdin>", line 1, in <module> +AssertionError +>>> assert 2 + 2 == 5, "Only for very large values of 2" ③ +Traceback (most recent call last): + File "<stdin>", line 1, in <module> +AssertionError: Only for very large values of 2+
assert statement is followed by any valid Python expression. In this case, the expression 1 + 1 == 2 evaluates to True, so the assert statement does nothing.
+False, the assert statement will raise an AssertionError.
+AssertionError is raised.
+Therefore, this line of code: + +
assert len(unique_characters) <= 10, 'Too many letters'
+
+…is equivalent to this: + +
if len(unique_characters) > 10:
+ raise AssertionError('Too many letters')
+
+The alphametics solver uses this exact assert statement to bail out early if the puzzle contains more than ten unique letters. Since each letter is assigned a unique digit, and there are only ten digits, a puzzle with more than ten unique letters can not possibly have a solution.
+
+
⁂ + +
A generator expression is like a generator function without the function. + +
+>>> unique_characters = {'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
+>>> gen = (ord(c) for c in unique_characters) ①
+>>> gen ②
+<generator object <genexpr> at 0x00BADC10>
+>>> next(gen) ③
+69
+>>> next(gen)
+68
+>>> tuple(ord(c) for c in unique_characters) ④
+(69, 68, 77, 79, 78, 83, 82, 89)
+next(gen) returns the next value from the iterator.
+tuple(), list(), or set(). In these cases, you don’t need an extra set of parentheses — just pass the “bare” expression ord(c) for c in unique_characters to the tuple() function, and Python figures out that it’s a generator expression.
+++ +☞Using a generator expression instead of a list comprehension can save both CPU and RAM. If you’re building an list just to throw it away (e.g. passing it to
tuple()orset()), use a generator expression instead! +
Here’s another way to accomplish the same thing, using a generator function: + +
def ord_map(a_string):
+ for c in a_string:
+ yield ord(c)
+
+gen = ord_map(unique_characters)
+
+The generator expression is more compact but functionally equivalent. + +
⁂ + +
First of all, what the heck are permutations? Permutations are a mathematical concept. (There are actually several definitions, depending on what kind of math you’re doing. Here I’m talking about combinatorics, but if that doesn’t mean anything to you, don’t worry about it. As always, Wikipedia is your friend.) + +
The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated. Mathematicians say things like “let’s find the permutations of 3 different items taken 2 at a time,” which means you have a sequence of 3 items and you want to find all the possible ordered pairs. + +
+>>> import itertools ① +>>> perms = itertools.permutations([1, 2, 3], 2) ② +>>> next(perms) ③ +(1, 2) +>>> next(perms) +(1, 3) +>>> next(perms) +(2, 1) ④ +>>> next(perms) +(2, 3) +>>> next(perms) +(3, 1) +>>> next(perms) +(3, 2) +>>> next(perms) ⑤ +Traceback (most recent call last): + File "<stdin>", line 1, in <module> +StopIteration+
itertools module has all kinds of fun stuff in it, including a permutations() function that does all the hard work of finding permutations.
+permutations() function takes a sequence (here a list of three integers) and a number, which is the number of items you want in each smaller group. The function returns an iterator, which you can use in a for loop or any old place that iterates. Here I’ll step through the iterator manually to show all the values.
+[1, 2, 3] taken 2 at a time is (1, 2).
+(2, 1) is different than (1, 2).
+[1, 2, 3] taken 2 at a time. Pairs like (1, 1) and (2, 2) never show up, because they contain repeats so they aren’t valid permutations. When there are no more permutations, the iterator raises a StopIteration exception.
+The permutations() function doesn’t have to take a list. It can take any sequence — even a string.
+
+
+>>> import itertools
+>>> perms = itertools.permutations('ABC', 3) ①
+>>> next(perms)
+('A', 'B', 'C') ②
+>>> next(perms)
+('A', 'C', 'B')
+>>> next(perms)
+('B', 'A', 'C')
+>>> next(perms)
+('B', 'C', 'A')
+>>> next(perms)
+('C', 'A', 'B')
+>>> next(perms)
+('C', 'B', 'A')
+>>> next(perms)
+Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+StopIteration
+>>> list(itertools.permutations('ABC', 3)) ③
+[('A', 'B', 'C'), ('A', 'C', 'B'),
+ ('B', 'A', 'C'), ('B', 'C', 'A'),
+ ('C', 'A', 'B'), ('C', 'B', 'A')]
+'ABC' is equivalent to the list ['A', 'B', 'C'].
+['A', 'B', 'C'], taken 3 at a time, is ('A', 'B', 'C'). There are five other permutations — the same three characters in every conceivable order.
+permutations() function always returns an iterator, an easy way to debug permutations is to pass that iterator to the built-in list() function to see all the permutations immediately.
+⁂ + +
itertools Module
+>>> import itertools
+>>> list(itertools.product('ABC', '123')) ①
+[('A', '1'), ('A', '2'), ('A', '3'),
+ ('B', '1'), ('B', '2'), ('B', '3'),
+ ('C', '1'), ('C', '2'), ('C', '3')]
+>>> list(itertools.combinations('ABC', 2)) ②
+[('A', 'B'), ('A', 'C'), ('B', 'C')]
+itertools.product() function returns an iterator containing the Cartesian product of two sequences.
+itertools.combinations() function returns an iterator containing all the possible combinations of the given sequence of the given length. This is like the itertools.permutations() function, except combinations don’t include items that are duplicates of other items in a different order. So itertools.permutations('ABC', 2) will return both ('A', 'B') and ('B', 'A') (among others), but itertools.combinations('ABC', 2) will not return ('B', 'A') because it is a duplicate of ('A', 'B') in a different order.
+[download favorite-people.txt]
+
+>>> names = list(open('examples/favorite-people.txt', encoding='utf-8')) ①
+>>> names
+['Dora\n', 'Ethan\n', 'Wesley\n', 'John\n', 'Anne\n',
+'Mike\n', 'Chris\n', 'Sarah\n', 'Alex\n', 'Lizzie\n']
+>>> names = [name.rstrip() for name in names] ②
+>>> names
+['Dora', 'Ethan', 'Wesley', 'John', 'Anne',
+'Mike', 'Chris', 'Sarah', 'Alex', 'Lizzie']
+>>> names = sorted(names) ③
+>>> names
+['Alex', 'Anne', 'Chris', 'Dora', 'Ethan',
+'John', 'Lizzie', 'Mike', 'Sarah', 'Wesley']
+>>> names = sorted(names, key=len) ④
+>>> names
+['Alex', 'Anne', 'Dora', 'John', 'Mike',
+'Chris', 'Ethan', 'Sarah', 'Lizzie', 'Wesley']
+list(open(filename)) idiom also includes the carriage returns at the end of each line. This list comprehension uses the rstrip() string method to strip trailing whitespace from each line. (Strings also have an lstrip() method to strip leading whitespace, and a strip() method which strips both.)
+sorted() function takes a list and returns it sorted. By default, it sorts alphabetically.
+sorted() function can also take a function as the key parameter, and it sorts by that key. In this case, the sort function is len(), so it sorts by len(each item). Shorter names come first, then longer, then longest.
+What does this have to do with the itertools module? I’m glad you asked.
+
+
+…continuing from the previous interactive shell… +>>> import itertools +>>> groups = itertools.groupby(names, len) ① +>>> groups +<itertools.groupby object at 0x00BB20C0> +>>> list(groups) +[(4, <itertools._grouper object at 0x00BA8BF0>), + (5, <itertools._grouper object at 0x00BB4050>), + (6, <itertools._grouper object at 0x00BB4030>)] +>>> groups = itertools.groupby(names, len) ② +>>> for name_length, name_iter in groups: ③ +... print('Names with {0:d} letters:'.format(name_length)) +... for name in name_iter: +... print(name) +... +Names with 4 letters: +Alex +Anne +Dora +John +Mike +Names with 5 letters: +Chris +Ethan +Sarah +Names with 6 letters: +Lizzie +Wesley+
itertools.groupby() function takes a sequence and a key function, and returns an iterator that generates pairs. Each pair contains the result of key_function(each item) and another iterator containing all the items that shared that key result.
+list() function “exhausted” the iterator, i.e. you’ve already generated every item in the iterator to make the list. There’s no “reset” button on an iterator; you can’t just start over once you’ve exhausted it. If you want to loop through it again (say, in the upcoming for loop), you need to call itertools.groupby() again to create a new iterator.
+itertools.groupby(names, len) will put all the 4-letter names in one iterator, all the 5-letter names in another iterator, and so on. The groupby() function is completely generic; it could group strings by first letter, numbers by their number of factors, or any other key function you can think of.
+++ +☞The
itertools.groupby()function only works if the input sequence is already sorted by the grouping function. In the example above, you grouped a list of names by thelen()function. That only worked because the input list was already sorted by length. +
Are you watching closely? +
+>>> list(range(0, 3)) +[0, 1, 2] +>>> list(range(10, 13)) +[10, 11, 12] +>>> list(itertools.chain(range(0, 3), range(10, 13))) ① +[0, 1, 2, 10, 11, 12] +>>> list(zip(range(0, 3), range(10, 13))) ② +[(0, 10), (1, 11), (2, 12)] +>>> list(zip(range(0, 3), range(10, 14))) ③ +[(0, 10), (1, 11), (2, 12)] +>>> list(itertools.zip_longest(range(0, 3), range(10, 14))) ④ +[(0, 10), (1, 11), (2, 12), (None, 13)]+
itertools.chain() function takes two iterators and returns an iterator that contains all the items from the first iterator, followed by all the items from the second iterator. (Actually, it can take any number of iterators, and it chains them all in the order they were passed to the function.)
+zip() function does something prosaic that turns out to be extremely useful: it takes any number of sequences and returns an iterator which returns tuples of the first items of each sequence, then the second items of each, then the third, and so on.
+zip() function stops at the end of the shortest sequence. range(10, 14) has 4 items (10, 11, 12, and 13), but range(0, 3) only has 3, so the zip() function returns an iterator of 3 items.
+itertools.zip_longest() function stops at the end of the longest sequence, inserting None values for items past the end of the shorter sequences.
+OK, that was all very interesting, but how does it relate to the alphametics solver? Here’s how: + +
+>>> characters = ('S', 'M', 'E', 'D', 'O', 'N', 'R', 'Y')
+>>> guess = ('1', '2', '0', '3', '4', '5', '6', '7')
+>>> tuple(zip(characters, guess)) ①
+(('S', '1'), ('M', '2'), ('E', '0'), ('D', '3'),
+ ('O', '4'), ('N', '5'), ('R', '6'), ('Y', '7'))
+>>> dict(zip(characters, guess)) ②
+{'E': '0', 'D': '3', 'M': '2', 'O': '4',
+ 'N': '5', 'S': '1', 'R': '6', 'Y': '7'}
+zip function will create a pairing of letters and digits, in order.
+dict() function to create a dictionary that uses letters as keys and their associated digits as values. (This isn’t the only way to do it, of course. You could use a dictionary comprehension to create the dictionary directly.) Although the printed representation of the dictionary lists the pairs in a different order (dictionaries have no “order” per se), you can see that each letter is associated with the digit, based on the ordering of the original characters and guess sequences.
+The alphametics solver uses this technique to create a dictionary that maps letters in the puzzle to digits in the solution, for each possible solution. + +
characters = tuple(ord(c) for c in sorted_characters)
+digits = tuple(ord(c) for c in '0123456789')
+...
+for guess in itertools.permutations(digits, len(characters)):
+ ...
+ equation = puzzle.translate(dict(zip(characters, guess)))
+
+But what is this translate() method? Ah, now you’re getting to the really fun part.
+
+
⁂ + +
Python strings have many methods. You learned about some of those methods in the Strings chapter: lower(), count(), and format(). Now I want to introduce you to a powerful but little-known string manipulation technique: the translate() method.
+
+
+>>> translation_table = {ord('A'): ord('O')} ①
+>>> translation_table ②
+{65: 79}
+>>> 'MARK'.translate(translation_table) ③
+'MORK'
+ord() function returns the ASCII value of a character, which, in the case of A–Z, is always a byte from 65 to 90.
+translate() method on a string takes a translation table and runs the string through it. That is, it replaces all occurrences of the keys of the translation table with the corresponding values. In this case, “translating” MARK to MORK.
+What does this have to do with solving alphametic puzzles? As it turns out, everything. + +
+>>> characters = tuple(ord(c) for c in 'SMEDONRY') ① +>>> characters +(83, 77, 69, 68, 79, 78, 82, 89) +>>> guess = tuple(ord(c) for c in '91570682') ② +>>> guess +(57, 49, 53, 55, 48, 54, 56, 50) +>>> translation_table = dict(zip(characters, guess)) ③ +>>> translation_table +{68: 55, 69: 53, 77: 49, 78: 54, 79: 48, 82: 56, 83: 57, 89: 50} +>>> 'SEND + MORE == MONEY'.translate(translation_table) ④ +'9567 + 1085 == 10652'+
alphametics.solve() function.
+itertools.permutations() function in the alphametics.solve() function.
+alphametics.solve() function does inside the for loop.
+translate() method of the original puzzle string. This converts each letter in the string to the corresponding digit (based on the letters in characters and the digits in guess). The result is a valid Python expression, as a string.
+That’s pretty impressive. But what can you do with a string that happens to be a valid Python expression? + +
⁂ + +
This is the final piece of the puzzle (or rather, the final piece of the puzzle solver). After all that fancy string manipulation, we’re left with a string like '9567 + 1085 == 10652'. But that’s a string, and what good is a string? Enter eval(), the universal Python evaluation tool.
+
+
+>>> eval('1 + 1 == 2')
+True
+>>> eval('1 + 1 == 3')
+False
+>>> eval('9567 + 1085 == 10652')
+True
+
+But wait, there’s more! The eval() function isn’t limited to boolean expressions. It can handle any Python expression and returns any datatype.
+
+
+>>> eval('"A" + "B"')
+'AB'
+>>> eval('"MARK".translate({65: 79})')
+'MORK'
+>>> eval('"AAAAA".count("A")')
+5
+>>> eval('["*"] * 5')
+['*', '*', '*', '*', '*']
+
+But wait, that’s not all! + +
+>>> x = 5
+>>> eval("x * 5") ①
+25
+>>> eval("pow(x, 2)") ②
+25
+>>> import math
+>>> eval("math.sqrt(x)") ③
+2.2360679774997898
+eval() takes can reference global variables defined outside the eval(). If called within a function, it can reference local variables too.
+Hey, wait a minute… + +
+>>> import subprocess
+>>> eval("subprocess.getoutput('ls ~')") ①
+'Desktop Library Pictures \
+ Documents Movies Public \
+ Music Sites'
+>>> eval("subprocess.getoutput('rm /some/random/file')") ②
+subprocess module allows you to run arbitrary shell commands and get the result as a Python string.
+It’s even worse than that, because there’s a global __import__() function that takes a module name as a string, imports the module, and returns a reference to it. Combined with the power of eval(), you can construct a single expression that will wipe out all your files:
+
+
+>>> eval("__import__('subprocess').getoutput('rm /some/random/file')") ①
+'rm -rf ~'. Actually there wouldn’t be any output, but you wouldn’t have any files left either.
+eval() is EVIL + +
Well, the evil part is evaluating arbitrary expressions from untrusted sources. You should only use eval() on trusted input. Of course, the trick is figuring out what’s “trusted.” But here’s something I know for certain: you should NOT take this alphametics solver and put it on the internet as a fun little web service. Don’t make the mistake of thinking, “Gosh, the function does a lot of string manipulation before getting a string to evaluate; I can’t imagine how someone could exploit that.” Someone WILL figure out how to sneak nasty executable code past all that string manipulation (stranger things have happened), and then you can kiss your server goodbye.
+
+
But surely there’s some way to evaluate expressions safely? To put eval() in a sandbox where it can’t access or harm the outside world? Well, yes and no.
+
+
+>>> x = 5
+>>> eval("x * 5", {}, {}) ①
+Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+ File "<string>", line 1, in <module>
+NameError: name 'x' is not defined
+>>> eval("x * 5", {"x": x}, {}) ②
+>>> import math
+>>> eval("math.sqrt(x)", {"x": x}, {}) ③
+Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+ File "<string>", line 1, in <module>
+NameError: name 'math' is not defined
+eval() function act as the global and local namespaces for evaluating the expression. In this case, they are both empty, which means that when the string "x * 5" is evaluated, there is no reference to x in either the global or local namespace, so eval() throws an exception.
+math module, you didn’t include it in the namespace passed to the eval() function, so the evaluation failed.
+Gee, that was easy. Lemme make an alphametics web service now! + +
+>>> eval("pow(5, 2)", {}, {}) ①
+25
+>>> eval("__import__('math').sqrt(5)", {}, {}) ②
+2.2360679774997898
+pow(5, 2) works, because 5 and 2 are literals, and pow() is a built-in function.
+__import__() function is also a built-in function, so it works too.
+Yeah, that means you can still do nasty things, even if you explicitly set the global and local namespaces to empty dictionaries when calling eval():
+
+
>>> eval("__import__('subprocess').getoutput('rm /some/random/file')", {}, {})
+
+Oops. I’m glad I didn’t make that alphametics web service. Is there any way to use eval() safely? Well, yes and no.
+
+
+>>> eval("__import__('math').sqrt(5)",
+... {"__builtins__":None}, {}) ①
+Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+ File "<string>", line 1, in <module>
+NameError: name '__import__' is not defined
+>>> eval("__import__('subprocess').getoutput('rm -rf /')",
+... {"__builtins__":None}, {}) ②
+Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+ File "<string>", line 1, in <module>
+NameError: name '__import__' is not defined
+"__builtins__" to None, the Python null value. Internally, the “built-in” functions are contained within a pseudo-module called "__builtins__". This pseudo-module (i.e. the set of built-in functions) is made available to evaluated expressions unless you explicitly override it.
+__builtins__. Not __builtin__, __built-ins__, or some other variation that will work just fine but expose you to catastrophic risks.
+So eval() is safe now? Well, yes and no.
+
+
+>>> eval("2 ** 2147483647",
+... {"__builtins__":None}, {}) ①
+
+__builtins__, you can still launch a denial-of-service attack. For example, trying to raise 2 to the 2147483647th power will spike your server’s CPU utilization to 100% for quite some time. (If you’re trying this in the interactive shell, press Ctrl-C a few times to break out of it.) Technically this expression will return a value eventually, but in the meantime your server will be doing a whole lot of nothing.
+In the end, it is possible to safely evaluate untrusted Python expressions, for some definition of “safe” that turns out not to be terribly useful in real life. It’s fine if you’re just playing around, and it’s fine if you only ever pass it trusted input. But anything else is just asking for trouble. + +
⁂ + +
To recap: this program solves alphametic puzzles by brute force, i.e. through an exhaustive search of all possible solutions. To do this, it… + +
re.findall() function
+set() function
+assert statement
+itertools.permutations() function
+translate() string method
+eval() function
+True
+…in just 14 lines of code. + +
⁂ + +
itertools module
+itertools — Iterator functions for efficient looping
+Many thanks to Raymond Hettinger for agreeing to relicense his code so I could port it to Python 3 and use it as the basis for this chapter. + +
© 2001–10 Mark Pilgrim + + + diff --git a/colophon.html b/colophon.html index 0784e32..aa1df0d 100644 --- a/colophon.html +++ b/colophon.html @@ -1,87 +1,87 @@ - - - -
You are here: Home ‣ Dive Into Python 3 ‣ -
--❝ Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte.
(I would have written a shorter letter, but I did not have the time.) ❞
— Blaise Pascal -
-
This book, like all books, was a labor of love. Oh sure, I got paid the medium-sized bucks for it, but nobody writes technical books for the money. And since this book is available on the web as well as on paper, I spent a lot of time fiddling with webby stuff when I should have been writing. - -
-
-
The online edition loads as efficiently as possible. Efficiency never happens by accident; I spent many hours making it so. Perhaps too many hours. Yes, almost certainly too many hours. Never underestimate the depths to which a procrastinating writer will sink. - -
I won’t bore you with all the details. Wait, yes — I will bore you with all the details. But here’s the short version. - -
⁂ - -
vertical rhythm, best available ampersand, curly quotes/apostrophes, other stuff from webtypography.net - -
⁂ - -
Unicode, callouts, font-family issues on Windows - -
⁂ - -
"Dive Into History 2009 edition", minimizing CSS + JS + HTML, inline CSS, optimizing images - -
⁂ - -
Quotes, constrained writing(?), PapayaWhip - -
⁂ - -
© 2001–10 Mark Pilgrim - - - + + + +
You are here: Home ‣ Dive Into Python 3 ‣ +
++❝ Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte.
(I would have written a shorter letter, but I did not have the time.) ❞
— Blaise Pascal +
+
This book, like all books, was a labor of love. Oh sure, I got paid the medium-sized bucks for it, but nobody writes technical books for the money. And since this book is available on the web as well as on paper, I spent a lot of time fiddling with webby stuff when I should have been writing. + +
+
+
The online edition loads as efficiently as possible. Efficiency never happens by accident; I spent many hours making it so. Perhaps too many hours. Yes, almost certainly too many hours. Never underestimate the depths to which a procrastinating writer will sink. + +
I won’t bore you with all the details. Wait, yes — I will bore you with all the details. But here’s the short version. + +
⁂ + +
vertical rhythm, best available ampersand, curly quotes/apostrophes, other stuff from webtypography.net + +
⁂ + +
Unicode, callouts, font-family issues on Windows + +
⁂ + +
"Dive Into History 2009 edition", minimizing CSS + JS + HTML, inline CSS, optimizing images + +
⁂ + +
Quotes, constrained writing(?), PapayaWhip + +
⁂ + +
© 2001–10 Mark Pilgrim + + + diff --git a/files.html b/files.html index 474a5f2..f3edefc 100644 --- a/files.html +++ b/files.html @@ -1,607 +1,607 @@ - - -
You are here: Home ‣ Dive Into Python 3 ‣ -
Difficulty level: ♦♦♦♢♢ -
--❝ A nine mile walk is no joke, especially in the rain. ❞
— Harry Kemelman, The Nine Mile Walk -
-
My Windows laptop had 38,493 files before I installed a single application. Installing Python 3 added almost 3,000 files to that total. Files are the primary storage paradigm of every major operating system; the concept is so ingrained that most people would have trouble imagining an alternative. Your computer is, metaphorically speaking, drowning in files. - -
Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier: - -
a_file = open('examples/chinese.txt', encoding='utf-8')
-
-Python has a built-in open() function, which takes a filename as an argument. Here the filename is 'examples/chinese.txt'. There are five interesting things about this filename:
-
-
open() function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well.
-But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding. Oh dear, that sounds dreadfully familiar.
-
-
Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string). - -
-# This example was created on Windows. Other platforms may
-# behave differently, for reasons outlined below.
->>> file = open('examples/chinese.txt')
->>> a_string = file.read()
-Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
- return codecs.charmap_decode(input,self.errors,decoding_table)[0]
-UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined>
->>>
-
-
-
-What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in cp1252.py, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly UnicodeDecodeError.
-
-
But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252). - -
-- -☞If you need to get the default character encoding, import the
localemodule and calllocale.getpreferredencoding(). On my Windows laptop, it returns'cp1252', but on my Linux box upstairs, it returns'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file. - -
So far, all we know is that Python has a built-in function called open(). The open() function returns a stream object, which has methods and attributes for getting information about and manipulating a stream of characters.
-
-
->>> a_file = open('examples/chinese.txt', encoding='utf-8')
->>> a_file.name ①
-'examples/chinese.txt'
->>> a_file.encoding ②
-'utf-8'
->>> a_file.mode ③
-'r'
-name attribute reflects the name you passed in to the open() function when you opened the file. It is not normalized to an absolute pathname.
-encoding attribute reflects the encoding you passed in to the open() function. If you didn’t specify the encoding when you opened the file (bad developer!) then the encoding attribute will reflect locale.getpreferredencoding().
-mode attribute tells you in which mode the file was opened. You can pass an optional mode parameter to the open() function. You didn’t specify a mode when you opened this file, so Python defaults to 'r', which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings).
--- -☞The documentation for the
open()function lists all the possible file modes. -
After you open a file for reading, you’ll probably want to read from it at some point. - -
->>> a_file = open('examples/chinese.txt', encoding='utf-8')
->>> a_file.read() ①
-'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'
->>> a_file.read() ②
-''
-read() method. The result is a string.
-What if you want to re-read a file? - -
-# continued from the previous example ->>> a_file.read() ① -'' ->>> a_file.seek(0) ② -0 ->>> a_file.read(16) ③ -'Dive Into Python' ->>> a_file.read(1) ④ -' ' ->>> a_file.read(1) -'是' ->>> a_file.tell() ⑤ -20-
read() method simply return an empty string.
-seek() method moves to a specific byte position in a file.
-read() method can take an optional parameter, the number of characters to read.
-Let’s try that again. - -
-# continued from the previous example ->>> a_file.seek(17) ① -17 ->>> a_file.read(1) ② -'是' ->>> a_file.tell() ③ -20-
Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that the seek() and read() methods are counting the same thing. But that’s only true for some characters.
-
-
But wait, it gets worse! - -
->>> a_file.seek(18) ① -18 ->>> a_file.read(1) ② -Traceback (most recent call last): - File "<pyshell#12>", line 1, in <module> - a_file.read(1) - File "C:\Python31\lib\codecs.py", line 300, in decode - (result, consumed) = self._buffer_decode(data, self.errors, final) -UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte-
UnicodeDecodeError.
-Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them. - -
-# continued from the previous example ->>> a_file.close()- -
Well that was anticlimactic. - -
The stream object a_file still exists; calling its close() method doesn’t destroy the object itself. But it’s not terribly useful.
-
-
-# continued from the previous example ->>> a_file.read() ① -Traceback (most recent call last): - File "<pyshell#24>", line 1, in <module> - a_file.read() -ValueError: I/O operation on closed file. ->>> a_file.seek(0) ② -Traceback (most recent call last): - File "<pyshell#25>", line 1, in <module> - a_file.seek(0) -ValueError: I/O operation on closed file. ->>> a_file.tell() ③ -Traceback (most recent call last): - File "<pyshell#26>", line 1, in <module> - a_file.tell() -ValueError: I/O operation on closed file. ->>> a_file.close() ④ ->>> a_file.closed ⑤ -True-
IOError exception.
-tell() method also fails.
-close() method on a stream object whose file has been closed does not raise an exception. It’s just a no-op.
-closed attribute will confirm that the file is closed.
-Stream objects have an explicit close() method, but what happens if your code has a bug and crashes before you call close()? That file could theoretically stay open for much longer than necessary. While you’re debugging on your local computer, that’s not a big deal. On a production server, maybe it is.
-
-
Python 2 had a solution for this: the try..finally block. That still works in Python 3, and you may see it in other people’s code or in older code that was ported to Python 3. But Python 2.5 introduced a cleaner solution, which is now the preferred solution in Python 3: the with statement.
-
-
with open('examples/chinese.txt', encoding='utf-8') as a_file:
- a_file.seek(17)
- a_character = a_file.read(1)
- print(a_character)
-
-This code calls open(), but it never calls a_file.close(). The with statement starts a code block, like an if statement or a for loop. Inside this code block, you can use the variable a_file as the stream object returned from the call to open(). All the regular stream object methods are available — seek(), read(), whatever you need. When the with block ends, Python calls a_file.close() automatically.
-
-
Here’s the kicker: no matter how or when you exit the with block, Python will close that file… even if you “exit” it via an unhandled exception. That’s right, even if your code raises an exception and your entire program comes to a screeching halt, that file will get closed. Guaranteed.
-
-
-- -☞In technical terms, the
withstatement creates a runtime context. In these examples, the stream object acts as a context manager. Python creates the stream object a_file and tells it that it is entering a runtime context. When thewithcode block is completed, Python tells the stream object that it is exiting the runtime context, and the stream object calls its ownclose()method. See Appendix B, “Classes That Can Be Used in awithBlock” for details. -
There’s nothing file-specific about the with statement; it’s just a generic framework for creating runtime contexts and telling objects that they’re entering and exiting a runtime context. If the object in question is a stream object, then it does useful file-like things (like closing the file automatically). But that behavior is defined in the stream object, not in the with statement. There are lots of other ways to use context managers that have nothing to do with files. You can even create your own, as you’ll see later in this chapter.
-
-
A “line” of a text file is just what you think it is — you type a few words and press ENTER, and now you’re on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated, because text files can use several different characters to mark the end of a line. Every operating system has its own convention. Some use a carriage return character, others use a line feed character, and some use both characters at the end of every line. - -
Now breathe a sigh of relief, because Python handles line endings automatically by default. If you say, “I want to read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and and it will all Just Work. - -
-- -☞If you need fine-grained control over what’s considered a line ending, you can pass the optional
newlineparameter to theopen()function. See theopen()function documentation for all the gory details. -
So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful. - -
line_number = 0
-with open('examples/favorite-people.txt', encoding='utf-8') as a_file: ①
- for a_line in a_file: ②
- line_number += 1
- print('{:>4} {}'.format(line_number, a_line.rstrip())) ③
-with pattern, you safely open the file and let Python close it for you.
-for loop. That’s it. Besides having explicit methods like read(), the stream object is also an iterator which spits out a single line every time you ask for a value.
-format() string method, you can print out the line number and the line itself. The format specifier {:>4} means “print this argument right-justified within 4 spaces.” The a_line variable contains the complete line, carriage returns and all. The rstrip() string method removes the trailing whitespace, including the carriage return characters.
--you@localhost:~/diveintopython3$ python3 examples/oneline.py - 1 Dora - 2 Ethan - 3 Wesley - 4 John - 5 Anne - 6 Mike - 7 Chris - 8 Sarah - 9 Alex - 10 Lizzie- -
-- -Did you get this error? -
-you@localhost:~/diveintopython3$ python3 examples/oneline.py -Traceback (most recent call last): - File "examples/oneline.py", line 4, in <module> - print('{:>4} {}'.format(line_number, a_line.rstrip())) -ValueError: zero length field name in format-If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. -
Python 3.0 supported string formatting, but only with explicitly numbered format specifiers. Python 3.1 allows you to omit the argument indexes in your format specifiers. Here is the Python 3.0-compatible version for comparison: -
-print('{0:>4} {1}'.format(line_number, a_line.rstrip()))
⁂ - -
You can write to files in much the same way that you read from them. First you open a file and get a stream object, then you use methods on the stream object to write data to the file, then you close the file. - -
To open a file for writing, use the open() function and specify the write mode. There are two file modes for writing:
-
-
mode='w' to the open() function.
-mode='a' to the open() function.
-Either mode will create the file automatically if it doesn’t already exist, so there’s never a need for any sort of fiddly “if the file doesn’t exist yet, create a new empty file just so you can open it for the first time” function. Just open a file and start writing. - -
You should always close a file as soon as you’re done writing to it, to release the file handle and ensure that the data is actually written to disk. As with reading data from a file, you can call the stream object’s close() method, or you can use the with statement and let Python close the file for you. I bet you can guess which technique I recommend.
-
-
->>> with open('test.log', mode='w', encoding='utf-8') as a_file: ①
-... a_file.write('test succeeded') ②
->>> with open('test.log', encoding='utf-8') as a_file:
-... print(a_file.read())
-test succeeded
->>> with open('test.log', mode='a', encoding='utf-8') as a_file: ③
-... a_file.write('and again')
->>> with open('test.log', encoding='utf-8') as a_file:
-... print(a_file.read())
-test succeededand again ④
-test.log (or overwriting the existing file), and opening the file for writing. The mode='w' parameter means open the file for writing. Yes, that’s all as dangerous as it sounds. I hope you didn’t care about the previous contents of that file (if any), because that data is gone now.
-write() method of the stream object returned by the open() function. After the with block ends, Python automatically closes the file.
-mode='a' to append to the file instead of overwriting it. Appending will never harm the existing contents of the file.
-test.log. Also note that neither carriage returns nor line feeds are included. Since you didn’t write them explicitly to the file either time, the file doesn’t include them. You can write a carriage return with the '\r' character, and/or a line feed with the '\n' character. Since you didn’t do either, everything you wrote to the file ended up on one line.
-Did you notice the encoding parameter that got passed in to the open() function while you were opening a file for writing? It’s important; don’t ever leave it out! As you saw in the beginning of this chapter, files don’t contain strings, they contain bytes. Reading a “string” from a text file only works because you told Python what encoding to use to read a stream of bytes and convert it to a string. Writing text to a file presents the same problem in reverse. You can’t write characters to a file; characters are an abstraction. In order to write to the file, Python needs to know how to convert your string into a sequence of bytes. The only way to be sure it’s performing the correct conversion is to specify the encoding parameter when you open the file for writing.
-
-
⁂ - -
-
-
Not all files contain text. Some of them contain pictures of my dog. - -
->>> an_image = open('examples/beauregard.jpg', mode='rb') ①
->>> an_image.mode ②
-'rb'
->>> an_image.name ③
-'examples/beauregard.jpg'
->>> an_image.encoding ④
-Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
-AttributeError: '_io.BufferedReader' object has no attribute 'encoding'
-mode parameter contains a 'b' character.
-mode, which reflects the mode parameter you passed into the open() function.
-name attribute, just like text stream objects.
-encoding attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do. What you get out of a binary file is exactly what you put into it, no conversion necessary.
-Did I mention you’re reading bytes? Oh yes you are. - -
-# continued from the previous example ->>> an_image.tell() -0 ->>> data = an_image.read(3) ① ->>> data -b'\xff\xd8\xff' ->>> type(data) ② -<class 'bytes'> ->>> an_image.tell() ③ -3 ->>> an_image.seek(0) -0 ->>> data = an_image.read() ->>> len(data) -3150-
read() method takes the number of bytes to read, not the number of characters.
-read() method and the position index you get out of the tell() method. The read() method reads bytes, and the seek() and tell() methods track the number of bytes read. For binary files, they’ll always agree.
-⁂ - -
Imagine you’re writing a library, and one of your library functions is going to read some data from a file. The function could simply take a filename as a string, go open the file for reading, read it, and close it before exiting. But you shouldn’t do that. Instead, your API should take an arbitrary stream object. - -
In the simplest case, a stream object is anything with a read() method which takes an optional size parameter and returns a string. When called with no size parameter, the read() method should read everything there is to read from the input source and return all the data as a single value. When called with a size parameter, it reads that much from the input source and returns that much data. When called again, it picks up where it left off and returns the next chunk of data.
-
-
That sounds exactly like the stream object you get from opening a real file. The difference is that you’re not limiting yourself to real files. The input source that’s being “read” could be anything: a web page, a string in memory, even the output of another program. As long as your functions take a stream object and simply call the object’s read() method, you can handle any input source that acts like a file, without specific code to handle each kind of input.
-
-
->>> a_string = 'PapayaWhip is the new black.' ->>> import io ① ->>> a_file = io.StringIO(a_string) ② ->>> a_file.read() ③ -'PapayaWhip is the new black.' ->>> a_file.read() ④ -'' ->>> a_file.seek(0) ⑤ -0 ->>> a_file.read(10) ⑥ -'PapayaWhip' ->>> a_file.tell() -10 ->>> a_file.seek(18) -18 ->>> a_file.read() -'new black.'-
io module defines the StringIO class that you can use to treat a string in memory as a file.
-io.StringIO() class and pass it the string you want to use as your “file” data. Now you have a stream object, and you can do all sorts of stream-like things with it.
-read() method “reads” the entire “file,” which in the case of a StringIO object simply returns the original string.
-read() method again returns an empty string.
-seek() method of the StringIO object.
-read() method.
--- -☞
io.StringIOlets you treat a string as a text file. There’s also aio.BytesIOclass, which lets you treat a byte array as a binary file. -
The Python standard library contains modules that support reading and writing compressed files. There are a number of different compression schemes; the two most popular on non-Windows systems are gzip and bzip2. (You may have also encountered PKZIP archives and GNU Tar archives. Python has modules for those, too.) - -
The gzip module lets you create a stream object for reading or writing a gzip-compressed file. The stream object it gives you supports the read() method (if you opened it for reading) or the write() method (if you opened it for writing). That means you can use the methods you’ve already learned for regular files to directly read or write a gzip-compressed file, without creating a temporary file to store the decompressed data.
-
-
As an added bonus, it supports the with statement too, so you can let Python automatically close your gzip-compressed file when you’re done with it.
-
-
-you@localhost:~$ python3
-
->>> import gzip
->>> with gzip.open('out.log.gz', mode='wb') as z_file: ①
-... z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))
-...
->>> exit()
-
-you@localhost:~$ ls -l out.log.gz ②
--rw-r--r-- 1 mark mark 79 2009-07-19 14:29 out.log.gz
-you@localhost:~$ gunzip out.log.gz ③
-you@localhost:~$ cat out.log ④
-A nine mile walk is no joke, especially in the rain.
-'b' character in the mode argument.)
-gunzip command (pronounced “gee-unzip”) decompresses the file and stores the contents in a new file named the same as the compressed file but without the .gz file extension.
-cat command displays the contents of a file. This file contains the string you originally wrote directly to the compressed file out.log.gz from within the Python Shell.
--- -Did you get this error? -
->>> with gzip.open('out.log.gz', mode='wb') as z_file: -... z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8')) -... -Traceback (most recent call last): - File "<stdin>", line 1, in <module> -AttributeError: 'GzipFile' object has no attribute '__exit__'-If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. -
Python 3.0 had a
gzipmodule, but it did not support using a gzipped-file object as a context manager. Python 3.1 added the ability to use gzipped-file objects in awithstatement. -
⁂ - -
Command-line gurus are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you. - -
Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX-like system, including Mac OS X and Linux. When you call the print() function, the thing you’re printing is sent to the stdout pipe. When your program crashes and prints out a traceback, it goes to the stderr pipe. By default, both of these pipes are just connected to the terminal window where you are working; when your program prints something, you see the output in your terminal window, and when a program crashes, you see the traceback in your terminal window too. In the graphical Python Shell, the stdout and stderr pipes default to your “Interactive Window”.
-
-
->>> for i in range(3):
-... print('PapayaWhip') ①
-PapayaWhip
-PapayaWhip
-PapayaWhip
->>> import sys
->>> for i in range(3):
-... sys.stdout.write('is the') ②
-is theis theis the
->>> for i in range(3):
-... sys.stderr.write('new black') ③
-new blacknew blacknew black
-print() function, in a loop. Nothing surprising here.
-stdout is defined in the sys module, and it is a stream object. Calling its write() function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you’re printing, and calls sys.stdout.write.
-sys.stdout and sys.stderr send their output to the same place: the Python IDE (if you’re in one), or the terminal (if you’re running Python from the command line). Like standard output, standard error does not add carriage returns for you. If you want carriage returns, you’ll need to write carriage return characters.
-sys.stdout and sys.stderr are stream objects, but they are write-only. Attempting to call their read() method will always raise an IOError.
-
-
->>> import sys ->>> sys.stdout.read() -Traceback (most recent call last): - File "<stdin>", line 1, in <module> -IOError: not readable- -
sys.stdout and sys.stderr are stream objects, albeit ones that only support writing. But they’re not constants; they’re variables. That means you can assign them a new value — any other stream object — to redirect their output.
-
-
import sys
-
-class RedirectStdoutTo:
- def __init__(self, out_new):
- self.out_new = out_new
-
- def __enter__(self):
- self.out_old = sys.stdout
- sys.stdout = self.out_new
-
- def __exit__(self, *args):
- sys.stdout = self.out_old
-
-print('A')
-with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
- print('B')
-print('C')
-
-Check this out: - -
-you@localhost:~/diveintopython3/examples$ python3 stdout.py -A -C -you@localhost:~/diveintopython3/examples$ cat out.log -B- -
-- -Did you get this error? -
-you@localhost:~/diveintopython3/examples$ python3 stdout.py - File "stdout.py", line 15 - with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file): - ^ -SyntaxError: invalid syntax-If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. -
Python 3.0 supported the
withstatement, but each statement can only use one context manager. Python 3.1 allows you to chain multiple context managers in a singlewithstatement. -
Let’s take the last part first. - -
print('A')
-with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
- print('B')
-print('C')
-
-That’s a complicated with statement. Let me rewrite it as something more recognizable.
-
-
with open('out.log', mode='w', encoding='utf-8') as a_file:
- with RedirectStdoutTo(a_file):
- print('B')
-
-As the rewrite shows, you actually have two with statements, one nested within the scope of the other. The “outer” with statement should be familiar by now: it opens a UTF-8-encoded text file named out.log for writing and assigns the stream object to a variable named a_file. But that’s not the only thing odd here.
-
with RedirectStdoutTo(a_file):
-
-Where’s the as clause? The with statement doesn’t actually require one. Just like you can call a function and ignore its return value, you can have a with statement that doesn’t assign the with context to a variable. In this case, you’re only interested in the side effects of the RedirectStdoutTo context.
-
-
What are those side effects? Take a look inside the RedirectStdoutTo class. This class is a custom context manager. Any class can be a context manager by defining two special methods: __enter__() and __exit__().
-
-
class RedirectStdoutTo:
- def __init__(self, out_new): ①
- self.out_new = out_new
-
- def __enter__(self): ②
- self.out_old = sys.stdout
- sys.stdout = self.out_new
-
- def __exit__(self, *args): ③
- sys.stdout = self.out_old
-__init__() method is called immediately after an instance is created. It takes one parameter, the stream object that you want to use as standard output for the life of the context. This method just saves the stream object in an instance variable so other methods can use it later.
-__enter__() method is a special class method; Python calls it when entering a context (i.e. at the beginning of the with statement). This method saves the current value of sys.stdout in self.out_old, then redirects standard output by assigning self.out_new to sys.stdout.
-__exit__() method is another special class method; Python calls it when exiting the context (i.e. at the end of the with statement). This method restores standard output to its original value by assigning the saved self.out_old value to sys.stdout.
-Putting it all together: - -
-print('A') ①
-with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file): ②
- print('B') ③
-print('C') ④
-with statement takes a comma-separated list of contexts. The comma-separated list acts like a series of nested with blocks. The first context listed is the “outer” block; the last one listed is the “inner” block. The first context opens a file; the second context redirects sys.stdout to the stream object that was created in the first context.
-print() function is executed with the context created by the with statement, it will not print to the screen; it will write to the file out.log.
-with code block is over. Python has told each context manager to do whatever it is they do upon exiting a context. The context managers form a last-in-first-out stack. Upon exiting, the second context changed sys.stdout back to its original value, then the first context closed the file named out.log. Since standard output has been restored to its original value, calling the print() function will once again print to the screen.
-Redirecting standard error works exactly the same way, using sys.stderr instead of sys.stdout.
-
-
⁂ - -
io module
-sys.stdout and sys.stderr
-© 2001–10 Mark Pilgrim - - - + + +
You are here: Home ‣ Dive Into Python 3 ‣ +
Difficulty level: ♦♦♦♢♢ +
++❝ A nine mile walk is no joke, especially in the rain. ❞
— Harry Kemelman, The Nine Mile Walk +
+
My Windows laptop had 38,493 files before I installed a single application. Installing Python 3 added almost 3,000 files to that total. Files are the primary storage paradigm of every major operating system; the concept is so ingrained that most people would have trouble imagining an alternative. Your computer is, metaphorically speaking, drowning in files. + +
Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier: + +
a_file = open('examples/chinese.txt', encoding='utf-8')
+
+Python has a built-in open() function, which takes a filename as an argument. Here the filename is 'examples/chinese.txt'. There are five interesting things about this filename:
+
+
open() function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well.
+But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding. Oh dear, that sounds dreadfully familiar.
+
+
Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string). + +
+# This example was created on Windows. Other platforms may
+# behave differently, for reasons outlined below.
+>>> file = open('examples/chinese.txt')
+>>> a_string = file.read()
+Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+ File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
+ return codecs.charmap_decode(input,self.errors,decoding_table)[0]
+UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined>
+>>>
+
+
+
+What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in cp1252.py, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly UnicodeDecodeError.
+
+
But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252). + +
++ +☞If you need to get the default character encoding, import the
localemodule and calllocale.getpreferredencoding(). On my Windows laptop, it returns'cp1252', but on my Linux box upstairs, it returns'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file. + +
So far, all we know is that Python has a built-in function called open(). The open() function returns a stream object, which has methods and attributes for getting information about and manipulating a stream of characters.
+
+
+>>> a_file = open('examples/chinese.txt', encoding='utf-8')
+>>> a_file.name ①
+'examples/chinese.txt'
+>>> a_file.encoding ②
+'utf-8'
+>>> a_file.mode ③
+'r'
+name attribute reflects the name you passed in to the open() function when you opened the file. It is not normalized to an absolute pathname.
+encoding attribute reflects the encoding you passed in to the open() function. If you didn’t specify the encoding when you opened the file (bad developer!) then the encoding attribute will reflect locale.getpreferredencoding().
+mode attribute tells you in which mode the file was opened. You can pass an optional mode parameter to the open() function. You didn’t specify a mode when you opened this file, so Python defaults to 'r', which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings).
+++ +☞The documentation for the
open()function lists all the possible file modes. +
After you open a file for reading, you’ll probably want to read from it at some point. + +
+>>> a_file = open('examples/chinese.txt', encoding='utf-8')
+>>> a_file.read() ①
+'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'
+>>> a_file.read() ②
+''
+read() method. The result is a string.
+What if you want to re-read a file? + +
+# continued from the previous example +>>> a_file.read() ① +'' +>>> a_file.seek(0) ② +0 +>>> a_file.read(16) ③ +'Dive Into Python' +>>> a_file.read(1) ④ +' ' +>>> a_file.read(1) +'是' +>>> a_file.tell() ⑤ +20+
read() method simply return an empty string.
+seek() method moves to a specific byte position in a file.
+read() method can take an optional parameter, the number of characters to read.
+Let’s try that again. + +
+# continued from the previous example +>>> a_file.seek(17) ① +17 +>>> a_file.read(1) ② +'是' +>>> a_file.tell() ③ +20+
Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that the seek() and read() methods are counting the same thing. But that’s only true for some characters.
+
+
But wait, it gets worse! + +
+>>> a_file.seek(18) ① +18 +>>> a_file.read(1) ② +Traceback (most recent call last): + File "<pyshell#12>", line 1, in <module> + a_file.read(1) + File "C:\Python31\lib\codecs.py", line 300, in decode + (result, consumed) = self._buffer_decode(data, self.errors, final) +UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte+
UnicodeDecodeError.
+Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them. + +
+# continued from the previous example +>>> a_file.close()+ +
Well that was anticlimactic. + +
The stream object a_file still exists; calling its close() method doesn’t destroy the object itself. But it’s not terribly useful.
+
+
+# continued from the previous example +>>> a_file.read() ① +Traceback (most recent call last): + File "<pyshell#24>", line 1, in <module> + a_file.read() +ValueError: I/O operation on closed file. +>>> a_file.seek(0) ② +Traceback (most recent call last): + File "<pyshell#25>", line 1, in <module> + a_file.seek(0) +ValueError: I/O operation on closed file. +>>> a_file.tell() ③ +Traceback (most recent call last): + File "<pyshell#26>", line 1, in <module> + a_file.tell() +ValueError: I/O operation on closed file. +>>> a_file.close() ④ +>>> a_file.closed ⑤ +True+
IOError exception.
+tell() method also fails.
+close() method on a stream object whose file has been closed does not raise an exception. It’s just a no-op.
+closed attribute will confirm that the file is closed.
+Stream objects have an explicit close() method, but what happens if your code has a bug and crashes before you call close()? That file could theoretically stay open for much longer than necessary. While you’re debugging on your local computer, that’s not a big deal. On a production server, maybe it is.
+
+
Python 2 had a solution for this: the try..finally block. That still works in Python 3, and you may see it in other people’s code or in older code that was ported to Python 3. But Python 2.5 introduced a cleaner solution, which is now the preferred solution in Python 3: the with statement.
+
+
with open('examples/chinese.txt', encoding='utf-8') as a_file:
+ a_file.seek(17)
+ a_character = a_file.read(1)
+ print(a_character)
+
+This code calls open(), but it never calls a_file.close(). The with statement starts a code block, like an if statement or a for loop. Inside this code block, you can use the variable a_file as the stream object returned from the call to open(). All the regular stream object methods are available — seek(), read(), whatever you need. When the with block ends, Python calls a_file.close() automatically.
+
+
Here’s the kicker: no matter how or when you exit the with block, Python will close that file… even if you “exit” it via an unhandled exception. That’s right, even if your code raises an exception and your entire program comes to a screeching halt, that file will get closed. Guaranteed.
+
+
++ +☞In technical terms, the
withstatement creates a runtime context. In these examples, the stream object acts as a context manager. Python creates the stream object a_file and tells it that it is entering a runtime context. When thewithcode block is completed, Python tells the stream object that it is exiting the runtime context, and the stream object calls its ownclose()method. See Appendix B, “Classes That Can Be Used in awithBlock” for details. +
There’s nothing file-specific about the with statement; it’s just a generic framework for creating runtime contexts and telling objects that they’re entering and exiting a runtime context. If the object in question is a stream object, then it does useful file-like things (like closing the file automatically). But that behavior is defined in the stream object, not in the with statement. There are lots of other ways to use context managers that have nothing to do with files. You can even create your own, as you’ll see later in this chapter.
+
+
A “line” of a text file is just what you think it is — you type a few words and press ENTER, and now you’re on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated, because text files can use several different characters to mark the end of a line. Every operating system has its own convention. Some use a carriage return character, others use a line feed character, and some use both characters at the end of every line. + +
Now breathe a sigh of relief, because Python handles line endings automatically by default. If you say, “I want to read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and and it will all Just Work. + +
++ +☞If you need fine-grained control over what’s considered a line ending, you can pass the optional
newlineparameter to theopen()function. See theopen()function documentation for all the gory details. +
So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful. + +
line_number = 0
+with open('examples/favorite-people.txt', encoding='utf-8') as a_file: ①
+ for a_line in a_file: ②
+ line_number += 1
+ print('{:>4} {}'.format(line_number, a_line.rstrip())) ③
+with pattern, you safely open the file and let Python close it for you.
+for loop. That’s it. Besides having explicit methods like read(), the stream object is also an iterator which spits out a single line every time you ask for a value.
+format() string method, you can print out the line number and the line itself. The format specifier {:>4} means “print this argument right-justified within 4 spaces.” The a_line variable contains the complete line, carriage returns and all. The rstrip() string method removes the trailing whitespace, including the carriage return characters.
++you@localhost:~/diveintopython3$ python3 examples/oneline.py + 1 Dora + 2 Ethan + 3 Wesley + 4 John + 5 Anne + 6 Mike + 7 Chris + 8 Sarah + 9 Alex + 10 Lizzie+ +
++ +Did you get this error? +
+you@localhost:~/diveintopython3$ python3 examples/oneline.py +Traceback (most recent call last): + File "examples/oneline.py", line 4, in <module> + print('{:>4} {}'.format(line_number, a_line.rstrip())) +ValueError: zero length field name in format+If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. +
Python 3.0 supported string formatting, but only with explicitly numbered format specifiers. Python 3.1 allows you to omit the argument indexes in your format specifiers. Here is the Python 3.0-compatible version for comparison: +
+print('{0:>4} {1}'.format(line_number, a_line.rstrip()))
⁂ + +
You can write to files in much the same way that you read from them. First you open a file and get a stream object, then you use methods on the stream object to write data to the file, then you close the file. + +
To open a file for writing, use the open() function and specify the write mode. There are two file modes for writing:
+
+
mode='w' to the open() function.
+mode='a' to the open() function.
+Either mode will create the file automatically if it doesn’t already exist, so there’s never a need for any sort of fiddly “if the file doesn’t exist yet, create a new empty file just so you can open it for the first time” function. Just open a file and start writing. + +
You should always close a file as soon as you’re done writing to it, to release the file handle and ensure that the data is actually written to disk. As with reading data from a file, you can call the stream object’s close() method, or you can use the with statement and let Python close the file for you. I bet you can guess which technique I recommend.
+
+
+>>> with open('test.log', mode='w', encoding='utf-8') as a_file: ①
+... a_file.write('test succeeded') ②
+>>> with open('test.log', encoding='utf-8') as a_file:
+... print(a_file.read())
+test succeeded
+>>> with open('test.log', mode='a', encoding='utf-8') as a_file: ③
+... a_file.write('and again')
+>>> with open('test.log', encoding='utf-8') as a_file:
+... print(a_file.read())
+test succeededand again ④
+test.log (or overwriting the existing file), and opening the file for writing. The mode='w' parameter means open the file for writing. Yes, that’s all as dangerous as it sounds. I hope you didn’t care about the previous contents of that file (if any), because that data is gone now.
+write() method of the stream object returned by the open() function. After the with block ends, Python automatically closes the file.
+mode='a' to append to the file instead of overwriting it. Appending will never harm the existing contents of the file.
+test.log. Also note that neither carriage returns nor line feeds are included. Since you didn’t write them explicitly to the file either time, the file doesn’t include them. You can write a carriage return with the '\r' character, and/or a line feed with the '\n' character. Since you didn’t do either, everything you wrote to the file ended up on one line.
+Did you notice the encoding parameter that got passed in to the open() function while you were opening a file for writing? It’s important; don’t ever leave it out! As you saw in the beginning of this chapter, files don’t contain strings, they contain bytes. Reading a “string” from a text file only works because you told Python what encoding to use to read a stream of bytes and convert it to a string. Writing text to a file presents the same problem in reverse. You can’t write characters to a file; characters are an abstraction. In order to write to the file, Python needs to know how to convert your string into a sequence of bytes. The only way to be sure it’s performing the correct conversion is to specify the encoding parameter when you open the file for writing.
+
+
⁂ + +
+
+
Not all files contain text. Some of them contain pictures of my dog. + +
+>>> an_image = open('examples/beauregard.jpg', mode='rb') ①
+>>> an_image.mode ②
+'rb'
+>>> an_image.name ③
+'examples/beauregard.jpg'
+>>> an_image.encoding ④
+Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+AttributeError: '_io.BufferedReader' object has no attribute 'encoding'
+mode parameter contains a 'b' character.
+mode, which reflects the mode parameter you passed into the open() function.
+name attribute, just like text stream objects.
+encoding attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do. What you get out of a binary file is exactly what you put into it, no conversion necessary.
+Did I mention you’re reading bytes? Oh yes you are. + +
+# continued from the previous example +>>> an_image.tell() +0 +>>> data = an_image.read(3) ① +>>> data +b'\xff\xd8\xff' +>>> type(data) ② +<class 'bytes'> +>>> an_image.tell() ③ +3 +>>> an_image.seek(0) +0 +>>> data = an_image.read() +>>> len(data) +3150+
read() method takes the number of bytes to read, not the number of characters.
+read() method and the position index you get out of the tell() method. The read() method reads bytes, and the seek() and tell() methods track the number of bytes read. For binary files, they’ll always agree.
+⁂ + +
Imagine you’re writing a library, and one of your library functions is going to read some data from a file. The function could simply take a filename as a string, go open the file for reading, read it, and close it before exiting. But you shouldn’t do that. Instead, your API should take an arbitrary stream object. + +
In the simplest case, a stream object is anything with a read() method which takes an optional size parameter and returns a string. When called with no size parameter, the read() method should read everything there is to read from the input source and return all the data as a single value. When called with a size parameter, it reads that much from the input source and returns that much data. When called again, it picks up where it left off and returns the next chunk of data.
+
+
That sounds exactly like the stream object you get from opening a real file. The difference is that you’re not limiting yourself to real files. The input source that’s being “read” could be anything: a web page, a string in memory, even the output of another program. As long as your functions take a stream object and simply call the object’s read() method, you can handle any input source that acts like a file, without specific code to handle each kind of input.
+
+
+>>> a_string = 'PapayaWhip is the new black.' +>>> import io ① +>>> a_file = io.StringIO(a_string) ② +>>> a_file.read() ③ +'PapayaWhip is the new black.' +>>> a_file.read() ④ +'' +>>> a_file.seek(0) ⑤ +0 +>>> a_file.read(10) ⑥ +'PapayaWhip' +>>> a_file.tell() +10 +>>> a_file.seek(18) +18 +>>> a_file.read() +'new black.'+
io module defines the StringIO class that you can use to treat a string in memory as a file.
+io.StringIO() class and pass it the string you want to use as your “file” data. Now you have a stream object, and you can do all sorts of stream-like things with it.
+read() method “reads” the entire “file,” which in the case of a StringIO object simply returns the original string.
+read() method again returns an empty string.
+seek() method of the StringIO object.
+read() method.
+++ +☞
io.StringIOlets you treat a string as a text file. There’s also aio.BytesIOclass, which lets you treat a byte array as a binary file. +
The Python standard library contains modules that support reading and writing compressed files. There are a number of different compression schemes; the two most popular on non-Windows systems are gzip and bzip2. (You may have also encountered PKZIP archives and GNU Tar archives. Python has modules for those, too.) + +
The gzip module lets you create a stream object for reading or writing a gzip-compressed file. The stream object it gives you supports the read() method (if you opened it for reading) or the write() method (if you opened it for writing). That means you can use the methods you’ve already learned for regular files to directly read or write a gzip-compressed file, without creating a temporary file to store the decompressed data.
+
+
As an added bonus, it supports the with statement too, so you can let Python automatically close your gzip-compressed file when you’re done with it.
+
+
+you@localhost:~$ python3
+
+>>> import gzip
+>>> with gzip.open('out.log.gz', mode='wb') as z_file: ①
+... z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))
+...
+>>> exit()
+
+you@localhost:~$ ls -l out.log.gz ②
+-rw-r--r-- 1 mark mark 79 2009-07-19 14:29 out.log.gz
+you@localhost:~$ gunzip out.log.gz ③
+you@localhost:~$ cat out.log ④
+A nine mile walk is no joke, especially in the rain.
+'b' character in the mode argument.)
+gunzip command (pronounced “gee-unzip”) decompresses the file and stores the contents in a new file named the same as the compressed file but without the .gz file extension.
+cat command displays the contents of a file. This file contains the string you originally wrote directly to the compressed file out.log.gz from within the Python Shell.
+++ +Did you get this error? +
+>>> with gzip.open('out.log.gz', mode='wb') as z_file: +... z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8')) +... +Traceback (most recent call last): + File "<stdin>", line 1, in <module> +AttributeError: 'GzipFile' object has no attribute '__exit__'+If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. +
Python 3.0 had a
gzipmodule, but it did not support using a gzipped-file object as a context manager. Python 3.1 added the ability to use gzipped-file objects in awithstatement. +
⁂ + +
Command-line gurus are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you. + +
Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX-like system, including Mac OS X and Linux. When you call the print() function, the thing you’re printing is sent to the stdout pipe. When your program crashes and prints out a traceback, it goes to the stderr pipe. By default, both of these pipes are just connected to the terminal window where you are working; when your program prints something, you see the output in your terminal window, and when a program crashes, you see the traceback in your terminal window too. In the graphical Python Shell, the stdout and stderr pipes default to your “Interactive Window”.
+
+
+>>> for i in range(3):
+... print('PapayaWhip') ①
+PapayaWhip
+PapayaWhip
+PapayaWhip
+>>> import sys
+>>> for i in range(3):
+... sys.stdout.write('is the') ②
+is theis theis the
+>>> for i in range(3):
+... sys.stderr.write('new black') ③
+new blacknew blacknew black
+print() function, in a loop. Nothing surprising here.
+stdout is defined in the sys module, and it is a stream object. Calling its write() function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you’re printing, and calls sys.stdout.write.
+sys.stdout and sys.stderr send their output to the same place: the Python IDE (if you’re in one), or the terminal (if you’re running Python from the command line). Like standard output, standard error does not add carriage returns for you. If you want carriage returns, you’ll need to write carriage return characters.
+sys.stdout and sys.stderr are stream objects, but they are write-only. Attempting to call their read() method will always raise an IOError.
+
+
+>>> import sys +>>> sys.stdout.read() +Traceback (most recent call last): + File "<stdin>", line 1, in <module> +IOError: not readable+ +
sys.stdout and sys.stderr are stream objects, albeit ones that only support writing. But they’re not constants; they’re variables. That means you can assign them a new value — any other stream object — to redirect their output.
+
+
import sys
+
+class RedirectStdoutTo:
+ def __init__(self, out_new):
+ self.out_new = out_new
+
+ def __enter__(self):
+ self.out_old = sys.stdout
+ sys.stdout = self.out_new
+
+ def __exit__(self, *args):
+ sys.stdout = self.out_old
+
+print('A')
+with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
+ print('B')
+print('C')
+
+Check this out: + +
+you@localhost:~/diveintopython3/examples$ python3 stdout.py +A +C +you@localhost:~/diveintopython3/examples$ cat out.log +B+ +
++ +Did you get this error? +
+you@localhost:~/diveintopython3/examples$ python3 stdout.py + File "stdout.py", line 15 + with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file): + ^ +SyntaxError: invalid syntax+If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. +
Python 3.0 supported the
withstatement, but each statement can only use one context manager. Python 3.1 allows you to chain multiple context managers in a singlewithstatement. +
Let’s take the last part first. + +
print('A')
+with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
+ print('B')
+print('C')
+
+That’s a complicated with statement. Let me rewrite it as something more recognizable.
+
+
with open('out.log', mode='w', encoding='utf-8') as a_file:
+ with RedirectStdoutTo(a_file):
+ print('B')
+
+As the rewrite shows, you actually have two with statements, one nested within the scope of the other. The “outer” with statement should be familiar by now: it opens a UTF-8-encoded text file named out.log for writing and assigns the stream object to a variable named a_file. But that’s not the only thing odd here.
+
with RedirectStdoutTo(a_file):
+
+Where’s the as clause? The with statement doesn’t actually require one. Just like you can call a function and ignore its return value, you can have a with statement that doesn’t assign the with context to a variable. In this case, you’re only interested in the side effects of the RedirectStdoutTo context.
+
+
What are those side effects? Take a look inside the RedirectStdoutTo class. This class is a custom context manager. Any class can be a context manager by defining two special methods: __enter__() and __exit__().
+
+
class RedirectStdoutTo:
+ def __init__(self, out_new): ①
+ self.out_new = out_new
+
+ def __enter__(self): ②
+ self.out_old = sys.stdout
+ sys.stdout = self.out_new
+
+ def __exit__(self, *args): ③
+ sys.stdout = self.out_old
+__init__() method is called immediately after an instance is created. It takes one parameter, the stream object that you want to use as standard output for the life of the context. This method just saves the stream object in an instance variable so other methods can use it later.
+__enter__() method is a special class method; Python calls it when entering a context (i.e. at the beginning of the with statement). This method saves the current value of sys.stdout in self.out_old, then redirects standard output by assigning self.out_new to sys.stdout.
+__exit__() method is another special class method; Python calls it when exiting the context (i.e. at the end of the with statement). This method restores standard output to its original value by assigning the saved self.out_old value to sys.stdout.
+Putting it all together: + +
+print('A') ①
+with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file): ②
+ print('B') ③
+print('C') ④
+with statement takes a comma-separated list of contexts. The comma-separated list acts like a series of nested with blocks. The first context listed is the “outer” block; the last one listed is the “inner” block. The first context opens a file; the second context redirects sys.stdout to the stream object that was created in the first context.
+print() function is executed with the context created by the with statement, it will not print to the screen; it will write to the file out.log.
+with code block is over. Python has told each context manager to do whatever it is they do upon exiting a context. The context managers form a last-in-first-out stack. Upon exiting, the second context changed sys.stdout back to its original value, then the first context closed the file named out.log. Since standard output has been restored to its original value, calling the print() function will once again print to the screen.
+Redirecting standard error works exactly the same way, using sys.stderr instead of sys.stdout.
+
+
⁂ + +
io module
+sys.stdout and sys.stderr
+© 2001–10 Mark Pilgrim + + + diff --git a/generators.html b/generators.html index 1f965fe..1b01e47 100755 --- a/generators.html +++ b/generators.html @@ -1,418 +1,418 @@ - - -
You are here: Home ‣ Dive Into Python 3 ‣ -
Difficulty level: ♦♦♦♢♢ -
--❝ My spelling is Wobbly. It’s good spelling but it Wobbles, and the letters get in the wrong places. ❞
— Winnie-the-Pooh -
-
Having grown up the son of a librarian and an English major, I have always been fascinated by languages. Not programming languages. Well yes, programming languages, but also natural languages. Take English. English is a schizophrenic language that borrows words from German, French, Spanish, and Latin (to name a few). Actually, “borrows” is the wrong word; “pillages” is more like it. Or perhaps “assimilates” — like the Borg. Yes, I like that. -
We are the Borg. Your linguistic and etymological distinctiveness will be added to our own. Resistance is futile.
-
In this chapter, you’re going to learn about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators. But first, let’s talk about how to make plural nouns. (If you haven’t read the chapter on regular expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and it quickly descends into more advanced uses.) -
If you grew up in an English-speaking country or learned English in a formal school setting, you’re probably familiar with the basic rules: -
(I know, there are a lot of exceptions. Man becomes men and woman becomes women, but human becomes humans. Mouse becomes mice and louse becomes lice, but house becomes houses. Knife becomes knives and wife becomes wives, but lowlife becomes lowlifes. And don’t even get me started on words that are their own plural, like sheep, deer, and haiku.) -
Other languages, of course, are completely different. -
Let’s design a Python library that automatically pluralizes English nouns. We’ll start with just these four rules, but keep in mind that you’ll inevitably need to add more. -
⁂ - -
So you’re looking at words, which, at least in English, means you’re looking at strings of characters. You have rules that say you need to find different combinations of characters, then do different things to them. This sounds like a job for regular expressions! -
import re
-
-def plural(noun):
- if re.search('[sxz]$', noun): ①
- return re.sub('$', 'es', noun) ②
- elif re.search('[^aeioudgkprt]h$', noun):
- return re.sub('$', 'es', noun)
- elif re.search('[^aeiou]y$', noun):
- return re.sub('y$', 'ies', noun)
- else:
- return noun + 's'
-[sxz] means “s, or x, or z”, but only one of them. The $ should be familiar; it matches the end of string. Combined, this regular expression tests whether noun ends with s, x, or z.
-re.sub() function performs regular expression-based string substitutions.
-Let’s look at regular expression substitutions in more detail. -
->>> import re
->>> re.search('[abc]', 'Mark') ①
-<_sre.SRE_Match object at 0x001C1FA8>
->>> re.sub('[abc]', 'o', 'Mark') ②
-'Mork'
->>> re.sub('[abc]', 'o', 'rock') ③
-'rook'
->>> re.sub('[abc]', 'o', 'caps') ④
-'oops'
-Mark contain a, b, or c? Yes, it contains a.
-a, b, or c, and replace it with o. Mark becomes Mork.
-rock into rook.
-caps into oaps, but it doesn’t. re.sub replaces all of the matches, not just the first one. So this regular expression turns caps into oops, because both the c and the a get turned into o.
-And now, back to the plural() function…
-
-
def plural(noun):
- if re.search('[sxz]$', noun):
- return re.sub('$', 'es', noun) ①
- elif re.search('[^aeioudgkprt]h$', noun): ②
- return re.sub('$', 'es', noun)
- elif re.search('[^aeiou]y$', noun): ③
- return re.sub('y$', 'ies', noun)
- else:
- return noun + 's'
-$) with the string es. In other words, adding es to the string. You could accomplish the same thing with string concatenation, for example noun + 'es', but I chose to use regular expressions for each rule, for reasons that will become clear later in the chapter.
-^ as the first character inside the square brackets means something special: negation. [^abc] means “any single character except a, b, or c”. So [^aeioudgkprt] means any character except a, e, i, o, u, d, g, k, p, r, or t. Then that character needs to be followed by h, followed by end of string. You’re looking for words that end in H where the H can be heard.
-a, e, i, o, or u. You’re looking for words that end in Y that sounds like I.
-Let’s look at negation regular expressions in more detail. - -
->>> import re
->>> re.search('[^aeiou]y$', 'vacancy') ①
-<_sre.SRE_Match object at 0x001C1FA8>
->>> re.search('[^aeiou]y$', 'boy') ②
->>>
->>> re.search('[^aeiou]y$', 'day')
->>>
->>> re.search('[^aeiou]y$', 'pita') ③
->>>
-vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u.
-boy does not match, because it ends in oy, and you specifically said that the character before the y could not be o. day does not match, because it ends in ay.
-pita does not match, because it does not end in y.
-
->>> re.sub('y$', 'ies', 'vacancy') ①
-'vacancies'
->>> re.sub('y$', 'ies', 'agency')
-'agencies'
->>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy') ②
-'vacancies'
-vacancy into vacancies and agency into agencies, which is what you wanted. Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub.
-y. Then in the substitution string, you use a new syntax, \1, which means “hey, that first group you remembered? put it right here.” In this case, you remember the c before the y; when you do the substitution, you substitute c in place of c, and ies in place of y. (If you have more than one remembered group, you can use \2 and \3 and so on.)
-Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn’t directly map to the way you first described the pluralizing rules. You originally laid out rules like “if the word ends in S, X, or Z, then add ES”. If you look at this function, you have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn’t get much more direct than that.
-
-
⁂ - -
Now you’re going to add a level of abstraction. You started by defining a list of rules: if this, do that, otherwise go to the next rule. Let’s temporarily complicate part of the program so you can simplify another part. - -
import re
-
-def match_sxz(noun):
- return re.search('[sxz]$', noun)
-
-def apply_sxz(noun):
- return re.sub('$', 'es', noun)
-
-def match_h(noun):
- return re.search('[^aeioudgkprt]h$', noun)
-
-def apply_h(noun):
- return re.sub('$', 'es', noun)
-
-def match_y(noun): ①
- return re.search('[^aeiou]y$', noun)
-
-def apply_y(noun): ②
- return re.sub('y$', 'ies', noun)
-
-def match_default(noun):
- return True
-
-def apply_default(noun):
- return noun + 's'
-
-rules = ((match_sxz, apply_sxz), ③
- (match_h, apply_h),
- (match_y, apply_y),
- (match_default, apply_default)
- )
-
-def plural(noun):
- for matches_rule, apply_rule in rules: ④
- if matches_rule(noun):
- return apply_rule(noun)
-re.search() function.
-re.sub() function to apply the appropriate pluralization rule.
-plural()) with multiple rules, you have the rules data structure, which is a sequence of pairs of functions.
-plural() function can be reduced to a few lines of code. Using a for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules structure. On the first iteration of the for loop, matches_rule will get match_sxz, and apply_rule will get apply_sxz. On the second iteration (assuming you get that far), matches_rule will be assigned match_h, and apply_rule will be assigned apply_h. The function is guaranteed to return something eventually, because the final match rule (match_default) simply returns True, meaning the corresponding apply rule (apply_default) will always be applied.
-The reason this technique works is that everything in Python is an object, including functions. The rules data structure contains functions — not names of functions, but actual function objects. When they get assigned in the for loop, then matches_rule and apply_rule are actual functions that you can call. On the first iteration of the for loop, this is equivalent to calling matches_sxz(noun), and if it returns a match, calling apply_sxz(noun).
-
-
If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire for loop is equivalent to the following:
-
-
-def plural(noun):
- if match_sxz(noun):
- return apply_sxz(noun)
- if match_h(noun):
- return apply_h(noun)
- if match_y(noun):
- return apply_y(noun)
- if match_default(noun):
- return apply_default(noun)
-
-The benefit here is that the plural() function is now simplified. It takes a sequence of rules, defined elsewhere, and iterates through them in a generic fashion.
-
-
The rules could be defined anywhere, in any way. The plural() function doesn’t care.
-
-
Now, was adding this level of abstraction worth it? Well, not yet. Let’s consider what it would take to add a new rule to the function. In the first example, it would require adding an if statement to the plural() function. In this second example, it would require adding two functions, match_foo() and apply_foo(), and then updating the rules sequence to specify where in the order the new match and apply functions should be called relative to the other rules.
-
-
But this is really just a stepping stone to the next section. Let’s move on… - -
⁂ - -
Defining separate named functions for each match and apply rule isn’t really necessary. You never call them directly; you add them to the rules sequence and call them through there. Furthermore, each function follows one of two patterns. All the match functions call re.search(), and all the apply functions call re.sub(). Let’s factor out the patterns so that defining new rules can be easier.
-
-
import re
-
-def build_match_and_apply_functions(pattern, search, replace):
- def matches_rule(word): ①
- return re.search(pattern, word)
- def apply_rule(word): ②
- return re.sub(search, replace, word)
- return (matches_rule, apply_rule) ③
-build_match_and_apply_functions() is a function that builds other functions dynamically. It takes pattern, search and replace, then defines a matches_rule() function which calls re.search() with the pattern that was passed to the build_match_and_apply_functions() function, and the word that was passed to the matches_rule() function you’re building. Whoa.
-re.sub() with the search and replace parameters that were passed to the build_match_and_apply_functions() function, and the word that was passed to the apply_rule() function you’re building. This technique of using the values of outside parameters within a dynamic function is called closures. You’re essentially defining constants within the apply function you’re building: it takes one parameter (word), but it then acts on that plus two other values (search and replace) which were set when you defined the apply function.
-build_match_and_apply_functions() function returns a tuple of two values: the two functions you just created. The constants you defined within those functions (pattern within the matches_rule() function, and search and replace within the apply_rule() function) stay with those functions, even after you return from build_match_and_apply_functions(). That’s insanely cool.
-If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it. - -
patterns = \ ①
- (
- ('[sxz]$', '$', 'es'),
- ('[^aeioudgkprt]h$', '$', 'es'),
- ('(qu|[^aeiou])y$', 'y$', 'ies'),
- ('$', '$', 's') ②
- )
-rules = [build_match_and_apply_functions(pattern, search, replace) ③
- for (pattern, search, replace) in patterns]
-re.search() to see if this rule matches. The second and third strings in each group are the search and replace expressions you would use in re.sub() to actually apply the rule to turn a noun into its plural.
-match_default() function simply returned True, meaning that if none of the more specific rules matched, the code would simply add an s to the end of the given word. This example does something functionally equivalent. The final regular expression asks whether the word has an end ($ matches the end of a string). Of course, every string has an end, even an empty string, so this expression always matches. Thus, it serves the same purpose as the match_default() function that always returned True: it ensures that if no more specific rule matches, the code adds an s to the end of the given word.
-build_match_and_apply_functions() function. That is, it takes each triplet of strings and calls the build_match_and_apply_functions() function with those three strings as arguments. The build_match_and_apply_functions() function returns a tuple of two functions. This means that rules ends up being functionally equivalent to the previous example: a list of tuples, where each tuple is a pair of functions. The first function is the match function that calls re.search(), and the second function is the apply function that calls re.sub().
-Rounding out this version of the script is the main entry point, the plural() function.
-
-
def plural(noun):
- for matches_rule, apply_rule in rules: ①
- if matches_rule(noun):
- return apply_rule(noun)
-plural() function hasn’t changed at all. It’s completely generic; it takes a list of rule functions and calls them in order. It doesn’t care how the rules are defined. In the previous example, they were defined as separate named functions. Now they are built dynamically by mapping the output of the build_match_and_apply_functions() function onto a list of raw strings. It doesn’t matter; the plural() function still works the same way.
-⁂ - -
You’ve factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them. - -
First, let’s create a text file that contains the rules you want. No fancy data structures, just whitespace-delimited strings in three columns. Let’s call it plural4-rules.txt.
-
-
[download plural4-rules.txt]
-
[sxz]$ $ es
-[^aeioudgkprt]h$ $ es
-[^aeiou]y$ y$ ies
-$ $ s
-
-Now let’s see how you can use this rules file. - -
import re
-
-def build_match_and_apply_functions(pattern, search, replace): ①
- def matches_rule(word):
- return re.search(pattern, word)
- def apply_rule(word):
- return re.sub(search, replace, word)
- return (matches_rule, apply_rule)
-
-rules = []
-with open('plural4-rules.txt', encoding='utf-8') as pattern_file: ②
- for line in pattern_file: ③
- pattern, search, replace = line.split(None, 3) ④
- rules.append(build_match_and_apply_functions( ⑤
- pattern, search, replace))
-build_match_and_apply_functions() function has not changed. You’re still using closures to build two functions dynamically that use variables defined in the outer function.
-open() function opens a file and returns a file object. In this case, the file we’re opening contains the pattern strings for pluralizing nouns. The with statement creates what’s called a context: when the with block ends, Python will automatically close the file, even if an exception is raised inside the with block. You’ll learn more about with blocks and file objects in the Files chapter.
-for line in <fileobject> idiom reads data from the open file, one line at a time, and assigns the text to the line variable. You’ll learn more about reading from files in the Files chapter.
-split() string method. The first argument to the split() method is None, which means “split on any whitespace (tabs or spaces, it makes no difference).” The second argument is 3, which means “split on whitespace 3 times, then leave the rest of the line alone.” A line like [sxz]$ $ es will be broken up into the list ['[sxz]$', '$', 'es'], which means that pattern will get '[sxz]$', search will get '$', and replace will get 'es'. That’s a lot of power in one little line of code.
-pattern, search, and replace to the build_match_and_apply_functions() function, which returns a tuple of functions. You append this tuple to the rules list, and rules ends up storing the list of match and apply functions that the plural() function expects.
-The improvement here is that you’ve completely separated the pluralization rules into an external file, so it can be maintained separately from the code that uses it. Code is code, data is data, and life is good. - -
⁂ - -
Wouldn’t it be grand to have a generic plural() function that parses the rules file? Get rules, check for a match, apply appropriate transformation, go to next rule. That’s all the plural() function has to do, and that’s all the plural() function should do.
-
-
def rules(rules_filename):
- with open(rules_filename, encoding='utf-8') as pattern_file:
- for line in pattern_file:
- pattern, search, replace = line.split(None, 3)
- yield build_match_and_apply_functions(pattern, search, replace)
-
-def plural(noun, rules_filename='plural5-rules.txt'):
- for matches_rule, apply_rule in rules(rules_filename):
- if matches_rule(noun):
- return apply_rule(noun)
- raise ValueError('no matching rule for {0}'.format(noun))
-
-How the heck does that work? Let’s look at an interactive example first. - -
->>> def make_counter(x):
-... print('entering make_counter')
-... while True:
-... yield x ①
-... print('incrementing x')
-... x = x + 1
-...
->>> counter = make_counter(2) ②
->>> counter ③
-<generator object at 0x001C9C10>
->>> next(counter) ④
-entering make_counter
-2
->>> next(counter) ⑤
-incrementing x
-3
->>> next(counter) ⑥
-incrementing x
-4
-yield keyword in make_counter means that this is not a normal function. It is a special kind of function which generates values one at a time. You can think of it as a resumable function. Calling it will return a generator that can be used to generate successive values of x.
-make_counter generator, just call it like any other function. Note that this does not actually execute the function code. You can tell this because the first line of the make_counter() function calls print(), but nothing has been printed yet.
-make_counter() function returns a generator object.
-next() function takes a generator object and returns its next value. The first time you call next() with the counter generator, it executes the code in make_counter() up to the first yield statement, then returns the value that was yielded. In this case, that will be 2, because you originally created the generator by calling make_counter(2).
-next() with the same generator object resumes exactly where it left off and continues until it hits the next yield statement. All variables, local state, &c. are saved on yield and restored on next(). The next line of code waiting to be executed calls print(), which prints incrementing x. After that, the statement x = x + 1. Then it loops through the while loop again, and the first thing it hits is the statement yield x, which saves the state of everything and returns the current value of x (now 3).
-next(counter), you do all the same things again, but this time x is now 4.
-Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing x and spitting out values. But let’s look at more productive uses of generators instead.
-
-
def fib(max):
- a, b = 0, 1 ①
- while a < max:
- yield a ②
- a, b = b, a + b ③
-1, goes up slowly at first, then more and more rapidly. To start the sequence, you need two variables: a starts at 0, and b starts at 1.
-a + b) and assign that to b for later use. Note that this happens in parallel; if a is 3 and b is 5, then a, b = b, a + b will set a to 5 (the previous value of b) and b to 8 (the sum of the previous values of a and b).
-So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way is easier to read. Also, it works well with for loops.
-
-
->>> from fibonacci import fib ->>> for n in fib(1000): ① -... print(n, end=' ') ② -0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 ->>> list(fib(1000)) ③ -[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987]-
fib() in a for loop directly. The for loop will automatically call the next() function to get values from the fib() generator and assign them to the for loop index variable (n).
-for loop, n gets a new value from the yield statement in fib(), and all you have to do is print it out. Once fib() runs out of numbers (a becomes bigger than max, which in this case is 1000), then the for loop exits gracefully.
-list() function, and it will iterate through the entire generator (just like the for loop in the previous example) and return a list of all the values.
-Let’s go back to plural5.py and see how this version of the plural() function works.
-
-
def rules(rules_filename):
- with open(rules_filename, encoding='utf-8') as pattern_file:
- for line in pattern_file:
- pattern, search, replace = line.split(None, 3) ①
- yield build_match_and_apply_functions(pattern, search, replace) ②
-
-def plural(noun, rules_filename='plural5-rules.txt'):
- for matches_rule, apply_rule in rules(rules_filename): ③
- if matches_rule(noun):
- return apply_rule(noun)
- raise ValueError('no matching rule for {0}'.format(noun))
-line.split(None, 3) to get the three “columns” and assign them to three local variables.
-build_match_and_apply_functions(), which is identical to the previous examples. In other words, rules() is a generator that spits out match and apply functions on demand.
-rules() is a generator, you can use it directly in a for loop. The first time through the for loop, you will call the rules() function, which will open the pattern file, read the first line, dynamically build a match function and an apply function from the patterns on that line, and yield the dynamically built functions. The second time through the for loop, you will pick up exactly where you left off in rules() (which was in the middle of the for line in pattern_file loop). The first thing it will do is read the next line of the file (which is still open), dynamically build another match and apply function based on the patterns on that line in the file, and yield the two functions.
-What have you gained over stage 4? Startup time. In stage 4, when you imported the plural4 module, it read the entire patterns file and built a list of all the possible rules, before you could even think about calling the plural() function. With generators, you can do everything lazily: you read the first rule and create functions and try them, and if that works you don’t ever read the rest of the file or create any other functions.
-
-
What have you lost? Performance! Every time you call the plural() function, the rules() generator starts over from the beginning — which means re-opening the patterns file and reading from the beginning, one line at a time.
-
-
What if you could have the best of both worlds: minimal startup cost (don’t execute any code on import), and maximum performance (don’t build the same functions over and over again). Oh, and you still want to keep the rules in a separate file (because code is code and data is data), just as long as you never have to read the same line twice.
-
-
To do that, you’ll need to build your own iterator. But before you do that, you need to learn about Python classes. - -
⁂ - -
© 2001–10 Mark Pilgrim - - - + + +
You are here: Home ‣ Dive Into Python 3 ‣ +
Difficulty level: ♦♦♦♢♢ +
++❝ My spelling is Wobbly. It’s good spelling but it Wobbles, and the letters get in the wrong places. ❞
— Winnie-the-Pooh +
+
Having grown up the son of a librarian and an English major, I have always been fascinated by languages. Not programming languages. Well yes, programming languages, but also natural languages. Take English. English is a schizophrenic language that borrows words from German, French, Spanish, and Latin (to name a few). Actually, “borrows” is the wrong word; “pillages” is more like it. Or perhaps “assimilates” — like the Borg. Yes, I like that. +
We are the Borg. Your linguistic and etymological distinctiveness will be added to our own. Resistance is futile.
+
In this chapter, you’re going to learn about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators. But first, let’s talk about how to make plural nouns. (If you haven’t read the chapter on regular expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and it quickly descends into more advanced uses.) +
If you grew up in an English-speaking country or learned English in a formal school setting, you’re probably familiar with the basic rules: +
(I know, there are a lot of exceptions. Man becomes men and woman becomes women, but human becomes humans. Mouse becomes mice and louse becomes lice, but house becomes houses. Knife becomes knives and wife becomes wives, but lowlife becomes lowlifes. And don’t even get me started on words that are their own plural, like sheep, deer, and haiku.) +
Other languages, of course, are completely different. +
Let’s design a Python library that automatically pluralizes English nouns. We’ll start with just these four rules, but keep in mind that you’ll inevitably need to add more. +
⁂ + +
So you’re looking at words, which, at least in English, means you’re looking at strings of characters. You have rules that say you need to find different combinations of characters, then do different things to them. This sounds like a job for regular expressions! +
import re
+
+def plural(noun):
+ if re.search('[sxz]$', noun): ①
+ return re.sub('$', 'es', noun) ②
+ elif re.search('[^aeioudgkprt]h$', noun):
+ return re.sub('$', 'es', noun)
+ elif re.search('[^aeiou]y$', noun):
+ return re.sub('y$', 'ies', noun)
+ else:
+ return noun + 's'
+[sxz] means “s, or x, or z”, but only one of them. The $ should be familiar; it matches the end of string. Combined, this regular expression tests whether noun ends with s, x, or z.
+re.sub() function performs regular expression-based string substitutions.
+Let’s look at regular expression substitutions in more detail. +
+>>> import re
+>>> re.search('[abc]', 'Mark') ①
+<_sre.SRE_Match object at 0x001C1FA8>
+>>> re.sub('[abc]', 'o', 'Mark') ②
+'Mork'
+>>> re.sub('[abc]', 'o', 'rock') ③
+'rook'
+>>> re.sub('[abc]', 'o', 'caps') ④
+'oops'
+Mark contain a, b, or c? Yes, it contains a.
+a, b, or c, and replace it with o. Mark becomes Mork.
+rock into rook.
+caps into oaps, but it doesn’t. re.sub replaces all of the matches, not just the first one. So this regular expression turns caps into oops, because both the c and the a get turned into o.
+And now, back to the plural() function…
+
+
def plural(noun):
+ if re.search('[sxz]$', noun):
+ return re.sub('$', 'es', noun) ①
+ elif re.search('[^aeioudgkprt]h$', noun): ②
+ return re.sub('$', 'es', noun)
+ elif re.search('[^aeiou]y$', noun): ③
+ return re.sub('y$', 'ies', noun)
+ else:
+ return noun + 's'
+$) with the string es. In other words, adding es to the string. You could accomplish the same thing with string concatenation, for example noun + 'es', but I chose to use regular expressions for each rule, for reasons that will become clear later in the chapter.
+^ as the first character inside the square brackets means something special: negation. [^abc] means “any single character except a, b, or c”. So [^aeioudgkprt] means any character except a, e, i, o, u, d, g, k, p, r, or t. Then that character needs to be followed by h, followed by end of string. You’re looking for words that end in H where the H can be heard.
+a, e, i, o, or u. You’re looking for words that end in Y that sounds like I.
+Let’s look at negation regular expressions in more detail. + +
+>>> import re
+>>> re.search('[^aeiou]y$', 'vacancy') ①
+<_sre.SRE_Match object at 0x001C1FA8>
+>>> re.search('[^aeiou]y$', 'boy') ②
+>>>
+>>> re.search('[^aeiou]y$', 'day')
+>>>
+>>> re.search('[^aeiou]y$', 'pita') ③
+>>>
+vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u.
+boy does not match, because it ends in oy, and you specifically said that the character before the y could not be o. day does not match, because it ends in ay.
+pita does not match, because it does not end in y.
+
+>>> re.sub('y$', 'ies', 'vacancy') ①
+'vacancies'
+>>> re.sub('y$', 'ies', 'agency')
+'agencies'
+>>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy') ②
+'vacancies'
+vacancy into vacancies and agency into agencies, which is what you wanted. Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub.
+y. Then in the substitution string, you use a new syntax, \1, which means “hey, that first group you remembered? put it right here.” In this case, you remember the c before the y; when you do the substitution, you substitute c in place of c, and ies in place of y. (If you have more than one remembered group, you can use \2 and \3 and so on.)
+Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn’t directly map to the way you first described the pluralizing rules. You originally laid out rules like “if the word ends in S, X, or Z, then add ES”. If you look at this function, you have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn’t get much more direct than that.
+
+
⁂ + +
Now you’re going to add a level of abstraction. You started by defining a list of rules: if this, do that, otherwise go to the next rule. Let’s temporarily complicate part of the program so you can simplify another part. + +
import re
+
+def match_sxz(noun):
+ return re.search('[sxz]$', noun)
+
+def apply_sxz(noun):
+ return re.sub('$', 'es', noun)
+
+def match_h(noun):
+ return re.search('[^aeioudgkprt]h$', noun)
+
+def apply_h(noun):
+ return re.sub('$', 'es', noun)
+
+def match_y(noun): ①
+ return re.search('[^aeiou]y$', noun)
+
+def apply_y(noun): ②
+ return re.sub('y$', 'ies', noun)
+
+def match_default(noun):
+ return True
+
+def apply_default(noun):
+ return noun + 's'
+
+rules = ((match_sxz, apply_sxz), ③
+ (match_h, apply_h),
+ (match_y, apply_y),
+ (match_default, apply_default)
+ )
+
+def plural(noun):
+ for matches_rule, apply_rule in rules: ④
+ if matches_rule(noun):
+ return apply_rule(noun)
+re.search() function.
+re.sub() function to apply the appropriate pluralization rule.
+plural()) with multiple rules, you have the rules data structure, which is a sequence of pairs of functions.
+plural() function can be reduced to a few lines of code. Using a for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules structure. On the first iteration of the for loop, matches_rule will get match_sxz, and apply_rule will get apply_sxz. On the second iteration (assuming you get that far), matches_rule will be assigned match_h, and apply_rule will be assigned apply_h. The function is guaranteed to return something eventually, because the final match rule (match_default) simply returns True, meaning the corresponding apply rule (apply_default) will always be applied.
+The reason this technique works is that everything in Python is an object, including functions. The rules data structure contains functions — not names of functions, but actual function objects. When they get assigned in the for loop, then matches_rule and apply_rule are actual functions that you can call. On the first iteration of the for loop, this is equivalent to calling matches_sxz(noun), and if it returns a match, calling apply_sxz(noun).
+
+
If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire for loop is equivalent to the following:
+
+
+def plural(noun):
+ if match_sxz(noun):
+ return apply_sxz(noun)
+ if match_h(noun):
+ return apply_h(noun)
+ if match_y(noun):
+ return apply_y(noun)
+ if match_default(noun):
+ return apply_default(noun)
+
+The benefit here is that the plural() function is now simplified. It takes a sequence of rules, defined elsewhere, and iterates through them in a generic fashion.
+
+
The rules could be defined anywhere, in any way. The plural() function doesn’t care.
+
+
Now, was adding this level of abstraction worth it? Well, not yet. Let’s consider what it would take to add a new rule to the function. In the first example, it would require adding an if statement to the plural() function. In this second example, it would require adding two functions, match_foo() and apply_foo(), and then updating the rules sequence to specify where in the order the new match and apply functions should be called relative to the other rules.
+
+
But this is really just a stepping stone to the next section. Let’s move on… + +
⁂ + +
Defining separate named functions for each match and apply rule isn’t really necessary. You never call them directly; you add them to the rules sequence and call them through there. Furthermore, each function follows one of two patterns. All the match functions call re.search(), and all the apply functions call re.sub(). Let’s factor out the patterns so that defining new rules can be easier.
+
+
import re
+
+def build_match_and_apply_functions(pattern, search, replace):
+ def matches_rule(word): ①
+ return re.search(pattern, word)
+ def apply_rule(word): ②
+ return re.sub(search, replace, word)
+ return (matches_rule, apply_rule) ③
+build_match_and_apply_functions() is a function that builds other functions dynamically. It takes pattern, search and replace, then defines a matches_rule() function which calls re.search() with the pattern that was passed to the build_match_and_apply_functions() function, and the word that was passed to the matches_rule() function you’re building. Whoa.
+re.sub() with the search and replace parameters that were passed to the build_match_and_apply_functions() function, and the word that was passed to the apply_rule() function you’re building. This technique of using the values of outside parameters within a dynamic function is called closures. You’re essentially defining constants within the apply function you’re building: it takes one parameter (word), but it then acts on that plus two other values (search and replace) which were set when you defined the apply function.
+build_match_and_apply_functions() function returns a tuple of two values: the two functions you just created. The constants you defined within those functions (pattern within the matches_rule() function, and search and replace within the apply_rule() function) stay with those functions, even after you return from build_match_and_apply_functions(). That’s insanely cool.
+If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it. + +
patterns = \ ①
+ (
+ ('[sxz]$', '$', 'es'),
+ ('[^aeioudgkprt]h$', '$', 'es'),
+ ('(qu|[^aeiou])y$', 'y$', 'ies'),
+ ('$', '$', 's') ②
+ )
+rules = [build_match_and_apply_functions(pattern, search, replace) ③
+ for (pattern, search, replace) in patterns]
+re.search() to see if this rule matches. The second and third strings in each group are the search and replace expressions you would use in re.sub() to actually apply the rule to turn a noun into its plural.
+match_default() function simply returned True, meaning that if none of the more specific rules matched, the code would simply add an s to the end of the given word. This example does something functionally equivalent. The final regular expression asks whether the word has an end ($ matches the end of a string). Of course, every string has an end, even an empty string, so this expression always matches. Thus, it serves the same purpose as the match_default() function that always returned True: it ensures that if no more specific rule matches, the code adds an s to the end of the given word.
+build_match_and_apply_functions() function. That is, it takes each triplet of strings and calls the build_match_and_apply_functions() function with those three strings as arguments. The build_match_and_apply_functions() function returns a tuple of two functions. This means that rules ends up being functionally equivalent to the previous example: a list of tuples, where each tuple is a pair of functions. The first function is the match function that calls re.search(), and the second function is the apply function that calls re.sub().
+Rounding out this version of the script is the main entry point, the plural() function.
+
+
def plural(noun):
+ for matches_rule, apply_rule in rules: ①
+ if matches_rule(noun):
+ return apply_rule(noun)
+plural() function hasn’t changed at all. It’s completely generic; it takes a list of rule functions and calls them in order. It doesn’t care how the rules are defined. In the previous example, they were defined as separate named functions. Now they are built dynamically by mapping the output of the build_match_and_apply_functions() function onto a list of raw strings. It doesn’t matter; the plural() function still works the same way.
+⁂ + +
You’ve factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them. + +
First, let’s create a text file that contains the rules you want. No fancy data structures, just whitespace-delimited strings in three columns. Let’s call it plural4-rules.txt.
+
+
[download plural4-rules.txt]
+
[sxz]$ $ es
+[^aeioudgkprt]h$ $ es
+[^aeiou]y$ y$ ies
+$ $ s
+
+Now let’s see how you can use this rules file. + +
import re
+
+def build_match_and_apply_functions(pattern, search, replace): ①
+ def matches_rule(word):
+ return re.search(pattern, word)
+ def apply_rule(word):
+ return re.sub(search, replace, word)
+ return (matches_rule, apply_rule)
+
+rules = []
+with open('plural4-rules.txt', encoding='utf-8') as pattern_file: ②
+ for line in pattern_file: ③
+ pattern, search, replace = line.split(None, 3) ④
+ rules.append(build_match_and_apply_functions( ⑤
+ pattern, search, replace))
+build_match_and_apply_functions() function has not changed. You’re still using closures to build two functions dynamically that use variables defined in the outer function.
+open() function opens a file and returns a file object. In this case, the file we’re opening contains the pattern strings for pluralizing nouns. The with statement creates what’s called a context: when the with block ends, Python will automatically close the file, even if an exception is raised inside the with block. You’ll learn more about with blocks and file objects in the Files chapter.
+for line in <fileobject> idiom reads data from the open file, one line at a time, and assigns the text to the line variable. You’ll learn more about reading from files in the Files chapter.
+split() string method. The first argument to the split() method is None, which means “split on any whitespace (tabs or spaces, it makes no difference).” The second argument is 3, which means “split on whitespace 3 times, then leave the rest of the line alone.” A line like [sxz]$ $ es will be broken up into the list ['[sxz]$', '$', 'es'], which means that pattern will get '[sxz]$', search will get '$', and replace will get 'es'. That’s a lot of power in one little line of code.
+pattern, search, and replace to the build_match_and_apply_functions() function, which returns a tuple of functions. You append this tuple to the rules list, and rules ends up storing the list of match and apply functions that the plural() function expects.
+The improvement here is that you’ve completely separated the pluralization rules into an external file, so it can be maintained separately from the code that uses it. Code is code, data is data, and life is good. + +
⁂ + +
Wouldn’t it be grand to have a generic plural() function that parses the rules file? Get rules, check for a match, apply appropriate transformation, go to next rule. That’s all the plural() function has to do, and that’s all the plural() function should do.
+
+
def rules(rules_filename):
+ with open(rules_filename, encoding='utf-8') as pattern_file:
+ for line in pattern_file:
+ pattern, search, replace = line.split(None, 3)
+ yield build_match_and_apply_functions(pattern, search, replace)
+
+def plural(noun, rules_filename='plural5-rules.txt'):
+ for matches_rule, apply_rule in rules(rules_filename):
+ if matches_rule(noun):
+ return apply_rule(noun)
+ raise ValueError('no matching rule for {0}'.format(noun))
+
+How the heck does that work? Let’s look at an interactive example first. + +
+>>> def make_counter(x):
+... print('entering make_counter')
+... while True:
+... yield x ①
+... print('incrementing x')
+... x = x + 1
+...
+>>> counter = make_counter(2) ②
+>>> counter ③
+<generator object at 0x001C9C10>
+>>> next(counter) ④
+entering make_counter
+2
+>>> next(counter) ⑤
+incrementing x
+3
+>>> next(counter) ⑥
+incrementing x
+4
+yield keyword in make_counter means that this is not a normal function. It is a special kind of function which generates values one at a time. You can think of it as a resumable function. Calling it will return a generator that can be used to generate successive values of x.
+make_counter generator, just call it like any other function. Note that this does not actually execute the function code. You can tell this because the first line of the make_counter() function calls print(), but nothing has been printed yet.
+make_counter() function returns a generator object.
+next() function takes a generator object and returns its next value. The first time you call next() with the counter generator, it executes the code in make_counter() up to the first yield statement, then returns the value that was yielded. In this case, that will be 2, because you originally created the generator by calling make_counter(2).
+next() with the same generator object resumes exactly where it left off and continues until it hits the next yield statement. All variables, local state, &c. are saved on yield and restored on next(). The next line of code waiting to be executed calls print(), which prints incrementing x. After that, the statement x = x + 1. Then it loops through the while loop again, and the first thing it hits is the statement yield x, which saves the state of everything and returns the current value of x (now 3).
+next(counter), you do all the same things again, but this time x is now 4.
+Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing x and spitting out values. But let’s look at more productive uses of generators instead.
+
+
def fib(max):
+ a, b = 0, 1 ①
+ while a < max:
+ yield a ②
+ a, b = b, a + b ③
+1, goes up slowly at first, then more and more rapidly. To start the sequence, you need two variables: a starts at 0, and b starts at 1.
+a + b) and assign that to b for later use. Note that this happens in parallel; if a is 3 and b is 5, then a, b = b, a + b will set a to 5 (the previous value of b) and b to 8 (the sum of the previous values of a and b).
+So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way is easier to read. Also, it works well with for loops.
+
+
+>>> from fibonacci import fib +>>> for n in fib(1000): ① +... print(n, end=' ') ② +0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 +>>> list(fib(1000)) ③ +[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987]+
fib() in a for loop directly. The for loop will automatically call the next() function to get values from the fib() generator and assign them to the for loop index variable (n).
+for loop, n gets a new value from the yield statement in fib(), and all you have to do is print it out. Once fib() runs out of numbers (a becomes bigger than max, which in this case is 1000), then the for loop exits gracefully.
+list() function, and it will iterate through the entire generator (just like the for loop in the previous example) and return a list of all the values.
+Let’s go back to plural5.py and see how this version of the plural() function works.
+
+
def rules(rules_filename):
+ with open(rules_filename, encoding='utf-8') as pattern_file:
+ for line in pattern_file:
+ pattern, search, replace = line.split(None, 3) ①
+ yield build_match_and_apply_functions(pattern, search, replace) ②
+
+def plural(noun, rules_filename='plural5-rules.txt'):
+ for matches_rule, apply_rule in rules(rules_filename): ③
+ if matches_rule(noun):
+ return apply_rule(noun)
+ raise ValueError('no matching rule for {0}'.format(noun))
+line.split(None, 3) to get the three “columns” and assign them to three local variables.
+build_match_and_apply_functions(), which is identical to the previous examples. In other words, rules() is a generator that spits out match and apply functions on demand.
+rules() is a generator, you can use it directly in a for loop. The first time through the for loop, you will call the rules() function, which will open the pattern file, read the first line, dynamically build a match function and an apply function from the patterns on that line, and yield the dynamically built functions. The second time through the for loop, you will pick up exactly where you left off in rules() (which was in the middle of the for line in pattern_file loop). The first thing it will do is read the next line of the file (which is still open), dynamically build another match and apply function based on the patterns on that line in the file, and yield the two functions.
+What have you gained over stage 4? Startup time. In stage 4, when you imported the plural4 module, it read the entire patterns file and built a list of all the possible rules, before you could even think about calling the plural() function. With generators, you can do everything lazily: you read the first rule and create functions and try them, and if that works you don’t ever read the rest of the file or create any other functions.
+
+
What have you lost? Performance! Every time you call the plural() function, the rules() generator starts over from the beginning — which means re-opening the patterns file and reading from the beginning, one line at a time.
+
+
What if you could have the best of both worlds: minimal startup cost (don’t execute any code on import), and maximum performance (don’t build the same functions over and over again). Oh, and you still want to keep the rules in a separate file (because code is code and data is data), just as long as you never have to read the same line twice.
+
+
To do that, you’ll need to build your own iterator. But before you do that, you need to learn about Python classes. + +
⁂ + +
© 2001–10 Mark Pilgrim + + + diff --git a/http-web-services.html b/http-web-services.html index 6518ab4..435d631 100755 --- a/http-web-services.html +++ b/http-web-services.html @@ -1,1003 +1,1003 @@ - - -
You are here: Home ‣ Dive Into Python 3 ‣ -
Difficulty level: ♦♦♦♦♢ -
--❝ A ruffled mind makes a restless pillow. ❞
— Charlotte Brontë -
-
Philosophically, I can describe HTTP web services in 12 words: exchanging data with remote servers using nothing but the operations of HTTP. If you want to get data from the server, use HTTP GET. If you want to send new data to the server, use HTTP POST. Some more advanced HTTP web service APIs also allow creating, modifying, and deleting data, using HTTP PUT and HTTP DELETE. That’s it. No registries, no envelopes, no wrappers, no tunneling. The “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for retrieving, creating, modifying, and deleting data.
-
-
The main advantage of this approach is simplicity, and its simplicity has proven popular. Data — usually XML or JSON — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging is also easier; because each resource in an HTTP web service has a unique address (in the form of a URL), you can load it in your web browser and immediately see the raw data. - -
Examples of HTTP web services: -
Python 3 comes with two different libraries for interacting with HTTP web services: - -
http.client is a low-level library that implements RFC 2616, the HTTP protocol.
-urllib.request is an abstraction layer built on top of http.client. It provides a standard API for accessing both HTTP and FTP servers, automatically follows HTTP redirects, and handles some common forms of HTTP authentication.
-So which one should you use? Neither of them. Instead, you should use httplib2, an open source third-party library that implements HTTP more fully than http.client but provides a better abstraction than urllib.request.
-
-
To understand why httplib2 is the right choice, you first need to understand HTTP.
-
-
⁂ - -
There are five important features which all HTTP clients should support. - -
The most important thing to understand about any type of web service is that network access is incredibly expensive. I don’t mean “dollars and cents” expensive (although bandwidth ain’t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even on the fastest broadband connection, latency (the time it takes to send a request and start retrieving data in a response) can still be higher than you anticipated. A router misbehaves, a packet is dropped, an intermediate proxy is under attack — there’s never a dull moment on the public internet, and there may be nothing you can do about it. - - - -
HTTP is designed with caching in mind. There is an entire class of devices (called “caching proxies”) whose only job is to sit between you and the rest of the world and minimize network access. Your company or ISP almost certainly maintains caching proxies, even if you’re unaware of them. They work because caching built into the HTTP protocol. - -
Here’s a concrete example of how caching works. You visit diveintomark.org in your browser. That page includes a background image, wearehugh.com/m.jpg. When your browser downloads that image, the server includes the following HTTP headers:
-
-
HTTP/1.1 200 OK
-Date: Sun, 31 May 2009 17:14:04 GMT
-Server: Apache
-Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
-ETag: "3075-ddc8d800"
-Accept-Ranges: bytes
-Content-Length: 12405
-Cache-Control: max-age=31536000, public
-Expires: Mon, 31 May 2010 17:14:04 GMT
-Connection: close
-Content-Type: image/jpeg
-
-The Cache-Control and Expires headers tell your browser (and any caching proxies between you and the server) that this image can be cached for up to a year. A year! And if, in the next year, you visit another page which also includes a link to this image, your browser will load the image from its cache without generating any network activity whatsoever.
-
-
But wait, it gets better. Let’s say your browser purges the image from your local cache for some reason. Maybe it ran out of disk space; maybe you manually cleared the cache. Whatever. But the HTTP headers said that this data could be cached by public caching proxies. (Technically, the important thing is what the headers don’t say; the Cache-Control header doesn’t have the private keyword, so this data is cacheable by default.) Caching proxies are designed to have tons of storage space, probably far more than your local browser has allocated.
-
-
If your company or ISP maintain a caching proxy, the proxy may still have the image cached. When you visit diveintomark.org again, your browser will look in its local cache for the image, but it won’t find it, so it will make a network request to try to download it from the remote server. But if the caching proxy still has a copy of the image, it will intercept that request and serve the image from its cache. That means that your request will never reach the remote server; in fact, it will never leave your company’s network. That makes for a faster download (fewer network hops) and saves your company money (less data being downloaded from the outside world).
-
-
HTTP caching only works when everybody does their part. On one side, servers need to send the correct headers in their response. On the other side, clients need to understand and respect those headers before they request the same data twice. The proxies in the middle are not a panacea; they can only be as smart as the servers and clients allow them to be. - -
Python’s HTTP libraries do not support caching, but httplib2 does.
-
-
Some data never changes, while other data changes all the time. In between, there is a vast field of data that might have changed, but hasn’t. CNN.com’s feed is updated every few minutes, but my weblog’s feed may not change for days or weeks at a time. In the latter case, I don’t want to tell clients to cache my feed for weeks at a time, because then when I do actually post something, people may not read it for weeks (because they’re respecting my cache headers which said “don’t bother checking this feed for weeks”). On the other hand, I don’t want clients downloading my entire feed once an hour if it hasn’t changed! - - - -
HTTP has a solution to this, too. When you request data for the first time, the server can send back a Last-Modified header. This is exactly what it sounds like: the date that the data was changed. That background image referenced from diveintomark.org included a Last-Modified header.
-
-
HTTP/1.1 200 OK
-Date: Sun, 31 May 2009 17:14:04 GMT
-Server: Apache
-Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
-ETag: "3075-ddc8d800"
-Accept-Ranges: bytes
-Content-Length: 12405
-Cache-Control: max-age=31536000, public
-Expires: Mon, 31 May 2010 17:14:04 GMT
-Connection: close
-Content-Type: image/jpeg
-
-
-When you request the same data a second (or third or fourth) time, you can send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data has changed since then, then the server ignores the If-Modified-Since header and just gives you the new data with a 200 status code. But if the data hasn’t changed since then, the server sends back a special HTTP 304 status code, which means “this data hasn’t changed since the last time you asked for it.” You can test this on the command line, using curl:
-
-
-you@localhost:~$ curl -I -H "If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT" http://wearehugh.com/m.jpg -HTTP/1.1 304 Not Modified -Date: Sun, 31 May 2009 18:04:39 GMT -Server: Apache -Connection: close -ETag: "3075-ddc8d800" -Expires: Mon, 31 May 2010 18:04:39 GMT -Cache-Control: max-age=31536000, public- -
Why is this an improvement? Because when the server sends a 304, it doesn’t re-send the data. All you get is the status code. Even after your cached copy has expired, last-modified checking ensures that you won’t download the same data twice if it hasn’t changed. (As an extra bonus, this 304 response also includes caching headers. Proxies will keep a copy of data even after it officially “expires,” in the hopes that the data hasn’t really changed and the next request responds with a 304 status code and updated cache information.)
-
-
Python’s HTTP libraries do not support last-modified date checking, but httplib2 does.
-
-
ETags are an alternate way to accomplish the same thing as the last-modified checking. With Etags, the server sends a hash code in an ETag header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) That background image referenced from diveintomark.org had an ETag header.
-
-
HTTP/1.1 200 OK
-Date: Sun, 31 May 2009 17:14:04 GMT
-Server: Apache
-Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
-ETag: "3075-ddc8d800"
-Accept-Ranges: bytes
-Content-Length: 12405
-Cache-Control: max-age=31536000, public
-Expires: Mon, 31 May 2010 17:14:04 GMT
-Connection: close
-Content-Type: image/jpeg
-
-
-
-
-The second time you request the same data, you include the ETag hash in an If-None-Match header of your request. If the data hasn’t changed, the server will send you back a 304 status code. As with the last-modified date checking, the server sends back only the 304 status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since you still have the data from the last time.
-
-
Again with the curl: - -
-you@localhost:~$ curl -I -H "If-None-Match: \"3075-ddc8d800\"" http://wearehugh.com/m.jpg ①
-HTTP/1.1 304 Not Modified
-Date: Sun, 31 May 2009 18:04:39 GMT
-Server: Apache
-Connection: close
-ETag: "3075-ddc8d800"
-Expires: Mon, 31 May 2010 18:04:39 GMT
-Cache-Control: max-age=31536000, public
-If-None-Match header.
-Python’s HTTP libraries do not support ETags, but httplib2 does.
-
-
When you talk about HTTP web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s XML, maybe it’s JSON, maybe it’s just plain text. Regardless of the format, text compresses well. The example feed in the XML chapter is 3070 bytes uncompressed, but would be 941 bytes after gzip compression. That’s just 30% of the original size! - -
HTTP supports several compression algorithms. The two most common types are gzip and deflate. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include an Accept-encoding header in your request that lists which compression algorithms you support. If the server supports any of the same algorithms, it will send you back compressed data (with a Content-encoding header that tells you which algorithm it used). Then it’s up to you to decompress the data.
-
-
-- -☞Important tip for server-side developers: make sure that the compressed version of a resource has a different Etag than the uncompressed version. Otherwise, caching proxies will get confused and may serve the compressed version to clients that can’t handle it. Read the discussion of Apache bug 39727 for more details on this subtle issue. -
Python’s HTTP libraries do not support compression, but httplib2 does.
-
-
Cool URIs don’t change, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at http://example.com/index.xml might be moved to http://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; http://www.example.com/index.xml becomes http://server-farm-1.example.com/index.xml.
-
-
-
-
Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code 200 means “everything’s normal, here’s the page you asked for”. Status code 404 means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection.
-
-
HTTP has several different ways of signifying that a resource has moved. The two most common techiques are status codes 302 and 301. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you’re supposed to use the new address from then on.
-
-
The urllib.request module automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address, and each time the urllib.request module will “helpfully” follow the redirect. In other words, it treats permanent redirects the same as temporary redirects. That means two round trips instead of one, which is bad for the server and bad for you.
-
-
httplib2 handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them.
-
-
⁂ - -
Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better. -
->>> import urllib.request ->>> a_url = 'http://diveintopython3.org/examples/feed.xml' ->>> data = urllib.request.urlopen(a_url).read() ① ->>> type(data) ② -<class 'bytes'> ->>> print(data) -<?xml version='1.0' encoding='utf-8'?> -<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'> - <title>dive into mark</title> - <subtitle>currently between addictions</subtitle> - <id>tag:diveintomark.org,2001-07-29:/</id> - <updated>2009-03-27T21:56:07Z</updated> - <link rel='alternate' type='text/html' href='http://diveintomark.org/'/> - … --
urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier.
-urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get bytes. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it to a string.
-So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (e.g. requesting this feed once an hour), then you’re being inefficient, and you’re being rude. - -
⁂ - -
To see why this is inefficient and rude, let’s turn on the debugging features of Python’s HTTP library and see what’s being sent “on the wire” (i.e. over the network). - -
->>> from http.client import HTTPConnection ->>> HTTPConnection.debuglevel = 1 ① ->>> from urllib.request import urlopen ->>> response = urlopen('http://diveintopython3.org/examples/feed.xml') ② -send: b'GET /examples/feed.xml HTTP/1.1 ③ -Host: diveintopython3.org ④ -Accept-Encoding: identity ⑤ -User-Agent: Python-urllib/3.1' ⑥ -Connection: close -reply: 'HTTP/1.1 200 OK' -…further debugging information omitted…-
urllib.request relies on another standard Python library, http.client. Normally you don’t need to touch http.client directly. (The urllib.request module imports it automatically.) But we import it here so we can toggle the debugging flag on the HTTPConnection class that urllib.request uses to connect to the HTTP server.
-urllib.request module sends five lines to the server.
-urllib.request does not support compression by default.
-Python-urllib plus a version number. Both urllib.request and httplib2 support changing the user agent, simply by adding a User-Agent header to the request (which will override the default value).
-Now let’s look at what the server sent back in its response. - -
-# continued from previous example ->>> print(response.headers.as_string()) ① -Date: Sun, 31 May 2009 19:23:06 GMT ② -Server: Apache -Last-Modified: Sun, 31 May 2009 06:39:55 GMT ③ -ETag: "bfe-93d9c4c0" ④ -Accept-Ranges: bytes -Content-Length: 3070 ⑤ -Cache-Control: max-age=86400 ⑥ -Expires: Mon, 01 Jun 2009 19:23:06 GMT -Vary: Accept-Encoding -Connection: close -Content-Type: application/xml ->>> data = response.read() ⑦ ->>> len(data) -3070-
urllib.request.urlopen() function contains all the HTTP headers the server sent back. It also contains methods to download the actual data; we’ll get to that in a minute.
-Last-Modified header.
-ETag header.
-Content-encoding header. Your request stated that you only accept uncompressed data (Accept-encoding: identity), and sure enough, this response contains uncompressed data.
-response.read(). As you can tell from the len() function, this downloads all 3070 bytes at once.
-As you can see, this code is already inefficient: it asked for (and received) uncompressed data. I know for a fact that this server supports gzip compression, but HTTP compression is opt-in. We didn’t ask for it, so we didn’t get it. That means we’re downloading 3070 bytes when we could have just downloaded 941. Bad dog, no biscuit. - -
But wait, it gets worse! To see just how inefficient this code is, let’s request the same feed a second time. - -
-# continued from the previous example
->>> response2 = urlopen('http://diveintopython3.org/examples/feed.xml')
-send: b'GET /examples/feed.xml HTTP/1.1
-Host: diveintopython3.org
-Accept-Encoding: identity
-User-Agent: Python-urllib/3.1'
-Connection: close
-reply: 'HTTP/1.1 200 OK'
-…further debugging information omitted…
-
-Notice anything peculiar about this request? It hasn’t changed! It’s exactly the same as the first request. No sign of If-Modified-Since headers. No sign of If-None-Match headers. No respect for the caching headers. Still no compression.
-
-
And what happens when you do the same thing twice? You get the same response. Twice. - -
-# continued from the previous example ->>> print(response2.headers.as_string()) ① -Date: Mon, 01 Jun 2009 03:58:00 GMT -Server: Apache -Last-Modified: Sun, 31 May 2009 22:51:11 GMT -ETag: "bfe-255ef5c0" -Accept-Ranges: bytes -Content-Length: 3070 -Cache-Control: max-age=86400 -Expires: Tue, 02 Jun 2009 03:58:00 GMT -Vary: Accept-Encoding -Connection: close -Content-Type: application/xml ->>> data2 = response2.read() ->>> len(data2) ② -3070 ->>> data2 == data ③ -True-
Cache-Control and Expires to allow caching, Last-Modified and ETag to enable “not-modified” tracking. Even the Vary: Accept-Encoding header hints that the server would support compression, if only you would ask for it. But you didn’t.
-HTTP is designed to work better than this. urllib speaks HTTP like I speak Spanish — enough to get by in a jam, but not enough to hold a conversation. HTTP is a conversation. It’s time to upgrade to a library that speaks HTTP fluently.
-
-
⁂ - -
httplib2Before you can use httplib2, you’ll need to install it. Visit code.google.com/p/httplib2/ and download the latest version. httplib2 is available for Python 2.x and Python 3.x; make sure you get the Python 3 version, named something like httplib2-python3-0.5.0.zip.
-
-
Unzip the archive, open a terminal window, and go to the newly created httplib2 directory. On Windows, open the Start menu, select Run..., type cmd.exe and press ENTER.
-
-
-c:\Users\pilgrim\Downloads> dir - Volume in drive C has no label. - Volume Serial Number is DED5-B4F8 - - Directory of c:\Users\pilgrim\Downloads - -07/28/2009 12:36 PM <DIR> . -07/28/2009 12:36 PM <DIR> .. -07/28/2009 12:36 PM <DIR> httplib2-python3-0.5.0 -07/28/2009 12:33 PM 18,997 httplib2-python3-0.5.0.zip - 1 File(s) 18,997 bytes - 3 Dir(s) 61,496,684,544 bytes free - -c:\Users\pilgrim\Downloads> cd httplib2-python3-0.5.0 -c:\Users\pilgrim\Downloads\httplib2-python3-0.5.0> c:\python31\python.exe setup.py install -running install -running build -running build_py -running install_lib -creating c:\python31\Lib\site-packages\httplib2 -copying build\lib\httplib2\iri2uri.py -> c:\python31\Lib\site-packages\httplib2 -copying build\lib\httplib2\__init__.py -> c:\python31\Lib\site-packages\httplib2 -byte-compiling c:\python31\Lib\site-packages\httplib2\iri2uri.py to iri2uri.pyc -byte-compiling c:\python31\Lib\site-packages\httplib2\__init__.py to __init__.pyc -running install_egg_info -Writing c:\python31\Lib\site-packages\httplib2-python3_0.5.0-py3.1.egg-info- -
On Mac OS X, run the Terminal.app application in your /Applications/Utilities/ folder. On Linux, run the Terminal application, which is usually in your Applications menu under Accessories or System.
-
-
-you@localhost:~/Desktop$ unzip httplib2-python3-0.5.0.zip -Archive: httplib2-python3-0.5.0.zip - inflating: httplib2-python3-0.5.0/README - inflating: httplib2-python3-0.5.0/setup.py - inflating: httplib2-python3-0.5.0/PKG-INFO - inflating: httplib2-python3-0.5.0/httplib2/__init__.py - inflating: httplib2-python3-0.5.0/httplib2/iri2uri.py -you@localhost:~/Desktop$ cd httplib2-python3-0.5.0/ -you@localhost:~/Desktop/httplib2-python3-0.5.0$ sudo python3 setup.py install -running install -running build -running build_py -creating build -creating build/lib.linux-x86_64-3.1 -creating build/lib.linux-x86_64-3.1/httplib2 -copying httplib2/iri2uri.py -> build/lib.linux-x86_64-3.1/httplib2 -copying httplib2/__init__.py -> build/lib.linux-x86_64-3.1/httplib2 -running install_lib -creating /usr/local/lib/python3.1/dist-packages/httplib2 -copying build/lib.linux-x86_64-3.1/httplib2/iri2uri.py -> /usr/local/lib/python3.1/dist-packages/httplib2 -copying build/lib.linux-x86_64-3.1/httplib2/__init__.py -> /usr/local/lib/python3.1/dist-packages/httplib2 -byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/iri2uri.py to iri2uri.pyc -byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/__init__.py to __init__.pyc -running install_egg_info -Writing /usr/local/lib/python3.1/dist-packages/httplib2-python3_0.5.0.egg-info- -
To use httplib2, create an instance of the httplib2.Http class.
-
-
->>> import httplib2
->>> h = httplib2.Http('.cache') ①
->>> response, content = h.request('http://diveintopython3.org/examples/feed.xml') ②
->>> response.status ③
-200
->>> content[:52] ④
-b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
->>> len(content)
-3070
-httplib2 is the Http object. For reasons you’ll see in the next section, you should always pass a directory name when you create an Http object. The directory does not need to exist; httplib2 will create it if necessary.
-Http object, retrieving data is as simple as calling the request() method with the address of the data you want. This will issue an HTTP GET request for that URL. (Later in this chapter, you’ll see how to issue other HTTP requests, like POST.)
-request() method returns two values. The first is an httplib2.Response object, which contains all the HTTP headers the server returned. For example, a status code of 200 indicates that the request was successful.
-bytes object, not a string. If you want it as a string, you’ll need to determine the character encoding and convert it yourself.
--- -☞You probably only need one
httplib2.Httpobject. There are valid reasons for creating more than one, but you should only do so if you know why you need them. “I need to request data from two different URLs” is not a valid reason. Re-use theHttpobject and just call therequest()method twice. -
httplib2 Returns Bytes Instead of StringsBytes. Strings. What a pain. Why can’t httplib2 “just” do the conversion for you? Well, it’s complicated, because the rules for determining the character encoding are specific to what kind of resource you’re requesting. How could httplib2 know what kind of resource you’re requesting? It’s usually listed in the Content-Type HTTP header, but that’s an optional feature of HTTP and not all HTTP servers include it. If that header is not included in the HTTP response, it’s left up to the client to guess. (This is commonly called “content sniffing,” and it’s never perfect.)
-
-
If you know what sort of resource you’re expecting (an XML document in this case), perhaps you could “just” pass the returned bytes object to the xml.etree.ElementTree.parse() function. That’ll work as long as the XML document includes information on its own character encoding (as this one does), but that’s an optional feature and not all XML documents do that. If an XML document doesn’t include encoding information, the client is supposed to look at the enclosing transport — i.e. the Content-Type HTTP header, which can include a charset parameter.
-
-
But it’s worse than that. Now character encoding information can be in two places: within the XML document itself, and within the Content-Type HTTP header. If the information is in both places, which one wins? According to RFC 3023 (I swear I am not making this up), if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the encoding is
-
-
charset parameter of the Content-Type HTTP header, or
-encoding attribute of the XML declaration within the document, or
-On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is
-
-
Content-Type HTTP header, or
-us-ascii
-And that’s just for XML documents. For HTML documents, web browsers have constructed such byzantine rules for content-sniffing [PDF] that we’re still trying to figure them all out. - -
“Patches welcome.” - -
httplib2 Handles CachingRemember in the previous section when I said you should always create an httplib2.Http object with a directory name? Caching is the reason.
-
-
-# continued from the previous example
->>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml') ①
->>> response2.status ②
-200
->>> content2[:52] ③
-b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
->>> len(content2)
-3070
-status is once again 200, just like last time.
-So… who cares? Quit your Python interactive shell and relaunch it with a new session, and I’ll show you. - -
-# NOT continued from previous example! -# Please exit out of the interactive shell -# and launch a new one. ->>> import httplib2 ->>> httplib2.debuglevel = 1 ① ->>> h = httplib2.Http('.cache') ② ->>> response, content = h.request('http://diveintopython3.org/examples/feed.xml') ③ ->>> len(content) ④ -3070 ->>> response.status ⑤ -200 ->>> response.fromcache ⑥ -True-
httplib2 equivalent of turning on debugging in http.client. httplib2 will print all the data being sent to the server and some key information being sent back.
-httplib2.Http object with the same directory name as before.
-httplib2’s local cache. That directory name you passed in when you created the httplib2.Http object — that directory holds httplib2’s cache of all the operations it’s ever performed.
--- -☞If you want to turn on
httplib2debugging, you need to set a module-level constant (httplib2.debuglevel), then create a newhttplib2.Httpobject. If you want to turn off debugging, you need to change the same module-level constant, then create a newhttplib2.Httpobject. -
You previously requested the data at this URL. That request was successful (status: 200). That response included not only the feed data, but also a set of caching headers that told anyone who was listening that they could cache this resource for up to 24 hours (Cache-Control: max-age=86400, which is 24 hours measured in seconds). httplib2 understand and respects those caching headers, and it stored the previous response in the .cache directory (which you passed in when you create the Http object). That cache hasn’t expired yet, so the second time you request the data at this URL, httplib2 simply returns the cached result without ever hitting the network.
-
-
I say “simply,” but obviously there is a lot of complexity hidden behind that simplicity. httplib2 handles HTTP caching automatically and by default. If for some reason you need to know whether a response came from the cache, you can check response.fromcache. Otherwise, it Just Works.
-
-
Now, suppose you have data cached, but you want to bypass the cache and re-request it from the remote server. Browsers sometimes do this if the user specifically requests it. For example, pressing F5 refreshes the current page, but pressing Ctrl+F5 bypasses the cache and re-requests the current page from the remote server. You might think “oh, I’ll just delete the data from my local cache, then request it again.” You could do that, but remember that there may be more parties involved than just you and the remote server. What about those intermediate proxy servers? They’re completely beyond your control, and they may still have that data cached, and will happily return it to you because (as far as they are concerned) their cache is still valid. - -
Instead of manipulating your local cache and hoping for the best, you should use the features of HTTP to ensure that your request actually reaches the remote server. - -
-# continued from the previous example
->>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml',
-... headers={'cache-control':'no-cache'}) ①
-connect: (diveintopython3.org, 80) ②
-send: b'GET /examples/feed.xml HTTP/1.1
-Host: diveintopython3.org
-user-agent: Python-httplib2/$Rev: 259 $
-accept-encoding: deflate, gzip
-cache-control: no-cache'
-reply: 'HTTP/1.1 200 OK'
-…further debugging information omitted…
->>> response2.status
-200
->>> response2.fromcache ③
-False
->>> print(dict(response2.items())) ④
-{'status': '200',
- 'content-length': '3070',
- 'content-location': 'http://diveintopython3.org/examples/feed.xml',
- 'accept-ranges': 'bytes',
- 'expires': 'Wed, 03 Jun 2009 00:40:26 GMT',
- 'vary': 'Accept-Encoding',
- 'server': 'Apache',
- 'last-modified': 'Sun, 31 May 2009 22:51:11 GMT',
- 'connection': 'close',
- '-content-encoding': 'gzip',
- 'etag': '"bfe-255ef5c0"',
- 'cache-control': 'max-age=86400',
- 'date': 'Tue, 02 Jun 2009 00:40:26 GMT',
- 'content-type': 'application/xml'}
-httplib2 allows you to add arbitrary HTTP headers to any outgoing request. In order to bypass all caches (not just your local disk cache, but also any caching proxies between you and the remote server), add a no-cache header in the headers dictionary.
-httplib2 initiating a network request. httplib2 understands and respects caching headers in both directions — as part of the incoming response and as part of the outgoing request. It noticed that you added the no-cache header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data.
-httplib2 uses to update its local cache, in the hopes of avoiding network access the next time you request this feed. Everything about HTTP caching is designed to maximize cache hits and minimize network access. Even though you bypassed the cache this time, the remote server would really appreciate it if you would cache the result for next time.
-httplib2 Handles Last-Modified and ETag HeadersThe Cache-Control and Expires caching headers are called freshness indicators. They tell caches in no uncertain terms that you can completely avoid all network access until the cache expires. And that’s exactly the behavior you saw in the previous section: given a freshness indicator, httplib2 does not generate a single byte of network activity to serve up cached data (unless you explicitly bypass the cache, of course).
-
-
But what about the case where the data might have changed, but hasn’t? HTTP defines Last-Modified and Etag headers for this purpose. These headers are called validators. If the local cache is no longer fresh, a client can send the validators with the next request to see if the data has actually changed. If the data hasn’t changed, the server sends back a 304 status code and no data. So there’s still a round-trip over the network, but you end up downloading fewer bytes.
-
-
->>> import httplib2
->>> httplib2.debuglevel = 1
->>> h = httplib2.Http('.cache')
->>> response, content = h.request('http://diveintopython3.org/') ①
-connect: (diveintopython3.org, 80)
-send: b'GET / HTTP/1.1
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 200 OK'
->>> print(dict(response.items())) ②
-{'-content-encoding': 'gzip',
- 'accept-ranges': 'bytes',
- 'connection': 'close',
- 'content-length': '6657',
- 'content-location': 'http://diveintopython3.org/',
- 'content-type': 'text/html',
- 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
- 'etag': '"7f806d-1a01-9fb97900"',
- 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
- 'server': 'Apache',
- 'status': '200',
- 'vary': 'Accept-Encoding,User-Agent'}
->>> len(content) ③
-6657
-httplib2 has little to work with, and it sends out a minimum of headers with the request.
-ETag and Last-Modified header.
-
-# continued from the previous example
->>> response, content = h.request('http://diveintopython3.org/') ①
-connect: (diveintopython3.org, 80)
-send: b'GET / HTTP/1.1
-Host: diveintopython3.org
-if-none-match: "7f806d-1a01-9fb97900" ②
-if-modified-since: Tue, 02 Jun 2009 02:51:48 GMT ③
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 304 Not Modified' ④
->>> response.fromcache ⑤
-True
->>> response.status ⑥
-200
->>> response.dict['status'] ⑦
-'304'
->>> len(content) ⑧
-6657
-Http object (and the same local cache).
-httplib2 sends the ETag validator back to the server in the If-None-Match header.
-httplib2 also sends the Last-Modified validator back to the server in the If-Modified-Since header.
-304 status code and no data.
-httplib2 notices the 304 status code and loads the content of the page from its cache.
-304 (returned from the server this time, which caused httplib2 to look in its cache), and 200 (returned from the server last time, and stored in httplib2’s cache along with the page data). response.status returns the status from the cache.
-response.dict, which is a dictionary of the actual headers returned from the server.
-httplib2 is smart enough to let you act dumb.) By the time the request() method returns to the caller, httplib2 has already updated its cache and returned the data to you.
-http2lib Handles CompressionHTTP supports several types of compression; the two most common types are gzip and deflate. httplib2 supports both of these.
-
-
->>> response, content = h.request('http://diveintopython3.org/')
-connect: (diveintopython3.org, 80)
-send: b'GET / HTTP/1.1
-Host: diveintopython3.org
-accept-encoding: deflate, gzip ①
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 200 OK'
->>> print(dict(response.items()))
-{'-content-encoding': 'gzip', ②
- 'accept-ranges': 'bytes',
- 'connection': 'close',
- 'content-length': '6657',
- 'content-location': 'http://diveintopython3.org/',
- 'content-type': 'text/html',
- 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
- 'etag': '"7f806d-1a01-9fb97900"',
- 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
- 'server': 'Apache',
- 'status': '304',
- 'vary': 'Accept-Encoding,User-Agent'}
-httplib2 sends a request, it includes an Accept-Encoding header to tell the server that it can handle either deflate or gzip compression.
-request() method returns, httplib2 has already decompressed the body of the response and placed it in the content variable. If you’re curious about whether or not the response was compressed, you can check response['-content-encoding']; otherwise, don’t worry about it.
-httplib2 Handles RedirectsHTTP defines two kinds of redirects: temporary and permanent. There’s nothing special to do with temporary redirects except follow them, which httplib2 does automatically.
-
-
->>> import httplib2
->>> httplib2.debuglevel = 1
->>> h = httplib2.Http('.cache')
->>> response, content = h.request('http://diveintopython3.org/examples/feed-302.xml') ①
-connect: (diveintopython3.org, 80)
-send: b'GET /examples/feed-302.xml HTTP/1.1 ②
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 302 Found' ③
-send: b'GET /examples/feed.xml HTTP/1.1 ④
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 200 OK'
-302 Found. Not shown here, this response also includes a Location header that points to the real URL.
-httplib2 immediately turns around and “follows” the redirect by issuing another request for the URL given in the Location header: http://diveintopython3.org/examples/feed.xml
-“Following” a redirect is nothing more than this example shows. httplib2 sends a request for the URL you asked for. The server comes back with a response that says “No no, look over there instead.” httplib2 sends another request for the new URL.
-
-
-# continued from the previous example ->>> response ① -{'status': '200', - 'content-length': '3070', - 'content-location': 'http://diveintopython3.org/examples/feed.xml', ② - 'accept-ranges': 'bytes', - 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT', - 'vary': 'Accept-Encoding', - 'server': 'Apache', - 'last-modified': 'Wed, 03 Jun 2009 02:20:15 GMT', - 'connection': 'close', - '-content-encoding': 'gzip', ③ - 'etag': '"bfe-4cbbf5c0"', - 'cache-control': 'max-age=86400', ④ - 'date': 'Wed, 03 Jun 2009 02:21:41 GMT', - 'content-type': 'application/xml'}-
request() method is the response from the final URL.
-httplib2 adds the final URL to the response dictionary, as content-location. This is not a header that came from the server; it’s specific to httplib2.
-The response you get back gives you information about the final URL. What if you want more information about the intermediate URLs, the ones that eventually redirected to the final URL? httplib2 lets you do that, too.
-
-
-# continued from the previous example ->>> response.previous ① -{'status': '302', - 'content-length': '228', - 'content-location': 'http://diveintopython3.org/examples/feed-302.xml', - 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT', - 'server': 'Apache', - 'connection': 'close', - 'location': 'http://diveintopython3.org/examples/feed.xml', - 'cache-control': 'max-age=86400', - 'date': 'Wed, 03 Jun 2009 02:21:41 GMT', - 'content-type': 'text/html; charset=iso-8859-1'} ->>> type(response) ② -<class 'httplib2.Response'> ->>> type(response.previous) -<class 'httplib2.Response'> ->>> response.previous.previous ③ ->>>-
httplib2 followed to get to the current response object.
-httplib2.Response objects.
-None.
-What happens if you request the same URL again? - -
-# continued from the previous example
->>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-302.xml') ①
-connect: (diveintopython3.org, 80)
-send: b'GET /examples/feed-302.xml HTTP/1.1 ②
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 302 Found' ③
->>> content2 == content ④
-True
-httplib2.Http object (and therefore the same cache).
-302 response was not cached, so httplib2 sends another request for the same URL.
-302. But notice what didn’t happen: there wasn’t ever a second request for the final URL, http://diveintopython3.org/examples/feed.xml. That response was cached (remember the Cache-Control header that you saw in the previous example). Once httplib2 received the 302 Found code, it checked its cache before issuing another request. The cache contained a fresh copy of http://diveintopython3.org/examples/feed.xml, so there was no need to re-request it.
-request() method returns, it has read the feed data from the cache and returned it. Of course, it’s the same as the data you received last time.
-In other words, you don’t have to do anything special for temporary redirects. httplib2 will follow them automatically, and the fact that one URL redirects to another has no bearing on httplib2’s support for compression, caching, ETags, or any of the other features of HTTP.
-
-
Permanent redirects are just as simple. - -
-# continued from the previous example
->>> response, content = h.request('http://diveintopython3.org/examples/feed-301.xml') ①
-connect: (diveintopython3.org, 80)
-send: b'GET /examples/feed-301.xml HTTP/1.1
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 301 Moved Permanently' ②
->>> response.fromcache ③
-True
-http://diveintopython3.org/examples/feed.xml.
-301. But again, notice what didn’t happen: there was no request to the redirect URL. Why not? Because it’s already cached locally.
-httplib2 “followed” the redirect right into its cache.
-But wait! There’s more! - -
-# continued from the previous example
->>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-301.xml') ①
->>> response2.fromcache ②
-True
->>> content2 == content ③
-True
-
-httplib2 follows a permanent redirect, all further requests for that URL will transparently be rewritten to the target URL without hitting the network for the original URL. Remember, debugging is still turned on, yet there is no output of network activity whatsoever.
-HTTP. It works. - -
⁂ - -
HTTP web services are not limited to GET requests. What if you want to create something new? Whenever you post a comment on a discussion forum, update your weblog, publish your status on a microblogging service like Twitter or Identi.ca, you’re probably already using HTTP POST.
-
-
Both Twitter and Identi.ca both offer a simple HTTP-based API for publishing and updating your status in 140 characters or less. Let’s look at Identi.ca’s API documentation for updating your status: - -
-- -Identi.ca REST API Method: statuses/update
-Updates the authenticating user’s status. Requires thestatusparameter specified below. Request must be aPOST. - --
-- URL -
https://identi.ca/api/statuses/update.format-- Formats -
xml,json,rss,atom-- HTTP Method(s) -
POST-- Requires Authentication -
- true -
- Parameters -
status. Required. The text of your status update. URL-encode as necessary. -
How does this work? To publish a new message on Identi.ca, you need to issue an HTTP POST request to http://identi.ca/api/statuses/update.format. (The format bit is not part of the URL; you replace it with the data format you want the server to return in response to your request. So if you want a response in XML, you would post the request to https://identi.ca/api/statuses/update.xml.) The request needs to include a parameter called status, which contains the text of your status update. And the request needs to be authenticated.
-
-
Authenticated? Sure. To update your status on Identi.ca, you need to prove who you are. Identi.ca is not a wiki; only you can update your own status. Identi.ca uses HTTP Basic Authentication (a.k.a. RFC 2617) over SSL to provide secure but easy-to-use authentication. httplib2 supports both SSL and HTTP Basic Authentication, so this part is easy.
-
-
A POST request is different from a GET request, because it includes a payload. The payload is the data you want to send to the server. The one piece of data that this API method requires is status, and it should be URL-encoded. This is a very simple serialization format that takes a set of key-value pairs (i.e. a dictionary) and transforms it into a string.
-
-
->>> from urllib.parse import urlencode ① ->>> data = {'status': 'Test update from Python 3'} ② ->>> urlencode(data) ③ -'status=Test+update+from+Python+3'-
urllib.parse.urlencode().
-status, whose value is the text of a single status update.
-POST request.
-- -
->>> from urllib.parse import urlencode
->>> import httplib2
->>> httplib2.debuglevel = 1
->>> h = httplib2.Http('.cache')
->>> data = {'status': 'Test update from Python 3'}
->>> h.add_credentials('diveintomark', 'MY_SECRET_PASSWORD', 'identi.ca') ①
->>> resp, content = h.request('https://identi.ca/api/statuses/update.xml',
-... 'POST', ②
-... urlencode(data), ③
-... headers={'Content-Type': 'application/x-www-form-urlencoded'}) ④
-httplib2 handles authentication. Store your username and password with the add_credentials() method. When httplib2 tries to issue the request, the server will respond with a 401 Unauthorized status code, and it will list which authentication methods it supports (in the WWW-Authenticate header). httplib2 will automatically construct an Authorization header and re-request the URL.
-POST.
--- -☞The third parameter to the
add_credentials()method is the domain in which the credentials are valid. You should always specify this! If you leave out the domain and later reuse thehttplib2.Httpobject on a different authenticated site,httplib2might end up leaking one site’s username and password to the other site. -
This is what goes over the wire: - -
-# continued from the previous example -send: b'POST /api/statuses/update.xml HTTP/1.1 -Host: identi.ca -Accept-Encoding: identity -Content-Length: 32 -content-type: application/x-www-form-urlencoded -user-agent: Python-httplib2/$Rev: 259 $ - -status=Test+update+from+Python+3' -reply: 'HTTP/1.1 401 Unauthorized' ① -send: b'POST /api/statuses/update.xml HTTP/1.1 ② -Host: identi.ca -Accept-Encoding: identity -Content-Length: 32 -content-type: application/x-www-form-urlencoded -authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2 ③ -user-agent: Python-httplib2/$Rev: 259 $ - -status=Test+update+from+Python+3' -reply: 'HTTP/1.1 200 OK' ④-
401 Unauthorized status code. httplib2 will never send authentication headers unless the server explicitly asks for them. This is how the server asks for them.
-httplib2 immediately turns around and requests the same URL a second time.
-add_credentials() method.
-What does the server send back after a successful request? That depends entirely on the web service API. In some protocols (like the Atom Publishing Protocol), the server sends back a 201 Created status code and the location of the newly created resource in the Location header. Identi.ca sends back a 200 OK and an XML document containing information about the newly created resource.
-
-
-# continued from the previous example
->>> print(content.decode('utf-8')) ①
-<?xml version="1.0" encoding="UTF-8"?>
-<status>
- <text>Test update from Python 3</text> ②
- <truncated>false</truncated>
- <created_at>Wed Jun 10 03:53:46 +0000 2009</created_at>
- <in_reply_to_status_id></in_reply_to_status_id>
- <source>api</source>
- <id>5131472</id> ③
- <in_reply_to_user_id></in_reply_to_user_id>
- <in_reply_to_screen_name></in_reply_to_screen_name>
- <favorited>false</favorited>
- <user>
- <id>3212</id>
- <name>Mark Pilgrim</name>
- <screen_name>diveintomark</screen_name>
- <location>27502, US</location>
- <description>tech writer, husband, father</description>
- <profile_image_url>http://avatar.identi.ca/3212-48-20081216000626.png</profile_image_url>
- <url>http://diveintomark.org/</url>
- <protected>false</protected>
- <followers_count>329</followers_count>
- <profile_background_color></profile_background_color>
- <profile_text_color></profile_text_color>
- <profile_link_color></profile_link_color>
- <profile_sidebar_fill_color></profile_sidebar_fill_color>
- <profile_sidebar_border_color></profile_sidebar_border_color>
- <friends_count>2</friends_count>
- <created_at>Wed Jul 02 22:03:58 +0000 2008</created_at>
- <favourites_count>30768</favourites_count>
- <utc_offset>0</utc_offset>
- <time_zone>UTC</time_zone>
- <profile_background_image_url></profile_background_image_url>
- <profile_background_tile>false</profile_background_tile>
- <statuses_count>122</statuses_count>
- <following>false</following>
- <notifications>false</notifications>
-</user>
-</status>
-httplib2 is always bytes, not a string. To convert it to a string, you need to decode it using the proper character encoding. Identi.ca’s API always returns results in UTF-8, so that part is easy.
-And here it is: - -
-
-
⁂ - -
HTTP isn’t limited to GET and POST. Those are certainly the most common types of requests, especially in web browsers. But web service APIs can go beyond GET and POST, and httplib2 is ready.
-
-
-# continued from the previous example ->>> from xml.etree import ElementTree as etree ->>> tree = etree.fromstring(content) ① ->>> status_id = tree.findtext('id') ② ->>> status_id -'5131472' ->>> url = 'https://identi.ca/api/statuses/destroy/{0}.xml'.format(status_id) ③ ->>> resp, deleted_content = h.request(url, 'DELETE') ④-
findtext() method finds the first instance of the given expression and extracts its text content. In this case, we’re just looking for an <id> element.
-<id> element, we can construct a URL to delete the status message we just published.
-DELETE request to that URL.
-This is what goes over the wire: - -
-send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1 ① -Host: identi.ca -Accept-Encoding: identity -user-agent: Python-httplib2/$Rev: 259 $ - -' -reply: 'HTTP/1.1 401 Unauthorized' ② -send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1 ③ -Host: identi.ca -Accept-Encoding: identity -authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2 ④ -user-agent: Python-httplib2/$Rev: 259 $ - -' -reply: 'HTTP/1.1 200 OK' ⑤ ->>> resp.status -200-
And just like that, poof, it’s gone. - -
-
-
⁂ - -
httplib2:
-
-
httplib2 project page
-httplib2 code examples
-httplib2
-httplib2: HTTP Persistence and Authentication
-HTTP caching: - -
RFCs: - -
© 2001–10 Mark Pilgrim - - - + + +
You are here: Home ‣ Dive Into Python 3 ‣ +
Difficulty level: ♦♦♦♦♢ +
++❝ A ruffled mind makes a restless pillow. ❞
— Charlotte Brontë +
+
Philosophically, I can describe HTTP web services in 12 words: exchanging data with remote servers using nothing but the operations of HTTP. If you want to get data from the server, use HTTP GET. If you want to send new data to the server, use HTTP POST. Some more advanced HTTP web service APIs also allow creating, modifying, and deleting data, using HTTP PUT and HTTP DELETE. That’s it. No registries, no envelopes, no wrappers, no tunneling. The “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for retrieving, creating, modifying, and deleting data.
+
+
The main advantage of this approach is simplicity, and its simplicity has proven popular. Data — usually XML or JSON — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging is also easier; because each resource in an HTTP web service has a unique address (in the form of a URL), you can load it in your web browser and immediately see the raw data. + +
Examples of HTTP web services: +
Python 3 comes with two different libraries for interacting with HTTP web services: + +
http.client is a low-level library that implements RFC 2616, the HTTP protocol.
+urllib.request is an abstraction layer built on top of http.client. It provides a standard API for accessing both HTTP and FTP servers, automatically follows HTTP redirects, and handles some common forms of HTTP authentication.
+So which one should you use? Neither of them. Instead, you should use httplib2, an open source third-party library that implements HTTP more fully than http.client but provides a better abstraction than urllib.request.
+
+
To understand why httplib2 is the right choice, you first need to understand HTTP.
+
+
⁂ + +
There are five important features which all HTTP clients should support. + +
The most important thing to understand about any type of web service is that network access is incredibly expensive. I don’t mean “dollars and cents” expensive (although bandwidth ain’t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even on the fastest broadband connection, latency (the time it takes to send a request and start retrieving data in a response) can still be higher than you anticipated. A router misbehaves, a packet is dropped, an intermediate proxy is under attack — there’s never a dull moment on the public internet, and there may be nothing you can do about it. + + + +
HTTP is designed with caching in mind. There is an entire class of devices (called “caching proxies”) whose only job is to sit between you and the rest of the world and minimize network access. Your company or ISP almost certainly maintains caching proxies, even if you’re unaware of them. They work because caching built into the HTTP protocol. + +
Here’s a concrete example of how caching works. You visit diveintomark.org in your browser. That page includes a background image, wearehugh.com/m.jpg. When your browser downloads that image, the server includes the following HTTP headers:
+
+
HTTP/1.1 200 OK
+Date: Sun, 31 May 2009 17:14:04 GMT
+Server: Apache
+Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
+ETag: "3075-ddc8d800"
+Accept-Ranges: bytes
+Content-Length: 12405
+Cache-Control: max-age=31536000, public
+Expires: Mon, 31 May 2010 17:14:04 GMT
+Connection: close
+Content-Type: image/jpeg
+
+The Cache-Control and Expires headers tell your browser (and any caching proxies between you and the server) that this image can be cached for up to a year. A year! And if, in the next year, you visit another page which also includes a link to this image, your browser will load the image from its cache without generating any network activity whatsoever.
+
+
But wait, it gets better. Let’s say your browser purges the image from your local cache for some reason. Maybe it ran out of disk space; maybe you manually cleared the cache. Whatever. But the HTTP headers said that this data could be cached by public caching proxies. (Technically, the important thing is what the headers don’t say; the Cache-Control header doesn’t have the private keyword, so this data is cacheable by default.) Caching proxies are designed to have tons of storage space, probably far more than your local browser has allocated.
+
+
If your company or ISP maintain a caching proxy, the proxy may still have the image cached. When you visit diveintomark.org again, your browser will look in its local cache for the image, but it won’t find it, so it will make a network request to try to download it from the remote server. But if the caching proxy still has a copy of the image, it will intercept that request and serve the image from its cache. That means that your request will never reach the remote server; in fact, it will never leave your company’s network. That makes for a faster download (fewer network hops) and saves your company money (less data being downloaded from the outside world).
+
+
HTTP caching only works when everybody does their part. On one side, servers need to send the correct headers in their response. On the other side, clients need to understand and respect those headers before they request the same data twice. The proxies in the middle are not a panacea; they can only be as smart as the servers and clients allow them to be. + +
Python’s HTTP libraries do not support caching, but httplib2 does.
+
+
Some data never changes, while other data changes all the time. In between, there is a vast field of data that might have changed, but hasn’t. CNN.com’s feed is updated every few minutes, but my weblog’s feed may not change for days or weeks at a time. In the latter case, I don’t want to tell clients to cache my feed for weeks at a time, because then when I do actually post something, people may not read it for weeks (because they’re respecting my cache headers which said “don’t bother checking this feed for weeks”). On the other hand, I don’t want clients downloading my entire feed once an hour if it hasn’t changed! + + + +
HTTP has a solution to this, too. When you request data for the first time, the server can send back a Last-Modified header. This is exactly what it sounds like: the date that the data was changed. That background image referenced from diveintomark.org included a Last-Modified header.
+
+
HTTP/1.1 200 OK
+Date: Sun, 31 May 2009 17:14:04 GMT
+Server: Apache
+Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
+ETag: "3075-ddc8d800"
+Accept-Ranges: bytes
+Content-Length: 12405
+Cache-Control: max-age=31536000, public
+Expires: Mon, 31 May 2010 17:14:04 GMT
+Connection: close
+Content-Type: image/jpeg
+
+
+When you request the same data a second (or third or fourth) time, you can send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data has changed since then, then the server ignores the If-Modified-Since header and just gives you the new data with a 200 status code. But if the data hasn’t changed since then, the server sends back a special HTTP 304 status code, which means “this data hasn’t changed since the last time you asked for it.” You can test this on the command line, using curl:
+
+
+you@localhost:~$ curl -I -H "If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT" http://wearehugh.com/m.jpg +HTTP/1.1 304 Not Modified +Date: Sun, 31 May 2009 18:04:39 GMT +Server: Apache +Connection: close +ETag: "3075-ddc8d800" +Expires: Mon, 31 May 2010 18:04:39 GMT +Cache-Control: max-age=31536000, public+ +
Why is this an improvement? Because when the server sends a 304, it doesn’t re-send the data. All you get is the status code. Even after your cached copy has expired, last-modified checking ensures that you won’t download the same data twice if it hasn’t changed. (As an extra bonus, this 304 response also includes caching headers. Proxies will keep a copy of data even after it officially “expires,” in the hopes that the data hasn’t really changed and the next request responds with a 304 status code and updated cache information.)
+
+
Python’s HTTP libraries do not support last-modified date checking, but httplib2 does.
+
+
ETags are an alternate way to accomplish the same thing as the last-modified checking. With Etags, the server sends a hash code in an ETag header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) That background image referenced from diveintomark.org had an ETag header.
+
+
HTTP/1.1 200 OK
+Date: Sun, 31 May 2009 17:14:04 GMT
+Server: Apache
+Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
+ETag: "3075-ddc8d800"
+Accept-Ranges: bytes
+Content-Length: 12405
+Cache-Control: max-age=31536000, public
+Expires: Mon, 31 May 2010 17:14:04 GMT
+Connection: close
+Content-Type: image/jpeg
+
+
+
+
+The second time you request the same data, you include the ETag hash in an If-None-Match header of your request. If the data hasn’t changed, the server will send you back a 304 status code. As with the last-modified date checking, the server sends back only the 304 status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since you still have the data from the last time.
+
+
Again with the curl: + +
+you@localhost:~$ curl -I -H "If-None-Match: \"3075-ddc8d800\"" http://wearehugh.com/m.jpg ①
+HTTP/1.1 304 Not Modified
+Date: Sun, 31 May 2009 18:04:39 GMT
+Server: Apache
+Connection: close
+ETag: "3075-ddc8d800"
+Expires: Mon, 31 May 2010 18:04:39 GMT
+Cache-Control: max-age=31536000, public
+If-None-Match header.
+Python’s HTTP libraries do not support ETags, but httplib2 does.
+
+
When you talk about HTTP web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s XML, maybe it’s JSON, maybe it’s just plain text. Regardless of the format, text compresses well. The example feed in the XML chapter is 3070 bytes uncompressed, but would be 941 bytes after gzip compression. That’s just 30% of the original size! + +
HTTP supports several compression algorithms. The two most common types are gzip and deflate. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include an Accept-encoding header in your request that lists which compression algorithms you support. If the server supports any of the same algorithms, it will send you back compressed data (with a Content-encoding header that tells you which algorithm it used). Then it’s up to you to decompress the data.
+
+
++ +☞Important tip for server-side developers: make sure that the compressed version of a resource has a different Etag than the uncompressed version. Otherwise, caching proxies will get confused and may serve the compressed version to clients that can’t handle it. Read the discussion of Apache bug 39727 for more details on this subtle issue. +
Python’s HTTP libraries do not support compression, but httplib2 does.
+
+
Cool URIs don’t change, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at http://example.com/index.xml might be moved to http://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; http://www.example.com/index.xml becomes http://server-farm-1.example.com/index.xml.
+
+
+
+
Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code 200 means “everything’s normal, here’s the page you asked for”. Status code 404 means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection.
+
+
HTTP has several different ways of signifying that a resource has moved. The two most common techiques are status codes 302 and 301. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you’re supposed to use the new address from then on.
+
+
The urllib.request module automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address, and each time the urllib.request module will “helpfully” follow the redirect. In other words, it treats permanent redirects the same as temporary redirects. That means two round trips instead of one, which is bad for the server and bad for you.
+
+
httplib2 handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them.
+
+
⁂ + +
Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better. +
+>>> import urllib.request +>>> a_url = 'http://diveintopython3.org/examples/feed.xml' +>>> data = urllib.request.urlopen(a_url).read() ① +>>> type(data) ② +<class 'bytes'> +>>> print(data) +<?xml version='1.0' encoding='utf-8'?> +<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'> + <title>dive into mark</title> + <subtitle>currently between addictions</subtitle> + <id>tag:diveintomark.org,2001-07-29:/</id> + <updated>2009-03-27T21:56:07Z</updated> + <link rel='alternate' type='text/html' href='http://diveintomark.org/'/> + … ++
urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier.
+urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get bytes. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it to a string.
+So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (e.g. requesting this feed once an hour), then you’re being inefficient, and you’re being rude. + +
⁂ + +
To see why this is inefficient and rude, let’s turn on the debugging features of Python’s HTTP library and see what’s being sent “on the wire” (i.e. over the network). + +
+>>> from http.client import HTTPConnection +>>> HTTPConnection.debuglevel = 1 ① +>>> from urllib.request import urlopen +>>> response = urlopen('http://diveintopython3.org/examples/feed.xml') ② +send: b'GET /examples/feed.xml HTTP/1.1 ③ +Host: diveintopython3.org ④ +Accept-Encoding: identity ⑤ +User-Agent: Python-urllib/3.1' ⑥ +Connection: close +reply: 'HTTP/1.1 200 OK' +…further debugging information omitted…+
urllib.request relies on another standard Python library, http.client. Normally you don’t need to touch http.client directly. (The urllib.request module imports it automatically.) But we import it here so we can toggle the debugging flag on the HTTPConnection class that urllib.request uses to connect to the HTTP server.
+urllib.request module sends five lines to the server.
+urllib.request does not support compression by default.
+Python-urllib plus a version number. Both urllib.request and httplib2 support changing the user agent, simply by adding a User-Agent header to the request (which will override the default value).
+Now let’s look at what the server sent back in its response. + +
+# continued from previous example +>>> print(response.headers.as_string()) ① +Date: Sun, 31 May 2009 19:23:06 GMT ② +Server: Apache +Last-Modified: Sun, 31 May 2009 06:39:55 GMT ③ +ETag: "bfe-93d9c4c0" ④ +Accept-Ranges: bytes +Content-Length: 3070 ⑤ +Cache-Control: max-age=86400 ⑥ +Expires: Mon, 01 Jun 2009 19:23:06 GMT +Vary: Accept-Encoding +Connection: close +Content-Type: application/xml +>>> data = response.read() ⑦ +>>> len(data) +3070+
urllib.request.urlopen() function contains all the HTTP headers the server sent back. It also contains methods to download the actual data; we’ll get to that in a minute.
+Last-Modified header.
+ETag header.
+Content-encoding header. Your request stated that you only accept uncompressed data (Accept-encoding: identity), and sure enough, this response contains uncompressed data.
+response.read(). As you can tell from the len() function, this downloads all 3070 bytes at once.
+As you can see, this code is already inefficient: it asked for (and received) uncompressed data. I know for a fact that this server supports gzip compression, but HTTP compression is opt-in. We didn’t ask for it, so we didn’t get it. That means we’re downloading 3070 bytes when we could have just downloaded 941. Bad dog, no biscuit. + +
But wait, it gets worse! To see just how inefficient this code is, let’s request the same feed a second time. + +
+# continued from the previous example
+>>> response2 = urlopen('http://diveintopython3.org/examples/feed.xml')
+send: b'GET /examples/feed.xml HTTP/1.1
+Host: diveintopython3.org
+Accept-Encoding: identity
+User-Agent: Python-urllib/3.1'
+Connection: close
+reply: 'HTTP/1.1 200 OK'
+…further debugging information omitted…
+
+Notice anything peculiar about this request? It hasn’t changed! It’s exactly the same as the first request. No sign of If-Modified-Since headers. No sign of If-None-Match headers. No respect for the caching headers. Still no compression.
+
+
And what happens when you do the same thing twice? You get the same response. Twice. + +
+# continued from the previous example +>>> print(response2.headers.as_string()) ① +Date: Mon, 01 Jun 2009 03:58:00 GMT +Server: Apache +Last-Modified: Sun, 31 May 2009 22:51:11 GMT +ETag: "bfe-255ef5c0" +Accept-Ranges: bytes +Content-Length: 3070 +Cache-Control: max-age=86400 +Expires: Tue, 02 Jun 2009 03:58:00 GMT +Vary: Accept-Encoding +Connection: close +Content-Type: application/xml +>>> data2 = response2.read() +>>> len(data2) ② +3070 +>>> data2 == data ③ +True+
Cache-Control and Expires to allow caching, Last-Modified and ETag to enable “not-modified” tracking. Even the Vary: Accept-Encoding header hints that the server would support compression, if only you would ask for it. But you didn’t.
+HTTP is designed to work better than this. urllib speaks HTTP like I speak Spanish — enough to get by in a jam, but not enough to hold a conversation. HTTP is a conversation. It’s time to upgrade to a library that speaks HTTP fluently.
+
+
⁂ + +
httplib2Before you can use httplib2, you’ll need to install it. Visit code.google.com/p/httplib2/ and download the latest version. httplib2 is available for Python 2.x and Python 3.x; make sure you get the Python 3 version, named something like httplib2-python3-0.5.0.zip.
+
+
Unzip the archive, open a terminal window, and go to the newly created httplib2 directory. On Windows, open the Start menu, select Run..., type cmd.exe and press ENTER.
+
+
+c:\Users\pilgrim\Downloads> dir + Volume in drive C has no label. + Volume Serial Number is DED5-B4F8 + + Directory of c:\Users\pilgrim\Downloads + +07/28/2009 12:36 PM <DIR> . +07/28/2009 12:36 PM <DIR> .. +07/28/2009 12:36 PM <DIR> httplib2-python3-0.5.0 +07/28/2009 12:33 PM 18,997 httplib2-python3-0.5.0.zip + 1 File(s) 18,997 bytes + 3 Dir(s) 61,496,684,544 bytes free + +c:\Users\pilgrim\Downloads> cd httplib2-python3-0.5.0 +c:\Users\pilgrim\Downloads\httplib2-python3-0.5.0> c:\python31\python.exe setup.py install +running install +running build +running build_py +running install_lib +creating c:\python31\Lib\site-packages\httplib2 +copying build\lib\httplib2\iri2uri.py -> c:\python31\Lib\site-packages\httplib2 +copying build\lib\httplib2\__init__.py -> c:\python31\Lib\site-packages\httplib2 +byte-compiling c:\python31\Lib\site-packages\httplib2\iri2uri.py to iri2uri.pyc +byte-compiling c:\python31\Lib\site-packages\httplib2\__init__.py to __init__.pyc +running install_egg_info +Writing c:\python31\Lib\site-packages\httplib2-python3_0.5.0-py3.1.egg-info+ +
On Mac OS X, run the Terminal.app application in your /Applications/Utilities/ folder. On Linux, run the Terminal application, which is usually in your Applications menu under Accessories or System.
+
+
+you@localhost:~/Desktop$ unzip httplib2-python3-0.5.0.zip +Archive: httplib2-python3-0.5.0.zip + inflating: httplib2-python3-0.5.0/README + inflating: httplib2-python3-0.5.0/setup.py + inflating: httplib2-python3-0.5.0/PKG-INFO + inflating: httplib2-python3-0.5.0/httplib2/__init__.py + inflating: httplib2-python3-0.5.0/httplib2/iri2uri.py +you@localhost:~/Desktop$ cd httplib2-python3-0.5.0/ +you@localhost:~/Desktop/httplib2-python3-0.5.0$ sudo python3 setup.py install +running install +running build +running build_py +creating build +creating build/lib.linux-x86_64-3.1 +creating build/lib.linux-x86_64-3.1/httplib2 +copying httplib2/iri2uri.py -> build/lib.linux-x86_64-3.1/httplib2 +copying httplib2/__init__.py -> build/lib.linux-x86_64-3.1/httplib2 +running install_lib +creating /usr/local/lib/python3.1/dist-packages/httplib2 +copying build/lib.linux-x86_64-3.1/httplib2/iri2uri.py -> /usr/local/lib/python3.1/dist-packages/httplib2 +copying build/lib.linux-x86_64-3.1/httplib2/__init__.py -> /usr/local/lib/python3.1/dist-packages/httplib2 +byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/iri2uri.py to iri2uri.pyc +byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/__init__.py to __init__.pyc +running install_egg_info +Writing /usr/local/lib/python3.1/dist-packages/httplib2-python3_0.5.0.egg-info+ +
To use httplib2, create an instance of the httplib2.Http class.
+
+
+>>> import httplib2
+>>> h = httplib2.Http('.cache') ①
+>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml') ②
+>>> response.status ③
+200
+>>> content[:52] ④
+b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
+>>> len(content)
+3070
+httplib2 is the Http object. For reasons you’ll see in the next section, you should always pass a directory name when you create an Http object. The directory does not need to exist; httplib2 will create it if necessary.
+Http object, retrieving data is as simple as calling the request() method with the address of the data you want. This will issue an HTTP GET request for that URL. (Later in this chapter, you’ll see how to issue other HTTP requests, like POST.)
+request() method returns two values. The first is an httplib2.Response object, which contains all the HTTP headers the server returned. For example, a status code of 200 indicates that the request was successful.
+bytes object, not a string. If you want it as a string, you’ll need to determine the character encoding and convert it yourself.
+++ +☞You probably only need one
httplib2.Httpobject. There are valid reasons for creating more than one, but you should only do so if you know why you need them. “I need to request data from two different URLs” is not a valid reason. Re-use theHttpobject and just call therequest()method twice. +
httplib2 Returns Bytes Instead of StringsBytes. Strings. What a pain. Why can’t httplib2 “just” do the conversion for you? Well, it’s complicated, because the rules for determining the character encoding are specific to what kind of resource you’re requesting. How could httplib2 know what kind of resource you’re requesting? It’s usually listed in the Content-Type HTTP header, but that’s an optional feature of HTTP and not all HTTP servers include it. If that header is not included in the HTTP response, it’s left up to the client to guess. (This is commonly called “content sniffing,” and it’s never perfect.)
+
+
If you know what sort of resource you’re expecting (an XML document in this case), perhaps you could “just” pass the returned bytes object to the xml.etree.ElementTree.parse() function. That’ll work as long as the XML document includes information on its own character encoding (as this one does), but that’s an optional feature and not all XML documents do that. If an XML document doesn’t include encoding information, the client is supposed to look at the enclosing transport — i.e. the Content-Type HTTP header, which can include a charset parameter.
+
+
But it’s worse than that. Now character encoding information can be in two places: within the XML document itself, and within the Content-Type HTTP header. If the information is in both places, which one wins? According to RFC 3023 (I swear I am not making this up), if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the encoding is
+
+
charset parameter of the Content-Type HTTP header, or
+encoding attribute of the XML declaration within the document, or
+On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is
+
+
Content-Type HTTP header, or
+us-ascii
+And that’s just for XML documents. For HTML documents, web browsers have constructed such byzantine rules for content-sniffing [PDF] that we’re still trying to figure them all out. + +
“Patches welcome.” + +
httplib2 Handles CachingRemember in the previous section when I said you should always create an httplib2.Http object with a directory name? Caching is the reason.
+
+
+# continued from the previous example
+>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml') ①
+>>> response2.status ②
+200
+>>> content2[:52] ③
+b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
+>>> len(content2)
+3070
+status is once again 200, just like last time.
+So… who cares? Quit your Python interactive shell and relaunch it with a new session, and I’ll show you. + +
+# NOT continued from previous example! +# Please exit out of the interactive shell +# and launch a new one. +>>> import httplib2 +>>> httplib2.debuglevel = 1 ① +>>> h = httplib2.Http('.cache') ② +>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml') ③ +>>> len(content) ④ +3070 +>>> response.status ⑤ +200 +>>> response.fromcache ⑥ +True+
httplib2 equivalent of turning on debugging in http.client. httplib2 will print all the data being sent to the server and some key information being sent back.
+httplib2.Http object with the same directory name as before.
+httplib2’s local cache. That directory name you passed in when you created the httplib2.Http object — that directory holds httplib2’s cache of all the operations it’s ever performed.
+++ +☞If you want to turn on
httplib2debugging, you need to set a module-level constant (httplib2.debuglevel), then create a newhttplib2.Httpobject. If you want to turn off debugging, you need to change the same module-level constant, then create a newhttplib2.Httpobject. +
You previously requested the data at this URL. That request was successful (status: 200). That response included not only the feed data, but also a set of caching headers that told anyone who was listening that they could cache this resource for up to 24 hours (Cache-Control: max-age=86400, which is 24 hours measured in seconds). httplib2 understand and respects those caching headers, and it stored the previous response in the .cache directory (which you passed in when you create the Http object). That cache hasn’t expired yet, so the second time you request the data at this URL, httplib2 simply returns the cached result without ever hitting the network.
+
+
I say “simply,” but obviously there is a lot of complexity hidden behind that simplicity. httplib2 handles HTTP caching automatically and by default. If for some reason you need to know whether a response came from the cache, you can check response.fromcache. Otherwise, it Just Works.
+
+
Now, suppose you have data cached, but you want to bypass the cache and re-request it from the remote server. Browsers sometimes do this if the user specifically requests it. For example, pressing F5 refreshes the current page, but pressing Ctrl+F5 bypasses the cache and re-requests the current page from the remote server. You might think “oh, I’ll just delete the data from my local cache, then request it again.” You could do that, but remember that there may be more parties involved than just you and the remote server. What about those intermediate proxy servers? They’re completely beyond your control, and they may still have that data cached, and will happily return it to you because (as far as they are concerned) their cache is still valid. + +
Instead of manipulating your local cache and hoping for the best, you should use the features of HTTP to ensure that your request actually reaches the remote server. + +
+# continued from the previous example
+>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml',
+... headers={'cache-control':'no-cache'}) ①
+connect: (diveintopython3.org, 80) ②
+send: b'GET /examples/feed.xml HTTP/1.1
+Host: diveintopython3.org
+user-agent: Python-httplib2/$Rev: 259 $
+accept-encoding: deflate, gzip
+cache-control: no-cache'
+reply: 'HTTP/1.1 200 OK'
+…further debugging information omitted…
+>>> response2.status
+200
+>>> response2.fromcache ③
+False
+>>> print(dict(response2.items())) ④
+{'status': '200',
+ 'content-length': '3070',
+ 'content-location': 'http://diveintopython3.org/examples/feed.xml',
+ 'accept-ranges': 'bytes',
+ 'expires': 'Wed, 03 Jun 2009 00:40:26 GMT',
+ 'vary': 'Accept-Encoding',
+ 'server': 'Apache',
+ 'last-modified': 'Sun, 31 May 2009 22:51:11 GMT',
+ 'connection': 'close',
+ '-content-encoding': 'gzip',
+ 'etag': '"bfe-255ef5c0"',
+ 'cache-control': 'max-age=86400',
+ 'date': 'Tue, 02 Jun 2009 00:40:26 GMT',
+ 'content-type': 'application/xml'}
+httplib2 allows you to add arbitrary HTTP headers to any outgoing request. In order to bypass all caches (not just your local disk cache, but also any caching proxies between you and the remote server), add a no-cache header in the headers dictionary.
+httplib2 initiating a network request. httplib2 understands and respects caching headers in both directions — as part of the incoming response and as part of the outgoing request. It noticed that you added the no-cache header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data.
+httplib2 uses to update its local cache, in the hopes of avoiding network access the next time you request this feed. Everything about HTTP caching is designed to maximize cache hits and minimize network access. Even though you bypassed the cache this time, the remote server would really appreciate it if you would cache the result for next time.
+httplib2 Handles Last-Modified and ETag HeadersThe Cache-Control and Expires caching headers are called freshness indicators. They tell caches in no uncertain terms that you can completely avoid all network access until the cache expires. And that’s exactly the behavior you saw in the previous section: given a freshness indicator, httplib2 does not generate a single byte of network activity to serve up cached data (unless you explicitly bypass the cache, of course).
+
+
But what about the case where the data might have changed, but hasn’t? HTTP defines Last-Modified and Etag headers for this purpose. These headers are called validators. If the local cache is no longer fresh, a client can send the validators with the next request to see if the data has actually changed. If the data hasn’t changed, the server sends back a 304 status code and no data. So there’s still a round-trip over the network, but you end up downloading fewer bytes.
+
+
+>>> import httplib2
+>>> httplib2.debuglevel = 1
+>>> h = httplib2.Http('.cache')
+>>> response, content = h.request('http://diveintopython3.org/') ①
+connect: (diveintopython3.org, 80)
+send: b'GET / HTTP/1.1
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 200 OK'
+>>> print(dict(response.items())) ②
+{'-content-encoding': 'gzip',
+ 'accept-ranges': 'bytes',
+ 'connection': 'close',
+ 'content-length': '6657',
+ 'content-location': 'http://diveintopython3.org/',
+ 'content-type': 'text/html',
+ 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
+ 'etag': '"7f806d-1a01-9fb97900"',
+ 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
+ 'server': 'Apache',
+ 'status': '200',
+ 'vary': 'Accept-Encoding,User-Agent'}
+>>> len(content) ③
+6657
+httplib2 has little to work with, and it sends out a minimum of headers with the request.
+ETag and Last-Modified header.
+
+# continued from the previous example
+>>> response, content = h.request('http://diveintopython3.org/') ①
+connect: (diveintopython3.org, 80)
+send: b'GET / HTTP/1.1
+Host: diveintopython3.org
+if-none-match: "7f806d-1a01-9fb97900" ②
+if-modified-since: Tue, 02 Jun 2009 02:51:48 GMT ③
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 304 Not Modified' ④
+>>> response.fromcache ⑤
+True
+>>> response.status ⑥
+200
+>>> response.dict['status'] ⑦
+'304'
+>>> len(content) ⑧
+6657
+Http object (and the same local cache).
+httplib2 sends the ETag validator back to the server in the If-None-Match header.
+httplib2 also sends the Last-Modified validator back to the server in the If-Modified-Since header.
+304 status code and no data.
+httplib2 notices the 304 status code and loads the content of the page from its cache.
+304 (returned from the server this time, which caused httplib2 to look in its cache), and 200 (returned from the server last time, and stored in httplib2’s cache along with the page data). response.status returns the status from the cache.
+response.dict, which is a dictionary of the actual headers returned from the server.
+httplib2 is smart enough to let you act dumb.) By the time the request() method returns to the caller, httplib2 has already updated its cache and returned the data to you.
+http2lib Handles CompressionHTTP supports several types of compression; the two most common types are gzip and deflate. httplib2 supports both of these.
+
+
+>>> response, content = h.request('http://diveintopython3.org/')
+connect: (diveintopython3.org, 80)
+send: b'GET / HTTP/1.1
+Host: diveintopython3.org
+accept-encoding: deflate, gzip ①
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 200 OK'
+>>> print(dict(response.items()))
+{'-content-encoding': 'gzip', ②
+ 'accept-ranges': 'bytes',
+ 'connection': 'close',
+ 'content-length': '6657',
+ 'content-location': 'http://diveintopython3.org/',
+ 'content-type': 'text/html',
+ 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
+ 'etag': '"7f806d-1a01-9fb97900"',
+ 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
+ 'server': 'Apache',
+ 'status': '304',
+ 'vary': 'Accept-Encoding,User-Agent'}
+httplib2 sends a request, it includes an Accept-Encoding header to tell the server that it can handle either deflate or gzip compression.
+request() method returns, httplib2 has already decompressed the body of the response and placed it in the content variable. If you’re curious about whether or not the response was compressed, you can check response['-content-encoding']; otherwise, don’t worry about it.
+httplib2 Handles RedirectsHTTP defines two kinds of redirects: temporary and permanent. There’s nothing special to do with temporary redirects except follow them, which httplib2 does automatically.
+
+
+>>> import httplib2
+>>> httplib2.debuglevel = 1
+>>> h = httplib2.Http('.cache')
+>>> response, content = h.request('http://diveintopython3.org/examples/feed-302.xml') ①
+connect: (diveintopython3.org, 80)
+send: b'GET /examples/feed-302.xml HTTP/1.1 ②
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 302 Found' ③
+send: b'GET /examples/feed.xml HTTP/1.1 ④
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 200 OK'
+302 Found. Not shown here, this response also includes a Location header that points to the real URL.
+httplib2 immediately turns around and “follows” the redirect by issuing another request for the URL given in the Location header: http://diveintopython3.org/examples/feed.xml
+“Following” a redirect is nothing more than this example shows. httplib2 sends a request for the URL you asked for. The server comes back with a response that says “No no, look over there instead.” httplib2 sends another request for the new URL.
+
+
+# continued from the previous example +>>> response ① +{'status': '200', + 'content-length': '3070', + 'content-location': 'http://diveintopython3.org/examples/feed.xml', ② + 'accept-ranges': 'bytes', + 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT', + 'vary': 'Accept-Encoding', + 'server': 'Apache', + 'last-modified': 'Wed, 03 Jun 2009 02:20:15 GMT', + 'connection': 'close', + '-content-encoding': 'gzip', ③ + 'etag': '"bfe-4cbbf5c0"', + 'cache-control': 'max-age=86400', ④ + 'date': 'Wed, 03 Jun 2009 02:21:41 GMT', + 'content-type': 'application/xml'}+
request() method is the response from the final URL.
+httplib2 adds the final URL to the response dictionary, as content-location. This is not a header that came from the server; it’s specific to httplib2.
+The response you get back gives you information about the final URL. What if you want more information about the intermediate URLs, the ones that eventually redirected to the final URL? httplib2 lets you do that, too.
+
+
+# continued from the previous example +>>> response.previous ① +{'status': '302', + 'content-length': '228', + 'content-location': 'http://diveintopython3.org/examples/feed-302.xml', + 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT', + 'server': 'Apache', + 'connection': 'close', + 'location': 'http://diveintopython3.org/examples/feed.xml', + 'cache-control': 'max-age=86400', + 'date': 'Wed, 03 Jun 2009 02:21:41 GMT', + 'content-type': 'text/html; charset=iso-8859-1'} +>>> type(response) ② +<class 'httplib2.Response'> +>>> type(response.previous) +<class 'httplib2.Response'> +>>> response.previous.previous ③ +>>>+
httplib2 followed to get to the current response object.
+httplib2.Response objects.
+None.
+What happens if you request the same URL again? + +
+# continued from the previous example
+>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-302.xml') ①
+connect: (diveintopython3.org, 80)
+send: b'GET /examples/feed-302.xml HTTP/1.1 ②
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 302 Found' ③
+>>> content2 == content ④
+True
+httplib2.Http object (and therefore the same cache).
+302 response was not cached, so httplib2 sends another request for the same URL.
+302. But notice what didn’t happen: there wasn’t ever a second request for the final URL, http://diveintopython3.org/examples/feed.xml. That response was cached (remember the Cache-Control header that you saw in the previous example). Once httplib2 received the 302 Found code, it checked its cache before issuing another request. The cache contained a fresh copy of http://diveintopython3.org/examples/feed.xml, so there was no need to re-request it.
+request() method returns, it has read the feed data from the cache and returned it. Of course, it’s the same as the data you received last time.
+In other words, you don’t have to do anything special for temporary redirects. httplib2 will follow them automatically, and the fact that one URL redirects to another has no bearing on httplib2’s support for compression, caching, ETags, or any of the other features of HTTP.
+
+
Permanent redirects are just as simple. + +
+# continued from the previous example
+>>> response, content = h.request('http://diveintopython3.org/examples/feed-301.xml') ①
+connect: (diveintopython3.org, 80)
+send: b'GET /examples/feed-301.xml HTTP/1.1
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 301 Moved Permanently' ②
+>>> response.fromcache ③
+True
+http://diveintopython3.org/examples/feed.xml.
+301. But again, notice what didn’t happen: there was no request to the redirect URL. Why not? Because it’s already cached locally.
+httplib2 “followed” the redirect right into its cache.
+But wait! There’s more! + +
+# continued from the previous example
+>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-301.xml') ①
+>>> response2.fromcache ②
+True
+>>> content2 == content ③
+True
+
+httplib2 follows a permanent redirect, all further requests for that URL will transparently be rewritten to the target URL without hitting the network for the original URL. Remember, debugging is still turned on, yet there is no output of network activity whatsoever.
+HTTP. It works. + +
⁂ + +
HTTP web services are not limited to GET requests. What if you want to create something new? Whenever you post a comment on a discussion forum, update your weblog, publish your status on a microblogging service like Twitter or Identi.ca, you’re probably already using HTTP POST.
+
+
Both Twitter and Identi.ca both offer a simple HTTP-based API for publishing and updating your status in 140 characters or less. Let’s look at Identi.ca’s API documentation for updating your status: + +
++ +Identi.ca REST API Method: statuses/update
+Updates the authenticating user’s status. Requires thestatusparameter specified below. Request must be aPOST. + ++
+- URL +
https://identi.ca/api/statuses/update.format+- Formats +
xml,json,rss,atom+- HTTP Method(s) +
POST+- Requires Authentication +
- true +
- Parameters +
status. Required. The text of your status update. URL-encode as necessary. +
How does this work? To publish a new message on Identi.ca, you need to issue an HTTP POST request to http://identi.ca/api/statuses/update.format. (The format bit is not part of the URL; you replace it with the data format you want the server to return in response to your request. So if you want a response in XML, you would post the request to https://identi.ca/api/statuses/update.xml.) The request needs to include a parameter called status, which contains the text of your status update. And the request needs to be authenticated.
+
+
Authenticated? Sure. To update your status on Identi.ca, you need to prove who you are. Identi.ca is not a wiki; only you can update your own status. Identi.ca uses HTTP Basic Authentication (a.k.a. RFC 2617) over SSL to provide secure but easy-to-use authentication. httplib2 supports both SSL and HTTP Basic Authentication, so this part is easy.
+
+
A POST request is different from a GET request, because it includes a payload. The payload is the data you want to send to the server. The one piece of data that this API method requires is status, and it should be URL-encoded. This is a very simple serialization format that takes a set of key-value pairs (i.e. a dictionary) and transforms it into a string.
+
+
+>>> from urllib.parse import urlencode ① +>>> data = {'status': 'Test update from Python 3'} ② +>>> urlencode(data) ③ +'status=Test+update+from+Python+3'+
urllib.parse.urlencode().
+status, whose value is the text of a single status update.
+POST request.
++ +
+>>> from urllib.parse import urlencode
+>>> import httplib2
+>>> httplib2.debuglevel = 1
+>>> h = httplib2.Http('.cache')
+>>> data = {'status': 'Test update from Python 3'}
+>>> h.add_credentials('diveintomark', 'MY_SECRET_PASSWORD', 'identi.ca') ①
+>>> resp, content = h.request('https://identi.ca/api/statuses/update.xml',
+... 'POST', ②
+... urlencode(data), ③
+... headers={'Content-Type': 'application/x-www-form-urlencoded'}) ④
+httplib2 handles authentication. Store your username and password with the add_credentials() method. When httplib2 tries to issue the request, the server will respond with a 401 Unauthorized status code, and it will list which authentication methods it supports (in the WWW-Authenticate header). httplib2 will automatically construct an Authorization header and re-request the URL.
+POST.
+++ +☞The third parameter to the
add_credentials()method is the domain in which the credentials are valid. You should always specify this! If you leave out the domain and later reuse thehttplib2.Httpobject on a different authenticated site,httplib2might end up leaking one site’s username and password to the other site. +
This is what goes over the wire: + +
+# continued from the previous example +send: b'POST /api/statuses/update.xml HTTP/1.1 +Host: identi.ca +Accept-Encoding: identity +Content-Length: 32 +content-type: application/x-www-form-urlencoded +user-agent: Python-httplib2/$Rev: 259 $ + +status=Test+update+from+Python+3' +reply: 'HTTP/1.1 401 Unauthorized' ① +send: b'POST /api/statuses/update.xml HTTP/1.1 ② +Host: identi.ca +Accept-Encoding: identity +Content-Length: 32 +content-type: application/x-www-form-urlencoded +authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2 ③ +user-agent: Python-httplib2/$Rev: 259 $ + +status=Test+update+from+Python+3' +reply: 'HTTP/1.1 200 OK' ④+
401 Unauthorized status code. httplib2 will never send authentication headers unless the server explicitly asks for them. This is how the server asks for them.
+httplib2 immediately turns around and requests the same URL a second time.
+add_credentials() method.
+What does the server send back after a successful request? That depends entirely on the web service API. In some protocols (like the Atom Publishing Protocol), the server sends back a 201 Created status code and the location of the newly created resource in the Location header. Identi.ca sends back a 200 OK and an XML document containing information about the newly created resource.
+
+
+# continued from the previous example
+>>> print(content.decode('utf-8')) ①
+<?xml version="1.0" encoding="UTF-8"?>
+<status>
+ <text>Test update from Python 3</text> ②
+ <truncated>false</truncated>
+ <created_at>Wed Jun 10 03:53:46 +0000 2009</created_at>
+ <in_reply_to_status_id></in_reply_to_status_id>
+ <source>api</source>
+ <id>5131472</id> ③
+ <in_reply_to_user_id></in_reply_to_user_id>
+ <in_reply_to_screen_name></in_reply_to_screen_name>
+ <favorited>false</favorited>
+ <user>
+ <id>3212</id>
+ <name>Mark Pilgrim</name>
+ <screen_name>diveintomark</screen_name>
+ <location>27502, US</location>
+ <description>tech writer, husband, father</description>
+ <profile_image_url>http://avatar.identi.ca/3212-48-20081216000626.png</profile_image_url>
+ <url>http://diveintomark.org/</url>
+ <protected>false</protected>
+ <followers_count>329</followers_count>
+ <profile_background_color></profile_background_color>
+ <profile_text_color></profile_text_color>
+ <profile_link_color></profile_link_color>
+ <profile_sidebar_fill_color></profile_sidebar_fill_color>
+ <profile_sidebar_border_color></profile_sidebar_border_color>
+ <friends_count>2</friends_count>
+ <created_at>Wed Jul 02 22:03:58 +0000 2008</created_at>
+ <favourites_count>30768</favourites_count>
+ <utc_offset>0</utc_offset>
+ <time_zone>UTC</time_zone>
+ <profile_background_image_url></profile_background_image_url>
+ <profile_background_tile>false</profile_background_tile>
+ <statuses_count>122</statuses_count>
+ <following>false</following>
+ <notifications>false</notifications>
+</user>
+</status>
+httplib2 is always bytes, not a string. To convert it to a string, you need to decode it using the proper character encoding. Identi.ca’s API always returns results in UTF-8, so that part is easy.
+And here it is: + +
+
+
⁂ + +
HTTP isn’t limited to GET and POST. Those are certainly the most common types of requests, especially in web browsers. But web service APIs can go beyond GET and POST, and httplib2 is ready.
+
+
+# continued from the previous example +>>> from xml.etree import ElementTree as etree +>>> tree = etree.fromstring(content) ① +>>> status_id = tree.findtext('id') ② +>>> status_id +'5131472' +>>> url = 'https://identi.ca/api/statuses/destroy/{0}.xml'.format(status_id) ③ +>>> resp, deleted_content = h.request(url, 'DELETE') ④+
findtext() method finds the first instance of the given expression and extracts its text content. In this case, we’re just looking for an <id> element.
+<id> element, we can construct a URL to delete the status message we just published.
+DELETE request to that URL.
+This is what goes over the wire: + +
+send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1 ① +Host: identi.ca +Accept-Encoding: identity +user-agent: Python-httplib2/$Rev: 259 $ + +' +reply: 'HTTP/1.1 401 Unauthorized' ② +send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1 ③ +Host: identi.ca +Accept-Encoding: identity +authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2 ④ +user-agent: Python-httplib2/$Rev: 259 $ + +' +reply: 'HTTP/1.1 200 OK' ⑤ +>>> resp.status +200+
And just like that, poof, it’s gone. + +
+
+
⁂ + +
httplib2:
+
+
httplib2 project page
+httplib2 code examples
+httplib2
+httplib2: HTTP Persistence and Authentication
+HTTP caching: + +
RFCs: + +
© 2001–10 Mark Pilgrim + + + diff --git a/installing-python.html b/installing-python.html index 4793167..59df064 100755 --- a/installing-python.html +++ b/installing-python.html @@ -1,364 +1,364 @@ - - -
You are here: Home ‣ Dive Into Python 3 ‣ -
Difficulty level: ♦♢♢♢♢ -
--❝ Tempora mutantur nos et mutamur in illis. (Times change, and we change with them.) ❞
— ancient Roman proverb -
-
Before you can start programming in Python 3, you need to install it. Or do you? - -
If you're using an account on a hosted server, your ISP may have already installed Python 3. If you’re running Linux at home, you may already have Python 3, too. Most popular GNU/Linux distributions come with Python 2 in the default installation; a small but growing number of distributions also include Python 3. Mac OS X includes a command-line version of Python 2, but as of this writing it does not include Python 3. Microsoft Windows does not come with any version of Python. But don’t despair! You can point-and-click your way through installing Python, regardless of what operating system you have. - -
The easiest way to check for Python 3 on your Linux or Mac OS X system is to get to a command line. On Linux, look in your Applications menu for a program called Terminal. (It may be in a submenu like Accessories or System.) On Mac OS X, there is an application called Terminal.app in your /Application/Utilities/ folder.
-
-
Once you’re at a command line prompt, just type python3 (all lowercase, no spaces) and see what happens. On my home Linux system, Python 3 is already installed, and this command gets me into the Python interactive shell. - -
-mark@atlantis:~$ python3 -Python 3.0.1+ (r301:69556, Apr 15 2009, 17:25:52) -[GCC 4.3.3] on linux2 -Type "help", "copyright", "credits" or "license" for more information. ->>>- -
(Type exit() and press ENTER to exit the Python interactive shell.) - -
My web hosting provider also runs Linux and provides command-line access, but my server does not have Python 3 installed. (Boo!) - -
-mark@manganese:~$ python3 -bash: python3: command not found- -
So back to the question that started this section, “Which Python is right for you?” Whichever one runs on the computer you already have. - -
[Read on for Windows instructions, or skip to Installing on Mac OS X, Installing on Ubuntu Linux, or Installing on Other Platforms.] - -
⁂ - -
Windows comes in two architectures these days: 32-bit and 64-bit. Of course, there are lots of different versions of Windows — XP, Vista, Windows 7 — but Python runs on all of them. The more important distinction is 32-bit v. 64-bit. If you have no idea what architecture you’re running, it’s probably 32-bit. - -
Visit python.org/download/ and download the appropriate Python 3 Windows installer for your architecture. Your choices will look something like this:
-
-
I don’t want to include direct download links here, because minor updates of Python happen all the time and I don’t want to be responsible for you missing important updates. You should always install the most recent version of Python 3.x unless you have some esoteric reason not to. - -
-
Once your download is complete, double-click the .msi file. Windows will pop up a security alert, since you’re about to be running executable code. The official Python installer is digitally signed by the Python Software Foundation, the non-profit corporation that oversees Python development. Don’t accept imitations!
-
Click the Run button to launch the Python 3 installer.
-
-
-
The first question the installer will ask you is whether you want to install Python 3 for all users or just for you. The default choice is “install for all users,” which is the best choice unless you have a good reason to choose otherwise. (One possible reason why you would want to “install just for me” is that you are installing Python on your company’s computer and you don’t have administrative rights on your Windows account. But then, why are you installing Python without permission from your company’s Windows administrator? Don’t get me in trouble here!) -
Click the Next button to accept your choice of installation type.
-
-
-
Next, the installer will prompt you to choose a destination directory. The default for all versions of Python 3.1.x is C:\Python31\, which should work well for most users unless you have a specific reason to change it. If you maintain a separate drive letter for installing applications, you can browse to it using the embedded controls, or simply type the pathname in the box below. You are not limited to installing Python on the C: drive; you can install it on any drive, in any folder.
-
Click the Next button to accept your choice of destination directory.
-
-
-
The next page looks complicated, but it’s not really. Like many installers, you have the option not to install every single component of Python 3. If disk space is especially tight, you can exclude certain components. -
.py files) and run them. Recommended but not required. (This option doesn’t require any disk space, so there is little point in excluding it.)
-docs.python.org. Recommended if you are on dialup or have limited Internet access.
-2to3.py script which you’ll learn about later in this book. Required if you want to learn about migrating existing Python 2 code to Python 3. If you have no existing Python 2 code, you can skip this option.
-
-
If you’re unsure how much disk space you have, click the Disk Usage button. The installer will list your drive letters, compute how much space is available on each drive, and calculate how much would be left after installation.
-
Click the OK button to return to the “Customizing Python” page.
-
-
-
If you decide to exclude an option, select the drop-down button before the option and select “Entire feature will be unavailable.” For example, excluding the test suite will save you a whopping 7908KB of disk space. -
Click the Next button to accept your choice of options.
-
-
-
The installer will copy all the necessary files to your chosen destination directory. (This happens so quickly, I had to try it three times to even get a screenshot of it!) - -
-
Click the Finish button to exit the installer.
-
-
-
In your Start menu, there should be a new item called Python 3.1. Within that, there is a program called IDLE. Select this item to run the interactive Python Shell.
-
-
[Skip to using the Python Shell] - -
⁂ - -
All modern Macintosh computers use the Intel chip (like most Windows PCs). Older Macs used PowerPC chips. You don’t need to understand the difference, because there’s just one Mac Python installer for all Macs. - -
Visit python.org/download/ and download the Mac installer. It will be called something like Python 3.1 Mac Installer Disk Image, although the version number may vary. Be sure to download version 3.x, not 2.x.
-
-
-
Your browser should automatically mount the disk image and open a Finder window to show you the contents. (If this doesn’t happen, you’ll need to find the disk image in your downloads folder and double-click to mount it. It will be named something like python-3.1.dmg.) The disk image contains a number of text files (Build.txt, License.txt, ReadMe.txt), and the actual installer package, Python.mpkg.
-
Double-click the Python.mpkg installer package to launch the Mac Python installer.
-
-
-
The first page of the installer gives a brief description of Python itself, then refers you to the ReadMe.txt file (which you didn’t read, did you?) for more details.
-
Click the Continue button to move along.
-
-
-
The next page actually contains some important information: Python requires Mac OS X 10.3 or later. If you are still running Mac OS X 10.2, you should really upgrade. Apple no longer provides security updates for your operating system, and your computer is probably at risk if you ever go online. Also, you can’t run Python 3. -
Click the Continue button to advance.
-
-
-
Like all good installers, the Python installer displays the software license agreement. Python is open source, and its license is approved by the Open Source Initiative. Python has had a number of owners and sponsors throughout its history, each of which has left its mark on the software license. But the end result is this: Python is open source, and you may use it on any platform, for any purpose, without fee or obligation of reciprocity. -
Click the Continue button once again.
-
-
-
Due to quirks in the standard Apple installer framework, you must “agree” to the software license in order to complete the installation. Since Python is open source, you are really “agreeing” that the license is granting you additional rights, rather than taking them away. -
Click the Agree button to continue.
-
-
-
The next screen allows you to change your install location. You must install Python on your boot drive, but due to limitations of the installer, it does not enforce this. In truth, I have never had the need to change the install location. -
From this screen, you can also customize the installation to exclude certain features. If you want to do this, click the Customize button; otherwise click the Install button.
-
-
-
If you choose a Custom Install, the installer will present you with the following list of features: -
python3 application. I strongly recommend keeping this option, too.
-docs.python.org. Recommended if you are on dialup or have limited Internet access.
-Terminal.app) to ensure that this version of Python is on the search path of your shell. You probably don’t need to change this.
-Click the Install button to continue.
-
-
-
Because it installs system-wide frameworks and binaries in /usr/local/bin/, the installer will ask you for an administrative password. There is no way to install Mac Python without administrator privileges.
-
Click the OK button to begin the installation.
-
-
-
The installer will display a progress meter while it installs the features you’ve selected. - -
-
Assuming all went well, the installer will give you a big green checkmark to tell you that the installation completed successfully. -
Click the Close button to exit the installer.
-
-
-
Assuming you didn’t change the install location, you can find the newly installed files in the Python 3.1 folder within your /Applications folder. The most important piece is IDLE, the graphical Python Shell.
-
Double-click IDLE to launch the Python Shell. - -
-
The Python Shell is where you will spend most of your time exploring Python. Examples throughout this book will assume that you can find your way into the Python Shell. - -
[Skip to using the Python Shell] - -
⁂ - -
Modern Linux distributions are backed by vast repositories of precompiled applications, ready to install. The exact details vary by distribution. In Ubuntu Linux, the easiest way to install Python 3 is through the Add/Remove application in your Applications menu.
-
-
-
When you first launch the Add/Remove application, it will show you a list of preselected applications in different categories. Some are already installed; most are not. Because the repository contains over 10,000 applications, there are different filters you can apply to see small parts of the repository. The default filter is “Canonical-maintained applications,” which is a small subset of the total number of applications that are officially supported by Canonical, the company that creates and maintains Ubuntu Linux.
-
-
-
Python 3 is not maintained by Canonical, so the first step is to drop down this filter menu and select “All Open Source applications.” - -
-
Once you’ve widened the filter to include all open source applications, use the Search box immediately after the filter menu to search for Python 3. - -
-
Now the list of applications narrows to just those matching Python 3. You’re going to check two packages. The first is Python (v3.0). This contains the Python interpreter itself.
-
-
The second package you want is immediately above: IDLE (using Python-3.0). This is a graphical Python Shell that you will use throughout this book.
-
After you’ve checked those two packages, click the Apply Changes button to continue.
-
-
-
The package manager will ask you to confirm that you want to add both IDLE (using Python-3.0) and Python (v3.0).
-
Click the Apply button to continue.
-
-
-
The package manager will show you a progress meter while it downloads the necessary packages from Canonical’s Internet repository. - -
-
Once the packages are downloaded, the package manager will automatically begin installing them. - -
-
If all went well, the package manager will confirm that both packages were successfully installed. From here, you can double-click IDLE to launch the Python Shell, or click the Close button to exit the package manager.
-
You can always relaunch the Python Shell by going to your Applications menu, then the Programming submenu, and selecting IDLE.
-
-
-
The Python Shell is where you will spend most of your time exploring Python. Examples throughout this book will assume that you can find your way into the Python Shell. - -
[Skip to using the Python Shell] - -
⁂ - -
Python 3 is available on a number of different platforms. In particular, it is available in virtually every Linux, BSD, and Solaris-based distribution. For example, RedHat Linux uses the yum package manager; FreeBSD has its ports and packages collection; Solaris has pkgadd and friends. A quick web search for Python 3 + your operating system will tell you whether a Python 3 package is available, and how to install it.
-
-
⁂ - -
The Python Shell is where you can explore Python syntax, get interactive help on commands, and debug short programs. The graphical Python Shell (named IDLE) also contains a decent text editor that supports Python syntax coloring and integrates with the Python Shell. If you don’t already have a favorite text editor, you should give IDLE a try. - -
First things first. The Python Shell itself is an amazing interactive playground. Throughout this book, you’ll see examples like this: - -
->>> 1 + 1 -2- -
The three angle brackets, >>>, denote the Python Shell prompt. Don’t type that part. That’s just to let you know that this example is meant to be followed in the Python Shell. - -
1 + 1 is the part you type. You can type any valid Python expression or command in the Python Shell. Don’t be shy; it won’t bite! The worst that will happen is you’ll get an error message. Commands get executed immediately (once you press ENTER); expressions get evaluated immediately, and the Python Shell prints out the result. - -
2 is the result of evaluating this expression. As it happens, 1 + 1 is a valid Python expression. The result, of course, is 2. - -
Let’s try another one. - -
->>> print('Hello world!')
-Hello world!
-
-
-Pretty simple, no? But there’s lots more you can do in the Python shell. If you ever get stuck — you can’t remember a command, or you can’t remember the proper arguments to pass a certain function — you can get interactive help in the Python Shell. Just type help and press ENTER. - -
->>> help -Type help() for interactive help, or help(object) for help about object.- -
There are two modes of help. You can get help about a single object, which just prints out the documentation and returns you to the Python Shell prompt. You can also enter help mode, where instead of evaluating Python expressions, you just type keywords or command names and it will print out whatever it knows about that command. - -
To enter the interactive help mode, type help() and press ENTER. - -
->>> help() -Welcome to Python 3.0! This is the online help utility. - -If this is your first time using Python, you should definitely check out -the tutorial on the Internet at http://docs.python.org/tutorial/. - -Enter the name of any module, keyword, or topic to get help on writing -Python programs and using Python modules. To quit this help utility and -return to the interpreter, just type "quit". - -To get a list of available modules, keywords, or topics, type "modules", -"keywords", or "topics". Each module also comes with a one-line summary -of what it does; to list the modules whose summaries contain a given word -such as "spam", type "modules spam". - -help>- -
Note how the prompt changes from >>> to help>. This reminds you that you’re in the interactive help mode. Now you can enter any keyword, command, module name, function name — pretty much anything Python understands — and read documentation on it. - -
-help> print ① -Help on built-in function print in module builtins: - -print(...) - print(value, ..., sep=' ', end='\n', file=sys.stdout) - - Prints the values to a stream, or to sys.stdout by default. - Optional keyword arguments: - file: a file-like object (stream); defaults to the current sys.stdout. - sep: string inserted between values, default a space. - end: string appended after the last value, default a newline. - -help> PapayaWhip ② -no Python documentation found for 'PapayaWhip' - -help> quit ③ - -You are now leaving help and returning to the Python interpreter. -If you want to ask for help on a particular object directly from the -interpreter, you can type "help(object)". Executing "help('string')" -has the same effect as typing a particular string at the help> prompt. ->>> ④-
print() function, just type print and press ENTER. The interactive help mode will display something akin to a man page: the function name, a brief synopsis, the function’s arguments and their default values, and so on. If the documentation seems opaque to you, don’t panic. You’ll learn more about all these concepts in the next few chapters.
-IDLE, the graphical Python Shell, also includes a Python-aware text editor. - -
⁂ - -
IDLE is not the only game in town when it comes to writing programs in Python. While it’s useful to get started with learning the language itself, many developers prefer other text editors or Integrated Development Environments (IDEs). I won’t cover them here, but the Python community maintains a list of Python-aware editors that covers a wide range of supported platforms and software licenses. - -
You might also want to check out the list of Python-aware IDEs, although few of them support Python 3 yet. One that does is PyDev, a plugin for Eclipse that turns Eclipse into a full-fledged Python IDE. Both Eclipse and PyDev are cross-platform and open source. - -
On the commercial front, there is ActiveState’s Komodo IDE. It has per-user licensing, but students can get a discount, and a free time-limited trial version is available. - -
I’ve been programming in Python for nine years, and I edit my Python programs in GNU Emacs and debug them in the command-line Python Shell. There’s no right or wrong way to develop in Python. Find a way that works for you! - -
© 2001–10 Mark Pilgrim - - - + + +
You are here: Home ‣ Dive Into Python 3 ‣ +
Difficulty level: ♦♢♢♢♢ +
++❝ Tempora mutantur nos et mutamur in illis. (Times change, and we change with them.) ❞
— ancient Roman proverb +
+
Before you can start programming in Python 3, you need to install it. Or do you? + +
If you're using an account on a hosted server, your ISP may have already installed Python 3. If you’re running Linux at home, you may already have Python 3, too. Most popular GNU/Linux distributions come with Python 2 in the default installation; a small but growing number of distributions also include Python 3. Mac OS X includes a command-line version of Python 2, but as of this writing it does not include Python 3. Microsoft Windows does not come with any version of Python. But don’t despair! You can point-and-click your way through installing Python, regardless of what operating system you have. + +
The easiest way to check for Python 3 on your Linux or Mac OS X system is to get to a command line. On Linux, look in your Applications menu for a program called Terminal. (It may be in a submenu like Accessories or System.) On Mac OS X, there is an application called Terminal.app in your /Application/Utilities/ folder.
+
+
Once you’re at a command line prompt, just type python3 (all lowercase, no spaces) and see what happens. On my home Linux system, Python 3 is already installed, and this command gets me into the Python interactive shell. + +
+mark@atlantis:~$ python3 +Python 3.0.1+ (r301:69556, Apr 15 2009, 17:25:52) +[GCC 4.3.3] on linux2 +Type "help", "copyright", "credits" or "license" for more information. +>>>+ +
(Type exit() and press ENTER to exit the Python interactive shell.) + +
My web hosting provider also runs Linux and provides command-line access, but my server does not have Python 3 installed. (Boo!) + +
+mark@manganese:~$ python3 +bash: python3: command not found+ +
So back to the question that started this section, “Which Python is right for you?” Whichever one runs on the computer you already have. + +
[Read on for Windows instructions, or skip to Installing on Mac OS X, Installing on Ubuntu Linux, or Installing on Other Platforms.] + +
⁂ + +
Windows comes in two architectures these days: 32-bit and 64-bit. Of course, there are lots of different versions of Windows — XP, Vista, Windows 7 — but Python runs on all of them. The more important distinction is 32-bit v. 64-bit. If you have no idea what architecture you’re running, it’s probably 32-bit. + +
Visit python.org/download/ and download the appropriate Python 3 Windows installer for your architecture. Your choices will look something like this:
+
+
I don’t want to include direct download links here, because minor updates of Python happen all the time and I don’t want to be responsible for you missing important updates. You should always install the most recent version of Python 3.x unless you have some esoteric reason not to. + +
+
Once your download is complete, double-click the .msi file. Windows will pop up a security alert, since you’re about to be running executable code. The official Python installer is digitally signed by the Python Software Foundation, the non-profit corporation that oversees Python development. Don’t accept imitations!
+
Click the Run button to launch the Python 3 installer.
+
+
+
The first question the installer will ask you is whether you want to install Python 3 for all users or just for you. The default choice is “install for all users,” which is the best choice unless you have a good reason to choose otherwise. (One possible reason why you would want to “install just for me” is that you are installing Python on your company’s computer and you don’t have administrative rights on your Windows account. But then, why are you installing Python without permission from your company’s Windows administrator? Don’t get me in trouble here!) +
Click the Next button to accept your choice of installation type.
+
+
+
Next, the installer will prompt you to choose a destination directory. The default for all versions of Python 3.1.x is C:\Python31\, which should work well for most users unless you have a specific reason to change it. If you maintain a separate drive letter for installing applications, you can browse to it using the embedded controls, or simply type the pathname in the box below. You are not limited to installing Python on the C: drive; you can install it on any drive, in any folder.
+
Click the Next button to accept your choice of destination directory.
+
+
+
The next page looks complicated, but it’s not really. Like many installers, you have the option not to install every single component of Python 3. If disk space is especially tight, you can exclude certain components. +
.py files) and run them. Recommended but not required. (This option doesn’t require any disk space, so there is little point in excluding it.)
+docs.python.org. Recommended if you are on dialup or have limited Internet access.
+2to3.py script which you’ll learn about later in this book. Required if you want to learn about migrating existing Python 2 code to Python 3. If you have no existing Python 2 code, you can skip this option.
+
+
If you’re unsure how much disk space you have, click the Disk Usage button. The installer will list your drive letters, compute how much space is available on each drive, and calculate how much would be left after installation.
+
Click the OK button to return to the “Customizing Python” page.
+
+
+
If you decide to exclude an option, select the drop-down button before the option and select “Entire feature will be unavailable.” For example, excluding the test suite will save you a whopping 7908KB of disk space. +
Click the Next button to accept your choice of options.
+
+
+
The installer will copy all the necessary files to your chosen destination directory. (This happens so quickly, I had to try it three times to even get a screenshot of it!) + +
+
Click the Finish button to exit the installer.
+
+
+
In your Start menu, there should be a new item called Python 3.1. Within that, there is a program called IDLE. Select this item to run the interactive Python Shell.
+
+
[Skip to using the Python Shell] + +
⁂ + +
All modern Macintosh computers use the Intel chip (like most Windows PCs). Older Macs used PowerPC chips. You don’t need to understand the difference, because there’s just one Mac Python installer for all Macs. + +
Visit python.org/download/ and download the Mac installer. It will be called something like Python 3.1 Mac Installer Disk Image, although the version number may vary. Be sure to download version 3.x, not 2.x.
+
+
+
Your browser should automatically mount the disk image and open a Finder window to show you the contents. (If this doesn’t happen, you’ll need to find the disk image in your downloads folder and double-click to mount it. It will be named something like python-3.1.dmg.) The disk image contains a number of text files (Build.txt, License.txt, ReadMe.txt), and the actual installer package, Python.mpkg.
+
Double-click the Python.mpkg installer package to launch the Mac Python installer.
+
+
+
The first page of the installer gives a brief description of Python itself, then refers you to the ReadMe.txt file (which you didn’t read, did you?) for more details.
+
Click the Continue button to move along.
+
+
+
The next page actually contains some important information: Python requires Mac OS X 10.3 or later. If you are still running Mac OS X 10.2, you should really upgrade. Apple no longer provides security updates for your operating system, and your computer is probably at risk if you ever go online. Also, you can’t run Python 3. +
Click the Continue button to advance.
+
+
+
Like all good installers, the Python installer displays the software license agreement. Python is open source, and its license is approved by the Open Source Initiative. Python has had a number of owners and sponsors throughout its history, each of which has left its mark on the software license. But the end result is this: Python is open source, and you may use it on any platform, for any purpose, without fee or obligation of reciprocity. +
Click the Continue button once again.
+
+
+
Due to quirks in the standard Apple installer framework, you must “agree” to the software license in order to complete the installation. Since Python is open source, you are really “agreeing” that the license is granting you additional rights, rather than taking them away. +
Click the Agree button to continue.
+
+
+
The next screen allows you to change your install location. You must install Python on your boot drive, but due to limitations of the installer, it does not enforce this. In truth, I have never had the need to change the install location. +
From this screen, you can also customize the installation to exclude certain features. If you want to do this, click the Customize button; otherwise click the Install button.
+
+
+
If you choose a Custom Install, the installer will present you with the following list of features: +
python3 application. I strongly recommend keeping this option, too.
+docs.python.org. Recommended if you are on dialup or have limited Internet access.
+Terminal.app) to ensure that this version of Python is on the search path of your shell. You probably don’t need to change this.
+Click the Install button to continue.
+
+
+
Because it installs system-wide frameworks and binaries in /usr/local/bin/, the installer will ask you for an administrative password. There is no way to install Mac Python without administrator privileges.
+
Click the OK button to begin the installation.
+
+
+
The installer will display a progress meter while it installs the features you’ve selected. + +
+
Assuming all went well, the installer will give you a big green checkmark to tell you that the installation completed successfully. +
Click the Close button to exit the installer.
+
+
+
Assuming you didn’t change the install location, you can find the newly installed files in the Python 3.1 folder within your /Applications folder. The most important piece is IDLE, the graphical Python Shell.
+
Double-click IDLE to launch the Python Shell. + +
+
The Python Shell is where you will spend most of your time exploring Python. Examples throughout this book will assume that you can find your way into the Python Shell. + +
[Skip to using the Python Shell] + +
⁂ + +
Modern Linux distributions are backed by vast repositories of precompiled applications, ready to install. The exact details vary by distribution. In Ubuntu Linux, the easiest way to install Python 3 is through the Add/Remove application in your Applications menu.
+
+
+
When you first launch the Add/Remove application, it will show you a list of preselected applications in different categories. Some are already installed; most are not. Because the repository contains over 10,000 applications, there are different filters you can apply to see small parts of the repository. The default filter is “Canonical-maintained applications,” which is a small subset of the total number of applications that are officially supported by Canonical, the company that creates and maintains Ubuntu Linux.
+
+
+
Python 3 is not maintained by Canonical, so the first step is to drop down this filter menu and select “All Open Source applications.” + +
+
Once you’ve widened the filter to include all open source applications, use the Search box immediately after the filter menu to search for Python 3. + +
+
Now the list of applications narrows to just those matching Python 3. You’re going to check two packages. The first is Python (v3.0). This contains the Python interpreter itself.
+
+
The second package you want is immediately above: IDLE (using Python-3.0). This is a graphical Python Shell that you will use throughout this book.
+
After you’ve checked those two packages, click the Apply Changes button to continue.
+
+
+
The package manager will ask you to confirm that you want to add both IDLE (using Python-3.0) and Python (v3.0).
+
Click the Apply button to continue.
+
+
+
The package manager will show you a progress meter while it downloads the necessary packages from Canonical’s Internet repository. + +
+
Once the packages are downloaded, the package manager will automatically begin installing them. + +
+
If all went well, the package manager will confirm that both packages were successfully installed. From here, you can double-click IDLE to launch the Python Shell, or click the Close button to exit the package manager.
+
You can always relaunch the Python Shell by going to your Applications menu, then the Programming submenu, and selecting IDLE.
+
+
+
The Python Shell is where you will spend most of your time exploring Python. Examples throughout this book will assume that you can find your way into the Python Shell. + +
[Skip to using the Python Shell] + +
⁂ + +
Python 3 is available on a number of different platforms. In particular, it is available in virtually every Linux, BSD, and Solaris-based distribution. For example, RedHat Linux uses the yum package manager; FreeBSD has its ports and packages collection; Solaris has pkgadd and friends. A quick web search for Python 3 + your operating system will tell you whether a Python 3 package is available, and how to install it.
+
+
⁂ + +
The Python Shell is where you can explore Python syntax, get interactive help on commands, and debug short programs. The graphical Python Shell (named IDLE) also contains a decent text editor that supports Python syntax coloring and integrates with the Python Shell. If you don’t already have a favorite text editor, you should give IDLE a try. + +
First things first. The Python Shell itself is an amazing interactive playground. Throughout this book, you’ll see examples like this: + +
+>>> 1 + 1 +2+ +
The three angle brackets, >>>, denote the Python Shell prompt. Don’t type that part. That’s just to let you know that this example is meant to be followed in the Python Shell. + +
1 + 1 is the part you type. You can type any valid Python expression or command in the Python Shell. Don’t be shy; it won’t bite! The worst that will happen is you’ll get an error message. Commands get executed immediately (once you press ENTER); expressions get evaluated immediately, and the Python Shell prints out the result. + +
2 is the result of evaluating this expression. As it happens, 1 + 1 is a valid Python expression. The result, of course, is 2. + +
Let’s try another one. + +
+>>> print('Hello world!')
+Hello world!
+
+
+Pretty simple, no? But there’s lots more you can do in the Python shell. If you ever get stuck — you can’t remember a command, or you can’t remember the proper arguments to pass a certain function — you can get interactive help in the Python Shell. Just type help and press ENTER. + +
+>>> help +Type help() for interactive help, or help(object) for help about object.+ +
There are two modes of help. You can get help about a single object, which just prints out the documentation and returns you to the Python Shell prompt. You can also enter help mode, where instead of evaluating Python expressions, you just type keywords or command names and it will print out whatever it knows about that command. + +
To enter the interactive help mode, type help() and press ENTER. + +
+>>> help() +Welcome to Python 3.0! This is the online help utility. + +If this is your first time using Python, you should definitely check out +the tutorial on the Internet at http://docs.python.org/tutorial/. + +Enter the name of any module, keyword, or topic to get help on writing +Python programs and using Python modules. To quit this help utility and +return to the interpreter, just type "quit". + +To get a list of available modules, keywords, or topics, type "modules", +"keywords", or "topics". Each module also comes with a one-line summary +of what it does; to list the modules whose summaries contain a given word +such as "spam", type "modules spam". + +help>+ +
Note how the prompt changes from >>> to help>. This reminds you that you’re in the interactive help mode. Now you can enter any keyword, command, module name, function name — pretty much anything Python understands — and read documentation on it. + +
+help> print ① +Help on built-in function print in module builtins: + +print(...) + print(value, ..., sep=' ', end='\n', file=sys.stdout) + + Prints the values to a stream, or to sys.stdout by default. + Optional keyword arguments: + file: a file-like object (stream); defaults to the current sys.stdout. + sep: string inserted between values, default a space. + end: string appended after the last value, default a newline. + +help> PapayaWhip ② +no Python documentation found for 'PapayaWhip' + +help> quit ③ + +You are now leaving help and returning to the Python interpreter. +If you want to ask for help on a particular object directly from the +interpreter, you can type "help(object)". Executing "help('string')" +has the same effect as typing a particular string at the help> prompt. +>>> ④+
print() function, just type print and press ENTER. The interactive help mode will display something akin to a man page: the function name, a brief synopsis, the function’s arguments and their default values, and so on. If the documentation seems opaque to you, don’t panic. You’ll learn more about all these concepts in the next few chapters.
+IDLE, the graphical Python Shell, also includes a Python-aware text editor. + +
⁂ + +
IDLE is not the only game in town when it comes to writing programs in Python. While it’s useful to get started with learning the language itself, many developers prefer other text editors or Integrated Development Environments (IDEs). I won’t cover them here, but the Python community maintains a list of Python-aware editors that covers a wide range of supported platforms and software licenses. + +
You might also want to check out the list of Python-aware IDEs, although few of them support Python 3 yet. One that does is PyDev, a plugin for Eclipse that turns Eclipse into a full-fledged Python IDE. Both Eclipse and PyDev are cross-platform and open source. + +
On the commercial front, there is ActiveState’s Komodo IDE. It has per-user licensing, but students can get a discount, and a free time-limited trial version is available. + +
I’ve been programming in Python for nine years, and I edit my Python programs in GNU Emacs and debug them in the command-line Python Shell. There’s no right or wrong way to develop in Python. Find a way that works for you! + +
© 2001–10 Mark Pilgrim + + + diff --git a/iterators.html b/iterators.html index 4b4a3f5..8da4842 100755 --- a/iterators.html +++ b/iterators.html @@ -1,394 +1,394 @@ - - -
You are here: Home ‣ Dive Into Python 3 ‣ -
Difficulty level: ♦♦♦♢♢ -
--❝ East is East, and West is West, and never the twain shall meet. ❞
— Rudyard Kipling -
-
Iterators are the “secret sauce” of Python 3. They’re everywhere, underlying everything, always just out of sight. Comprehensions are just a simple form of iterators. Generators are just a simple form of iterators. A function that yields values is a nice, compact way of building an iterator without building an iterator. Let me show you what I mean by that.
-
-
Remember the Fibonacci generator? Here it is as a built-from-scratch iterator: - -
class Fib:
- '''iterator that yields numbers in the Fibonacci sequence'''
-
- def __init__(self, max):
- self.max = max
-
- def __iter__(self):
- self.a = 0
- self.b = 1
- return self
-
- def __next__(self):
- fib = self.a
- if fib > self.max:
- raise StopIteration
- self.a, self.b = self.b, self.a + self.b
- return fib
-
-Let’s take that one line at a time. - -
class Fib:
-
-class? What’s a class?
-
-
⁂ - -
Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you’ve defined. - -
Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that’s all that’s required, since a class doesn’t need to inherit from any other class.
-
-
class PapayaWhip: ①
- pass ②
-PapayaWhip, and it doesn’t inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement.
-if statement, for loop, or any other block of code. The first line not indented is outside the class.
-This PapayaWhip class doesn’t define any methods or attributes, but syntactically, there needs to be something in the definition, thus the pass statement. This is a Python reserved word that just means “move along, nothing to see here”. It’s a statement that does nothing, and it’s a good placeholder when you’re stubbing out functions or classes.
-
-
-- -☞The
passstatement in Python is like a empty set of curly braces ({}) in Java or C. -
Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don’t have explicit constructors and destructors. Although it’s not required, Python classes can have something similar to a constructor: the __init__() method.
-
-
__init__() MethodThis example shows the initialization of the Fib class using the __init__ method.
-
-
class Fib:
- '''iterator that yields numbers in the Fibonacci sequence''' ①
-
- def __init__(self, max): ②
-docstrings too, just like modules and functions.
-__init__() method is called immediately after an instance of the class is created. It would be tempting — but technically incorrect — to call this the “constructor” of the class. It’s tempting, because it looks like a C++ constructor (by convention, the __init__() method is the first method defined for the class), acts like one (it’s the first piece of code executed in a newly created instance of the class), and even sounds like one. Incorrect, because the object has already been constructed by the time the __init__() method is called, and you already have a valid reference to the new instance of the class.
-The first argument of every class method, including the __init__() method, is always a reference to the current instance of the class. By convention, this argument is named self. This argument fills the role of the reserved word this in C++ or Java, but self is not a reserved word in Python, merely a naming convention. Nonetheless, please don’t call it anything but self; this is a very strong convention.
-
-
In the __init__() method, self refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify self explicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically.
-
-
⁂ - -
Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the __init__() method requires. The return value will be the newly created object.
-
->>> import fibonacci2 ->>> fib = fibonacci2.Fib(100) ① ->>> fib ② -<fibonacci2.Fib object at 0x00DB8810> ->>> fib.__class__ ③ -<class 'fibonacci2.Fib'> ->>> fib.__doc__ ④ -'iterator that yields numbers in the Fibonacci sequence'-
Fib class (defined in the fibonacci2 module) and assigning the newly created instance to the variable fib. You are passing one parameter, 100, which will end up as the max argument in Fib’s __init__() method.
-Fib class.
-__class__, which is the object’s class. Java programmers may be familiar with the Class class, which contains methods like getName() and getSuperclass() to get metadata information about an object. In Python, this kind of metadata is available through attributes, but the idea is the same.
-docstring just as with a function or a module. All instances of a class share the same docstring.
--- -☞In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit
newoperator like there is in C++ or Java. -
⁂ - -
On to the next line: - -
class Fib:
- def __init__(self, max):
- self.max = max ①
-__init__() method as an argument. self.max is “global” to the instance. That means that you can access it from other methods.
-class Fib:
- def __init__(self, max):
- self.max = max ①
- .
- .
- .
- def __next__(self):
- fib = self.a
- if fib > self.max: ②
-__init__() method…
-__next__() method.
-Instance variables are specific to one instance of a class. For example, if you create two Fib instances with different maximum values, they will each remember their own values.
-
-
->>> import fibonacci2 ->>> fib1 = fibonacci2.Fib(100) ->>> fib2 = fibonacci2.Fib(200) ->>> fib1.max -100 ->>> fib2.max -200- -
⁂ - -
Now you’re ready to learn how to build an iterator. An iterator is just a class that defines an __iter__() method.
-
-
-
-
class Fib: ①
- def __init__(self, max): ②
- self.max = max
-
- def __iter__(self): ③
- self.a = 0
- self.b = 1
- return self
-
- def __next__(self): ④
- fib = self.a
- if fib > self.max:
- raise StopIteration ⑤
- self.a, self.b = self.b, self.a + self.b
- return fib ⑥
-fib needs to be a class, not a function.
-Fib(max) is really creating an instance of this class and calling its __init__() method with max. The __init__() method saves the maximum value as an instance variable so other methods can refer to it later.
-__iter__() method is called whenever someone calls iter(fib). (As you’ll see in a minute, a for loop will call this automatically, but you can also call it yourself manually.) After performing beginning-of-iteration initialization (in this case, resetting self.a and self.b, our two counters), the __iter__() method can return any object that implements a __next__() method. In this case (and in most cases), __iter__() simply returns self, since this class implements its own __next__() method.
-__next__() method is called whenever someone calls next() on an iterator of an instance of a class. That will make more sense in a minute.
-__next__() method raises a StopIteration exception, this signals to the caller that the iteration is exhausted. Unlike most exceptions, this is not an error; it’s a normal condition that just means that the iterator has no more values to generate. If the caller is a for loop, it will notice this StopIteration exception and gracefully exit the loop. (In other words, it will swallow the exception.) This little bit of magic is actually the key to using iterators in for loops.
-__next__() method simply returns the value. Do not use yield here; that’s a bit of syntactic sugar that only applies when you’re using generators. Here you’re creating your own iterator from scratch; use return instead.
-Thoroughly confused yet? Excellent. Let’s see how to call this iterator: - -
->>> from fibonacci2 import Fib ->>> for n in Fib(1000): -... print(n, end=' ') -0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987- -
Why, it’s exactly the same! Byte for byte identical to how you called Fibonacci-as-a-generator (modulo one capital letter). But how? - -
There’s a bit of magic involved in for loops. Here’s what happens:
-
-
for loop calls Fib(1000), as shown. This returns an instance of the Fib class. Call this fib_inst.
-for loop calls iter(fib_inst), which returns an iterator object. Call this fib_iter. In this case, fib_iter == fib_inst, because the __iter__() method returns self, but the for loop doesn’t know (or care) about that.
-for loop calls next(fib_iter), which calls the __next__() method on the fib_iter object, which does the next-Fibonacci-number calculations and returns a value. The for loop takes this value and assigns it to n, then executes the body of the for loop for that value of n.
-for loop know when to stop? I’m glad you asked! When next(fib_iter) raises a StopIteration exception, the for loop will swallow the exception and gracefully exit. (Any other exception will pass through and be raised as usual.) And where have you seen a StopIteration exception? In the __next__() method, of course!
-⁂ - -
Now it’s time for the finale. Let’s rewrite the plural rules generator as an iterator. - -
class LazyRules:
- rules_filename = 'plural6-rules.txt'
-
- def __init__(self):
- self.pattern_file = open(self.rules_filename, encoding='utf-8')
- self.cache = []
-
- def __iter__(self):
- self.cache_index = 0
- return self
-
- def __next__(self):
- self.cache_index += 1
- if len(self.cache) >= self.cache_index:
- return self.cache[self.cache_index - 1]
-
- if self.pattern_file.closed:
- raise StopIteration
-
- line = self.pattern_file.readline()
- if not line:
- self.pattern_file.close()
- raise StopIteration
-
- pattern, search, replace = line.split(None, 3)
- funcs = build_match_and_apply_functions(
- pattern, search, replace)
- self.cache.append(funcs)
- return funcs
-
-rules = LazyRules()
-
-So this is a class that implements __iter__() and __next__(), so it can be used as an iterator. Then, you instantiate the class and assign it to rules. This happens just once, on import.
-
-
Let’s take the class one bite at a time. - -
class LazyRules:
- rules_filename = 'plural6-rules.txt'
-
- def __init__(self):
- self.pattern_file = open(self.rules_filename, encoding='utf-8') ①
- self.cache = [] ②
-LazyRules class, open the pattern file but don’t read anything from it. (That comes later.)
-__next__() method) as you read lines from the pattern file.
-Before we continue, let’s take a closer look at rules_filename. It’s not defined within the __iter__() method. In fact, it’s not defined within any method. It’s defined at the class level. It’s a class variable, and although you can access it just like an instance variable (self.rules_filename), it is shared across all instances of the LazyRules class.
-
-
->>> import plural6 ->>> r1 = plural6.LazyRules() ->>> r2 = plural6.LazyRules() ->>> r1.rules_filename ① -'plural6-rules.txt' ->>> r2.rules_filename -'plural6-rules.txt' ->>> r2.rules_filename = 'r2-override.txt' ② ->>> r2.rules_filename -'r2-override.txt' ->>> r1.rules_filename -'plural6-rules.txt' ->>> r2.__class__.rules_filename ③ -'plural6-rules.txt' ->>> r2.__class__.rules_filename = 'papayawhip.txt' ④ ->>> r1.rules_filename -'papayawhip.txt' ->>> r2.rules_filename ⑤ -'r2-overridetxt'-
__class__ attribute to access the class itself.
-And now back to our show. - -
def __iter__(self): ①
- self.cache_index = 0
- return self ②
-
-__iter__() method will be called every time someone — say, a for loop — calls iter(rules).
-__iter__() method must do is return an iterator. In this case, it returns self, which signals that this class defines a __next__() method which will take care of returning values throughout the iteration.
- def __next__(self): ①
- .
- .
- .
- pattern, search, replace = line.split(None, 3)
- funcs = build_match_and_apply_functions( ②
- pattern, search, replace)
- self.cache.append(funcs) ③
- return funcs
-__next__() method gets called whenever someone — say, a for loop — calls next(rules). This method will only make sense if we start at the end and work backwards. So let’s do that.
-build_match_and_apply_functions() function hasn’t changed; it’s the same as it ever was.
-self.cache.
-Moving backwards… - -
def __next__(self):
- .
- .
- .
- line = self.pattern_file.readline() ①
- if not line: ②
- self.pattern_file.close()
- raise StopIteration ③
- .
- .
- .
-readline() method (note: singular, not the plural readlines()) reads exactly one line from an open file. Specifically, the next line. (File objects are iterators too! It’s iterators all the way down…)
-readline() to read, line will not be an empty string. Even if the file contained a blank line, line would end up as the one-character string '\n' (a carriage return). If line is really an empty string, that means there are no more lines to read from the file.
-StopIteration exception. Remember, we got to this point because we needed a match and apply function for the next rule. The next rule comes from the next line of the file… but there is no next line! Therefore, we have no value to return. The iteration is over. (♫ The party’s over… ♫)
-Moving backwards all the way to the start of the __next__() method…
-
-
def __next__(self):
- self.cache_index += 1
- if len(self.cache) >= self.cache_index:
- return self.cache[self.cache_index - 1] ①
-
- if self.pattern_file.closed:
- raise StopIteration ②
- .
- .
- .
-self.cache will be a list of the functions we need to match and apply individual rules. (At least that should sound familiar!) self.cache_index keeps track of which cached item we should return next. If we haven’t exhausted the cache yet (i.e. if the length of self.cache is greater than self.cache_index), then we have a cache hit! Hooray! We can return the match and apply functions from the cache instead of building them from scratch.
-Putting it all together, here’s what happens when: - -
LazyRules class, called rules, which opens the pattern file but does not read from it.
-plural() function again to pluralize a different word. The for loop in the plural() function will call iter(rules), which will reset the cache index but will not reset the open file object.
-for loop will ask for a value from rules, which will invoke its __next__() method. This time, however, the cache is primed with a single pair of match and apply functions, corresponding to the patterns in the first line of the pattern file. Since they were built and cached in the course of pluralizing the previous word, they’re retrieved from the cache. The cache index increments, and the open file is never touched.
-for loop comes around again and asks for another value from rules. This invokes the __next__() method a second time. This time, the cache is exhausted — it only contained one item, and we’re asking for a second — so the __next__() method continues. It reads another line from the open file, builds match and apply functions out of the patterns, and caches them.
-readline() command. In the meantime, the cache now has more items in it, and if we start all over again trying to pluralize a new word, each of those items in the cache will be tried before reading the next line from the pattern file.
-We have achieved pluralization nirvana. - -
import is instantiating a single class and opening a file (but not reading from it).
--- -☞Is this really nirvana? Well, yes and no. Here’s something to consider with the
LazyRulesexample: the pattern file is opened (during__init__()), and it remains open until the final rule is reached. Python will eventually close the file when it exits, or after the last instantiation of theLazyRulesclass is destroyed, but still, that could be a long time. If this class is part of a long-running Python process, the Python interpreter may never exit, and theLazyRulesobject may never get destroyed. -There are ways around this. Instead of opening the file during
__init__()and leaving it open while you read rules one line at a time, you could open the file, read all the rules, and immediately close the file. Or you could open the file, read one rule, save the file position with thetell()method, close the file, and later re-open it and use theseek()method to continue reading where you left off. Or you could not worry about it and just leave the file open, like this example code does. Programming is design, and design is all about trade-offs and constraints. Leaving a file open too long might be a problem; making your code more complicated might be a problem. Which one is the bigger problem depends on your development team, your application, and your runtime environment. -
⁂ - -
© 2001–10 Mark Pilgrim - - - + + +
You are here: Home ‣ Dive Into Python 3 ‣ +
Difficulty level: ♦♦♦♢♢ +
++❝ East is East, and West is West, and never the twain shall meet. ❞
— Rudyard Kipling +
+
Iterators are the “secret sauce” of Python 3. They’re everywhere, underlying everything, always just out of sight. Comprehensions are just a simple form of iterators. Generators are just a simple form of iterators. A function that yields values is a nice, compact way of building an iterator without building an iterator. Let me show you what I mean by that.
+
+
Remember the Fibonacci generator? Here it is as a built-from-scratch iterator: + +
class Fib:
+ '''iterator that yields numbers in the Fibonacci sequence'''
+
+ def __init__(self, max):
+ self.max = max
+
+ def __iter__(self):
+ self.a = 0
+ self.b = 1
+ return self
+
+ def __next__(self):
+ fib = self.a
+ if fib > self.max:
+ raise StopIteration
+ self.a, self.b = self.b, self.a + self.b
+ return fib
+
+Let’s take that one line at a time. + +
class Fib:
+
+class? What’s a class?
+
+
⁂ + +
Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you’ve defined. + +
Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that’s all that’s required, since a class doesn’t need to inherit from any other class.
+
+
class PapayaWhip: ①
+ pass ②
+PapayaWhip, and it doesn’t inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement.
+if statement, for loop, or any other block of code. The first line not indented is outside the class.
+This PapayaWhip class doesn’t define any methods or attributes, but syntactically, there needs to be something in the definition, thus the pass statement. This is a Python reserved word that just means “move along, nothing to see here”. It’s a statement that does nothing, and it’s a good placeholder when you’re stubbing out functions or classes.
+
+
++ +☞The
passstatement in Python is like a empty set of curly braces ({}) in Java or C. +
Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don’t have explicit constructors and destructors. Although it’s not required, Python classes can have something similar to a constructor: the __init__() method.
+
+
__init__() MethodThis example shows the initialization of the Fib class using the __init__ method.
+
+
class Fib:
+ '''iterator that yields numbers in the Fibonacci sequence''' ①
+
+ def __init__(self, max): ②
+docstrings too, just like modules and functions.
+__init__() method is called immediately after an instance of the class is created. It would be tempting — but technically incorrect — to call this the “constructor” of the class. It’s tempting, because it looks like a C++ constructor (by convention, the __init__() method is the first method defined for the class), acts like one (it’s the first piece of code executed in a newly created instance of the class), and even sounds like one. Incorrect, because the object has already been constructed by the time the __init__() method is called, and you already have a valid reference to the new instance of the class.
+The first argument of every class method, including the __init__() method, is always a reference to the current instance of the class. By convention, this argument is named self. This argument fills the role of the reserved word this in C++ or Java, but self is not a reserved word in Python, merely a naming convention. Nonetheless, please don’t call it anything but self; this is a very strong convention.
+
+
In the __init__() method, self refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify self explicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically.
+
+
⁂ + +
Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the __init__() method requires. The return value will be the newly created object.
+
+>>> import fibonacci2 +>>> fib = fibonacci2.Fib(100) ① +>>> fib ② +<fibonacci2.Fib object at 0x00DB8810> +>>> fib.__class__ ③ +<class 'fibonacci2.Fib'> +>>> fib.__doc__ ④ +'iterator that yields numbers in the Fibonacci sequence'+
Fib class (defined in the fibonacci2 module) and assigning the newly created instance to the variable fib. You are passing one parameter, 100, which will end up as the max argument in Fib’s __init__() method.
+Fib class.
+__class__, which is the object’s class. Java programmers may be familiar with the Class class, which contains methods like getName() and getSuperclass() to get metadata information about an object. In Python, this kind of metadata is available through attributes, but the idea is the same.
+docstring just as with a function or a module. All instances of a class share the same docstring.
+++ +☞In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit
newoperator like there is in C++ or Java. +
⁂ + +
On to the next line: + +
class Fib:
+ def __init__(self, max):
+ self.max = max ①
+__init__() method as an argument. self.max is “global” to the instance. That means that you can access it from other methods.
+class Fib:
+ def __init__(self, max):
+ self.max = max ①
+ .
+ .
+ .
+ def __next__(self):
+ fib = self.a
+ if fib > self.max: ②
+__init__() method…
+__next__() method.
+Instance variables are specific to one instance of a class. For example, if you create two Fib instances with different maximum values, they will each remember their own values.
+
+
+>>> import fibonacci2 +>>> fib1 = fibonacci2.Fib(100) +>>> fib2 = fibonacci2.Fib(200) +>>> fib1.max +100 +>>> fib2.max +200+ +
⁂ + +
Now you’re ready to learn how to build an iterator. An iterator is just a class that defines an __iter__() method.
+
+
+
+
class Fib: ①
+ def __init__(self, max): ②
+ self.max = max
+
+ def __iter__(self): ③
+ self.a = 0
+ self.b = 1
+ return self
+
+ def __next__(self): ④
+ fib = self.a
+ if fib > self.max:
+ raise StopIteration ⑤
+ self.a, self.b = self.b, self.a + self.b
+ return fib ⑥
+fib needs to be a class, not a function.
+Fib(max) is really creating an instance of this class and calling its __init__() method with max. The __init__() method saves the maximum value as an instance variable so other methods can refer to it later.
+__iter__() method is called whenever someone calls iter(fib). (As you’ll see in a minute, a for loop will call this automatically, but you can also call it yourself manually.) After performing beginning-of-iteration initialization (in this case, resetting self.a and self.b, our two counters), the __iter__() method can return any object that implements a __next__() method. In this case (and in most cases), __iter__() simply returns self, since this class implements its own __next__() method.
+__next__() method is called whenever someone calls next() on an iterator of an instance of a class. That will make more sense in a minute.
+__next__() method raises a StopIteration exception, this signals to the caller that the iteration is exhausted. Unlike most exceptions, this is not an error; it’s a normal condition that just means that the iterator has no more values to generate. If the caller is a for loop, it will notice this StopIteration exception and gracefully exit the loop. (In other words, it will swallow the exception.) This little bit of magic is actually the key to using iterators in for loops.
+__next__() method simply returns the value. Do not use yield here; that’s a bit of syntactic sugar that only applies when you’re using generators. Here you’re creating your own iterator from scratch; use return instead.
+Thoroughly confused yet? Excellent. Let’s see how to call this iterator: + +
+>>> from fibonacci2 import Fib +>>> for n in Fib(1000): +... print(n, end=' ') +0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987+ +
Why, it’s exactly the same! Byte for byte identical to how you called Fibonacci-as-a-generator (modulo one capital letter). But how? + +
There’s a bit of magic involved in for loops. Here’s what happens:
+
+
for loop calls Fib(1000), as shown. This returns an instance of the Fib class. Call this fib_inst.
+for loop calls iter(fib_inst), which returns an iterator object. Call this fib_iter. In this case, fib_iter == fib_inst, because the __iter__() method returns self, but the for loop doesn’t know (or care) about that.
+for loop calls next(fib_iter), which calls the __next__() method on the fib_iter object, which does the next-Fibonacci-number calculations and returns a value. The for loop takes this value and assigns it to n, then executes the body of the for loop for that value of n.
+for loop know when to stop? I’m glad you asked! When next(fib_iter) raises a StopIteration exception, the for loop will swallow the exception and gracefully exit. (Any other exception will pass through and be raised as usual.) And where have you seen a StopIteration exception? In the __next__() method, of course!
+⁂ + +
Now it’s time for the finale. Let’s rewrite the plural rules generator as an iterator. + +
class LazyRules:
+ rules_filename = 'plural6-rules.txt'
+
+ def __init__(self):
+ self.pattern_file = open(self.rules_filename, encoding='utf-8')
+ self.cache = []
+
+ def __iter__(self):
+ self.cache_index = 0
+ return self
+
+ def __next__(self):
+ self.cache_index += 1
+ if len(self.cache) >= self.cache_index:
+ return self.cache[self.cache_index - 1]
+
+ if self.pattern_file.closed:
+ raise StopIteration
+
+ line = self.pattern_file.readline()
+ if not line:
+ self.pattern_file.close()
+ raise StopIteration
+
+ pattern, search, replace = line.split(None, 3)
+ funcs = build_match_and_apply_functions(
+ pattern, search, replace)
+ self.cache.append(funcs)
+ return funcs
+
+rules = LazyRules()
+
+So this is a class that implements __iter__() and __next__(), so it can be used as an iterator. Then, you instantiate the class and assign it to rules. This happens just once, on import.
+
+
Let’s take the class one bite at a time. + +
class LazyRules:
+ rules_filename = 'plural6-rules.txt'
+
+ def __init__(self):
+ self.pattern_file = open(self.rules_filename, encoding='utf-8') ①
+ self.cache = [] ②
+LazyRules class, open the pattern file but don’t read anything from it. (That comes later.)
+__next__() method) as you read lines from the pattern file.
+Before we continue, let’s take a closer look at rules_filename. It’s not defined within the __iter__() method. In fact, it’s not defined within any method. It’s defined at the class level. It’s a class variable, and although you can access it just like an instance variable (self.rules_filename), it is shared across all instances of the LazyRules class.
+
+
+>>> import plural6 +>>> r1 = plural6.LazyRules() +>>> r2 = plural6.LazyRules() +>>> r1.rules_filename ① +'plural6-rules.txt' +>>> r2.rules_filename +'plural6-rules.txt' +>>> r2.rules_filename = 'r2-override.txt' ② +>>> r2.rules_filename +'r2-override.txt' +>>> r1.rules_filename +'plural6-rules.txt' +>>> r2.__class__.rules_filename ③ +'plural6-rules.txt' +>>> r2.__class__.rules_filename = 'papayawhip.txt' ④ +>>> r1.rules_filename +'papayawhip.txt' +>>> r2.rules_filename ⑤ +'r2-overridetxt'+
__class__ attribute to access the class itself.
+And now back to our show. + +
def __iter__(self): ①
+ self.cache_index = 0
+ return self ②
+
+__iter__() method will be called every time someone — say, a for loop — calls iter(rules).
+__iter__() method must do is return an iterator. In this case, it returns self, which signals that this class defines a __next__() method which will take care of returning values throughout the iteration.
+ def __next__(self): ①
+ .
+ .
+ .
+ pattern, search, replace = line.split(None, 3)
+ funcs = build_match_and_apply_functions( ②
+ pattern, search, replace)
+ self.cache.append(funcs) ③
+ return funcs
+__next__() method gets called whenever someone — say, a for loop — calls next(rules). This method will only make sense if we start at the end and work backwards. So let’s do that.
+build_match_and_apply_functions() function hasn’t changed; it’s the same as it ever was.
+self.cache.
+Moving backwards… + +
def __next__(self):
+ .
+ .
+ .
+ line = self.pattern_file.readline() ①
+ if not line: ②
+ self.pattern_file.close()
+ raise StopIteration ③
+ .
+ .
+ .
+readline() method (note: singular, not the plural readlines()) reads exactly one line from an open file. Specifically, the next line. (File objects are iterators too! It’s iterators all the way down…)
+readline() to read, line will not be an empty string. Even if the file contained a blank line, line would end up as the one-character string '\n' (a carriage return). If line is really an empty string, that means there are no more lines to read from the file.
+StopIteration exception. Remember, we got to this point because we needed a match and apply function for the next rule. The next rule comes from the next line of the file… but there is no next line! Therefore, we have no value to return. The iteration is over. (♫ The party’s over… ♫)
+Moving backwards all the way to the start of the __next__() method…
+
+
def __next__(self):
+ self.cache_index += 1
+ if len(self.cache) >= self.cache_index:
+ return self.cache[self.cache_index - 1] ①
+
+ if self.pattern_file.closed:
+ raise StopIteration ②
+ .
+ .
+ .
+self.cache will be a list of the functions we need to match and apply individual rules. (At least that should sound familiar!) self.cache_index keeps track of which cached item we should return next. If we haven’t exhausted the cache yet (i.e. if the length of self.cache is greater than self.cache_index), then we have a cache hit! Hooray! We can return the match and apply functions from the cache instead of building them from scratch.
+Putting it all together, here’s what happens when: + +
LazyRules class, called rules, which opens the pattern file but does not read from it.
+plural() function again to pluralize a different word. The for loop in the plural() function will call iter(rules), which will reset the cache index but will not reset the open file object.
+for loop will ask for a value from rules, which will invoke its __next__() method. This time, however, the cache is primed with a single pair of match and apply functions, corresponding to the patterns in the first line of the pattern file. Since they were built and cached in the course of pluralizing the previous word, they’re retrieved from the cache. The cache index increments, and the open file is never touched.
+for loop comes around again and asks for another value from rules. This invokes the __next__() method a second time. This time, the cache is exhausted — it only contained one item, and we’re asking for a second — so the __next__() method continues. It reads another line from the open file, builds match and apply functions out of the patterns, and caches them.
+readline() command. In the meantime, the cache now has more items in it, and if we start all over again trying to pluralize a new word, each of those items in the cache will be tried before reading the next line from the pattern file.
+We have achieved pluralization nirvana. + +
import is instantiating a single class and opening a file (but not reading from it).
+++ +☞Is this really nirvana? Well, yes and no. Here’s something to consider with the
LazyRulesexample: the pattern file is opened (during__init__()), and it remains open until the final rule is reached. Python will eventually close the file when it exits, or after the last instantiation of theLazyRulesclass is destroyed, but still, that could be a long time. If this class is part of a long-running Python process, the Python interpreter may never exit, and theLazyRulesobject may never get destroyed. +There are ways around this. Instead of opening the file during
__init__()and leaving it open while you read rules one line at a time, you could open the file, read all the rules, and immediately close the file. Or you could open the file, read one rule, save the file position with thetell()method, close the file, and later re-open it and use theseek()method to continue reading where you left off. Or you could not worry about it and just leave the file open, like this example code does. Programming is design, and design is all about trade-offs and constraints. Leaving a file open too long might be a problem; making your code more complicated might be a problem. Which one is the bigger problem depends on your development team, your application, and your runtime environment. +
⁂ + +
© 2001–10 Mark Pilgrim
+
+
+
diff --git a/j/.htaccess b/j/.htaccess
index 35a1445..3c593e3 100644
--- a/j/.htaccess
+++ b/j/.htaccess
@@ -1,4 +1,4 @@
-FileETag MTime Size
-
-ExpiresActive On
-ExpiresDefault "access plus 1 year"
+FileETag MTime Size
+
+ExpiresActive On
+ExpiresDefault "access plus 1 year"
diff --git a/j/html5.js b/j/html5.js
index e973e7f..6457708 100644
--- a/j/html5.js
+++ b/j/html5.js
@@ -1 +1,3 @@
-(function(){var e="abbr,article,aside,audio,bb,canvas,datagrid,datalist,details,dialog,figure,footer,header,mark,menu,meter,nav,output,progress,section,time,video".split(','),i=e.length;while(i--){document.createElement(e[i])}})()
\ No newline at end of file
+/*@cc_on@if(@_jscript_version<9)(function(p,e){function q(a,b){if(g[a])g[a].styleSheet.cssText+=b;else{var c=r[l],d=e[j]("style");d.media=a;c.insertBefore(d,c[l]);g[a]=d;q(a,b)}}function s(a,b){for(var c=new RegExp("\\b("+m+")\\b(?!.*[;}])","gi"),d=function(k){return".iepp_"+k},h=-1;++h You are here: Home ‣ Dive Into Python 3 ‣
- Difficulty level: ♦♦♦♦♢
- ❝ You’ll find the shame is like the pain; you only feel it once. ❞
- Real artists ship. Or so says Steve Jobs. Do you want to release a Python script, library, framework, or application? Excellent. The world needs more Python code. Python 3 comes with a packaging framework called Distutils. Distutils is many things: a build tool (for you), an installation tool (for your users), a package metadata format (for search engines), and more. It integrates with the Python Package Index (“PyPI”), a central repository for open source Python libraries.
-
- All of these facets of Distutils center around the setup script, traditionally called In this chapter, you’ll learn how the setup scripts for ☞ ⁂
-
- Releasing your first Python package is a daunting process. (Releasing your second one is a little easier.) Distutils tries to automate as much of it as possible, but there are some things you simply must do yourself.
-
- ⁂
-
- To start packaging your Python software, you need to get your files and directories in order. The The ⁂
-
- The Distutils setup script is a Python script. In theory, it can do anything Python can do. In practice, it should do as little as possible, in as standard a way as possible. Setup scripts should be boring. The more exotic your installation process is, the more exotic your bug reports will be.
-
- The first line of every Distutils setup script is always the same:
-
- This imports the The The following named arguments are required:
-
- Although not required, I recommend that you also include the following in your setup script:
-
- ☞Setup script metadata is defined in PEP 314.
- Now let’s look at the The ⁂
-
- The Python Package Index (“PyPI”) contains thousands of Python libraries. Proper classification metadata will allow people to find yours more easily. PyPI lets you browse packages by classifier. You can even select multiple classifiers to narrow your search. Classifiers are not invisible metadata that you can just ignore!
-
- To classify your software, pass a Classifiers are optional. You can write a Distutils setup script without any classifiers at all. Don’t do that. You should always include at least these classifiers:
-
- I also recommend that you include the following classifiers:
-
- By way of example, here are the classifiers for Django, a production-ready, cross-platform, BSD-licensed web application framework that runs on your web server. (Django is not yet compatible with Python 3, so the Here are the classifiers for And here are the classifiers for By default, Distutils will include the following files in your release package:
-
- That will cover all the files in the A manifest file is a text file called This is the entire manifest file for the All manifest commands preserve the directory structure that you set up in your project directory. That ☞Manifest files have their own unique format. See Specifying the files to distribute and the manifest template commands for details.
- To reiterate: you only need to create a manifest file if you want to include files that Distutils doesn’t include by default. If you do need a manifest file, it should only include the files and directories that Distutils wouldn’t otherwise find on its own.
-
- There’s a lot to keep track of. Distutils comes with a built-in validation command that checks that all the required metadata is present in your setup script. For example, if you forget to include the Once you include a ⁂
-
- Distutils supports building multiple types of release packages. At a minimum, you should build a “source distribution” that contains your source code, your Distutils setup script, your “read me” file, and whatever additional files you want to include. To build a source distribution, pass the Several things to note here:
-
- ⁂
-
- In my opinion, every Python library deserves a graphical installer for Windows users. It’s easy to make (even if you don’t run Windows yourself), and Windows users appreciate it.
-
- Distutils can create a graphical Windows installer for you, by passing the Distutils can help you build installable packages for Linux users. In my opinion, this probably isn’t worth your time. If you want your software distributed for Linux, your time would be better spent working with community members who specialize in packaging software for major Linux distributions.
-
- For example, my The Linux packages that Distutils builds offer none of these advantages. Your time is better spent elsewhere.
-
- ⁂
-
- Uploading software to the Python Package Index is a three step process.
-
- To register yourself, go to the PyPI user registration page. Enter your desired username and password, provide a valid email address, and click the Now you need to register your software with PyPI and upload it. You can do this all in one step.
-
- Congratulations, you now have your own page on the Python Package Index! The address is If you want to release a new version, just update your ⁂
-
- Distutils is not the be-all and end-all of Python packaging, but as of this writing (August 2009), it’s the only packaging framework that works in Python 3. There are a number of other frameworks for Python 2; some focus on installation, others on testing and deployment. Some or all of these may end up being ported to Python 3 in the future.
-
- These frameworks focus on installation:
-
- These focus on testing and deployment:
-
- ⁂
-
- On Distutils:
-
- On other packaging frameworks:
-
- © 2001–10 Mark Pilgrim
-
-
-
+
+
+ You are here: Home ‣ Dive Into Python 3 ‣
+ Difficulty level: ♦♦♦♦♢
+ ❝ You’ll find the shame is like the pain; you only feel it once. ❞
+ Real artists ship. Or so says Steve Jobs. Do you want to release a Python script, library, framework, or application? Excellent. The world needs more Python code. Python 3 comes with a packaging framework called Distutils. Distutils is many things: a build tool (for you), an installation tool (for your users), a package metadata format (for search engines), and more. It integrates with the Python Package Index (“PyPI”), a central repository for open source Python libraries.
+
+ All of these facets of Distutils center around the setup script, traditionally called In this chapter, you’ll learn how the setup scripts for ☞ ⁂
+
+ Releasing your first Python package is a daunting process. (Releasing your second one is a little easier.) Distutils tries to automate as much of it as possible, but there are some things you simply must do yourself.
+
+ ⁂
+
+ To start packaging your Python software, you need to get your files and directories in order. The The ⁂
+
+ The Distutils setup script is a Python script. In theory, it can do anything Python can do. In practice, it should do as little as possible, in as standard a way as possible. Setup scripts should be boring. The more exotic your installation process is, the more exotic your bug reports will be.
+
+ The first line of every Distutils setup script is always the same:
+
+ This imports the The The following named arguments are required:
+
+ Although not required, I recommend that you also include the following in your setup script:
+
+ ☞Setup script metadata is defined in PEP 314.
+ Now let’s look at the The ⁂
+
+ The Python Package Index (“PyPI”) contains thousands of Python libraries. Proper classification metadata will allow people to find yours more easily. PyPI lets you browse packages by classifier. You can even select multiple classifiers to narrow your search. Classifiers are not invisible metadata that you can just ignore!
+
+ To classify your software, pass a Classifiers are optional. You can write a Distutils setup script without any classifiers at all. Don’t do that. You should always include at least these classifiers:
+
+ I also recommend that you include the following classifiers:
+
+ By way of example, here are the classifiers for Django, a production-ready, cross-platform, BSD-licensed web application framework that runs on your web server. (Django is not yet compatible with Python 3, so the Here are the classifiers for And here are the classifiers for By default, Distutils will include the following files in your release package:
+
+ That will cover all the files in the A manifest file is a text file called This is the entire manifest file for the All manifest commands preserve the directory structure that you set up in your project directory. That ☞Manifest files have their own unique format. See Specifying the files to distribute and the manifest template commands for details.
+ To reiterate: you only need to create a manifest file if you want to include files that Distutils doesn’t include by default. If you do need a manifest file, it should only include the files and directories that Distutils wouldn’t otherwise find on its own.
+
+ There’s a lot to keep track of. Distutils comes with a built-in validation command that checks that all the required metadata is present in your setup script. For example, if you forget to include the Once you include a ⁂
+
+ Distutils supports building multiple types of release packages. At a minimum, you should build a “source distribution” that contains your source code, your Distutils setup script, your “read me” file, and whatever additional files you want to include. To build a source distribution, pass the Several things to note here:
+
+ ⁂
+
+ In my opinion, every Python library deserves a graphical installer for Windows users. It’s easy to make (even if you don’t run Windows yourself), and Windows users appreciate it.
+
+ Distutils can create a graphical Windows installer for you, by passing the Distutils can help you build installable packages for Linux users. In my opinion, this probably isn’t worth your time. If you want your software distributed for Linux, your time would be better spent working with community members who specialize in packaging software for major Linux distributions.
+
+ For example, my The Linux packages that Distutils builds offer none of these advantages. Your time is better spent elsewhere.
+
+ ⁂
+
+ Uploading software to the Python Package Index is a three step process.
+
+ To register yourself, go to the PyPI user registration page. Enter your desired username and password, provide a valid email address, and click the Now you need to register your software with PyPI and upload it. You can do this all in one step.
+
+ Congratulations, you now have your own page on the Python Package Index! The address is If you want to release a new version, just update your ⁂
+
+ Distutils is not the be-all and end-all of Python packaging, but as of this writing (August 2009), it’s the only packaging framework that works in Python 3. There are a number of other frameworks for Python 2; some focus on installation, others on testing and deployment. Some or all of these may end up being ported to Python 3 in the future.
+
+ These frameworks focus on installation:
+
+ These focus on testing and deployment:
+
+ ⁂
+
+ On Distutils:
+
+ On other packaging frameworks:
+
+ © 2001–10 Mark Pilgrim
+
+
+
diff --git a/prince.css b/prince.css
index 5dbf409..5fa3299 100644
--- a/prince.css
+++ b/prince.css
@@ -1,59 +1,59 @@
-/*
-
-"Dive Into Python 3" Prince stylesheet
-
-Copyright (c) 2009, Mark Pilgrim, All rights reserved.
-
-Redistribution and use in source and binary forms, with or without modification,
-are permitted provided that the following conditions are met:
-
-* Redistributions of source code must retain the above copyright notice,
- this list of conditions and the following disclaimer.
-* Redistributions in binary form must reproduce the above copyright notice,
- this list of conditions and the following disclaimer in the documentation
- and/or other materials provided with the distribution.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS'
-AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
-ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
-LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
-CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
-SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
-INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
-CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
-ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGE.
-*/
-
-/* some Prince-specific rules to generate a nicer PDF */
-/* see http://www.princexml.com/ */
-
-@page {
- size: US-Letter;
- margin: 30pt;
- padding: 0;
- @bottom-center {
- font: 12pt/1.75 'Gill Sans', 'Gill Sans MT', Helvetica, Corbel, 'Nimbus Sans L', sans-serif;
- content: counter(page);
- }
-}
-pre {
- page-break-inside: avoid;
-}
-h1 {
- page-break-before: always;
- prince-bookmark-level: 1;
-}
-h2 {
- prince-bookmark-level: 2;
-}
-h3 {
- prince-bookmark-level: 3;
-}
-ul, ol {
- margin: 1.75em 20pt;
-}
-abbr {
- text-decoration: none;
-}
+/*
+
+"Dive Into Python 3" Prince stylesheet
+
+Copyright (c) 2009, Mark Pilgrim, All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS'
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+*/
+
+/* some Prince-specific rules to generate a nicer PDF */
+/* see http://www.princexml.com/ */
+
+@page {
+ size: US-Letter;
+ margin: 30pt;
+ padding: 0;
+ @bottom-center {
+ font: 12pt/1.75 'Gill Sans', 'Gill Sans MT', Helvetica, Corbel, 'Nimbus Sans L', sans-serif;
+ content: counter(page);
+ }
+}
+pre {
+ page-break-inside: avoid;
+}
+h1 {
+ page-break-before: always;
+ prince-bookmark-level: 1;
+}
+h2 {
+ prince-bookmark-level: 2;
+}
+h3 {
+ prince-bookmark-level: 3;
+}
+ul, ol {
+ margin: 1.75em 20pt;
+}
+abbr {
+ text-decoration: none;
+}
diff --git a/publish b/publish
index f471782..8d39edc 100755
--- a/publish
+++ b/publish
@@ -2,7 +2,6 @@
die () {
echo "$1" >/dev/stderr
- [ -n "$(which Snarl_CMD 2>/dev/null)" ] && Snarl_CMD snShowMessage 10 "Dive Into Python 3" "$1." "C:\Users\pilgrim\site-lisp\todochiku-icons\alert.png"
exit 1
}
@@ -119,9 +118,9 @@ java -jar util/yuicompressor-2.4.2.jar build/dip3.css > build/$revision.css && \
echo "inlining CSS, minimizing URLs, adding evil tracking code"
ga=`cat j/ga.js`
for f in build/*.html; do
- css=`python2.6 util/lesscss.py "$f" "build/$revision.css"` || die "Failed to remove unused CSS"
- mobilecss=`python2.6 util/lesscss.py "$f" "build/m-$revision.css"` || die "Failed to remove unused CSS"
- printcss=`python2.6 util/lesscss.py "$f" "build/p-$revision.css"` || die "Failed to remove unused CSS"
+ css=`python2.5 util/lesscss.py "$f" "build/$revision.css"` || die "Failed to remove unused CSS"
+ mobilecss=`python2.5 util/lesscss.py "$f" "build/m-$revision.css"` || die "Failed to remove unused CSS"
+ printcss=`python2.5 util/lesscss.py "$f" "build/p-$revision.css"` || die "Failed to remove unused CSS"
sed -i -e "s|||g" -e "s|||g" -e "s|||g" -e "s||${ga}|g" "$f" || die "Failed to inline CSS"
done
@@ -130,7 +129,7 @@ chmod 755 build/examples build/j build/i build/d && \
chmod 644 build/*.html build/*.css build/*.txt build/*.zip build/examples/* build/examples/.htaccess build/j/* build/j/.htaccess build/i/* build/i/.htaccess build/d/.htaccess build/.htaccess || die "Failed to reset file permissions"
# ship it!
-#die "Aborting without publishing"
+die "Aborting without publishing"
echo -n "publishing"
rsync -essh -a build/d/.htaccess build/*.zip diveintomark.org:~/web/diveintopython3.org/d/ && \
echo -n "." && \
@@ -140,5 +139,3 @@ rsync -essh -a build/d/.htaccess build/*.zip diveintomark.org:~/web/diveintopyth
echo -n "." && \
rsync -essh -a build/examples build/*.txt build/*.html build/.htaccess diveintomark.org:~/web/diveintopython3.org/ && \
echo "." || die "Failed to publish to remote server"
-
-[ -n "$(which Snarl_CMD 2>/dev/null)" ] && Snarl_CMD snShowMessage 10 "Dive Into Python 3" "Published." "C:\Users\pilgrim\site-lisp\todochiku-icons\clean.png"
diff --git a/table-of-contents.html b/table-of-contents.html
index 93772a7..5d9db97 100755
--- a/table-of-contents.html
+++ b/table-of-contents.html
@@ -1,446 +1,446 @@
-
-
- You are here: Home ‣ Dive Into Python 3 ‣
- © 2001–10 Mark Pilgrim
-
+
+
+ You are here: Home ‣ Dive Into Python 3 ‣
+ © 2001–10 Mark Pilgrim
+
diff --git a/util/lesscss.py b/util/lesscss.py
index 9342d22..c39249c 100755
--- a/util/lesscss.py
+++ b/util/lesscss.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python2.6
+#!/usr/bin/python2.5
from pyquery import PyQuery as pq
import glob
@@ -12,10 +12,7 @@ SELECTOR_EXCEPTIONS = ('.w', '.b', '.str', '.kwd', '.com', '.typ', '.lit', '.pun
filename = sys.argv[1]
cssfilename = sys.argv[2]
pqd = pq(filename=filename)
-
-with open(filename, 'rb') as fopen:
- raw_data = fopen.read()
-
+raw_data = open(filename, 'rb').read()
if raw_data.count('Packaging Python Libraries
-
-
-
— Marquise de Merteuil, Dangerous Liaisons
-Diving In
-setup.py. In fact, you’ve already seen several Distutils setup scripts in this book. You used Distutils to install httplib2 in HTTP Web Services and again to install chardet in Case Study: Porting chardet to Python 3.
-
-chardet and httplib2 work, and you’ll step through the process of releasing your own Python software.
-
-
-
-# chardet's setup.py
-from distutils.core import setup
-setup(
- name = "chardet",
- packages = ["chardet"],
- version = "1.0.2",
- description = "Universal encoding detector",
- author = "Mark Pilgrim",
- author_email = "mark@diveintomark.org",
- url = "http://chardet.feedparser.org/",
- download_url = "http://chardet.feedparser.org/download/python3-chardet-1.0.1.tgz",
- keywords = ["encoding", "i18n", "xml"],
- classifiers = [
- "Programming Language :: Python",
- "Programming Language :: Python :: 3",
- "Development Status :: 4 - Beta",
- "Environment :: Other Environment",
- "Intended Audience :: Developers",
- "License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)",
- "Operating System :: OS Independent",
- "Topic :: Software Development :: Libraries :: Python Modules",
- "Topic :: Text Processing :: Linguistic",
- ],
- long_description = """\
-Universal character encoding detector
--------------------------------------
-
-Detects
- - ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
- - Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
- - EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
- - EUC-KR, ISO-2022-KR (Korean)
- - KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
- - ISO-8859-2, windows-1250 (Hungarian)
- - ISO-8859-5, windows-1251 (Bulgarian)
- - windows-1252 (English)
- - ISO-8859-7, windows-1253 (Greek)
- - ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
- - TIS-620 (Thai)
-
-This version requires Python 3 or later; a Python 2 version is available separately.
-"""
-)
-
-
-chardet and httplib2 are open source, but there’s no requirement that you release your own Python libraries under any particular license. The process described in this chapter will work for any Python software, regardless of license.
-Things Distutils Can’t Do For You
-
-
-
-
-
-
-Directory Structure
-
-httplib2 directory looks like this:
-
-
-httplib2/ ①
-|
-+--README.txt ②
-|
-+--setup.py ③
-|
-+--httplib2/ ④
- |
- +--__init__.py
- |
- +--iri2uri.py
-
-
-
-.txt extension, and it should use Windows-style carriage returns. Just because you use a fancy text editor that runs from the command line and includes its own macro language, that doesn’t mean you need to make life difficult for your users. (Your users use Notepad. Sad but true.) Even if you’re on Linux or Mac OS X, your fancy text editor undoubtedly has an option to save files with Windows-style carriage returns.
-setup.py unless you have a good reason not to. You do not have a good reason not to.
-.py file, you should put it in the root directory along with your “read me” file and your setup script. But httplib2 is not a single .py file; it’s a multi-file module. But that’s OK! Just put the httplib2 directory in the root directory, so you have an __init__.py file within an httplib2/ directory within the httplib2/ root directory. That’s not a problem; in fact, it will simplify your packaging process.
-chardet directory looks slightly different. Like httplib2, it’s a multi-file module, so there’s a chardet/ directory within the chardet/ root directory. In addition to the README.txt file, chardet has HTML-formatted documentation in the docs/ directory. The docs/ directory contains several .html and .css files and an images/ subdirectory, which contains several .png and .gif files. (This will be important later.) Also, in keeping with the convention for (L)GPL-licensed software, it has a separate file called COPYING.txt which contains the complete text of the LGPL.
-
-
-
-
-chardet/
-|
-+--COPYING.txt
-|
-+--setup.py
-|
-+--README.txt
-|
-+--docs/
-| |
-| +--index.html
-| |
-| +--usage.html
-| |
-| +--images/ ...
-|
-+--chardet/
- |
- +--__init__.py
- |
- +--big5freq.py
- |
- +--...
-Writing Your Setup Script
-
-
-
-from distutils.core import setupsetup() function, which is the main entry point into Distutils. 95% of all Distutils setup scripts consist of a single call to setup() and nothing else. (I totally just made up that statistic, but if your Distutils setup script is doing more than calling the Distutils setup() function, you should have a good reason. Do you have a good reason? I didn’t think so.)
-
-setup() function can take dozens of parameters. For the sanity of everyone involved, you must use named arguments for every parameter. This is not merely a convention; it’s a hard requirement. Your setup script will crash if you try to call the setup() function with non-named arguments.
-
-
-
-
-
-
-
-
-
-
-chardet setup script. It has all of these required and recommended parameters, plus one I haven’t mentioned yet: packages.
-
-
-
-from distutils.core import setup
-setup(
- name = 'chardet',
- packages = ['chardet'],
- version = '1.0.2',
- description = 'Universal encoding detector',
- author='Mark Pilgrim',
- ...
-)packages parameter highlights an unfortunate vocabulary overlap in the distribution process. We’ve been talking about the “package” as the thing you’re building (and potentially listing in The Python “Package” Index). But that’s not what this packages parameter refers to. It refers to the fact that the chardet module is a multi-file module, sometimes known as… a “package.” The packages parameter tells Distutils to include the chardet/ directory, its __init__.py file, and all the other .py files that constitute the chardet module. That’s kind of important; all this happy talk about documentation and metadata is irrelevant if you forget to include the actual code!
-
-Classifying Your Package
-
-classifiers parameter to the Distutils setup() function. The classifers parameter is a list of strings. These strings are not freeform. All classifier strings should come from this list on PyPI.
-
-
-
-
-"Programming Language :: Python" and "Programming Language :: Python :: 3". If you do not include these, your package will not show up in this list of Python 3-compatible libraries, which linked from the sidebar of every single page of pypi.python.org.
-"Operating System :: OS Independent". Multiple Operating System classifiers are only necessary if your software requires specific support for each platform. (This is not common.)
-
-
-
-Developers, End Users/Desktop, Science/Research, and System Administrators.
-Framework classifier. If not, omit it.
-Examples of Good Package Classifiers
-
-Programming Language :: Python :: 3 classifier is not listed.)
-
-
-
-Programming Language :: Python
-License :: OSI Approved :: BSD License
-Operating System :: OS Independent
-Development Status :: 5 - Production/Stable
-Environment :: Web Environment
-Framework :: Django
-Intended Audience :: Developers
-Topic :: Internet :: WWW/HTTP
-Topic :: Internet :: WWW/HTTP :: Dynamic Content
-Topic :: Internet :: WWW/HTTP :: WSGI
-Topic :: Software Development :: Libraries :: Python Moduleschardet, the character encoding detection library covered in Case Study: Porting chardet to Python 3. chardet is beta quality, cross-platform, Python 3-compatible, LGPL-licensed, and intended for developers to integrate into their own products.
-
-
-
-Programming Language :: Python
-Programming Language :: Python :: 3
-License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
-Operating System :: OS Independent
-Development Status :: 4 - Beta
-Environment :: Other Environment
-Intended Audience :: Developers
-Topic :: Text Processing :: Linguistic
-Topic :: Software Development :: Libraries :: Python Moduleshttplib2, the HTTP module I mentioned at the beginning of this chapter. httplib2 is beta quality, cross-platform, MIT-licensed, and intended for Python developers.
-
-
-
-Programming Language :: Python
-Programming Language :: Python :: 3
-License :: OSI Approved :: MIT License
-Operating System :: OS Independent
-Development Status :: 4 - Beta
-Environment :: Web Environment
-Intended Audience :: Developers
-Topic :: Internet :: WWW/HTTP
-Topic :: Software Development :: Libraries :: Python ModulesSpecifying Additional Files With A Manifest
-
-
-
-
-README.txt
-setup.py
-.py files needed by the multi-file modules listed in the packages parameter
-.py files listed in the py_modules parameter
-httplib2 project. But for the chardet project, we also want to include the COPYING.txt license file and the entire docs/ directory that contains images and HTML files. To tell Distutils to include these additional files and directories when it builds the chardet release package, you need a manifest file.
-
-MANIFEST.in. Place it in the project’s root directory, next to README.txt and setup.py. Manifest files are not Python scripts; they are text files that contain a series of “commands” in a Distutils-defined format. Manifest commands allow you to include or exclude specific files and directories.
-
-chardet project:
-
-
-include COPYING.txt ①
-recursive-include docs *.html *.css *.png *.gif ②
-
-
-COPYING.txt file from the project’s root directory.
-recursive-include command takes a directory name and one or more filenames. The filenames aren’t limited to specific files; they can include wildcards. This line means “See that docs/ directory in the project’s root directory? Look in there (recursively) for .html, .css, .png, and .gif files. I want all of them in my release package.”
-recursive-include command is not going to put a bunch of .html and .png files in the root directory of the release package. It’s going to maintain the existing docs/ directory structure, but only include those files inside that directory that match the given wildcards. (I didn’t mention it earlier, but the chardet documentation is actually written in XML and converted to HTML by a separate script. I don’t want to include the XML files in the release package, just the HTML and the images.)
-
-
-
-
-Checking Your Setup Script for Errors
-
-version parameter, Distutils will remind you.
-
-
-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
-running check
-warning: check: missing required meta-data: version
-
-version parameter (and all the other required bits of metadata), the check command will look like this:
-
-
-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
-running check
-
-Creating a Source Distribution
-
-sdist command to your Distutils setup script.
-
-
-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py sdist
-running sdist
-running check
-reading manifest template 'MANIFEST.in'
-writing manifest file 'MANIFEST'
-creating chardet-1.0.2
-creating chardet-1.0.2\chardet
-creating chardet-1.0.2\docs
-creating chardet-1.0.2\docs\images
-copying files to chardet-1.0.2...
-copying COPYING -> chardet-1.0.2
-copying README.txt -> chardet-1.0.2
-copying setup.py -> chardet-1.0.2
-copying chardet\__init__.py -> chardet-1.0.2\chardet
-copying chardet\big5freq.py -> chardet-1.0.2\chardet
-...
-copying chardet\universaldetector.py -> chardet-1.0.2\chardet
-copying chardet\utf8prober.py -> chardet-1.0.2\chardet
-copying docs\faq.html -> chardet-1.0.2\docs
-copying docs\history.html -> chardet-1.0.2\docs
-copying docs\how-it-works.html -> chardet-1.0.2\docs
-copying docs\index.html -> chardet-1.0.2\docs
-copying docs\license.html -> chardet-1.0.2\docs
-copying docs\supported-encodings.html -> chardet-1.0.2\docs
-copying docs\usage.html -> chardet-1.0.2\docs
-copying docs\images\caution.png -> chardet-1.0.2\docs\images
-copying docs\images\important.png -> chardet-1.0.2\docs\images
-copying docs\images\note.png -> chardet-1.0.2\docs\images
-copying docs\images\permalink.gif -> chardet-1.0.2\docs\images
-copying docs\images\tip.png -> chardet-1.0.2\docs\images
-copying docs\images\warning.png -> chardet-1.0.2\docs\images
-creating dist
-creating 'dist\chardet-1.0.2.zip' and adding 'chardet-1.0.2' to it
-adding 'chardet-1.0.2\COPYING'
-adding 'chardet-1.0.2\PKG-INFO'
-adding 'chardet-1.0.2\README.txt'
-adding 'chardet-1.0.2\setup.py'
-adding 'chardet-1.0.2\chardet\big5freq.py'
-adding 'chardet-1.0.2\chardet\big5prober.py'
-...
-adding 'chardet-1.0.2\chardet\universaldetector.py'
-adding 'chardet-1.0.2\chardet\utf8prober.py'
-adding 'chardet-1.0.2\chardet\__init__.py'
-adding 'chardet-1.0.2\docs\faq.html'
-adding 'chardet-1.0.2\docs\history.html'
-adding 'chardet-1.0.2\docs\how-it-works.html'
-adding 'chardet-1.0.2\docs\index.html'
-adding 'chardet-1.0.2\docs\license.html'
-adding 'chardet-1.0.2\docs\supported-encodings.html'
-adding 'chardet-1.0.2\docs\usage.html'
-adding 'chardet-1.0.2\docs\images\caution.png'
-adding 'chardet-1.0.2\docs\images\important.png'
-adding 'chardet-1.0.2\docs\images\note.png'
-adding 'chardet-1.0.2\docs\images\permalink.gif'
-adding 'chardet-1.0.2\docs\images\tip.png'
-adding 'chardet-1.0.2\docs\images\warning.png'
-removing 'chardet-1.0.2' (and everything under it)
-
-
-
-
-MANIFEST.in).
-COPYING.txt and the HTML and image files in the docs/ directory.
-dist/ directory. Within the dist/ directory the .zip file that you can distribute.
-
-c:\Users\pilgrim\chardet> dir dist
- Volume in drive C has no label.
- Volume Serial Number is DED5-B4F8
-
- Directory of c:\Users\pilgrim\chardet\dist
-
-07/30/2009 06:29 PM <DIR> .
-07/30/2009 06:29 PM <DIR> ..
-07/30/2009 06:29 PM 206,440 chardet-1.0.2.zip
- 1 File(s) 206,440 bytes
- 2 Dir(s) 61,424,635,904 bytes free
-
-Creating a Graphical Installer
-
-bdist_wininst command to your Distutils setup script.
-
-
-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py bdist_wininst
-running bdist_wininst
-running build
-running build_py
-creating build
-creating build\lib
-creating build\lib\chardet
-copying chardet\big5freq.py -> build\lib\chardet
-copying chardet\big5prober.py -> build\lib\chardet
-...
-copying chardet\universaldetector.py -> build\lib\chardet
-copying chardet\utf8prober.py -> build\lib\chardet
-copying chardet\__init__.py -> build\lib\chardet
-installing to build\bdist.win32\wininst
-running install_lib
-creating build\bdist.win32
-creating build\bdist.win32\wininst
-creating build\bdist.win32\wininst\PURELIB
-creating build\bdist.win32\wininst\PURELIB\chardet
-copying build\lib\chardet\big5freq.py -> build\bdist.win32\wininst\PURELIB\chardet
-copying build\lib\chardet\big5prober.py -> build\bdist.win32\wininst\PURELIB\chardet
-...
-copying build\lib\chardet\universaldetector.py -> build\bdist.win32\wininst\PURELIB\chardet
-copying build\lib\chardet\utf8prober.py -> build\bdist.win32\wininst\PURELIB\chardet
-copying build\lib\chardet\__init__.py -> build\bdist.win32\wininst\PURELIB\chardet
-running install_egg_info
-Writing build\bdist.win32\wininst\PURELIB\chardet-1.0.2-py3.1.egg-info
-creating 'c:\users\pilgrim\appdata\local\temp\tmp2f4h7e.zip' and adding '.' to it
-adding 'PURELIB\chardet-1.0.2-py3.1.egg-info'
-adding 'PURELIB\chardet\big5freq.py'
-adding 'PURELIB\chardet\big5prober.py'
-...
-adding 'PURELIB\chardet\universaldetector.py'
-adding 'PURELIB\chardet\utf8prober.py'
-adding 'PURELIB\chardet\__init__.py'
-removing 'build\bdist.win32\wininst' (and everything under it)
-c:\Users\pilgrim\chardet> dir dist
-c:\Users\pilgrim\chardet>dir dist
- Volume in drive C has no label.
- Volume Serial Number is AADE-E29F
-
- Directory of c:\Users\pilgrim\chardet\dist
-
-07/30/2009 10:14 PM <DIR> .
-07/30/2009 10:14 PM <DIR> ..
-07/30/2009 10:14 PM 371,236 chardet-1.0.2.win32.exe
-07/30/2009 06:29 PM 206,440 chardet-1.0.2.zip
- 2 File(s) 577,676 bytes
- 2 Dir(s) 61,424,070,656 bytes free
-
-Building Installable Packages for Other Operating Systems
-
-chardet library is in the Debian GNU/Linux repositories (and therefore in the Ubuntu repositories as well). I had nothing to do with this; the packages just showed up there one day. The Debian community has their own policies for packaging Python libraries, and the Debian python-chardet package is designed to follow these conventions. And since the package lives in Debian’s repositories, Debian users will receive security updates and/or new versions, depending on the system-wide settings they’ve chosen to manage their own computers.
-
-Adding Your Software to The Python Package Index
-
-
-
-
-setup.py sdist and setup.py bdist_*
-Register button. (If you have a PGP or GPG key, you can also provide that. If you don’t have one or don’t know what that means, don’t worry about it.) Check your email; within a few minutes, you should receive a message from PyPI with a validation link. Click the link to complete the registration process.
-
-
-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload ①
-running register
-We need to know who you are, so please choose either:
- 1. use your existing login,
- 2. register as a new user,
- 3. have the server generate a new password for you (and email it to you), or
- 4. quit
-Your selection [default 1]: 1 ②
-Username: MarkPilgrim ③
-Password:
-Registering chardet to http://pypi.python.org/pypi ④
-Server response (200): OK
-running sdist ⑤
-... output trimmed for brevity ...
-running bdist_wininst ⑥
-... output trimmed for brevity ...
-running upload ⑦
-Submitting dist\chardet-1.0.2.zip to http://pypi.python.org/pypi
-Server response (200): OK
-Submitting dist\chardet-1.0.2.win32.exe to http://pypi.python.org/pypi
-Server response (200): OK
-I can store your PyPI login so future submissions will be faster.
-(the login will be stored in c:\home\.pypirc)
-Save your login (y/N)?n ⑧
-
-
-
-setup.py parameters. Next, it builds a source distribution (sdist) and a Windows installer (bdist_wininst), then uploads them to PyPI (upload).
-http://pypi.python.org/pypi/NAME, where NAME is the string you passed in the name parameter in your setup.py file.
-
-setup.py with the new version number, then run the same upload command again:
-
-
-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload
-
-
-The Many Possible Futures of Python Packaging
-
-
-
-
-
-
-
-virtualenv
-zc.buildout
-py2exe
-Further Reading
-
-
-
-
-setup() function
-site-packages directory
-
-
-
-Packaging Python Libraries
+
+
+
— Marquise de Merteuil, Dangerous Liaisons
+Diving In
+setup.py. In fact, you’ve already seen several Distutils setup scripts in this book. You used Distutils to install httplib2 in HTTP Web Services and again to install chardet in Case Study: Porting chardet to Python 3.
+
+chardet and httplib2 work, and you’ll step through the process of releasing your own Python software.
+
+
+
+# chardet's setup.py
+from distutils.core import setup
+setup(
+ name = "chardet",
+ packages = ["chardet"],
+ version = "1.0.2",
+ description = "Universal encoding detector",
+ author = "Mark Pilgrim",
+ author_email = "mark@diveintomark.org",
+ url = "http://chardet.feedparser.org/",
+ download_url = "http://chardet.feedparser.org/download/python3-chardet-1.0.1.tgz",
+ keywords = ["encoding", "i18n", "xml"],
+ classifiers = [
+ "Programming Language :: Python",
+ "Programming Language :: Python :: 3",
+ "Development Status :: 4 - Beta",
+ "Environment :: Other Environment",
+ "Intended Audience :: Developers",
+ "License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)",
+ "Operating System :: OS Independent",
+ "Topic :: Software Development :: Libraries :: Python Modules",
+ "Topic :: Text Processing :: Linguistic",
+ ],
+ long_description = """\
+Universal character encoding detector
+-------------------------------------
+
+Detects
+ - ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
+ - Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
+ - EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
+ - EUC-KR, ISO-2022-KR (Korean)
+ - KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
+ - ISO-8859-2, windows-1250 (Hungarian)
+ - ISO-8859-5, windows-1251 (Bulgarian)
+ - windows-1252 (English)
+ - ISO-8859-7, windows-1253 (Greek)
+ - ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
+ - TIS-620 (Thai)
+
+This version requires Python 3 or later; a Python 2 version is available separately.
+"""
+)
+
+
+chardet and httplib2 are open source, but there’s no requirement that you release your own Python libraries under any particular license. The process described in this chapter will work for any Python software, regardless of license.
+Things Distutils Can’t Do For You
+
+
+
+
+
+
+Directory Structure
+
+httplib2 directory looks like this:
+
+
+httplib2/ ①
+|
++--README.txt ②
+|
++--setup.py ③
+|
++--httplib2/ ④
+ |
+ +--__init__.py
+ |
+ +--iri2uri.py
+
+
+
+.txt extension, and it should use Windows-style carriage returns. Just because you use a fancy text editor that runs from the command line and includes its own macro language, that doesn’t mean you need to make life difficult for your users. (Your users use Notepad. Sad but true.) Even if you’re on Linux or Mac OS X, your fancy text editor undoubtedly has an option to save files with Windows-style carriage returns.
+setup.py unless you have a good reason not to. You do not have a good reason not to.
+.py file, you should put it in the root directory along with your “read me” file and your setup script. But httplib2 is not a single .py file; it’s a multi-file module. But that’s OK! Just put the httplib2 directory in the root directory, so you have an __init__.py file within an httplib2/ directory within the httplib2/ root directory. That’s not a problem; in fact, it will simplify your packaging process.
+chardet directory looks slightly different. Like httplib2, it’s a multi-file module, so there’s a chardet/ directory within the chardet/ root directory. In addition to the README.txt file, chardet has HTML-formatted documentation in the docs/ directory. The docs/ directory contains several .html and .css files and an images/ subdirectory, which contains several .png and .gif files. (This will be important later.) Also, in keeping with the convention for (L)GPL-licensed software, it has a separate file called COPYING.txt which contains the complete text of the LGPL.
+
+
+
+
+chardet/
+|
++--COPYING.txt
+|
++--setup.py
+|
++--README.txt
+|
++--docs/
+| |
+| +--index.html
+| |
+| +--usage.html
+| |
+| +--images/ ...
+|
++--chardet/
+ |
+ +--__init__.py
+ |
+ +--big5freq.py
+ |
+ +--...
+Writing Your Setup Script
+
+
+
+from distutils.core import setupsetup() function, which is the main entry point into Distutils. 95% of all Distutils setup scripts consist of a single call to setup() and nothing else. (I totally just made up that statistic, but if your Distutils setup script is doing more than calling the Distutils setup() function, you should have a good reason. Do you have a good reason? I didn’t think so.)
+
+setup() function can take dozens of parameters. For the sanity of everyone involved, you must use named arguments for every parameter. This is not merely a convention; it’s a hard requirement. Your setup script will crash if you try to call the setup() function with non-named arguments.
+
+
+
+
+
+
+
+
+
+
+chardet setup script. It has all of these required and recommended parameters, plus one I haven’t mentioned yet: packages.
+
+
+
+from distutils.core import setup
+setup(
+ name = 'chardet',
+ packages = ['chardet'],
+ version = '1.0.2',
+ description = 'Universal encoding detector',
+ author='Mark Pilgrim',
+ ...
+)packages parameter highlights an unfortunate vocabulary overlap in the distribution process. We’ve been talking about the “package” as the thing you’re building (and potentially listing in The Python “Package” Index). But that’s not what this packages parameter refers to. It refers to the fact that the chardet module is a multi-file module, sometimes known as… a “package.” The packages parameter tells Distutils to include the chardet/ directory, its __init__.py file, and all the other .py files that constitute the chardet module. That’s kind of important; all this happy talk about documentation and metadata is irrelevant if you forget to include the actual code!
+
+Classifying Your Package
+
+classifiers parameter to the Distutils setup() function. The classifers parameter is a list of strings. These strings are not freeform. All classifier strings should come from this list on PyPI.
+
+
+
+
+"Programming Language :: Python" and "Programming Language :: Python :: 3". If you do not include these, your package will not show up in this list of Python 3-compatible libraries, which linked from the sidebar of every single page of pypi.python.org.
+"Operating System :: OS Independent". Multiple Operating System classifiers are only necessary if your software requires specific support for each platform. (This is not common.)
+
+
+
+Developers, End Users/Desktop, Science/Research, and System Administrators.
+Framework classifier. If not, omit it.
+Examples of Good Package Classifiers
+
+Programming Language :: Python :: 3 classifier is not listed.)
+
+
+
+Programming Language :: Python
+License :: OSI Approved :: BSD License
+Operating System :: OS Independent
+Development Status :: 5 - Production/Stable
+Environment :: Web Environment
+Framework :: Django
+Intended Audience :: Developers
+Topic :: Internet :: WWW/HTTP
+Topic :: Internet :: WWW/HTTP :: Dynamic Content
+Topic :: Internet :: WWW/HTTP :: WSGI
+Topic :: Software Development :: Libraries :: Python Moduleschardet, the character encoding detection library covered in Case Study: Porting chardet to Python 3. chardet is beta quality, cross-platform, Python 3-compatible, LGPL-licensed, and intended for developers to integrate into their own products.
+
+
+
+Programming Language :: Python
+Programming Language :: Python :: 3
+License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
+Operating System :: OS Independent
+Development Status :: 4 - Beta
+Environment :: Other Environment
+Intended Audience :: Developers
+Topic :: Text Processing :: Linguistic
+Topic :: Software Development :: Libraries :: Python Moduleshttplib2, the HTTP module I mentioned at the beginning of this chapter. httplib2 is beta quality, cross-platform, MIT-licensed, and intended for Python developers.
+
+
+
+Programming Language :: Python
+Programming Language :: Python :: 3
+License :: OSI Approved :: MIT License
+Operating System :: OS Independent
+Development Status :: 4 - Beta
+Environment :: Web Environment
+Intended Audience :: Developers
+Topic :: Internet :: WWW/HTTP
+Topic :: Software Development :: Libraries :: Python ModulesSpecifying Additional Files With A Manifest
+
+
+
+
+README.txt
+setup.py
+.py files needed by the multi-file modules listed in the packages parameter
+.py files listed in the py_modules parameter
+httplib2 project. But for the chardet project, we also want to include the COPYING.txt license file and the entire docs/ directory that contains images and HTML files. To tell Distutils to include these additional files and directories when it builds the chardet release package, you need a manifest file.
+
+MANIFEST.in. Place it in the project’s root directory, next to README.txt and setup.py. Manifest files are not Python scripts; they are text files that contain a series of “commands” in a Distutils-defined format. Manifest commands allow you to include or exclude specific files and directories.
+
+chardet project:
+
+
+include COPYING.txt ①
+recursive-include docs *.html *.css *.png *.gif ②
+
+
+COPYING.txt file from the project’s root directory.
+recursive-include command takes a directory name and one or more filenames. The filenames aren’t limited to specific files; they can include wildcards. This line means “See that docs/ directory in the project’s root directory? Look in there (recursively) for .html, .css, .png, and .gif files. I want all of them in my release package.”
+recursive-include command is not going to put a bunch of .html and .png files in the root directory of the release package. It’s going to maintain the existing docs/ directory structure, but only include those files inside that directory that match the given wildcards. (I didn’t mention it earlier, but the chardet documentation is actually written in XML and converted to HTML by a separate script. I don’t want to include the XML files in the release package, just the HTML and the images.)
+
+
+
+
+Checking Your Setup Script for Errors
+
+version parameter, Distutils will remind you.
+
+
+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
+running check
+warning: check: missing required meta-data: version
+
+version parameter (and all the other required bits of metadata), the check command will look like this:
+
+
+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
+running check
+
+Creating a Source Distribution
+
+sdist command to your Distutils setup script.
+
+
+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py sdist
+running sdist
+running check
+reading manifest template 'MANIFEST.in'
+writing manifest file 'MANIFEST'
+creating chardet-1.0.2
+creating chardet-1.0.2\chardet
+creating chardet-1.0.2\docs
+creating chardet-1.0.2\docs\images
+copying files to chardet-1.0.2...
+copying COPYING -> chardet-1.0.2
+copying README.txt -> chardet-1.0.2
+copying setup.py -> chardet-1.0.2
+copying chardet\__init__.py -> chardet-1.0.2\chardet
+copying chardet\big5freq.py -> chardet-1.0.2\chardet
+...
+copying chardet\universaldetector.py -> chardet-1.0.2\chardet
+copying chardet\utf8prober.py -> chardet-1.0.2\chardet
+copying docs\faq.html -> chardet-1.0.2\docs
+copying docs\history.html -> chardet-1.0.2\docs
+copying docs\how-it-works.html -> chardet-1.0.2\docs
+copying docs\index.html -> chardet-1.0.2\docs
+copying docs\license.html -> chardet-1.0.2\docs
+copying docs\supported-encodings.html -> chardet-1.0.2\docs
+copying docs\usage.html -> chardet-1.0.2\docs
+copying docs\images\caution.png -> chardet-1.0.2\docs\images
+copying docs\images\important.png -> chardet-1.0.2\docs\images
+copying docs\images\note.png -> chardet-1.0.2\docs\images
+copying docs\images\permalink.gif -> chardet-1.0.2\docs\images
+copying docs\images\tip.png -> chardet-1.0.2\docs\images
+copying docs\images\warning.png -> chardet-1.0.2\docs\images
+creating dist
+creating 'dist\chardet-1.0.2.zip' and adding 'chardet-1.0.2' to it
+adding 'chardet-1.0.2\COPYING'
+adding 'chardet-1.0.2\PKG-INFO'
+adding 'chardet-1.0.2\README.txt'
+adding 'chardet-1.0.2\setup.py'
+adding 'chardet-1.0.2\chardet\big5freq.py'
+adding 'chardet-1.0.2\chardet\big5prober.py'
+...
+adding 'chardet-1.0.2\chardet\universaldetector.py'
+adding 'chardet-1.0.2\chardet\utf8prober.py'
+adding 'chardet-1.0.2\chardet\__init__.py'
+adding 'chardet-1.0.2\docs\faq.html'
+adding 'chardet-1.0.2\docs\history.html'
+adding 'chardet-1.0.2\docs\how-it-works.html'
+adding 'chardet-1.0.2\docs\index.html'
+adding 'chardet-1.0.2\docs\license.html'
+adding 'chardet-1.0.2\docs\supported-encodings.html'
+adding 'chardet-1.0.2\docs\usage.html'
+adding 'chardet-1.0.2\docs\images\caution.png'
+adding 'chardet-1.0.2\docs\images\important.png'
+adding 'chardet-1.0.2\docs\images\note.png'
+adding 'chardet-1.0.2\docs\images\permalink.gif'
+adding 'chardet-1.0.2\docs\images\tip.png'
+adding 'chardet-1.0.2\docs\images\warning.png'
+removing 'chardet-1.0.2' (and everything under it)
+
+
+
+
+MANIFEST.in).
+COPYING.txt and the HTML and image files in the docs/ directory.
+dist/ directory. Within the dist/ directory the .zip file that you can distribute.
+
+c:\Users\pilgrim\chardet> dir dist
+ Volume in drive C has no label.
+ Volume Serial Number is DED5-B4F8
+
+ Directory of c:\Users\pilgrim\chardet\dist
+
+07/30/2009 06:29 PM <DIR> .
+07/30/2009 06:29 PM <DIR> ..
+07/30/2009 06:29 PM 206,440 chardet-1.0.2.zip
+ 1 File(s) 206,440 bytes
+ 2 Dir(s) 61,424,635,904 bytes free
+
+Creating a Graphical Installer
+
+bdist_wininst command to your Distutils setup script.
+
+
+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py bdist_wininst
+running bdist_wininst
+running build
+running build_py
+creating build
+creating build\lib
+creating build\lib\chardet
+copying chardet\big5freq.py -> build\lib\chardet
+copying chardet\big5prober.py -> build\lib\chardet
+...
+copying chardet\universaldetector.py -> build\lib\chardet
+copying chardet\utf8prober.py -> build\lib\chardet
+copying chardet\__init__.py -> build\lib\chardet
+installing to build\bdist.win32\wininst
+running install_lib
+creating build\bdist.win32
+creating build\bdist.win32\wininst
+creating build\bdist.win32\wininst\PURELIB
+creating build\bdist.win32\wininst\PURELIB\chardet
+copying build\lib\chardet\big5freq.py -> build\bdist.win32\wininst\PURELIB\chardet
+copying build\lib\chardet\big5prober.py -> build\bdist.win32\wininst\PURELIB\chardet
+...
+copying build\lib\chardet\universaldetector.py -> build\bdist.win32\wininst\PURELIB\chardet
+copying build\lib\chardet\utf8prober.py -> build\bdist.win32\wininst\PURELIB\chardet
+copying build\lib\chardet\__init__.py -> build\bdist.win32\wininst\PURELIB\chardet
+running install_egg_info
+Writing build\bdist.win32\wininst\PURELIB\chardet-1.0.2-py3.1.egg-info
+creating 'c:\users\pilgrim\appdata\local\temp\tmp2f4h7e.zip' and adding '.' to it
+adding 'PURELIB\chardet-1.0.2-py3.1.egg-info'
+adding 'PURELIB\chardet\big5freq.py'
+adding 'PURELIB\chardet\big5prober.py'
+...
+adding 'PURELIB\chardet\universaldetector.py'
+adding 'PURELIB\chardet\utf8prober.py'
+adding 'PURELIB\chardet\__init__.py'
+removing 'build\bdist.win32\wininst' (and everything under it)
+c:\Users\pilgrim\chardet> dir dist
+c:\Users\pilgrim\chardet>dir dist
+ Volume in drive C has no label.
+ Volume Serial Number is AADE-E29F
+
+ Directory of c:\Users\pilgrim\chardet\dist
+
+07/30/2009 10:14 PM <DIR> .
+07/30/2009 10:14 PM <DIR> ..
+07/30/2009 10:14 PM 371,236 chardet-1.0.2.win32.exe
+07/30/2009 06:29 PM 206,440 chardet-1.0.2.zip
+ 2 File(s) 577,676 bytes
+ 2 Dir(s) 61,424,070,656 bytes free
+
+Building Installable Packages for Other Operating Systems
+
+chardet library is in the Debian GNU/Linux repositories (and therefore in the Ubuntu repositories as well). I had nothing to do with this; the packages just showed up there one day. The Debian community has their own policies for packaging Python libraries, and the Debian python-chardet package is designed to follow these conventions. And since the package lives in Debian’s repositories, Debian users will receive security updates and/or new versions, depending on the system-wide settings they’ve chosen to manage their own computers.
+
+Adding Your Software to The Python Package Index
+
+
+
+
+setup.py sdist and setup.py bdist_*
+Register button. (If you have a PGP or GPG key, you can also provide that. If you don’t have one or don’t know what that means, don’t worry about it.) Check your email; within a few minutes, you should receive a message from PyPI with a validation link. Click the link to complete the registration process.
+
+
+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload ①
+running register
+We need to know who you are, so please choose either:
+ 1. use your existing login,
+ 2. register as a new user,
+ 3. have the server generate a new password for you (and email it to you), or
+ 4. quit
+Your selection [default 1]: 1 ②
+Username: MarkPilgrim ③
+Password:
+Registering chardet to http://pypi.python.org/pypi ④
+Server response (200): OK
+running sdist ⑤
+... output trimmed for brevity ...
+running bdist_wininst ⑥
+... output trimmed for brevity ...
+running upload ⑦
+Submitting dist\chardet-1.0.2.zip to http://pypi.python.org/pypi
+Server response (200): OK
+Submitting dist\chardet-1.0.2.win32.exe to http://pypi.python.org/pypi
+Server response (200): OK
+I can store your PyPI login so future submissions will be faster.
+(the login will be stored in c:\home\.pypirc)
+Save your login (y/N)?n ⑧
+
+
+
+setup.py parameters. Next, it builds a source distribution (sdist) and a Windows installer (bdist_wininst), then uploads them to PyPI (upload).
+http://pypi.python.org/pypi/NAME, where NAME is the string you passed in the name parameter in your setup.py file.
+
+setup.py with the new version number, then run the same upload command again:
+
+
+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload
+
+
+The Many Possible Futures of Python Packaging
+
+
+
+
+
+
+
+virtualenv
+zc.buildout
+py2exe
+Further Reading
+
+
+
+
+setup() function
+site-packages directory
+
+
+
+Table of Contents
-
-
-
-
-
-
-None
-
-
-
-itertools Module
-
-
-chardet to Python 3
-
-
-chardet Module
-
-2to3
-2to3 Can’t
-
-
-False is invalid syntax
-constants
-'bytes' object to str implicitly
-'int' and 'bytes'
-ord() expected string of length 1, but int found
-int() >= str()
-'reduce' is not defined
-
-
-2to3
-
-
-print statement
-unicode() global function
-long data type
-has_key() dictionary method
-
-
-http
-urllib
-dbm
-xmlrpc
-next() iterator method
-filter() global function
-map() global function
-reduce() global function
-apply() global function
-intern() global function
-exec statement
-execfile statement
-repr literals (backticks)
-try...except statement
-raise statement
-throw method on generators
-xrange() global function
-raw_input() and input() global functions
-func_* function attributes
-xreadlines() I/O method
-lambda functions that take a tuple instead of multiple parameters
-__nonzero__ special method
-sys.maxint
-callable() global function
-zip() global function
-StandardError exception
-types module constants
-isinstance() global function
-basestring datatype
-itertools module
-sys.exc_type, sys.exc_value, sys.exc_traceback
-os.getcwdu() function
-
-
-with Block
-Table of Contents
+
+
+
+
+
+
+None
+
+
+
+itertools Module
+
+
+chardet to Python 3
+
+
+chardet Module
+
+2to3
+2to3 Can’t
+
+
+False is invalid syntax
+constants
+'bytes' object to str implicitly
+'int' and 'bytes'
+ord() expected string of length 1, but int found
+int() >= str()
+'reduce' is not defined
+
+
+2to3
+
+
+print statement
+unicode() global function
+long data type
+has_key() dictionary method
+
+
+http
+urllib
+dbm
+xmlrpc
+next() iterator method
+filter() global function
+map() global function
+reduce() global function
+apply() global function
+intern() global function
+exec statement
+execfile statement
+repr literals (backticks)
+try...except statement
+raise statement
+throw method on generators
+xrange() global function
+raw_input() and input() global functions
+func_* function attributes
+xreadlines() I/O method
+lambda functions that take a tuple instead of multiple parameters
+__nonzero__ special method
+sys.maxint
+callable() global function
+zip() global function
+StandardError exception
+types module constants
+isinstance() global function
+basestring datatype
+itertools module
+sys.exc_type, sys.exc_value, sys.exc_traceback
+os.getcwdu() function
+
+
+with Block
+