diff --git a/.htaccess b/.htaccess index 28a220d..b901de6 100644 --- a/.htaccess +++ b/.htaccess @@ -1,3 +1,3 @@ -FileETag MTime Size - -SetEnv dont-vary +FileETag MTime Size + +SetEnv dont-vary diff --git a/advanced-iterators.html b/advanced-iterators.html index fb099d8..ee15c6d 100755 --- a/advanced-iterators.html +++ b/advanced-iterators.html @@ -1,647 +1,647 @@ - - -Advanced Iterators - Dive Into Python 3 - - - - - - -
  
-

You are here: Home Dive Into Python 3 -

Difficulty level: ♦♦♦♦♢ -

Advanced Iterators

-
-

Great fleas have little fleas upon their backs to bite ’em,
And little fleas have lesser fleas, and so ad infinitum.
— Augustus De Morgan -

-

  -

Diving In

-

Just as regular expressions put strings on steroids, the itertools module puts iterators on steroids. But first, I want to show you a classic puzzle. - -

HAWAII + IDAHO + IOWA + OHIO == STATES
-510199 + 98153 + 9301 + 3593 == 621246
-
-H = 5
-A = 1
-W = 0
-I = 9
-D = 8
-O = 3
-S = 6
-T = 2
-E = 4
- -

Puzzles like this are called cryptarithms or alphametics. The letters spell out actual words, but if you replace each letter with a digit from 0–9, it also “spells” an arithmetic equation. The trick is to figure out which letter maps to each digit. All the occurrences of each letter must map to the same digit, no digit can be repeated, and no “word” can start with the digit 0. - -

- -

In this chapter, we’ll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles in just 14 lines of code. - -

[download alphametics.py] -

import re
-import itertools
-
-def solve(puzzle):
-    words = re.findall('[A-Z]+', puzzle.upper())
-    unique_characters = set(''.join(words))
-    assert len(unique_characters) <= 10, 'Too many letters'
-    first_letters = {word[0] for word in words}
-    n = len(first_letters)
-    sorted_characters = ''.join(first_letters) + \
-        ''.join(unique_characters - first_letters)
-    characters = tuple(ord(c) for c in sorted_characters)
-    digits = tuple(ord(c) for c in '0123456789')
-    zero = digits[0]
-    for guess in itertools.permutations(digits, len(characters)):
-        if zero not in guess[:n]:
-            equation = puzzle.translate(dict(zip(characters, guess)))
-            if eval(equation):
-                return equation
-
-if __name__ == '__main__':
-    import sys
-    for puzzle in sys.argv[1:]:
-        print(puzzle)
-        solution = solve(puzzle)
-        if solution:
-            print(solution)
- -

You can run the program from the command line. On Linux, it would look like this. (These may take some time, depending on the speed of your computer, and there is no progress bar. Just be patient!) - -

-you@localhost:~/diveintopython3/examples$ python3 alphametics.py "HAWAII + IDAHO + IOWA + OHIO == STATES"
-HAWAII + IDAHO + IOWA + OHIO = STATES
-510199 + 98153 + 9301 + 3593 == 621246
-you@localhost:~/diveintopython3/examples$ python3 alphametics.py "I + LOVE + YOU == DORA"
-I + LOVE + YOU == DORA
-1 + 2784 + 975 == 3760
-you@localhost:~/diveintopython3/examples$ python3 alphametics.py "SEND + MORE == MONEY"
-SEND + MORE == MONEY
-9567 + 1085 == 10652
- -

⁂ - -

Finding all occurrences of a pattern

- -

The first thing this alphametics solver does is find all the letters (A–Z) in the puzzle. - -

->>> import re
->>> re.findall('[0-9]+', '16 2-by-4s in rows of 8')  
-['16', '2', '4', '8']
->>> re.findall('[A-Z]+', 'SEND + MORE == MONEY')     
-['SEND', 'MORE', 'MONEY']
-
    -
  1. The re module is Python’s implementation of regular expressions. It has a nifty function called findall() which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the string. In this case, the pattern matches sequences of numbers. The findall() function returns a list of all the substrings that matched the pattern. -
  2. Here the regular expression pattern matches sequences of letters. Again, the return value is a list, and each item in the list is a string that matched the regular expression pattern. -
- -

Here’s another example that will stretch your brain a little. - -

->>> re.findall(' s.*? s', "The sixth sick sheikh's sixth sheep's sick.")
-[' sixth s', " sheikh's s", " sheep's s"]
- - - -

Surprised? The regular expression looks for a space, an s, and then the shortest possible series of any character (.*?), then a space, then another s. Well, looking at that input string, I see five matches: - -

    -
  1. The sixth sick sheikh's sixth sheep's sick. -
  2. The sixth sick sheikh's sixth sheep's sick. -
  3. The sixth sick sheikh's sixth sheep's sick. -
  4. The sixth sick sheikh's sixth sheep's sick. -
  5. The sixth sick sheikh's sixth sheep's sick. -
- -

But the re.findall() function only returned three matches. Specifically, it returned the first, the third, and the fifth. Why is that? Because it doesn’t return overlapping matches. The first match overlaps with the second, so the first is returned and the second is skipped. Then the third overlaps with the fourth, so the third is returned and the fourth is skipped. Finally, the fifth is returned. Three matches, not five. - -

This has nothing to do with the alphametics solver; I just thought it was interesting. - -

⁂ - -

Finding the unique items in a sequence

- -

Sets make it trivial to find the unique items in a sequence. - -

->>> a_list = ['The', 'sixth', 'sick', "sheik's", 'sixth', "sheep's", 'sick']
->>> set(a_list)                      
-{'sixth', 'The', "sheep's", 'sick', "sheik's"}
->>> a_string = 'EAST IS EAST'
->>> set(a_string)                    
-{'A', ' ', 'E', 'I', 'S', 'T'}
->>> words = ['SEND', 'MORE', 'MONEY']
->>> ''.join(words)                   
-'SENDMOREMONEY'
->>> set(''.join(words))              
-{'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
-
    -
  1. Given a list of several strings, the set() function will return a set of unique strings from the list. This makes sense if you think of it like a for loop. Take the first item from the list, put it in the set. Second. Third. Fourth. Fifth — wait, that’s in the set already, so it only gets listed once, because Python sets don’t allow duplicates. Sixth. Seventh — again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn’t even need to be sorted first. -
  2. The same technique works with strings, since a string is just a sequence of characters. -
  3. Given a list of strings, ''.join(a_list) concatenates all the strings together into one. -
  4. So, given a list of strings, this line of code returns all the unique characters across all the strings, with no duplicates. -
- -

The alphametics solver uses this technique to build a set of all the unique characters in the puzzle. - -

unique_characters = set(''.join(words))
- -

This list is later used to assign digits to characters as the solver iterates through the possible solutions. - -

⁂ - -

Making assertions

- -

Like many programming languages, Python has an assert statement. Here’s how it works. - -

->>> assert 1 + 1 == 2                                     
->>> assert 1 + 1 == 3                                     
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-AssertionError
->>> assert 2 + 2 == 5, "Only for very large values of 2"  
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-AssertionError: Only for very large values of 2
-
    -
  1. The assert statement is followed by any valid Python expression. In this case, the expression 1 + 1 == 2 evaluates to True, so the assert statement does nothing. -
  2. However, if the Python expression evaluates to False, the assert statement will raise an AssertionError. -
  3. You can also include a human-readable message that is printed if the AssertionError is raised. -
- -

Therefore, this line of code: - -

assert len(unique_characters) <= 10, 'Too many letters'
- -

…is equivalent to this: - -

if len(unique_characters) > 10:
-    raise AssertionError('Too many letters')
- -

The alphametics solver uses this exact assert statement to bail out early if the puzzle contains more than ten unique letters. Since each letter is assigned a unique digit, and there are only ten digits, a puzzle with more than ten unique letters can not possibly have a solution. - -

⁂ - -

Generator expressions

- -

A generator expression is like a generator function without the function. - -

->>> unique_characters = {'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
->>> gen = (ord(c) for c in unique_characters)  
->>> gen                                        
-<generator object <genexpr> at 0x00BADC10>
->>> next(gen)                                  
-69
->>> next(gen)
-68
->>> tuple(ord(c) for c in unique_characters)   
-(69, 68, 77, 79, 78, 83, 82, 89)
-
    -
  1. A generator expression is like an anonymous function that yields values. The expression itself looks like a list comprehension, but it’s wrapped in parentheses instead of square brackets. -
  2. The generator expression returns… an iterator. -
  3. Calling next(gen) returns the next value from the iterator. -
  4. If you like, you can iterate through all the possible values and return a tuple, list, or set, by passing the generator expression to tuple(), list(), or set(). In these cases, you don’t need an extra set of parentheses — just pass the “bare” expression ord(c) for c in unique_characters to the tuple() function, and Python figures out that it’s a generator expression. -
- -
-

Using a generator expression instead of a list comprehension can save both CPU and RAM. If you’re building an list just to throw it away (e.g. passing it to tuple() or set()), use a generator expression instead! -

- -

Here’s another way to accomplish the same thing, using a generator function: - -

def ord_map(a_string):
-    for c in a_string:
-        yield ord(c)
-
-gen = ord_map(unique_characters)
- -

The generator expression is more compact but functionally equivalent. - -

⁂ - -

Calculating Permutations… The Lazy Way!

- -

First of all, what the heck are permutations? Permutations are a mathematical concept. (There are actually several definitions, depending on what kind of math you’re doing. Here I’m talking about combinatorics, but if that doesn’t mean anything to you, don’t worry about it. As always, Wikipedia is your friend.) - -

The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated. Mathematicians say things like “let’s find the permutations of 3 different items taken 2 at a time,” which means you have a sequence of 3 items and you want to find all the possible ordered pairs. - -

->>> import itertools                              
->>> perms = itertools.permutations([1, 2, 3], 2)  
->>> next(perms)                                   
-(1, 2)
->>> next(perms)
-(1, 3)
->>> next(perms)
-(2, 1)                                            
->>> next(perms)
-(2, 3)
->>> next(perms)
-(3, 1)
->>> next(perms)
-(3, 2)
->>> next(perms)                                   
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-StopIteration
-
    -
  1. The itertools module has all kinds of fun stuff in it, including a permutations() function that does all the hard work of finding permutations. -
  2. The permutations() function takes a sequence (here a list of three integers) and a number, which is the number of items you want in each smaller group. The function returns an iterator, which you can use in a for loop or any old place that iterates. Here I’ll step through the iterator manually to show all the values. -
  3. The first permutation of [1, 2, 3] taken 2 at a time is (1, 2). -
  4. Note that permutations are ordered: (2, 1) is different than (1, 2). -
  5. That’s it! Those are all the permutations of [1, 2, 3] taken 2 at a time. Pairs like (1, 1) and (2, 2) never show up, because they contain repeats so they aren’t valid permutations. When there are no more permutations, the iterator raises a StopIteration exception. -
- - - -

The permutations() function doesn’t have to take a list. It can take any sequence — even a string. - -

->>> import itertools
->>> perms = itertools.permutations('ABC', 3)  
->>> next(perms)
-('A', 'B', 'C')                               
->>> next(perms)
-('A', 'C', 'B')
->>> next(perms)
-('B', 'A', 'C')
->>> next(perms)
-('B', 'C', 'A')
->>> next(perms)
-('C', 'A', 'B')
->>> next(perms)
-('C', 'B', 'A')
->>> next(perms)
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-StopIteration
->>> list(itertools.permutations('ABC', 3))    
-[('A', 'B', 'C'), ('A', 'C', 'B'),
- ('B', 'A', 'C'), ('B', 'C', 'A'),
- ('C', 'A', 'B'), ('C', 'B', 'A')]
-
    -
  1. A string is just a sequence of characters. For the purposes of finding permutations, the string 'ABC' is equivalent to the list ['A', 'B', 'C']. -
  2. The first permutation of the 3 items ['A', 'B', 'C'], taken 3 at a time, is ('A', 'B', 'C'). There are five other permutations — the same three characters in every conceivable order. -
  3. Since the permutations() function always returns an iterator, an easy way to debug permutations is to pass that iterator to the built-in list() function to see all the permutations immediately. -
- -

⁂ - -

Other Fun Stuff in the itertools Module

-
->>> import itertools
->>> list(itertools.product('ABC', '123'))   
-[('A', '1'), ('A', '2'), ('A', '3'), 
- ('B', '1'), ('B', '2'), ('B', '3'), 
- ('C', '1'), ('C', '2'), ('C', '3')]
->>> list(itertools.combinations('ABC', 2))  
-[('A', 'B'), ('A', 'C'), ('B', 'C')]
-
    -
  1. The itertools.product() function returns an iterator containing the Cartesian product of two sequences. -
  2. The itertools.combinations() function returns an iterator containing all the possible combinations of the given sequence of the given length. This is like the itertools.permutations() function, except combinations don’t include items that are duplicates of other items in a different order. So itertools.permutations('ABC', 2) will return both ('A', 'B') and ('B', 'A') (among others), but itertools.combinations('ABC', 2) will not return ('B', 'A') because it is a duplicate of ('A', 'B') in a different order. -
- -

[download favorite-people.txt] -

->>> names = list(open('examples/favorite-people.txt', encoding='utf-8'))  
->>> names
-['Dora\n', 'Ethan\n', 'Wesley\n', 'John\n', 'Anne\n',
-'Mike\n', 'Chris\n', 'Sarah\n', 'Alex\n', 'Lizzie\n']
->>> names = [name.rstrip() for name in names]                             
->>> names
-['Dora', 'Ethan', 'Wesley', 'John', 'Anne',
-'Mike', 'Chris', 'Sarah', 'Alex', 'Lizzie']
->>> names = sorted(names)                                                 
->>> names
-['Alex', 'Anne', 'Chris', 'Dora', 'Ethan',
-'John', 'Lizzie', 'Mike', 'Sarah', 'Wesley']
->>> names = sorted(names, key=len)                                        
->>> names
-['Alex', 'Anne', 'Dora', 'John', 'Mike',
-'Chris', 'Ethan', 'Sarah', 'Lizzie', 'Wesley']
-
    -
  1. This idiom returns a list of the lines in a text file. -
  2. Unfortunately (for this example), the list(open(filename)) idiom also includes the carriage returns at the end of each line. This list comprehension uses the rstrip() string method to strip trailing whitespace from each line. (Strings also have an lstrip() method to strip leading whitespace, and a strip() method which strips both.) -
  3. The sorted() function takes a list and returns it sorted. By default, it sorts alphabetically. -
  4. But the sorted() function can also take a function as the key parameter, and it sorts by that key. In this case, the sort function is len(), so it sorts by len(each item). Shorter names come first, then longer, then longest. -
- -

What does this have to do with the itertools module? I’m glad you asked. - -

-…continuing from the previous interactive shell…
->>> import itertools
->>> groups = itertools.groupby(names, len)  
->>> groups
-<itertools.groupby object at 0x00BB20C0>
->>> list(groups)
-[(4, <itertools._grouper object at 0x00BA8BF0>),
- (5, <itertools._grouper object at 0x00BB4050>),
- (6, <itertools._grouper object at 0x00BB4030>)]
->>> groups = itertools.groupby(names, len)   
->>> for name_length, name_iter in groups:    
-...     print('Names with {0:d} letters:'.format(name_length))
-...     for name in name_iter:
-...         print(name)
-... 
-Names with 4 letters:
-Alex
-Anne
-Dora
-John
-Mike
-Names with 5 letters:
-Chris
-Ethan
-Sarah
-Names with 6 letters:
-Lizzie
-Wesley
-
    -
  1. The itertools.groupby() function takes a sequence and a key function, and returns an iterator that generates pairs. Each pair contains the result of key_function(each item) and another iterator containing all the items that shared that key result. -
  2. Calling the list() function “exhausted” the iterator, i.e. you’ve already generated every item in the iterator to make the list. There’s no “reset” button on an iterator; you can’t just start over once you’ve exhausted it. If you want to loop through it again (say, in the upcoming for loop), you need to call itertools.groupby() again to create a new iterator. -
  3. In this example, given a list of names already sorted by length, itertools.groupby(names, len) will put all the 4-letter names in one iterator, all the 5-letter names in another iterator, and so on. The groupby() function is completely generic; it could group strings by first letter, numbers by their number of factors, or any other key function you can think of. -
- - -
-

The itertools.groupby() function only works if the input sequence is already sorted by the grouping function. In the example above, you grouped a list of names by the len() function. That only worked because the input list was already sorted by length. -

- -

Are you watching closely? -

->>> list(range(0, 3))
-[0, 1, 2]
->>> list(range(10, 13))
-[10, 11, 12]
->>> list(itertools.chain(range(0, 3), range(10, 13)))        
-[0, 1, 2, 10, 11, 12]
->>> list(zip(range(0, 3), range(10, 13)))                    
-[(0, 10), (1, 11), (2, 12)]
->>> list(zip(range(0, 3), range(10, 14)))                    
-[(0, 10), (1, 11), (2, 12)]
->>> list(itertools.zip_longest(range(0, 3), range(10, 14)))  
-[(0, 10), (1, 11), (2, 12), (None, 13)]
-
    -
  1. The itertools.chain() function takes two iterators and returns an iterator that contains all the items from the first iterator, followed by all the items from the second iterator. (Actually, it can take any number of iterators, and it chains them all in the order they were passed to the function.) -
  2. The zip() function does something prosaic that turns out to be extremely useful: it takes any number of sequences and returns an iterator which returns tuples of the first items of each sequence, then the second items of each, then the third, and so on. -
  3. The zip() function stops at the end of the shortest sequence. range(10, 14) has 4 items (10, 11, 12, and 13), but range(0, 3) only has 3, so the zip() function returns an iterator of 3 items. -
  4. On the other hand, the itertools.zip_longest() function stops at the end of the longest sequence, inserting None values for items past the end of the shorter sequences. -
- -

OK, that was all very interesting, but how does it relate to the alphametics solver? Here’s how: - -

->>> characters = ('S', 'M', 'E', 'D', 'O', 'N', 'R', 'Y')
->>> guess = ('1', '2', '0', '3', '4', '5', '6', '7')
->>> tuple(zip(characters, guess))  
-(('S', '1'), ('M', '2'), ('E', '0'), ('D', '3'),
- ('O', '4'), ('N', '5'), ('R', '6'), ('Y', '7'))
->>> dict(zip(characters, guess))   
-{'E': '0', 'D': '3', 'M': '2', 'O': '4',
- 'N': '5', 'S': '1', 'R': '6', 'Y': '7'}
-
    -
  1. Given a list of letters and a list of digits (each represented here as 1-character strings), the zip function will create a pairing of letters and digits, in order. -
  2. Why is that cool? Because that data structure happens to be exactly the right structure to pass to the dict() function to create a dictionary that uses letters as keys and their associated digits as values. (This isn’t the only way to do it, of course. You could use a dictionary comprehension to create the dictionary directly.) Although the printed representation of the dictionary lists the pairs in a different order (dictionaries have no “order” per se), you can see that each letter is associated with the digit, based on the ordering of the original characters and guess sequences. -
- -

The alphametics solver uses this technique to create a dictionary that maps letters in the puzzle to digits in the solution, for each possible solution. - -

characters = tuple(ord(c) for c in sorted_characters)
-digits = tuple(ord(c) for c in '0123456789')
-...
-for guess in itertools.permutations(digits, len(characters)):
-    ...
-    equation = puzzle.translate(dict(zip(characters, guess)))
- -

But what is this translate() method? Ah, now you’re getting to the really fun part. - -

⁂ - -

A New Kind Of String Manipulation

- -

Python strings have many methods. You learned about some of those methods in the Strings chapter: lower(), count(), and format(). Now I want to introduce you to a powerful but little-known string manipulation technique: the translate() method. - -

->>> translation_table = {ord('A'): ord('O')}  
->>> translation_table                         
-{65: 79}
->>> 'MARK'.translate(translation_table)       
-'MORK'
-
    -
  1. String translation starts with a translation table, which is just a dictionary that maps one character to another. Actually, “character” is incorrect — the translation table really maps one byte to another. -
  2. Remember, bytes in Python 3 are integers. The ord() function returns the ASCII value of a character, which, in the case of A–Z, is always a byte from 65 to 90. -
  3. The translate() method on a string takes a translation table and runs the string through it. That is, it replaces all occurrences of the keys of the translation table with the corresponding values. In this case, “translating” MARK to MORK. -
- - - -

What does this have to do with solving alphametic puzzles? As it turns out, everything. - -

->>> characters = tuple(ord(c) for c in 'SMEDONRY')       
->>> characters
-(83, 77, 69, 68, 79, 78, 82, 89)
->>> guess = tuple(ord(c) for c in '91570682')            
->>> guess
-(57, 49, 53, 55, 48, 54, 56, 50)
->>> translation_table = dict(zip(characters, guess))     
->>> translation_table
-{68: 55, 69: 53, 77: 49, 78: 54, 79: 48, 82: 56, 83: 57, 89: 50}
->>> 'SEND + MORE == MONEY'.translate(translation_table)  
-'9567 + 1085 == 10652'
-
    -
  1. Using a generator expression, we quickly compute the byte values for each character in a string. characters is an example of the value of sorted_characters in the alphametics.solve() function. -
  2. Using another generator expression, we quickly compute the byte values for each digit in this string. The result, guess, is of the form returned by the itertools.permutations() function in the alphametics.solve() function. -
  3. This translation table is generated by zipping characters and guess together and building a dictionary from the resulting sequence of pairs. This is exactly what the alphametics.solve() function does inside the for loop. -
  4. Finally, we pass this translation table to the translate() method of the original puzzle string. This converts each letter in the string to the corresponding digit (based on the letters in characters and the digits in guess). The result is a valid Python expression, as a string. -
- -

That’s pretty impressive. But what can you do with a string that happens to be a valid Python expression? - -

⁂ - -

Evaluating Arbitrary Strings As Python Expressions

- -

This is the final piece of the puzzle (or rather, the final piece of the puzzle solver). After all that fancy string manipulation, we’re left with a string like '9567 + 1085 == 10652'. But that’s a string, and what good is a string? Enter eval(), the universal Python evaluation tool. - -

->>> eval('1 + 1 == 2')
-True
->>> eval('1 + 1 == 3')
-False
->>> eval('9567 + 1085 == 10652')
-True
- -

But wait, there’s more! The eval() function isn’t limited to boolean expressions. It can handle any Python expression and returns any datatype. - -

->>> eval('"A" + "B"')
-'AB'
->>> eval('"MARK".translate({65: 79})')
-'MORK'
->>> eval('"AAAAA".count("A")')
-5
->>> eval('["*"] * 5')
-['*', '*', '*', '*', '*']
- -

But wait, that’s not all! - -

->>> x = 5
->>> eval("x * 5")         
-25
->>> eval("pow(x, 2)")     
-25
->>> import math
->>> eval("math.sqrt(x)")  
-2.2360679774997898
-
    -
  1. The expression that eval() takes can reference global variables defined outside the eval(). If called within a function, it can reference local variables too. -
  2. And functions. -
  3. And modules. -
- -

Hey, wait a minute… - -

->>> import subprocess
->>> eval("subprocess.getoutput('ls ~')")                  
-'Desktop         Library         Pictures \
- Documents       Movies          Public   \
- Music           Sites'
->>> eval("subprocess.getoutput('rm /some/random/file')")  
-
    -
  1. The subprocess module allows you to run arbitrary shell commands and get the result as a Python string. -
  2. Arbitrary shell commands can have permanent consequences. -
- -

It’s even worse than that, because there’s a global __import__() function that takes a module name as a string, imports the module, and returns a reference to it. Combined with the power of eval(), you can construct a single expression that will wipe out all your files: - -

->>> eval("__import__('subprocess').getoutput('rm /some/random/file')")  
-
    -
  1. Now imagine the output of 'rm -rf ~'. Actually there wouldn’t be any output, but you wouldn’t have any files left either. -
- -

eval() is EVIL - -

Well, the evil part is evaluating arbitrary expressions from untrusted sources. You should only use eval() on trusted input. Of course, the trick is figuring out what’s “trusted.” But here’s something I know for certain: you should NOT take this alphametics solver and put it on the internet as a fun little web service. Don’t make the mistake of thinking, “Gosh, the function does a lot of string manipulation before getting a string to evaluate; I can’t imagine how someone could exploit that.” Someone WILL figure out how to sneak nasty executable code past all that string manipulation (stranger things have happened), and then you can kiss your server goodbye. - -

But surely there’s some way to evaluate expressions safely? To put eval() in a sandbox where it can’t access or harm the outside world? Well, yes and no. - -

->>> x = 5
->>> eval("x * 5", {}, {})               
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-  File "<string>", line 1, in <module>
-NameError: name 'x' is not defined
->>> eval("x * 5", {"x": x}, {})         
->>> import math
->>> eval("math.sqrt(x)", {"x": x}, {})  
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-  File "<string>", line 1, in <module>
-NameError: name 'math' is not defined
-
    -
  1. The second and third parameters passed to the eval() function act as the global and local namespaces for evaluating the expression. In this case, they are both empty, which means that when the string "x * 5" is evaluated, there is no reference to x in either the global or local namespace, so eval() throws an exception. -
  2. You can selectively include specific values in the global namespace by listing them individually. Then those — and only those — variables will be available during evaluation. -
  3. Even though you just imported the math module, you didn’t include it in the namespace passed to the eval() function, so the evaluation failed. -
- -

Gee, that was easy. Lemme make an alphametics web service now! - -

->>> eval("pow(5, 2)", {}, {})                   
-25
->>> eval("__import__('math').sqrt(5)", {}, {})  
-2.2360679774997898
-
    -
  1. Even though you’ve passed empty dictionaries for the global and local namespaces, all of Python’s built-in functions are still available during evaluation. So pow(5, 2) works, because 5 and 2 are literals, and pow() is a built-in function. -
  2. Unfortunately (and if you don’t see why it’s unfortunate, read on), the __import__() function is also a built-in function, so it works too. -
- -

Yeah, that means you can still do nasty things, even if you explicitly set the global and local namespaces to empty dictionaries when calling eval(): - -

>>> eval("__import__('subprocess').getoutput('rm /some/random/file')", {}, {})
- -

Oops. I’m glad I didn’t make that alphametics web service. Is there any way to use eval() safely? Well, yes and no. - -

->>> eval("__import__('math').sqrt(5)",
-...     {"__builtins__":None}, {})          
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-  File "<string>", line 1, in <module>
-NameError: name '__import__' is not defined
->>> eval("__import__('subprocess').getoutput('rm -rf /')",
-...     {"__builtins__":None}, {})          
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-  File "<string>", line 1, in <module>
-NameError: name '__import__' is not defined
-
    -
  1. To evaluate untrusted expressions safely, you need to define a global namespace dictionary that maps "__builtins__" to None, the Python null value. Internally, the “built-in” functions are contained within a pseudo-module called "__builtins__". This pseudo-module (i.e. the set of built-in functions) is made available to evaluated expressions unless you explicitly override it. -
  2. Be sure you’ve overridden __builtins__. Not __builtin__, __built-ins__, or some other variation that will work just fine but expose you to catastrophic risks. -
- -

So eval() is safe now? Well, yes and no. - -

->>> eval("2 ** 2147483647",
-...     {"__builtins__":None}, {})          
-
-
    -
  1. Even without access to __builtins__, you can still launch a denial-of-service attack. For example, trying to raise 2 to the 2147483647th power will spike your server’s CPU utilization to 100% for quite some time. (If you’re trying this in the interactive shell, press Ctrl-C a few times to break out of it.) Technically this expression will return a value eventually, but in the meantime your server will be doing a whole lot of nothing. -
- -

In the end, it is possible to safely evaluate untrusted Python expressions, for some definition of “safe” that turns out not to be terribly useful in real life. It’s fine if you’re just playing around, and it’s fine if you only ever pass it trusted input. But anything else is just asking for trouble. - -

⁂ - -

Putting It All Together

- -

To recap: this program solves alphametic puzzles by brute force, i.e. through an exhaustive search of all possible solutions. To do this, it… - -

    -
  1. Finds all the letters in the puzzle with the re.findall() function -
  2. Find all the unique letters in the puzzle with sets and the set() function -
  3. Checks if there are more than 10 unique letters (meaning the puzzle is definitely unsolvable) with an assert statement -
  4. Converts the letters to their ASCII equivalents with a generator object -
  5. Calculates all the possible solutions with the itertools.permutations() function -
  6. Converts each possible solution to a Python expression with the translate() string method -
  7. Tests each possible solution by evaluating the Python expression with the eval() function -
  8. Returns the first solution that evaluates to True -
- -

…in just 14 lines of code. - -

⁂ - -

Further Reading

- - - -

Many thanks to Raymond Hettinger for agreeing to relicense his code so I could port it to Python 3 and use it as the basis for this chapter. - -

- -

© 2001–10 Mark Pilgrim - - - + + +Advanced Iterators - Dive Into Python 3 + + + + + + +

  
+

You are here: Home Dive Into Python 3 +

Difficulty level: ♦♦♦♦♢ +

Advanced Iterators

+
+

Great fleas have little fleas upon their backs to bite ’em,
And little fleas have lesser fleas, and so ad infinitum.
— Augustus De Morgan +

+

  +

Diving In

+

Just as regular expressions put strings on steroids, the itertools module puts iterators on steroids. But first, I want to show you a classic puzzle. + +

HAWAII + IDAHO + IOWA + OHIO == STATES
+510199 + 98153 + 9301 + 3593 == 621246
+
+H = 5
+A = 1
+W = 0
+I = 9
+D = 8
+O = 3
+S = 6
+T = 2
+E = 4
+ +

Puzzles like this are called cryptarithms or alphametics. The letters spell out actual words, but if you replace each letter with a digit from 0–9, it also “spells” an arithmetic equation. The trick is to figure out which letter maps to each digit. All the occurrences of each letter must map to the same digit, no digit can be repeated, and no “word” can start with the digit 0. + +

+ +

In this chapter, we’ll dive into an incredible Python program originally written by Raymond Hettinger. This program solves alphametic puzzles in just 14 lines of code. + +

[download alphametics.py] +

import re
+import itertools
+
+def solve(puzzle):
+    words = re.findall('[A-Z]+', puzzle.upper())
+    unique_characters = set(''.join(words))
+    assert len(unique_characters) <= 10, 'Too many letters'
+    first_letters = {word[0] for word in words}
+    n = len(first_letters)
+    sorted_characters = ''.join(first_letters) + \
+        ''.join(unique_characters - first_letters)
+    characters = tuple(ord(c) for c in sorted_characters)
+    digits = tuple(ord(c) for c in '0123456789')
+    zero = digits[0]
+    for guess in itertools.permutations(digits, len(characters)):
+        if zero not in guess[:n]:
+            equation = puzzle.translate(dict(zip(characters, guess)))
+            if eval(equation):
+                return equation
+
+if __name__ == '__main__':
+    import sys
+    for puzzle in sys.argv[1:]:
+        print(puzzle)
+        solution = solve(puzzle)
+        if solution:
+            print(solution)
+ +

You can run the program from the command line. On Linux, it would look like this. (These may take some time, depending on the speed of your computer, and there is no progress bar. Just be patient!) + +

+you@localhost:~/diveintopython3/examples$ python3 alphametics.py "HAWAII + IDAHO + IOWA + OHIO == STATES"
+HAWAII + IDAHO + IOWA + OHIO = STATES
+510199 + 98153 + 9301 + 3593 == 621246
+you@localhost:~/diveintopython3/examples$ python3 alphametics.py "I + LOVE + YOU == DORA"
+I + LOVE + YOU == DORA
+1 + 2784 + 975 == 3760
+you@localhost:~/diveintopython3/examples$ python3 alphametics.py "SEND + MORE == MONEY"
+SEND + MORE == MONEY
+9567 + 1085 == 10652
+ +

⁂ + +

Finding all occurrences of a pattern

+ +

The first thing this alphametics solver does is find all the letters (A–Z) in the puzzle. + +

+>>> import re
+>>> re.findall('[0-9]+', '16 2-by-4s in rows of 8')  
+['16', '2', '4', '8']
+>>> re.findall('[A-Z]+', 'SEND + MORE == MONEY')     
+['SEND', 'MORE', 'MONEY']
+
    +
  1. The re module is Python’s implementation of regular expressions. It has a nifty function called findall() which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the string. In this case, the pattern matches sequences of numbers. The findall() function returns a list of all the substrings that matched the pattern. +
  2. Here the regular expression pattern matches sequences of letters. Again, the return value is a list, and each item in the list is a string that matched the regular expression pattern. +
+ +

Here’s another example that will stretch your brain a little. + +

+>>> re.findall(' s.*? s', "The sixth sick sheikh's sixth sheep's sick.")
+[' sixth s', " sheikh's s", " sheep's s"]
+ + + +

Surprised? The regular expression looks for a space, an s, and then the shortest possible series of any character (.*?), then a space, then another s. Well, looking at that input string, I see five matches: + +

    +
  1. The sixth sick sheikh's sixth sheep's sick. +
  2. The sixth sick sheikh's sixth sheep's sick. +
  3. The sixth sick sheikh's sixth sheep's sick. +
  4. The sixth sick sheikh's sixth sheep's sick. +
  5. The sixth sick sheikh's sixth sheep's sick. +
+ +

But the re.findall() function only returned three matches. Specifically, it returned the first, the third, and the fifth. Why is that? Because it doesn’t return overlapping matches. The first match overlaps with the second, so the first is returned and the second is skipped. Then the third overlaps with the fourth, so the third is returned and the fourth is skipped. Finally, the fifth is returned. Three matches, not five. + +

This has nothing to do with the alphametics solver; I just thought it was interesting. + +

⁂ + +

Finding the unique items in a sequence

+ +

Sets make it trivial to find the unique items in a sequence. + +

+>>> a_list = ['The', 'sixth', 'sick', "sheik's", 'sixth', "sheep's", 'sick']
+>>> set(a_list)                      
+{'sixth', 'The', "sheep's", 'sick', "sheik's"}
+>>> a_string = 'EAST IS EAST'
+>>> set(a_string)                    
+{'A', ' ', 'E', 'I', 'S', 'T'}
+>>> words = ['SEND', 'MORE', 'MONEY']
+>>> ''.join(words)                   
+'SENDMOREMONEY'
+>>> set(''.join(words))              
+{'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
+
    +
  1. Given a list of several strings, the set() function will return a set of unique strings from the list. This makes sense if you think of it like a for loop. Take the first item from the list, put it in the set. Second. Third. Fourth. Fifth — wait, that’s in the set already, so it only gets listed once, because Python sets don’t allow duplicates. Sixth. Seventh — again, a duplicate, so it only gets listed once. The end result? All the unique items in the original list, without any duplicates. The original list doesn’t even need to be sorted first. +
  2. The same technique works with strings, since a string is just a sequence of characters. +
  3. Given a list of strings, ''.join(a_list) concatenates all the strings together into one. +
  4. So, given a list of strings, this line of code returns all the unique characters across all the strings, with no duplicates. +
+ +

The alphametics solver uses this technique to build a set of all the unique characters in the puzzle. + +

unique_characters = set(''.join(words))
+ +

This list is later used to assign digits to characters as the solver iterates through the possible solutions. + +

⁂ + +

Making assertions

+ +

Like many programming languages, Python has an assert statement. Here’s how it works. + +

+>>> assert 1 + 1 == 2                                     
+>>> assert 1 + 1 == 3                                     
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+AssertionError
+>>> assert 2 + 2 == 5, "Only for very large values of 2"  
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+AssertionError: Only for very large values of 2
+
    +
  1. The assert statement is followed by any valid Python expression. In this case, the expression 1 + 1 == 2 evaluates to True, so the assert statement does nothing. +
  2. However, if the Python expression evaluates to False, the assert statement will raise an AssertionError. +
  3. You can also include a human-readable message that is printed if the AssertionError is raised. +
+ +

Therefore, this line of code: + +

assert len(unique_characters) <= 10, 'Too many letters'
+ +

…is equivalent to this: + +

if len(unique_characters) > 10:
+    raise AssertionError('Too many letters')
+ +

The alphametics solver uses this exact assert statement to bail out early if the puzzle contains more than ten unique letters. Since each letter is assigned a unique digit, and there are only ten digits, a puzzle with more than ten unique letters can not possibly have a solution. + +

⁂ + +

Generator expressions

+ +

A generator expression is like a generator function without the function. + +

+>>> unique_characters = {'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}
+>>> gen = (ord(c) for c in unique_characters)  
+>>> gen                                        
+<generator object <genexpr> at 0x00BADC10>
+>>> next(gen)                                  
+69
+>>> next(gen)
+68
+>>> tuple(ord(c) for c in unique_characters)   
+(69, 68, 77, 79, 78, 83, 82, 89)
+
    +
  1. A generator expression is like an anonymous function that yields values. The expression itself looks like a list comprehension, but it’s wrapped in parentheses instead of square brackets. +
  2. The generator expression returns… an iterator. +
  3. Calling next(gen) returns the next value from the iterator. +
  4. If you like, you can iterate through all the possible values and return a tuple, list, or set, by passing the generator expression to tuple(), list(), or set(). In these cases, you don’t need an extra set of parentheses — just pass the “bare” expression ord(c) for c in unique_characters to the tuple() function, and Python figures out that it’s a generator expression. +
+ +
+

Using a generator expression instead of a list comprehension can save both CPU and RAM. If you’re building an list just to throw it away (e.g. passing it to tuple() or set()), use a generator expression instead! +

+ +

Here’s another way to accomplish the same thing, using a generator function: + +

def ord_map(a_string):
+    for c in a_string:
+        yield ord(c)
+
+gen = ord_map(unique_characters)
+ +

The generator expression is more compact but functionally equivalent. + +

⁂ + +

Calculating Permutations… The Lazy Way!

+ +

First of all, what the heck are permutations? Permutations are a mathematical concept. (There are actually several definitions, depending on what kind of math you’re doing. Here I’m talking about combinatorics, but if that doesn’t mean anything to you, don’t worry about it. As always, Wikipedia is your friend.) + +

The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated. Mathematicians say things like “let’s find the permutations of 3 different items taken 2 at a time,” which means you have a sequence of 3 items and you want to find all the possible ordered pairs. + +

+>>> import itertools                              
+>>> perms = itertools.permutations([1, 2, 3], 2)  
+>>> next(perms)                                   
+(1, 2)
+>>> next(perms)
+(1, 3)
+>>> next(perms)
+(2, 1)                                            
+>>> next(perms)
+(2, 3)
+>>> next(perms)
+(3, 1)
+>>> next(perms)
+(3, 2)
+>>> next(perms)                                   
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+StopIteration
+
    +
  1. The itertools module has all kinds of fun stuff in it, including a permutations() function that does all the hard work of finding permutations. +
  2. The permutations() function takes a sequence (here a list of three integers) and a number, which is the number of items you want in each smaller group. The function returns an iterator, which you can use in a for loop or any old place that iterates. Here I’ll step through the iterator manually to show all the values. +
  3. The first permutation of [1, 2, 3] taken 2 at a time is (1, 2). +
  4. Note that permutations are ordered: (2, 1) is different than (1, 2). +
  5. That’s it! Those are all the permutations of [1, 2, 3] taken 2 at a time. Pairs like (1, 1) and (2, 2) never show up, because they contain repeats so they aren’t valid permutations. When there are no more permutations, the iterator raises a StopIteration exception. +
+ + + +

The permutations() function doesn’t have to take a list. It can take any sequence — even a string. + +

+>>> import itertools
+>>> perms = itertools.permutations('ABC', 3)  
+>>> next(perms)
+('A', 'B', 'C')                               
+>>> next(perms)
+('A', 'C', 'B')
+>>> next(perms)
+('B', 'A', 'C')
+>>> next(perms)
+('B', 'C', 'A')
+>>> next(perms)
+('C', 'A', 'B')
+>>> next(perms)
+('C', 'B', 'A')
+>>> next(perms)
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+StopIteration
+>>> list(itertools.permutations('ABC', 3))    
+[('A', 'B', 'C'), ('A', 'C', 'B'),
+ ('B', 'A', 'C'), ('B', 'C', 'A'),
+ ('C', 'A', 'B'), ('C', 'B', 'A')]
+
    +
  1. A string is just a sequence of characters. For the purposes of finding permutations, the string 'ABC' is equivalent to the list ['A', 'B', 'C']. +
  2. The first permutation of the 3 items ['A', 'B', 'C'], taken 3 at a time, is ('A', 'B', 'C'). There are five other permutations — the same three characters in every conceivable order. +
  3. Since the permutations() function always returns an iterator, an easy way to debug permutations is to pass that iterator to the built-in list() function to see all the permutations immediately. +
+ +

⁂ + +

Other Fun Stuff in the itertools Module

+
+>>> import itertools
+>>> list(itertools.product('ABC', '123'))   
+[('A', '1'), ('A', '2'), ('A', '3'), 
+ ('B', '1'), ('B', '2'), ('B', '3'), 
+ ('C', '1'), ('C', '2'), ('C', '3')]
+>>> list(itertools.combinations('ABC', 2))  
+[('A', 'B'), ('A', 'C'), ('B', 'C')]
+
    +
  1. The itertools.product() function returns an iterator containing the Cartesian product of two sequences. +
  2. The itertools.combinations() function returns an iterator containing all the possible combinations of the given sequence of the given length. This is like the itertools.permutations() function, except combinations don’t include items that are duplicates of other items in a different order. So itertools.permutations('ABC', 2) will return both ('A', 'B') and ('B', 'A') (among others), but itertools.combinations('ABC', 2) will not return ('B', 'A') because it is a duplicate of ('A', 'B') in a different order. +
+ +

[download favorite-people.txt] +

+>>> names = list(open('examples/favorite-people.txt', encoding='utf-8'))  
+>>> names
+['Dora\n', 'Ethan\n', 'Wesley\n', 'John\n', 'Anne\n',
+'Mike\n', 'Chris\n', 'Sarah\n', 'Alex\n', 'Lizzie\n']
+>>> names = [name.rstrip() for name in names]                             
+>>> names
+['Dora', 'Ethan', 'Wesley', 'John', 'Anne',
+'Mike', 'Chris', 'Sarah', 'Alex', 'Lizzie']
+>>> names = sorted(names)                                                 
+>>> names
+['Alex', 'Anne', 'Chris', 'Dora', 'Ethan',
+'John', 'Lizzie', 'Mike', 'Sarah', 'Wesley']
+>>> names = sorted(names, key=len)                                        
+>>> names
+['Alex', 'Anne', 'Dora', 'John', 'Mike',
+'Chris', 'Ethan', 'Sarah', 'Lizzie', 'Wesley']
+
    +
  1. This idiom returns a list of the lines in a text file. +
  2. Unfortunately (for this example), the list(open(filename)) idiom also includes the carriage returns at the end of each line. This list comprehension uses the rstrip() string method to strip trailing whitespace from each line. (Strings also have an lstrip() method to strip leading whitespace, and a strip() method which strips both.) +
  3. The sorted() function takes a list and returns it sorted. By default, it sorts alphabetically. +
  4. But the sorted() function can also take a function as the key parameter, and it sorts by that key. In this case, the sort function is len(), so it sorts by len(each item). Shorter names come first, then longer, then longest. +
+ +

What does this have to do with the itertools module? I’m glad you asked. + +

+…continuing from the previous interactive shell…
+>>> import itertools
+>>> groups = itertools.groupby(names, len)  
+>>> groups
+<itertools.groupby object at 0x00BB20C0>
+>>> list(groups)
+[(4, <itertools._grouper object at 0x00BA8BF0>),
+ (5, <itertools._grouper object at 0x00BB4050>),
+ (6, <itertools._grouper object at 0x00BB4030>)]
+>>> groups = itertools.groupby(names, len)   
+>>> for name_length, name_iter in groups:    
+...     print('Names with {0:d} letters:'.format(name_length))
+...     for name in name_iter:
+...         print(name)
+... 
+Names with 4 letters:
+Alex
+Anne
+Dora
+John
+Mike
+Names with 5 letters:
+Chris
+Ethan
+Sarah
+Names with 6 letters:
+Lizzie
+Wesley
+
    +
  1. The itertools.groupby() function takes a sequence and a key function, and returns an iterator that generates pairs. Each pair contains the result of key_function(each item) and another iterator containing all the items that shared that key result. +
  2. Calling the list() function “exhausted” the iterator, i.e. you’ve already generated every item in the iterator to make the list. There’s no “reset” button on an iterator; you can’t just start over once you’ve exhausted it. If you want to loop through it again (say, in the upcoming for loop), you need to call itertools.groupby() again to create a new iterator. +
  3. In this example, given a list of names already sorted by length, itertools.groupby(names, len) will put all the 4-letter names in one iterator, all the 5-letter names in another iterator, and so on. The groupby() function is completely generic; it could group strings by first letter, numbers by their number of factors, or any other key function you can think of. +
+ + +
+

The itertools.groupby() function only works if the input sequence is already sorted by the grouping function. In the example above, you grouped a list of names by the len() function. That only worked because the input list was already sorted by length. +

+ +

Are you watching closely? +

+>>> list(range(0, 3))
+[0, 1, 2]
+>>> list(range(10, 13))
+[10, 11, 12]
+>>> list(itertools.chain(range(0, 3), range(10, 13)))        
+[0, 1, 2, 10, 11, 12]
+>>> list(zip(range(0, 3), range(10, 13)))                    
+[(0, 10), (1, 11), (2, 12)]
+>>> list(zip(range(0, 3), range(10, 14)))                    
+[(0, 10), (1, 11), (2, 12)]
+>>> list(itertools.zip_longest(range(0, 3), range(10, 14)))  
+[(0, 10), (1, 11), (2, 12), (None, 13)]
+
    +
  1. The itertools.chain() function takes two iterators and returns an iterator that contains all the items from the first iterator, followed by all the items from the second iterator. (Actually, it can take any number of iterators, and it chains them all in the order they were passed to the function.) +
  2. The zip() function does something prosaic that turns out to be extremely useful: it takes any number of sequences and returns an iterator which returns tuples of the first items of each sequence, then the second items of each, then the third, and so on. +
  3. The zip() function stops at the end of the shortest sequence. range(10, 14) has 4 items (10, 11, 12, and 13), but range(0, 3) only has 3, so the zip() function returns an iterator of 3 items. +
  4. On the other hand, the itertools.zip_longest() function stops at the end of the longest sequence, inserting None values for items past the end of the shorter sequences. +
+ +

OK, that was all very interesting, but how does it relate to the alphametics solver? Here’s how: + +

+>>> characters = ('S', 'M', 'E', 'D', 'O', 'N', 'R', 'Y')
+>>> guess = ('1', '2', '0', '3', '4', '5', '6', '7')
+>>> tuple(zip(characters, guess))  
+(('S', '1'), ('M', '2'), ('E', '0'), ('D', '3'),
+ ('O', '4'), ('N', '5'), ('R', '6'), ('Y', '7'))
+>>> dict(zip(characters, guess))   
+{'E': '0', 'D': '3', 'M': '2', 'O': '4',
+ 'N': '5', 'S': '1', 'R': '6', 'Y': '7'}
+
    +
  1. Given a list of letters and a list of digits (each represented here as 1-character strings), the zip function will create a pairing of letters and digits, in order. +
  2. Why is that cool? Because that data structure happens to be exactly the right structure to pass to the dict() function to create a dictionary that uses letters as keys and their associated digits as values. (This isn’t the only way to do it, of course. You could use a dictionary comprehension to create the dictionary directly.) Although the printed representation of the dictionary lists the pairs in a different order (dictionaries have no “order” per se), you can see that each letter is associated with the digit, based on the ordering of the original characters and guess sequences. +
+ +

The alphametics solver uses this technique to create a dictionary that maps letters in the puzzle to digits in the solution, for each possible solution. + +

characters = tuple(ord(c) for c in sorted_characters)
+digits = tuple(ord(c) for c in '0123456789')
+...
+for guess in itertools.permutations(digits, len(characters)):
+    ...
+    equation = puzzle.translate(dict(zip(characters, guess)))
+ +

But what is this translate() method? Ah, now you’re getting to the really fun part. + +

⁂ + +

A New Kind Of String Manipulation

+ +

Python strings have many methods. You learned about some of those methods in the Strings chapter: lower(), count(), and format(). Now I want to introduce you to a powerful but little-known string manipulation technique: the translate() method. + +

+>>> translation_table = {ord('A'): ord('O')}  
+>>> translation_table                         
+{65: 79}
+>>> 'MARK'.translate(translation_table)       
+'MORK'
+
    +
  1. String translation starts with a translation table, which is just a dictionary that maps one character to another. Actually, “character” is incorrect — the translation table really maps one byte to another. +
  2. Remember, bytes in Python 3 are integers. The ord() function returns the ASCII value of a character, which, in the case of A–Z, is always a byte from 65 to 90. +
  3. The translate() method on a string takes a translation table and runs the string through it. That is, it replaces all occurrences of the keys of the translation table with the corresponding values. In this case, “translating” MARK to MORK. +
+ + + +

What does this have to do with solving alphametic puzzles? As it turns out, everything. + +

+>>> characters = tuple(ord(c) for c in 'SMEDONRY')       
+>>> characters
+(83, 77, 69, 68, 79, 78, 82, 89)
+>>> guess = tuple(ord(c) for c in '91570682')            
+>>> guess
+(57, 49, 53, 55, 48, 54, 56, 50)
+>>> translation_table = dict(zip(characters, guess))     
+>>> translation_table
+{68: 55, 69: 53, 77: 49, 78: 54, 79: 48, 82: 56, 83: 57, 89: 50}
+>>> 'SEND + MORE == MONEY'.translate(translation_table)  
+'9567 + 1085 == 10652'
+
    +
  1. Using a generator expression, we quickly compute the byte values for each character in a string. characters is an example of the value of sorted_characters in the alphametics.solve() function. +
  2. Using another generator expression, we quickly compute the byte values for each digit in this string. The result, guess, is of the form returned by the itertools.permutations() function in the alphametics.solve() function. +
  3. This translation table is generated by zipping characters and guess together and building a dictionary from the resulting sequence of pairs. This is exactly what the alphametics.solve() function does inside the for loop. +
  4. Finally, we pass this translation table to the translate() method of the original puzzle string. This converts each letter in the string to the corresponding digit (based on the letters in characters and the digits in guess). The result is a valid Python expression, as a string. +
+ +

That’s pretty impressive. But what can you do with a string that happens to be a valid Python expression? + +

⁂ + +

Evaluating Arbitrary Strings As Python Expressions

+ +

This is the final piece of the puzzle (or rather, the final piece of the puzzle solver). After all that fancy string manipulation, we’re left with a string like '9567 + 1085 == 10652'. But that’s a string, and what good is a string? Enter eval(), the universal Python evaluation tool. + +

+>>> eval('1 + 1 == 2')
+True
+>>> eval('1 + 1 == 3')
+False
+>>> eval('9567 + 1085 == 10652')
+True
+ +

But wait, there’s more! The eval() function isn’t limited to boolean expressions. It can handle any Python expression and returns any datatype. + +

+>>> eval('"A" + "B"')
+'AB'
+>>> eval('"MARK".translate({65: 79})')
+'MORK'
+>>> eval('"AAAAA".count("A")')
+5
+>>> eval('["*"] * 5')
+['*', '*', '*', '*', '*']
+ +

But wait, that’s not all! + +

+>>> x = 5
+>>> eval("x * 5")         
+25
+>>> eval("pow(x, 2)")     
+25
+>>> import math
+>>> eval("math.sqrt(x)")  
+2.2360679774997898
+
    +
  1. The expression that eval() takes can reference global variables defined outside the eval(). If called within a function, it can reference local variables too. +
  2. And functions. +
  3. And modules. +
+ +

Hey, wait a minute… + +

+>>> import subprocess
+>>> eval("subprocess.getoutput('ls ~')")                  
+'Desktop         Library         Pictures \
+ Documents       Movies          Public   \
+ Music           Sites'
+>>> eval("subprocess.getoutput('rm /some/random/file')")  
+
    +
  1. The subprocess module allows you to run arbitrary shell commands and get the result as a Python string. +
  2. Arbitrary shell commands can have permanent consequences. +
+ +

It’s even worse than that, because there’s a global __import__() function that takes a module name as a string, imports the module, and returns a reference to it. Combined with the power of eval(), you can construct a single expression that will wipe out all your files: + +

+>>> eval("__import__('subprocess').getoutput('rm /some/random/file')")  
+
    +
  1. Now imagine the output of 'rm -rf ~'. Actually there wouldn’t be any output, but you wouldn’t have any files left either. +
+ +

eval() is EVIL + +

Well, the evil part is evaluating arbitrary expressions from untrusted sources. You should only use eval() on trusted input. Of course, the trick is figuring out what’s “trusted.” But here’s something I know for certain: you should NOT take this alphametics solver and put it on the internet as a fun little web service. Don’t make the mistake of thinking, “Gosh, the function does a lot of string manipulation before getting a string to evaluate; I can’t imagine how someone could exploit that.” Someone WILL figure out how to sneak nasty executable code past all that string manipulation (stranger things have happened), and then you can kiss your server goodbye. + +

But surely there’s some way to evaluate expressions safely? To put eval() in a sandbox where it can’t access or harm the outside world? Well, yes and no. + +

+>>> x = 5
+>>> eval("x * 5", {}, {})               
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+  File "<string>", line 1, in <module>
+NameError: name 'x' is not defined
+>>> eval("x * 5", {"x": x}, {})         
+>>> import math
+>>> eval("math.sqrt(x)", {"x": x}, {})  
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+  File "<string>", line 1, in <module>
+NameError: name 'math' is not defined
+
    +
  1. The second and third parameters passed to the eval() function act as the global and local namespaces for evaluating the expression. In this case, they are both empty, which means that when the string "x * 5" is evaluated, there is no reference to x in either the global or local namespace, so eval() throws an exception. +
  2. You can selectively include specific values in the global namespace by listing them individually. Then those — and only those — variables will be available during evaluation. +
  3. Even though you just imported the math module, you didn’t include it in the namespace passed to the eval() function, so the evaluation failed. +
+ +

Gee, that was easy. Lemme make an alphametics web service now! + +

+>>> eval("pow(5, 2)", {}, {})                   
+25
+>>> eval("__import__('math').sqrt(5)", {}, {})  
+2.2360679774997898
+
    +
  1. Even though you’ve passed empty dictionaries for the global and local namespaces, all of Python’s built-in functions are still available during evaluation. So pow(5, 2) works, because 5 and 2 are literals, and pow() is a built-in function. +
  2. Unfortunately (and if you don’t see why it’s unfortunate, read on), the __import__() function is also a built-in function, so it works too. +
+ +

Yeah, that means you can still do nasty things, even if you explicitly set the global and local namespaces to empty dictionaries when calling eval(): + +

>>> eval("__import__('subprocess').getoutput('rm /some/random/file')", {}, {})
+ +

Oops. I’m glad I didn’t make that alphametics web service. Is there any way to use eval() safely? Well, yes and no. + +

+>>> eval("__import__('math').sqrt(5)",
+...     {"__builtins__":None}, {})          
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+  File "<string>", line 1, in <module>
+NameError: name '__import__' is not defined
+>>> eval("__import__('subprocess').getoutput('rm -rf /')",
+...     {"__builtins__":None}, {})          
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+  File "<string>", line 1, in <module>
+NameError: name '__import__' is not defined
+
    +
  1. To evaluate untrusted expressions safely, you need to define a global namespace dictionary that maps "__builtins__" to None, the Python null value. Internally, the “built-in” functions are contained within a pseudo-module called "__builtins__". This pseudo-module (i.e. the set of built-in functions) is made available to evaluated expressions unless you explicitly override it. +
  2. Be sure you’ve overridden __builtins__. Not __builtin__, __built-ins__, or some other variation that will work just fine but expose you to catastrophic risks. +
+ +

So eval() is safe now? Well, yes and no. + +

+>>> eval("2 ** 2147483647",
+...     {"__builtins__":None}, {})          
+
+
    +
  1. Even without access to __builtins__, you can still launch a denial-of-service attack. For example, trying to raise 2 to the 2147483647th power will spike your server’s CPU utilization to 100% for quite some time. (If you’re trying this in the interactive shell, press Ctrl-C a few times to break out of it.) Technically this expression will return a value eventually, but in the meantime your server will be doing a whole lot of nothing. +
+ +

In the end, it is possible to safely evaluate untrusted Python expressions, for some definition of “safe” that turns out not to be terribly useful in real life. It’s fine if you’re just playing around, and it’s fine if you only ever pass it trusted input. But anything else is just asking for trouble. + +

⁂ + +

Putting It All Together

+ +

To recap: this program solves alphametic puzzles by brute force, i.e. through an exhaustive search of all possible solutions. To do this, it… + +

    +
  1. Finds all the letters in the puzzle with the re.findall() function +
  2. Find all the unique letters in the puzzle with sets and the set() function +
  3. Checks if there are more than 10 unique letters (meaning the puzzle is definitely unsolvable) with an assert statement +
  4. Converts the letters to their ASCII equivalents with a generator object +
  5. Calculates all the possible solutions with the itertools.permutations() function +
  6. Converts each possible solution to a Python expression with the translate() string method +
  7. Tests each possible solution by evaluating the Python expression with the eval() function +
  8. Returns the first solution that evaluates to True +
+ +

…in just 14 lines of code. + +

⁂ + +

Further Reading

+ + + +

Many thanks to Raymond Hettinger for agreeing to relicense his code so I could port it to Python 3 and use it as the basis for this chapter. + +

+ +

© 2001–10 Mark Pilgrim + + + diff --git a/colophon.html b/colophon.html index 0784e32..aa1df0d 100644 --- a/colophon.html +++ b/colophon.html @@ -1,87 +1,87 @@ - - - -Colophon - Dive Into Python 3 - - - - - -

  
-

You are here: Home Dive Into Python 3 -

Colophon

-
-

Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte.
(I would have written a shorter letter, but I did not have the time.)
Blaise Pascal -

-

  -

Diving In

-

This book, like all books, was a labor of love. Oh sure, I got paid the medium-sized bucks for it, but nobody writes technical books for the money. And since this book is available on the web as well as on paper, I spent a lot of time fiddling with webby stuff when I should have been writing. - -

[typewriter] - -

The online edition loads as efficiently as possible. Efficiency never happens by accident; I spent many hours making it so. Perhaps too many hours. Yes, almost certainly too many hours. Never underestimate the depths to which a procrastinating writer will sink. - -

I won’t bore you with all the details. Wait, yes — I will bore you with all the details. But here’s the short version. - -

    -
  1. HTML is minimized, then served compressed. -
  2. Scripts and stylesheets are minimized by YUI Compressor (and also served compressed). -
  3. Scripts are combined to reduce HTTP requests. -
  4. Stylesheets are combined and inlined to reduce HTTP requests. -
  5. Unused CSS selectors and properties are removed on a page-by-page basis with a little help from pyquery. -
  6. HTTP caching and other server-side options are optimized based on advice from YSlow and Page Speed. -
  7. Pages use Unicode characters in place of images wherever possible. -
  8. Images are optimized with OptiPNG. -
  9. The entire book was lovingly hand-authored in HTML 5 to avoid markup cruft. -
- -

⁂ - -

Typography

- -

vertical rhythm, best available ampersand, curly quotes/apostrophes, other stuff from webtypography.net - -

⁂ - -

Graphics

- -

Unicode, callouts, font-family issues on Windows - -

⁂ - -

Performance

- -

"Dive Into History 2009 edition", minimizing CSS + JS + HTML, inline CSS, optimizing images - -

⁂ - -

Fun stuff

- -

Quotes, constrained writing(?), PapayaWhip - -

⁂ - -

Further Reading

- - - -

© 2001–10 Mark Pilgrim - - - + + + +Colophon - Dive Into Python 3 + + + + + +

  
+

You are here: Home Dive Into Python 3 +

Colophon

+
+

Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte.
(I would have written a shorter letter, but I did not have the time.)
Blaise Pascal +

+

  +

Diving In

+

This book, like all books, was a labor of love. Oh sure, I got paid the medium-sized bucks for it, but nobody writes technical books for the money. And since this book is available on the web as well as on paper, I spent a lot of time fiddling with webby stuff when I should have been writing. + +

[typewriter] + +

The online edition loads as efficiently as possible. Efficiency never happens by accident; I spent many hours making it so. Perhaps too many hours. Yes, almost certainly too many hours. Never underestimate the depths to which a procrastinating writer will sink. + +

I won’t bore you with all the details. Wait, yes — I will bore you with all the details. But here’s the short version. + +

    +
  1. HTML is minimized, then served compressed. +
  2. Scripts and stylesheets are minimized by YUI Compressor (and also served compressed). +
  3. Scripts are combined to reduce HTTP requests. +
  4. Stylesheets are combined and inlined to reduce HTTP requests. +
  5. Unused CSS selectors and properties are removed on a page-by-page basis with a little help from pyquery. +
  6. HTTP caching and other server-side options are optimized based on advice from YSlow and Page Speed. +
  7. Pages use Unicode characters in place of images wherever possible. +
  8. Images are optimized with OptiPNG. +
  9. The entire book was lovingly hand-authored in HTML 5 to avoid markup cruft. +
+ +

⁂ + +

Typography

+ +

vertical rhythm, best available ampersand, curly quotes/apostrophes, other stuff from webtypography.net + +

⁂ + +

Graphics

+ +

Unicode, callouts, font-family issues on Windows + +

⁂ + +

Performance

+ +

"Dive Into History 2009 edition", minimizing CSS + JS + HTML, inline CSS, optimizing images + +

⁂ + +

Fun stuff

+ +

Quotes, constrained writing(?), PapayaWhip + +

⁂ + +

Further Reading

+ + + +

© 2001–10 Mark Pilgrim + + + diff --git a/files.html b/files.html index 474a5f2..f3edefc 100644 --- a/files.html +++ b/files.html @@ -1,607 +1,607 @@ - - -Files - Dive Into Python 3 - - - - - - -

  
-

You are here: Home Dive Into Python 3 -

Difficulty level: ♦♦♦♢♢ -

Files

-
-

A nine mile walk is no joke, especially in the rain.
— Harry Kemelman, The Nine Mile Walk -

-

  -

Diving In

-

My Windows laptop had 38,493 files before I installed a single application. Installing Python 3 added almost 3,000 files to that total. Files are the primary storage paradigm of every major operating system; the concept is so ingrained that most people would have trouble imagining an alternative. Your computer is, metaphorically speaking, drowning in files. - -

Reading From Text Files

- -

Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier: - -

a_file = open('examples/chinese.txt', encoding='utf-8')
- -

Python has a built-in open() function, which takes a filename as an argument. Here the filename is 'examples/chinese.txt'. There are five interesting things about this filename: - -

    -
  1. It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments — a directory path and a filename — but the open() function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well. -
  2. The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows. -
  3. The directory path does not begin with a slash or a drive letter, so it is called a relative path. Relative to what, you might ask? Patience, grasshopper. -
  4. It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-ASCII pathnames. -
  5. It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a figment of an entirely virtual filesystem. If your computer considers it a file and can access it as a file, Python can open it. -
- -

But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding. Oh dear, that sounds dreadfully familiar. - -

Character Encoding Rears Its Ugly Head

- -

Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string). - -

-# This example was created on Windows. Other platforms may
-# behave differently, for reasons outlined below.
->>> file = open('examples/chinese.txt')
->>> a_string = file.read()
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
-    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
-UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined>
->>> 
- - - -

What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in cp1252.py, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly UnicodeDecodeError. - -

But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252). - -

-

If you need to get the default character encoding, import the locale module and call locale.getpreferredencoding(). On my Windows laptop, it returns 'cp1252', but on my Linux box upstairs, it returns 'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file. - -

- -

Stream Objects

- -

So far, all we know is that Python has a built-in function called open(). The open() function returns a stream object, which has methods and attributes for getting information about and manipulating a stream of characters. - -

->>> a_file = open('examples/chinese.txt', encoding='utf-8')
->>> a_file.name                                              
-'examples/chinese.txt'
->>> a_file.encoding                                          
-'utf-8'
->>> a_file.mode                                              
-'r'
-
    -
  1. The name attribute reflects the name you passed in to the open() function when you opened the file. It is not normalized to an absolute pathname. -
  2. Likewise, encoding attribute reflects the encoding you passed in to the open() function. If you didn’t specify the encoding when you opened the file (bad developer!) then the encoding attribute will reflect locale.getpreferredencoding(). -
  3. The mode attribute tells you in which mode the file was opened. You can pass an optional mode parameter to the open() function. You didn’t specify a mode when you opened this file, so Python defaults to 'r', which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings). -
- -
-

The documentation for the open() function lists all the possible file modes. -

- -

Reading Data From A Text File

- -

After you open a file for reading, you’ll probably want to read from it at some point. - -

->>> a_file = open('examples/chinese.txt', encoding='utf-8')
->>> a_file.read()                                            
-'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'
->>> a_file.read()                                            
-''
-
    -
  1. Once you open a file (with the correct encoding), reading from it is just a matter of calling the stream object’s read() method. The result is a string. -
  2. Perhaps somewhat surprisingly, reading the file again does not raise an exception. Python does not consider reading past end-of-file to be an error; it simply returns an empty string. -
- - - -

What if you want to re-read a file? - -

-# continued from the previous example
->>> a_file.read()                      
-''
->>> a_file.seek(0)                     
-0
->>> a_file.read(16)                    
-'Dive Into Python'
->>> a_file.read(1)                     
-' '
->>> a_file.read(1)
-'是'
->>> a_file.tell()                      
-20
-
    -
  1. Since you’re still at the end of the file, further calls to the stream object’s read() method simply return an empty string. -
  2. The seek() method moves to a specific byte position in a file. -
  3. The read() method can take an optional parameter, the number of characters to read. -
  4. If you like, you can even read one character at a time. -
  5. 16 + 1 + 1 = … 20? -
- -

Let’s try that again. - -

-# continued from the previous example
->>> a_file.seek(17)                    
-17
->>> a_file.read(1)                     
-'是'
->>> a_file.tell()                      
-20
-
    -
  1. Move to the 17th byte. -
  2. Read one character. -
  3. Now you’re on the 20th byte. -
- -

Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that the seek() and read() methods are counting the same thing. But that’s only true for some characters. - -

But wait, it gets worse! - -

->>> a_file.seek(18)                         
-18
->>> a_file.read(1)                          
-Traceback (most recent call last):
-  File "<pyshell#12>", line 1, in <module>
-    a_file.read(1)
-  File "C:\Python31\lib\codecs.py", line 300, in decode
-    (result, consumed) = self._buffer_decode(data, self.errors, final)
-UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte
-
    -
  1. Move to the 18th byte and try to read one character. -
  2. Why does this fail? Because there isn’t a character at the 18th byte. The nearest character starts at the 17th byte (and goes for three bytes). Trying to read a character from the middle will fail with a UnicodeDecodeError. -
- -

Closing Files

- -

Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them. - -

-# continued from the previous example
->>> a_file.close()
- -

Well that was anticlimactic. - -

The stream object a_file still exists; calling its close() method doesn’t destroy the object itself. But it’s not terribly useful. - -

-# continued from the previous example
->>> a_file.read()                           
-Traceback (most recent call last):
-  File "<pyshell#24>", line 1, in <module>
-    a_file.read()
-ValueError: I/O operation on closed file.
->>> a_file.seek(0)                          
-Traceback (most recent call last):
-  File "<pyshell#25>", line 1, in <module>
-    a_file.seek(0)
-ValueError: I/O operation on closed file.
->>> a_file.tell()                           
-Traceback (most recent call last):
-  File "<pyshell#26>", line 1, in <module>
-    a_file.tell()
-ValueError: I/O operation on closed file.
->>> a_file.close()                          
->>> a_file.closed                           
-True
-
    -
  1. You can’t read from a closed file; that raises an IOError exception. -
  2. You can’t seek in a closed file either. -
  3. There’s no current position in a closed file, so the tell() method also fails. -
  4. Perhaps surprisingly, calling the close() method on a stream object whose file has been closed does not raise an exception. It’s just a no-op. -
  5. Closed stream objects do have one useful attribute: the closed attribute will confirm that the file is closed. -
- -

Closing Files Automatically

- - - -

Stream objects have an explicit close() method, but what happens if your code has a bug and crashes before you call close()? That file could theoretically stay open for much longer than necessary. While you’re debugging on your local computer, that’s not a big deal. On a production server, maybe it is. - -

Python 2 had a solution for this: the try..finally block. That still works in Python 3, and you may see it in other people’s code or in older code that was ported to Python 3. But Python 2.5 introduced a cleaner solution, which is now the preferred solution in Python 3: the with statement. - -

with open('examples/chinese.txt', encoding='utf-8') as a_file:
-    a_file.seek(17)
-    a_character = a_file.read(1)
-    print(a_character)
- -

This code calls open(), but it never calls a_file.close(). The with statement starts a code block, like an if statement or a for loop. Inside this code block, you can use the variable a_file as the stream object returned from the call to open(). All the regular stream object methods are available — seek(), read(), whatever you need. When the with block ends, Python calls a_file.close() automatically. - -

Here’s the kicker: no matter how or when you exit the with block, Python will close that file… even if you “exit” it via an unhandled exception. That’s right, even if your code raises an exception and your entire program comes to a screeching halt, that file will get closed. Guaranteed. - -

-

In technical terms, the with statement creates a runtime context. In these examples, the stream object acts as a context manager. Python creates the stream object a_file and tells it that it is entering a runtime context. When the with code block is completed, Python tells the stream object that it is exiting the runtime context, and the stream object calls its own close() method. See Appendix B, “Classes That Can Be Used in a with Block” for details. -

- -

There’s nothing file-specific about the with statement; it’s just a generic framework for creating runtime contexts and telling objects that they’re entering and exiting a runtime context. If the object in question is a stream object, then it does useful file-like things (like closing the file automatically). But that behavior is defined in the stream object, not in the with statement. There are lots of other ways to use context managers that have nothing to do with files. You can even create your own, as you’ll see later in this chapter. - -

Reading Data One Line At A Time

- -

A “line” of a text file is just what you think it is — you type a few words and press ENTER, and now you’re on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated, because text files can use several different characters to mark the end of a line. Every operating system has its own convention. Some use a carriage return character, others use a line feed character, and some use both characters at the end of every line. - -

Now breathe a sigh of relief, because Python handles line endings automatically by default. If you say, “I want to read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and and it will all Just Work. - -

-

If you need fine-grained control over what’s considered a line ending, you can pass the optional newline parameter to the open() function. See the open() function documentation for all the gory details. -

- -

So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful. - -

[download oneline.py] -

line_number = 0
-with open('examples/favorite-people.txt', encoding='utf-8') as a_file:  
-    for a_line in a_file:                                               
-        line_number += 1
-        print('{:>4} {}'.format(line_number, a_line.rstrip()))          
-
    -
  1. Using the with pattern, you safely open the file and let Python close it for you. -
  2. To read a file one line at a time, use a for loop. That’s it. Besides having explicit methods like read(), the stream object is also an iterator which spits out a single line every time you ask for a value. -
  3. Using the format() string method, you can print out the line number and the line itself. The format specifier {:>4} means “print this argument right-justified within 4 spaces.” The a_line variable contains the complete line, carriage returns and all. The rstrip() string method removes the trailing whitespace, including the carriage return characters. -
- -
-you@localhost:~/diveintopython3$ python3 examples/oneline.py
-   1 Dora
-   2 Ethan
-   3 Wesley
-   4 John
-   5 Anne
-   6 Mike
-   7 Chris
-   8 Sarah
-   9 Alex
-  10 Lizzie
- -
-

Did you get this error? -

-you@localhost:~/diveintopython3$ python3 examples/oneline.py
-Traceback (most recent call last):
-  File "examples/oneline.py", line 4, in <module>
-    print('{:>4} {}'.format(line_number, a_line.rstrip()))
-ValueError: zero length field name in format
-

If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. -

Python 3.0 supported string formatting, but only with explicitly numbered format specifiers. Python 3.1 allows you to omit the argument indexes in your format specifiers. Here is the Python 3.0-compatible version for comparison: -

print('{0:>4} {1}'.format(line_number, a_line.rstrip()))
-
- -

⁂ - -

Writing to Text Files

- - - -

You can write to files in much the same way that you read from them. First you open a file and get a stream object, then you use methods on the stream object to write data to the file, then you close the file. - -

To open a file for writing, use the open() function and specify the write mode. There are two file modes for writing: - -

- -

Either mode will create the file automatically if it doesn’t already exist, so there’s never a need for any sort of fiddly “if the file doesn’t exist yet, create a new empty file just so you can open it for the first time” function. Just open a file and start writing. - -

You should always close a file as soon as you’re done writing to it, to release the file handle and ensure that the data is actually written to disk. As with reading data from a file, you can call the stream object’s close() method, or you can use the with statement and let Python close the file for you. I bet you can guess which technique I recommend. - -

->>> with open('test.log', mode='w', encoding='utf-8') as a_file:  
-...     a_file.write('test succeeded')                            
->>> with open('test.log', encoding='utf-8') as a_file:
-...     print(a_file.read())                              
-test succeeded
->>> with open('test.log', mode='a', encoding='utf-8') as a_file:  
-...     a_file.write('and again')
->>> with open('test.log', encoding='utf-8') as a_file:
-...     print(a_file.read())                              
-test succeededand again                                           
-
    -
  1. You start boldly by creating the new file test.log (or overwriting the existing file), and opening the file for writing. The mode='w' parameter means open the file for writing. Yes, that’s all as dangerous as it sounds. I hope you didn’t care about the previous contents of that file (if any), because that data is gone now. -
  2. You can add data to the newly opened file with the write() method of the stream object returned by the open() function. After the with block ends, Python automatically closes the file. -
  3. That was so fun, let’s do it again. But this time, with mode='a' to append to the file instead of overwriting it. Appending will never harm the existing contents of the file. -
  4. Both the original line you wrote and the second line you appended are now in the file test.log. Also note that neither carriage returns nor line feeds are included. Since you didn’t write them explicitly to the file either time, the file doesn’t include them. You can write a carriage return with the '\r' character, and/or a line feed with the '\n' character. Since you didn’t do either, everything you wrote to the file ended up on one line. -
- -

Character Encoding Again

- -

Did you notice the encoding parameter that got passed in to the open() function while you were opening a file for writing? It’s important; don’t ever leave it out! As you saw in the beginning of this chapter, files don’t contain strings, they contain bytes. Reading a “string” from a text file only works because you told Python what encoding to use to read a stream of bytes and convert it to a string. Writing text to a file presents the same problem in reverse. You can’t write characters to a file; characters are an abstraction. In order to write to the file, Python needs to know how to convert your string into a sequence of bytes. The only way to be sure it’s performing the correct conversion is to specify the encoding parameter when you open the file for writing. - -

⁂ - -

Binary Files

- -

my dog Beauregard - -

Not all files contain text. Some of them contain pictures of my dog. - -

->>> an_image = open('examples/beauregard.jpg', mode='rb')                
->>> an_image.mode                                                        
-'rb'
->>> an_image.name                                                        
-'examples/beauregard.jpg'
->>> an_image.encoding                                                    
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-AttributeError: '_io.BufferedReader' object has no attribute 'encoding'
-
    -
  1. Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that the mode parameter contains a 'b' character. -
  2. The stream object you get from opening a file in binary mode has many of the same attributes, including mode, which reflects the mode parameter you passed into the open() function. -
  3. Binary stream objects also have a name attribute, just like text stream objects. -
  4. Here’s one difference, though: a binary stream object has no encoding attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do. What you get out of a binary file is exactly what you put into it, no conversion necessary. -
- -

Did I mention you’re reading bytes? Oh yes you are. - -

-# continued from the previous example
->>> an_image.tell()
-0
->>> data = an_image.read(3)  
->>> data
-b'\xff\xd8\xff'
->>> type(data)               
-<class 'bytes'>
->>> an_image.tell()          
-3
->>> an_image.seek(0)
-0
->>> data = an_image.read()
->>> len(data)
-3150
-
    -
  1. Like text files, you can read binary files a little bit at a time. But there’s a crucial difference… -
  2. …you’re reading bytes, not strings. Since you opened the file in binary mode, the read() method takes the number of bytes to read, not the number of characters. -
  3. That means that there’s never an unexpected mismatch between the number you passed into the read() method and the position index you get out of the tell() method. The read() method reads bytes, and the seek() and tell() methods track the number of bytes read. For binary files, they’ll always agree. -
- -

⁂ - -

Stream Objects From Non-File Sources

- - - -

Imagine you’re writing a library, and one of your library functions is going to read some data from a file. The function could simply take a filename as a string, go open the file for reading, read it, and close it before exiting. But you shouldn’t do that. Instead, your API should take an arbitrary stream object. - -

In the simplest case, a stream object is anything with a read() method which takes an optional size parameter and returns a string. When called with no size parameter, the read() method should read everything there is to read from the input source and return all the data as a single value. When called with a size parameter, it reads that much from the input source and returns that much data. When called again, it picks up where it left off and returns the next chunk of data. - -

That sounds exactly like the stream object you get from opening a real file. The difference is that you’re not limiting yourself to real files. The input source that’s being “read” could be anything: a web page, a string in memory, even the output of another program. As long as your functions take a stream object and simply call the object’s read() method, you can handle any input source that acts like a file, without specific code to handle each kind of input. - -

->>> a_string = 'PapayaWhip is the new black.'
->>> import io                                  
->>> a_file = io.StringIO(a_string)             
->>> a_file.read()                              
-'PapayaWhip is the new black.'
->>> a_file.read()                              
-''
->>> a_file.seek(0)                             
-0
->>> a_file.read(10)                            
-'PapayaWhip'
->>> a_file.tell()                       
-10
->>> a_file.seek(18)
-18
->>> a_file.read()
-'new black.'
-
    -
  1. The io module defines the StringIO class that you can use to treat a string in memory as a file. -
  2. To create a stream object out of a string, create an instance of the io.StringIO() class and pass it the string you want to use as your “file” data. Now you have a stream object, and you can do all sorts of stream-like things with it. -
  3. Calling the read() method “reads” the entire “file,” which in the case of a StringIO object simply returns the original string. -
  4. Just like a real file, calling the read() method again returns an empty string. -
  5. You can explicitly seek to the beginning of the string, just like seeking through a real file, by using the seek() method of the StringIO object. -
  6. You can also read the string in chunks, by passing a size parameter to the read() method. -
- -
-

io.StringIO lets you treat a string as a text file. There’s also a io.BytesIO class, which lets you treat a byte array as a binary file. -

- -

Handling Compressed Files

- -

The Python standard library contains modules that support reading and writing compressed files. There are a number of different compression schemes; the two most popular on non-Windows systems are gzip and bzip2. (You may have also encountered PKZIP archives and GNU Tar archives. Python has modules for those, too.) - -

The gzip module lets you create a stream object for reading or writing a gzip-compressed file. The stream object it gives you supports the read() method (if you opened it for reading) or the write() method (if you opened it for writing). That means you can use the methods you’ve already learned for regular files to directly read or write a gzip-compressed file, without creating a temporary file to store the decompressed data. - -

As an added bonus, it supports the with statement too, so you can let Python automatically close your gzip-compressed file when you’re done with it. - -

-you@localhost:~$ python3
-
->>> import gzip
->>> with gzip.open('out.log.gz', mode='wb') as z_file:                                      
-...   z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))
-... 
->>> exit()
-
-you@localhost:~$ ls -l out.log.gz                                                           
--rw-r--r--  1 mark mark    79 2009-07-19 14:29 out.log.gz
-you@localhost:~$ gunzip out.log.gz                                                          
-you@localhost:~$ cat out.log                                                                
-A nine mile walk is no joke, especially in the rain.
-
    -
  1. You should always open gzipped files in binary mode. (Note the 'b' character in the mode argument.) -
  2. I constructed this example on Linux. If you’re not familiar with the command line, this command is showing the “long listing” of the gzip-compressed file you just created in the Python Shell. This listing shows that the file exists (good), and that it is 79 bytes long. That’s actually larger than the string you started with! The gzip file format includes a fixed-length header that contains some metadata about the file, so it’s inefficient for extremely small files. -
  3. The gunzip command (pronounced “gee-unzip”) decompresses the file and stores the contents in a new file named the same as the compressed file but without the .gz file extension. -
  4. The cat command displays the contents of a file. This file contains the string you originally wrote directly to the compressed file out.log.gz from within the Python Shell. -
- -
-

Did you get this error? -

->>> with gzip.open('out.log.gz', mode='wb') as z_file:
-...         z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))
-... 
-Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
-AttributeError: 'GzipFile' object has no attribute '__exit__'
-

If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. -

Python 3.0 had a gzip module, but it did not support using a gzipped-file object as a context manager. Python 3.1 added the ability to use gzipped-file objects in a with statement. -

- -

⁂ - -

Standard Input, Output, and Error

- - - -

Command-line gurus are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you. - -

Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX-like system, including Mac OS X and Linux. When you call the print() function, the thing you’re printing is sent to the stdout pipe. When your program crashes and prints out a traceback, it goes to the stderr pipe. By default, both of these pipes are just connected to the terminal window where you are working; when your program prints something, you see the output in your terminal window, and when a program crashes, you see the traceback in your terminal window too. In the graphical Python Shell, the stdout and stderr pipes default to your “Interactive Window”. - -

->>> for i in range(3):
-...     print('PapayaWhip')        
-PapayaWhip
-PapayaWhip
-PapayaWhip
->>> import sys
->>> for i in range(3):
-... sys.stdout.write('is the')     
-is theis theis the
->>> for i in range(3):
-... sys.stderr.write('new black')  
-new blacknew blacknew black
-
    -
  1. The print() function, in a loop. Nothing surprising here. -
  2. stdout is defined in the sys module, and it is a stream object. Calling its write() function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you’re printing, and calls sys.stdout.write. -
  3. In the simplest case, sys.stdout and sys.stderr send their output to the same place: the Python IDE (if you’re in one), or the terminal (if you’re running Python from the command line). Like standard output, standard error does not add carriage returns for you. If you want carriage returns, you’ll need to write carriage return characters. -
- -

sys.stdout and sys.stderr are stream objects, but they are write-only. Attempting to call their read() method will always raise an IOError. - -

->>> import sys
->>> sys.stdout.read()
-Traceback (most recent call last):
-  File "<stdin>", line 1, in <module>
-IOError: not readable
- -

Redirecting Standard Output

- -

sys.stdout and sys.stderr are stream objects, albeit ones that only support writing. But they’re not constants; they’re variables. That means you can assign them a new value — any other stream object — to redirect their output. - -

[download stdout.py] -

import sys
-
-class RedirectStdoutTo:
-    def __init__(self, out_new):
-        self.out_new = out_new
-
-    def __enter__(self):
-        self.out_old = sys.stdout
-        sys.stdout = self.out_new
-
-    def __exit__(self, *args):
-        sys.stdout = self.out_old
-
-print('A')
-with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
-    print('B')
-print('C')
- -

Check this out: - -

-you@localhost:~/diveintopython3/examples$ python3 stdout.py
-A
-C
-you@localhost:~/diveintopython3/examples$ cat out.log
-B
- -
-

Did you get this error? -

-you@localhost:~/diveintopython3/examples$ python3 stdout.py
-  File "stdout.py", line 15
-    with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
-                                                              ^
-SyntaxError: invalid syntax
-

If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. -

Python 3.0 supported the with statement, but each statement can only use one context manager. Python 3.1 allows you to chain multiple context managers in a single with statement. -

- -

Let’s take the last part first. - -

print('A')
-with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
-    print('B')
-print('C')
- -

That’s a complicated with statement. Let me rewrite it as something more recognizable. - -

with open('out.log', mode='w', encoding='utf-8') as a_file:
-    with RedirectStdoutTo(a_file):
-        print('B')
- -

As the rewrite shows, you actually have two with statements, one nested within the scope of the other. The “outer” with statement should be familiar by now: it opens a UTF-8-encoded text file named out.log for writing and assigns the stream object to a variable named a_file. But that’s not the only thing odd here. -

with RedirectStdoutTo(a_file):
- -

Where’s the as clause? The with statement doesn’t actually require one. Just like you can call a function and ignore its return value, you can have a with statement that doesn’t assign the with context to a variable. In this case, you’re only interested in the side effects of the RedirectStdoutTo context. - -

What are those side effects? Take a look inside the RedirectStdoutTo class. This class is a custom context manager. Any class can be a context manager by defining two special methods: __enter__() and __exit__(). - -

class RedirectStdoutTo:
-    def __init__(self, out_new):    
-        self.out_new = out_new
-
-    def __enter__(self):            
-        self.out_old = sys.stdout
-        sys.stdout = self.out_new
-
-    def __exit__(self, *args):      
-        sys.stdout = self.out_old
-
    -
  1. The __init__() method is called immediately after an instance is created. It takes one parameter, the stream object that you want to use as standard output for the life of the context. This method just saves the stream object in an instance variable so other methods can use it later. -
  2. The __enter__() method is a special class method; Python calls it when entering a context (i.e. at the beginning of the with statement). This method saves the current value of sys.stdout in self.out_old, then redirects standard output by assigning self.out_new to sys.stdout. -
  3. The __exit__() method is another special class method; Python calls it when exiting the context (i.e. at the end of the with statement). This method restores standard output to its original value by assigning the saved self.out_old value to sys.stdout. -
- -

Putting it all together: - -


-print('A')                                                                             
-with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):  
-    print('B')                                                                         
-print('C')                                                                             
-
    -
  1. This will print to the IDE “Interactive Window” (or the terminal, if running the script from the command line). -
  2. This with statement takes a comma-separated list of contexts. The comma-separated list acts like a series of nested with blocks. The first context listed is the “outer” block; the last one listed is the “inner” block. The first context opens a file; the second context redirects sys.stdout to the stream object that was created in the first context. -
  3. Because this print() function is executed with the context created by the with statement, it will not print to the screen; it will write to the file out.log. -
  4. The with code block is over. Python has told each context manager to do whatever it is they do upon exiting a context. The context managers form a last-in-first-out stack. Upon exiting, the second context changed sys.stdout back to its original value, then the first context closed the file named out.log. Since standard output has been restored to its original value, calling the print() function will once again print to the screen. -
- -

Redirecting standard error works exactly the same way, using sys.stderr instead of sys.stdout. - -

⁂ - -

Further Reading

- - - -

- -

© 2001–10 Mark Pilgrim - - - + + +Files - Dive Into Python 3 + + + + + + +

  
+

You are here: Home Dive Into Python 3 +

Difficulty level: ♦♦♦♢♢ +

Files

+
+

A nine mile walk is no joke, especially in the rain.
— Harry Kemelman, The Nine Mile Walk +

+

  +

Diving In

+

My Windows laptop had 38,493 files before I installed a single application. Installing Python 3 added almost 3,000 files to that total. Files are the primary storage paradigm of every major operating system; the concept is so ingrained that most people would have trouble imagining an alternative. Your computer is, metaphorically speaking, drowning in files. + +

Reading From Text Files

+ +

Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier: + +

a_file = open('examples/chinese.txt', encoding='utf-8')
+ +

Python has a built-in open() function, which takes a filename as an argument. Here the filename is 'examples/chinese.txt'. There are five interesting things about this filename: + +

    +
  1. It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments — a directory path and a filename — but the open() function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well. +
  2. The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows. +
  3. The directory path does not begin with a slash or a drive letter, so it is called a relative path. Relative to what, you might ask? Patience, grasshopper. +
  4. It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-ASCII pathnames. +
  5. It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a figment of an entirely virtual filesystem. If your computer considers it a file and can access it as a file, Python can open it. +
+ +

But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding. Oh dear, that sounds dreadfully familiar. + +

Character Encoding Rears Its Ugly Head

+ +

Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string). + +

+# This example was created on Windows. Other platforms may
+# behave differently, for reasons outlined below.
+>>> file = open('examples/chinese.txt')
+>>> a_string = file.read()
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
+    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
+UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined>
+>>> 
+ + + +

What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in cp1252.py, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly UnicodeDecodeError. + +

But wait, it’s worse than that! The default encoding is platform-dependent, so this code might work on your computer (if your default encoding is UTF-8), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252). + +

+

If you need to get the default character encoding, import the locale module and call locale.getpreferredencoding(). On my Windows laptop, it returns 'cp1252', but on my Linux box upstairs, it returns 'UTF8'. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file. + +

+ +

Stream Objects

+ +

So far, all we know is that Python has a built-in function called open(). The open() function returns a stream object, which has methods and attributes for getting information about and manipulating a stream of characters. + +

+>>> a_file = open('examples/chinese.txt', encoding='utf-8')
+>>> a_file.name                                              
+'examples/chinese.txt'
+>>> a_file.encoding                                          
+'utf-8'
+>>> a_file.mode                                              
+'r'
+
    +
  1. The name attribute reflects the name you passed in to the open() function when you opened the file. It is not normalized to an absolute pathname. +
  2. Likewise, encoding attribute reflects the encoding you passed in to the open() function. If you didn’t specify the encoding when you opened the file (bad developer!) then the encoding attribute will reflect locale.getpreferredencoding(). +
  3. The mode attribute tells you in which mode the file was opened. You can pass an optional mode parameter to the open() function. You didn’t specify a mode when you opened this file, so Python defaults to 'r', which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings). +
+ +
+

The documentation for the open() function lists all the possible file modes. +

+ +

Reading Data From A Text File

+ +

After you open a file for reading, you’ll probably want to read from it at some point. + +

+>>> a_file = open('examples/chinese.txt', encoding='utf-8')
+>>> a_file.read()                                            
+'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'
+>>> a_file.read()                                            
+''
+
    +
  1. Once you open a file (with the correct encoding), reading from it is just a matter of calling the stream object’s read() method. The result is a string. +
  2. Perhaps somewhat surprisingly, reading the file again does not raise an exception. Python does not consider reading past end-of-file to be an error; it simply returns an empty string. +
+ + + +

What if you want to re-read a file? + +

+# continued from the previous example
+>>> a_file.read()                      
+''
+>>> a_file.seek(0)                     
+0
+>>> a_file.read(16)                    
+'Dive Into Python'
+>>> a_file.read(1)                     
+' '
+>>> a_file.read(1)
+'是'
+>>> a_file.tell()                      
+20
+
    +
  1. Since you’re still at the end of the file, further calls to the stream object’s read() method simply return an empty string. +
  2. The seek() method moves to a specific byte position in a file. +
  3. The read() method can take an optional parameter, the number of characters to read. +
  4. If you like, you can even read one character at a time. +
  5. 16 + 1 + 1 = … 20? +
+ +

Let’s try that again. + +

+# continued from the previous example
+>>> a_file.seek(17)                    
+17
+>>> a_file.read(1)                     
+'是'
+>>> a_file.tell()                      
+20
+
    +
  1. Move to the 17th byte. +
  2. Read one character. +
  3. Now you’re on the 20th byte. +
+ +

Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8. The English characters in the file only require one byte each, so you might be misled into thinking that the seek() and read() methods are counting the same thing. But that’s only true for some characters. + +

But wait, it gets worse! + +

+>>> a_file.seek(18)                         
+18
+>>> a_file.read(1)                          
+Traceback (most recent call last):
+  File "<pyshell#12>", line 1, in <module>
+    a_file.read(1)
+  File "C:\Python31\lib\codecs.py", line 300, in decode
+    (result, consumed) = self._buffer_decode(data, self.errors, final)
+UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte
+
    +
  1. Move to the 18th byte and try to read one character. +
  2. Why does this fail? Because there isn’t a character at the 18th byte. The nearest character starts at the 17th byte (and goes for three bytes). Trying to read a character from the middle will fail with a UnicodeDecodeError. +
+ +

Closing Files

+ +

Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them. + +

+# continued from the previous example
+>>> a_file.close()
+ +

Well that was anticlimactic. + +

The stream object a_file still exists; calling its close() method doesn’t destroy the object itself. But it’s not terribly useful. + +

+# continued from the previous example
+>>> a_file.read()                           
+Traceback (most recent call last):
+  File "<pyshell#24>", line 1, in <module>
+    a_file.read()
+ValueError: I/O operation on closed file.
+>>> a_file.seek(0)                          
+Traceback (most recent call last):
+  File "<pyshell#25>", line 1, in <module>
+    a_file.seek(0)
+ValueError: I/O operation on closed file.
+>>> a_file.tell()                           
+Traceback (most recent call last):
+  File "<pyshell#26>", line 1, in <module>
+    a_file.tell()
+ValueError: I/O operation on closed file.
+>>> a_file.close()                          
+>>> a_file.closed                           
+True
+
    +
  1. You can’t read from a closed file; that raises an IOError exception. +
  2. You can’t seek in a closed file either. +
  3. There’s no current position in a closed file, so the tell() method also fails. +
  4. Perhaps surprisingly, calling the close() method on a stream object whose file has been closed does not raise an exception. It’s just a no-op. +
  5. Closed stream objects do have one useful attribute: the closed attribute will confirm that the file is closed. +
+ +

Closing Files Automatically

+ + + +

Stream objects have an explicit close() method, but what happens if your code has a bug and crashes before you call close()? That file could theoretically stay open for much longer than necessary. While you’re debugging on your local computer, that’s not a big deal. On a production server, maybe it is. + +

Python 2 had a solution for this: the try..finally block. That still works in Python 3, and you may see it in other people’s code or in older code that was ported to Python 3. But Python 2.5 introduced a cleaner solution, which is now the preferred solution in Python 3: the with statement. + +

with open('examples/chinese.txt', encoding='utf-8') as a_file:
+    a_file.seek(17)
+    a_character = a_file.read(1)
+    print(a_character)
+ +

This code calls open(), but it never calls a_file.close(). The with statement starts a code block, like an if statement or a for loop. Inside this code block, you can use the variable a_file as the stream object returned from the call to open(). All the regular stream object methods are available — seek(), read(), whatever you need. When the with block ends, Python calls a_file.close() automatically. + +

Here’s the kicker: no matter how or when you exit the with block, Python will close that file… even if you “exit” it via an unhandled exception. That’s right, even if your code raises an exception and your entire program comes to a screeching halt, that file will get closed. Guaranteed. + +

+

In technical terms, the with statement creates a runtime context. In these examples, the stream object acts as a context manager. Python creates the stream object a_file and tells it that it is entering a runtime context. When the with code block is completed, Python tells the stream object that it is exiting the runtime context, and the stream object calls its own close() method. See Appendix B, “Classes That Can Be Used in a with Block” for details. +

+ +

There’s nothing file-specific about the with statement; it’s just a generic framework for creating runtime contexts and telling objects that they’re entering and exiting a runtime context. If the object in question is a stream object, then it does useful file-like things (like closing the file automatically). But that behavior is defined in the stream object, not in the with statement. There are lots of other ways to use context managers that have nothing to do with files. You can even create your own, as you’ll see later in this chapter. + +

Reading Data One Line At A Time

+ +

A “line” of a text file is just what you think it is — you type a few words and press ENTER, and now you’re on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated, because text files can use several different characters to mark the end of a line. Every operating system has its own convention. Some use a carriage return character, others use a line feed character, and some use both characters at the end of every line. + +

Now breathe a sigh of relief, because Python handles line endings automatically by default. If you say, “I want to read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and and it will all Just Work. + +

+

If you need fine-grained control over what’s considered a line ending, you can pass the optional newline parameter to the open() function. See the open() function documentation for all the gory details. +

+ +

So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful. + +

[download oneline.py] +

line_number = 0
+with open('examples/favorite-people.txt', encoding='utf-8') as a_file:  
+    for a_line in a_file:                                               
+        line_number += 1
+        print('{:>4} {}'.format(line_number, a_line.rstrip()))          
+
    +
  1. Using the with pattern, you safely open the file and let Python close it for you. +
  2. To read a file one line at a time, use a for loop. That’s it. Besides having explicit methods like read(), the stream object is also an iterator which spits out a single line every time you ask for a value. +
  3. Using the format() string method, you can print out the line number and the line itself. The format specifier {:>4} means “print this argument right-justified within 4 spaces.” The a_line variable contains the complete line, carriage returns and all. The rstrip() string method removes the trailing whitespace, including the carriage return characters. +
+ +
+you@localhost:~/diveintopython3$ python3 examples/oneline.py
+   1 Dora
+   2 Ethan
+   3 Wesley
+   4 John
+   5 Anne
+   6 Mike
+   7 Chris
+   8 Sarah
+   9 Alex
+  10 Lizzie
+ +
+

Did you get this error? +

+you@localhost:~/diveintopython3$ python3 examples/oneline.py
+Traceback (most recent call last):
+  File "examples/oneline.py", line 4, in <module>
+    print('{:>4} {}'.format(line_number, a_line.rstrip()))
+ValueError: zero length field name in format
+

If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. +

Python 3.0 supported string formatting, but only with explicitly numbered format specifiers. Python 3.1 allows you to omit the argument indexes in your format specifiers. Here is the Python 3.0-compatible version for comparison: +

print('{0:>4} {1}'.format(line_number, a_line.rstrip()))
+
+ +

⁂ + +

Writing to Text Files

+ + + +

You can write to files in much the same way that you read from them. First you open a file and get a stream object, then you use methods on the stream object to write data to the file, then you close the file. + +

To open a file for writing, use the open() function and specify the write mode. There are two file modes for writing: + +

+ +

Either mode will create the file automatically if it doesn’t already exist, so there’s never a need for any sort of fiddly “if the file doesn’t exist yet, create a new empty file just so you can open it for the first time” function. Just open a file and start writing. + +

You should always close a file as soon as you’re done writing to it, to release the file handle and ensure that the data is actually written to disk. As with reading data from a file, you can call the stream object’s close() method, or you can use the with statement and let Python close the file for you. I bet you can guess which technique I recommend. + +

+>>> with open('test.log', mode='w', encoding='utf-8') as a_file:  
+...     a_file.write('test succeeded')                            
+>>> with open('test.log', encoding='utf-8') as a_file:
+...     print(a_file.read())                              
+test succeeded
+>>> with open('test.log', mode='a', encoding='utf-8') as a_file:  
+...     a_file.write('and again')
+>>> with open('test.log', encoding='utf-8') as a_file:
+...     print(a_file.read())                              
+test succeededand again                                           
+
    +
  1. You start boldly by creating the new file test.log (or overwriting the existing file), and opening the file for writing. The mode='w' parameter means open the file for writing. Yes, that’s all as dangerous as it sounds. I hope you didn’t care about the previous contents of that file (if any), because that data is gone now. +
  2. You can add data to the newly opened file with the write() method of the stream object returned by the open() function. After the with block ends, Python automatically closes the file. +
  3. That was so fun, let’s do it again. But this time, with mode='a' to append to the file instead of overwriting it. Appending will never harm the existing contents of the file. +
  4. Both the original line you wrote and the second line you appended are now in the file test.log. Also note that neither carriage returns nor line feeds are included. Since you didn’t write them explicitly to the file either time, the file doesn’t include them. You can write a carriage return with the '\r' character, and/or a line feed with the '\n' character. Since you didn’t do either, everything you wrote to the file ended up on one line. +
+ +

Character Encoding Again

+ +

Did you notice the encoding parameter that got passed in to the open() function while you were opening a file for writing? It’s important; don’t ever leave it out! As you saw in the beginning of this chapter, files don’t contain strings, they contain bytes. Reading a “string” from a text file only works because you told Python what encoding to use to read a stream of bytes and convert it to a string. Writing text to a file presents the same problem in reverse. You can’t write characters to a file; characters are an abstraction. In order to write to the file, Python needs to know how to convert your string into a sequence of bytes. The only way to be sure it’s performing the correct conversion is to specify the encoding parameter when you open the file for writing. + +

⁂ + +

Binary Files

+ +

my dog Beauregard + +

Not all files contain text. Some of them contain pictures of my dog. + +

+>>> an_image = open('examples/beauregard.jpg', mode='rb')                
+>>> an_image.mode                                                        
+'rb'
+>>> an_image.name                                                        
+'examples/beauregard.jpg'
+>>> an_image.encoding                                                    
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+AttributeError: '_io.BufferedReader' object has no attribute 'encoding'
+
    +
  1. Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that the mode parameter contains a 'b' character. +
  2. The stream object you get from opening a file in binary mode has many of the same attributes, including mode, which reflects the mode parameter you passed into the open() function. +
  3. Binary stream objects also have a name attribute, just like text stream objects. +
  4. Here’s one difference, though: a binary stream object has no encoding attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do. What you get out of a binary file is exactly what you put into it, no conversion necessary. +
+ +

Did I mention you’re reading bytes? Oh yes you are. + +

+# continued from the previous example
+>>> an_image.tell()
+0
+>>> data = an_image.read(3)  
+>>> data
+b'\xff\xd8\xff'
+>>> type(data)               
+<class 'bytes'>
+>>> an_image.tell()          
+3
+>>> an_image.seek(0)
+0
+>>> data = an_image.read()
+>>> len(data)
+3150
+
    +
  1. Like text files, you can read binary files a little bit at a time. But there’s a crucial difference… +
  2. …you’re reading bytes, not strings. Since you opened the file in binary mode, the read() method takes the number of bytes to read, not the number of characters. +
  3. That means that there’s never an unexpected mismatch between the number you passed into the read() method and the position index you get out of the tell() method. The read() method reads bytes, and the seek() and tell() methods track the number of bytes read. For binary files, they’ll always agree. +
+ +

⁂ + +

Stream Objects From Non-File Sources

+ + + +

Imagine you’re writing a library, and one of your library functions is going to read some data from a file. The function could simply take a filename as a string, go open the file for reading, read it, and close it before exiting. But you shouldn’t do that. Instead, your API should take an arbitrary stream object. + +

In the simplest case, a stream object is anything with a read() method which takes an optional size parameter and returns a string. When called with no size parameter, the read() method should read everything there is to read from the input source and return all the data as a single value. When called with a size parameter, it reads that much from the input source and returns that much data. When called again, it picks up where it left off and returns the next chunk of data. + +

That sounds exactly like the stream object you get from opening a real file. The difference is that you’re not limiting yourself to real files. The input source that’s being “read” could be anything: a web page, a string in memory, even the output of another program. As long as your functions take a stream object and simply call the object’s read() method, you can handle any input source that acts like a file, without specific code to handle each kind of input. + +

+>>> a_string = 'PapayaWhip is the new black.'
+>>> import io                                  
+>>> a_file = io.StringIO(a_string)             
+>>> a_file.read()                              
+'PapayaWhip is the new black.'
+>>> a_file.read()                              
+''
+>>> a_file.seek(0)                             
+0
+>>> a_file.read(10)                            
+'PapayaWhip'
+>>> a_file.tell()                       
+10
+>>> a_file.seek(18)
+18
+>>> a_file.read()
+'new black.'
+
    +
  1. The io module defines the StringIO class that you can use to treat a string in memory as a file. +
  2. To create a stream object out of a string, create an instance of the io.StringIO() class and pass it the string you want to use as your “file” data. Now you have a stream object, and you can do all sorts of stream-like things with it. +
  3. Calling the read() method “reads” the entire “file,” which in the case of a StringIO object simply returns the original string. +
  4. Just like a real file, calling the read() method again returns an empty string. +
  5. You can explicitly seek to the beginning of the string, just like seeking through a real file, by using the seek() method of the StringIO object. +
  6. You can also read the string in chunks, by passing a size parameter to the read() method. +
+ +
+

io.StringIO lets you treat a string as a text file. There’s also a io.BytesIO class, which lets you treat a byte array as a binary file. +

+ +

Handling Compressed Files

+ +

The Python standard library contains modules that support reading and writing compressed files. There are a number of different compression schemes; the two most popular on non-Windows systems are gzip and bzip2. (You may have also encountered PKZIP archives and GNU Tar archives. Python has modules for those, too.) + +

The gzip module lets you create a stream object for reading or writing a gzip-compressed file. The stream object it gives you supports the read() method (if you opened it for reading) or the write() method (if you opened it for writing). That means you can use the methods you’ve already learned for regular files to directly read or write a gzip-compressed file, without creating a temporary file to store the decompressed data. + +

As an added bonus, it supports the with statement too, so you can let Python automatically close your gzip-compressed file when you’re done with it. + +

+you@localhost:~$ python3
+
+>>> import gzip
+>>> with gzip.open('out.log.gz', mode='wb') as z_file:                                      
+...   z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))
+... 
+>>> exit()
+
+you@localhost:~$ ls -l out.log.gz                                                           
+-rw-r--r--  1 mark mark    79 2009-07-19 14:29 out.log.gz
+you@localhost:~$ gunzip out.log.gz                                                          
+you@localhost:~$ cat out.log                                                                
+A nine mile walk is no joke, especially in the rain.
+
    +
  1. You should always open gzipped files in binary mode. (Note the 'b' character in the mode argument.) +
  2. I constructed this example on Linux. If you’re not familiar with the command line, this command is showing the “long listing” of the gzip-compressed file you just created in the Python Shell. This listing shows that the file exists (good), and that it is 79 bytes long. That’s actually larger than the string you started with! The gzip file format includes a fixed-length header that contains some metadata about the file, so it’s inefficient for extremely small files. +
  3. The gunzip command (pronounced “gee-unzip”) decompresses the file and stores the contents in a new file named the same as the compressed file but without the .gz file extension. +
  4. The cat command displays the contents of a file. This file contains the string you originally wrote directly to the compressed file out.log.gz from within the Python Shell. +
+ +
+

Did you get this error? +

+>>> with gzip.open('out.log.gz', mode='wb') as z_file:
+...         z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))
+... 
+Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+AttributeError: 'GzipFile' object has no attribute '__exit__'
+

If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. +

Python 3.0 had a gzip module, but it did not support using a gzipped-file object as a context manager. Python 3.1 added the ability to use gzipped-file objects in a with statement. +

+ +

⁂ + +

Standard Input, Output, and Error

+ + + +

Command-line gurus are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you. + +

Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX-like system, including Mac OS X and Linux. When you call the print() function, the thing you’re printing is sent to the stdout pipe. When your program crashes and prints out a traceback, it goes to the stderr pipe. By default, both of these pipes are just connected to the terminal window where you are working; when your program prints something, you see the output in your terminal window, and when a program crashes, you see the traceback in your terminal window too. In the graphical Python Shell, the stdout and stderr pipes default to your “Interactive Window”. + +

+>>> for i in range(3):
+...     print('PapayaWhip')        
+PapayaWhip
+PapayaWhip
+PapayaWhip
+>>> import sys
+>>> for i in range(3):
+... sys.stdout.write('is the')     
+is theis theis the
+>>> for i in range(3):
+... sys.stderr.write('new black')  
+new blacknew blacknew black
+
    +
  1. The print() function, in a loop. Nothing surprising here. +
  2. stdout is defined in the sys module, and it is a stream object. Calling its write() function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you’re printing, and calls sys.stdout.write. +
  3. In the simplest case, sys.stdout and sys.stderr send their output to the same place: the Python IDE (if you’re in one), or the terminal (if you’re running Python from the command line). Like standard output, standard error does not add carriage returns for you. If you want carriage returns, you’ll need to write carriage return characters. +
+ +

sys.stdout and sys.stderr are stream objects, but they are write-only. Attempting to call their read() method will always raise an IOError. + +

+>>> import sys
+>>> sys.stdout.read()
+Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+IOError: not readable
+ +

Redirecting Standard Output

+ +

sys.stdout and sys.stderr are stream objects, albeit ones that only support writing. But they’re not constants; they’re variables. That means you can assign them a new value — any other stream object — to redirect their output. + +

[download stdout.py] +

import sys
+
+class RedirectStdoutTo:
+    def __init__(self, out_new):
+        self.out_new = out_new
+
+    def __enter__(self):
+        self.out_old = sys.stdout
+        sys.stdout = self.out_new
+
+    def __exit__(self, *args):
+        sys.stdout = self.out_old
+
+print('A')
+with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
+    print('B')
+print('C')
+ +

Check this out: + +

+you@localhost:~/diveintopython3/examples$ python3 stdout.py
+A
+C
+you@localhost:~/diveintopython3/examples$ cat out.log
+B
+ +
+

Did you get this error? +

+you@localhost:~/diveintopython3/examples$ python3 stdout.py
+  File "stdout.py", line 15
+    with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
+                                                              ^
+SyntaxError: invalid syntax
+

If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1. +

Python 3.0 supported the with statement, but each statement can only use one context manager. Python 3.1 allows you to chain multiple context managers in a single with statement. +

+ +

Let’s take the last part first. + +

print('A')
+with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
+    print('B')
+print('C')
+ +

That’s a complicated with statement. Let me rewrite it as something more recognizable. + +

with open('out.log', mode='w', encoding='utf-8') as a_file:
+    with RedirectStdoutTo(a_file):
+        print('B')
+ +

As the rewrite shows, you actually have two with statements, one nested within the scope of the other. The “outer” with statement should be familiar by now: it opens a UTF-8-encoded text file named out.log for writing and assigns the stream object to a variable named a_file. But that’s not the only thing odd here. +

with RedirectStdoutTo(a_file):
+ +

Where’s the as clause? The with statement doesn’t actually require one. Just like you can call a function and ignore its return value, you can have a with statement that doesn’t assign the with context to a variable. In this case, you’re only interested in the side effects of the RedirectStdoutTo context. + +

What are those side effects? Take a look inside the RedirectStdoutTo class. This class is a custom context manager. Any class can be a context manager by defining two special methods: __enter__() and __exit__(). + +

class RedirectStdoutTo:
+    def __init__(self, out_new):    
+        self.out_new = out_new
+
+    def __enter__(self):            
+        self.out_old = sys.stdout
+        sys.stdout = self.out_new
+
+    def __exit__(self, *args):      
+        sys.stdout = self.out_old
+
    +
  1. The __init__() method is called immediately after an instance is created. It takes one parameter, the stream object that you want to use as standard output for the life of the context. This method just saves the stream object in an instance variable so other methods can use it later. +
  2. The __enter__() method is a special class method; Python calls it when entering a context (i.e. at the beginning of the with statement). This method saves the current value of sys.stdout in self.out_old, then redirects standard output by assigning self.out_new to sys.stdout. +
  3. The __exit__() method is another special class method; Python calls it when exiting the context (i.e. at the end of the with statement). This method restores standard output to its original value by assigning the saved self.out_old value to sys.stdout. +
+ +

Putting it all together: + +


+print('A')                                                                             
+with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):  
+    print('B')                                                                         
+print('C')                                                                             
+
    +
  1. This will print to the IDE “Interactive Window” (or the terminal, if running the script from the command line). +
  2. This with statement takes a comma-separated list of contexts. The comma-separated list acts like a series of nested with blocks. The first context listed is the “outer” block; the last one listed is the “inner” block. The first context opens a file; the second context redirects sys.stdout to the stream object that was created in the first context. +
  3. Because this print() function is executed with the context created by the with statement, it will not print to the screen; it will write to the file out.log. +
  4. The with code block is over. Python has told each context manager to do whatever it is they do upon exiting a context. The context managers form a last-in-first-out stack. Upon exiting, the second context changed sys.stdout back to its original value, then the first context closed the file named out.log. Since standard output has been restored to its original value, calling the print() function will once again print to the screen. +
+ +

Redirecting standard error works exactly the same way, using sys.stderr instead of sys.stdout. + +

⁂ + +

Further Reading

+ + + +

+ +

© 2001–10 Mark Pilgrim + + + diff --git a/generators.html b/generators.html index 1f965fe..1b01e47 100755 --- a/generators.html +++ b/generators.html @@ -1,418 +1,418 @@ - - -Closures & Generators - Dive Into Python 3 - - - - - - -

  
-

You are here: Home Dive Into Python 3 -

Difficulty level: ♦♦♦♢♢ -

Closures & Generators

-
-

My spelling is Wobbly. It’s good spelling but it Wobbles, and the letters get in the wrong places.
— Winnie-the-Pooh -

-

  -

Diving In

-

Having grown up the son of a librarian and an English major, I have always been fascinated by languages. Not programming languages. Well yes, programming languages, but also natural languages. Take English. English is a schizophrenic language that borrows words from German, French, Spanish, and Latin (to name a few). Actually, “borrows” is the wrong word; “pillages” is more like it. Or perhaps “assimilates” — like the Borg. Yes, I like that. -

We are the Borg. Your linguistic and etymological distinctiveness will be added to our own. Resistance is futile. -

In this chapter, you’re going to learn about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators. But first, let’s talk about how to make plural nouns. (If you haven’t read the chapter on regular expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and it quickly descends into more advanced uses.) -

If you grew up in an English-speaking country or learned English in a formal school setting, you’re probably familiar with the basic rules: -

-

(I know, there are a lot of exceptions. Man becomes men and woman becomes women, but human becomes humans. Mouse becomes mice and louse becomes lice, but house becomes houses. Knife becomes knives and wife becomes wives, but lowlife becomes lowlifes. And don’t even get me started on words that are their own plural, like sheep, deer, and haiku.) -

Other languages, of course, are completely different. -

Let’s design a Python library that automatically pluralizes English nouns. We’ll start with just these four rules, but keep in mind that you’ll inevitably need to add more. -

⁂ - -

I Know, Let’s Use Regular Expressions!

-

So you’re looking at words, which, at least in English, means you’re looking at strings of characters. You have rules that say you need to find different combinations of characters, then do different things to them. This sounds like a job for regular expressions! -

[download plural1.py] -

import re
-
-def plural(noun):          
-    if re.search('[sxz]$', noun):             
-        return re.sub('$', 'es', noun)        
-    elif re.search('[^aeioudgkprt]h$', noun):
-        return re.sub('$', 'es', noun)       
-    elif re.search('[^aeiou]y$', noun):      
-        return re.sub('y$', 'ies', noun)     
-    else:
-        return noun + 's'
-
    -
  1. This is a regular expression, but it uses a syntax you didn’t see in Regular Expressions. The square brackets mean “match exactly one of these characters.” So [sxz] means “s, or x, or z”, but only one of them. The $ should be familiar; it matches the end of string. Combined, this regular expression tests whether noun ends with s, x, or z. -
  2. This re.sub() function performs regular expression-based string substitutions. -
- -

Let’s look at regular expression substitutions in more detail. -

->>> import re
->>> re.search('[abc]', 'Mark')    
-<_sre.SRE_Match object at 0x001C1FA8>
->>> re.sub('[abc]', 'o', 'Mark')  
-'Mork'
->>> re.sub('[abc]', 'o', 'rock')  
-'rook'
->>> re.sub('[abc]', 'o', 'caps')  
-'oops'
-
    -
  1. Does the string Mark contain a, b, or c? Yes, it contains a. -
  2. OK, now find a, b, or c, and replace it with o. Mark becomes Mork. -
  3. The same function turns rock into rook. -
  4. You might think this would turn caps into oaps, but it doesn’t. re.sub replaces all of the matches, not just the first one. So this regular expression turns caps into oops, because both the c and the a get turned into o. -
- -

And now, back to the plural() function… - -

def plural(noun):          
-    if re.search('[sxz]$', noun):            
-        return re.sub('$', 'es', noun)         
-    elif re.search('[^aeioudgkprt]h$', noun):  
-        return re.sub('$', 'es', noun)
-    elif re.search('[^aeiou]y$', noun):        
-        return re.sub('y$', 'ies', noun)     
-    else:
-        return noun + 's'
-
    -
  1. Here, you’re replacing the end of the string (matched by $) with the string es. In other words, adding es to the string. You could accomplish the same thing with string concatenation, for example noun + 'es', but I chose to use regular expressions for each rule, for reasons that will become clear later in the chapter. -
  2. Look closely, this is another new variation. The ^ as the first character inside the square brackets means something special: negation. [^abc] means “any single character except a, b, or c”. So [^aeioudgkprt] means any character except a, e, i, o, u, d, g, k, p, r, or t. Then that character needs to be followed by h, followed by end of string. You’re looking for words that end in H where the H can be heard. -
  3. Same pattern here: match words that end in Y, where the character before the Y is not a, e, i, o, or u. You’re looking for words that end in Y that sounds like I. -
- -

Let’s look at negation regular expressions in more detail. - -

->>> import re
->>> re.search('[^aeiou]y$', 'vacancy')  
-<_sre.SRE_Match object at 0x001C1FA8>
->>> re.search('[^aeiou]y$', 'boy')      
->>> 
->>> re.search('[^aeiou]y$', 'day')
->>> 
->>> re.search('[^aeiou]y$', 'pita')     
->>> 
-
    -
  1. vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u. -
  2. boy does not match, because it ends in oy, and you specifically said that the character before the y could not be o. day does not match, because it ends in ay. -
  3. pita does not match, because it does not end in y. -
-
->>> re.sub('y$', 'ies', 'vacancy')               
-'vacancies'
->>> re.sub('y$', 'ies', 'agency')
-'agencies'
->>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy')  
-'vacancies'
-
    -
  1. This regular expression turns vacancy into vacancies and agency into agencies, which is what you wanted. Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub. -
  2. Just in passing, I want to point out that it is possible to combine these two regular expressions (one to find out if the rule applies, and another to actually apply it) into a single regular expression. Here’s what that would look like. Most of it should look familiar: you’re using a remembered group, which you learned in Case study: Parsing Phone Numbers. The group is used to remember the character before the letter y. Then in the substitution string, you use a new syntax, \1, which means “hey, that first group you remembered? put it right here.” In this case, you remember the c before the y; when you do the substitution, you substitute c in place of c, and ies in place of y. (If you have more than one remembered group, you can use \2 and \3 and so on.) -
-

Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn’t directly map to the way you first described the pluralizing rules. You originally laid out rules like “if the word ends in S, X, or Z, then add ES”. If you look at this function, you have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn’t get much more direct than that. - -

⁂ - -

A List Of Functions

- -

Now you’re going to add a level of abstraction. You started by defining a list of rules: if this, do that, otherwise go to the next rule. Let’s temporarily complicate part of the program so you can simplify another part. - -

[download plural2.py] -

import re
-
-def match_sxz(noun):
-    return re.search('[sxz]$', noun)
-
-def apply_sxz(noun):
-    return re.sub('$', 'es', noun)
-
-def match_h(noun):
-    return re.search('[^aeioudgkprt]h$', noun)
-
-def apply_h(noun):
-    return re.sub('$', 'es', noun)
-
-def match_y(noun):                             
-    return re.search('[^aeiou]y$', noun)
-        
-def apply_y(noun):                             
-    return re.sub('y$', 'ies', noun)
-
-def match_default(noun):
-    return True
-
-def apply_default(noun):
-    return noun + 's'
-
-rules = ((match_sxz, apply_sxz),               
-         (match_h, apply_h),
-         (match_y, apply_y),
-         (match_default, apply_default)
-         )
-
-def plural(noun):           
-    for matches_rule, apply_rule in rules:       
-        if matches_rule(noun):
-            return apply_rule(noun)
-
    -
  1. Now, each match rule is its own function which returns the results of calling the re.search() function. -
  2. Each apply rule is also its own function which calls the re.sub() function to apply the appropriate pluralization rule. -
  3. Instead of having one function (plural()) with multiple rules, you have the rules data structure, which is a sequence of pairs of functions. -
  4. Since the rules have been broken out into a separate data structure, the new plural() function can be reduced to a few lines of code. Using a for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules structure. On the first iteration of the for loop, matches_rule will get match_sxz, and apply_rule will get apply_sxz. On the second iteration (assuming you get that far), matches_rule will be assigned match_h, and apply_rule will be assigned apply_h. The function is guaranteed to return something eventually, because the final match rule (match_default) simply returns True, meaning the corresponding apply rule (apply_default) will always be applied. -
- - -

The reason this technique works is that everything in Python is an object, including functions. The rules data structure contains functions — not names of functions, but actual function objects. When they get assigned in the for loop, then matches_rule and apply_rule are actual functions that you can call. On the first iteration of the for loop, this is equivalent to calling matches_sxz(noun), and if it returns a match, calling apply_sxz(noun). - -

If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire for loop is equivalent to the following: - -


-def plural(noun):
-    if match_sxz(noun):
-        return apply_sxz(noun)
-    if match_h(noun):
-        return apply_h(noun)
-    if match_y(noun):
-        return apply_y(noun)
-    if match_default(noun):
-        return apply_default(noun)
- -

The benefit here is that the plural() function is now simplified. It takes a sequence of rules, defined elsewhere, and iterates through them in a generic fashion. - -

    -
  1. Get a match rule -
  2. Does it match? Then call the apply rule and return the result. -
  3. No match? Go to step 1. -
- -

The rules could be defined anywhere, in any way. The plural() function doesn’t care. - -

Now, was adding this level of abstraction worth it? Well, not yet. Let’s consider what it would take to add a new rule to the function. In the first example, it would require adding an if statement to the plural() function. In this second example, it would require adding two functions, match_foo() and apply_foo(), and then updating the rules sequence to specify where in the order the new match and apply functions should be called relative to the other rules. - -

But this is really just a stepping stone to the next section. Let’s move on… - -

⁂ - -

A List Of Patterns

- -

Defining separate named functions for each match and apply rule isn’t really necessary. You never call them directly; you add them to the rules sequence and call them through there. Furthermore, each function follows one of two patterns. All the match functions call re.search(), and all the apply functions call re.sub(). Let’s factor out the patterns so that defining new rules can be easier. - -

[download plural3.py] -

import re
-
-def build_match_and_apply_functions(pattern, search, replace):
-    def matches_rule(word):                                     
-        return re.search(pattern, word)
-    def apply_rule(word):                                       
-        return re.sub(search, replace, word)
-    return (matches_rule, apply_rule)                           
-
    -
  1. build_match_and_apply_functions() is a function that builds other functions dynamically. It takes pattern, search and replace, then defines a matches_rule() function which calls re.search() with the pattern that was passed to the build_match_and_apply_functions() function, and the word that was passed to the matches_rule() function you’re building. Whoa. -
  2. Building the apply function works the same way. The apply function is a function that takes one parameter, and calls re.sub() with the search and replace parameters that were passed to the build_match_and_apply_functions() function, and the word that was passed to the apply_rule() function you’re building. This technique of using the values of outside parameters within a dynamic function is called closures. You’re essentially defining constants within the apply function you’re building: it takes one parameter (word), but it then acts on that plus two other values (search and replace) which were set when you defined the apply function. -
  3. Finally, the build_match_and_apply_functions() function returns a tuple of two values: the two functions you just created. The constants you defined within those functions (pattern within the matches_rule() function, and search and replace within the apply_rule() function) stay with those functions, even after you return from build_match_and_apply_functions(). That’s insanely cool. -
- -

If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it. - -

patterns = \                                                        
-  (
-    ('[sxz]$',           '$',  'es'),
-    ('[^aeioudgkprt]h$', '$',  'es'),
-    ('(qu|[^aeiou])y$',  'y$', 'ies'),
-    ('$',                '$',  's')                                 
-  )
-rules = [build_match_and_apply_functions(pattern, search, replace)  
-         for (pattern, search, replace) in patterns]
-
    -
  1. Our pluralization “rules” are now defined as a tuple of tuples of strings (not functions). The first string in each group is the regular expression pattern that you would use in re.search() to see if this rule matches. The second and third strings in each group are the search and replace expressions you would use in re.sub() to actually apply the rule to turn a noun into its plural. -
  2. There’s a slight change here, in the fallback rule. In the previous example, the match_default() function simply returned True, meaning that if none of the more specific rules matched, the code would simply add an s to the end of the given word. This example does something functionally equivalent. The final regular expression asks whether the word has an end ($ matches the end of a string). Of course, every string has an end, even an empty string, so this expression always matches. Thus, it serves the same purpose as the match_default() function that always returned True: it ensures that if no more specific rule matches, the code adds an s to the end of the given word. -
  3. This line is magic. It takes the sequence of strings in patterns and turns them into a sequence of functions. How? By “mapping” the strings to the build_match_and_apply_functions() function. That is, it takes each triplet of strings and calls the build_match_and_apply_functions() function with those three strings as arguments. The build_match_and_apply_functions() function returns a tuple of two functions. This means that rules ends up being functionally equivalent to the previous example: a list of tuples, where each tuple is a pair of functions. The first function is the match function that calls re.search(), and the second function is the apply function that calls re.sub(). -
- -

Rounding out this version of the script is the main entry point, the plural() function. - -

def plural(noun):
-    for matches_rule, apply_rule in rules:  
-        if matches_rule(noun):
-            return apply_rule(noun)
-
    -
  1. Since the rules list is the same as the previous example (really, it is), it should come as no surprise that the plural() function hasn’t changed at all. It’s completely generic; it takes a list of rule functions and calls them in order. It doesn’t care how the rules are defined. In the previous example, they were defined as separate named functions. Now they are built dynamically by mapping the output of the build_match_and_apply_functions() function onto a list of raw strings. It doesn’t matter; the plural() function still works the same way. -
- -

⁂ - -

A File Of Patterns

- -

You’ve factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them. - -

First, let’s create a text file that contains the rules you want. No fancy data structures, just whitespace-delimited strings in three columns. Let’s call it plural4-rules.txt. - -

[download plural4-rules.txt] -

[sxz]$               $    es
-[^aeioudgkprt]h$     $    es
-[^aeiou]y$          y$    ies
-$                    $    s
- -

Now let’s see how you can use this rules file. - -

[download plural4.py] -

import re
-
-def build_match_and_apply_functions(pattern, search, replace):  
-    def matches_rule(word):
-        return re.search(pattern, word)
-    def apply_rule(word):
-        return re.sub(search, replace, word)
-    return (matches_rule, apply_rule)
-
-rules = []
-with open('plural4-rules.txt', encoding='utf-8') as pattern_file:  
-    for line in pattern_file:                                      
-        pattern, search, replace = line.split(None, 3)             
-        rules.append(build_match_and_apply_functions(              
-                pattern, search, replace))
-
    -
  1. The build_match_and_apply_functions() function has not changed. You’re still using closures to build two functions dynamically that use variables defined in the outer function. -
  2. The global open() function opens a file and returns a file object. In this case, the file we’re opening contains the pattern strings for pluralizing nouns. The with statement creates what’s called a context: when the with block ends, Python will automatically close the file, even if an exception is raised inside the with block. You’ll learn more about with blocks and file objects in the Files chapter. -
  3. The for line in <fileobject> idiom reads data from the open file, one line at a time, and assigns the text to the line variable. You’ll learn more about reading from files in the Files chapter. -
  4. Each line in the file really has three values, but they’re separated by whitespace (tabs or spaces, it makes no difference). To split it out, use the split() string method. The first argument to the split() method is None, which means “split on any whitespace (tabs or spaces, it makes no difference).” The second argument is 3, which means “split on whitespace 3 times, then leave the rest of the line alone.” A line like [sxz]$ $ es will be broken up into the list ['[sxz]$', '$', 'es'], which means that pattern will get '[sxz]$', search will get '$', and replace will get 'es'. That’s a lot of power in one little line of code. -
  5. Finally, you pass pattern, search, and replace to the build_match_and_apply_functions() function, which returns a tuple of functions. You append this tuple to the rules list, and rules ends up storing the list of match and apply functions that the plural() function expects. -
- -

The improvement here is that you’ve completely separated the pluralization rules into an external file, so it can be maintained separately from the code that uses it. Code is code, data is data, and life is good. - -

⁂ - -

Generators

- -

Wouldn’t it be grand to have a generic plural() function that parses the rules file? Get rules, check for a match, apply appropriate transformation, go to next rule. That’s all the plural() function has to do, and that’s all the plural() function should do. - -

[download plural5.py] -

def rules(rules_filename):
-    with open(rules_filename, encoding='utf-8') as pattern_file:
-        for line in pattern_file:
-            pattern, search, replace = line.split(None, 3)
-            yield build_match_and_apply_functions(pattern, search, replace)
-
-def plural(noun, rules_filename='plural5-rules.txt'):
-    for matches_rule, apply_rule in rules(rules_filename):
-        if matches_rule(noun):
-            return apply_rule(noun)
-    raise ValueError('no matching rule for {0}'.format(noun))
- -

How the heck does that work? Let’s look at an interactive example first. - -

->>> def make_counter(x):
-...     print('entering make_counter')
-...     while True:
-...         yield x                    
-...         print('incrementing x')
-...         x = x + 1
-... 
->>> counter = make_counter(2)          
->>> counter                            
-<generator object at 0x001C9C10>
->>> next(counter)                      
-entering make_counter
-2
->>> next(counter)                      
-incrementing x
-3
->>> next(counter)                      
-incrementing x
-4
-
    -
  1. The presence of the yield keyword in make_counter means that this is not a normal function. It is a special kind of function which generates values one at a time. You can think of it as a resumable function. Calling it will return a generator that can be used to generate successive values of x. -
  2. To create an instance of the make_counter generator, just call it like any other function. Note that this does not actually execute the function code. You can tell this because the first line of the make_counter() function calls print(), but nothing has been printed yet. -
  3. The make_counter() function returns a generator object. -
  4. The next() function takes a generator object and returns its next value. The first time you call next() with the counter generator, it executes the code in make_counter() up to the first yield statement, then returns the value that was yielded. In this case, that will be 2, because you originally created the generator by calling make_counter(2). -
  5. Repeatedly calling next() with the same generator object resumes exactly where it left off and continues until it hits the next yield statement. All variables, local state, &c. are saved on yield and restored on next(). The next line of code waiting to be executed calls print(), which prints incrementing x. After that, the statement x = x + 1. Then it loops through the while loop again, and the first thing it hits is the statement yield x, which saves the state of everything and returns the current value of x (now 3). -
  6. The second time you call next(counter), you do all the same things again, but this time x is now 4. -
- -

Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing x and spitting out values. But let’s look at more productive uses of generators instead. - -

A Fibonacci Generator

- - - -

[download fibonacci.py] -

def fib(max):
-    a, b = 0, 1          
-    while a < max:
-        yield a          
-        a, b = b, a + b  
-
    -
  1. The Fibonacci sequence is a sequence of numbers where each number is the sum of the two numbers before it. It starts with 0 and 1, goes up slowly at first, then more and more rapidly. To start the sequence, you need two variables: a starts at 0, and b starts at 1. -
  2. a is the current number in the sequence, so yield it. -
  3. b is the next number in the sequence, so assign that to a, but also calculate the next value (a + b) and assign that to b for later use. Note that this happens in parallel; if a is 3 and b is 5, then a, b = b, a + b will set a to 5 (the previous value of b) and b to 8 (the sum of the previous values of a and b). -
- -

So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way is easier to read. Also, it works well with for loops. - -

->>> from fibonacci import fib
->>> for n in fib(1000):      
-...     print(n, end=' ')    
-0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
->>> list(fib(1000))          
-[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987]
-
    -
  1. You can use a generator like fib() in a for loop directly. The for loop will automatically call the next() function to get values from the fib() generator and assign them to the for loop index variable (n). -
  2. Each time through the for loop, n gets a new value from the yield statement in fib(), and all you have to do is print it out. Once fib() runs out of numbers (a becomes bigger than max, which in this case is 1000), then the for loop exits gracefully. -
  3. This is a useful idiom: pass a generator to the list() function, and it will iterate through the entire generator (just like the for loop in the previous example) and return a list of all the values. -
- -

A Plural Rule Generator

- -

Let’s go back to plural5.py and see how this version of the plural() function works. - -

def rules(rules_filename):
-    with open(rules_filename, encoding='utf-8') as pattern_file:
-        for line in pattern_file:
-            pattern, search, replace = line.split(None, 3)                   
-            yield build_match_and_apply_functions(pattern, search, replace)  
-
-def plural(noun, rules_filename='plural5-rules.txt'):
-    for matches_rule, apply_rule in rules(rules_filename):                   
-        if matches_rule(noun):
-            return apply_rule(noun)
-    raise ValueError('no matching rule for {0}'.format(noun))
-
    -
  1. No magic here. Remember that the lines of the rules file have three values separated by whitespace, so you use line.split(None, 3) to get the three “columns” and assign them to three local variables. -
  2. And then you yield. What do you yield? Two functions, built dynamically with your old friend, build_match_and_apply_functions(), which is identical to the previous examples. In other words, rules() is a generator that spits out match and apply functions on demand. -
  3. Since rules() is a generator, you can use it directly in a for loop. The first time through the for loop, you will call the rules() function, which will open the pattern file, read the first line, dynamically build a match function and an apply function from the patterns on that line, and yield the dynamically built functions. The second time through the for loop, you will pick up exactly where you left off in rules() (which was in the middle of the for line in pattern_file loop). The first thing it will do is read the next line of the file (which is still open), dynamically build another match and apply function based on the patterns on that line in the file, and yield the two functions. -
- -

What have you gained over stage 4? Startup time. In stage 4, when you imported the plural4 module, it read the entire patterns file and built a list of all the possible rules, before you could even think about calling the plural() function. With generators, you can do everything lazily: you read the first rule and create functions and try them, and if that works you don’t ever read the rest of the file or create any other functions. - -

What have you lost? Performance! Every time you call the plural() function, the rules() generator starts over from the beginning — which means re-opening the patterns file and reading from the beginning, one line at a time. - -

What if you could have the best of both worlds: minimal startup cost (don’t execute any code on import), and maximum performance (don’t build the same functions over and over again). Oh, and you still want to keep the rules in a separate file (because code is code and data is data), just as long as you never have to read the same line twice. - -

To do that, you’ll need to build your own iterator. But before you do that, you need to learn about Python classes. - -

⁂ - -

Further Reading

- - -

- -

© 2001–10 Mark Pilgrim - - - + + +Closures & Generators - Dive Into Python 3 + + + + + + +

  
+

You are here: Home Dive Into Python 3 +

Difficulty level: ♦♦♦♢♢ +

Closures & Generators

+
+

My spelling is Wobbly. It’s good spelling but it Wobbles, and the letters get in the wrong places.
— Winnie-the-Pooh +

+

  +

Diving In

+

Having grown up the son of a librarian and an English major, I have always been fascinated by languages. Not programming languages. Well yes, programming languages, but also natural languages. Take English. English is a schizophrenic language that borrows words from German, French, Spanish, and Latin (to name a few). Actually, “borrows” is the wrong word; “pillages” is more like it. Or perhaps “assimilates” — like the Borg. Yes, I like that. +

We are the Borg. Your linguistic and etymological distinctiveness will be added to our own. Resistance is futile. +

In this chapter, you’re going to learn about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators. But first, let’s talk about how to make plural nouns. (If you haven’t read the chapter on regular expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and it quickly descends into more advanced uses.) +

If you grew up in an English-speaking country or learned English in a formal school setting, you’re probably familiar with the basic rules: +

+

(I know, there are a lot of exceptions. Man becomes men and woman becomes women, but human becomes humans. Mouse becomes mice and louse becomes lice, but house becomes houses. Knife becomes knives and wife becomes wives, but lowlife becomes lowlifes. And don’t even get me started on words that are their own plural, like sheep, deer, and haiku.) +

Other languages, of course, are completely different. +

Let’s design a Python library that automatically pluralizes English nouns. We’ll start with just these four rules, but keep in mind that you’ll inevitably need to add more. +

⁂ + +

I Know, Let’s Use Regular Expressions!

+

So you’re looking at words, which, at least in English, means you’re looking at strings of characters. You have rules that say you need to find different combinations of characters, then do different things to them. This sounds like a job for regular expressions! +

[download plural1.py] +

import re
+
+def plural(noun):          
+    if re.search('[sxz]$', noun):             
+        return re.sub('$', 'es', noun)        
+    elif re.search('[^aeioudgkprt]h$', noun):
+        return re.sub('$', 'es', noun)       
+    elif re.search('[^aeiou]y$', noun):      
+        return re.sub('y$', 'ies', noun)     
+    else:
+        return noun + 's'
+
    +
  1. This is a regular expression, but it uses a syntax you didn’t see in Regular Expressions. The square brackets mean “match exactly one of these characters.” So [sxz] means “s, or x, or z”, but only one of them. The $ should be familiar; it matches the end of string. Combined, this regular expression tests whether noun ends with s, x, or z. +
  2. This re.sub() function performs regular expression-based string substitutions. +
+ +

Let’s look at regular expression substitutions in more detail. +

+>>> import re
+>>> re.search('[abc]', 'Mark')    
+<_sre.SRE_Match object at 0x001C1FA8>
+>>> re.sub('[abc]', 'o', 'Mark')  
+'Mork'
+>>> re.sub('[abc]', 'o', 'rock')  
+'rook'
+>>> re.sub('[abc]', 'o', 'caps')  
+'oops'
+
    +
  1. Does the string Mark contain a, b, or c? Yes, it contains a. +
  2. OK, now find a, b, or c, and replace it with o. Mark becomes Mork. +
  3. The same function turns rock into rook. +
  4. You might think this would turn caps into oaps, but it doesn’t. re.sub replaces all of the matches, not just the first one. So this regular expression turns caps into oops, because both the c and the a get turned into o. +
+ +

And now, back to the plural() function… + +

def plural(noun):          
+    if re.search('[sxz]$', noun):            
+        return re.sub('$', 'es', noun)         
+    elif re.search('[^aeioudgkprt]h$', noun):  
+        return re.sub('$', 'es', noun)
+    elif re.search('[^aeiou]y$', noun):        
+        return re.sub('y$', 'ies', noun)     
+    else:
+        return noun + 's'
+
    +
  1. Here, you’re replacing the end of the string (matched by $) with the string es. In other words, adding es to the string. You could accomplish the same thing with string concatenation, for example noun + 'es', but I chose to use regular expressions for each rule, for reasons that will become clear later in the chapter. +
  2. Look closely, this is another new variation. The ^ as the first character inside the square brackets means something special: negation. [^abc] means “any single character except a, b, or c”. So [^aeioudgkprt] means any character except a, e, i, o, u, d, g, k, p, r, or t. Then that character needs to be followed by h, followed by end of string. You’re looking for words that end in H where the H can be heard. +
  3. Same pattern here: match words that end in Y, where the character before the Y is not a, e, i, o, or u. You’re looking for words that end in Y that sounds like I. +
+ +

Let’s look at negation regular expressions in more detail. + +

+>>> import re
+>>> re.search('[^aeiou]y$', 'vacancy')  
+<_sre.SRE_Match object at 0x001C1FA8>
+>>> re.search('[^aeiou]y$', 'boy')      
+>>> 
+>>> re.search('[^aeiou]y$', 'day')
+>>> 
+>>> re.search('[^aeiou]y$', 'pita')     
+>>> 
+
    +
  1. vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u. +
  2. boy does not match, because it ends in oy, and you specifically said that the character before the y could not be o. day does not match, because it ends in ay. +
  3. pita does not match, because it does not end in y. +
+
+>>> re.sub('y$', 'ies', 'vacancy')               
+'vacancies'
+>>> re.sub('y$', 'ies', 'agency')
+'agencies'
+>>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy')  
+'vacancies'
+
    +
  1. This regular expression turns vacancy into vacancies and agency into agencies, which is what you wanted. Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub. +
  2. Just in passing, I want to point out that it is possible to combine these two regular expressions (one to find out if the rule applies, and another to actually apply it) into a single regular expression. Here’s what that would look like. Most of it should look familiar: you’re using a remembered group, which you learned in Case study: Parsing Phone Numbers. The group is used to remember the character before the letter y. Then in the substitution string, you use a new syntax, \1, which means “hey, that first group you remembered? put it right here.” In this case, you remember the c before the y; when you do the substitution, you substitute c in place of c, and ies in place of y. (If you have more than one remembered group, you can use \2 and \3 and so on.) +
+

Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn’t directly map to the way you first described the pluralizing rules. You originally laid out rules like “if the word ends in S, X, or Z, then add ES”. If you look at this function, you have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn’t get much more direct than that. + +

⁂ + +

A List Of Functions

+ +

Now you’re going to add a level of abstraction. You started by defining a list of rules: if this, do that, otherwise go to the next rule. Let’s temporarily complicate part of the program so you can simplify another part. + +

[download plural2.py] +

import re
+
+def match_sxz(noun):
+    return re.search('[sxz]$', noun)
+
+def apply_sxz(noun):
+    return re.sub('$', 'es', noun)
+
+def match_h(noun):
+    return re.search('[^aeioudgkprt]h$', noun)
+
+def apply_h(noun):
+    return re.sub('$', 'es', noun)
+
+def match_y(noun):                             
+    return re.search('[^aeiou]y$', noun)
+        
+def apply_y(noun):                             
+    return re.sub('y$', 'ies', noun)
+
+def match_default(noun):
+    return True
+
+def apply_default(noun):
+    return noun + 's'
+
+rules = ((match_sxz, apply_sxz),               
+         (match_h, apply_h),
+         (match_y, apply_y),
+         (match_default, apply_default)
+         )
+
+def plural(noun):           
+    for matches_rule, apply_rule in rules:       
+        if matches_rule(noun):
+            return apply_rule(noun)
+
    +
  1. Now, each match rule is its own function which returns the results of calling the re.search() function. +
  2. Each apply rule is also its own function which calls the re.sub() function to apply the appropriate pluralization rule. +
  3. Instead of having one function (plural()) with multiple rules, you have the rules data structure, which is a sequence of pairs of functions. +
  4. Since the rules have been broken out into a separate data structure, the new plural() function can be reduced to a few lines of code. Using a for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules structure. On the first iteration of the for loop, matches_rule will get match_sxz, and apply_rule will get apply_sxz. On the second iteration (assuming you get that far), matches_rule will be assigned match_h, and apply_rule will be assigned apply_h. The function is guaranteed to return something eventually, because the final match rule (match_default) simply returns True, meaning the corresponding apply rule (apply_default) will always be applied. +
+ + +

The reason this technique works is that everything in Python is an object, including functions. The rules data structure contains functions — not names of functions, but actual function objects. When they get assigned in the for loop, then matches_rule and apply_rule are actual functions that you can call. On the first iteration of the for loop, this is equivalent to calling matches_sxz(noun), and if it returns a match, calling apply_sxz(noun). + +

If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire for loop is equivalent to the following: + +


+def plural(noun):
+    if match_sxz(noun):
+        return apply_sxz(noun)
+    if match_h(noun):
+        return apply_h(noun)
+    if match_y(noun):
+        return apply_y(noun)
+    if match_default(noun):
+        return apply_default(noun)
+ +

The benefit here is that the plural() function is now simplified. It takes a sequence of rules, defined elsewhere, and iterates through them in a generic fashion. + +

    +
  1. Get a match rule +
  2. Does it match? Then call the apply rule and return the result. +
  3. No match? Go to step 1. +
+ +

The rules could be defined anywhere, in any way. The plural() function doesn’t care. + +

Now, was adding this level of abstraction worth it? Well, not yet. Let’s consider what it would take to add a new rule to the function. In the first example, it would require adding an if statement to the plural() function. In this second example, it would require adding two functions, match_foo() and apply_foo(), and then updating the rules sequence to specify where in the order the new match and apply functions should be called relative to the other rules. + +

But this is really just a stepping stone to the next section. Let’s move on… + +

⁂ + +

A List Of Patterns

+ +

Defining separate named functions for each match and apply rule isn’t really necessary. You never call them directly; you add them to the rules sequence and call them through there. Furthermore, each function follows one of two patterns. All the match functions call re.search(), and all the apply functions call re.sub(). Let’s factor out the patterns so that defining new rules can be easier. + +

[download plural3.py] +

import re
+
+def build_match_and_apply_functions(pattern, search, replace):
+    def matches_rule(word):                                     
+        return re.search(pattern, word)
+    def apply_rule(word):                                       
+        return re.sub(search, replace, word)
+    return (matches_rule, apply_rule)                           
+
    +
  1. build_match_and_apply_functions() is a function that builds other functions dynamically. It takes pattern, search and replace, then defines a matches_rule() function which calls re.search() with the pattern that was passed to the build_match_and_apply_functions() function, and the word that was passed to the matches_rule() function you’re building. Whoa. +
  2. Building the apply function works the same way. The apply function is a function that takes one parameter, and calls re.sub() with the search and replace parameters that were passed to the build_match_and_apply_functions() function, and the word that was passed to the apply_rule() function you’re building. This technique of using the values of outside parameters within a dynamic function is called closures. You’re essentially defining constants within the apply function you’re building: it takes one parameter (word), but it then acts on that plus two other values (search and replace) which were set when you defined the apply function. +
  3. Finally, the build_match_and_apply_functions() function returns a tuple of two values: the two functions you just created. The constants you defined within those functions (pattern within the matches_rule() function, and search and replace within the apply_rule() function) stay with those functions, even after you return from build_match_and_apply_functions(). That’s insanely cool. +
+ +

If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it. + +

patterns = \                                                        
+  (
+    ('[sxz]$',           '$',  'es'),
+    ('[^aeioudgkprt]h$', '$',  'es'),
+    ('(qu|[^aeiou])y$',  'y$', 'ies'),
+    ('$',                '$',  's')                                 
+  )
+rules = [build_match_and_apply_functions(pattern, search, replace)  
+         for (pattern, search, replace) in patterns]
+
    +
  1. Our pluralization “rules” are now defined as a tuple of tuples of strings (not functions). The first string in each group is the regular expression pattern that you would use in re.search() to see if this rule matches. The second and third strings in each group are the search and replace expressions you would use in re.sub() to actually apply the rule to turn a noun into its plural. +
  2. There’s a slight change here, in the fallback rule. In the previous example, the match_default() function simply returned True, meaning that if none of the more specific rules matched, the code would simply add an s to the end of the given word. This example does something functionally equivalent. The final regular expression asks whether the word has an end ($ matches the end of a string). Of course, every string has an end, even an empty string, so this expression always matches. Thus, it serves the same purpose as the match_default() function that always returned True: it ensures that if no more specific rule matches, the code adds an s to the end of the given word. +
  3. This line is magic. It takes the sequence of strings in patterns and turns them into a sequence of functions. How? By “mapping” the strings to the build_match_and_apply_functions() function. That is, it takes each triplet of strings and calls the build_match_and_apply_functions() function with those three strings as arguments. The build_match_and_apply_functions() function returns a tuple of two functions. This means that rules ends up being functionally equivalent to the previous example: a list of tuples, where each tuple is a pair of functions. The first function is the match function that calls re.search(), and the second function is the apply function that calls re.sub(). +
+ +

Rounding out this version of the script is the main entry point, the plural() function. + +

def plural(noun):
+    for matches_rule, apply_rule in rules:  
+        if matches_rule(noun):
+            return apply_rule(noun)
+
    +
  1. Since the rules list is the same as the previous example (really, it is), it should come as no surprise that the plural() function hasn’t changed at all. It’s completely generic; it takes a list of rule functions and calls them in order. It doesn’t care how the rules are defined. In the previous example, they were defined as separate named functions. Now they are built dynamically by mapping the output of the build_match_and_apply_functions() function onto a list of raw strings. It doesn’t matter; the plural() function still works the same way. +
+ +

⁂ + +

A File Of Patterns

+ +

You’ve factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them. + +

First, let’s create a text file that contains the rules you want. No fancy data structures, just whitespace-delimited strings in three columns. Let’s call it plural4-rules.txt. + +

[download plural4-rules.txt] +

[sxz]$               $    es
+[^aeioudgkprt]h$     $    es
+[^aeiou]y$          y$    ies
+$                    $    s
+ +

Now let’s see how you can use this rules file. + +

[download plural4.py] +

import re
+
+def build_match_and_apply_functions(pattern, search, replace):  
+    def matches_rule(word):
+        return re.search(pattern, word)
+    def apply_rule(word):
+        return re.sub(search, replace, word)
+    return (matches_rule, apply_rule)
+
+rules = []
+with open('plural4-rules.txt', encoding='utf-8') as pattern_file:  
+    for line in pattern_file:                                      
+        pattern, search, replace = line.split(None, 3)             
+        rules.append(build_match_and_apply_functions(              
+                pattern, search, replace))
+
    +
  1. The build_match_and_apply_functions() function has not changed. You’re still using closures to build two functions dynamically that use variables defined in the outer function. +
  2. The global open() function opens a file and returns a file object. In this case, the file we’re opening contains the pattern strings for pluralizing nouns. The with statement creates what’s called a context: when the with block ends, Python will automatically close the file, even if an exception is raised inside the with block. You’ll learn more about with blocks and file objects in the Files chapter. +
  3. The for line in <fileobject> idiom reads data from the open file, one line at a time, and assigns the text to the line variable. You’ll learn more about reading from files in the Files chapter. +
  4. Each line in the file really has three values, but they’re separated by whitespace (tabs or spaces, it makes no difference). To split it out, use the split() string method. The first argument to the split() method is None, which means “split on any whitespace (tabs or spaces, it makes no difference).” The second argument is 3, which means “split on whitespace 3 times, then leave the rest of the line alone.” A line like [sxz]$ $ es will be broken up into the list ['[sxz]$', '$', 'es'], which means that pattern will get '[sxz]$', search will get '$', and replace will get 'es'. That’s a lot of power in one little line of code. +
  5. Finally, you pass pattern, search, and replace to the build_match_and_apply_functions() function, which returns a tuple of functions. You append this tuple to the rules list, and rules ends up storing the list of match and apply functions that the plural() function expects. +
+ +

The improvement here is that you’ve completely separated the pluralization rules into an external file, so it can be maintained separately from the code that uses it. Code is code, data is data, and life is good. + +

⁂ + +

Generators

+ +

Wouldn’t it be grand to have a generic plural() function that parses the rules file? Get rules, check for a match, apply appropriate transformation, go to next rule. That’s all the plural() function has to do, and that’s all the plural() function should do. + +

[download plural5.py] +

def rules(rules_filename):
+    with open(rules_filename, encoding='utf-8') as pattern_file:
+        for line in pattern_file:
+            pattern, search, replace = line.split(None, 3)
+            yield build_match_and_apply_functions(pattern, search, replace)
+
+def plural(noun, rules_filename='plural5-rules.txt'):
+    for matches_rule, apply_rule in rules(rules_filename):
+        if matches_rule(noun):
+            return apply_rule(noun)
+    raise ValueError('no matching rule for {0}'.format(noun))
+ +

How the heck does that work? Let’s look at an interactive example first. + +

+>>> def make_counter(x):
+...     print('entering make_counter')
+...     while True:
+...         yield x                    
+...         print('incrementing x')
+...         x = x + 1
+... 
+>>> counter = make_counter(2)          
+>>> counter                            
+<generator object at 0x001C9C10>
+>>> next(counter)                      
+entering make_counter
+2
+>>> next(counter)                      
+incrementing x
+3
+>>> next(counter)                      
+incrementing x
+4
+
    +
  1. The presence of the yield keyword in make_counter means that this is not a normal function. It is a special kind of function which generates values one at a time. You can think of it as a resumable function. Calling it will return a generator that can be used to generate successive values of x. +
  2. To create an instance of the make_counter generator, just call it like any other function. Note that this does not actually execute the function code. You can tell this because the first line of the make_counter() function calls print(), but nothing has been printed yet. +
  3. The make_counter() function returns a generator object. +
  4. The next() function takes a generator object and returns its next value. The first time you call next() with the counter generator, it executes the code in make_counter() up to the first yield statement, then returns the value that was yielded. In this case, that will be 2, because you originally created the generator by calling make_counter(2). +
  5. Repeatedly calling next() with the same generator object resumes exactly where it left off and continues until it hits the next yield statement. All variables, local state, &c. are saved on yield and restored on next(). The next line of code waiting to be executed calls print(), which prints incrementing x. After that, the statement x = x + 1. Then it loops through the while loop again, and the first thing it hits is the statement yield x, which saves the state of everything and returns the current value of x (now 3). +
  6. The second time you call next(counter), you do all the same things again, but this time x is now 4. +
+ +

Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing x and spitting out values. But let’s look at more productive uses of generators instead. + +

A Fibonacci Generator

+ + + +

[download fibonacci.py] +

def fib(max):
+    a, b = 0, 1          
+    while a < max:
+        yield a          
+        a, b = b, a + b  
+
    +
  1. The Fibonacci sequence is a sequence of numbers where each number is the sum of the two numbers before it. It starts with 0 and 1, goes up slowly at first, then more and more rapidly. To start the sequence, you need two variables: a starts at 0, and b starts at 1. +
  2. a is the current number in the sequence, so yield it. +
  3. b is the next number in the sequence, so assign that to a, but also calculate the next value (a + b) and assign that to b for later use. Note that this happens in parallel; if a is 3 and b is 5, then a, b = b, a + b will set a to 5 (the previous value of b) and b to 8 (the sum of the previous values of a and b). +
+ +

So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way is easier to read. Also, it works well with for loops. + +

+>>> from fibonacci import fib
+>>> for n in fib(1000):      
+...     print(n, end=' ')    
+0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
+>>> list(fib(1000))          
+[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987]
+
    +
  1. You can use a generator like fib() in a for loop directly. The for loop will automatically call the next() function to get values from the fib() generator and assign them to the for loop index variable (n). +
  2. Each time through the for loop, n gets a new value from the yield statement in fib(), and all you have to do is print it out. Once fib() runs out of numbers (a becomes bigger than max, which in this case is 1000), then the for loop exits gracefully. +
  3. This is a useful idiom: pass a generator to the list() function, and it will iterate through the entire generator (just like the for loop in the previous example) and return a list of all the values. +
+ +

A Plural Rule Generator

+ +

Let’s go back to plural5.py and see how this version of the plural() function works. + +

def rules(rules_filename):
+    with open(rules_filename, encoding='utf-8') as pattern_file:
+        for line in pattern_file:
+            pattern, search, replace = line.split(None, 3)                   
+            yield build_match_and_apply_functions(pattern, search, replace)  
+
+def plural(noun, rules_filename='plural5-rules.txt'):
+    for matches_rule, apply_rule in rules(rules_filename):                   
+        if matches_rule(noun):
+            return apply_rule(noun)
+    raise ValueError('no matching rule for {0}'.format(noun))
+
    +
  1. No magic here. Remember that the lines of the rules file have three values separated by whitespace, so you use line.split(None, 3) to get the three “columns” and assign them to three local variables. +
  2. And then you yield. What do you yield? Two functions, built dynamically with your old friend, build_match_and_apply_functions(), which is identical to the previous examples. In other words, rules() is a generator that spits out match and apply functions on demand. +
  3. Since rules() is a generator, you can use it directly in a for loop. The first time through the for loop, you will call the rules() function, which will open the pattern file, read the first line, dynamically build a match function and an apply function from the patterns on that line, and yield the dynamically built functions. The second time through the for loop, you will pick up exactly where you left off in rules() (which was in the middle of the for line in pattern_file loop). The first thing it will do is read the next line of the file (which is still open), dynamically build another match and apply function based on the patterns on that line in the file, and yield the two functions. +
+ +

What have you gained over stage 4? Startup time. In stage 4, when you imported the plural4 module, it read the entire patterns file and built a list of all the possible rules, before you could even think about calling the plural() function. With generators, you can do everything lazily: you read the first rule and create functions and try them, and if that works you don’t ever read the rest of the file or create any other functions. + +

What have you lost? Performance! Every time you call the plural() function, the rules() generator starts over from the beginning — which means re-opening the patterns file and reading from the beginning, one line at a time. + +

What if you could have the best of both worlds: minimal startup cost (don’t execute any code on import), and maximum performance (don’t build the same functions over and over again). Oh, and you still want to keep the rules in a separate file (because code is code and data is data), just as long as you never have to read the same line twice. + +

To do that, you’ll need to build your own iterator. But before you do that, you need to learn about Python classes. + +

⁂ + +

Further Reading

+ + +

+ +

© 2001–10 Mark Pilgrim + + + diff --git a/http-web-services.html b/http-web-services.html index 6518ab4..435d631 100755 --- a/http-web-services.html +++ b/http-web-services.html @@ -1,1003 +1,1003 @@ - - -HTTP Web Services - Dive Into Python 3 - - - - - - -

  
-

You are here: Home Dive Into Python 3 -

Difficulty level: ♦♦♦♦♢ -

HTTP Web Services

-
-

A ruffled mind makes a restless pillow.
— Charlotte Brontë -

-

  -

Diving In

-

Philosophically, I can describe HTTP web services in 12 words: exchanging data with remote servers using nothing but the operations of HTTP. If you want to get data from the server, use HTTP GET. If you want to send new data to the server, use HTTP POST. Some more advanced HTTP web service APIs also allow creating, modifying, and deleting data, using HTTP PUT and HTTP DELETE. That’s it. No registries, no envelopes, no wrappers, no tunneling. The “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for retrieving, creating, modifying, and deleting data. - -

The main advantage of this approach is simplicity, and its simplicity has proven popular. Data — usually XML or JSON — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging is also easier; because each resource in an HTTP web service has a unique address (in the form of a URL), you can load it in your web browser and immediately see the raw data. - -

Examples of HTTP web services: -

- -

Python 3 comes with two different libraries for interacting with HTTP web services: - -

- -

So which one should you use? Neither of them. Instead, you should use httplib2, an open source third-party library that implements HTTP more fully than http.client but provides a better abstraction than urllib.request. - -

To understand why httplib2 is the right choice, you first need to understand HTTP. - -

⁂ - -

Features of HTTP

- -

There are five important features which all HTTP clients should support. - -

Caching

- -

The most important thing to understand about any type of web service is that network access is incredibly expensive. I don’t mean “dollars and cents” expensive (although bandwidth ain’t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even on the fastest broadband connection, latency (the time it takes to send a request and start retrieving data in a response) can still be higher than you anticipated. A router misbehaves, a packet is dropped, an intermediate proxy is under attack — there’s never a dull moment on the public internet, and there may be nothing you can do about it. - -

- -

HTTP is designed with caching in mind. There is an entire class of devices (called “caching proxies”) whose only job is to sit between you and the rest of the world and minimize network access. Your company or ISP almost certainly maintains caching proxies, even if you’re unaware of them. They work because caching built into the HTTP protocol. - -

Here’s a concrete example of how caching works. You visit diveintomark.org in your browser. That page includes a background image, wearehugh.com/m.jpg. When your browser downloads that image, the server includes the following HTTP headers: - -

HTTP/1.1 200 OK
-Date: Sun, 31 May 2009 17:14:04 GMT
-Server: Apache
-Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
-ETag: "3075-ddc8d800"
-Accept-Ranges: bytes
-Content-Length: 12405
-Cache-Control: max-age=31536000, public
-Expires: Mon, 31 May 2010 17:14:04 GMT
-Connection: close
-Content-Type: image/jpeg
- -

The Cache-Control and Expires headers tell your browser (and any caching proxies between you and the server) that this image can be cached for up to a year. A year! And if, in the next year, you visit another page which also includes a link to this image, your browser will load the image from its cache without generating any network activity whatsoever. - -

But wait, it gets better. Let’s say your browser purges the image from your local cache for some reason. Maybe it ran out of disk space; maybe you manually cleared the cache. Whatever. But the HTTP headers said that this data could be cached by public caching proxies. (Technically, the important thing is what the headers don’t say; the Cache-Control header doesn’t have the private keyword, so this data is cacheable by default.) Caching proxies are designed to have tons of storage space, probably far more than your local browser has allocated. - -

If your company or ISP maintain a caching proxy, the proxy may still have the image cached. When you visit diveintomark.org again, your browser will look in its local cache for the image, but it won’t find it, so it will make a network request to try to download it from the remote server. But if the caching proxy still has a copy of the image, it will intercept that request and serve the image from its cache. That means that your request will never reach the remote server; in fact, it will never leave your company’s network. That makes for a faster download (fewer network hops) and saves your company money (less data being downloaded from the outside world). - -

HTTP caching only works when everybody does their part. On one side, servers need to send the correct headers in their response. On the other side, clients need to understand and respect those headers before they request the same data twice. The proxies in the middle are not a panacea; they can only be as smart as the servers and clients allow them to be. - -

Python’s HTTP libraries do not support caching, but httplib2 does. - -

Last-Modified Checking

- -

Some data never changes, while other data changes all the time. In between, there is a vast field of data that might have changed, but hasn’t. CNN.com’s feed is updated every few minutes, but my weblog’s feed may not change for days or weeks at a time. In the latter case, I don’t want to tell clients to cache my feed for weeks at a time, because then when I do actually post something, people may not read it for weeks (because they’re respecting my cache headers which said “don’t bother checking this feed for weeks”). On the other hand, I don’t want clients downloading my entire feed once an hour if it hasn’t changed! - -

- -

HTTP has a solution to this, too. When you request data for the first time, the server can send back a Last-Modified header. This is exactly what it sounds like: the date that the data was changed. That background image referenced from diveintomark.org included a Last-Modified header. - -

HTTP/1.1 200 OK
-Date: Sun, 31 May 2009 17:14:04 GMT
-Server: Apache
-Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
-ETag: "3075-ddc8d800"
-Accept-Ranges: bytes
-Content-Length: 12405
-Cache-Control: max-age=31536000, public
-Expires: Mon, 31 May 2010 17:14:04 GMT
-Connection: close
-Content-Type: image/jpeg
-
- -

When you request the same data a second (or third or fourth) time, you can send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data has changed since then, then the server ignores the If-Modified-Since header and just gives you the new data with a 200 status code. But if the data hasn’t changed since then, the server sends back a special HTTP 304 status code, which means “this data hasn’t changed since the last time you asked for it.” You can test this on the command line, using curl: - -

-you@localhost:~$ curl -I -H "If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT" http://wearehugh.com/m.jpg
-HTTP/1.1 304 Not Modified
-Date: Sun, 31 May 2009 18:04:39 GMT
-Server: Apache
-Connection: close
-ETag: "3075-ddc8d800"
-Expires: Mon, 31 May 2010 18:04:39 GMT
-Cache-Control: max-age=31536000, public
- -

Why is this an improvement? Because when the server sends a 304, it doesn’t re-send the data. All you get is the status code. Even after your cached copy has expired, last-modified checking ensures that you won’t download the same data twice if it hasn’t changed. (As an extra bonus, this 304 response also includes caching headers. Proxies will keep a copy of data even after it officially “expires,” in the hopes that the data hasn’t really changed and the next request responds with a 304 status code and updated cache information.) - -

Python’s HTTP libraries do not support last-modified date checking, but httplib2 does. - -

ETag Checking

- -

ETags are an alternate way to accomplish the same thing as the last-modified checking. With Etags, the server sends a hash code in an ETag header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) That background image referenced from diveintomark.org had an ETag header. - -

HTTP/1.1 200 OK
-Date: Sun, 31 May 2009 17:14:04 GMT
-Server: Apache
-Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
-ETag: "3075-ddc8d800"
-Accept-Ranges: bytes
-Content-Length: 12405
-Cache-Control: max-age=31536000, public
-Expires: Mon, 31 May 2010 17:14:04 GMT
-Connection: close
-Content-Type: image/jpeg
-
- - - -

The second time you request the same data, you include the ETag hash in an If-None-Match header of your request. If the data hasn’t changed, the server will send you back a 304 status code. As with the last-modified date checking, the server sends back only the 304 status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since you still have the data from the last time. - -

Again with the curl: - -

-you@localhost:~$ curl -I -H "If-None-Match: \"3075-ddc8d800\"" http://wearehugh.com/m.jpg  
-HTTP/1.1 304 Not Modified
-Date: Sun, 31 May 2009 18:04:39 GMT
-Server: Apache
-Connection: close
-ETag: "3075-ddc8d800"
-Expires: Mon, 31 May 2010 18:04:39 GMT
-Cache-Control: max-age=31536000, public
-
    -
  1. ETags are commonly enclosed in quotation marks, but the quotation marks are part of the value. That means you need to send the quotation marks back to the server in the If-None-Match header. -
- -

Python’s HTTP libraries do not support ETags, but httplib2 does. - -

Compression

- -

When you talk about HTTP web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s XML, maybe it’s JSON, maybe it’s just plain text. Regardless of the format, text compresses well. The example feed in the XML chapter is 3070 bytes uncompressed, but would be 941 bytes after gzip compression. That’s just 30% of the original size! - -

HTTP supports several compression algorithms. The two most common types are gzip and deflate. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include an Accept-encoding header in your request that lists which compression algorithms you support. If the server supports any of the same algorithms, it will send you back compressed data (with a Content-encoding header that tells you which algorithm it used). Then it’s up to you to decompress the data. - -

-

Important tip for server-side developers: make sure that the compressed version of a resource has a different Etag than the uncompressed version. Otherwise, caching proxies will get confused and may serve the compressed version to clients that can’t handle it. Read the discussion of Apache bug 39727 for more details on this subtle issue. -

- -

Python’s HTTP libraries do not support compression, but httplib2 does. - -

Redirects

- -

Cool URIs don’t change, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at http://example.com/index.xml might be moved to http://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; http://www.example.com/index.xml becomes http://server-farm-1.example.com/index.xml. - -

- -

Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code 200 means “everything’s normal, here’s the page you asked for”. Status code 404 means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection. - -

HTTP has several different ways of signifying that a resource has moved. The two most common techiques are status codes 302 and 301. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you’re supposed to use the new address from then on. - -

The urllib.request module automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address, and each time the urllib.request module will “helpfully” follow the redirect. In other words, it treats permanent redirects the same as temporary redirects. That means two round trips instead of one, which is bad for the server and bad for you. - -

httplib2 handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them. - -

⁂ - -

How Not To Fetch Data Over HTTP

- -

Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better. -

->>> import urllib.request
->>> a_url = 'http://diveintopython3.org/examples/feed.xml'
->>> data = urllib.request.urlopen(a_url).read()  
->>> type(data)                                   
-<class 'bytes'>
->>> print(data)
-<?xml version='1.0' encoding='utf-8'?>
-<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
-  <title>dive into mark</title>
-  <subtitle>currently between addictions</subtitle>
-  <id>tag:diveintomark.org,2001-07-29:/</id>
-  <updated>2009-03-27T21:56:07Z</updated>
-  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
-  …
-
-
    -
  1. Downloading anything over HTTP is incredibly easy in Python; in fact, it’s a one-liner. The urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier. -
  2. The urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get bytes. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it to a string. -
- -

So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (e.g. requesting this feed once an hour), then you’re being inefficient, and you’re being rude. - -

⁂ - -

What’s On The Wire?

- -

To see why this is inefficient and rude, let’s turn on the debugging features of Python’s HTTP library and see what’s being sent “on the wire” (i.e. over the network). - -

->>> from http.client import HTTPConnection
->>> HTTPConnection.debuglevel = 1                                       
->>> from urllib.request import urlopen
->>> response = urlopen('http://diveintopython3.org/examples/feed.xml')  
-send: b'GET /examples/feed.xml HTTP/1.1                                 
-Host: diveintopython3.org                                               
-Accept-Encoding: identity                                               
-User-Agent: Python-urllib/3.1'                                          
-Connection: close
-reply: 'HTTP/1.1 200 OK'
-…further debugging information omitted…
-
    -
  1. As I mentioned at the beginning of the chapter, urllib.request relies on another standard Python library, http.client. Normally you don’t need to touch http.client directly. (The urllib.request module imports it automatically.) But we import it here so we can toggle the debugging flag on the HTTPConnection class that urllib.request uses to connect to the HTTP server. -
  2. Now that the debugging flag is set, information on the HTTP request and response is printed out in real time. As you can see, when you request the Atom feed, the urllib.request module sends five lines to the server. -
  3. The first line specifies the HTTP verb you’re using, and the path of the resource (minus the domain name). -
  4. The second line specifies the domain name from which we’re requesting this feed. -
  5. The third line specifies the compression algorithms that the client supports. As I mentioned earlier, urllib.request does not support compression by default. -
  6. The fourth line specifies the name of the library that is making the request. By default, this is Python-urllib plus a version number. Both urllib.request and httplib2 support changing the user agent, simply by adding a User-Agent header to the request (which will override the default value). -
- - - -

Now let’s look at what the server sent back in its response. - -

-# continued from previous example
->>> print(response.headers.as_string())        
-Date: Sun, 31 May 2009 19:23:06 GMT            
-Server: Apache
-Last-Modified: Sun, 31 May 2009 06:39:55 GMT   
-ETag: "bfe-93d9c4c0"                           
-Accept-Ranges: bytes
-Content-Length: 3070                           
-Cache-Control: max-age=86400                   
-Expires: Mon, 01 Jun 2009 19:23:06 GMT
-Vary: Accept-Encoding
-Connection: close
-Content-Type: application/xml
->>> data = response.read()                     
->>> len(data)
-3070
-
    -
  1. The response returned from the urllib.request.urlopen() function contains all the HTTP headers the server sent back. It also contains methods to download the actual data; we’ll get to that in a minute. -
  2. The server tells you when it handled your request. -
  3. This response includes a Last-Modified header. -
  4. This response includes an ETag header. -
  5. The data is 3070 bytes long. Notice what isn’t here: a Content-encoding header. Your request stated that you only accept uncompressed data (Accept-encoding: identity), and sure enough, this response contains uncompressed data. -
  6. This response includes caching headers that state that this feed can be cached for up to 24 hours (86400 seconds). -
  7. And finally, download the actual data by calling response.read(). As you can tell from the len() function, this downloads all 3070 bytes at once. -
- -

As you can see, this code is already inefficient: it asked for (and received) uncompressed data. I know for a fact that this server supports gzip compression, but HTTP compression is opt-in. We didn’t ask for it, so we didn’t get it. That means we’re downloading 3070 bytes when we could have just downloaded 941. Bad dog, no biscuit. - -

But wait, it gets worse! To see just how inefficient this code is, let’s request the same feed a second time. - -

-# continued from the previous example
->>> response2 = urlopen('http://diveintopython3.org/examples/feed.xml')
-send: b'GET /examples/feed.xml HTTP/1.1
-Host: diveintopython3.org
-Accept-Encoding: identity
-User-Agent: Python-urllib/3.1'
-Connection: close
-reply: 'HTTP/1.1 200 OK'
-…further debugging information omitted…
- -

Notice anything peculiar about this request? It hasn’t changed! It’s exactly the same as the first request. No sign of If-Modified-Since headers. No sign of If-None-Match headers. No respect for the caching headers. Still no compression. - -

And what happens when you do the same thing twice? You get the same response. Twice. - -

-# continued from the previous example
->>> print(response2.headers.as_string())     
-Date: Mon, 01 Jun 2009 03:58:00 GMT
-Server: Apache
-Last-Modified: Sun, 31 May 2009 22:51:11 GMT
-ETag: "bfe-255ef5c0"
-Accept-Ranges: bytes
-Content-Length: 3070
-Cache-Control: max-age=86400
-Expires: Tue, 02 Jun 2009 03:58:00 GMT
-Vary: Accept-Encoding
-Connection: close
-Content-Type: application/xml
->>> data2 = response2.read()
->>> len(data2)                               
-3070
->>> data2 == data                            
-True
-
    -
  1. The server is still sending the same array of “smart” headers: Cache-Control and Expires to allow caching, Last-Modified and ETag to enable “not-modified” tracking. Even the Vary: Accept-Encoding header hints that the server would support compression, if only you would ask for it. But you didn’t. -
  2. Once again, fetching this data downloads the whole 3070 bytes… -
  3. …the exact same 3070 bytes you downloaded last time. -
- -

HTTP is designed to work better than this. urllib speaks HTTP like I speak Spanish — enough to get by in a jam, but not enough to hold a conversation. HTTP is a conversation. It’s time to upgrade to a library that speaks HTTP fluently. - -

⁂ - -

Introducing httplib2

- -

Before you can use httplib2, you’ll need to install it. Visit code.google.com/p/httplib2/ and download the latest version. httplib2 is available for Python 2.x and Python 3.x; make sure you get the Python 3 version, named something like httplib2-python3-0.5.0.zip. - -

Unzip the archive, open a terminal window, and go to the newly created httplib2 directory. On Windows, open the Start menu, select Run..., type cmd.exe and press ENTER. - -

-c:\Users\pilgrim\Downloads> dir
- Volume in drive C has no label.
- Volume Serial Number is DED5-B4F8
-
- Directory of c:\Users\pilgrim\Downloads
-
-07/28/2009  12:36 PM    <DIR>          .
-07/28/2009  12:36 PM    <DIR>          ..
-07/28/2009  12:36 PM    <DIR>          httplib2-python3-0.5.0
-07/28/2009  12:33 PM            18,997 httplib2-python3-0.5.0.zip
-               1 File(s)         18,997 bytes
-               3 Dir(s)  61,496,684,544 bytes free
-
-c:\Users\pilgrim\Downloads> cd httplib2-python3-0.5.0
-c:\Users\pilgrim\Downloads\httplib2-python3-0.5.0> c:\python31\python.exe setup.py install
-running install
-running build
-running build_py
-running install_lib
-creating c:\python31\Lib\site-packages\httplib2
-copying build\lib\httplib2\iri2uri.py -> c:\python31\Lib\site-packages\httplib2
-copying build\lib\httplib2\__init__.py -> c:\python31\Lib\site-packages\httplib2
-byte-compiling c:\python31\Lib\site-packages\httplib2\iri2uri.py to iri2uri.pyc
-byte-compiling c:\python31\Lib\site-packages\httplib2\__init__.py to __init__.pyc
-running install_egg_info
-Writing c:\python31\Lib\site-packages\httplib2-python3_0.5.0-py3.1.egg-info
- -

On Mac OS X, run the Terminal.app application in your /Applications/Utilities/ folder. On Linux, run the Terminal application, which is usually in your Applications menu under Accessories or System. - -

-you@localhost:~/Desktop$ unzip httplib2-python3-0.5.0.zip
-Archive:  httplib2-python3-0.5.0.zip
-  inflating: httplib2-python3-0.5.0/README
-  inflating: httplib2-python3-0.5.0/setup.py
-  inflating: httplib2-python3-0.5.0/PKG-INFO
-  inflating: httplib2-python3-0.5.0/httplib2/__init__.py
-  inflating: httplib2-python3-0.5.0/httplib2/iri2uri.py
-you@localhost:~/Desktop$ cd httplib2-python3-0.5.0/
-you@localhost:~/Desktop/httplib2-python3-0.5.0$ sudo python3 setup.py install
-running install
-running build
-running build_py
-creating build
-creating build/lib.linux-x86_64-3.1
-creating build/lib.linux-x86_64-3.1/httplib2
-copying httplib2/iri2uri.py -> build/lib.linux-x86_64-3.1/httplib2
-copying httplib2/__init__.py -> build/lib.linux-x86_64-3.1/httplib2
-running install_lib
-creating /usr/local/lib/python3.1/dist-packages/httplib2
-copying build/lib.linux-x86_64-3.1/httplib2/iri2uri.py -> /usr/local/lib/python3.1/dist-packages/httplib2
-copying build/lib.linux-x86_64-3.1/httplib2/__init__.py -> /usr/local/lib/python3.1/dist-packages/httplib2
-byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/iri2uri.py to iri2uri.pyc
-byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/__init__.py to __init__.pyc
-running install_egg_info
-Writing /usr/local/lib/python3.1/dist-packages/httplib2-python3_0.5.0.egg-info
- -

To use httplib2, create an instance of the httplib2.Http class. - -

->>> import httplib2
->>> h = httplib2.Http('.cache')                                                    
->>> response, content = h.request('http://diveintopython3.org/examples/feed.xml')  
->>> response.status                                                                
-200
->>> content[:52]                                                                   
-b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
->>> len(content)
-3070
-
    -
  1. The primary interface to httplib2 is the Http object. For reasons you’ll see in the next section, you should always pass a directory name when you create an Http object. The directory does not need to exist; httplib2 will create it if necessary. -
  2. Once you have an Http object, retrieving data is as simple as calling the request() method with the address of the data you want. This will issue an HTTP GET request for that URL. (Later in this chapter, you’ll see how to issue other HTTP requests, like POST.) -
  3. The request() method returns two values. The first is an httplib2.Response object, which contains all the HTTP headers the server returned. For example, a status code of 200 indicates that the request was successful. -
  4. The content variable contains the actual data that was returned by the HTTP server. The data is returned as a bytes object, not a string. If you want it as a string, you’ll need to determine the character encoding and convert it yourself. -
- -
-

You probably only need one httplib2.Http object. There are valid reasons for creating more than one, but you should only do so if you know why you need them. “I need to request data from two different URLs” is not a valid reason. Re-use the Http object and just call the request() method twice. -

- -

A Short Digression To Explain Why httplib2 Returns Bytes Instead of Strings

- -

Bytes. Strings. What a pain. Why can’t httplib2 “just” do the conversion for you? Well, it’s complicated, because the rules for determining the character encoding are specific to what kind of resource you’re requesting. How could httplib2 know what kind of resource you’re requesting? It’s usually listed in the Content-Type HTTP header, but that’s an optional feature of HTTP and not all HTTP servers include it. If that header is not included in the HTTP response, it’s left up to the client to guess. (This is commonly called “content sniffing,” and it’s never perfect.) - -

If you know what sort of resource you’re expecting (an XML document in this case), perhaps you could “just” pass the returned bytes object to the xml.etree.ElementTree.parse() function. That’ll work as long as the XML document includes information on its own character encoding (as this one does), but that’s an optional feature and not all XML documents do that. If an XML document doesn’t include encoding information, the client is supposed to look at the enclosing transport — i.e. the Content-Type HTTP header, which can include a charset parameter. - -

[I support RFC 3023 t-shirt] - -

But it’s worse than that. Now character encoding information can be in two places: within the XML document itself, and within the Content-Type HTTP header. If the information is in both places, which one wins? According to RFC 3023 (I swear I am not making this up), if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the encoding is - -

    -
  1. the encoding given in the charset parameter of the Content-Type HTTP header, or -
  2. the encoding given in the encoding attribute of the XML declaration within the document, or -
  3. UTF-8 -
- -

On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is - -

    -
  1. the encoding given in the charset parameter of the Content-Type HTTP header, or -
  2. us-ascii -
- -

And that’s just for XML documents. For HTML documents, web browsers have constructed such byzantine rules for content-sniffing [PDF] that we’re still trying to figure them all out. - -

Patches welcome.” - -

How httplib2 Handles Caching

- -

Remember in the previous section when I said you should always create an httplib2.Http object with a directory name? Caching is the reason. - -

-# continued from the previous example
->>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml')  
->>> response2.status                                                                 
-200
->>> content2[:52]                                                                    
-b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
->>> len(content2)
-3070
-
    -
  1. This shouldn’t be terribly surprising. It’s the same thing you did last time, except you’re putting the result into two new variables. -
  2. The HTTP status is once again 200, just like last time. -
  3. The downloaded content is the same as last time, too. -
- -

So… who cares? Quit your Python interactive shell and relaunch it with a new session, and I’ll show you. - -

-# NOT continued from previous example!
-# Please exit out of the interactive shell
-# and launch a new one.
->>> import httplib2
->>> httplib2.debuglevel = 1                                                        
->>> h = httplib2.Http('.cache')                                                    
->>> response, content = h.request('http://diveintopython3.org/examples/feed.xml')  
->>> len(content)                                                                   
-3070
->>> response.status                                                                
-200
->>> response.fromcache                                                             
-True
-
    -
  1. Let’s turn on debugging and see what’s on the wire. This is the httplib2 equivalent of turning on debugging in http.client. httplib2 will print all the data being sent to the server and some key information being sent back. -
  2. Create an httplib2.Http object with the same directory name as before. -
  3. Request the same URL as before. Nothing appears to happen. More precisely, nothing gets sent to the server, and nothing gets returned from the server. There is absolutely no network activity whatsoever. -
  4. Yet we did “receive” some data — in fact, we received all of it. -
  5. We also “received” an HTTP status code indicating that the “request” was successful. -
  6. Here’s the rub: this “response” was generated from httplib2’s local cache. That directory name you passed in when you created the httplib2.Http object — that directory holds httplib2’s cache of all the operations it’s ever performed. -
- - - -
-

If you want to turn on httplib2 debugging, you need to set a module-level constant (httplib2.debuglevel), then create a new httplib2.Http object. If you want to turn off debugging, you need to change the same module-level constant, then create a new httplib2.Http object. -

- -

You previously requested the data at this URL. That request was successful (status: 200). That response included not only the feed data, but also a set of caching headers that told anyone who was listening that they could cache this resource for up to 24 hours (Cache-Control: max-age=86400, which is 24 hours measured in seconds). httplib2 understand and respects those caching headers, and it stored the previous response in the .cache directory (which you passed in when you create the Http object). That cache hasn’t expired yet, so the second time you request the data at this URL, httplib2 simply returns the cached result without ever hitting the network. - -

I say “simply,” but obviously there is a lot of complexity hidden behind that simplicity. httplib2 handles HTTP caching automatically and by default. If for some reason you need to know whether a response came from the cache, you can check response.fromcache. Otherwise, it Just Works. - -

Now, suppose you have data cached, but you want to bypass the cache and re-request it from the remote server. Browsers sometimes do this if the user specifically requests it. For example, pressing F5 refreshes the current page, but pressing Ctrl+F5 bypasses the cache and re-requests the current page from the remote server. You might think “oh, I’ll just delete the data from my local cache, then request it again.” You could do that, but remember that there may be more parties involved than just you and the remote server. What about those intermediate proxy servers? They’re completely beyond your control, and they may still have that data cached, and will happily return it to you because (as far as they are concerned) their cache is still valid. - -

Instead of manipulating your local cache and hoping for the best, you should use the features of HTTP to ensure that your request actually reaches the remote server. - -

-# continued from the previous example
->>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml',
-...     headers={'cache-control':'no-cache'})  
-connect: (diveintopython3.org, 80)             
-send: b'GET /examples/feed.xml HTTP/1.1
-Host: diveintopython3.org
-user-agent: Python-httplib2/$Rev: 259 $
-accept-encoding: deflate, gzip
-cache-control: no-cache'
-reply: 'HTTP/1.1 200 OK'
-…further debugging information omitted…
->>> response2.status
-200
->>> response2.fromcache                        
-False
->>> print(dict(response2.items()))             
-{'status': '200',
- 'content-length': '3070',
- 'content-location': 'http://diveintopython3.org/examples/feed.xml',
- 'accept-ranges': 'bytes',
- 'expires': 'Wed, 03 Jun 2009 00:40:26 GMT',
- 'vary': 'Accept-Encoding',
- 'server': 'Apache',
- 'last-modified': 'Sun, 31 May 2009 22:51:11 GMT',
- 'connection': 'close',
- '-content-encoding': 'gzip',
- 'etag': '"bfe-255ef5c0"',
- 'cache-control': 'max-age=86400',
- 'date': 'Tue, 02 Jun 2009 00:40:26 GMT',
- 'content-type': 'application/xml'}
-
    -
  1. httplib2 allows you to add arbitrary HTTP headers to any outgoing request. In order to bypass all caches (not just your local disk cache, but also any caching proxies between you and the remote server), add a no-cache header in the headers dictionary. -
  2. Now you see httplib2 initiating a network request. httplib2 understands and respects caching headers in both directions — as part of the incoming response and as part of the outgoing request. It noticed that you added the no-cache header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data. -
  3. This response was not generated from your local cache. You knew that, of course, because you saw the debugging information on the outgoing request. But it’s nice to have that programmatically verified. -
  4. The request succeeded; you downloaded the entire feed again from the remote server. Of course, the server also sent back a full complement of HTTP headers along with the feed data. That includes caching headers, which httplib2 uses to update its local cache, in the hopes of avoiding network access the next time you request this feed. Everything about HTTP caching is designed to maximize cache hits and minimize network access. Even though you bypassed the cache this time, the remote server would really appreciate it if you would cache the result for next time. -
- -

How httplib2 Handles Last-Modified and ETag Headers

- -

The Cache-Control and Expires caching headers are called freshness indicators. They tell caches in no uncertain terms that you can completely avoid all network access until the cache expires. And that’s exactly the behavior you saw in the previous section: given a freshness indicator, httplib2 does not generate a single byte of network activity to serve up cached data (unless you explicitly bypass the cache, of course). - -

But what about the case where the data might have changed, but hasn’t? HTTP defines Last-Modified and Etag headers for this purpose. These headers are called validators. If the local cache is no longer fresh, a client can send the validators with the next request to see if the data has actually changed. If the data hasn’t changed, the server sends back a 304 status code and no data. So there’s still a round-trip over the network, but you end up downloading fewer bytes. - -

->>> import httplib2
->>> httplib2.debuglevel = 1
->>> h = httplib2.Http('.cache')
->>> response, content = h.request('http://diveintopython3.org/')  
-connect: (diveintopython3.org, 80)
-send: b'GET / HTTP/1.1
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 200 OK'
->>> print(dict(response.items()))                                 
-{'-content-encoding': 'gzip',
- 'accept-ranges': 'bytes',
- 'connection': 'close',
- 'content-length': '6657',
- 'content-location': 'http://diveintopython3.org/',
- 'content-type': 'text/html',
- 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
- 'etag': '"7f806d-1a01-9fb97900"',
- 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
- 'server': 'Apache',
- 'status': '200',
- 'vary': 'Accept-Encoding,User-Agent'}
->>> len(content)                                                  
-6657
-
    -
  1. Instead of the feed, this time we’re going to download the site’s home page, which is HTML. Since this is the first time you’ve ever requested this page, httplib2 has little to work with, and it sends out a minimum of headers with the request. -
  2. The response contains a multitude of HTTP headers… but no caching information. However, it does include both an ETag and Last-Modified header. -
  3. At the time I constructed this example, this page was 6657 bytes. It’s probably changed since then, but don’t worry about it. -
- -
-# continued from the previous example
->>> response, content = h.request('http://diveintopython3.org/')  
-connect: (diveintopython3.org, 80)
-send: b'GET / HTTP/1.1
-Host: diveintopython3.org
-if-none-match: "7f806d-1a01-9fb97900"                             
-if-modified-since: Tue, 02 Jun 2009 02:51:48 GMT                  
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 304 Not Modified'                                
->>> response.fromcache                                            
-True
->>> response.status                                               
-200
->>> response.dict['status']                                       
-'304'
->>> len(content)                                                  
-6657
-
    -
  1. You request the same page again, with the same Http object (and the same local cache). -
  2. httplib2 sends the ETag validator back to the server in the If-None-Match header. -
  3. httplib2 also sends the Last-Modified validator back to the server in the If-Modified-Since header. -
  4. The server looked at these validators, looked at the page you requested, and determined that the page has not changed since you last requested it, so it sends back a 304 status code and no data. -
  5. Back on the client, httplib2 notices the 304 status code and loads the content of the page from its cache. -
  6. This might be a bit confusing. There are really two status codes — 304 (returned from the server this time, which caused httplib2 to look in its cache), and 200 (returned from the server last time, and stored in httplib2’s cache along with the page data). response.status returns the status from the cache. -
  7. If you want the raw status code returned from the server, you can get that by looking in response.dict, which is a dictionary of the actual headers returned from the server. -
  8. However, you still get the data in the content variable. Generally, you don’t need to know why a response was served from the cache. (You may not even care that it was served from the cache at all, and that’s fine too. httplib2 is smart enough to let you act dumb.) By the time the request() method returns to the caller, httplib2 has already updated its cache and returned the data to you. -
- -

How http2lib Handles Compression

- - - -

HTTP supports several types of compression; the two most common types are gzip and deflate. httplib2 supports both of these. - -

->>> response, content = h.request('http://diveintopython3.org/')
-connect: (diveintopython3.org, 80)
-send: b'GET / HTTP/1.1
-Host: diveintopython3.org
-accept-encoding: deflate, gzip                          
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 200 OK'
->>> print(dict(response.items()))
-{'-content-encoding': 'gzip',                           
- 'accept-ranges': 'bytes',
- 'connection': 'close',
- 'content-length': '6657',
- 'content-location': 'http://diveintopython3.org/',
- 'content-type': 'text/html',
- 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
- 'etag': '"7f806d-1a01-9fb97900"',
- 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
- 'server': 'Apache',
- 'status': '304',
- 'vary': 'Accept-Encoding,User-Agent'}
-
    -
  1. Every time httplib2 sends a request, it includes an Accept-Encoding header to tell the server that it can handle either deflate or gzip compression. -
  2. In this case, the server has responded with a gzip-compressed payload. By the time the request() method returns, httplib2 has already decompressed the body of the response and placed it in the content variable. If you’re curious about whether or not the response was compressed, you can check response['-content-encoding']; otherwise, don’t worry about it. -
- -

How httplib2 Handles Redirects

- -

HTTP defines two kinds of redirects: temporary and permanent. There’s nothing special to do with temporary redirects except follow them, which httplib2 does automatically. - -

->>> import httplib2
->>> httplib2.debuglevel = 1
->>> h = httplib2.Http('.cache')
->>> response, content = h.request('http://diveintopython3.org/examples/feed-302.xml')  
-connect: (diveintopython3.org, 80)
-send: b'GET /examples/feed-302.xml HTTP/1.1                                            
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 302 Found'                                                            
-send: b'GET /examples/feed.xml HTTP/1.1                                                
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 200 OK'
-
    -
  1. There is no feed at this URL. I’ve set up my server to issue a temporary redirect to the correct address. -
  2. There’s the request. -
  3. And there’s the response: 302 Found. Not shown here, this response also includes a Location header that points to the real URL. -
  4. httplib2 immediately turns around and “follows” the redirect by issuing another request for the URL given in the Location header: http://diveintopython3.org/examples/feed.xml -
- -

“Following” a redirect is nothing more than this example shows. httplib2 sends a request for the URL you asked for. The server comes back with a response that says “No no, look over there instead.” httplib2 sends another request for the new URL. - -

-# continued from the previous example
->>> response                                                          
-{'status': '200',
- 'content-length': '3070',
- 'content-location': 'http://diveintopython3.org/examples/feed.xml',  
- 'accept-ranges': 'bytes',
- 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',
- 'vary': 'Accept-Encoding',
- 'server': 'Apache',
- 'last-modified': 'Wed, 03 Jun 2009 02:20:15 GMT',
- 'connection': 'close',
- '-content-encoding': 'gzip',                                         
- 'etag': '"bfe-4cbbf5c0"',
- 'cache-control': 'max-age=86400',                                    
- 'date': 'Wed, 03 Jun 2009 02:21:41 GMT',
- 'content-type': 'application/xml'}
-
    -
  1. The response you get back from this single call to the request() method is the response from the final URL. -
  2. httplib2 adds the final URL to the response dictionary, as content-location. This is not a header that came from the server; it’s specific to httplib2. -
  3. Apropos of nothing, this feed is compressed. -
  4. And cacheable. (This is important, as you’ll see in a minute.) -
- -

The response you get back gives you information about the final URL. What if you want more information about the intermediate URLs, the ones that eventually redirected to the final URL? httplib2 lets you do that, too. - -

-# continued from the previous example
->>> response.previous                                                     
-{'status': '302',
- 'content-length': '228',
- 'content-location': 'http://diveintopython3.org/examples/feed-302.xml',
- 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',
- 'server': 'Apache',
- 'connection': 'close',
- 'location': 'http://diveintopython3.org/examples/feed.xml',
- 'cache-control': 'max-age=86400',
- 'date': 'Wed, 03 Jun 2009 02:21:41 GMT',
- 'content-type': 'text/html; charset=iso-8859-1'}
->>> type(response)                                                        
-<class 'httplib2.Response'>
->>> type(response.previous)
-<class 'httplib2.Response'>
->>> response.previous.previous                                            
->>>
-
    -
  1. The response.previous attribute holds a reference to the previous response object that httplib2 followed to get to the current response object. -
  2. Both response and response.previous are httplib2.Response objects. -
  3. That means you can check response.previous.previous to follow the redirect chain backwards even further. (Scenario: one URL redirects to a second URL which redirects to a third URL. It could happen!) In this case, we’ve already reached the beginning of the redirect chain, so the attribute is None. -
- -

What happens if you request the same URL again? - -

-# continued from the previous example
->>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-302.xml')  
-connect: (diveintopython3.org, 80)
-send: b'GET /examples/feed-302.xml HTTP/1.1                                              
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 302 Found'                                                              
->>> content2 == content                                                                  
-True
-
    -
  1. Same URL, same httplib2.Http object (and therefore the same cache). -
  2. The 302 response was not cached, so httplib2 sends another request for the same URL. -
  3. Once again, the server responds with a 302. But notice what didn’t happen: there wasn’t ever a second request for the final URL, http://diveintopython3.org/examples/feed.xml. That response was cached (remember the Cache-Control header that you saw in the previous example). Once httplib2 received the 302 Found code, it checked its cache before issuing another request. The cache contained a fresh copy of http://diveintopython3.org/examples/feed.xml, so there was no need to re-request it. -
  4. By the time the request() method returns, it has read the feed data from the cache and returned it. Of course, it’s the same as the data you received last time. -
- -

In other words, you don’t have to do anything special for temporary redirects. httplib2 will follow them automatically, and the fact that one URL redirects to another has no bearing on httplib2’s support for compression, caching, ETags, or any of the other features of HTTP. - -

Permanent redirects are just as simple. - -

-# continued from the previous example
->>> response, content = h.request('http://diveintopython3.org/examples/feed-301.xml')  
-connect: (diveintopython3.org, 80)
-send: b'GET /examples/feed-301.xml HTTP/1.1
-Host: diveintopython3.org
-accept-encoding: deflate, gzip
-user-agent: Python-httplib2/$Rev: 259 $'
-reply: 'HTTP/1.1 301 Moved Permanently'                                                
->>> response.fromcache                                                                 
-True
-
    -
  1. Once again, this URL doesn’t really exist. I’ve set up my server to issue a permanent redirect to http://diveintopython3.org/examples/feed.xml. -
  2. And here it is: status code 301. But again, notice what didn’t happen: there was no request to the redirect URL. Why not? Because it’s already cached locally. -
  3. httplib2 “followed” the redirect right into its cache. -
- -

But wait! There’s more! - -

-# continued from the previous example
->>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-301.xml')  
->>> response2.fromcache                                                                  
-True
->>> content2 == content                                                                  
-True
-
-
    -
  1. Here’s the difference between temporary and permanent redirects: once httplib2 follows a permanent redirect, all further requests for that URL will transparently be rewritten to the target URL without hitting the network for the original URL. Remember, debugging is still turned on, yet there is no output of network activity whatsoever. -
  2. Yep, this response was retrieved from the local cache. -
  3. Yep, you got the entire feed (from the cache). -
- -

HTTP. It works. - -

⁂ - -

Beyond HTTP GET

- -

HTTP web services are not limited to GET requests. What if you want to create something new? Whenever you post a comment on a discussion forum, update your weblog, publish your status on a microblogging service like Twitter or Identi.ca, you’re probably already using HTTP POST. - -

Both Twitter and Identi.ca both offer a simple HTTP-based API for publishing and updating your status in 140 characters or less. Let’s look at Identi.ca’s API documentation for updating your status: - -

-

Identi.ca REST API Method: statuses/update
-Updates the authenticating user’s status. Requires the status parameter specified below. Request must be a POST. - -

-
URL -
https://identi.ca/api/statuses/update.format -
Formats -
xml, json, rss, atom -
HTTP Method(s) -
POST -
Requires Authentication -
true -
Parameters -
status. Required. The text of your status update. URL-encode as necessary. -
-
- -

How does this work? To publish a new message on Identi.ca, you need to issue an HTTP POST request to http://identi.ca/api/statuses/update.format. (The format bit is not part of the URL; you replace it with the data format you want the server to return in response to your request. So if you want a response in XML, you would post the request to https://identi.ca/api/statuses/update.xml.) The request needs to include a parameter called status, which contains the text of your status update. And the request needs to be authenticated. - -

Authenticated? Sure. To update your status on Identi.ca, you need to prove who you are. Identi.ca is not a wiki; only you can update your own status. Identi.ca uses HTTP Basic Authentication (a.k.a. RFC 2617) over SSL to provide secure but easy-to-use authentication. httplib2 supports both SSL and HTTP Basic Authentication, so this part is easy. - -

A POST request is different from a GET request, because it includes a payload. The payload is the data you want to send to the server. The one piece of data that this API method requires is status, and it should be URL-encoded. This is a very simple serialization format that takes a set of key-value pairs (i.e. a dictionary) and transforms it into a string. - -

->>> from urllib.parse import urlencode              
->>> data = {'status': 'Test update from Python 3'}  
->>> urlencode(data)                                 
-'status=Test+update+from+Python+3'
-
    -
  1. Python comes with a utility function to URL-encode a dictionary: urllib.parse.urlencode(). -
  2. This is the sort of dictionary that the Identi.ca API is looking for. It contains one key, status, whose value is the text of a single status update. -
  3. This is what the URL-encoded string looks like. This is the payload that will be sent “on the wire” to the Identi.ca API server in your HTTP POST request. -
- -

- -

->>> from urllib.parse import urlencode
->>> import httplib2
->>> httplib2.debuglevel = 1
->>> h = httplib2.Http('.cache')
->>> data = {'status': 'Test update from Python 3'}
->>> h.add_credentials('diveintomark', 'MY_SECRET_PASSWORD', 'identi.ca')    
->>> resp, content = h.request('https://identi.ca/api/statuses/update.xml',
-...     'POST',                                                             
-...     urlencode(data),                                                    
-...     headers={'Content-Type': 'application/x-www-form-urlencoded'})      
-
    -
  1. This is how httplib2 handles authentication. Store your username and password with the add_credentials() method. When httplib2 tries to issue the request, the server will respond with a 401 Unauthorized status code, and it will list which authentication methods it supports (in the WWW-Authenticate header). httplib2 will automatically construct an Authorization header and re-request the URL. -
  2. The second parameter is the type of HTTP request, in this case POST. -
  3. The third parameter is the payload to send to the server. We’re sending the URL-encoded dictionary with a status message. -
  4. Finally, we need to tell the server that the payload is URL-encoded data. -
- -
-

The third parameter to the add_credentials() method is the domain in which the credentials are valid. You should always specify this! If you leave out the domain and later reuse the httplib2.Http object on a different authenticated site, httplib2 might end up leaking one site’s username and password to the other site. -

- -

This is what goes over the wire: - -

-# continued from the previous example
-send: b'POST /api/statuses/update.xml HTTP/1.1
-Host: identi.ca
-Accept-Encoding: identity
-Content-Length: 32
-content-type: application/x-www-form-urlencoded
-user-agent: Python-httplib2/$Rev: 259 $
-
-status=Test+update+from+Python+3'
-reply: 'HTTP/1.1 401 Unauthorized'                        
-send: b'POST /api/statuses/update.xml HTTP/1.1            
-Host: identi.ca
-Accept-Encoding: identity
-Content-Length: 32
-content-type: application/x-www-form-urlencoded
-authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2  
-user-agent: Python-httplib2/$Rev: 259 $
-
-status=Test+update+from+Python+3'
-reply: 'HTTP/1.1 200 OK'                                  
-
    -
  1. After the first request, the server responds with a 401 Unauthorized status code. httplib2 will never send authentication headers unless the server explicitly asks for them. This is how the server asks for them. -
  2. httplib2 immediately turns around and requests the same URL a second time. -
  3. This time, it includes the username and password that you added with the add_credentials() method. -
  4. It worked! -
- -

What does the server send back after a successful request? That depends entirely on the web service API. In some protocols (like the Atom Publishing Protocol), the server sends back a 201 Created status code and the location of the newly created resource in the Location header. Identi.ca sends back a 200 OK and an XML document containing information about the newly created resource. - -

-# continued from the previous example
->>> print(content.decode('utf-8'))                             
-<?xml version="1.0" encoding="UTF-8"?>
-<status>
- <text>Test update from Python 3</text>                        
- <truncated>false</truncated>
- <created_at>Wed Jun 10 03:53:46 +0000 2009</created_at>
- <in_reply_to_status_id></in_reply_to_status_id>
- <source>api</source>
- <id>5131472</id>                                              
- <in_reply_to_user_id></in_reply_to_user_id>
- <in_reply_to_screen_name></in_reply_to_screen_name>
- <favorited>false</favorited>
- <user>
-  <id>3212</id>
-  <name>Mark Pilgrim</name>
-  <screen_name>diveintomark</screen_name>
-  <location>27502, US</location>
-  <description>tech writer, husband, father</description>
-  <profile_image_url>http://avatar.identi.ca/3212-48-20081216000626.png</profile_image_url>
-  <url>http://diveintomark.org/</url>
-  <protected>false</protected>
-  <followers_count>329</followers_count>
-  <profile_background_color></profile_background_color>
-  <profile_text_color></profile_text_color>
-  <profile_link_color></profile_link_color>
-  <profile_sidebar_fill_color></profile_sidebar_fill_color>
-  <profile_sidebar_border_color></profile_sidebar_border_color>
-  <friends_count>2</friends_count>
-  <created_at>Wed Jul 02 22:03:58 +0000 2008</created_at>
-  <favourites_count>30768</favourites_count>
-  <utc_offset>0</utc_offset>
-  <time_zone>UTC</time_zone>
-  <profile_background_image_url></profile_background_image_url>
-  <profile_background_tile>false</profile_background_tile>
-  <statuses_count>122</statuses_count>
-  <following>false</following>
-  <notifications>false</notifications>
-</user>
-</status>
-
    -
  1. Remember, the data returned by httplib2 is always bytes, not a string. To convert it to a string, you need to decode it using the proper character encoding. Identi.ca’s API always returns results in UTF-8, so that part is easy. -
  2. There’s the text of the status message we just published. -
  3. There’s the unique identifier for the new status message. Identi.ca uses this to construct a URL for viewing the message on the web. -
- -

And here it is: - -

screenshot showing published status message on Identi.ca - -

⁂ - -

Beyond HTTP POST

- -

HTTP isn’t limited to GET and POST. Those are certainly the most common types of requests, especially in web browsers. But web service APIs can go beyond GET and POST, and httplib2 is ready. - -

-# continued from the previous example
->>> from xml.etree import ElementTree as etree
->>> tree = etree.fromstring(content)                                          
->>> status_id = tree.findtext('id')                                           
->>> status_id
-'5131472'
->>> url = 'https://identi.ca/api/statuses/destroy/{0}.xml'.format(status_id)  
->>> resp, deleted_content = h.request(url, 'DELETE')                          
-
    -
  1. The server returned XML, right? You know how to parse XML. -
  2. The findtext() method finds the first instance of the given expression and extracts its text content. In this case, we’re just looking for an <id> element. -
  3. Based on the text content of the <id> element, we can construct a URL to delete the status message we just published. -
  4. To delete a message, you simply issue an HTTP DELETE request to that URL. -
- -

This is what goes over the wire: - -

-send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1      
-Host: identi.ca
-Accept-Encoding: identity
-user-agent: Python-httplib2/$Rev: 259 $
-
-'
-reply: 'HTTP/1.1 401 Unauthorized'                             
-send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1      
-Host: identi.ca
-Accept-Encoding: identity
-authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2       
-user-agent: Python-httplib2/$Rev: 259 $
-
-'
-reply: 'HTTP/1.1 200 OK'                                       
->>> resp.status
-200
-
    -
  1. “Delete this status message.” -
  2. “I’m sorry, Dave, I’m afraid I can’t do that.” -
  3. “Unauthorized Hmmph. Delete this status message, please… -
  4. …and here’s my username and password.” -
  5. “Consider it done!” -
- -

And just like that, poof, it’s gone. - -

screenshot showing deleted message on Identi.ca - -

⁂ - -

Further Reading

- -

httplib2: - -

- -

HTTP caching: - -

- -

RFCs: - -

- -

-

© 2001–10 Mark Pilgrim - - - + + +HTTP Web Services - Dive Into Python 3 + + + + + + +

  
+

You are here: Home Dive Into Python 3 +

Difficulty level: ♦♦♦♦♢ +

HTTP Web Services

+
+

A ruffled mind makes a restless pillow.
— Charlotte Brontë +

+

  +

Diving In

+

Philosophically, I can describe HTTP web services in 12 words: exchanging data with remote servers using nothing but the operations of HTTP. If you want to get data from the server, use HTTP GET. If you want to send new data to the server, use HTTP POST. Some more advanced HTTP web service APIs also allow creating, modifying, and deleting data, using HTTP PUT and HTTP DELETE. That’s it. No registries, no envelopes, no wrappers, no tunneling. The “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for retrieving, creating, modifying, and deleting data. + +

The main advantage of this approach is simplicity, and its simplicity has proven popular. Data — usually XML or JSON — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging is also easier; because each resource in an HTTP web service has a unique address (in the form of a URL), you can load it in your web browser and immediately see the raw data. + +

Examples of HTTP web services: +

+ +

Python 3 comes with two different libraries for interacting with HTTP web services: + +

+ +

So which one should you use? Neither of them. Instead, you should use httplib2, an open source third-party library that implements HTTP more fully than http.client but provides a better abstraction than urllib.request. + +

To understand why httplib2 is the right choice, you first need to understand HTTP. + +

⁂ + +

Features of HTTP

+ +

There are five important features which all HTTP clients should support. + +

Caching

+ +

The most important thing to understand about any type of web service is that network access is incredibly expensive. I don’t mean “dollars and cents” expensive (although bandwidth ain’t free). I mean that it takes an extraordinary long time to open a connection, send a request, and retrieve a response from a remote server. Even on the fastest broadband connection, latency (the time it takes to send a request and start retrieving data in a response) can still be higher than you anticipated. A router misbehaves, a packet is dropped, an intermediate proxy is under attack — there’s never a dull moment on the public internet, and there may be nothing you can do about it. + +

+ +

HTTP is designed with caching in mind. There is an entire class of devices (called “caching proxies”) whose only job is to sit between you and the rest of the world and minimize network access. Your company or ISP almost certainly maintains caching proxies, even if you’re unaware of them. They work because caching built into the HTTP protocol. + +

Here’s a concrete example of how caching works. You visit diveintomark.org in your browser. That page includes a background image, wearehugh.com/m.jpg. When your browser downloads that image, the server includes the following HTTP headers: + +

HTTP/1.1 200 OK
+Date: Sun, 31 May 2009 17:14:04 GMT
+Server: Apache
+Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
+ETag: "3075-ddc8d800"
+Accept-Ranges: bytes
+Content-Length: 12405
+Cache-Control: max-age=31536000, public
+Expires: Mon, 31 May 2010 17:14:04 GMT
+Connection: close
+Content-Type: image/jpeg
+ +

The Cache-Control and Expires headers tell your browser (and any caching proxies between you and the server) that this image can be cached for up to a year. A year! And if, in the next year, you visit another page which also includes a link to this image, your browser will load the image from its cache without generating any network activity whatsoever. + +

But wait, it gets better. Let’s say your browser purges the image from your local cache for some reason. Maybe it ran out of disk space; maybe you manually cleared the cache. Whatever. But the HTTP headers said that this data could be cached by public caching proxies. (Technically, the important thing is what the headers don’t say; the Cache-Control header doesn’t have the private keyword, so this data is cacheable by default.) Caching proxies are designed to have tons of storage space, probably far more than your local browser has allocated. + +

If your company or ISP maintain a caching proxy, the proxy may still have the image cached. When you visit diveintomark.org again, your browser will look in its local cache for the image, but it won’t find it, so it will make a network request to try to download it from the remote server. But if the caching proxy still has a copy of the image, it will intercept that request and serve the image from its cache. That means that your request will never reach the remote server; in fact, it will never leave your company’s network. That makes for a faster download (fewer network hops) and saves your company money (less data being downloaded from the outside world). + +

HTTP caching only works when everybody does their part. On one side, servers need to send the correct headers in their response. On the other side, clients need to understand and respect those headers before they request the same data twice. The proxies in the middle are not a panacea; they can only be as smart as the servers and clients allow them to be. + +

Python’s HTTP libraries do not support caching, but httplib2 does. + +

Last-Modified Checking

+ +

Some data never changes, while other data changes all the time. In between, there is a vast field of data that might have changed, but hasn’t. CNN.com’s feed is updated every few minutes, but my weblog’s feed may not change for days or weeks at a time. In the latter case, I don’t want to tell clients to cache my feed for weeks at a time, because then when I do actually post something, people may not read it for weeks (because they’re respecting my cache headers which said “don’t bother checking this feed for weeks”). On the other hand, I don’t want clients downloading my entire feed once an hour if it hasn’t changed! + +

+ +

HTTP has a solution to this, too. When you request data for the first time, the server can send back a Last-Modified header. This is exactly what it sounds like: the date that the data was changed. That background image referenced from diveintomark.org included a Last-Modified header. + +

HTTP/1.1 200 OK
+Date: Sun, 31 May 2009 17:14:04 GMT
+Server: Apache
+Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
+ETag: "3075-ddc8d800"
+Accept-Ranges: bytes
+Content-Length: 12405
+Cache-Control: max-age=31536000, public
+Expires: Mon, 31 May 2010 17:14:04 GMT
+Connection: close
+Content-Type: image/jpeg
+
+ +

When you request the same data a second (or third or fourth) time, you can send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data has changed since then, then the server ignores the If-Modified-Since header and just gives you the new data with a 200 status code. But if the data hasn’t changed since then, the server sends back a special HTTP 304 status code, which means “this data hasn’t changed since the last time you asked for it.” You can test this on the command line, using curl: + +

+you@localhost:~$ curl -I -H "If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT" http://wearehugh.com/m.jpg
+HTTP/1.1 304 Not Modified
+Date: Sun, 31 May 2009 18:04:39 GMT
+Server: Apache
+Connection: close
+ETag: "3075-ddc8d800"
+Expires: Mon, 31 May 2010 18:04:39 GMT
+Cache-Control: max-age=31536000, public
+ +

Why is this an improvement? Because when the server sends a 304, it doesn’t re-send the data. All you get is the status code. Even after your cached copy has expired, last-modified checking ensures that you won’t download the same data twice if it hasn’t changed. (As an extra bonus, this 304 response also includes caching headers. Proxies will keep a copy of data even after it officially “expires,” in the hopes that the data hasn’t really changed and the next request responds with a 304 status code and updated cache information.) + +

Python’s HTTP libraries do not support last-modified date checking, but httplib2 does. + +

ETag Checking

+ +

ETags are an alternate way to accomplish the same thing as the last-modified checking. With Etags, the server sends a hash code in an ETag header along with the data you requested. (Exactly how this hash is determined is entirely up to the server. The only requirement is that it changes when the data changes.) That background image referenced from diveintomark.org had an ETag header. + +

HTTP/1.1 200 OK
+Date: Sun, 31 May 2009 17:14:04 GMT
+Server: Apache
+Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
+ETag: "3075-ddc8d800"
+Accept-Ranges: bytes
+Content-Length: 12405
+Cache-Control: max-age=31536000, public
+Expires: Mon, 31 May 2010 17:14:04 GMT
+Connection: close
+Content-Type: image/jpeg
+
+ + + +

The second time you request the same data, you include the ETag hash in an If-None-Match header of your request. If the data hasn’t changed, the server will send you back a 304 status code. As with the last-modified date checking, the server sends back only the 304 status code; it doesn’t send you the same data a second time. By including the ETag hash in your second request, you’re telling the server that there’s no need to re-send the same data if it still matches this hash, since you still have the data from the last time. + +

Again with the curl: + +

+you@localhost:~$ curl -I -H "If-None-Match: \"3075-ddc8d800\"" http://wearehugh.com/m.jpg  
+HTTP/1.1 304 Not Modified
+Date: Sun, 31 May 2009 18:04:39 GMT
+Server: Apache
+Connection: close
+ETag: "3075-ddc8d800"
+Expires: Mon, 31 May 2010 18:04:39 GMT
+Cache-Control: max-age=31536000, public
+
    +
  1. ETags are commonly enclosed in quotation marks, but the quotation marks are part of the value. That means you need to send the quotation marks back to the server in the If-None-Match header. +
+ +

Python’s HTTP libraries do not support ETags, but httplib2 does. + +

Compression

+ +

When you talk about HTTP web services, you’re almost always talking about moving text-based data back and forth over the wire. Maybe it’s XML, maybe it’s JSON, maybe it’s just plain text. Regardless of the format, text compresses well. The example feed in the XML chapter is 3070 bytes uncompressed, but would be 941 bytes after gzip compression. That’s just 30% of the original size! + +

HTTP supports several compression algorithms. The two most common types are gzip and deflate. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include an Accept-encoding header in your request that lists which compression algorithms you support. If the server supports any of the same algorithms, it will send you back compressed data (with a Content-encoding header that tells you which algorithm it used). Then it’s up to you to decompress the data. + +

+

Important tip for server-side developers: make sure that the compressed version of a resource has a different Etag than the uncompressed version. Otherwise, caching proxies will get confused and may serve the compressed version to clients that can’t handle it. Read the discussion of Apache bug 39727 for more details on this subtle issue. +

+ +

Python’s HTTP libraries do not support compression, but httplib2 does. + +

Redirects

+ +

Cool URIs don’t change, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at http://example.com/index.xml might be moved to http://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; http://www.example.com/index.xml becomes http://server-farm-1.example.com/index.xml. + +

+ +

Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code 200 means “everything’s normal, here’s the page you asked for”. Status code 404 means “page not found”. (You’ve probably seen 404 errors while browsing the web.) Status codes in the 300’s indicate some form of redirection. + +

HTTP has several different ways of signifying that a resource has moved. The two most common techiques are status codes 302 and 301. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you’re supposed to use the new address from then on. + +

The urllib.request module automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but it doesn’t tell you that it did so. You’ll end up getting data you asked for, but you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue pounding away at the old address, and each time you’ll get redirected to the new address, and each time the urllib.request module will “helpfully” follow the redirect. In other words, it treats permanent redirects the same as temporary redirects. That means two round trips instead of one, which is bad for the server and bad for you. + +

httplib2 handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred, it will keep track of them locally and automatically rewrite redirected URLs before requesting them. + +

⁂ + +

How Not To Fetch Data Over HTTP

+ +

Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better. +

+>>> import urllib.request
+>>> a_url = 'http://diveintopython3.org/examples/feed.xml'
+>>> data = urllib.request.urlopen(a_url).read()  
+>>> type(data)                                   
+<class 'bytes'>
+>>> print(data)
+<?xml version='1.0' encoding='utf-8'?>
+<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
+  <title>dive into mark</title>
+  <subtitle>currently between addictions</subtitle>
+  <id>tag:diveintomark.org,2001-07-29:/</id>
+  <updated>2009-03-27T21:56:07Z</updated>
+  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
+  …
+
+
    +
  1. Downloading anything over HTTP is incredibly easy in Python; in fact, it’s a one-liner. The urllib.request module has a handy urlopen() function that takes the address of the page you want, and returns a file-like object that you can just read() from to get the full contents of the page. It just can’t get any easier. +
  2. The urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get bytes. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it to a string. +
+ +

So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis (e.g. requesting this feed once an hour), then you’re being inefficient, and you’re being rude. + +

⁂ + +

What’s On The Wire?

+ +

To see why this is inefficient and rude, let’s turn on the debugging features of Python’s HTTP library and see what’s being sent “on the wire” (i.e. over the network). + +

+>>> from http.client import HTTPConnection
+>>> HTTPConnection.debuglevel = 1                                       
+>>> from urllib.request import urlopen
+>>> response = urlopen('http://diveintopython3.org/examples/feed.xml')  
+send: b'GET /examples/feed.xml HTTP/1.1                                 
+Host: diveintopython3.org                                               
+Accept-Encoding: identity                                               
+User-Agent: Python-urllib/3.1'                                          
+Connection: close
+reply: 'HTTP/1.1 200 OK'
+…further debugging information omitted…
+
    +
  1. As I mentioned at the beginning of the chapter, urllib.request relies on another standard Python library, http.client. Normally you don’t need to touch http.client directly. (The urllib.request module imports it automatically.) But we import it here so we can toggle the debugging flag on the HTTPConnection class that urllib.request uses to connect to the HTTP server. +
  2. Now that the debugging flag is set, information on the HTTP request and response is printed out in real time. As you can see, when you request the Atom feed, the urllib.request module sends five lines to the server. +
  3. The first line specifies the HTTP verb you’re using, and the path of the resource (minus the domain name). +
  4. The second line specifies the domain name from which we’re requesting this feed. +
  5. The third line specifies the compression algorithms that the client supports. As I mentioned earlier, urllib.request does not support compression by default. +
  6. The fourth line specifies the name of the library that is making the request. By default, this is Python-urllib plus a version number. Both urllib.request and httplib2 support changing the user agent, simply by adding a User-Agent header to the request (which will override the default value). +
+ + + +

Now let’s look at what the server sent back in its response. + +

+# continued from previous example
+>>> print(response.headers.as_string())        
+Date: Sun, 31 May 2009 19:23:06 GMT            
+Server: Apache
+Last-Modified: Sun, 31 May 2009 06:39:55 GMT   
+ETag: "bfe-93d9c4c0"                           
+Accept-Ranges: bytes
+Content-Length: 3070                           
+Cache-Control: max-age=86400                   
+Expires: Mon, 01 Jun 2009 19:23:06 GMT
+Vary: Accept-Encoding
+Connection: close
+Content-Type: application/xml
+>>> data = response.read()                     
+>>> len(data)
+3070
+
    +
  1. The response returned from the urllib.request.urlopen() function contains all the HTTP headers the server sent back. It also contains methods to download the actual data; we’ll get to that in a minute. +
  2. The server tells you when it handled your request. +
  3. This response includes a Last-Modified header. +
  4. This response includes an ETag header. +
  5. The data is 3070 bytes long. Notice what isn’t here: a Content-encoding header. Your request stated that you only accept uncompressed data (Accept-encoding: identity), and sure enough, this response contains uncompressed data. +
  6. This response includes caching headers that state that this feed can be cached for up to 24 hours (86400 seconds). +
  7. And finally, download the actual data by calling response.read(). As you can tell from the len() function, this downloads all 3070 bytes at once. +
+ +

As you can see, this code is already inefficient: it asked for (and received) uncompressed data. I know for a fact that this server supports gzip compression, but HTTP compression is opt-in. We didn’t ask for it, so we didn’t get it. That means we’re downloading 3070 bytes when we could have just downloaded 941. Bad dog, no biscuit. + +

But wait, it gets worse! To see just how inefficient this code is, let’s request the same feed a second time. + +

+# continued from the previous example
+>>> response2 = urlopen('http://diveintopython3.org/examples/feed.xml')
+send: b'GET /examples/feed.xml HTTP/1.1
+Host: diveintopython3.org
+Accept-Encoding: identity
+User-Agent: Python-urllib/3.1'
+Connection: close
+reply: 'HTTP/1.1 200 OK'
+…further debugging information omitted…
+ +

Notice anything peculiar about this request? It hasn’t changed! It’s exactly the same as the first request. No sign of If-Modified-Since headers. No sign of If-None-Match headers. No respect for the caching headers. Still no compression. + +

And what happens when you do the same thing twice? You get the same response. Twice. + +

+# continued from the previous example
+>>> print(response2.headers.as_string())     
+Date: Mon, 01 Jun 2009 03:58:00 GMT
+Server: Apache
+Last-Modified: Sun, 31 May 2009 22:51:11 GMT
+ETag: "bfe-255ef5c0"
+Accept-Ranges: bytes
+Content-Length: 3070
+Cache-Control: max-age=86400
+Expires: Tue, 02 Jun 2009 03:58:00 GMT
+Vary: Accept-Encoding
+Connection: close
+Content-Type: application/xml
+>>> data2 = response2.read()
+>>> len(data2)                               
+3070
+>>> data2 == data                            
+True
+
    +
  1. The server is still sending the same array of “smart” headers: Cache-Control and Expires to allow caching, Last-Modified and ETag to enable “not-modified” tracking. Even the Vary: Accept-Encoding header hints that the server would support compression, if only you would ask for it. But you didn’t. +
  2. Once again, fetching this data downloads the whole 3070 bytes… +
  3. …the exact same 3070 bytes you downloaded last time. +
+ +

HTTP is designed to work better than this. urllib speaks HTTP like I speak Spanish — enough to get by in a jam, but not enough to hold a conversation. HTTP is a conversation. It’s time to upgrade to a library that speaks HTTP fluently. + +

⁂ + +

Introducing httplib2

+ +

Before you can use httplib2, you’ll need to install it. Visit code.google.com/p/httplib2/ and download the latest version. httplib2 is available for Python 2.x and Python 3.x; make sure you get the Python 3 version, named something like httplib2-python3-0.5.0.zip. + +

Unzip the archive, open a terminal window, and go to the newly created httplib2 directory. On Windows, open the Start menu, select Run..., type cmd.exe and press ENTER. + +

+c:\Users\pilgrim\Downloads> dir
+ Volume in drive C has no label.
+ Volume Serial Number is DED5-B4F8
+
+ Directory of c:\Users\pilgrim\Downloads
+
+07/28/2009  12:36 PM    <DIR>          .
+07/28/2009  12:36 PM    <DIR>          ..
+07/28/2009  12:36 PM    <DIR>          httplib2-python3-0.5.0
+07/28/2009  12:33 PM            18,997 httplib2-python3-0.5.0.zip
+               1 File(s)         18,997 bytes
+               3 Dir(s)  61,496,684,544 bytes free
+
+c:\Users\pilgrim\Downloads> cd httplib2-python3-0.5.0
+c:\Users\pilgrim\Downloads\httplib2-python3-0.5.0> c:\python31\python.exe setup.py install
+running install
+running build
+running build_py
+running install_lib
+creating c:\python31\Lib\site-packages\httplib2
+copying build\lib\httplib2\iri2uri.py -> c:\python31\Lib\site-packages\httplib2
+copying build\lib\httplib2\__init__.py -> c:\python31\Lib\site-packages\httplib2
+byte-compiling c:\python31\Lib\site-packages\httplib2\iri2uri.py to iri2uri.pyc
+byte-compiling c:\python31\Lib\site-packages\httplib2\__init__.py to __init__.pyc
+running install_egg_info
+Writing c:\python31\Lib\site-packages\httplib2-python3_0.5.0-py3.1.egg-info
+ +

On Mac OS X, run the Terminal.app application in your /Applications/Utilities/ folder. On Linux, run the Terminal application, which is usually in your Applications menu under Accessories or System. + +

+you@localhost:~/Desktop$ unzip httplib2-python3-0.5.0.zip
+Archive:  httplib2-python3-0.5.0.zip
+  inflating: httplib2-python3-0.5.0/README
+  inflating: httplib2-python3-0.5.0/setup.py
+  inflating: httplib2-python3-0.5.0/PKG-INFO
+  inflating: httplib2-python3-0.5.0/httplib2/__init__.py
+  inflating: httplib2-python3-0.5.0/httplib2/iri2uri.py
+you@localhost:~/Desktop$ cd httplib2-python3-0.5.0/
+you@localhost:~/Desktop/httplib2-python3-0.5.0$ sudo python3 setup.py install
+running install
+running build
+running build_py
+creating build
+creating build/lib.linux-x86_64-3.1
+creating build/lib.linux-x86_64-3.1/httplib2
+copying httplib2/iri2uri.py -> build/lib.linux-x86_64-3.1/httplib2
+copying httplib2/__init__.py -> build/lib.linux-x86_64-3.1/httplib2
+running install_lib
+creating /usr/local/lib/python3.1/dist-packages/httplib2
+copying build/lib.linux-x86_64-3.1/httplib2/iri2uri.py -> /usr/local/lib/python3.1/dist-packages/httplib2
+copying build/lib.linux-x86_64-3.1/httplib2/__init__.py -> /usr/local/lib/python3.1/dist-packages/httplib2
+byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/iri2uri.py to iri2uri.pyc
+byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/__init__.py to __init__.pyc
+running install_egg_info
+Writing /usr/local/lib/python3.1/dist-packages/httplib2-python3_0.5.0.egg-info
+ +

To use httplib2, create an instance of the httplib2.Http class. + +

+>>> import httplib2
+>>> h = httplib2.Http('.cache')                                                    
+>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml')  
+>>> response.status                                                                
+200
+>>> content[:52]                                                                   
+b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
+>>> len(content)
+3070
+
    +
  1. The primary interface to httplib2 is the Http object. For reasons you’ll see in the next section, you should always pass a directory name when you create an Http object. The directory does not need to exist; httplib2 will create it if necessary. +
  2. Once you have an Http object, retrieving data is as simple as calling the request() method with the address of the data you want. This will issue an HTTP GET request for that URL. (Later in this chapter, you’ll see how to issue other HTTP requests, like POST.) +
  3. The request() method returns two values. The first is an httplib2.Response object, which contains all the HTTP headers the server returned. For example, a status code of 200 indicates that the request was successful. +
  4. The content variable contains the actual data that was returned by the HTTP server. The data is returned as a bytes object, not a string. If you want it as a string, you’ll need to determine the character encoding and convert it yourself. +
+ +
+

You probably only need one httplib2.Http object. There are valid reasons for creating more than one, but you should only do so if you know why you need them. “I need to request data from two different URLs” is not a valid reason. Re-use the Http object and just call the request() method twice. +

+ +

A Short Digression To Explain Why httplib2 Returns Bytes Instead of Strings

+ +

Bytes. Strings. What a pain. Why can’t httplib2 “just” do the conversion for you? Well, it’s complicated, because the rules for determining the character encoding are specific to what kind of resource you’re requesting. How could httplib2 know what kind of resource you’re requesting? It’s usually listed in the Content-Type HTTP header, but that’s an optional feature of HTTP and not all HTTP servers include it. If that header is not included in the HTTP response, it’s left up to the client to guess. (This is commonly called “content sniffing,” and it’s never perfect.) + +

If you know what sort of resource you’re expecting (an XML document in this case), perhaps you could “just” pass the returned bytes object to the xml.etree.ElementTree.parse() function. That’ll work as long as the XML document includes information on its own character encoding (as this one does), but that’s an optional feature and not all XML documents do that. If an XML document doesn’t include encoding information, the client is supposed to look at the enclosing transport — i.e. the Content-Type HTTP header, which can include a charset parameter. + +

[I support RFC 3023 t-shirt] + +

But it’s worse than that. Now character encoding information can be in two places: within the XML document itself, and within the Content-Type HTTP header. If the information is in both places, which one wins? According to RFC 3023 (I swear I am not making this up), if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the encoding is + +

    +
  1. the encoding given in the charset parameter of the Content-Type HTTP header, or +
  2. the encoding given in the encoding attribute of the XML declaration within the document, or +
  3. UTF-8 +
+ +

On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is + +

    +
  1. the encoding given in the charset parameter of the Content-Type HTTP header, or +
  2. us-ascii +
+ +

And that’s just for XML documents. For HTML documents, web browsers have constructed such byzantine rules for content-sniffing [PDF] that we’re still trying to figure them all out. + +

Patches welcome.” + +

How httplib2 Handles Caching

+ +

Remember in the previous section when I said you should always create an httplib2.Http object with a directory name? Caching is the reason. + +

+# continued from the previous example
+>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml')  
+>>> response2.status                                                                 
+200
+>>> content2[:52]                                                                    
+b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="
+>>> len(content2)
+3070
+
    +
  1. This shouldn’t be terribly surprising. It’s the same thing you did last time, except you’re putting the result into two new variables. +
  2. The HTTP status is once again 200, just like last time. +
  3. The downloaded content is the same as last time, too. +
+ +

So… who cares? Quit your Python interactive shell and relaunch it with a new session, and I’ll show you. + +

+# NOT continued from previous example!
+# Please exit out of the interactive shell
+# and launch a new one.
+>>> import httplib2
+>>> httplib2.debuglevel = 1                                                        
+>>> h = httplib2.Http('.cache')                                                    
+>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml')  
+>>> len(content)                                                                   
+3070
+>>> response.status                                                                
+200
+>>> response.fromcache                                                             
+True
+
    +
  1. Let’s turn on debugging and see what’s on the wire. This is the httplib2 equivalent of turning on debugging in http.client. httplib2 will print all the data being sent to the server and some key information being sent back. +
  2. Create an httplib2.Http object with the same directory name as before. +
  3. Request the same URL as before. Nothing appears to happen. More precisely, nothing gets sent to the server, and nothing gets returned from the server. There is absolutely no network activity whatsoever. +
  4. Yet we did “receive” some data — in fact, we received all of it. +
  5. We also “received” an HTTP status code indicating that the “request” was successful. +
  6. Here’s the rub: this “response” was generated from httplib2’s local cache. That directory name you passed in when you created the httplib2.Http object — that directory holds httplib2’s cache of all the operations it’s ever performed. +
+ + + +
+

If you want to turn on httplib2 debugging, you need to set a module-level constant (httplib2.debuglevel), then create a new httplib2.Http object. If you want to turn off debugging, you need to change the same module-level constant, then create a new httplib2.Http object. +

+ +

You previously requested the data at this URL. That request was successful (status: 200). That response included not only the feed data, but also a set of caching headers that told anyone who was listening that they could cache this resource for up to 24 hours (Cache-Control: max-age=86400, which is 24 hours measured in seconds). httplib2 understand and respects those caching headers, and it stored the previous response in the .cache directory (which you passed in when you create the Http object). That cache hasn’t expired yet, so the second time you request the data at this URL, httplib2 simply returns the cached result without ever hitting the network. + +

I say “simply,” but obviously there is a lot of complexity hidden behind that simplicity. httplib2 handles HTTP caching automatically and by default. If for some reason you need to know whether a response came from the cache, you can check response.fromcache. Otherwise, it Just Works. + +

Now, suppose you have data cached, but you want to bypass the cache and re-request it from the remote server. Browsers sometimes do this if the user specifically requests it. For example, pressing F5 refreshes the current page, but pressing Ctrl+F5 bypasses the cache and re-requests the current page from the remote server. You might think “oh, I’ll just delete the data from my local cache, then request it again.” You could do that, but remember that there may be more parties involved than just you and the remote server. What about those intermediate proxy servers? They’re completely beyond your control, and they may still have that data cached, and will happily return it to you because (as far as they are concerned) their cache is still valid. + +

Instead of manipulating your local cache and hoping for the best, you should use the features of HTTP to ensure that your request actually reaches the remote server. + +

+# continued from the previous example
+>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml',
+...     headers={'cache-control':'no-cache'})  
+connect: (diveintopython3.org, 80)             
+send: b'GET /examples/feed.xml HTTP/1.1
+Host: diveintopython3.org
+user-agent: Python-httplib2/$Rev: 259 $
+accept-encoding: deflate, gzip
+cache-control: no-cache'
+reply: 'HTTP/1.1 200 OK'
+…further debugging information omitted…
+>>> response2.status
+200
+>>> response2.fromcache                        
+False
+>>> print(dict(response2.items()))             
+{'status': '200',
+ 'content-length': '3070',
+ 'content-location': 'http://diveintopython3.org/examples/feed.xml',
+ 'accept-ranges': 'bytes',
+ 'expires': 'Wed, 03 Jun 2009 00:40:26 GMT',
+ 'vary': 'Accept-Encoding',
+ 'server': 'Apache',
+ 'last-modified': 'Sun, 31 May 2009 22:51:11 GMT',
+ 'connection': 'close',
+ '-content-encoding': 'gzip',
+ 'etag': '"bfe-255ef5c0"',
+ 'cache-control': 'max-age=86400',
+ 'date': 'Tue, 02 Jun 2009 00:40:26 GMT',
+ 'content-type': 'application/xml'}
+
    +
  1. httplib2 allows you to add arbitrary HTTP headers to any outgoing request. In order to bypass all caches (not just your local disk cache, but also any caching proxies between you and the remote server), add a no-cache header in the headers dictionary. +
  2. Now you see httplib2 initiating a network request. httplib2 understands and respects caching headers in both directions — as part of the incoming response and as part of the outgoing request. It noticed that you added the no-cache header, so it bypassed its local cache altogether and then had no choice but to hit the network to request the data. +
  3. This response was not generated from your local cache. You knew that, of course, because you saw the debugging information on the outgoing request. But it’s nice to have that programmatically verified. +
  4. The request succeeded; you downloaded the entire feed again from the remote server. Of course, the server also sent back a full complement of HTTP headers along with the feed data. That includes caching headers, which httplib2 uses to update its local cache, in the hopes of avoiding network access the next time you request this feed. Everything about HTTP caching is designed to maximize cache hits and minimize network access. Even though you bypassed the cache this time, the remote server would really appreciate it if you would cache the result for next time. +
+ +

How httplib2 Handles Last-Modified and ETag Headers

+ +

The Cache-Control and Expires caching headers are called freshness indicators. They tell caches in no uncertain terms that you can completely avoid all network access until the cache expires. And that’s exactly the behavior you saw in the previous section: given a freshness indicator, httplib2 does not generate a single byte of network activity to serve up cached data (unless you explicitly bypass the cache, of course). + +

But what about the case where the data might have changed, but hasn’t? HTTP defines Last-Modified and Etag headers for this purpose. These headers are called validators. If the local cache is no longer fresh, a client can send the validators with the next request to see if the data has actually changed. If the data hasn’t changed, the server sends back a 304 status code and no data. So there’s still a round-trip over the network, but you end up downloading fewer bytes. + +

+>>> import httplib2
+>>> httplib2.debuglevel = 1
+>>> h = httplib2.Http('.cache')
+>>> response, content = h.request('http://diveintopython3.org/')  
+connect: (diveintopython3.org, 80)
+send: b'GET / HTTP/1.1
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 200 OK'
+>>> print(dict(response.items()))                                 
+{'-content-encoding': 'gzip',
+ 'accept-ranges': 'bytes',
+ 'connection': 'close',
+ 'content-length': '6657',
+ 'content-location': 'http://diveintopython3.org/',
+ 'content-type': 'text/html',
+ 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
+ 'etag': '"7f806d-1a01-9fb97900"',
+ 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
+ 'server': 'Apache',
+ 'status': '200',
+ 'vary': 'Accept-Encoding,User-Agent'}
+>>> len(content)                                                  
+6657
+
    +
  1. Instead of the feed, this time we’re going to download the site’s home page, which is HTML. Since this is the first time you’ve ever requested this page, httplib2 has little to work with, and it sends out a minimum of headers with the request. +
  2. The response contains a multitude of HTTP headers… but no caching information. However, it does include both an ETag and Last-Modified header. +
  3. At the time I constructed this example, this page was 6657 bytes. It’s probably changed since then, but don’t worry about it. +
+ +
+# continued from the previous example
+>>> response, content = h.request('http://diveintopython3.org/')  
+connect: (diveintopython3.org, 80)
+send: b'GET / HTTP/1.1
+Host: diveintopython3.org
+if-none-match: "7f806d-1a01-9fb97900"                             
+if-modified-since: Tue, 02 Jun 2009 02:51:48 GMT                  
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 304 Not Modified'                                
+>>> response.fromcache                                            
+True
+>>> response.status                                               
+200
+>>> response.dict['status']                                       
+'304'
+>>> len(content)                                                  
+6657
+
    +
  1. You request the same page again, with the same Http object (and the same local cache). +
  2. httplib2 sends the ETag validator back to the server in the If-None-Match header. +
  3. httplib2 also sends the Last-Modified validator back to the server in the If-Modified-Since header. +
  4. The server looked at these validators, looked at the page you requested, and determined that the page has not changed since you last requested it, so it sends back a 304 status code and no data. +
  5. Back on the client, httplib2 notices the 304 status code and loads the content of the page from its cache. +
  6. This might be a bit confusing. There are really two status codes — 304 (returned from the server this time, which caused httplib2 to look in its cache), and 200 (returned from the server last time, and stored in httplib2’s cache along with the page data). response.status returns the status from the cache. +
  7. If you want the raw status code returned from the server, you can get that by looking in response.dict, which is a dictionary of the actual headers returned from the server. +
  8. However, you still get the data in the content variable. Generally, you don’t need to know why a response was served from the cache. (You may not even care that it was served from the cache at all, and that’s fine too. httplib2 is smart enough to let you act dumb.) By the time the request() method returns to the caller, httplib2 has already updated its cache and returned the data to you. +
+ +

How http2lib Handles Compression

+ + + +

HTTP supports several types of compression; the two most common types are gzip and deflate. httplib2 supports both of these. + +

+>>> response, content = h.request('http://diveintopython3.org/')
+connect: (diveintopython3.org, 80)
+send: b'GET / HTTP/1.1
+Host: diveintopython3.org
+accept-encoding: deflate, gzip                          
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 200 OK'
+>>> print(dict(response.items()))
+{'-content-encoding': 'gzip',                           
+ 'accept-ranges': 'bytes',
+ 'connection': 'close',
+ 'content-length': '6657',
+ 'content-location': 'http://diveintopython3.org/',
+ 'content-type': 'text/html',
+ 'date': 'Tue, 02 Jun 2009 03:26:54 GMT',
+ 'etag': '"7f806d-1a01-9fb97900"',
+ 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',
+ 'server': 'Apache',
+ 'status': '304',
+ 'vary': 'Accept-Encoding,User-Agent'}
+
    +
  1. Every time httplib2 sends a request, it includes an Accept-Encoding header to tell the server that it can handle either deflate or gzip compression. +
  2. In this case, the server has responded with a gzip-compressed payload. By the time the request() method returns, httplib2 has already decompressed the body of the response and placed it in the content variable. If you’re curious about whether or not the response was compressed, you can check response['-content-encoding']; otherwise, don’t worry about it. +
+ +

How httplib2 Handles Redirects

+ +

HTTP defines two kinds of redirects: temporary and permanent. There’s nothing special to do with temporary redirects except follow them, which httplib2 does automatically. + +

+>>> import httplib2
+>>> httplib2.debuglevel = 1
+>>> h = httplib2.Http('.cache')
+>>> response, content = h.request('http://diveintopython3.org/examples/feed-302.xml')  
+connect: (diveintopython3.org, 80)
+send: b'GET /examples/feed-302.xml HTTP/1.1                                            
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 302 Found'                                                            
+send: b'GET /examples/feed.xml HTTP/1.1                                                
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 200 OK'
+
    +
  1. There is no feed at this URL. I’ve set up my server to issue a temporary redirect to the correct address. +
  2. There’s the request. +
  3. And there’s the response: 302 Found. Not shown here, this response also includes a Location header that points to the real URL. +
  4. httplib2 immediately turns around and “follows” the redirect by issuing another request for the URL given in the Location header: http://diveintopython3.org/examples/feed.xml +
+ +

“Following” a redirect is nothing more than this example shows. httplib2 sends a request for the URL you asked for. The server comes back with a response that says “No no, look over there instead.” httplib2 sends another request for the new URL. + +

+# continued from the previous example
+>>> response                                                          
+{'status': '200',
+ 'content-length': '3070',
+ 'content-location': 'http://diveintopython3.org/examples/feed.xml',  
+ 'accept-ranges': 'bytes',
+ 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',
+ 'vary': 'Accept-Encoding',
+ 'server': 'Apache',
+ 'last-modified': 'Wed, 03 Jun 2009 02:20:15 GMT',
+ 'connection': 'close',
+ '-content-encoding': 'gzip',                                         
+ 'etag': '"bfe-4cbbf5c0"',
+ 'cache-control': 'max-age=86400',                                    
+ 'date': 'Wed, 03 Jun 2009 02:21:41 GMT',
+ 'content-type': 'application/xml'}
+
    +
  1. The response you get back from this single call to the request() method is the response from the final URL. +
  2. httplib2 adds the final URL to the response dictionary, as content-location. This is not a header that came from the server; it’s specific to httplib2. +
  3. Apropos of nothing, this feed is compressed. +
  4. And cacheable. (This is important, as you’ll see in a minute.) +
+ +

The response you get back gives you information about the final URL. What if you want more information about the intermediate URLs, the ones that eventually redirected to the final URL? httplib2 lets you do that, too. + +

+# continued from the previous example
+>>> response.previous                                                     
+{'status': '302',
+ 'content-length': '228',
+ 'content-location': 'http://diveintopython3.org/examples/feed-302.xml',
+ 'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',
+ 'server': 'Apache',
+ 'connection': 'close',
+ 'location': 'http://diveintopython3.org/examples/feed.xml',
+ 'cache-control': 'max-age=86400',
+ 'date': 'Wed, 03 Jun 2009 02:21:41 GMT',
+ 'content-type': 'text/html; charset=iso-8859-1'}
+>>> type(response)                                                        
+<class 'httplib2.Response'>
+>>> type(response.previous)
+<class 'httplib2.Response'>
+>>> response.previous.previous                                            
+>>>
+
    +
  1. The response.previous attribute holds a reference to the previous response object that httplib2 followed to get to the current response object. +
  2. Both response and response.previous are httplib2.Response objects. +
  3. That means you can check response.previous.previous to follow the redirect chain backwards even further. (Scenario: one URL redirects to a second URL which redirects to a third URL. It could happen!) In this case, we’ve already reached the beginning of the redirect chain, so the attribute is None. +
+ +

What happens if you request the same URL again? + +

+# continued from the previous example
+>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-302.xml')  
+connect: (diveintopython3.org, 80)
+send: b'GET /examples/feed-302.xml HTTP/1.1                                              
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 302 Found'                                                              
+>>> content2 == content                                                                  
+True
+
    +
  1. Same URL, same httplib2.Http object (and therefore the same cache). +
  2. The 302 response was not cached, so httplib2 sends another request for the same URL. +
  3. Once again, the server responds with a 302. But notice what didn’t happen: there wasn’t ever a second request for the final URL, http://diveintopython3.org/examples/feed.xml. That response was cached (remember the Cache-Control header that you saw in the previous example). Once httplib2 received the 302 Found code, it checked its cache before issuing another request. The cache contained a fresh copy of http://diveintopython3.org/examples/feed.xml, so there was no need to re-request it. +
  4. By the time the request() method returns, it has read the feed data from the cache and returned it. Of course, it’s the same as the data you received last time. +
+ +

In other words, you don’t have to do anything special for temporary redirects. httplib2 will follow them automatically, and the fact that one URL redirects to another has no bearing on httplib2’s support for compression, caching, ETags, or any of the other features of HTTP. + +

Permanent redirects are just as simple. + +

+# continued from the previous example
+>>> response, content = h.request('http://diveintopython3.org/examples/feed-301.xml')  
+connect: (diveintopython3.org, 80)
+send: b'GET /examples/feed-301.xml HTTP/1.1
+Host: diveintopython3.org
+accept-encoding: deflate, gzip
+user-agent: Python-httplib2/$Rev: 259 $'
+reply: 'HTTP/1.1 301 Moved Permanently'                                                
+>>> response.fromcache                                                                 
+True
+
    +
  1. Once again, this URL doesn’t really exist. I’ve set up my server to issue a permanent redirect to http://diveintopython3.org/examples/feed.xml. +
  2. And here it is: status code 301. But again, notice what didn’t happen: there was no request to the redirect URL. Why not? Because it’s already cached locally. +
  3. httplib2 “followed” the redirect right into its cache. +
+ +

But wait! There’s more! + +

+# continued from the previous example
+>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-301.xml')  
+>>> response2.fromcache                                                                  
+True
+>>> content2 == content                                                                  
+True
+
+
    +
  1. Here’s the difference between temporary and permanent redirects: once httplib2 follows a permanent redirect, all further requests for that URL will transparently be rewritten to the target URL without hitting the network for the original URL. Remember, debugging is still turned on, yet there is no output of network activity whatsoever. +
  2. Yep, this response was retrieved from the local cache. +
  3. Yep, you got the entire feed (from the cache). +
+ +

HTTP. It works. + +

⁂ + +

Beyond HTTP GET

+ +

HTTP web services are not limited to GET requests. What if you want to create something new? Whenever you post a comment on a discussion forum, update your weblog, publish your status on a microblogging service like Twitter or Identi.ca, you’re probably already using HTTP POST. + +

Both Twitter and Identi.ca both offer a simple HTTP-based API for publishing and updating your status in 140 characters or less. Let’s look at Identi.ca’s API documentation for updating your status: + +

+

Identi.ca REST API Method: statuses/update
+Updates the authenticating user’s status. Requires the status parameter specified below. Request must be a POST. + +

+
URL +
https://identi.ca/api/statuses/update.format +
Formats +
xml, json, rss, atom +
HTTP Method(s) +
POST +
Requires Authentication +
true +
Parameters +
status. Required. The text of your status update. URL-encode as necessary. +
+
+ +

How does this work? To publish a new message on Identi.ca, you need to issue an HTTP POST request to http://identi.ca/api/statuses/update.format. (The format bit is not part of the URL; you replace it with the data format you want the server to return in response to your request. So if you want a response in XML, you would post the request to https://identi.ca/api/statuses/update.xml.) The request needs to include a parameter called status, which contains the text of your status update. And the request needs to be authenticated. + +

Authenticated? Sure. To update your status on Identi.ca, you need to prove who you are. Identi.ca is not a wiki; only you can update your own status. Identi.ca uses HTTP Basic Authentication (a.k.a. RFC 2617) over SSL to provide secure but easy-to-use authentication. httplib2 supports both SSL and HTTP Basic Authentication, so this part is easy. + +

A POST request is different from a GET request, because it includes a payload. The payload is the data you want to send to the server. The one piece of data that this API method requires is status, and it should be URL-encoded. This is a very simple serialization format that takes a set of key-value pairs (i.e. a dictionary) and transforms it into a string. + +

+>>> from urllib.parse import urlencode              
+>>> data = {'status': 'Test update from Python 3'}  
+>>> urlencode(data)                                 
+'status=Test+update+from+Python+3'
+
    +
  1. Python comes with a utility function to URL-encode a dictionary: urllib.parse.urlencode(). +
  2. This is the sort of dictionary that the Identi.ca API is looking for. It contains one key, status, whose value is the text of a single status update. +
  3. This is what the URL-encoded string looks like. This is the payload that will be sent “on the wire” to the Identi.ca API server in your HTTP POST request. +
+ +

+ +

+>>> from urllib.parse import urlencode
+>>> import httplib2
+>>> httplib2.debuglevel = 1
+>>> h = httplib2.Http('.cache')
+>>> data = {'status': 'Test update from Python 3'}
+>>> h.add_credentials('diveintomark', 'MY_SECRET_PASSWORD', 'identi.ca')    
+>>> resp, content = h.request('https://identi.ca/api/statuses/update.xml',
+...     'POST',                                                             
+...     urlencode(data),                                                    
+...     headers={'Content-Type': 'application/x-www-form-urlencoded'})      
+
    +
  1. This is how httplib2 handles authentication. Store your username and password with the add_credentials() method. When httplib2 tries to issue the request, the server will respond with a 401 Unauthorized status code, and it will list which authentication methods it supports (in the WWW-Authenticate header). httplib2 will automatically construct an Authorization header and re-request the URL. +
  2. The second parameter is the type of HTTP request, in this case POST. +
  3. The third parameter is the payload to send to the server. We’re sending the URL-encoded dictionary with a status message. +
  4. Finally, we need to tell the server that the payload is URL-encoded data. +
+ +
+

The third parameter to the add_credentials() method is the domain in which the credentials are valid. You should always specify this! If you leave out the domain and later reuse the httplib2.Http object on a different authenticated site, httplib2 might end up leaking one site’s username and password to the other site. +

+ +

This is what goes over the wire: + +

+# continued from the previous example
+send: b'POST /api/statuses/update.xml HTTP/1.1
+Host: identi.ca
+Accept-Encoding: identity
+Content-Length: 32
+content-type: application/x-www-form-urlencoded
+user-agent: Python-httplib2/$Rev: 259 $
+
+status=Test+update+from+Python+3'
+reply: 'HTTP/1.1 401 Unauthorized'                        
+send: b'POST /api/statuses/update.xml HTTP/1.1            
+Host: identi.ca
+Accept-Encoding: identity
+Content-Length: 32
+content-type: application/x-www-form-urlencoded
+authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2  
+user-agent: Python-httplib2/$Rev: 259 $
+
+status=Test+update+from+Python+3'
+reply: 'HTTP/1.1 200 OK'                                  
+
    +
  1. After the first request, the server responds with a 401 Unauthorized status code. httplib2 will never send authentication headers unless the server explicitly asks for them. This is how the server asks for them. +
  2. httplib2 immediately turns around and requests the same URL a second time. +
  3. This time, it includes the username and password that you added with the add_credentials() method. +
  4. It worked! +
+ +

What does the server send back after a successful request? That depends entirely on the web service API. In some protocols (like the Atom Publishing Protocol), the server sends back a 201 Created status code and the location of the newly created resource in the Location header. Identi.ca sends back a 200 OK and an XML document containing information about the newly created resource. + +

+# continued from the previous example
+>>> print(content.decode('utf-8'))                             
+<?xml version="1.0" encoding="UTF-8"?>
+<status>
+ <text>Test update from Python 3</text>                        
+ <truncated>false</truncated>
+ <created_at>Wed Jun 10 03:53:46 +0000 2009</created_at>
+ <in_reply_to_status_id></in_reply_to_status_id>
+ <source>api</source>
+ <id>5131472</id>                                              
+ <in_reply_to_user_id></in_reply_to_user_id>
+ <in_reply_to_screen_name></in_reply_to_screen_name>
+ <favorited>false</favorited>
+ <user>
+  <id>3212</id>
+  <name>Mark Pilgrim</name>
+  <screen_name>diveintomark</screen_name>
+  <location>27502, US</location>
+  <description>tech writer, husband, father</description>
+  <profile_image_url>http://avatar.identi.ca/3212-48-20081216000626.png</profile_image_url>
+  <url>http://diveintomark.org/</url>
+  <protected>false</protected>
+  <followers_count>329</followers_count>
+  <profile_background_color></profile_background_color>
+  <profile_text_color></profile_text_color>
+  <profile_link_color></profile_link_color>
+  <profile_sidebar_fill_color></profile_sidebar_fill_color>
+  <profile_sidebar_border_color></profile_sidebar_border_color>
+  <friends_count>2</friends_count>
+  <created_at>Wed Jul 02 22:03:58 +0000 2008</created_at>
+  <favourites_count>30768</favourites_count>
+  <utc_offset>0</utc_offset>
+  <time_zone>UTC</time_zone>
+  <profile_background_image_url></profile_background_image_url>
+  <profile_background_tile>false</profile_background_tile>
+  <statuses_count>122</statuses_count>
+  <following>false</following>
+  <notifications>false</notifications>
+</user>
+</status>
+
    +
  1. Remember, the data returned by httplib2 is always bytes, not a string. To convert it to a string, you need to decode it using the proper character encoding. Identi.ca’s API always returns results in UTF-8, so that part is easy. +
  2. There’s the text of the status message we just published. +
  3. There’s the unique identifier for the new status message. Identi.ca uses this to construct a URL for viewing the message on the web. +
+ +

And here it is: + +

screenshot showing published status message on Identi.ca + +

⁂ + +

Beyond HTTP POST

+ +

HTTP isn’t limited to GET and POST. Those are certainly the most common types of requests, especially in web browsers. But web service APIs can go beyond GET and POST, and httplib2 is ready. + +

+# continued from the previous example
+>>> from xml.etree import ElementTree as etree
+>>> tree = etree.fromstring(content)                                          
+>>> status_id = tree.findtext('id')                                           
+>>> status_id
+'5131472'
+>>> url = 'https://identi.ca/api/statuses/destroy/{0}.xml'.format(status_id)  
+>>> resp, deleted_content = h.request(url, 'DELETE')                          
+
    +
  1. The server returned XML, right? You know how to parse XML. +
  2. The findtext() method finds the first instance of the given expression and extracts its text content. In this case, we’re just looking for an <id> element. +
  3. Based on the text content of the <id> element, we can construct a URL to delete the status message we just published. +
  4. To delete a message, you simply issue an HTTP DELETE request to that URL. +
+ +

This is what goes over the wire: + +

+send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1      
+Host: identi.ca
+Accept-Encoding: identity
+user-agent: Python-httplib2/$Rev: 259 $
+
+'
+reply: 'HTTP/1.1 401 Unauthorized'                             
+send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1      
+Host: identi.ca
+Accept-Encoding: identity
+authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2       
+user-agent: Python-httplib2/$Rev: 259 $
+
+'
+reply: 'HTTP/1.1 200 OK'                                       
+>>> resp.status
+200
+
    +
  1. “Delete this status message.” +
  2. “I’m sorry, Dave, I’m afraid I can’t do that.” +
  3. “Unauthorized Hmmph. Delete this status message, please… +
  4. …and here’s my username and password.” +
  5. “Consider it done!” +
+ +

And just like that, poof, it’s gone. + +

screenshot showing deleted message on Identi.ca + +

⁂ + +

Further Reading

+ +

httplib2: + +

+ +

HTTP caching: + +

+ +

RFCs: + +

+ +

+

© 2001–10 Mark Pilgrim + + + diff --git a/installing-python.html b/installing-python.html index 4793167..59df064 100755 --- a/installing-python.html +++ b/installing-python.html @@ -1,364 +1,364 @@ - - -Installing Python - Dive Into Python 3 - - - - - - -

  
-

You are here: Home Dive Into Python 3 -

Difficulty level: ♦♢♢♢♢ -

Installing Python

-
-

Tempora mutantur nos et mutamur in illis. (Times change, and we change with them.)
— ancient Roman proverb -

-

  -

Diving In

-

Before you can start programming in Python 3, you need to install it. Or do you? - -

Which Python Is Right For You?

- -

If you're using an account on a hosted server, your ISP may have already installed Python 3. If you’re running Linux at home, you may already have Python 3, too. Most popular GNU/Linux distributions come with Python 2 in the default installation; a small but growing number of distributions also include Python 3. Mac OS X includes a command-line version of Python 2, but as of this writing it does not include Python 3. Microsoft Windows does not come with any version of Python. But don’t despair! You can point-and-click your way through installing Python, regardless of what operating system you have. - -

The easiest way to check for Python 3 on your Linux or Mac OS X system is to get to a command line. On Linux, look in your Applications menu for a program called Terminal. (It may be in a submenu like Accessories or System.) On Mac OS X, there is an application called Terminal.app in your /Application/Utilities/ folder. - -

Once you’re at a command line prompt, just type python3 (all lowercase, no spaces) and see what happens. On my home Linux system, Python 3 is already installed, and this command gets me into the Python interactive shell. - -

-mark@atlantis:~$ python3
-Python 3.0.1+ (r301:69556, Apr 15 2009, 17:25:52)
-[GCC 4.3.3] on linux2
-Type "help", "copyright", "credits" or "license" for more information.
->>>
- -

(Type exit() and press ENTER to exit the Python interactive shell.) - -

My web hosting provider also runs Linux and provides command-line access, but my server does not have Python 3 installed. (Boo!) - -

-mark@manganese:~$ python3
-bash: python3: command not found
- -

So back to the question that started this section, “Which Python is right for you?” Whichever one runs on the computer you already have. - -

[Read on for Windows instructions, or skip to Installing on Mac OS X, Installing on Ubuntu Linux, or Installing on Other Platforms.] - -

⁂ - -

Installing on Microsoft Windows

- -

Windows comes in two architectures these days: 32-bit and 64-bit. Of course, there are lots of different versions of Windows — XP, Vista, Windows 7 — but Python runs on all of them. The more important distinction is 32-bit v. 64-bit. If you have no idea what architecture you’re running, it’s probably 32-bit. - -

Visit python.org/download/ and download the appropriate Python 3 Windows installer for your architecture. Your choices will look something like this: - -

- -

I don’t want to include direct download links here, because minor updates of Python happen all the time and I don’t want to be responsible for you missing important updates. You should always install the most recent version of Python 3.x unless you have some esoteric reason not to. - -

    -
  1. -

    [Windows dialog: open file security warning] -

    Once your download is complete, double-click the .msi file. Windows will pop up a security alert, since you’re about to be running executable code. The official Python installer is digitally signed by the Python Software Foundation, the non-profit corporation that oversees Python development. Don’t accept imitations! -

    Click the Run button to launch the Python 3 installer. - -

  2. -

    [Python installer: select whether to install Python 3.1 for all users of this computer] -

    The first question the installer will ask you is whether you want to install Python 3 for all users or just for you. The default choice is “install for all users,” which is the best choice unless you have a good reason to choose otherwise. (One possible reason why you would want to “install just for me” is that you are installing Python on your company’s computer and you don’t have administrative rights on your Windows account. But then, why are you installing Python without permission from your company’s Windows administrator? Don’t get me in trouble here!) -

    Click the Next button to accept your choice of installation type. - -

  3. -

    [Python installer: select destination directory] -

    Next, the installer will prompt you to choose a destination directory. The default for all versions of Python 3.1.x is C:\Python31\, which should work well for most users unless you have a specific reason to change it. If you maintain a separate drive letter for installing applications, you can browse to it using the embedded controls, or simply type the pathname in the box below. You are not limited to installing Python on the C: drive; you can install it on any drive, in any folder. -

    Click the Next button to accept your choice of destination directory. - -

  4. -

    [Python installer: customize Python 3.1] -

    The next page looks complicated, but it’s not really. Like many installers, you have the option not to install every single component of Python 3. If disk space is especially tight, you can exclude certain components. -

    - -
  5. -

    [Python installer: disk space requirements] -

    If you’re unsure how much disk space you have, click the Disk Usage button. The installer will list your drive letters, compute how much space is available on each drive, and calculate how much would be left after installation. -

    Click the OK button to return to the “Customizing Python” page. - -

  6. -

    [Python installer: removing Test Suite option will save 7908KB on your hard drive] -

    If you decide to exclude an option, select the drop-down button before the option and select “Entire feature will be unavailable.” For example, excluding the test suite will save you a whopping 7908KB of disk space. -

    Click the Next button to accept your choice of options. - -

  7. -

    [Python installer: progress meter] -

    The installer will copy all the necessary files to your chosen destination directory. (This happens so quickly, I had to try it three times to even get a screenshot of it!) - -

  8. -

    [Python installer: installation completed. Special Windows thanks to Mark Hammond, without whose years of freely shared Windows expertise, Python for Windows would still be Python for DOS.] -

    Click the Finish button to exit the installer. - -

  9. -

    [Windows Python Shell, a graphical interactive shell for Python] -

    In your Start menu, there should be a new item called Python 3.1. Within that, there is a program called IDLE. Select this item to run the interactive Python Shell. - -

- -

[Skip to using the Python Shell] - -

⁂ - -

Installing on Mac OS X

- -

All modern Macintosh computers use the Intel chip (like most Windows PCs). Older Macs used PowerPC chips. You don’t need to understand the difference, because there’s just one Mac Python installer for all Macs. - -

Visit python.org/download/ and download the Mac installer. It will be called something like Python 3.1 Mac Installer Disk Image, although the version number may vary. Be sure to download version 3.x, not 2.x. - -

    - -
  1. -

    [contents of Python installer disk image] -

    Your browser should automatically mount the disk image and open a Finder window to show you the contents. (If this doesn’t happen, you’ll need to find the disk image in your downloads folder and double-click to mount it. It will be named something like python-3.1.dmg.) The disk image contains a number of text files (Build.txt, License.txt, ReadMe.txt), and the actual installer package, Python.mpkg. -

    Double-click the Python.mpkg installer package to launch the Mac Python installer. - -

  2. -

    [Python installer: welcome screen] -

    The first page of the installer gives a brief description of Python itself, then refers you to the ReadMe.txt file (which you didn’t read, did you?) for more details. -

    Click the Continue button to move along. - -

  3. -

    [Python installer: information about supported architectures, disk space, and acceptable destination folders] -

    The next page actually contains some important information: Python requires Mac OS X 10.3 or later. If you are still running Mac OS X 10.2, you should really upgrade. Apple no longer provides security updates for your operating system, and your computer is probably at risk if you ever go online. Also, you can’t run Python 3. -

    Click the Continue button to advance. - -

  4. -

    [Python installer: software license agreement] -

    Like all good installers, the Python installer displays the software license agreement. Python is open source, and its license is approved by the Open Source Initiative. Python has had a number of owners and sponsors throughout its history, each of which has left its mark on the software license. But the end result is this: Python is open source, and you may use it on any platform, for any purpose, without fee or obligation of reciprocity. -

    Click the Continue button once again. - -

  5. -

    [Python installer: dialog to accept license agreement] -

    Due to quirks in the standard Apple installer framework, you must “agree” to the software license in order to complete the installation. Since Python is open source, you are really “agreeing” that the license is granting you additional rights, rather than taking them away. -

    Click the Agree button to continue. - -

  6. -

    [Python installer: standard install screen] -

    The next screen allows you to change your install location. You must install Python on your boot drive, but due to limitations of the installer, it does not enforce this. In truth, I have never had the need to change the install location. -

    From this screen, you can also customize the installation to exclude certain features. If you want to do this, click the Customize button; otherwise click the Install button. - -

  7. -

    [Python installer: custom install screen] -

    If you choose a Custom Install, the installer will present you with the following list of features: -

    -

    Click the Install button to continue. - -

  8. -

    [Python installer: dialog to enter administrative password] -

    Because it installs system-wide frameworks and binaries in /usr/local/bin/, the installer will ask you for an administrative password. There is no way to install Mac Python without administrator privileges. -

    Click the OK button to begin the installation. - -

  9. -

    [Python installer: progress meter] -

    The installer will display a progress meter while it installs the features you’ve selected. - -

  10. -

    [Python installer: install succeeded] -

    Assuming all went well, the installer will give you a big green checkmark to tell you that the installation completed successfully. -

    Click the Close button to exit the installer. - -

  11. -

    [contents of /Applications/Python 3.1/ folder] -

    Assuming you didn’t change the install location, you can find the newly installed files in the Python 3.1 folder within your /Applications folder. The most important piece is IDLE, the graphical Python Shell. -

    Double-click IDLE to launch the Python Shell. - -

  12. -

    [Mac Python Shell, a graphical interactive shell for Python] -

    The Python Shell is where you will spend most of your time exploring Python. Examples throughout this book will assume that you can find your way into the Python Shell. - -

- -

[Skip to using the Python Shell] - -

⁂ - -

Installing on Ubuntu Linux

- -

Modern Linux distributions are backed by vast repositories of precompiled applications, ready to install. The exact details vary by distribution. In Ubuntu Linux, the easiest way to install Python 3 is through the Add/Remove application in your Applications menu. - -

    -
  1. -

    [Add/Remove: Canonical-maintained applications] -

    When you first launch the Add/Remove application, it will show you a list of preselected applications in different categories. Some are already installed; most are not. Because the repository contains over 10,000 applications, there are different filters you can apply to see small parts of the repository. The default filter is “Canonical-maintained applications,” which is a small subset of the total number of applications that are officially supported by Canonical, the company that creates and maintains Ubuntu Linux. - -

  2. -

    [Add/Remove: all open source applications] -

    Python 3 is not maintained by Canonical, so the first step is to drop down this filter menu and select “All Open Source applications.” - -

  3. -

    [Add/Remove: search for Python 3] -

    Once you’ve widened the filter to include all open source applications, use the Search box immediately after the filter menu to search for Python 3. - -

  4. -

    [Add/Remove: select Python 3.0 package] -

    Now the list of applications narrows to just those matching Python 3. You’re going to check two packages. The first is Python (v3.0). This contains the Python interpreter itself. -

  5. -

    [Add/Remove: select IDLE for Python 3.0 package] -

    The second package you want is immediately above: IDLE (using Python-3.0). This is a graphical Python Shell that you will use throughout this book. -

    After you’ve checked those two packages, click the Apply Changes button to continue. - -

  6. -

    [Add/Remove: apply changes] -

    The package manager will ask you to confirm that you want to add both IDLE (using Python-3.0) and Python (v3.0). -

    Click the Apply button to continue. - -

  7. -

    [Add/Remove: download progress meter] -

    The package manager will show you a progress meter while it downloads the necessary packages from Canonical’s Internet repository. - -

  8. -

    [Add/Remove: installation progress meter] -

    Once the packages are downloaded, the package manager will automatically begin installing them. - -

  9. -

    [Add/Remove: new applications have been installed] -

    If all went well, the package manager will confirm that both packages were successfully installed. From here, you can double-click IDLE to launch the Python Shell, or click the Close button to exit the package manager. -

    You can always relaunch the Python Shell by going to your Applications menu, then the Programming submenu, and selecting IDLE. - -

  10. -

    [Linux Python Shell, a graphical interactive shell for Python] -

    The Python Shell is where you will spend most of your time exploring Python. Examples throughout this book will assume that you can find your way into the Python Shell. - -

- -

[Skip to using the Python Shell] - -

⁂ - -

Installing on Other Platforms

- -

Python 3 is available on a number of different platforms. In particular, it is available in virtually every Linux, BSD, and Solaris-based distribution. For example, RedHat Linux uses the yum package manager; FreeBSD has its ports and packages collection; Solaris has pkgadd and friends. A quick web search for Python 3 + your operating system will tell you whether a Python 3 package is available, and how to install it. - -

⁂ - -

Using The Python Shell

- -

The Python Shell is where you can explore Python syntax, get interactive help on commands, and debug short programs. The graphical Python Shell (named IDLE) also contains a decent text editor that supports Python syntax coloring and integrates with the Python Shell. If you don’t already have a favorite text editor, you should give IDLE a try. - -

First things first. The Python Shell itself is an amazing interactive playground. Throughout this book, you’ll see examples like this: - -

->>> 1 + 1
-2
- -

The three angle brackets, >>>, denote the Python Shell prompt. Don’t type that part. That’s just to let you know that this example is meant to be followed in the Python Shell. - -

1 + 1 is the part you type. You can type any valid Python expression or command in the Python Shell. Don’t be shy; it won’t bite! The worst that will happen is you’ll get an error message. Commands get executed immediately (once you press ENTER); expressions get evaluated immediately, and the Python Shell prints out the result. - -

2 is the result of evaluating this expression. As it happens, 1 + 1 is a valid Python expression. The result, of course, is 2. - -

Let’s try another one. - -

->>> print('Hello world!')
-Hello world!
-
- -

Pretty simple, no? But there’s lots more you can do in the Python shell. If you ever get stuck — you can’t remember a command, or you can’t remember the proper arguments to pass a certain function — you can get interactive help in the Python Shell. Just type help and press ENTER. - -

->>> help
-Type help() for interactive help, or help(object) for help about object.
- -

There are two modes of help. You can get help about a single object, which just prints out the documentation and returns you to the Python Shell prompt. You can also enter help mode, where instead of evaluating Python expressions, you just type keywords or command names and it will print out whatever it knows about that command. - -

To enter the interactive help mode, type help() and press ENTER. - -

->>> help()
-Welcome to Python 3.0!  This is the online help utility.
-
-If this is your first time using Python, you should definitely check out
-the tutorial on the Internet at http://docs.python.org/tutorial/.
-
-Enter the name of any module, keyword, or topic to get help on writing
-Python programs and using Python modules.  To quit this help utility and
-return to the interpreter, just type "quit".
-
-To get a list of available modules, keywords, or topics, type "modules",
-"keywords", or "topics".  Each module also comes with a one-line summary
-of what it does; to list the modules whose summaries contain a given word
-such as "spam", type "modules spam".
-
-help> 
- -

Note how the prompt changes from >>> to help>. This reminds you that you’re in the interactive help mode. Now you can enter any keyword, command, module name, function name — pretty much anything Python understands — and read documentation on it. - -

-help> print                                                                 
-Help on built-in function print in module builtins:
-
-print(...)
-    print(value, ..., sep=' ', end='\n', file=sys.stdout)
-    
-    Prints the values to a stream, or to sys.stdout by default.
-    Optional keyword arguments:
-    file: a file-like object (stream); defaults to the current sys.stdout.
-    sep:  string inserted between values, default a space.
-    end:  string appended after the last value, default a newline.
-
-help> PapayaWhip                                                            
-no Python documentation found for 'PapayaWhip'
-
-help> quit                                                                  
-
-You are now leaving help and returning to the Python interpreter.
-If you want to ask for help on a particular object directly from the
-interpreter, you can type "help(object)".  Executing "help('string')"
-has the same effect as typing a particular string at the help> prompt.
->>>                                                                         
-
    -
  1. To get documentation on the print() function, just type print and press ENTER. The interactive help mode will display something akin to a man page: the function name, a brief synopsis, the function’s arguments and their default values, and so on. If the documentation seems opaque to you, don’t panic. You’ll learn more about all these concepts in the next few chapters. -
  2. Of course, the interactive help mode doesn’t know everything. If you type something that isn’t a Python command, module, function, or other built-in keyword, the interactive help mode will just shrug its virtual shoulders. -
  3. To quit the interactive help mode, type quit and press ENTER. -
  4. The prompt changes back to >>> to signal that you’ve left the interactive help mode and returned to the Python Shell. -
- -

IDLE, the graphical Python Shell, also includes a Python-aware text editor. - -

⁂ - -

Python Editors and IDEs

- -

IDLE is not the only game in town when it comes to writing programs in Python. While it’s useful to get started with learning the language itself, many developers prefer other text editors or Integrated Development Environments (IDEs). I won’t cover them here, but the Python community maintains a list of Python-aware editors that covers a wide range of supported platforms and software licenses. - -

You might also want to check out the list of Python-aware IDEs, although few of them support Python 3 yet. One that does is PyDev, a plugin for Eclipse that turns Eclipse into a full-fledged Python IDE. Both Eclipse and PyDev are cross-platform and open source. - -

On the commercial front, there is ActiveState’s Komodo IDE. It has per-user licensing, but students can get a discount, and a free time-limited trial version is available. - -

I’ve been programming in Python for nine years, and I edit my Python programs in GNU Emacs and debug them in the command-line Python Shell. There’s no right or wrong way to develop in Python. Find a way that works for you! - -

- -

© 2001–10 Mark Pilgrim - - - + + +Installing Python - Dive Into Python 3 + + + + + + +

  
+

You are here: Home Dive Into Python 3 +

Difficulty level: ♦♢♢♢♢ +

Installing Python

+
+

Tempora mutantur nos et mutamur in illis. (Times change, and we change with them.)
— ancient Roman proverb +

+

  +

Diving In

+

Before you can start programming in Python 3, you need to install it. Or do you? + +

Which Python Is Right For You?

+ +

If you're using an account on a hosted server, your ISP may have already installed Python 3. If you’re running Linux at home, you may already have Python 3, too. Most popular GNU/Linux distributions come with Python 2 in the default installation; a small but growing number of distributions also include Python 3. Mac OS X includes a command-line version of Python 2, but as of this writing it does not include Python 3. Microsoft Windows does not come with any version of Python. But don’t despair! You can point-and-click your way through installing Python, regardless of what operating system you have. + +

The easiest way to check for Python 3 on your Linux or Mac OS X system is to get to a command line. On Linux, look in your Applications menu for a program called Terminal. (It may be in a submenu like Accessories or System.) On Mac OS X, there is an application called Terminal.app in your /Application/Utilities/ folder. + +

Once you’re at a command line prompt, just type python3 (all lowercase, no spaces) and see what happens. On my home Linux system, Python 3 is already installed, and this command gets me into the Python interactive shell. + +

+mark@atlantis:~$ python3
+Python 3.0.1+ (r301:69556, Apr 15 2009, 17:25:52)
+[GCC 4.3.3] on linux2
+Type "help", "copyright", "credits" or "license" for more information.
+>>>
+ +

(Type exit() and press ENTER to exit the Python interactive shell.) + +

My web hosting provider also runs Linux and provides command-line access, but my server does not have Python 3 installed. (Boo!) + +

+mark@manganese:~$ python3
+bash: python3: command not found
+ +

So back to the question that started this section, “Which Python is right for you?” Whichever one runs on the computer you already have. + +

[Read on for Windows instructions, or skip to Installing on Mac OS X, Installing on Ubuntu Linux, or Installing on Other Platforms.] + +

⁂ + +

Installing on Microsoft Windows

+ +

Windows comes in two architectures these days: 32-bit and 64-bit. Of course, there are lots of different versions of Windows — XP, Vista, Windows 7 — but Python runs on all of them. The more important distinction is 32-bit v. 64-bit. If you have no idea what architecture you’re running, it’s probably 32-bit. + +

Visit python.org/download/ and download the appropriate Python 3 Windows installer for your architecture. Your choices will look something like this: + +

+ +

I don’t want to include direct download links here, because minor updates of Python happen all the time and I don’t want to be responsible for you missing important updates. You should always install the most recent version of Python 3.x unless you have some esoteric reason not to. + +

    +
  1. +

    [Windows dialog: open file security warning] +

    Once your download is complete, double-click the .msi file. Windows will pop up a security alert, since you’re about to be running executable code. The official Python installer is digitally signed by the Python Software Foundation, the non-profit corporation that oversees Python development. Don’t accept imitations! +

    Click the Run button to launch the Python 3 installer. + +

  2. +

    [Python installer: select whether to install Python 3.1 for all users of this computer] +

    The first question the installer will ask you is whether you want to install Python 3 for all users or just for you. The default choice is “install for all users,” which is the best choice unless you have a good reason to choose otherwise. (One possible reason why you would want to “install just for me” is that you are installing Python on your company’s computer and you don’t have administrative rights on your Windows account. But then, why are you installing Python without permission from your company’s Windows administrator? Don’t get me in trouble here!) +

    Click the Next button to accept your choice of installation type. + +

  3. +

    [Python installer: select destination directory] +

    Next, the installer will prompt you to choose a destination directory. The default for all versions of Python 3.1.x is C:\Python31\, which should work well for most users unless you have a specific reason to change it. If you maintain a separate drive letter for installing applications, you can browse to it using the embedded controls, or simply type the pathname in the box below. You are not limited to installing Python on the C: drive; you can install it on any drive, in any folder. +

    Click the Next button to accept your choice of destination directory. + +

  4. +

    [Python installer: customize Python 3.1] +

    The next page looks complicated, but it’s not really. Like many installers, you have the option not to install every single component of Python 3. If disk space is especially tight, you can exclude certain components. +

    + +
  5. +

    [Python installer: disk space requirements] +

    If you’re unsure how much disk space you have, click the Disk Usage button. The installer will list your drive letters, compute how much space is available on each drive, and calculate how much would be left after installation. +

    Click the OK button to return to the “Customizing Python” page. + +

  6. +

    [Python installer: removing Test Suite option will save 7908KB on your hard drive] +

    If you decide to exclude an option, select the drop-down button before the option and select “Entire feature will be unavailable.” For example, excluding the test suite will save you a whopping 7908KB of disk space. +

    Click the Next button to accept your choice of options. + +

  7. +

    [Python installer: progress meter] +

    The installer will copy all the necessary files to your chosen destination directory. (This happens so quickly, I had to try it three times to even get a screenshot of it!) + +

  8. +

    [Python installer: installation completed. Special Windows thanks to Mark Hammond, without whose years of freely shared Windows expertise, Python for Windows would still be Python for DOS.] +

    Click the Finish button to exit the installer. + +

  9. +

    [Windows Python Shell, a graphical interactive shell for Python] +

    In your Start menu, there should be a new item called Python 3.1. Within that, there is a program called IDLE. Select this item to run the interactive Python Shell. + +

+ +

[Skip to using the Python Shell] + +

⁂ + +

Installing on Mac OS X

+ +

All modern Macintosh computers use the Intel chip (like most Windows PCs). Older Macs used PowerPC chips. You don’t need to understand the difference, because there’s just one Mac Python installer for all Macs. + +

Visit python.org/download/ and download the Mac installer. It will be called something like Python 3.1 Mac Installer Disk Image, although the version number may vary. Be sure to download version 3.x, not 2.x. + +

    + +
  1. +

    [contents of Python installer disk image] +

    Your browser should automatically mount the disk image and open a Finder window to show you the contents. (If this doesn’t happen, you’ll need to find the disk image in your downloads folder and double-click to mount it. It will be named something like python-3.1.dmg.) The disk image contains a number of text files (Build.txt, License.txt, ReadMe.txt), and the actual installer package, Python.mpkg. +

    Double-click the Python.mpkg installer package to launch the Mac Python installer. + +

  2. +

    [Python installer: welcome screen] +

    The first page of the installer gives a brief description of Python itself, then refers you to the ReadMe.txt file (which you didn’t read, did you?) for more details. +

    Click the Continue button to move along. + +

  3. +

    [Python installer: information about supported architectures, disk space, and acceptable destination folders] +

    The next page actually contains some important information: Python requires Mac OS X 10.3 or later. If you are still running Mac OS X 10.2, you should really upgrade. Apple no longer provides security updates for your operating system, and your computer is probably at risk if you ever go online. Also, you can’t run Python 3. +

    Click the Continue button to advance. + +

  4. +

    [Python installer: software license agreement] +

    Like all good installers, the Python installer displays the software license agreement. Python is open source, and its license is approved by the Open Source Initiative. Python has had a number of owners and sponsors throughout its history, each of which has left its mark on the software license. But the end result is this: Python is open source, and you may use it on any platform, for any purpose, without fee or obligation of reciprocity. +

    Click the Continue button once again. + +

  5. +

    [Python installer: dialog to accept license agreement] +

    Due to quirks in the standard Apple installer framework, you must “agree” to the software license in order to complete the installation. Since Python is open source, you are really “agreeing” that the license is granting you additional rights, rather than taking them away. +

    Click the Agree button to continue. + +

  6. +

    [Python installer: standard install screen] +

    The next screen allows you to change your install location. You must install Python on your boot drive, but due to limitations of the installer, it does not enforce this. In truth, I have never had the need to change the install location. +

    From this screen, you can also customize the installation to exclude certain features. If you want to do this, click the Customize button; otherwise click the Install button. + +

  7. +

    [Python installer: custom install screen] +

    If you choose a Custom Install, the installer will present you with the following list of features: +

    +

    Click the Install button to continue. + +

  8. +

    [Python installer: dialog to enter administrative password] +

    Because it installs system-wide frameworks and binaries in /usr/local/bin/, the installer will ask you for an administrative password. There is no way to install Mac Python without administrator privileges. +

    Click the OK button to begin the installation. + +

  9. +

    [Python installer: progress meter] +

    The installer will display a progress meter while it installs the features you’ve selected. + +

  10. +

    [Python installer: install succeeded] +

    Assuming all went well, the installer will give you a big green checkmark to tell you that the installation completed successfully. +

    Click the Close button to exit the installer. + +

  11. +

    [contents of /Applications/Python 3.1/ folder] +

    Assuming you didn’t change the install location, you can find the newly installed files in the Python 3.1 folder within your /Applications folder. The most important piece is IDLE, the graphical Python Shell. +

    Double-click IDLE to launch the Python Shell. + +

  12. +

    [Mac Python Shell, a graphical interactive shell for Python] +

    The Python Shell is where you will spend most of your time exploring Python. Examples throughout this book will assume that you can find your way into the Python Shell. + +

+ +

[Skip to using the Python Shell] + +

⁂ + +

Installing on Ubuntu Linux

+ +

Modern Linux distributions are backed by vast repositories of precompiled applications, ready to install. The exact details vary by distribution. In Ubuntu Linux, the easiest way to install Python 3 is through the Add/Remove application in your Applications menu. + +

    +
  1. +

    [Add/Remove: Canonical-maintained applications] +

    When you first launch the Add/Remove application, it will show you a list of preselected applications in different categories. Some are already installed; most are not. Because the repository contains over 10,000 applications, there are different filters you can apply to see small parts of the repository. The default filter is “Canonical-maintained applications,” which is a small subset of the total number of applications that are officially supported by Canonical, the company that creates and maintains Ubuntu Linux. + +

  2. +

    [Add/Remove: all open source applications] +

    Python 3 is not maintained by Canonical, so the first step is to drop down this filter menu and select “All Open Source applications.” + +

  3. +

    [Add/Remove: search for Python 3] +

    Once you’ve widened the filter to include all open source applications, use the Search box immediately after the filter menu to search for Python 3. + +

  4. +

    [Add/Remove: select Python 3.0 package] +

    Now the list of applications narrows to just those matching Python 3. You’re going to check two packages. The first is Python (v3.0). This contains the Python interpreter itself. +

  5. +

    [Add/Remove: select IDLE for Python 3.0 package] +

    The second package you want is immediately above: IDLE (using Python-3.0). This is a graphical Python Shell that you will use throughout this book. +

    After you’ve checked those two packages, click the Apply Changes button to continue. + +

  6. +

    [Add/Remove: apply changes] +

    The package manager will ask you to confirm that you want to add both IDLE (using Python-3.0) and Python (v3.0). +

    Click the Apply button to continue. + +

  7. +

    [Add/Remove: download progress meter] +

    The package manager will show you a progress meter while it downloads the necessary packages from Canonical’s Internet repository. + +

  8. +

    [Add/Remove: installation progress meter] +

    Once the packages are downloaded, the package manager will automatically begin installing them. + +

  9. +

    [Add/Remove: new applications have been installed] +

    If all went well, the package manager will confirm that both packages were successfully installed. From here, you can double-click IDLE to launch the Python Shell, or click the Close button to exit the package manager. +

    You can always relaunch the Python Shell by going to your Applications menu, then the Programming submenu, and selecting IDLE. + +

  10. +

    [Linux Python Shell, a graphical interactive shell for Python] +

    The Python Shell is where you will spend most of your time exploring Python. Examples throughout this book will assume that you can find your way into the Python Shell. + +

+ +

[Skip to using the Python Shell] + +

⁂ + +

Installing on Other Platforms

+ +

Python 3 is available on a number of different platforms. In particular, it is available in virtually every Linux, BSD, and Solaris-based distribution. For example, RedHat Linux uses the yum package manager; FreeBSD has its ports and packages collection; Solaris has pkgadd and friends. A quick web search for Python 3 + your operating system will tell you whether a Python 3 package is available, and how to install it. + +

⁂ + +

Using The Python Shell

+ +

The Python Shell is where you can explore Python syntax, get interactive help on commands, and debug short programs. The graphical Python Shell (named IDLE) also contains a decent text editor that supports Python syntax coloring and integrates with the Python Shell. If you don’t already have a favorite text editor, you should give IDLE a try. + +

First things first. The Python Shell itself is an amazing interactive playground. Throughout this book, you’ll see examples like this: + +

+>>> 1 + 1
+2
+ +

The three angle brackets, >>>, denote the Python Shell prompt. Don’t type that part. That’s just to let you know that this example is meant to be followed in the Python Shell. + +

1 + 1 is the part you type. You can type any valid Python expression or command in the Python Shell. Don’t be shy; it won’t bite! The worst that will happen is you’ll get an error message. Commands get executed immediately (once you press ENTER); expressions get evaluated immediately, and the Python Shell prints out the result. + +

2 is the result of evaluating this expression. As it happens, 1 + 1 is a valid Python expression. The result, of course, is 2. + +

Let’s try another one. + +

+>>> print('Hello world!')
+Hello world!
+
+ +

Pretty simple, no? But there’s lots more you can do in the Python shell. If you ever get stuck — you can’t remember a command, or you can’t remember the proper arguments to pass a certain function — you can get interactive help in the Python Shell. Just type help and press ENTER. + +

+>>> help
+Type help() for interactive help, or help(object) for help about object.
+ +

There are two modes of help. You can get help about a single object, which just prints out the documentation and returns you to the Python Shell prompt. You can also enter help mode, where instead of evaluating Python expressions, you just type keywords or command names and it will print out whatever it knows about that command. + +

To enter the interactive help mode, type help() and press ENTER. + +

+>>> help()
+Welcome to Python 3.0!  This is the online help utility.
+
+If this is your first time using Python, you should definitely check out
+the tutorial on the Internet at http://docs.python.org/tutorial/.
+
+Enter the name of any module, keyword, or topic to get help on writing
+Python programs and using Python modules.  To quit this help utility and
+return to the interpreter, just type "quit".
+
+To get a list of available modules, keywords, or topics, type "modules",
+"keywords", or "topics".  Each module also comes with a one-line summary
+of what it does; to list the modules whose summaries contain a given word
+such as "spam", type "modules spam".
+
+help> 
+ +

Note how the prompt changes from >>> to help>. This reminds you that you’re in the interactive help mode. Now you can enter any keyword, command, module name, function name — pretty much anything Python understands — and read documentation on it. + +

+help> print                                                                 
+Help on built-in function print in module builtins:
+
+print(...)
+    print(value, ..., sep=' ', end='\n', file=sys.stdout)
+    
+    Prints the values to a stream, or to sys.stdout by default.
+    Optional keyword arguments:
+    file: a file-like object (stream); defaults to the current sys.stdout.
+    sep:  string inserted between values, default a space.
+    end:  string appended after the last value, default a newline.
+
+help> PapayaWhip                                                            
+no Python documentation found for 'PapayaWhip'
+
+help> quit                                                                  
+
+You are now leaving help and returning to the Python interpreter.
+If you want to ask for help on a particular object directly from the
+interpreter, you can type "help(object)".  Executing "help('string')"
+has the same effect as typing a particular string at the help> prompt.
+>>>                                                                         
+
    +
  1. To get documentation on the print() function, just type print and press ENTER. The interactive help mode will display something akin to a man page: the function name, a brief synopsis, the function’s arguments and their default values, and so on. If the documentation seems opaque to you, don’t panic. You’ll learn more about all these concepts in the next few chapters. +
  2. Of course, the interactive help mode doesn’t know everything. If you type something that isn’t a Python command, module, function, or other built-in keyword, the interactive help mode will just shrug its virtual shoulders. +
  3. To quit the interactive help mode, type quit and press ENTER. +
  4. The prompt changes back to >>> to signal that you’ve left the interactive help mode and returned to the Python Shell. +
+ +

IDLE, the graphical Python Shell, also includes a Python-aware text editor. + +

⁂ + +

Python Editors and IDEs

+ +

IDLE is not the only game in town when it comes to writing programs in Python. While it’s useful to get started with learning the language itself, many developers prefer other text editors or Integrated Development Environments (IDEs). I won’t cover them here, but the Python community maintains a list of Python-aware editors that covers a wide range of supported platforms and software licenses. + +

You might also want to check out the list of Python-aware IDEs, although few of them support Python 3 yet. One that does is PyDev, a plugin for Eclipse that turns Eclipse into a full-fledged Python IDE. Both Eclipse and PyDev are cross-platform and open source. + +

On the commercial front, there is ActiveState’s Komodo IDE. It has per-user licensing, but students can get a discount, and a free time-limited trial version is available. + +

I’ve been programming in Python for nine years, and I edit my Python programs in GNU Emacs and debug them in the command-line Python Shell. There’s no right or wrong way to develop in Python. Find a way that works for you! + +

+ +

© 2001–10 Mark Pilgrim + + + diff --git a/iterators.html b/iterators.html index 4b4a3f5..8da4842 100755 --- a/iterators.html +++ b/iterators.html @@ -1,394 +1,394 @@ - - -Classes & Iterators - Dive Into Python 3 - - - - - - -

  
-

You are here: Home Dive Into Python 3 -

Difficulty level: ♦♦♦♢♢ -

Classes & Iterators

-
-

East is East, and West is West, and never the twain shall meet.
Rudyard Kipling -

-

  -

Diving In

-

Iterators are the “secret sauce” of Python 3. They’re everywhere, underlying everything, always just out of sight. Comprehensions are just a simple form of iterators. Generators are just a simple form of iterators. A function that yields values is a nice, compact way of building an iterator without building an iterator. Let me show you what I mean by that. - -

Remember the Fibonacci generator? Here it is as a built-from-scratch iterator: - -

[download fibonacci2.py] -

class Fib:
-    '''iterator that yields numbers in the Fibonacci sequence'''
-
-    def __init__(self, max):
-        self.max = max
-
-    def __iter__(self):
-        self.a = 0
-        self.b = 1
-        return self
-
-    def __next__(self):
-        fib = self.a
-        if fib > self.max:
-            raise StopIteration
-        self.a, self.b = self.b, self.a + self.b
-        return fib
- -

Let’s take that one line at a time. - -

class Fib:
- -

class? What’s a class? - -

⁂ - -

Defining Classes

- -

Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you’ve defined. - -

Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that’s all that’s required, since a class doesn’t need to inherit from any other class. - -

class PapayaWhip:  
-    pass           
-
    -
  1. The name of this class is PapayaWhip, and it doesn’t inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement. -
  2. You probably guessed this, but everything in a class is indented, just like the code within a function, if statement, for loop, or any other block of code. The first line not indented is outside the class. -
- -

This PapayaWhip class doesn’t define any methods or attributes, but syntactically, there needs to be something in the definition, thus the pass statement. This is a Python reserved word that just means “move along, nothing to see here”. It’s a statement that does nothing, and it’s a good placeholder when you’re stubbing out functions or classes. - -

-

The pass statement in Python is like a empty set of curly braces ({}) in Java or C. -

- -

Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don’t have explicit constructors and destructors. Although it’s not required, Python classes can have something similar to a constructor: the __init__() method. - -

The __init__() Method

- -

This example shows the initialization of the Fib class using the __init__ method. - -

class Fib:
-    '''iterator that yields numbers in the Fibonacci sequence'''  
-
-    def __init__(self, max):                                      
-
    -
  1. Classes can (and should) have docstrings too, just like modules and functions. -
  2. The __init__() method is called immediately after an instance of the class is created. It would be tempting — but technically incorrect — to call this the “constructor” of the class. It’s tempting, because it looks like a C++ constructor (by convention, the __init__() method is the first method defined for the class), acts like one (it’s the first piece of code executed in a newly created instance of the class), and even sounds like one. Incorrect, because the object has already been constructed by the time the __init__() method is called, and you already have a valid reference to the new instance of the class. -
- -

The first argument of every class method, including the __init__() method, is always a reference to the current instance of the class. By convention, this argument is named self. This argument fills the role of the reserved word this in C++ or Java, but self is not a reserved word in Python, merely a naming convention. Nonetheless, please don’t call it anything but self; this is a very strong convention. - -

In the __init__() method, self refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify self explicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically. - -

⁂ - -

Instantiating Classes

- -

Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the __init__() method requires. The return value will be the newly created object. -

->>> import fibonacci2
->>> fib = fibonacci2.Fib(100)  
->>> fib                        
-<fibonacci2.Fib object at 0x00DB8810>
->>> fib.__class__              
-<class 'fibonacci2.Fib'>
->>> fib.__doc__                
-'iterator that yields numbers in the Fibonacci sequence'
-
    -
  1. You are creating an instance of the Fib class (defined in the fibonacci2 module) and assigning the newly created instance to the variable fib. You are passing one parameter, 100, which will end up as the max argument in Fib’s __init__() method. -
  2. fib is now an instance of the Fib class. -
  3. Every class instance has a built-in attribute, __class__, which is the object’s class. Java programmers may be familiar with the Class class, which contains methods like getName() and getSuperclass() to get metadata information about an object. In Python, this kind of metadata is available through attributes, but the idea is the same. -
  4. You can access the instance’s docstring just as with a function or a module. All instances of a class share the same docstring. -
- -
-

In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit new operator like there is in C++ or Java. -

- -

⁂ - -

Instance Variables

- -

On to the next line: - -

class Fib:
-    def __init__(self, max):
-        self.max = max        
-
    -
  1. What is self.max? It’s an instance variable. It is completely separate from max, which was passed into the __init__() method as an argument. self.max is “global” to the instance. That means that you can access it from other methods. -
- -
class Fib:
-    def __init__(self, max):
-        self.max = max        
-    .
-    .
-    .
-    def __next__(self):
-        fib = self.a
-        if fib > self.max:    
-
    -
  1. self.max is defined in the __init__() method… -
  2. …and referenced in the __next__() method. -
- -

Instance variables are specific to one instance of a class. For example, if you create two Fib instances with different maximum values, they will each remember their own values. - -

->>> import fibonacci2
->>> fib1 = fibonacci2.Fib(100)
->>> fib2 = fibonacci2.Fib(200)
->>> fib1.max
-100
->>> fib2.max
-200
- -

⁂ - -

A Fibonacci Iterator

- -

Now you’re ready to learn how to build an iterator. An iterator is just a class that defines an __iter__() method. - -

- -

[download fibonacci2.py] -

class Fib:                                        
-    def __init__(self, max):                      
-        self.max = max
-
-    def __iter__(self):                           
-        self.a = 0
-        self.b = 1
-        return self
-
-    def __next__(self):                           
-        fib = self.a
-        if fib > self.max:
-            raise StopIteration                   
-        self.a, self.b = self.b, self.a + self.b
-        return fib                                
-
    -
  1. To build an iterator from scratch, fib needs to be a class, not a function. -
  2. “Calling” Fib(max) is really creating an instance of this class and calling its __init__() method with max. The __init__() method saves the maximum value as an instance variable so other methods can refer to it later. -
  3. The __iter__() method is called whenever someone calls iter(fib). (As you’ll see in a minute, a for loop will call this automatically, but you can also call it yourself manually.) After performing beginning-of-iteration initialization (in this case, resetting self.a and self.b, our two counters), the __iter__() method can return any object that implements a __next__() method. In this case (and in most cases), __iter__() simply returns self, since this class implements its own __next__() method. -
  4. The __next__() method is called whenever someone calls next() on an iterator of an instance of a class. That will make more sense in a minute. -
  5. When the __next__() method raises a StopIteration exception, this signals to the caller that the iteration is exhausted. Unlike most exceptions, this is not an error; it’s a normal condition that just means that the iterator has no more values to generate. If the caller is a for loop, it will notice this StopIteration exception and gracefully exit the loop. (In other words, it will swallow the exception.) This little bit of magic is actually the key to using iterators in for loops. -
  6. To spit out the next value, an iterator’s __next__() method simply returns the value. Do not use yield here; that’s a bit of syntactic sugar that only applies when you’re using generators. Here you’re creating your own iterator from scratch; use return instead. -
- -

Thoroughly confused yet? Excellent. Let’s see how to call this iterator: - -

->>> from fibonacci2 import Fib
->>> for n in Fib(1000):
-...     print(n, end=' ')
-0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
- -

Why, it’s exactly the same! Byte for byte identical to how you called Fibonacci-as-a-generator (modulo one capital letter). But how? - -

There’s a bit of magic involved in for loops. Here’s what happens: - -

- -

⁂ - -

A Plural Rule Iterator

- - -

Now it’s time for the finale. Let’s rewrite the plural rules generator as an iterator. - -

[download plural6.py] -

class LazyRules:
-    rules_filename = 'plural6-rules.txt'
-
-    def __init__(self):
-        self.pattern_file = open(self.rules_filename, encoding='utf-8')
-        self.cache = []
-
-    def __iter__(self):
-        self.cache_index = 0
-        return self
-
-    def __next__(self):
-        self.cache_index += 1
-        if len(self.cache) >= self.cache_index:
-            return self.cache[self.cache_index - 1]
-
-        if self.pattern_file.closed:
-            raise StopIteration
-
-        line = self.pattern_file.readline()
-        if not line:
-            self.pattern_file.close()
-            raise StopIteration
-
-        pattern, search, replace = line.split(None, 3)
-        funcs = build_match_and_apply_functions(
-            pattern, search, replace)
-        self.cache.append(funcs)
-        return funcs
-
-rules = LazyRules()
- -

So this is a class that implements __iter__() and __next__(), so it can be used as an iterator. Then, you instantiate the class and assign it to rules. This happens just once, on import. - -

Let’s take the class one bite at a time. - -

class LazyRules:
-    rules_filename = 'plural6-rules.txt'
-
-    def __init__(self):
-        self.pattern_file = open(self.rules_filename, encoding='utf-8')  
-        self.cache = []                                                  
-
    -
  1. When we instantiate the LazyRules class, open the pattern file but don’t read anything from it. (That comes later.) -
  2. After opening the patterns file, initialize the cache. You’ll use this cache later (in the __next__() method) as you read lines from the pattern file. -
- -

Before we continue, let’s take a closer look at rules_filename. It’s not defined within the __iter__() method. In fact, it’s not defined within any method. It’s defined at the class level. It’s a class variable, and although you can access it just like an instance variable (self.rules_filename), it is shared across all instances of the LazyRules class. - -

->>> import plural6
->>> r1 = plural6.LazyRules()
->>> r2 = plural6.LazyRules()
->>> r1.rules_filename                               
-'plural6-rules.txt'
->>> r2.rules_filename
-'plural6-rules.txt'
->>> r2.rules_filename = 'r2-override.txt'           
->>> r2.rules_filename
-'r2-override.txt'
->>> r1.rules_filename
-'plural6-rules.txt'
->>> r2.__class__.rules_filename                     
-'plural6-rules.txt'
->>> r2.__class__.rules_filename = 'papayawhip.txt'  
->>> r1.rules_filename
-'papayawhip.txt'
->>> r2.rules_filename                               
-'r2-overridetxt'
-
    -
  1. Each instance of the class inherits the rules_filename attribute with the value defined by the class. -
  2. Changing the attribute’s value in one instance does not affect other instances… -
  3. …nor does it change the class attribute. You can access the class attribute (as opposed to an individual instance’s attribute) by using the special __class__ attribute to access the class itself. -
  4. If you change the class attribute, all instances that are still inheriting that value (like r1 here) will be affected. -
  5. Instances that have overridden that attribute (like r2 here) will not be affected. -
- -

And now back to our show. - -

    def __iter__(self):       
-        self.cache_index = 0
-        return self           
-
-
    -
  1. The __iter__() method will be called every time someone — say, a for loop — calls iter(rules). -
  2. The one thing that every __iter__() method must do is return an iterator. In this case, it returns self, which signals that this class defines a __next__() method which will take care of returning values throughout the iteration. -
- -
    def __next__(self):                                 
-        .
-        .
-        .
-        pattern, search, replace = line.split(None, 3)
-        funcs = build_match_and_apply_functions(        
-            pattern, search, replace)
-        self.cache.append(funcs)                        
-        return funcs
-
    -
  1. The __next__() method gets called whenever someone — say, a for loop — calls next(rules). This method will only make sense if we start at the end and work backwards. So let’s do that. -
  2. The last part of this function should look familiar, at least. The build_match_and_apply_functions() function hasn’t changed; it’s the same as it ever was. -
  3. The only difference is that, before returning the match and apply functions (which are stored in the tuple funcs), we’re going to save them in self.cache. -
- -

Moving backwards… - -

    def __next__(self):
-        .
-        .
-        .
-        line = self.pattern_file.readline()  
-        if not line:                         
-            self.pattern_file.close()
-            raise StopIteration              
-        .
-        .
-        .
-
    -
  1. A bit of advanced file trickery here. The readline() method (note: singular, not the plural readlines()) reads exactly one line from an open file. Specifically, the next line. (File objects are iterators too! It’s iterators all the way down…) -
  2. If there was a line for readline() to read, line will not be an empty string. Even if the file contained a blank line, line would end up as the one-character string '\n' (a carriage return). If line is really an empty string, that means there are no more lines to read from the file. -
  3. When we reach the end of the file, we should close the file and raise the magic StopIteration exception. Remember, we got to this point because we needed a match and apply function for the next rule. The next rule comes from the next line of the file… but there is no next line! Therefore, we have no value to return. The iteration is over. ( The party’s over… ) -
- -

Moving backwards all the way to the start of the __next__() method… - -

    def __next__(self):
-        self.cache_index += 1
-        if len(self.cache) >= self.cache_index:
-            return self.cache[self.cache_index - 1]     
-
-        if self.pattern_file.closed:
-            raise StopIteration                         
-        .
-        .
-        .
-
    -
  1. self.cache will be a list of the functions we need to match and apply individual rules. (At least that should sound familiar!) self.cache_index keeps track of which cached item we should return next. If we haven’t exhausted the cache yet (i.e. if the length of self.cache is greater than self.cache_index), then we have a cache hit! Hooray! We can return the match and apply functions from the cache instead of building them from scratch. -
  2. On the other hand, if we don’t get a hit from the cache, and the file object has been closed (which could happen, further down the method, as you saw in the previous code snippet), then there’s nothing more we can do. If the file is closed, it means we’ve exhausted it — we’ve already read through every line from the pattern file, and we’ve already built and cached the match and apply functions for each pattern. The file is exhausted; the cache is exhausted; I’m exhausted. Wait, what? Hang in there, we’re almost done. -
- -

Putting it all together, here’s what happens when: - -

- -

We have achieved pluralization nirvana. - -

    -
  1. Minimal startup cost. The only thing that happens on import is instantiating a single class and opening a file (but not reading from it). -
  2. Maximum performance. The previous example would read through the file and build functions dynamically every time you wanted to pluralize a word. This version will cache functions as soon as they’re built, and in the worst case, it will only read through the pattern file once, no matter how many words you pluralize. -
  3. Separation of code and data. All the patterns are stored in a separate file. Code is code, and data is data, and never the twain shall meet. -
- -
-

Is this really nirvana? Well, yes and no. Here’s something to consider with the LazyRules example: the pattern file is opened (during __init__()), and it remains open until the final rule is reached. Python will eventually close the file when it exits, or after the last instantiation of the LazyRules class is destroyed, but still, that could be a long time. If this class is part of a long-running Python process, the Python interpreter may never exit, and the LazyRules object may never get destroyed. -

There are ways around this. Instead of opening the file during __init__() and leaving it open while you read rules one line at a time, you could open the file, read all the rules, and immediately close the file. Or you could open the file, read one rule, save the file position with the tell() method, close the file, and later re-open it and use the seek() method to continue reading where you left off. Or you could not worry about it and just leave the file open, like this example code does. Programming is design, and design is all about trade-offs and constraints. Leaving a file open too long might be a problem; making your code more complicated might be a problem. Which one is the bigger problem depends on your development team, your application, and your runtime environment. -

- -

⁂ - -

Further Reading

- - -

- -

© 2001–10 Mark Pilgrim - - - + + +Classes & Iterators - Dive Into Python 3 + + + + + + +

  
+

You are here: Home Dive Into Python 3 +

Difficulty level: ♦♦♦♢♢ +

Classes & Iterators

+
+

East is East, and West is West, and never the twain shall meet.
Rudyard Kipling +

+

  +

Diving In

+

Iterators are the “secret sauce” of Python 3. They’re everywhere, underlying everything, always just out of sight. Comprehensions are just a simple form of iterators. Generators are just a simple form of iterators. A function that yields values is a nice, compact way of building an iterator without building an iterator. Let me show you what I mean by that. + +

Remember the Fibonacci generator? Here it is as a built-from-scratch iterator: + +

[download fibonacci2.py] +

class Fib:
+    '''iterator that yields numbers in the Fibonacci sequence'''
+
+    def __init__(self, max):
+        self.max = max
+
+    def __iter__(self):
+        self.a = 0
+        self.b = 1
+        return self
+
+    def __next__(self):
+        fib = self.a
+        if fib > self.max:
+            raise StopIteration
+        self.a, self.b = self.b, self.a + self.b
+        return fib
+ +

Let’s take that one line at a time. + +

class Fib:
+ +

class? What’s a class? + +

⁂ + +

Defining Classes

+ +

Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you’ve defined. + +

Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that’s all that’s required, since a class doesn’t need to inherit from any other class. + +

class PapayaWhip:  
+    pass           
+
    +
  1. The name of this class is PapayaWhip, and it doesn’t inherit from any other class. Class names are usually capitalized, EachWordLikeThis, but this is only a convention, not a requirement. +
  2. You probably guessed this, but everything in a class is indented, just like the code within a function, if statement, for loop, or any other block of code. The first line not indented is outside the class. +
+ +

This PapayaWhip class doesn’t define any methods or attributes, but syntactically, there needs to be something in the definition, thus the pass statement. This is a Python reserved word that just means “move along, nothing to see here”. It’s a statement that does nothing, and it’s a good placeholder when you’re stubbing out functions or classes. + +

+

The pass statement in Python is like a empty set of curly braces ({}) in Java or C. +

+ +

Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don’t have explicit constructors and destructors. Although it’s not required, Python classes can have something similar to a constructor: the __init__() method. + +

The __init__() Method

+ +

This example shows the initialization of the Fib class using the __init__ method. + +

class Fib:
+    '''iterator that yields numbers in the Fibonacci sequence'''  
+
+    def __init__(self, max):                                      
+
    +
  1. Classes can (and should) have docstrings too, just like modules and functions. +
  2. The __init__() method is called immediately after an instance of the class is created. It would be tempting — but technically incorrect — to call this the “constructor” of the class. It’s tempting, because it looks like a C++ constructor (by convention, the __init__() method is the first method defined for the class), acts like one (it’s the first piece of code executed in a newly created instance of the class), and even sounds like one. Incorrect, because the object has already been constructed by the time the __init__() method is called, and you already have a valid reference to the new instance of the class. +
+ +

The first argument of every class method, including the __init__() method, is always a reference to the current instance of the class. By convention, this argument is named self. This argument fills the role of the reserved word this in C++ or Java, but self is not a reserved word in Python, merely a naming convention. Nonetheless, please don’t call it anything but self; this is a very strong convention. + +

In the __init__() method, self refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify self explicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically. + +

⁂ + +

Instantiating Classes

+ +

Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the __init__() method requires. The return value will be the newly created object. +

+>>> import fibonacci2
+>>> fib = fibonacci2.Fib(100)  
+>>> fib                        
+<fibonacci2.Fib object at 0x00DB8810>
+>>> fib.__class__              
+<class 'fibonacci2.Fib'>
+>>> fib.__doc__                
+'iterator that yields numbers in the Fibonacci sequence'
+
    +
  1. You are creating an instance of the Fib class (defined in the fibonacci2 module) and assigning the newly created instance to the variable fib. You are passing one parameter, 100, which will end up as the max argument in Fib’s __init__() method. +
  2. fib is now an instance of the Fib class. +
  3. Every class instance has a built-in attribute, __class__, which is the object’s class. Java programmers may be familiar with the Class class, which contains methods like getName() and getSuperclass() to get metadata information about an object. In Python, this kind of metadata is available through attributes, but the idea is the same. +
  4. You can access the instance’s docstring just as with a function or a module. All instances of a class share the same docstring. +
+ +
+

In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit new operator like there is in C++ or Java. +

+ +

⁂ + +

Instance Variables

+ +

On to the next line: + +

class Fib:
+    def __init__(self, max):
+        self.max = max        
+
    +
  1. What is self.max? It’s an instance variable. It is completely separate from max, which was passed into the __init__() method as an argument. self.max is “global” to the instance. That means that you can access it from other methods. +
+ +
class Fib:
+    def __init__(self, max):
+        self.max = max        
+    .
+    .
+    .
+    def __next__(self):
+        fib = self.a
+        if fib > self.max:    
+
    +
  1. self.max is defined in the __init__() method… +
  2. …and referenced in the __next__() method. +
+ +

Instance variables are specific to one instance of a class. For example, if you create two Fib instances with different maximum values, they will each remember their own values. + +

+>>> import fibonacci2
+>>> fib1 = fibonacci2.Fib(100)
+>>> fib2 = fibonacci2.Fib(200)
+>>> fib1.max
+100
+>>> fib2.max
+200
+ +

⁂ + +

A Fibonacci Iterator

+ +

Now you’re ready to learn how to build an iterator. An iterator is just a class that defines an __iter__() method. + +

+ +

[download fibonacci2.py] +

class Fib:                                        
+    def __init__(self, max):                      
+        self.max = max
+
+    def __iter__(self):                           
+        self.a = 0
+        self.b = 1
+        return self
+
+    def __next__(self):                           
+        fib = self.a
+        if fib > self.max:
+            raise StopIteration                   
+        self.a, self.b = self.b, self.a + self.b
+        return fib                                
+
    +
  1. To build an iterator from scratch, fib needs to be a class, not a function. +
  2. “Calling” Fib(max) is really creating an instance of this class and calling its __init__() method with max. The __init__() method saves the maximum value as an instance variable so other methods can refer to it later. +
  3. The __iter__() method is called whenever someone calls iter(fib). (As you’ll see in a minute, a for loop will call this automatically, but you can also call it yourself manually.) After performing beginning-of-iteration initialization (in this case, resetting self.a and self.b, our two counters), the __iter__() method can return any object that implements a __next__() method. In this case (and in most cases), __iter__() simply returns self, since this class implements its own __next__() method. +
  4. The __next__() method is called whenever someone calls next() on an iterator of an instance of a class. That will make more sense in a minute. +
  5. When the __next__() method raises a StopIteration exception, this signals to the caller that the iteration is exhausted. Unlike most exceptions, this is not an error; it’s a normal condition that just means that the iterator has no more values to generate. If the caller is a for loop, it will notice this StopIteration exception and gracefully exit the loop. (In other words, it will swallow the exception.) This little bit of magic is actually the key to using iterators in for loops. +
  6. To spit out the next value, an iterator’s __next__() method simply returns the value. Do not use yield here; that’s a bit of syntactic sugar that only applies when you’re using generators. Here you’re creating your own iterator from scratch; use return instead. +
+ +

Thoroughly confused yet? Excellent. Let’s see how to call this iterator: + +

+>>> from fibonacci2 import Fib
+>>> for n in Fib(1000):
+...     print(n, end=' ')
+0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
+ +

Why, it’s exactly the same! Byte for byte identical to how you called Fibonacci-as-a-generator (modulo one capital letter). But how? + +

There’s a bit of magic involved in for loops. Here’s what happens: + +

+ +

⁂ + +

A Plural Rule Iterator

+ + +

Now it’s time for the finale. Let’s rewrite the plural rules generator as an iterator. + +

[download plural6.py] +

class LazyRules:
+    rules_filename = 'plural6-rules.txt'
+
+    def __init__(self):
+        self.pattern_file = open(self.rules_filename, encoding='utf-8')
+        self.cache = []
+
+    def __iter__(self):
+        self.cache_index = 0
+        return self
+
+    def __next__(self):
+        self.cache_index += 1
+        if len(self.cache) >= self.cache_index:
+            return self.cache[self.cache_index - 1]
+
+        if self.pattern_file.closed:
+            raise StopIteration
+
+        line = self.pattern_file.readline()
+        if not line:
+            self.pattern_file.close()
+            raise StopIteration
+
+        pattern, search, replace = line.split(None, 3)
+        funcs = build_match_and_apply_functions(
+            pattern, search, replace)
+        self.cache.append(funcs)
+        return funcs
+
+rules = LazyRules()
+ +

So this is a class that implements __iter__() and __next__(), so it can be used as an iterator. Then, you instantiate the class and assign it to rules. This happens just once, on import. + +

Let’s take the class one bite at a time. + +

class LazyRules:
+    rules_filename = 'plural6-rules.txt'
+
+    def __init__(self):
+        self.pattern_file = open(self.rules_filename, encoding='utf-8')  
+        self.cache = []                                                  
+
    +
  1. When we instantiate the LazyRules class, open the pattern file but don’t read anything from it. (That comes later.) +
  2. After opening the patterns file, initialize the cache. You’ll use this cache later (in the __next__() method) as you read lines from the pattern file. +
+ +

Before we continue, let’s take a closer look at rules_filename. It’s not defined within the __iter__() method. In fact, it’s not defined within any method. It’s defined at the class level. It’s a class variable, and although you can access it just like an instance variable (self.rules_filename), it is shared across all instances of the LazyRules class. + +

+>>> import plural6
+>>> r1 = plural6.LazyRules()
+>>> r2 = plural6.LazyRules()
+>>> r1.rules_filename                               
+'plural6-rules.txt'
+>>> r2.rules_filename
+'plural6-rules.txt'
+>>> r2.rules_filename = 'r2-override.txt'           
+>>> r2.rules_filename
+'r2-override.txt'
+>>> r1.rules_filename
+'plural6-rules.txt'
+>>> r2.__class__.rules_filename                     
+'plural6-rules.txt'
+>>> r2.__class__.rules_filename = 'papayawhip.txt'  
+>>> r1.rules_filename
+'papayawhip.txt'
+>>> r2.rules_filename                               
+'r2-overridetxt'
+
    +
  1. Each instance of the class inherits the rules_filename attribute with the value defined by the class. +
  2. Changing the attribute’s value in one instance does not affect other instances… +
  3. …nor does it change the class attribute. You can access the class attribute (as opposed to an individual instance’s attribute) by using the special __class__ attribute to access the class itself. +
  4. If you change the class attribute, all instances that are still inheriting that value (like r1 here) will be affected. +
  5. Instances that have overridden that attribute (like r2 here) will not be affected. +
+ +

And now back to our show. + +

    def __iter__(self):       
+        self.cache_index = 0
+        return self           
+
+
    +
  1. The __iter__() method will be called every time someone — say, a for loop — calls iter(rules). +
  2. The one thing that every __iter__() method must do is return an iterator. In this case, it returns self, which signals that this class defines a __next__() method which will take care of returning values throughout the iteration. +
+ +
    def __next__(self):                                 
+        .
+        .
+        .
+        pattern, search, replace = line.split(None, 3)
+        funcs = build_match_and_apply_functions(        
+            pattern, search, replace)
+        self.cache.append(funcs)                        
+        return funcs
+
    +
  1. The __next__() method gets called whenever someone — say, a for loop — calls next(rules). This method will only make sense if we start at the end and work backwards. So let’s do that. +
  2. The last part of this function should look familiar, at least. The build_match_and_apply_functions() function hasn’t changed; it’s the same as it ever was. +
  3. The only difference is that, before returning the match and apply functions (which are stored in the tuple funcs), we’re going to save them in self.cache. +
+ +

Moving backwards… + +

    def __next__(self):
+        .
+        .
+        .
+        line = self.pattern_file.readline()  
+        if not line:                         
+            self.pattern_file.close()
+            raise StopIteration              
+        .
+        .
+        .
+
    +
  1. A bit of advanced file trickery here. The readline() method (note: singular, not the plural readlines()) reads exactly one line from an open file. Specifically, the next line. (File objects are iterators too! It’s iterators all the way down…) +
  2. If there was a line for readline() to read, line will not be an empty string. Even if the file contained a blank line, line would end up as the one-character string '\n' (a carriage return). If line is really an empty string, that means there are no more lines to read from the file. +
  3. When we reach the end of the file, we should close the file and raise the magic StopIteration exception. Remember, we got to this point because we needed a match and apply function for the next rule. The next rule comes from the next line of the file… but there is no next line! Therefore, we have no value to return. The iteration is over. ( The party’s over… ) +
+ +

Moving backwards all the way to the start of the __next__() method… + +

    def __next__(self):
+        self.cache_index += 1
+        if len(self.cache) >= self.cache_index:
+            return self.cache[self.cache_index - 1]     
+
+        if self.pattern_file.closed:
+            raise StopIteration                         
+        .
+        .
+        .
+
    +
  1. self.cache will be a list of the functions we need to match and apply individual rules. (At least that should sound familiar!) self.cache_index keeps track of which cached item we should return next. If we haven’t exhausted the cache yet (i.e. if the length of self.cache is greater than self.cache_index), then we have a cache hit! Hooray! We can return the match and apply functions from the cache instead of building them from scratch. +
  2. On the other hand, if we don’t get a hit from the cache, and the file object has been closed (which could happen, further down the method, as you saw in the previous code snippet), then there’s nothing more we can do. If the file is closed, it means we’ve exhausted it — we’ve already read through every line from the pattern file, and we’ve already built and cached the match and apply functions for each pattern. The file is exhausted; the cache is exhausted; I’m exhausted. Wait, what? Hang in there, we’re almost done. +
+ +

Putting it all together, here’s what happens when: + +

+ +

We have achieved pluralization nirvana. + +

    +
  1. Minimal startup cost. The only thing that happens on import is instantiating a single class and opening a file (but not reading from it). +
  2. Maximum performance. The previous example would read through the file and build functions dynamically every time you wanted to pluralize a word. This version will cache functions as soon as they’re built, and in the worst case, it will only read through the pattern file once, no matter how many words you pluralize. +
  3. Separation of code and data. All the patterns are stored in a separate file. Code is code, and data is data, and never the twain shall meet. +
+ +
+

Is this really nirvana? Well, yes and no. Here’s something to consider with the LazyRules example: the pattern file is opened (during __init__()), and it remains open until the final rule is reached. Python will eventually close the file when it exits, or after the last instantiation of the LazyRules class is destroyed, but still, that could be a long time. If this class is part of a long-running Python process, the Python interpreter may never exit, and the LazyRules object may never get destroyed. +

There are ways around this. Instead of opening the file during __init__() and leaving it open while you read rules one line at a time, you could open the file, read all the rules, and immediately close the file. Or you could open the file, read one rule, save the file position with the tell() method, close the file, and later re-open it and use the seek() method to continue reading where you left off. Or you could not worry about it and just leave the file open, like this example code does. Programming is design, and design is all about trade-offs and constraints. Leaving a file open too long might be a problem; making your code more complicated might be a problem. Which one is the bigger problem depends on your development team, your application, and your runtime environment. +

+ +

⁂ + +

Further Reading

+ + +

+ +

© 2001–10 Mark Pilgrim + + + diff --git a/j/.htaccess b/j/.htaccess index 35a1445..3c593e3 100644 --- a/j/.htaccess +++ b/j/.htaccess @@ -1,4 +1,4 @@ -FileETag MTime Size - -ExpiresActive On -ExpiresDefault "access plus 1 year" +FileETag MTime Size + +ExpiresActive On +ExpiresDefault "access plus 1 year" diff --git a/j/html5.js b/j/html5.js index e973e7f..6457708 100644 --- a/j/html5.js +++ b/j/html5.js @@ -1 +1,3 @@ -(function(){var e="abbr,article,aside,audio,bb,canvas,datagrid,datalist,details,dialog,figure,footer,header,mark,menu,meter,nav,output,progress,section,time,video".split(','),i=e.length;while(i--){document.createElement(e[i])}})() \ No newline at end of file +/*@cc_on@if(@_jscript_version<9)(function(p,e){function q(a,b){if(g[a])g[a].styleSheet.cssText+=b;else{var c=r[l],d=e[j]("style");d.media=a;c.insertBefore(d,c[l]);g[a]=d;q(a,b)}}function s(a,b){for(var c=new RegExp("\\b("+m+")\\b(?!.*[;}])","gi"),d=function(k){return".iepp_"+k},h=-1;++h\\s*$","i");i.innerHTML=a.outerHTML.replace(/\r|\n/g," ").replace(c,a.currentStyle.display=="block"?"":"");c=i.childNodes[0];c.className+=" iepp_"+d;c=f[f.length]=[a,c];a.parentNode.replaceChild(c[1],c[0])}s(e.styleSheets,"all")}function u(){for(var a=-1,b;++a - -Packaging Python Libraries - Dive Into Python 3 - - - - - - -

  
-

You are here: Home Dive Into Python 3 -

Difficulty level: ♦♦♦♦♢ -

Packaging Python Libraries

-
-

You’ll find the shame is like the pain; you only feel it once.
— Marquise de Merteuil, Dangerous Liaisons -

-

  -

Diving In

-

Real artists ship. Or so says Steve Jobs. Do you want to release a Python script, library, framework, or application? Excellent. The world needs more Python code. Python 3 comes with a packaging framework called Distutils. Distutils is many things: a build tool (for you), an installation tool (for your users), a package metadata format (for search engines), and more. It integrates with the Python Package Index (“PyPI”), a central repository for open source Python libraries. - -

All of these facets of Distutils center around the setup script, traditionally called setup.py. In fact, you’ve already seen several Distutils setup scripts in this book. You used Distutils to install httplib2 in HTTP Web Services and again to install chardet in Case Study: Porting chardet to Python 3. - -

In this chapter, you’ll learn how the setup scripts for chardet and httplib2 work, and you’ll step through the process of releasing your own Python software. - -

# chardet's setup.py
-from distutils.core import setup
-setup(
-    name = "chardet",
-    packages = ["chardet"],
-    version = "1.0.2",
-    description = "Universal encoding detector",
-    author = "Mark Pilgrim",
-    author_email = "mark@diveintomark.org",
-    url = "http://chardet.feedparser.org/",
-    download_url = "http://chardet.feedparser.org/download/python3-chardet-1.0.1.tgz",
-    keywords = ["encoding", "i18n", "xml"],
-    classifiers = [
-        "Programming Language :: Python",
-        "Programming Language :: Python :: 3",
-        "Development Status :: 4 - Beta",
-        "Environment :: Other Environment",
-        "Intended Audience :: Developers",
-        "License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)",
-        "Operating System :: OS Independent",
-        "Topic :: Software Development :: Libraries :: Python Modules",
-        "Topic :: Text Processing :: Linguistic",
-        ],
-    long_description = """\
-Universal character encoding detector
--------------------------------------
-
-Detects
- - ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
- - Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
- - EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
- - EUC-KR, ISO-2022-KR (Korean)
- - KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
- - ISO-8859-2, windows-1250 (Hungarian)
- - ISO-8859-5, windows-1251 (Bulgarian)
- - windows-1252 (English)
- - ISO-8859-7, windows-1253 (Greek)
- - ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
- - TIS-620 (Thai)
-
-This version requires Python 3 or later; a Python 2 version is available separately.
-"""
-)
- -
-

chardet and httplib2 are open source, but there’s no requirement that you release your own Python libraries under any particular license. The process described in this chapter will work for any Python software, regardless of license. -

- -

⁂ - -

Things Distutils Can’t Do For You

- -

Releasing your first Python package is a daunting process. (Releasing your second one is a little easier.) Distutils tries to automate as much of it as possible, but there are some things you simply must do yourself. - -

- -

⁂ - -

Directory Structure

- -

To start packaging your Python software, you need to get your files and directories in order. The httplib2 directory looks like this: - -

-httplib2/                 
-|
-+--README.txt             
-|
-+--setup.py               
-|
-+--httplib2/              
-   |
-   +--__init__.py
-   |
-   +--iri2uri.py
-
    -
  1. Make a root directory to hold everything. Give it the same name as your Python module. -
  2. To accomodate Windows users, your “read me” file should include a .txt extension, and it should use Windows-style carriage returns. Just because you use a fancy text editor that runs from the command line and includes its own macro language, that doesn’t mean you need to make life difficult for your users. (Your users use Notepad. Sad but true.) Even if you’re on Linux or Mac OS X, your fancy text editor undoubtedly has an option to save files with Windows-style carriage returns. -
  3. Your Distutils setup script should be named setup.py unless you have a good reason not to. You do not have a good reason not to. -
  4. If your Python software is a single .py file, you should put it in the root directory along with your “read me” file and your setup script. But httplib2 is not a single .py file; it’s a multi-file module. But that’s OK! Just put the httplib2 directory in the root directory, so you have an __init__.py file within an httplib2/ directory within the httplib2/ root directory. That’s not a problem; in fact, it will simplify your packaging process. -
- -

The chardet directory looks slightly different. Like httplib2, it’s a multi-file module, so there’s a chardet/ directory within the chardet/ root directory. In addition to the README.txt file, chardet has HTML-formatted documentation in the docs/ directory. The docs/ directory contains several .html and .css files and an images/ subdirectory, which contains several .png and .gif files. (This will be important later.) Also, in keeping with the convention for (L)GPL-licensed software, it has a separate file called COPYING.txt which contains the complete text of the LGPL. - -


-chardet/
-|
-+--COPYING.txt
-|
-+--setup.py
-|
-+--README.txt
-|
-+--docs/
-|  |
-|  +--index.html
-|  |
-|  +--usage.html
-|  |
-|  +--images/ ...
-|
-+--chardet/
-   |
-   +--__init__.py
-   |
-   +--big5freq.py
-   |
-   +--...
-
- -

⁂ - -

Writing Your Setup Script

- -

The Distutils setup script is a Python script. In theory, it can do anything Python can do. In practice, it should do as little as possible, in as standard a way as possible. Setup scripts should be boring. The more exotic your installation process is, the more exotic your bug reports will be. - -

The first line of every Distutils setup script is always the same: - -

from distutils.core import setup
- -

This imports the setup() function, which is the main entry point into Distutils. 95% of all Distutils setup scripts consist of a single call to setup() and nothing else. (I totally just made up that statistic, but if your Distutils setup script is doing more than calling the Distutils setup() function, you should have a good reason. Do you have a good reason? I didn’t think so.) - -

The setup() function can take dozens of parameters. For the sanity of everyone involved, you must use named arguments for every parameter. This is not merely a convention; it’s a hard requirement. Your setup script will crash if you try to call the setup() function with non-named arguments. - -

The following named arguments are required: - -

- -

Although not required, I recommend that you also include the following in your setup script: - -

- -
-

Setup script metadata is defined in PEP 314. -

- -

Now let’s look at the chardet setup script. It has all of these required and recommended parameters, plus one I haven’t mentioned yet: packages. - -

from distutils.core import setup
-setup(
-    name = 'chardet',
-    packages = ['chardet'],
-    version = '1.0.2',
-    description = 'Universal encoding detector',
-    author='Mark Pilgrim',
-    ...
-)
- -

The packages parameter highlights an unfortunate vocabulary overlap in the distribution process. We’ve been talking about the “package” as the thing you’re building (and potentially listing in The Python “Package” Index). But that’s not what this packages parameter refers to. It refers to the fact that the chardet module is a multi-file module, sometimes known as… a “package.” The packages parameter tells Distutils to include the chardet/ directory, its __init__.py file, and all the other .py files that constitute the chardet module. That’s kind of important; all this happy talk about documentation and metadata is irrelevant if you forget to include the actual code! - -

⁂ - -

Classifying Your Package

- -

The Python Package Index (“PyPI”) contains thousands of Python libraries. Proper classification metadata will allow people to find yours more easily. PyPI lets you browse packages by classifier. You can even select multiple classifiers to narrow your search. Classifiers are not invisible metadata that you can just ignore! - -

To classify your software, pass a classifiers parameter to the Distutils setup() function. The classifers parameter is a list of strings. These strings are not freeform. All classifier strings should come from this list on PyPI. - -

Classifiers are optional. You can write a Distutils setup script without any classifiers at all. Don’t do that. You should always include at least these classifiers: - -

- -

I also recommend that you include the following classifiers: - -

- -

Examples of Good Package Classifiers

- -

By way of example, here are the classifiers for Django, a production-ready, cross-platform, BSD-licensed web application framework that runs on your web server. (Django is not yet compatible with Python 3, so the Programming Language :: Python :: 3 classifier is not listed.) - -

Programming Language :: Python
-License :: OSI Approved :: BSD License
-Operating System :: OS Independent
-Development Status :: 5 - Production/Stable
-Environment :: Web Environment
-Framework :: Django
-Intended Audience :: Developers
-Topic :: Internet :: WWW/HTTP
-Topic :: Internet :: WWW/HTTP :: Dynamic Content
-Topic :: Internet :: WWW/HTTP :: WSGI
-Topic :: Software Development :: Libraries :: Python Modules
- -

Here are the classifiers for chardet, the character encoding detection library covered in Case Study: Porting chardet to Python 3. chardet is beta quality, cross-platform, Python 3-compatible, LGPL-licensed, and intended for developers to integrate into their own products. - -

Programming Language :: Python
-Programming Language :: Python :: 3
-License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
-Operating System :: OS Independent
-Development Status :: 4 - Beta
-Environment :: Other Environment
-Intended Audience :: Developers
-Topic :: Text Processing :: Linguistic
-Topic :: Software Development :: Libraries :: Python Modules
- -

And here are the classifiers for httplib2, the HTTP module I mentioned at the beginning of this chapter. httplib2 is beta quality, cross-platform, MIT-licensed, and intended for Python developers. - -

Programming Language :: Python
-Programming Language :: Python :: 3
-License :: OSI Approved :: MIT License
-Operating System :: OS Independent
-Development Status :: 4 - Beta
-Environment :: Web Environment
-Intended Audience :: Developers
-Topic :: Internet :: WWW/HTTP
-Topic :: Software Development :: Libraries :: Python Modules
- -

Specifying Additional Files With A Manifest

- -

By default, Distutils will include the following files in your release package: - -

- -

That will cover all the files in the httplib2 project. But for the chardet project, we also want to include the COPYING.txt license file and the entire docs/ directory that contains images and HTML files. To tell Distutils to include these additional files and directories when it builds the chardet release package, you need a manifest file. - -

A manifest file is a text file called MANIFEST.in. Place it in the project’s root directory, next to README.txt and setup.py. Manifest files are not Python scripts; they are text files that contain a series of “commands” in a Distutils-defined format. Manifest commands allow you to include or exclude specific files and directories. - -

This is the entire manifest file for the chardet project: - -

include COPYING.txt                                
-recursive-include docs *.html *.css *.png *.gif    
-
    -
  1. The first line is self-explanatory: include the COPYING.txt file from the project’s root directory. -
  2. The second line is a bit more complicated. The recursive-include command takes a directory name and one or more filenames. The filenames aren’t limited to specific files; they can include wildcards. This line means “See that docs/ directory in the project’s root directory? Look in there (recursively) for .html, .css, .png, and .gif files. I want all of them in my release package.” -
- -

All manifest commands preserve the directory structure that you set up in your project directory. That recursive-include command is not going to put a bunch of .html and .png files in the root directory of the release package. It’s going to maintain the existing docs/ directory structure, but only include those files inside that directory that match the given wildcards. (I didn’t mention it earlier, but the chardet documentation is actually written in XML and converted to HTML by a separate script. I don’t want to include the XML files in the release package, just the HTML and the images.) - -

-

Manifest files have their own unique format. See Specifying the files to distribute and the manifest template commands for details. -

- -

To reiterate: you only need to create a manifest file if you want to include files that Distutils doesn’t include by default. If you do need a manifest file, it should only include the files and directories that Distutils wouldn’t otherwise find on its own. - -

Checking Your Setup Script for Errors

- -

There’s a lot to keep track of. Distutils comes with a built-in validation command that checks that all the required metadata is present in your setup script. For example, if you forget to include the version parameter, Distutils will remind you. - -

-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
-running check
-warning: check: missing required meta-data: version
- -

Once you include a version parameter (and all the other required bits of metadata), the check command will look like this: - -

-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
-running check
- -

⁂ - -

Creating a Source Distribution

- -

Distutils supports building multiple types of release packages. At a minimum, you should build a “source distribution” that contains your source code, your Distutils setup script, your “read me” file, and whatever additional files you want to include. To build a source distribution, pass the sdist command to your Distutils setup script. - -

-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py sdist
-running sdist
-running check
-reading manifest template 'MANIFEST.in'
-writing manifest file 'MANIFEST'
-creating chardet-1.0.2
-creating chardet-1.0.2\chardet
-creating chardet-1.0.2\docs
-creating chardet-1.0.2\docs\images
-copying files to chardet-1.0.2...
-copying COPYING -> chardet-1.0.2
-copying README.txt -> chardet-1.0.2
-copying setup.py -> chardet-1.0.2
-copying chardet\__init__.py -> chardet-1.0.2\chardet
-copying chardet\big5freq.py -> chardet-1.0.2\chardet
-...
-copying chardet\universaldetector.py -> chardet-1.0.2\chardet
-copying chardet\utf8prober.py -> chardet-1.0.2\chardet
-copying docs\faq.html -> chardet-1.0.2\docs
-copying docs\history.html -> chardet-1.0.2\docs
-copying docs\how-it-works.html -> chardet-1.0.2\docs
-copying docs\index.html -> chardet-1.0.2\docs
-copying docs\license.html -> chardet-1.0.2\docs
-copying docs\supported-encodings.html -> chardet-1.0.2\docs
-copying docs\usage.html -> chardet-1.0.2\docs
-copying docs\images\caution.png -> chardet-1.0.2\docs\images
-copying docs\images\important.png -> chardet-1.0.2\docs\images
-copying docs\images\note.png -> chardet-1.0.2\docs\images
-copying docs\images\permalink.gif -> chardet-1.0.2\docs\images
-copying docs\images\tip.png -> chardet-1.0.2\docs\images
-copying docs\images\warning.png -> chardet-1.0.2\docs\images
-creating dist
-creating 'dist\chardet-1.0.2.zip' and adding 'chardet-1.0.2' to it
-adding 'chardet-1.0.2\COPYING'
-adding 'chardet-1.0.2\PKG-INFO'
-adding 'chardet-1.0.2\README.txt'
-adding 'chardet-1.0.2\setup.py'
-adding 'chardet-1.0.2\chardet\big5freq.py'
-adding 'chardet-1.0.2\chardet\big5prober.py'
-...
-adding 'chardet-1.0.2\chardet\universaldetector.py'
-adding 'chardet-1.0.2\chardet\utf8prober.py'
-adding 'chardet-1.0.2\chardet\__init__.py'
-adding 'chardet-1.0.2\docs\faq.html'
-adding 'chardet-1.0.2\docs\history.html'
-adding 'chardet-1.0.2\docs\how-it-works.html'
-adding 'chardet-1.0.2\docs\index.html'
-adding 'chardet-1.0.2\docs\license.html'
-adding 'chardet-1.0.2\docs\supported-encodings.html'
-adding 'chardet-1.0.2\docs\usage.html'
-adding 'chardet-1.0.2\docs\images\caution.png'
-adding 'chardet-1.0.2\docs\images\important.png'
-adding 'chardet-1.0.2\docs\images\note.png'
-adding 'chardet-1.0.2\docs\images\permalink.gif'
-adding 'chardet-1.0.2\docs\images\tip.png'
-adding 'chardet-1.0.2\docs\images\warning.png'
-removing 'chardet-1.0.2' (and everything under it)
- -

Several things to note here: - -

- -
-c:\Users\pilgrim\chardet> dir dist
- Volume in drive C has no label.
- Volume Serial Number is DED5-B4F8
-
- Directory of c:\Users\pilgrim\chardet\dist
-
-07/30/2009  06:29 PM    <DIR>          .
-07/30/2009  06:29 PM    <DIR>          ..
-07/30/2009  06:29 PM           206,440 chardet-1.0.2.zip
-               1 File(s)        206,440 bytes
-               2 Dir(s)  61,424,635,904 bytes free
- -

⁂ - -

Creating a Graphical Installer

- -

In my opinion, every Python library deserves a graphical installer for Windows users. It’s easy to make (even if you don’t run Windows yourself), and Windows users appreciate it. - -

Distutils can create a graphical Windows installer for you, by passing the bdist_wininst command to your Distutils setup script. - -

-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py bdist_wininst
-running bdist_wininst
-running build
-running build_py
-creating build
-creating build\lib
-creating build\lib\chardet
-copying chardet\big5freq.py -> build\lib\chardet
-copying chardet\big5prober.py -> build\lib\chardet
-...
-copying chardet\universaldetector.py -> build\lib\chardet
-copying chardet\utf8prober.py -> build\lib\chardet
-copying chardet\__init__.py -> build\lib\chardet
-installing to build\bdist.win32\wininst
-running install_lib
-creating build\bdist.win32
-creating build\bdist.win32\wininst
-creating build\bdist.win32\wininst\PURELIB
-creating build\bdist.win32\wininst\PURELIB\chardet
-copying build\lib\chardet\big5freq.py -> build\bdist.win32\wininst\PURELIB\chardet
-copying build\lib\chardet\big5prober.py -> build\bdist.win32\wininst\PURELIB\chardet
-...
-copying build\lib\chardet\universaldetector.py -> build\bdist.win32\wininst\PURELIB\chardet
-copying build\lib\chardet\utf8prober.py -> build\bdist.win32\wininst\PURELIB\chardet
-copying build\lib\chardet\__init__.py -> build\bdist.win32\wininst\PURELIB\chardet
-running install_egg_info
-Writing build\bdist.win32\wininst\PURELIB\chardet-1.0.2-py3.1.egg-info
-creating 'c:\users\pilgrim\appdata\local\temp\tmp2f4h7e.zip' and adding '.' to it
-adding 'PURELIB\chardet-1.0.2-py3.1.egg-info'
-adding 'PURELIB\chardet\big5freq.py'
-adding 'PURELIB\chardet\big5prober.py'
-...
-adding 'PURELIB\chardet\universaldetector.py'
-adding 'PURELIB\chardet\utf8prober.py'
-adding 'PURELIB\chardet\__init__.py'
-removing 'build\bdist.win32\wininst' (and everything under it)
-c:\Users\pilgrim\chardet> dir dist
-c:\Users\pilgrim\chardet>dir dist
- Volume in drive C has no label.
- Volume Serial Number is AADE-E29F
-
- Directory of c:\Users\pilgrim\chardet\dist
-
-07/30/2009  10:14 PM    <DIR>          .
-07/30/2009  10:14 PM    <DIR>          ..
-07/30/2009  10:14 PM           371,236 chardet-1.0.2.win32.exe
-07/30/2009  06:29 PM           206,440 chardet-1.0.2.zip
-               2 File(s)        577,676 bytes
-               2 Dir(s)  61,424,070,656 bytes free
- -

Building Installable Packages for Other Operating Systems

- -

Distutils can help you build installable packages for Linux users. In my opinion, this probably isn’t worth your time. If you want your software distributed for Linux, your time would be better spent working with community members who specialize in packaging software for major Linux distributions. - -

For example, my chardet library is in the Debian GNU/Linux repositories (and therefore in the Ubuntu repositories as well). I had nothing to do with this; the packages just showed up there one day. The Debian community has their own policies for packaging Python libraries, and the Debian python-chardet package is designed to follow these conventions. And since the package lives in Debian’s repositories, Debian users will receive security updates and/or new versions, depending on the system-wide settings they’ve chosen to manage their own computers. - -

The Linux packages that Distutils builds offer none of these advantages. Your time is better spent elsewhere. - -

⁂ - -

Adding Your Software to The Python Package Index

- -

Uploading software to the Python Package Index is a three step process. - -

    -
  1. Register yourself -
  2. Register your software -
  3. Upload the packages you created with setup.py sdist and setup.py bdist_* -
- -

To register yourself, go to the PyPI user registration page. Enter your desired username and password, provide a valid email address, and click the Register button. (If you have a PGP or GPG key, you can also provide that. If you don’t have one or don’t know what that means, don’t worry about it.) Check your email; within a few minutes, you should receive a message from PyPI with a validation link. Click the link to complete the registration process. - -

Now you need to register your software with PyPI and upload it. You can do this all in one step. - -

-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload  
-running register
-We need to know who you are, so please choose either:
- 1. use your existing login,
- 2. register as a new user,
- 3. have the server generate a new password for you (and email it to you), or
- 4. quit
-Your selection [default 1]:  1                                                                 
-Username: MarkPilgrim                                                                          
-Password:
-Registering chardet to http://pypi.python.org/pypi                                             
-Server response (200): OK
-running sdist                                                                                  
-... output trimmed for brevity ...
-running bdist_wininst                                                                          
-... output trimmed for brevity ...
-running upload                                                                                 
-Submitting dist\chardet-1.0.2.zip to http://pypi.python.org/pypi
-Server response (200): OK
-Submitting dist\chardet-1.0.2.win32.exe to http://pypi.python.org/pypi
-Server response (200): OK
-I can store your PyPI login so future submissions will be faster.
-(the login will be stored in c:\home\.pypirc)
-Save your login (y/N)?n                                                                        
-
    -
  1. When you release your project for the first time, Distutils will add your software to the Python Package Index and give it its own URL. Every time after that, it will simply update the project metadata with any changes you may have made in your setup.py parameters. Next, it builds a source distribution (sdist) and a Windows installer (bdist_wininst), then uploads them to PyPI (upload). -
  2. Type 1 or just press ENTER to select “use your existing login.” -
  3. Enter the username and password you selected on the the PyPI user registration page. Distuils will not echo your password; it will not even echo asterisks in place of characters. Just type your password and press ENTER. -
  4. Distutils registers your package with the Python Package Index… -
  5. …builds your source distribution… -
  6. …builds your Windows installer… -
  7. …and uploads them both to the Python Package Index. -
  8. If you want to automate the process of releasing new versions, you need to save your PyPI credentials in a local file. This is completely insecure and completely optional. -
- -

Congratulations, you now have your own page on the Python Package Index! The address is http://pypi.python.org/pypi/NAME, where NAME is the string you passed in the name parameter in your setup.py file. - -

If you want to release a new version, just update your setup.py with the new version number, then run the same upload command again: - -

-c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload
-
- -

⁂ - -

The Many Possible Futures of Python Packaging

- -

Distutils is not the be-all and end-all of Python packaging, but as of this writing (August 2009), it’s the only packaging framework that works in Python 3. There are a number of other frameworks for Python 2; some focus on installation, others on testing and deployment. Some or all of these may end up being ported to Python 3 in the future. - -

These frameworks focus on installation: - -

- -

These focus on testing and deployment: - -

- -

⁂ - -

Further Reading

- -

On Distutils: - -

- -

On other packaging frameworks: - -

- -

-

© 2001–10 Mark Pilgrim - - - + + +Packaging Python Libraries - Dive Into Python 3 + + + + + + +

  
+

You are here: Home Dive Into Python 3 +

Difficulty level: ♦♦♦♦♢ +

Packaging Python Libraries

+
+

You’ll find the shame is like the pain; you only feel it once.
— Marquise de Merteuil, Dangerous Liaisons +

+

  +

Diving In

+

Real artists ship. Or so says Steve Jobs. Do you want to release a Python script, library, framework, or application? Excellent. The world needs more Python code. Python 3 comes with a packaging framework called Distutils. Distutils is many things: a build tool (for you), an installation tool (for your users), a package metadata format (for search engines), and more. It integrates with the Python Package Index (“PyPI”), a central repository for open source Python libraries. + +

All of these facets of Distutils center around the setup script, traditionally called setup.py. In fact, you’ve already seen several Distutils setup scripts in this book. You used Distutils to install httplib2 in HTTP Web Services and again to install chardet in Case Study: Porting chardet to Python 3. + +

In this chapter, you’ll learn how the setup scripts for chardet and httplib2 work, and you’ll step through the process of releasing your own Python software. + +

# chardet's setup.py
+from distutils.core import setup
+setup(
+    name = "chardet",
+    packages = ["chardet"],
+    version = "1.0.2",
+    description = "Universal encoding detector",
+    author = "Mark Pilgrim",
+    author_email = "mark@diveintomark.org",
+    url = "http://chardet.feedparser.org/",
+    download_url = "http://chardet.feedparser.org/download/python3-chardet-1.0.1.tgz",
+    keywords = ["encoding", "i18n", "xml"],
+    classifiers = [
+        "Programming Language :: Python",
+        "Programming Language :: Python :: 3",
+        "Development Status :: 4 - Beta",
+        "Environment :: Other Environment",
+        "Intended Audience :: Developers",
+        "License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)",
+        "Operating System :: OS Independent",
+        "Topic :: Software Development :: Libraries :: Python Modules",
+        "Topic :: Text Processing :: Linguistic",
+        ],
+    long_description = """\
+Universal character encoding detector
+-------------------------------------
+
+Detects
+ - ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
+ - Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
+ - EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
+ - EUC-KR, ISO-2022-KR (Korean)
+ - KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
+ - ISO-8859-2, windows-1250 (Hungarian)
+ - ISO-8859-5, windows-1251 (Bulgarian)
+ - windows-1252 (English)
+ - ISO-8859-7, windows-1253 (Greek)
+ - ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
+ - TIS-620 (Thai)
+
+This version requires Python 3 or later; a Python 2 version is available separately.
+"""
+)
+ +
+

chardet and httplib2 are open source, but there’s no requirement that you release your own Python libraries under any particular license. The process described in this chapter will work for any Python software, regardless of license. +

+ +

⁂ + +

Things Distutils Can’t Do For You

+ +

Releasing your first Python package is a daunting process. (Releasing your second one is a little easier.) Distutils tries to automate as much of it as possible, but there are some things you simply must do yourself. + +

+ +

⁂ + +

Directory Structure

+ +

To start packaging your Python software, you need to get your files and directories in order. The httplib2 directory looks like this: + +

+httplib2/                 
+|
++--README.txt             
+|
++--setup.py               
+|
++--httplib2/              
+   |
+   +--__init__.py
+   |
+   +--iri2uri.py
+
    +
  1. Make a root directory to hold everything. Give it the same name as your Python module. +
  2. To accomodate Windows users, your “read me” file should include a .txt extension, and it should use Windows-style carriage returns. Just because you use a fancy text editor that runs from the command line and includes its own macro language, that doesn’t mean you need to make life difficult for your users. (Your users use Notepad. Sad but true.) Even if you’re on Linux or Mac OS X, your fancy text editor undoubtedly has an option to save files with Windows-style carriage returns. +
  3. Your Distutils setup script should be named setup.py unless you have a good reason not to. You do not have a good reason not to. +
  4. If your Python software is a single .py file, you should put it in the root directory along with your “read me” file and your setup script. But httplib2 is not a single .py file; it’s a multi-file module. But that’s OK! Just put the httplib2 directory in the root directory, so you have an __init__.py file within an httplib2/ directory within the httplib2/ root directory. That’s not a problem; in fact, it will simplify your packaging process. +
+ +

The chardet directory looks slightly different. Like httplib2, it’s a multi-file module, so there’s a chardet/ directory within the chardet/ root directory. In addition to the README.txt file, chardet has HTML-formatted documentation in the docs/ directory. The docs/ directory contains several .html and .css files and an images/ subdirectory, which contains several .png and .gif files. (This will be important later.) Also, in keeping with the convention for (L)GPL-licensed software, it has a separate file called COPYING.txt which contains the complete text of the LGPL. + +


+chardet/
+|
++--COPYING.txt
+|
++--setup.py
+|
++--README.txt
+|
++--docs/
+|  |
+|  +--index.html
+|  |
+|  +--usage.html
+|  |
+|  +--images/ ...
+|
++--chardet/
+   |
+   +--__init__.py
+   |
+   +--big5freq.py
+   |
+   +--...
+
+ +

⁂ + +

Writing Your Setup Script

+ +

The Distutils setup script is a Python script. In theory, it can do anything Python can do. In practice, it should do as little as possible, in as standard a way as possible. Setup scripts should be boring. The more exotic your installation process is, the more exotic your bug reports will be. + +

The first line of every Distutils setup script is always the same: + +

from distutils.core import setup
+ +

This imports the setup() function, which is the main entry point into Distutils. 95% of all Distutils setup scripts consist of a single call to setup() and nothing else. (I totally just made up that statistic, but if your Distutils setup script is doing more than calling the Distutils setup() function, you should have a good reason. Do you have a good reason? I didn’t think so.) + +

The setup() function can take dozens of parameters. For the sanity of everyone involved, you must use named arguments for every parameter. This is not merely a convention; it’s a hard requirement. Your setup script will crash if you try to call the setup() function with non-named arguments. + +

The following named arguments are required: + +

+ +

Although not required, I recommend that you also include the following in your setup script: + +

+ +
+

Setup script metadata is defined in PEP 314. +

+ +

Now let’s look at the chardet setup script. It has all of these required and recommended parameters, plus one I haven’t mentioned yet: packages. + +

from distutils.core import setup
+setup(
+    name = 'chardet',
+    packages = ['chardet'],
+    version = '1.0.2',
+    description = 'Universal encoding detector',
+    author='Mark Pilgrim',
+    ...
+)
+ +

The packages parameter highlights an unfortunate vocabulary overlap in the distribution process. We’ve been talking about the “package” as the thing you’re building (and potentially listing in The Python “Package” Index). But that’s not what this packages parameter refers to. It refers to the fact that the chardet module is a multi-file module, sometimes known as… a “package.” The packages parameter tells Distutils to include the chardet/ directory, its __init__.py file, and all the other .py files that constitute the chardet module. That’s kind of important; all this happy talk about documentation and metadata is irrelevant if you forget to include the actual code! + +

⁂ + +

Classifying Your Package

+ +

The Python Package Index (“PyPI”) contains thousands of Python libraries. Proper classification metadata will allow people to find yours more easily. PyPI lets you browse packages by classifier. You can even select multiple classifiers to narrow your search. Classifiers are not invisible metadata that you can just ignore! + +

To classify your software, pass a classifiers parameter to the Distutils setup() function. The classifers parameter is a list of strings. These strings are not freeform. All classifier strings should come from this list on PyPI. + +

Classifiers are optional. You can write a Distutils setup script without any classifiers at all. Don’t do that. You should always include at least these classifiers: + +

+ +

I also recommend that you include the following classifiers: + +

+ +

Examples of Good Package Classifiers

+ +

By way of example, here are the classifiers for Django, a production-ready, cross-platform, BSD-licensed web application framework that runs on your web server. (Django is not yet compatible with Python 3, so the Programming Language :: Python :: 3 classifier is not listed.) + +

Programming Language :: Python
+License :: OSI Approved :: BSD License
+Operating System :: OS Independent
+Development Status :: 5 - Production/Stable
+Environment :: Web Environment
+Framework :: Django
+Intended Audience :: Developers
+Topic :: Internet :: WWW/HTTP
+Topic :: Internet :: WWW/HTTP :: Dynamic Content
+Topic :: Internet :: WWW/HTTP :: WSGI
+Topic :: Software Development :: Libraries :: Python Modules
+ +

Here are the classifiers for chardet, the character encoding detection library covered in Case Study: Porting chardet to Python 3. chardet is beta quality, cross-platform, Python 3-compatible, LGPL-licensed, and intended for developers to integrate into their own products. + +

Programming Language :: Python
+Programming Language :: Python :: 3
+License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
+Operating System :: OS Independent
+Development Status :: 4 - Beta
+Environment :: Other Environment
+Intended Audience :: Developers
+Topic :: Text Processing :: Linguistic
+Topic :: Software Development :: Libraries :: Python Modules
+ +

And here are the classifiers for httplib2, the HTTP module I mentioned at the beginning of this chapter. httplib2 is beta quality, cross-platform, MIT-licensed, and intended for Python developers. + +

Programming Language :: Python
+Programming Language :: Python :: 3
+License :: OSI Approved :: MIT License
+Operating System :: OS Independent
+Development Status :: 4 - Beta
+Environment :: Web Environment
+Intended Audience :: Developers
+Topic :: Internet :: WWW/HTTP
+Topic :: Software Development :: Libraries :: Python Modules
+ +

Specifying Additional Files With A Manifest

+ +

By default, Distutils will include the following files in your release package: + +

+ +

That will cover all the files in the httplib2 project. But for the chardet project, we also want to include the COPYING.txt license file and the entire docs/ directory that contains images and HTML files. To tell Distutils to include these additional files and directories when it builds the chardet release package, you need a manifest file. + +

A manifest file is a text file called MANIFEST.in. Place it in the project’s root directory, next to README.txt and setup.py. Manifest files are not Python scripts; they are text files that contain a series of “commands” in a Distutils-defined format. Manifest commands allow you to include or exclude specific files and directories. + +

This is the entire manifest file for the chardet project: + +

include COPYING.txt                                
+recursive-include docs *.html *.css *.png *.gif    
+
    +
  1. The first line is self-explanatory: include the COPYING.txt file from the project’s root directory. +
  2. The second line is a bit more complicated. The recursive-include command takes a directory name and one or more filenames. The filenames aren’t limited to specific files; they can include wildcards. This line means “See that docs/ directory in the project’s root directory? Look in there (recursively) for .html, .css, .png, and .gif files. I want all of them in my release package.” +
+ +

All manifest commands preserve the directory structure that you set up in your project directory. That recursive-include command is not going to put a bunch of .html and .png files in the root directory of the release package. It’s going to maintain the existing docs/ directory structure, but only include those files inside that directory that match the given wildcards. (I didn’t mention it earlier, but the chardet documentation is actually written in XML and converted to HTML by a separate script. I don’t want to include the XML files in the release package, just the HTML and the images.) + +

+

Manifest files have their own unique format. See Specifying the files to distribute and the manifest template commands for details. +

+ +

To reiterate: you only need to create a manifest file if you want to include files that Distutils doesn’t include by default. If you do need a manifest file, it should only include the files and directories that Distutils wouldn’t otherwise find on its own. + +

Checking Your Setup Script for Errors

+ +

There’s a lot to keep track of. Distutils comes with a built-in validation command that checks that all the required metadata is present in your setup script. For example, if you forget to include the version parameter, Distutils will remind you. + +

+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
+running check
+warning: check: missing required meta-data: version
+ +

Once you include a version parameter (and all the other required bits of metadata), the check command will look like this: + +

+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
+running check
+ +

⁂ + +

Creating a Source Distribution

+ +

Distutils supports building multiple types of release packages. At a minimum, you should build a “source distribution” that contains your source code, your Distutils setup script, your “read me” file, and whatever additional files you want to include. To build a source distribution, pass the sdist command to your Distutils setup script. + +

+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py sdist
+running sdist
+running check
+reading manifest template 'MANIFEST.in'
+writing manifest file 'MANIFEST'
+creating chardet-1.0.2
+creating chardet-1.0.2\chardet
+creating chardet-1.0.2\docs
+creating chardet-1.0.2\docs\images
+copying files to chardet-1.0.2...
+copying COPYING -> chardet-1.0.2
+copying README.txt -> chardet-1.0.2
+copying setup.py -> chardet-1.0.2
+copying chardet\__init__.py -> chardet-1.0.2\chardet
+copying chardet\big5freq.py -> chardet-1.0.2\chardet
+...
+copying chardet\universaldetector.py -> chardet-1.0.2\chardet
+copying chardet\utf8prober.py -> chardet-1.0.2\chardet
+copying docs\faq.html -> chardet-1.0.2\docs
+copying docs\history.html -> chardet-1.0.2\docs
+copying docs\how-it-works.html -> chardet-1.0.2\docs
+copying docs\index.html -> chardet-1.0.2\docs
+copying docs\license.html -> chardet-1.0.2\docs
+copying docs\supported-encodings.html -> chardet-1.0.2\docs
+copying docs\usage.html -> chardet-1.0.2\docs
+copying docs\images\caution.png -> chardet-1.0.2\docs\images
+copying docs\images\important.png -> chardet-1.0.2\docs\images
+copying docs\images\note.png -> chardet-1.0.2\docs\images
+copying docs\images\permalink.gif -> chardet-1.0.2\docs\images
+copying docs\images\tip.png -> chardet-1.0.2\docs\images
+copying docs\images\warning.png -> chardet-1.0.2\docs\images
+creating dist
+creating 'dist\chardet-1.0.2.zip' and adding 'chardet-1.0.2' to it
+adding 'chardet-1.0.2\COPYING'
+adding 'chardet-1.0.2\PKG-INFO'
+adding 'chardet-1.0.2\README.txt'
+adding 'chardet-1.0.2\setup.py'
+adding 'chardet-1.0.2\chardet\big5freq.py'
+adding 'chardet-1.0.2\chardet\big5prober.py'
+...
+adding 'chardet-1.0.2\chardet\universaldetector.py'
+adding 'chardet-1.0.2\chardet\utf8prober.py'
+adding 'chardet-1.0.2\chardet\__init__.py'
+adding 'chardet-1.0.2\docs\faq.html'
+adding 'chardet-1.0.2\docs\history.html'
+adding 'chardet-1.0.2\docs\how-it-works.html'
+adding 'chardet-1.0.2\docs\index.html'
+adding 'chardet-1.0.2\docs\license.html'
+adding 'chardet-1.0.2\docs\supported-encodings.html'
+adding 'chardet-1.0.2\docs\usage.html'
+adding 'chardet-1.0.2\docs\images\caution.png'
+adding 'chardet-1.0.2\docs\images\important.png'
+adding 'chardet-1.0.2\docs\images\note.png'
+adding 'chardet-1.0.2\docs\images\permalink.gif'
+adding 'chardet-1.0.2\docs\images\tip.png'
+adding 'chardet-1.0.2\docs\images\warning.png'
+removing 'chardet-1.0.2' (and everything under it)
+ +

Several things to note here: + +

+ +
+c:\Users\pilgrim\chardet> dir dist
+ Volume in drive C has no label.
+ Volume Serial Number is DED5-B4F8
+
+ Directory of c:\Users\pilgrim\chardet\dist
+
+07/30/2009  06:29 PM    <DIR>          .
+07/30/2009  06:29 PM    <DIR>          ..
+07/30/2009  06:29 PM           206,440 chardet-1.0.2.zip
+               1 File(s)        206,440 bytes
+               2 Dir(s)  61,424,635,904 bytes free
+ +

⁂ + +

Creating a Graphical Installer

+ +

In my opinion, every Python library deserves a graphical installer for Windows users. It’s easy to make (even if you don’t run Windows yourself), and Windows users appreciate it. + +

Distutils can create a graphical Windows installer for you, by passing the bdist_wininst command to your Distutils setup script. + +

+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py bdist_wininst
+running bdist_wininst
+running build
+running build_py
+creating build
+creating build\lib
+creating build\lib\chardet
+copying chardet\big5freq.py -> build\lib\chardet
+copying chardet\big5prober.py -> build\lib\chardet
+...
+copying chardet\universaldetector.py -> build\lib\chardet
+copying chardet\utf8prober.py -> build\lib\chardet
+copying chardet\__init__.py -> build\lib\chardet
+installing to build\bdist.win32\wininst
+running install_lib
+creating build\bdist.win32
+creating build\bdist.win32\wininst
+creating build\bdist.win32\wininst\PURELIB
+creating build\bdist.win32\wininst\PURELIB\chardet
+copying build\lib\chardet\big5freq.py -> build\bdist.win32\wininst\PURELIB\chardet
+copying build\lib\chardet\big5prober.py -> build\bdist.win32\wininst\PURELIB\chardet
+...
+copying build\lib\chardet\universaldetector.py -> build\bdist.win32\wininst\PURELIB\chardet
+copying build\lib\chardet\utf8prober.py -> build\bdist.win32\wininst\PURELIB\chardet
+copying build\lib\chardet\__init__.py -> build\bdist.win32\wininst\PURELIB\chardet
+running install_egg_info
+Writing build\bdist.win32\wininst\PURELIB\chardet-1.0.2-py3.1.egg-info
+creating 'c:\users\pilgrim\appdata\local\temp\tmp2f4h7e.zip' and adding '.' to it
+adding 'PURELIB\chardet-1.0.2-py3.1.egg-info'
+adding 'PURELIB\chardet\big5freq.py'
+adding 'PURELIB\chardet\big5prober.py'
+...
+adding 'PURELIB\chardet\universaldetector.py'
+adding 'PURELIB\chardet\utf8prober.py'
+adding 'PURELIB\chardet\__init__.py'
+removing 'build\bdist.win32\wininst' (and everything under it)
+c:\Users\pilgrim\chardet> dir dist
+c:\Users\pilgrim\chardet>dir dist
+ Volume in drive C has no label.
+ Volume Serial Number is AADE-E29F
+
+ Directory of c:\Users\pilgrim\chardet\dist
+
+07/30/2009  10:14 PM    <DIR>          .
+07/30/2009  10:14 PM    <DIR>          ..
+07/30/2009  10:14 PM           371,236 chardet-1.0.2.win32.exe
+07/30/2009  06:29 PM           206,440 chardet-1.0.2.zip
+               2 File(s)        577,676 bytes
+               2 Dir(s)  61,424,070,656 bytes free
+ +

Building Installable Packages for Other Operating Systems

+ +

Distutils can help you build installable packages for Linux users. In my opinion, this probably isn’t worth your time. If you want your software distributed for Linux, your time would be better spent working with community members who specialize in packaging software for major Linux distributions. + +

For example, my chardet library is in the Debian GNU/Linux repositories (and therefore in the Ubuntu repositories as well). I had nothing to do with this; the packages just showed up there one day. The Debian community has their own policies for packaging Python libraries, and the Debian python-chardet package is designed to follow these conventions. And since the package lives in Debian’s repositories, Debian users will receive security updates and/or new versions, depending on the system-wide settings they’ve chosen to manage their own computers. + +

The Linux packages that Distutils builds offer none of these advantages. Your time is better spent elsewhere. + +

⁂ + +

Adding Your Software to The Python Package Index

+ +

Uploading software to the Python Package Index is a three step process. + +

    +
  1. Register yourself +
  2. Register your software +
  3. Upload the packages you created with setup.py sdist and setup.py bdist_* +
+ +

To register yourself, go to the PyPI user registration page. Enter your desired username and password, provide a valid email address, and click the Register button. (If you have a PGP or GPG key, you can also provide that. If you don’t have one or don’t know what that means, don’t worry about it.) Check your email; within a few minutes, you should receive a message from PyPI with a validation link. Click the link to complete the registration process. + +

Now you need to register your software with PyPI and upload it. You can do this all in one step. + +

+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload  
+running register
+We need to know who you are, so please choose either:
+ 1. use your existing login,
+ 2. register as a new user,
+ 3. have the server generate a new password for you (and email it to you), or
+ 4. quit
+Your selection [default 1]:  1                                                                 
+Username: MarkPilgrim                                                                          
+Password:
+Registering chardet to http://pypi.python.org/pypi                                             
+Server response (200): OK
+running sdist                                                                                  
+... output trimmed for brevity ...
+running bdist_wininst                                                                          
+... output trimmed for brevity ...
+running upload                                                                                 
+Submitting dist\chardet-1.0.2.zip to http://pypi.python.org/pypi
+Server response (200): OK
+Submitting dist\chardet-1.0.2.win32.exe to http://pypi.python.org/pypi
+Server response (200): OK
+I can store your PyPI login so future submissions will be faster.
+(the login will be stored in c:\home\.pypirc)
+Save your login (y/N)?n                                                                        
+
    +
  1. When you release your project for the first time, Distutils will add your software to the Python Package Index and give it its own URL. Every time after that, it will simply update the project metadata with any changes you may have made in your setup.py parameters. Next, it builds a source distribution (sdist) and a Windows installer (bdist_wininst), then uploads them to PyPI (upload). +
  2. Type 1 or just press ENTER to select “use your existing login.” +
  3. Enter the username and password you selected on the the PyPI user registration page. Distuils will not echo your password; it will not even echo asterisks in place of characters. Just type your password and press ENTER. +
  4. Distutils registers your package with the Python Package Index… +
  5. …builds your source distribution… +
  6. …builds your Windows installer… +
  7. …and uploads them both to the Python Package Index. +
  8. If you want to automate the process of releasing new versions, you need to save your PyPI credentials in a local file. This is completely insecure and completely optional. +
+ +

Congratulations, you now have your own page on the Python Package Index! The address is http://pypi.python.org/pypi/NAME, where NAME is the string you passed in the name parameter in your setup.py file. + +

If you want to release a new version, just update your setup.py with the new version number, then run the same upload command again: + +

+c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload
+
+ +

⁂ + +

The Many Possible Futures of Python Packaging

+ +

Distutils is not the be-all and end-all of Python packaging, but as of this writing (August 2009), it’s the only packaging framework that works in Python 3. There are a number of other frameworks for Python 2; some focus on installation, others on testing and deployment. Some or all of these may end up being ported to Python 3 in the future. + +

These frameworks focus on installation: + +

+ +

These focus on testing and deployment: + +

+ +

⁂ + +

Further Reading

+ +

On Distutils: + +

+ +

On other packaging frameworks: + +

+ +

+

© 2001–10 Mark Pilgrim + + + diff --git a/prince.css b/prince.css index 5dbf409..5fa3299 100644 --- a/prince.css +++ b/prince.css @@ -1,59 +1,59 @@ -/* - -"Dive Into Python 3" Prince stylesheet - -Copyright (c) 2009, Mark Pilgrim, All rights reserved. - -Redistribution and use in source and binary forms, with or without modification, -are permitted provided that the following conditions are met: - -* Redistributions of source code must retain the above copyright notice, - this list of conditions and the following disclaimer. -* Redistributions in binary form must reproduce the above copyright notice, - this list of conditions and the following disclaimer in the documentation - and/or other materials provided with the distribution. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS' -AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -POSSIBILITY OF SUCH DAMAGE. -*/ - -/* some Prince-specific rules to generate a nicer PDF */ -/* see http://www.princexml.com/ */ - -@page { - size: US-Letter; - margin: 30pt; - padding: 0; - @bottom-center { - font: 12pt/1.75 'Gill Sans', 'Gill Sans MT', Helvetica, Corbel, 'Nimbus Sans L', sans-serif; - content: counter(page); - } -} -pre { - page-break-inside: avoid; -} -h1 { - page-break-before: always; - prince-bookmark-level: 1; -} -h2 { - prince-bookmark-level: 2; -} -h3 { - prince-bookmark-level: 3; -} -ul, ol { - margin: 1.75em 20pt; -} -abbr { - text-decoration: none; -} +/* + +"Dive Into Python 3" Prince stylesheet + +Copyright (c) 2009, Mark Pilgrim, All rights reserved. + +Redistribution and use in source and binary forms, with or without modification, +are permitted provided that the following conditions are met: + +* Redistributions of source code must retain the above copyright notice, + this list of conditions and the following disclaimer. +* Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS' +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. +*/ + +/* some Prince-specific rules to generate a nicer PDF */ +/* see http://www.princexml.com/ */ + +@page { + size: US-Letter; + margin: 30pt; + padding: 0; + @bottom-center { + font: 12pt/1.75 'Gill Sans', 'Gill Sans MT', Helvetica, Corbel, 'Nimbus Sans L', sans-serif; + content: counter(page); + } +} +pre { + page-break-inside: avoid; +} +h1 { + page-break-before: always; + prince-bookmark-level: 1; +} +h2 { + prince-bookmark-level: 2; +} +h3 { + prince-bookmark-level: 3; +} +ul, ol { + margin: 1.75em 20pt; +} +abbr { + text-decoration: none; +} diff --git a/publish b/publish index f471782..8d39edc 100755 --- a/publish +++ b/publish @@ -2,7 +2,6 @@ die () { echo "$1" >/dev/stderr - [ -n "$(which Snarl_CMD 2>/dev/null)" ] && Snarl_CMD snShowMessage 10 "Dive Into Python 3" "$1." "C:\Users\pilgrim\site-lisp\todochiku-icons\alert.png" exit 1 } @@ -119,9 +118,9 @@ java -jar util/yuicompressor-2.4.2.jar build/dip3.css > build/$revision.css && \ echo "inlining CSS, minimizing URLs, adding evil tracking code" ga=`cat j/ga.js` for f in build/*.html; do - css=`python2.6 util/lesscss.py "$f" "build/$revision.css"` || die "Failed to remove unused CSS" - mobilecss=`python2.6 util/lesscss.py "$f" "build/m-$revision.css"` || die "Failed to remove unused CSS" - printcss=`python2.6 util/lesscss.py "$f" "build/p-$revision.css"` || die "Failed to remove unused CSS" + css=`python2.5 util/lesscss.py "$f" "build/$revision.css"` || die "Failed to remove unused CSS" + mobilecss=`python2.5 util/lesscss.py "$f" "build/m-$revision.css"` || die "Failed to remove unused CSS" + printcss=`python2.5 util/lesscss.py "$f" "build/p-$revision.css"` || die "Failed to remove unused CSS" sed -i -e "s|||g" -e "s|||g" -e "s|||g" -e "s||${ga}|g" "$f" || die "Failed to inline CSS" done @@ -130,7 +129,7 @@ chmod 755 build/examples build/j build/i build/d && \ chmod 644 build/*.html build/*.css build/*.txt build/*.zip build/examples/* build/examples/.htaccess build/j/* build/j/.htaccess build/i/* build/i/.htaccess build/d/.htaccess build/.htaccess || die "Failed to reset file permissions" # ship it! -#die "Aborting without publishing" +die "Aborting without publishing" echo -n "publishing" rsync -essh -a build/d/.htaccess build/*.zip diveintomark.org:~/web/diveintopython3.org/d/ && \ echo -n "." && \ @@ -140,5 +139,3 @@ rsync -essh -a build/d/.htaccess build/*.zip diveintomark.org:~/web/diveintopyth echo -n "." && \ rsync -essh -a build/examples build/*.txt build/*.html build/.htaccess diveintomark.org:~/web/diveintopython3.org/ && \ echo "." || die "Failed to publish to remote server" - -[ -n "$(which Snarl_CMD 2>/dev/null)" ] && Snarl_CMD snShowMessage 10 "Dive Into Python 3" "Published." "C:\Users\pilgrim\site-lisp\todochiku-icons\clean.png" diff --git a/table-of-contents.html b/table-of-contents.html index 93772a7..5d9db97 100755 --- a/table-of-contents.html +++ b/table-of-contents.html @@ -1,446 +1,446 @@ - - -Table of contents - Dive Into Python 3 - - - - - -

 
-

You are here: Home Dive Into Python 3 -

Table of Contents

- -
    -
  1. What’s New in “Dive Into Python 3” -
      -
    1. a.k.a. “the minus level” -
    -
  2. Installing Python -
      -
    1. Diving In -
    2. Which Python Is Right For You? -
    3. Installing on Microsoft Windows -
    4. Installing on Mac OS X -
    5. Installing on Ubuntu Linux -
    6. Installing on Other Platforms -
    7. Using The Python Shell -
    8. Python Editors and IDEs -
    -
  3. Your First Python Program -
      -
    1. Diving In -
    2. Declaring Functions -
        -
      1. Optional and Named Arguments -
      -
    3. Writing Readable Code -
        -
      1. Documentation Strings -
      -
    4. The import Search Path -
    5. Everything Is An Object -
        -
      1. What’s An Object? -
      -
    6. Indenting Code -
    7. Exceptions -
        -
      1. Catching Import Errors -
      -
    8. Unbound Variables -
    9. Everything is Case-Sensitive -
    10. Running Scripts -
    11. Further Reading -
    -
  4. Native Datatypes -
      -
    1. Diving In -
    2. Booleans -
    3. Numbers -
        -
      1. Coercing Integers To Floats And Vice-Versa -
      2. Common Numerical Operations -
      3. Fractions -
      4. Trigonometry -
      5. Numbers In A Boolean Context -
      -
    4. Lists -
        -
      1. Creating A List -
      2. Slicing A List -
      3. Adding Items To A List -
      4. Searching For Values In A List -
      5. Removing Items From A List -
      6. Removing Items From A List: Bonus Round -
      7. Lists In A Boolean Context -
      -
    5. Tuples -
        -
      1. Tuples In A Boolean Context -
      2. Assigning Multiple Values At Once -
      -
    6. Sets -
        -
      1. Creating A Set -
      2. Modifying A Set -
      3. Removing Items From A Set -
      4. Common Set Operations -
      5. Sets In A Boolean Context -
      -
    7. Dictionaries -
        -
      1. Creating A Dictionary -
      2. Modifying A Dictionary -
      3. Mixed-Value Dictionaries -
      4. Dictionaries In A Boolean Context -
      -
    8. None -
        -
      1. None In A Boolean Context -
      -
    9. Further Reading -
    -
  5. Comprehensions -
      -
    1. Diving In -
    2. Working With Files And Directories -
        -
      1. The Current Working Directory -
      2. Working With Filenames and Directory Names -
      3. Listing Directories -
      4. Getting File Metadata -
      5. Constructing Absolute Pathnames -
      -
    3. List Comprehensions -
    4. Dictionary Comprehensions -
        -
      1. Other Fun Stuff To Do With Dictionary Comprehensions -
      -
    5. Set Comprehensions -
    6. Further Reading -
    -
  6. Strings -
      -
    1. Some Boring Stuff You Need To Understand Before You Can Dive In -
    2. Unicode -
    3. Diving In -
    4. Formatting Strings -
        -
      1. Compound Field Names -
      2. Format Specifiers -
      -
    5. Other Common String Methods -
        -
      1. Slicing A String -
      -
    6. Strings vs. Bytes -
    7. Postscript: Character Encoding Of Python Source Code -
    8. Further Reading -
    -
  7. Regular Expressions -
      -
    1. Diving In -
    2. Case Study: Street Addresses -
    3. Case Study: Roman Numerals -
        -
      1. Checking For Thousands -
      2. Checking For Hundreds -
      -
    4. Using The {n,m} Syntax -
        -
      1. Checking For Tens And Ones -
      -
    5. Verbose Regular Expressions -
    6. Case study: Parsing Phone Numbers -
    7. Summary -
    -
  8. Closures & Generators -
      -
    1. Diving In -
    2. I Know, Let’s Use Regular Expressions! -
    3. A List Of Functions -
    4. A List Of Patterns -
    5. A File Of Patterns -
    6. Generators -
        -
      1. A Fibonacci Generator -
      2. A Plural Rule Generator -
      -
    7. Further Reading -
    -
  9. Classes & Iterators -
      -
    1. Diving In -
    2. Defining Classes -
        -
      1. The __init__() Method -
      -
    3. Instantiating Classes -
    4. Instance Variables -
    5. A Fibonacci Iterator -
    6. A Plural Rule Iterator -
    7. Further Reading -
    -
  10. Advanced Iterators -
      -
    1. Diving In -
    2. Finding all occurrences of a pattern -
    3. Finding the unique items in a sequence -
    4. Making assertions -
    5. Generator expressions -
    6. Calculating Permutations… The Lazy Way! -
    7. Other Fun Stuff in the itertools Module -
    8. A New Kind Of String Manipulation -
    9. Evaluating Arbitrary Strings As Python Expressions -
    10. Putting It All Together -
    11. Further Reading -
    -
  11. Unit Testing -
      -
    1. (Not) Diving In -
    2. A Single Question -
    3. “Halt And Catch Fire” -
    4. More Halting, More Fire -
    5. And One More Thing… -
    6. A Pleasing Symmetry -
    7. More Bad Input -
    -
  12. Refactoring -
      -
    1. Diving In -
    2. Handling Changing Requirements -
    3. Refactoring -
    4. Summary -
    -
  13. Files -
      -
    1. Diving In -
    2. Reading From Text Files -
        -
      1. Character Encoding Rears Its Ugly Head -
      2. Stream Objects -
      3. Reading Data From A Text File -
      4. Closing Files -
      5. Closing Files Automatically -
      6. Reading Data One Line At A Time -
      -
    3. Writing to Text Files -
        -
      1. Character Encoding Again -
      -
    4. Binary Files -
    5. Stream Objects From Non-File Sources -
        -
      1. Handling Compressed Files -
      -
    6. Standard Input, Output, and Error -
        -
      1. Redirecting Standard Output -
      -
    7. Further Reading -
    -
  14. XML -
      -
    1. Diving In -
    2. A 5-Minute Crash Course in XML -
    3. The Structure Of An Atom Feed -
    4. Parsing XML -
        -
      1. Elements Are Lists -
      2. Attributes Are Dictonaries -
      -
    5. Searching For Nodes Within An XML Document -
    6. Going Further With lxml -
    7. Generating XML -
    8. Parsing Broken XML -
    9. Further Reading -
    -
  15. Serializing Python Objects -
      -
    1. Diving In -
        -
      1. A Quick Note About The Examples in This Chapter -
      -
    2. Saving Data to a Pickle File -
    3. Loading Data from a Pickle File -
    4. Pickling Without a File -
    5. Bytes and Strings Rear Their Ugly Heads Again -
    6. Debugging Pickle Files -
    7. Serializing Python Objects to be Read by Other Languages -
    8. Saving Data to a JSON File -
    9. Mapping of Python Datatypes to JSON -
    10. Serializing Datatypes Unsupported by JSON -
    11. Loading Data from a JSON File -
    12. Further Reading -
    -
  16. HTTP Web Services -
      -
    1. Diving In -
    2. Features of HTTP -
        -
      1. Caching -
      2. Last-Modified Checking -
      3. ETag Checking -
      4. Compression -
      5. Redirects -
      -
    3. How Not To Fetch Data Over HTTP -
    4. What’s On The Wire? -
    5. Introducing httplib2 -
        -
      1. A Short Digression To Explain Why httplib2 Returns Bytes Instead of Strings -
      2. How httplib2 Handles Caching -
      3. How httplib2 Handles Last-Modified and ETag Headers -
      4. How http2lib Handles Compression -
      5. How httplib2 Handles Redirects -
      -
    6. Beyond HTTP GET -
    7. Beyond HTTP POST -
    8. Further Reading -
    -
  17. Case Study: Porting chardet to Python 3 -
      -
    1. Diving In -
    2. What is Character Encoding Auto-Detection? -
        -
      1. Isn’t That Impossible? -
      2. Does Such An Algorithm Exist? -
      -
    3. Introducing The chardet Module -
        -
      1. UTF-n With A BOM -
      2. Escaped Encodings -
      3. Multi-Byte Encodings -
      4. Single-Byte Encodings -
      5. windows-1252 -
      -
    4. Running 2to3 -
    5. A Short Digression Into Multi-File Modules -
    6. Fixing What 2to3 Can’t -
        -
      1. False is invalid syntax -
      2. No module named constants -
      3. Name 'file' is not defined -
      4. Can’t use a string pattern on a bytes-like object -
      5. Can't convert 'bytes' object to str implicitly -
      6. Unsupported operand type(s) for +: 'int' and 'bytes' -
      7. ord() expected string of length 1, but int found -
      8. Unorderable types: int() >= str() -
      9. Global name 'reduce' is not defined -
      -
    7. Summary -
    -
  18. Packaging Python Libraries -
      -
    1. Diving In -
    2. Things Distutils Can’t Do For You -
    3. Directory Structure -
    4. Writing Your Setup Script -
    5. Classifying Your Package -
        -
      1. Examples of Good Package Classifiers -
      -
    6. Specifying Additional Files With A Manifest -
    7. Checking Your Setup Script for Errors -
    8. Creating a Source Distribution -
    9. Creating a Graphical Installer -
        -
      1. Building Installable Packages for Other Operating Systems -
      -
    10. Adding Your Software to The Python Package Index -
    11. The Many Possible Futures of Python Packaging -
    12. Further Reading -
    -
  19. Porting Code to Python 3 with 2to3 -
      -
    1. Diving In -
    2. print statement -
    3. Unicode string literals -
    4. unicode() global function -
    5. long data type -
    6. <> comparison -
    7. has_key() dictionary method -
    8. Dictionary methods that return lists -
    9. Modules that have been renamed or reorganized -
        -
      1. http -
      2. urllib -
      3. dbm -
      4. xmlrpc -
      5. Other modules -
      -
    10. Relative imports within a package -
    11. next() iterator method -
    12. filter() global function -
    13. map() global function -
    14. reduce() global function -
    15. apply() global function -
    16. intern() global function -
    17. exec statement -
    18. execfile statement -
    19. repr literals (backticks) -
    20. try...except statement -
    21. raise statement -
    22. throw method on generators -
    23. xrange() global function -
    24. raw_input() and input() global functions -
    25. func_* function attributes -
    26. xreadlines() I/O method -
    27. lambda functions that take a tuple instead of multiple parameters -
    28. Special method attributes -
    29. __nonzero__ special method -
    30. Octal literals -
    31. sys.maxint -
    32. callable() global function -
    33. zip() global function -
    34. StandardError exception -
    35. types module constants -
    36. isinstance() global function -
    37. basestring datatype -
    38. itertools module -
    39. sys.exc_type, sys.exc_value, sys.exc_traceback -
    40. List comprehensions over tuples -
    41. os.getcwdu() function -
    42. Metaclasses -
    43. Matters of style -
        -
      1. set() literals (explicit) -
      2. buffer() global function (explicit) -
      3. Whitespace around commas (explicit) -
      4. Common idioms (explicit) -
      -
    -
  20. Special Method Names -
      -
    1. Diving In -
    2. Basics -
    3. Classes That Act Like Iterators -
    4. Computed Attributes -
    5. Classes That Act Like Functions -
    6. Classes That Act Like Sequences -
    7. Classes That Act Like Dictionaries -
    8. Classes That Act Like Numbers -
    9. Classes That Can Be Compared -
    10. Classes That Can Be Serialized -
    11. Classes That Can Be Used in a with Block -
    12. Really Esoteric Stuff -
    13. Further Reading -
    -
  21. Where to Go From Here -
      -
    1. Things to Read -
    2. Where To Look For Python 3-Compatible Code -
    -
- -

© 2001–10 Mark Pilgrim - + + +Table of contents - Dive Into Python 3 + + + + + +

 
+

You are here: Home Dive Into Python 3 +

Table of Contents

+ +
    +
  1. What’s New in “Dive Into Python 3” +
      +
    1. a.k.a. “the minus level” +
    +
  2. Installing Python +
      +
    1. Diving In +
    2. Which Python Is Right For You? +
    3. Installing on Microsoft Windows +
    4. Installing on Mac OS X +
    5. Installing on Ubuntu Linux +
    6. Installing on Other Platforms +
    7. Using The Python Shell +
    8. Python Editors and IDEs +
    +
  3. Your First Python Program +
      +
    1. Diving In +
    2. Declaring Functions +
        +
      1. Optional and Named Arguments +
      +
    3. Writing Readable Code +
        +
      1. Documentation Strings +
      +
    4. The import Search Path +
    5. Everything Is An Object +
        +
      1. What’s An Object? +
      +
    6. Indenting Code +
    7. Exceptions +
        +
      1. Catching Import Errors +
      +
    8. Unbound Variables +
    9. Everything is Case-Sensitive +
    10. Running Scripts +
    11. Further Reading +
    +
  4. Native Datatypes +
      +
    1. Diving In +
    2. Booleans +
    3. Numbers +
        +
      1. Coercing Integers To Floats And Vice-Versa +
      2. Common Numerical Operations +
      3. Fractions +
      4. Trigonometry +
      5. Numbers In A Boolean Context +
      +
    4. Lists +
        +
      1. Creating A List +
      2. Slicing A List +
      3. Adding Items To A List +
      4. Searching For Values In A List +
      5. Removing Items From A List +
      6. Removing Items From A List: Bonus Round +
      7. Lists In A Boolean Context +
      +
    5. Tuples +
        +
      1. Tuples In A Boolean Context +
      2. Assigning Multiple Values At Once +
      +
    6. Sets +
        +
      1. Creating A Set +
      2. Modifying A Set +
      3. Removing Items From A Set +
      4. Common Set Operations +
      5. Sets In A Boolean Context +
      +
    7. Dictionaries +
        +
      1. Creating A Dictionary +
      2. Modifying A Dictionary +
      3. Mixed-Value Dictionaries +
      4. Dictionaries In A Boolean Context +
      +
    8. None +
        +
      1. None In A Boolean Context +
      +
    9. Further Reading +
    +
  5. Comprehensions +
      +
    1. Diving In +
    2. Working With Files And Directories +
        +
      1. The Current Working Directory +
      2. Working With Filenames and Directory Names +
      3. Listing Directories +
      4. Getting File Metadata +
      5. Constructing Absolute Pathnames +
      +
    3. List Comprehensions +
    4. Dictionary Comprehensions +
        +
      1. Other Fun Stuff To Do With Dictionary Comprehensions +
      +
    5. Set Comprehensions +
    6. Further Reading +
    +
  6. Strings +
      +
    1. Some Boring Stuff You Need To Understand Before You Can Dive In +
    2. Unicode +
    3. Diving In +
    4. Formatting Strings +
        +
      1. Compound Field Names +
      2. Format Specifiers +
      +
    5. Other Common String Methods +
        +
      1. Slicing A String +
      +
    6. Strings vs. Bytes +
    7. Postscript: Character Encoding Of Python Source Code +
    8. Further Reading +
    +
  7. Regular Expressions +
      +
    1. Diving In +
    2. Case Study: Street Addresses +
    3. Case Study: Roman Numerals +
        +
      1. Checking For Thousands +
      2. Checking For Hundreds +
      +
    4. Using The {n,m} Syntax +
        +
      1. Checking For Tens And Ones +
      +
    5. Verbose Regular Expressions +
    6. Case study: Parsing Phone Numbers +
    7. Summary +
    +
  8. Closures & Generators +
      +
    1. Diving In +
    2. I Know, Let’s Use Regular Expressions! +
    3. A List Of Functions +
    4. A List Of Patterns +
    5. A File Of Patterns +
    6. Generators +
        +
      1. A Fibonacci Generator +
      2. A Plural Rule Generator +
      +
    7. Further Reading +
    +
  9. Classes & Iterators +
      +
    1. Diving In +
    2. Defining Classes +
        +
      1. The __init__() Method +
      +
    3. Instantiating Classes +
    4. Instance Variables +
    5. A Fibonacci Iterator +
    6. A Plural Rule Iterator +
    7. Further Reading +
    +
  10. Advanced Iterators +
      +
    1. Diving In +
    2. Finding all occurrences of a pattern +
    3. Finding the unique items in a sequence +
    4. Making assertions +
    5. Generator expressions +
    6. Calculating Permutations… The Lazy Way! +
    7. Other Fun Stuff in the itertools Module +
    8. A New Kind Of String Manipulation +
    9. Evaluating Arbitrary Strings As Python Expressions +
    10. Putting It All Together +
    11. Further Reading +
    +
  11. Unit Testing +
      +
    1. (Not) Diving In +
    2. A Single Question +
    3. “Halt And Catch Fire” +
    4. More Halting, More Fire +
    5. And One More Thing… +
    6. A Pleasing Symmetry +
    7. More Bad Input +
    +
  12. Refactoring +
      +
    1. Diving In +
    2. Handling Changing Requirements +
    3. Refactoring +
    4. Summary +
    +
  13. Files +
      +
    1. Diving In +
    2. Reading From Text Files +
        +
      1. Character Encoding Rears Its Ugly Head +
      2. Stream Objects +
      3. Reading Data From A Text File +
      4. Closing Files +
      5. Closing Files Automatically +
      6. Reading Data One Line At A Time +
      +
    3. Writing to Text Files +
        +
      1. Character Encoding Again +
      +
    4. Binary Files +
    5. Stream Objects From Non-File Sources +
        +
      1. Handling Compressed Files +
      +
    6. Standard Input, Output, and Error +
        +
      1. Redirecting Standard Output +
      +
    7. Further Reading +
    +
  14. XML +
      +
    1. Diving In +
    2. A 5-Minute Crash Course in XML +
    3. The Structure Of An Atom Feed +
    4. Parsing XML +
        +
      1. Elements Are Lists +
      2. Attributes Are Dictonaries +
      +
    5. Searching For Nodes Within An XML Document +
    6. Going Further With lxml +
    7. Generating XML +
    8. Parsing Broken XML +
    9. Further Reading +
    +
  15. Serializing Python Objects +
      +
    1. Diving In +
        +
      1. A Quick Note About The Examples in This Chapter +
      +
    2. Saving Data to a Pickle File +
    3. Loading Data from a Pickle File +
    4. Pickling Without a File +
    5. Bytes and Strings Rear Their Ugly Heads Again +
    6. Debugging Pickle Files +
    7. Serializing Python Objects to be Read by Other Languages +
    8. Saving Data to a JSON File +
    9. Mapping of Python Datatypes to JSON +
    10. Serializing Datatypes Unsupported by JSON +
    11. Loading Data from a JSON File +
    12. Further Reading +
    +
  16. HTTP Web Services +
      +
    1. Diving In +
    2. Features of HTTP +
        +
      1. Caching +
      2. Last-Modified Checking +
      3. ETag Checking +
      4. Compression +
      5. Redirects +
      +
    3. How Not To Fetch Data Over HTTP +
    4. What’s On The Wire? +
    5. Introducing httplib2 +
        +
      1. A Short Digression To Explain Why httplib2 Returns Bytes Instead of Strings +
      2. How httplib2 Handles Caching +
      3. How httplib2 Handles Last-Modified and ETag Headers +
      4. How http2lib Handles Compression +
      5. How httplib2 Handles Redirects +
      +
    6. Beyond HTTP GET +
    7. Beyond HTTP POST +
    8. Further Reading +
    +
  17. Case Study: Porting chardet to Python 3 +
      +
    1. Diving In +
    2. What is Character Encoding Auto-Detection? +
        +
      1. Isn’t That Impossible? +
      2. Does Such An Algorithm Exist? +
      +
    3. Introducing The chardet Module +
        +
      1. UTF-n With A BOM +
      2. Escaped Encodings +
      3. Multi-Byte Encodings +
      4. Single-Byte Encodings +
      5. windows-1252 +
      +
    4. Running 2to3 +
    5. A Short Digression Into Multi-File Modules +
    6. Fixing What 2to3 Can’t +
        +
      1. False is invalid syntax +
      2. No module named constants +
      3. Name 'file' is not defined +
      4. Can’t use a string pattern on a bytes-like object +
      5. Can't convert 'bytes' object to str implicitly +
      6. Unsupported operand type(s) for +: 'int' and 'bytes' +
      7. ord() expected string of length 1, but int found +
      8. Unorderable types: int() >= str() +
      9. Global name 'reduce' is not defined +
      +
    7. Summary +
    +
  18. Packaging Python Libraries +
      +
    1. Diving In +
    2. Things Distutils Can’t Do For You +
    3. Directory Structure +
    4. Writing Your Setup Script +
    5. Classifying Your Package +
        +
      1. Examples of Good Package Classifiers +
      +
    6. Specifying Additional Files With A Manifest +
    7. Checking Your Setup Script for Errors +
    8. Creating a Source Distribution +
    9. Creating a Graphical Installer +
        +
      1. Building Installable Packages for Other Operating Systems +
      +
    10. Adding Your Software to The Python Package Index +
    11. The Many Possible Futures of Python Packaging +
    12. Further Reading +
    +
  19. Porting Code to Python 3 with 2to3 +
      +
    1. Diving In +
    2. print statement +
    3. Unicode string literals +
    4. unicode() global function +
    5. long data type +
    6. <> comparison +
    7. has_key() dictionary method +
    8. Dictionary methods that return lists +
    9. Modules that have been renamed or reorganized +
        +
      1. http +
      2. urllib +
      3. dbm +
      4. xmlrpc +
      5. Other modules +
      +
    10. Relative imports within a package +
    11. next() iterator method +
    12. filter() global function +
    13. map() global function +
    14. reduce() global function +
    15. apply() global function +
    16. intern() global function +
    17. exec statement +
    18. execfile statement +
    19. repr literals (backticks) +
    20. try...except statement +
    21. raise statement +
    22. throw method on generators +
    23. xrange() global function +
    24. raw_input() and input() global functions +
    25. func_* function attributes +
    26. xreadlines() I/O method +
    27. lambda functions that take a tuple instead of multiple parameters +
    28. Special method attributes +
    29. __nonzero__ special method +
    30. Octal literals +
    31. sys.maxint +
    32. callable() global function +
    33. zip() global function +
    34. StandardError exception +
    35. types module constants +
    36. isinstance() global function +
    37. basestring datatype +
    38. itertools module +
    39. sys.exc_type, sys.exc_value, sys.exc_traceback +
    40. List comprehensions over tuples +
    41. os.getcwdu() function +
    42. Metaclasses +
    43. Matters of style +
        +
      1. set() literals (explicit) +
      2. buffer() global function (explicit) +
      3. Whitespace around commas (explicit) +
      4. Common idioms (explicit) +
      +
    +
  20. Special Method Names +
      +
    1. Diving In +
    2. Basics +
    3. Classes That Act Like Iterators +
    4. Computed Attributes +
    5. Classes That Act Like Functions +
    6. Classes That Act Like Sequences +
    7. Classes That Act Like Dictionaries +
    8. Classes That Act Like Numbers +
    9. Classes That Can Be Compared +
    10. Classes That Can Be Serialized +
    11. Classes That Can Be Used in a with Block +
    12. Really Esoteric Stuff +
    13. Further Reading +
    +
  21. Where to Go From Here +
      +
    1. Things to Read +
    2. Where To Look For Python 3-Compatible Code +
    +
+ +

© 2001–10 Mark Pilgrim + diff --git a/util/lesscss.py b/util/lesscss.py index 9342d22..c39249c 100755 --- a/util/lesscss.py +++ b/util/lesscss.py @@ -1,4 +1,4 @@ -#!/usr/bin/python2.6 +#!/usr/bin/python2.5 from pyquery import PyQuery as pq import glob @@ -12,10 +12,7 @@ SELECTOR_EXCEPTIONS = ('.w', '.b', '.str', '.kwd', '.com', '.typ', '.lit', '.pun filename = sys.argv[1] cssfilename = sys.argv[2] pqd = pq(filename=filename) - -with open(filename, 'rb') as fopen: - raw_data = fopen.read() - +raw_data = open(filename, 'rb').read() if raw_data.count('