diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 0e4e4e2..8d651b4 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -14,9 +14,9 @@ mark{background:#ff8;font-weight:bold}
You are here: Home ‣ Dive Into Python 3 ‣ -
chardet to Python 3chardet to Python 3-❝ Words, words. They’re all we have to go on. ❞
— Rosencrantz and Guildenstern are Dead +❝ Words, words. They’re all we have to go on. ❞
— Rosencrantz and Guildenstern are Dead
[8] Again, I should point out that map can take a list, a tuple, or any object that acts like a sequence. See previous footnote about filter.
I want to talk about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators. - Generators are new in Python 2.3. But first, let's talk about how to make plural nouns. -
If you haven't read Chapter 7, Regular Expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and quickly descends into -more advanced uses. -
English is a schizophrenic language that borrows from a lot of other languages, and the rules for making singular nouns into -plural nouns are varied and complex. There are rules, and then there are exceptions to those rules, and then there are exceptions -to the exceptions. -
If you grew up in an English-speaking country or learned English in a formal school setting, you're probably familiar with -the basic rules: -
(I know, there are a lot of exceptions. “Man” becomes “men” and “woman” becomes “women”, but “human” becomes “humans”. “Mouse” becomes “mice” and “louse” becomes “lice”, but “house” becomes “houses”. “Knife” becomes “knives” and “wife” becomes “wives”, but “lowlife” becomes “lowlifes”. And don't even get me started on words that are their own plural, like “sheep”, “deer”, and “haiku”.) -
Other languages are, of course, completely different. -
Let's design a module that pluralizes nouns. Start with just English nouns, and just these four rules, but keep in mind that -you'll inevitably need to add more rules, and you may eventually need to add more languages. -
plural.py, stage 1So you're looking at words, which at least in English are strings of characters. And you have rules that say you need to - find different combinations of characters, and then do different things to them. This sounds like a job for regular expressions. -
plural1.py
-import re
-
-def plural(noun):
- if re.search('[sxz]$', noun): ①
- return re.sub('$', 'es', noun) ②
- elif re.search('[^aeioudgkprt]h$', noun):
- return re.sub('$', 'es', noun)
- elif re.search('[^aeiou]y$', noun):
- return re.sub('y$', 'ies', noun)
- else:
- return noun + 's'
-[sxz] means “s, or x, or z”, but only one of them. The $ should be familiar; it matches the end of string. So you're checking to see if noun ends with s, x, or z.
-re.sub function performs regular expression-based string substitutions. Let's look at it in more detail.
-re.sub
->>> import re
->>> re.search('[abc]', 'Mark') ①
-<_sre.SRE_Match object at 0x001C1FA8>
->>> re.sub('[abc]', 'o', 'Mark') ②
-'Mork'
->>> re.sub('[abc]', 'o', 'rock') ③
-'rook'
->>> re.sub('[abc]', 'o', 'caps') ④
-'oops'
-Mark contain a, b, or c? Yes, it contains a.
-a, b, or c, and replace it with o. Mark becomes Mork.
-rock into rook.
-caps into oaps, but it doesn't. re.sub replaces all of the matches, not just the first one. So this regular expression turns caps into oops, because both the c and the a get turned into o.
-plural1.py
-import re
-
-def plural(noun):
- if re.search('[sxz]$', noun):
- return re.sub('$', 'es', noun) ①
- elif re.search('[^aeioudgkprt]h$', noun): ②
- return re.sub('$', 'es', noun) ③
- elif re.search('[^aeiou]y$', noun):
- return re.sub('y$', 'ies', noun)
- else:
- return noun + 's'
-plural function. What are you doing? You're replacing the end of string with es. In other words, adding es to the string. You could accomplish the same thing with string concatenation, for example noun + 'es', but I'm using regular expressions for everything, for consistency, for reasons that will become clear later in the chapter.
-^ as the first character inside the square brackets means something special: negation. [^abc] means “any single character except a, b, or c”. So [^aeioudgkprt] means any character except a, e, i, o, u, d, g, k, p, r, or t. Then that character needs to be followed by h, followed by end of string. You're looking for words that end in H where the H can be heard.
-a, e, i, o, or u. You're looking for words that end in Y that sounds like I.
-
->>> import re
->>> re.search('[^aeiou]y$', 'vacancy') ①
-<_sre.SRE_Match object at 0x001C1FA8>
->>> re.search('[^aeiou]y$', 'boy') ②
->>>
->>> re.search('[^aeiou]y$', 'day')
->>>
->>> re.search('[^aeiou]y$', 'pita') ③
->>>
-vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u.
-boy does not match, because it ends in oy, and you specifically said that the character before the y could not be o. day does not match, because it ends in ay.
-pita does not match, because it does not end in y.
-re.sub
->>> re.sub('y$', 'ies', 'vacancy') ①
-'vacancies'
->>> re.sub('y$', 'ies', 'agency')
-'agencies'
->>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy') ②
-'vacancies'
-vacancy into vacancies and agency into agencies, which is what you wanted. Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub.
-y. Then in the substitution string, you use a new syntax, \1, which means “hey, that first group you remembered? put it here”. In this case, you remember the c before the y, and then when you do the substitution, you substitute c in place of c, and ies in place of y. (If you have more than one remembered group, you can use \2 and \3 and so on.)
-Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder
-to read, and it doesn't directly map to the way you first described the pluralizing rules. You originally laid out rules
-like “if the word ends in S, X, or Z, then add ES”. And if you look at this function, you have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn't get much more direct than that.
-
plural.py, stage 2Now you're going to add a level of abstraction. You started by defining a list of rules: if this, then do that, otherwise - go to the next rule. Let's temporarily complicate part of the program so you can simplify another part. -
plural2.py
-import re
-
-def match_sxz(noun):
- return re.search('[sxz]$', noun)
-
-def apply_sxz(noun):
- return re.sub('$', 'es', noun)
-
-def match_h(noun):
- return re.search('[^aeioudgkprt]h$', noun)
-
-def apply_h(noun):
- return re.sub('$', 'es', noun)
-
-def match_y(noun):
- return re.search('[^aeiou]y$', noun)
-
-def apply_y(noun):
- return re.sub('y$', 'ies', noun)
-
-def match_default(noun):
- return 1
-
-def apply_default(noun):
- return noun + 's'
-
-rules = ((match_sxz, apply_sxz),
- (match_h, apply_h),
- (match_y, apply_y),
- (match_default, apply_default)
- ) ①
-
-def plural(noun):
- for matchesRule, applyRule in rules: ②
- if matchesRule(noun):③
- return applyRule(noun) ④
-for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules tuple. On the first iteration of the for loop, matchesRule will get match_sxz, and applyRule will get apply_sxz. On the second iteration (assuming you get that far), matchesRule will be assigned match_h, and applyRule will be assigned apply_h.
-for loop, then matchesRule and applyRule are actual functions that you can call. So on the first iteration of the for loop, this is equivalent to calling matches_sxz(noun).
-for loop, this is equivalent to calling apply_sxz(noun), and so forth.
-If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. This for loop is equivalent to the following:
-
plural function
-def plural(noun):
- if match_sxz(noun):
- return apply_sxz(noun)
- if match_h(noun):
- return apply_h(noun)
- if match_y(noun):
- return apply_y(noun)
- if match_default(noun):
- return apply_default(noun)
-The benefit here is that that plural function is now simplified. It takes a list of rules, defined elsewhere, and iterates through them in a generic fashion.
-Get a match rule; does it match? Then call the apply rule. The rules could be defined anywhere, in any way. The plural function doesn't care.
-
Now, was adding this level of abstraction worth it? Well, not yet. Let's consider what it would take to add a new rule to
-the function. Well, in the previous example, it would require adding an if statement to the plural function. In this example, it would require adding two functions, match_foo and apply_foo, and then updating the rules list to specify where in the order the new match and apply functions should be called relative to the other rules.
-
This is really just a stepping stone to the next section. Let's move on. -
plural.py, stage 3Defining separate named functions for each match and apply rule isn't really necessary. You never call them directly; you - define them in the rules list and call them through there. Let's streamline the rules definition by anonymizing those functions. -
plural3.py
-import re
-
-rules = \
- (
- (
- lambda word: re.search('[sxz]$', word),
- lambda word: re.sub('$', 'es', word)
- ),
- (
- lambda word: re.search('[^aeioudgkprt]h$', word),
- lambda word: re.sub('$', 'es', word)
- ),
- (
- lambda word: re.search('[^aeiou]y$', word),
- lambda word: re.sub('y$', 'ies', word)
- ),
- (
- lambda word: re.search('$', word),
- lambda word: re.sub('$', 's', word)
- )
- ) ①
-
-def plural(noun):
- for matchesRule, applyRule in rules: ②
- if matchesRule(noun):
- return applyRule(noun)
-match_sxz and apply_sxz, you have “inlined” those function definitions directly into the rules list itself, using lambda functions.
-plural function hasn't changed at all. It iterates through a set of rule functions, checks the first rule, and if it returns a
- true value, calls the second rule and returns the value. Same as above, word for word. The only difference is that the rule
- functions were defined inline, anonymously, using lambda functions. But the plural function doesn't care how they were defined; it just gets a list of rules and blindly works through them.
-Now to add a new rule, all you need to do is define the functions directly in the rules list itself: one match rule, and one apply rule. But defining the rule functions inline like this makes it very clear that
-you have some unnecessary duplication here. You have four pairs of functions, and they all follow the same pattern. The
-match function is a single call to re.search, and the apply function is a single call to re.sub. Let's factor out these similarities.
-
plural.py, stage 4Let's factor out the duplication in the code so that defining new rules can be easier. -
plural4.py
-import re
-
-def buildMatchAndApplyFunctions((pattern, search, replace)):
- matchFunction = lambda word: re.search(pattern, word) ①
- applyFunction = lambda word: re.sub(search, replace, word) ②
- return (matchFunction, applyFunction) ③
-buildMatchAndApplyFunctions is a function that builds other functions dynamically. It takes pattern, search and replace (actually it takes a tuple, but more on that in a minute), and you can build the match function using the lambda syntax to be a function that takes one parameter (word) and calls re.search with the pattern that was passed to the buildMatchAndApplyFunctions function, and the word that was passed to the match function you're building. Whoa.
-re.sub with the search and replace parameters that were passed to the buildMatchAndApplyFunctions function, and the word that was passed to the apply function you're building. This technique of using the values of outside parameters within a
- dynamic function is called closures. You're essentially defining constants within the apply function you're building: it takes one parameter (word), but it then acts on that plus two other values (search and replace) which were set when you defined the apply function.
-buildMatchAndApplyFunctions function returns a tuple of two values: the two functions you just created. The constants you defined within those functions
- (pattern within matchFunction, and search and replace within applyFunction) stay with those functions, even after you return from buildMatchAndApplyFunctions. That's insanely cool.
-If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it. -
plural4.py continued
-patterns = \
- (
- ('[sxz]$', '$', 'es'),
- ('[^aeioudgkprt]h$', '$', 'es'),
- ('(qu|[^aeiou])y$', 'y$', 'ies'),
- ('$', '$', 's')
- ) ①
-rules = map(buildMatchAndApplyFunctions, patterns) ②
-re.search to see if this rule matches; the second and third are the search and replace expressions you would use in re.sub to actually apply the rule to turn a noun into its plural.
-buildMatchAndApplyFunctions function, which just happens to take three strings as parameters and return a tuple of two functions. This means that rules ends up being exactly the same as the previous example: a list of tuples, where each tuple is a pair of functions, where
- the first function is the match function that calls re.search, and the second function is the apply function that calls re.sub.
-I swear I am not making this up: rules ends up with exactly the same list of functions as the previous example. Unroll the rules definition, and you'll get this: -
-rules = \
- (
- (
- lambda word: re.search('[sxz]$', word),
- lambda word: re.sub('$', 'es', word)
- ),
- (
- lambda word: re.search('[^aeioudgkprt]h$', word),
- lambda word: re.sub('$', 'es', word)
- ),
- (
- lambda word: re.search('[^aeiou]y$', word),
- lambda word: re.sub('y$', 'ies', word)
- ),
- (
- lambda word: re.search('$', word),
- lambda word: re.sub('$', 's', word)
- )
- )
-plural4.py, finishing up
-def plural(noun):
- for matchesRule, applyRule in rules: ①
- if matchesRule(noun):
- return applyRule(noun)
-plural function hasn't changed. Remember, it's completely generic; it takes a list of rule functions and calls them in order.
- It doesn't care how the rules are defined. In stage 2, they were defined as seperate named functions. In stage 3, they were defined as anonymous lambda functions. Now in stage 4, they are built dynamically by mapping the buildMatchAndApplyFunctions function onto a list of raw strings. Doesn't matter; the plural function still works the same way.
-Just in case that wasn't mind-blowing enough, I must confess that there was a subtlety in the definition of buildMatchAndApplyFunctions that I skipped over. Let's go back and take another look.
-
buildMatchAndApplyFunctions
-def buildMatchAndApplyFunctions((pattern, search, replace)): ①
-
->>> def foo((a, b, c)):
-... print c
-... print b
-... print a
->>> parameters = ('apple', 'bear', 'catnap')
->>> foo(parameters) ①
-catnap
-bear
-apple
-foo is with a tuple of three elements. When the function is called, the elements are assigned to different local variables within
-foo.
-Now let's go back and see why this auto-tuple-expansion trick was necessary. patterns was a list of tuples, and each tuple had three elements. When you called map(buildMatchAndApplyFunctions, patterns), that means that buildMatchAndApplyFunctions is not getting called with three parameters. Using map to map a single list onto a function always calls the function with a single parameter: each element of the list. In the
- case of patterns, each element of the list is a tuple, so buildMatchAndApplyFunctions always gets called with the tuple, and you use the auto-tuple-expansion trick in the definition of buildMatchAndApplyFunctions to assign the elements of that tuple to named variables that you can work with.
-
plural.py, stage 5You've factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a - list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained - separately from the code that uses them. -
First, let's create a text file that contains the rules you want. No fancy data structures, just space- (or tab-)delimited
-strings in three columns. You'll call it rules.en; “en” stands for English. These are the rules for pluralizing English nouns. You could add other rule files for other languages
-later.
-
rules.en
-[sxz]$$ es
-[^aeioudgkprt]h$ $ es
-[^aeiou]y$ y$ ies
-$ $ s
-Now let's see how you can use this rules file. -
plural5.py
-import re
-import string
-
-def buildRule((pattern, search, replace)):
- return lambda word: re.search(pattern, word) and re.sub(search, replace, word) ①
-
-def plural(noun, language='en'): ②
- lines = file('rules.%s' % language).readlines() ③
- patterns = map(string.split, lines) ④
- rules = map(buildRule, patterns) ⑤
- for rule in rules:
- result = rule(noun) ⑥
- if result: return result
-plural function now takes an optional second parameter, language, which defaults to en.
-en, then you'll open the rules.en file, read the entire thing, break it up by carriage returns, and return a list. Each line of the file will be one element
- in the list.
-string.split function onto this list will create a new list where each element is a tuple of three strings. So a line like [sxz]$ $ es will be broken up into the tuple ('[sxz]$', '$', 'es'). This means that patterns will end up as a list of tuples, just like you hard-coded it in stage 4.
-buildRule. Calling buildRule(('[sxz]$', '$', 'es')) returns a function that takes a single parameter, word. When this returned function is called, it will execute re.search('[sxz]$', word) and re.sub('$', 'es', word).
-None), then the rule didn't match and you need to try another rule.
-So the improvement here is that you've completely separated the pluralization rules into an external file. Not only can the
-file be maintained separately from the code, but you've set up a naming scheme where the same plural function can use different rule files, based on the language parameter.
-
The downside here is that you're reading that file every time you call the plural function. I thought I could get through this entire book without using the phrase “left as an exercise for the reader”, but here you go: building a caching mechanism for the language-specific rule files that auto-refreshes itself if the rule
-files change between calls is left as an exercise for the reader. Have fun.
-
plural.py, stage 6Now you're ready to talk about generators. -
plural6.py
-import re
-
-def rules(language):
- for line in file('rules.%s' % language):
- pattern, search, replace = line.split()
- yield lambda word: re.search(pattern, word) and re.sub(search, replace, word)
-
-def plural(noun, language='en'):
- for applyRule in rules(language):
- result = applyRule(noun)
- if result: return result
-This uses a technique called generators, which I'm not even going to try to explain until you look at a simpler example first. -
->>> def make_counter(x): -... print 'entering make_counter' -... while 1: -... yield x ① -... print 'incrementing x' -... x = x + 1 -... ->>> counter = make_counter(2) ② ->>> counter ③ -<generator object at 0x001C9C10> ->>> counter.next() ④ -entering make_counter -2 ->>> counter.next() ⑤ -incrementing x -3 ->>> counter.next() ⑥ -incrementing x -4 -
yield keyword in make_counter means that this is not a normal function. It is a special kind of function which generates values one at a time. You can
- think of it as a resumable function. Calling it will return a generator that can be used to generate successive values of
-x.
-make_counter generator, just call it like any other function. Note that this does not actually execute the function code. You can tell
- this because the first line of make_counter is a print statement, but nothing has been printed yet.
-make_counter function returns a generator object.
-next() method on the generator object, it executes the code in make_counter up to the first yield statement, and then returns the value that was yielded. In this case, that will be 2, because you originally created the generator by calling make_counter(2).
-next() on the generator object resumes where you left off and continues until you hit the next yield statement. The next line of code waiting to be executed is the print statement that prints incrementing x, and then after that the x = x + 1 statement that actually increments it. Then you loop through the while loop again, and the first thing you do is yield x, which returns the current value of x (now 3).
-counter.next(), you do all the same things again, but this time x is now 4. And so forth. Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing x and spitting out values. But let's look at more productive uses of generators instead.
-
-def fibonacci(max):
- a, b = 0, 1 ①
- while a < max:
- yield a ②
- a, b = b, a+b ③
-0 and 1, goes up slowly at first, then more and more rapidly. To start the sequence, you need two variables: a starts at 0, and b starts at 1.
-a+b) and assign that to b for later use. Note that this happens in parallel; if a is 3 and b is 5, then a, b = b, a+b will set a to 5 (the previous value of b) and b to 8 (the sum of the previous values of a and b).
-So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way
-is easier to read. Also, it works well with for loops.
-
for loops->>> for n in fibonacci(1000): ① -... print n, ② -0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 -
fibonacci in a for loop directly. The for loop will create the generator object and successively call the next() method to get values to assign to the for loop index variable (n).
-for loop, n gets a new value from the yield statement in fibonacci, and all you do is print it out. Once fibonacci runs out of numbers (a gets bigger than max, which in this case is 1000), then the for loop exits gracefully.
-OK, let's go back to the plural function and see how you're using this.
-
-def rules(language):
- for line in file('rules.%s' % language): ①
- pattern, search, replace = line.split() ②
- yield lambda word: re.search(pattern, word) and re.sub(search, replace, word) ③
-
-def plural(noun, language='en'):
- for applyRule in rules(language): ④
- result = applyRule(noun)
- if result: return result
-for line in file(...) is a common idiom for reading lines from a file, one line at a time. It works because file actually returns a generator whose next() method returns the next line of the file. That is so insanely cool, I wet myself just thinking about it.
-line.split() returns a tuple of 3 values, and you assign those values to 3 local variables.
-lambda, that is actually a closure (it uses the local variables pattern, search, and replace as constants). In other words, rules is a generator that spits out rule functions.
-rules is a generator, you can use it directly in a for loop. The first time through the for loop, you will call the rules function, which will open the rules file, read the first line out of it, dynamically build a function that matches and applies
- the first rule defined in the rules file, and yields the dynamically built function. The second time through the for loop, you will pick up where you left off in rules (which was in the middle of the for line in file(...) loop), read the second line of the rules file, dynamically build another function that matches and applies the second rule
- defined in the rules file, and yields it. And so forth.
-What have you gained over stage 5? In stage 5, you read the entire rules file and built a list of all the possible rules before you even tried the first one. -Now with generators, you can do everything lazily: you open the first and read the first rule and create a function to try -it, but if that works you don't ever read the rest of the file or create any other functions. -
You talked about several different advanced techniques in this chapter. Not all of them are appropriate for every situation. -
You should now be comfortable with all of these techniques: -
lambda.
-
-Adding abstractions, building functions dynamically, building closures, and using generators can all make your code simpler, -more readable, and more flexible. But they can also end up making it more difficult to debug later. It's up to you to find -the right balance between simplicity and power. -
Performance tuning is a many-splendored thing. Just because Python is an interpreted language doesn't mean you shouldn't worry about code optimization. But don't worry about it too much.
chardet to Python 3
-2to3
+chardet to Python 3
+2to3
There is a changelog, a feed, and discussion on Reddit. During development, you can download the book by cloning the Mercurial repository: diff --git a/iterators-and-generators.html b/iterators-and-generators.html new file mode 100644 index 0000000..54aac8c --- /dev/null +++ b/iterators-and-generators.html @@ -0,0 +1,569 @@ + +
+ +You are here: Home ‣ Dive Into Python 3 ‣ +
++❝ East is East, and West is West, and never the twain shall meet. ❞
— Rudyard_Kipling +
+
Let’s talk about plural nouns. Also, functions that return other functions, advanced regular expressions, iterators, and generators. But first, let’s talk about how to make plural nouns. (If you haven’t read the chapter on regular expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and quickly descends into more advanced uses.) +
English is a schizophrenic language that borrows from a lot of other languages, and the rules for making singular nouns into plural nouns are varied and complex. There are rules, and then there are exceptions to those rules, and then there are exceptions to the exceptions. +
If you grew up in an English-speaking country or learned English in a formal school setting, you’re probably familiar with the basic rules: +
(I know, there are a lot of exceptions. Man becomes men and woman becomes women, but human becomes humans. Mouse becomes mice and louse becomes lice, but house becomes houses. Knife becomes knives and wife becomes wives, but lowlife becomes lowlifes. And don’t even get me started on words that are their own plural, like sheep, deer, and haiku.) +
Other languages, of course, are completely different. +
Let’s design a Python library that automatically pluralizes English nouns. We’ll start just these four rules, but keep in mind that you’ll inevitably need to add more. +
So you’re looking at words, which, at least in English, means you’re looking at strings of characters. You have rules that say you need to find different combinations of characters, then do different things to them. This sounds like a job for regular expressions! +
import re
+
+def plural(noun):
+ if re.search('[sxz]$', noun): ①
+ return re.sub('$', 'es', noun) ②
+ elif re.search('[^aeioudgkprt]h$', noun):
+ return re.sub('$', 'es', noun)
+ elif re.search('[^aeiou]y$', noun):
+ return re.sub('y$', 'ies', noun)
+ else:
+ return noun + 's'
+[sxz] means “s, or x, or z”, but only one of them. The $ should be familiar; it matches the end of string. Combined, this regular expression is tests whether noun ends with s, x, or z.
+re.sub function performs regular expression-based string substitutions.
+Let’s look at regular expression substitutions in more detail. +
+>>> import re +>>> re.search('[abc]', 'Mark') ① +<_sre.SRE_Match object at 0x001C1FA8> +>>> re.sub('[abc]', 'o', 'Mark') ② +'Mork' +>>> re.sub('[abc]', 'o', 'rock') ③ +'rook' +>>> re.sub('[abc]', 'o', 'caps') ④ +'oops'+
Mark contain a, b, or c? Yes, it contains a.
+a, b, or c, and replace it with o. Mark becomes Mork.
+rock into rook.
+caps into oaps, but it doesn’t. re.sub replaces all of the matches, not just the first one. So this regular expression turns caps into oops, because both the c and the a get turned into o.
+And now, back to the plural() function…
+
+
def plural(noun):
+ if re.search('[sxz]$', noun):
+ return re.sub('$', 'es', noun) ①
+ elif re.search('[^aeioudgkprt]h$', noun): ②
+ return re.sub('$', 'es', noun) ③
+ elif re.search('[^aeiou]y$', noun):
+ return re.sub('y$', 'ies', noun)
+ else:
+ return noun + 's'
+$) with the string es. In other words, adding es to the string. You could accomplish the same thing with string concatenation, for example noun + 'es', but I chose to use regular expressions for each rule, for reasons that will become clear later in the chapter.
+^ as the first character inside the square brackets means something special: negation. [^abc] means “any single character except a, b, or c”. So [^aeioudgkprt] means any character except a, e, i, o, u, d, g, k, p, r, or t. Then that character needs to be followed by h, followed by end of string. You’re looking for words that end in H where the H can be heard.
+a, e, i, o, or u. You’re looking for words that end in Y that sounds like I.
+Let’s look at negation regular expressions in more detail. + +
+>>> import re +>>> re.search('[^aeiou]y$', 'vacancy') ① +<_sre.SRE_Match object at 0x001C1FA8> +>>> re.search('[^aeiou]y$', 'boy') ② +>>> +>>> re.search('[^aeiou]y$', 'day') +>>> +>>> re.search('[^aeiou]y$', 'pita') ③ +>>>+
vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u.
+boy does not match, because it ends in oy, and you specifically said that the character before the y could not be o. day does not match, because it ends in ay.
+pita does not match, because it does not end in y.
++>>> re.sub('y$', 'ies', 'vacancy') ① +'vacancies' +>>> re.sub('y$', 'ies', 'agency') +'agencies' +>>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy') ② +'vacancies'+
vacancy into vacancies and agency into agencies, which is what you wanted. Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub.
+y. Then in the substitution string, you use a new syntax, \1, which means “hey, that first group you remembered? put it right here.” In this case, you remember the c before the y; when you do the substitution, you substitute c in place of c, and ies in place of y. (If you have more than one remembered group, you can use \2 and \3 and so on.)
+Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn’t directly map to the way you first described the pluralizing rules. You originally laid out rules like “if the word ends in S, X, or Z, then add ES”. If you look at this function, you have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn’t get much more direct than that.
+
+
Now you’re going to add a level of abstraction. You started by defining a list of rules: if this, do that, otherwise go to the next rule. Let’s temporarily complicate part of the program so you can simplify another part. + +
import re
+
+def match_sxz(noun):
+ return re.search('[sxz]$', noun)
+
+def apply_sxz(noun):
+ return re.sub('$', 'es', noun)
+
+def match_h(noun):
+ return re.search('[^aeioudgkprt]h$', noun)
+
+def apply_h(noun):
+ return re.sub('$', 'es', noun)
+
+def match_y(noun): ①
+ return re.search('[^aeiou]y$', noun)
+
+def apply_y(noun): ②
+ return re.sub('y$', 'ies', noun)
+
+def match_default(noun):
+ return True
+
+def apply_default(noun):
+ return noun + 's'
+
+rules = ((match_sxz, apply_sxz), ③
+ (match_h, apply_h),
+ (match_y, apply_y),
+ (match_default, apply_default)
+ )
+
+def plural(noun):
+ for matches_rule, apply_rule in rules: ④
+ if matches_rule(noun):
+ return apply_rule(noun)
+re.sub() function.
+re.search() function to apply the appropriate pluralization rule.
+plural()) with multiple rules, you have the rules data structure, which is a sequence of pairs of functions.
+plural() function can be reduced to a few lines of code. Using a for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules structure. On the first iteration of the for loop, matches_rule will get match_sxz, and apply_rule will get apply_sxz. On the second iteration (assuming you get that far), matches_rule will be assigned match_h, and apply_rule will be assigned apply_h. The function is guaranteed to return something eventually, because the final match rule (match_default) simply returns True, meaning the corresponding apply rule (apply_default) will always be applied.
+The reason this technique works is that everything in Python is an object, including functions. The rules data structure contains functions — not names of functions, but actual function objects. When they get assigned in the for loop, then matches_rule and apply_rule are actual functions that you can call. On the first iteration of the for loop, this is equivalent to calling matches_sxz(noun), and if it returns a match, calling apply_sxz(noun).
+
+
If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire for loop is equivalent to the following:
+
+
+def plural(noun):
+ if match_sxz(noun):
+ return apply_sxz(noun)
+ if match_h(noun):
+ return apply_h(noun)
+ if match_y(noun):
+ return apply_y(noun)
+ if match_default(noun):
+ return apply_default(noun)
+
+The benefit here is that that plural function is now simplified. It takes a list of rules, defined elsewhere, and iterates through them in a generic fashion.
+
+
The rules could be defined anywhere, in any way. The plural() function doesn’t care.
+
+
Now, was adding this level of abstraction worth it? Well, not yet. Let’s consider what it would take to add a new rule to the function. In the first example, it would require adding an if statement to the plural function. In this second example, it would require adding two functions, match_foo() and apply_foo(), and then updating the rules list to specify where in the order the new match and apply functions should be called relative to the other rules.
+
+
But this is really just a stepping stone to the next section. Let’s move on… + +
Defining separate named functions for each match and apply rule isn’t really necessary. You never call them directly; you add them to the rules list and call them through there. Furthermore, each function follows one of two patterns. All the match functions call re.search(), and all the apply functions call re.sub(). Let’s factor out the patterns so that defining new rules can be easier.
+
+
import re
+
+def build_match_and_apply_functions(pattern, search, replace):
+ def matches_rule(word): ①
+ return re.search(pattern, word)
+ def apply_rule(word): ②
+ return re.sub(search, replace, word)
+ return (matches_rule, apply_rule) ③
+build_match_and_apply_functions is a function that builds other functions dynamically. It takes pattern, search and replace, then defines a matches_rule() function which calls re.search() with the pattern that was passed to the build_match_and_apply_functions() function, and the word that was passed to the matches_rule() function you’re building. Whoa.
+re.sub() with the search and replace parameters that were passed to the build_match_and_apply_functions function, and the word that was passed to the apply_rule() function you’re building. This technique of using the values of outside parameters within a dynamic function is called closures. You’re essentially defining constants within the apply function you’re building: it takes one parameter (word), but it then acts on that plus two other values (search and replace) which were set when you defined the apply function.
+build_match_and_apply_functions function returns a tuple of two values: the two functions you just created. The constants you defined within those functions (pattern within matchFunction, and search and replace within applyFunction) stay with those functions, even after you return from build_match_and_apply_functions. That’s insanely cool.
+If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it. + +
+patterns = \ ①
+ [
+ ['[sxz]$', '$', 'es'],
+ ['[^aeioudgkprt]h$', '$', 'es'],
+ ['(qu|[^aeiou])y$', 'y$', 'ies'],
+ ['$', '$', 's']
+ ]
+rules = [build_match_and_apply_functions(pattern, search, replace) ②
+ for (pattern, search, replace) in patterns]
+re.search() to see if this rule matches. The second and third strings in each group are the search and replace expressions you would use in re.sub() to actually apply the rule to turn a noun into its plural.
+build_match_and_apply_functions function, which just happens to take three strings as parameters and return a tuple of two functions. This means that rules ends up being exactly the same as the previous example: a list of tuples, where each tuple is a pair of functions, where the first function is the match function that calls re.search(), and the second function is the apply function that calls re.sub().
+Rounding out this version of the script is the main entry point, the plural() function.
+
+
def plural(noun):
+ for matches_rule, apply_rule in rules: ①
+ if matches_rule(noun):
+ return apply_rule(noun)
+plural() function hasn’t changed at all. It’s completely generic; it takes a list of rule functions and calls them in order. It doesn’t care how the rules are defined. In the previous example, they were defined as seperate named functions. Now they are built dynamically by mapping the output of the build_match_and_apply_functions() function onto a list of raw strings. It doesn’t matter; the plural function still works the same way.
+You’ve factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them. + +
First, let’s create a text file that contains the rules you want. No fancy data structures, just whitespace-delimited strings in three columns. Let’s call it plural4-rules.txt.
+
+
[download plural4-rules.txt]
+
[sxz]$ $ es
+[^aeioudgkprt]h$ $ es
+[^aeiou]y$ y$ ies
+$ $ s
+
+Now let’s see how you can use this rules file. + +
import re
+
+def build_match_and_apply_functions(pattern, search, replace): ①
+ def matches_rule(word):
+ return re.search(pattern, word)
+ def apply_rule(word):
+ return re.sub(search, replace, word)
+ return (matches_rule, apply_rule)
+
+rules = []
+pattern_file = open('plural4-rules.txt') ②
+try:
+ for line in pattern_file: ③
+ pattern, search, replace = line.split(None, 3) ④
+ rules.append(build_match_and_apply_functions( ⑤
+ pattern, search, replace))
+finally:
+ pattern_file.close() ⑥
+build_match_and_apply_functions() function has not changed. You’re still using closures to build two functions dynamically that use variables defined in the outer function.
+for line in <fileobject> idiom.
+split() string method. The first argument to the split() method is None, which means “split on any whitespace (tabs or spaces, it makes no difference).” The second argument is 3, which means “split on whitespace 3 times, then discard the rest of the line.” A line like [sxz]$ $ es will be broken up into the tuple ('[sxz]$', '$', 'es'), which means that pattern will get '[sxz]$', search will get '$', and replace will get 'es'. That’s a lot of power in one little line of code.
+try..finally block to ensure the file object is closed.
+The improvement here is that you’ve completely separated the pluralization rules into an external file, so it can be maintained separately from the code that uses it. Code is code, data is data, and life is good. + +
Now you’re ready to learn about generators. + +
def rules():
+ for line in open('plural5-rules.txt'):
+ pattern, search, replace = line.split(None, 3)
+ yield build_match_and_apply_functions(pattern, search, replace)
+
+def plural(noun):
+ for matches_rule, apply_rule in rules():
+ if matches_rule(noun):
+ return apply_rule(noun)
+
+How the heck does that work? Let’s look at an interactive example first. + +
+>>> def make_counter(x): +... print 'entering make_counter' +... while True: +... yield x ① +... print 'incrementing x' +... x = x + 1 +... +>>> counter = make_counter(2) ② +>>> counter ③ +<generator object at 0x001C9C10> +>>> next(counter) ④ +entering make_counter +2 +>>> next(counter) ⑤ +incrementing x +3 +>>> next(counter) ⑥ +incrementing x +4+
yield keyword in make_counter means that this is not a normal function. It is a special kind of function which generates values one at a time. You can think of it as a resumable function. Calling it will return a generator that can be used to generate successive values of x.
+make_counter generator, just call it like any other function. Note that this does not actually execute the function code. You can tell this because the first line of the make_counter() function calls print(), but nothing has been printed yet.
+make_counter() function returns a generator object.
+next() function takes a generator object and returns its next value. The first time you call next() with the counter generator, it executes the code in make_counter() up to the first yield statement, then returns the value that was yielded. In this case, that will be 2, because you originally created the generator by calling make_counter(2).
+next() with the same generator object resumes exactly where it left off and continues until it hits the next yield statement. All variables, local state, &c. are saved on yield and restored on next(). The next line of code waiting to be executed calls print(), which prints incrementing x. After that, the statement x = x + 1. Then it loops through the while loop again, and the first thing it hits is the statement yield x, which saves the state of everything and returns the current value of x (now 3).
+next(counter), you do all the same things again, but this time x is now 4.
+Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing x and spitting out values. But let’s look at more productive uses of generators instead.
+
+
def fib(max):
+ a, b = 0, 1 ①
+ while a < max:
+ yield a ②
+ a, b = b, a + b ③
+0 and 1, goes up slowly at first, then more and more rapidly. To start the sequence, you need two variables: a starts at 0, and b starts at 1.
+a + b) and assign that to b for later use. Note that this happens in parallel; if a is 3 and b is 5, then a, b = b, a + b will set a to 5 (the previous value of b) and b to 8 (the sum of the previous values of a and b).
+So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way is easier to read. Also, it works well with for loops.
+
+
+>>> from fibonacci import fib +>>> for n in fib(1000): ① +... print(n, end=' ') ② +0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987+
fib() in a for loop directly. The for loop will automatically call the next() function to get values from the fib() generator and assign them to the for loop index variable (n).
+for loop, n gets a new value from the yield statement in fib(), and all you have to do is print it out. Once fib() runs out of numbers (a becomes bigger than max, which in this case is 1000), then the for loop exits gracefully.
+Let’s go back to plural5.py and see how this version of the plural() function works.
+
+
def rules():
+ for line in open('plural5-rules.txt'): ①
+ pattern, search, replace = line.split(None, 3) ②
+ yield build_match_and_apply_functions(pattern, search, replace) ③
+
+def plural(noun):
+ for matches_rule, apply_rule in rules(): ④
+ if matches_rule(noun):
+ return apply_rule(noun)
+for line in open(...) is a common idiom for reading from a file one line at a time. But here’s what you might not know: the reason this idiom works is because open() actually returns a generator, and calling next() on this generator returns the next line of the file.
+line.split(None, 3) to get the three “columns” and assign them to three local variables.
+build_match_and_apply_functions(), which is identical to the previous examples. In other words, rules() is a generator that spits out match and apply functions on demand.
+rules() is a generator, you can use it directly in a for loop. The first time through the for loop, you will call the rules() function, which will open the pattern file, read the first line, dynamically build a match function and an apply function from the patterns on that line, and yield the dynamically built functions. The second time through the for loop, you will pick up exactly where you left off in rules() (which was in the middle of the for line in file(...) loop). The first thing it will do is read the next line of the file (which is still open), dynamically build another match and apply function based on the patterns on that line in the file, and yield the two functions.
+What have you gained over stage 4? Startup time. In stage 4, when you imported the plural4 module, it read the entire patterns file and built a list of all the possible rules, before you could even think about calling the plural() function. With generators, you can do everything lazily: you read the first rule and create functions and try them, and if that works you don’t ever read the rest of the file or create any other functions.
+
+
What have you lost? Performance! Every time you call the plural() function, the rules() generator starts over from the beginning — which means re-opening the patterns file and reading from the beginning, one line at a time.
+
+
What if you could have the best of both worlds: minimal startup cost (don’t execute any code on import), and maximum performance (don’t build the same functions over and over again). Oh, and you still want to keep the rules in a separate file (because code is code and data is data), just as long as you never have to read the same line twice.
+
+
In truth, generators are special case of iterators. A function that yields values is a nice, compact way of building an iterator without building an iterator. Let me show you what I mean by that.
+
+
Remember the Fibonacci generator? Here it is as a built-from-scratch iterator: + +
class fib: ①
+ def __init__(self, max): ②
+ self.max = max
+
+ def __iter__(self): ③
+ self.a, self.b = 0, 1
+ return self
+
+ def __next__(self): ④
+ fib = self.a
+ if fib > self.max:
+ raise StopIteration ⑤
+ self.a, self.b = self.b, self.a + self.b
+ return fib ⑥
+fib needs to be a class, not a function.
+fib(max) is really creating an instance of this class and calling its __init__() method with max. The __init__() method saves the maximum value as an instance variable so other methods can refer to it later.
+__iter__() method is called whenever someone calls iter(fib). (As you’ll see in a minute, a for loop will call this automatically, but you can also call it yourself manually.) After performing beginning-of-iteration initialization (in this case, resetting self.a and self.b, our two counters), the __iter__() method can return any object that implements a __next__() method. In this case (and in most cases), __iter__() simply returns self, since this class implements its own __next__() method.
+__next__() method is called whenever someone calls next() on an iterator of an instance of a class. That will make more sense in a minute.
+__next__() method raises a StopIteration exception, this signals to the caller that the iteration is over; no more values are available. If the caller is a for loop, it will notice this StopIteration exception and gracefully exit the loop. (In other words, it will swallow the exception.) This little bit of magic is actually the key to using iterators in for loops.
+__next__() method simply returns the value. Do not use yield here; that’s a bit of syntactic sugar that only applies when you’re using generators. Here you’re creating your own iterator from scratch; use return instead.
+Thoroughly confused yet? Excellent. Let’s see how to call this iterator:
+ ++>>> from fibonacci2 import fib +>>> for n in fib(1000): +... print(n, end=' ') +0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987+ +
Why, it’s exactly the same! Byte for byte identical to how you called Fibonacci-as-a-generator! But how? + +
I told you there was a bit of magic involved in for loops. Here’s what happens:
+
+
for loop calls fib(1000), as shown. This returns an instance of the fib class. Call this fib_inst.
+for loop calls iter(fib_inst), which returns an iterator object. Call this fib_iter. In this case, fib_iter == fib_inst, because the __iter__() method returns self, but the for loop doesn’t know (or care) about that.
+for loop calls next(fib_iter), which calls the __next__() method on the fib_iter object, which does the next-Fibonacci-number calculations and returns a value. The for loop takes this value and assigns it to n, then executes the body of the for loop for that value of n.
+for loop know when to stop? I’m glad you asked! When next(fib_iter) raises a StopIteration exception, the for loop will swallow the exception and gracefully exit. (Any other exception will pass through and be raised as usual.) And where have you seen a StopIteration exception? In the __next__() method, of course!
+Now it’s time for the finale… + +
class LazyRules:
+ def __init__(self):
+ self.pattern_file = open('plural6-rules.txt')
+ self.cache = []
+
+ def __iter__(self):
+ self.cache_index = 0
+ return self
+
+ def __next__(self):
+ self.cache_index += 1
+ if len(self.cache) >= self.cache_index:
+ return self.cache[self.cache_index - 1]
+
+ if self.pattern_file.closed:
+ raise StopIteration
+
+ line = self.pattern_file.readline()
+ if not line:
+ self.pattern_file.close()
+ raise StopIteration
+
+ pattern, search, replace = line.split(None, 3)
+ funcs = build_match_and_apply_functions(
+ pattern, search, replace)
+ self.cache.append(funcs)
+ return funcs
+
+rules = LazyRules()
+
+So this is a class that implements __iter__() and __next__(), so it can be used as an iterator. Then, you instantiate the class and assign it to rules. This happens just once, on import.
+
+
Let’s take the class one bite at a time. + +
class LazyRules:
+ def __init__(self): ①
+ self.pattern_file = open('plural6-rules.txt') ③
+ self.cache = [] ②
+__init__() method is only going to be called once, when you instantiate the class and assign it to rules.
+ def __iter__(self): ①
+ self.cache_index = 0 ②
+ return self ③
+
+__iter__() method will be called every time someone — say, a for loop — calls iter(rules).
+__iter__() method returns self, which signals that this class will take care of returning its own values throughout an iteration.
+ def __next__(self): ①
+ .
+ .
+ .
+ pattern, search, replace = line.split(None, 3)
+ funcs = build_match_and_apply_functions( ②
+ pattern, search, replace)
+ self.cache.append(funcs) ③
+ return funcs
+__next__() method gets called whenever someone — say, a for loop — calls next(rules). This method will only make sense if we start at the end and work backwards. So let’s do that.
+build_match_and_apply_functions() function hasn’t changed; it’s the same as it ever was. Each line of the pattern file will be read exactly once, as late as possible.
+self.cache. Each match and apply function will be built exactly once, as late as possible, then cached.
+Moving backwards… + +
def __next__(self):
+ .
+ .
+ .
+ line = self.pattern_file.readline() ①
+ if not line: ②
+ self.pattern_file.close()
+ raise StopIteration ③
+ .
+ .
+ .
+readline() method (note: singular, not the plural readlines()) reads exactly one line from an open file. Specifically, the next line. (File objects are iterators too! It’s iterators all the way down…)
+readline() to read, line will not be an empty string. Even if the file contained a blank line, line would end up as the one-character string '\n' (a carriage return). If line is really an empty string, that means there are no more lines to read from the file.
+StopIteration exception. Remember, we got to this point because we needed a match and apply function for the next rule. The next rule comes from the next line of the file… but there is no next line! Therefore, we have no value to return. The iteration is over. (♫ The party’s over… ♫)
+Moving backwards all the way to the start of the __next__() method…
+
+
def __next__(self):
+ self.cache_index += 1
+ if len(self.cache) >= self.cache_index:
+ return self.cache[self.cache_index - 1] ①
+
+ if self.pattern_file.closed:
+ raise StopIteration ②
+ .
+ .
+ .
+self.cache will be a list of the functions we need to match and apply individual rules. (At least that should sound familiar!) self.cache_index keeps track of which cached item we should return next. If we haven’t exhausted the cache yet (i.e. if the length of self.cache is greater than self.cache_index), then we have a cache hit! Hooray! We can return the match and apply functions from the cache instead of building them from scratch.
+Putting it all together, here’s what happens when: + +
LazyRules class, called rules, which opens the pattern file but does not read from it.
+plural() function again to pluralize a different word. The for loop in the plural() function will call iter(rules), which will reset the cache index but will not reset the open file object.
+for loop will ask for a value from rules, which will invoke its __next__() method. This time, however, the cache is primed with a single pair of match and apply functions, corresponding to the patterns in the first line of the pattern file. Since they were built and cached in the course of pluralizing the previous word, they’re retrieved from the cache. The cache index increments, and the open file is never touched.
+for loop comes around again and asks for another value from rules. This invokes the __next__() method a second time. This time, the cache is exhausted — it only contained one item, and we’re asking for a second — so the __next__() method continues. It reads another line from the open file, builds match and apply functions out of the patterns, and caches them.
+readline() command. In the meantime, the cache now has more items in it, and if we start all over again trying to pluralize a new word, each of those items in the cache will be tried before reading the next line from the pattern file.
+Thus, we have achieved our combined goal: + +
import is instantiating a single class and opening a file (but not reading from it).
+© 2001–9 ℳark Pilgrim + + diff --git a/native-datatypes.html b/native-datatypes.html index f610cb8..8781014 100644 --- a/native-datatypes.html +++ b/native-datatypes.html @@ -9,9 +9,9 @@ body{counter-reset:h1 2}
You are here: Home ‣ Dive Into Python 3 ‣ -
-❝ Wonder is the foundation of all philosophy, research its progress, ignorance its end. ❞
— Michel de Montaigne +❝ Wonder is the foundation of all philosophy, inquiry its progress, ignorance its end. ❞
— Michel de Montaigne
You are here: Home ‣ Dive Into Python 3 ‣ -
2to32to3@@ -495,6 +495,7 @@ for an_iterator in a_sequence_of_iterators: reduce(a, b, c)❝ Life is pleasant. Death is peaceful. It’s the transition that’s troublesome. ❞
— Isaac Asimov (attributed)
+☞The version of
2to3that shipped with Python 3.0 would not fix thereduce()function automatically. The fix first appeared in the2to3script that shipped with Python 3.1.
apply() global functionYou are here: Home ‣ Dive Into Python 3 ‣ -
-❝ Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. ❞
— Jamie Zawinski +❝ Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. ❞
— Jamie Zawinski
❝ I’m telling you this ’cause you’re one of my friends.
-My alphabet starts where your alphabet ends! ❞
— Dr. Seuss, On Beyond Zebra! +My alphabet starts where your alphabet ends! ❞
— Dr. Seuss, On Beyond Zebra!
Did you know that the people of Bougainville have the smallest alphabet in the world? Their Rotokas alphabet is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters — 52 if you count uppercase and lowercase separately — plus a handful of !@#$%& punctuation marks. -
When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. +
When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.
In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and the result will be gibberish. @@ -101,7 +101,7 @@ La Peña
Let's take another look at humansize.py:
-
SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'], ①
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
@@ -149,6 +149,8 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
- There's a lot going on here. First, that's a method call on a string literal. Strings are objects, and objects have methods. Second, the whole expression evaluates to a string. Third,
{0} and {1} are replacement fields, which are replaced by the arguments passed to the format() method.
+Compound field names
+
The previous example shows the simplest case, where the replacement fields are simply integers. Integer replacement fields are treated as positional indices into the argument list of the format() method. That means that {0} is replaced by the first argument (username in this case), {1} is replaced by the second argument (password), &c. You can have as many positional indices as you have arguments, and you can have as many arguments as you want. But replacement fields are much more powerful than that.
@@ -160,11 +162,11 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
'1000KB = 1MB'
-- Rather than calling any function in the
humansize module, you'll just grab one of the data structures it defines: the list of "SI" (powers-of-1000) suffixes.
+ - Rather than calling any function in the
humansize module, you're just grabbing one of the data structures it defines: the list of "SI" (powers-of-1000) suffixes.
- This looks complicated, but it's not.
{0} would refer to the first argument passed to the format() method, si_suffixes. But si_suffixes is a list. So {0[0]} refers to the first item of the list which is the first argument passed to the format() method: 'KB'. Meanwhile, {1[0]} refers to the second item of the same list: 'MB'. Everything outside the curly braces — including 1000, the equals sign, and the spaces — is untouched. The final result is the string '1000KB = 1MB'.
-What this example shows is that format specifers can access items and properties of data structures using (almost) Python syntax. The following things "just work":
+
What this example shows is that format specifers can access items and properties of data structures using (almost) Python syntax. This is called compound field names. The following compound field names "just work":
- Passing a list, and accessing an item of the list by index (as in the previous example)
@@ -193,6 +195,8 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
sys.modules["humansize"].SUFFIXES[1000][0] is the first item of the list of SI suffixes: 'KB'. Therefore, the complete replacement field {0.modules[humansize].SUFFIXES[1000][0]} is replaced by the two-character string KB.
+Format specifiers
+
But wait! There's more! Let's take another look at that strange line of code from humansize.py:
if size < multiple:
@@ -210,58 +214,54 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
>>> "{0:.1f} {1}".format(698.25, 'GB')
'698.3 GB'
-For all the gory details on presentation types, check the Format Specification Mini-Language in the official Python documentation.
+
For all the gory details on format specifiers, consult the Format Specification Mini-Language in the official Python documentation.
-
-Note that (k, v) is a tuple. I told you they were good for something.
+
Other common string methods
-You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that
-string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
+
Besides formatting, strings can do a number of other useful tricks.
->>> uid = "sa"
->>> pwd = "secret"
->>> print pwd + " is not a good password for " + uid ①
-secret is not a good password for sa
->>> print "%s is not a good password for %s" % (pwd, uid) ②
-secret is not a good password for sa
->>> userCount = 6
->>> print "Users connected: %d" % (userCount, ) ③ ④
-Users connected: 6
->>> print "Users connected: " + userCount ⑤
-Traceback (innermost last):
- File "<interactive input>", line 1, in ?
-TypeError: cannot concatenate 'str' and 'int' objects
+>>> s = """Finished files are the re- ①
+... sult of years of scientif-
+... ic study combined with the
+... experience of years."""
+>>> s.splitlines() ②
+['Finished files are the re-',
+ 'sult of years of scientif-',
+ 'ic study combined with the',
+ 'experience of years.']
+>>> print(s.lower()) ③
+finished files are the re-
+sult of years of scientif-
+ic study combined with the
+experience of years.
+>>> s.lower().count("f") ④
+6
-+ is the string concatenation operator.
-- In this trivial case, string formatting accomplishes the same result as concatentation.
-
(userCount, ) is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether (userCount) was a tuple with one element or just the value of userCount.
-- String formatting works with integers by specifying
%d instead of %s.
- - Trying to concatenate a string with a non-string raises an exception. Unlike string formatting, string concatenation works only when everything is already a string.
+
- You can input multi-line strings in the Python interactive shell. Once you start a multi-line string with triple quotation marks, just hit ENTER and the interactive shell will prompt you to continue the string. Typing the closing triple quotation marks ends the string, and the next ENTER will execute the command (in this case, assigning the string to s).
+
- The
splitlines() method takes one multi-line string and returns a list of strings, one for each line of the original. Note that the carriage returns at the end of each line are not included.
+ - The
lower() method converts the entire string to lowercase. (Similarly, the upper() method converts a string to uppercase.)
+ - the
count() method counts the number of occurrences of a substring. Yes, there really are six “f”s in that sentence!
-As with printf in C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.
+
+
+
+
+
+join works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements will raise an exception.
+
+
+
>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
@@ -302,10 +308,15 @@ is an object. You might have thought I meant that string variables are
split takes an optional second argument, which is the number of times to split. (“Oooooh, optional arguments...” You'll learn how to do this in your own functions in the next chapter.)
-
-
-Common string operations
+
+
+anystring.split(delimiter, 1) is a useful technique when you want to search a string for a substring and then work with everything before the substring (which ends up in the first element of the returned list) and everything after it (which ends up in the second element).
+
+
+
+ string modulechardet: a mini-FAQ
windows-1252
2to3
- 2to3 can't
+ 2to3 can’t
You are here: Home ‣ Dive Into Python 3 ‣ -
-❝ Certitude is not the test of certainty. We have been cocksure of many things that were not so. ❞
— Oliver Wendell Holmes, Jr. +❝ Certitude is not the test of certainty. We have been cocksure of many things that were not so. ❞
— Oliver Wendell Holmes, Jr.
to_roman() function should return the Roman numeral representation for all integers 1 to 3999.
It is not immediately obvious how this code does… well, anything. It defines a class which has no __init__() method. The class does have another method, but it is never called. The entire script has a __main__ block, but it doesn't reference the class or its method. But it does do something, I promise.
-
import roman1
import unittest
@@ -159,7 +159,7 @@ Traceback (most recent call last):
- Overall, the unit test failed because at least one test case did not pass. When a test case doesn't pass,
unittest distinguishes between failures and errors. A failure is a call to an assertXYZ method, like assertEqual or assertRaises, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort of exception raised in the code you're testing or the unit test case itself.
Now, finally, you can write the to_roman() function.
-
roman_numeral_map = (('M', 1000),
('CM', 900),
('D', 500),
@@ -233,7 +233,7 @@ OK
The to_roman() function should raise an OutOfRangeError when given an integer greater than 3999.
What would that test look like?
-
class ToRomanBadInput(unittest.TestCase): ①
def test_too_large(self): ②
@@ -298,7 +298,7 @@ FAILED (failures=1)
- Of course, the
to_roman() function isn't raising the OutOfRangeError exception you just defined, because you haven't told it to do that yet. That's excellent news! It means this is a valid test case — it fails before you write the code to make it pass.
Now you can write the code to make this test pass.
-
def to_roman(n):
"""convert integer to Roman numeral"""
if n > 3999:
diff --git a/your-first-python-program.html b/your-first-python-program.html
index 3b0c8e4..7a4f443 100644
--- a/your-first-python-program.html
+++ b/your-first-python-program.html
@@ -10,14 +10,14 @@ th{font-family:inherit !important}
You are here: Home ‣ Dive Into Python 3 ‣
-
Your first Python program
+Your First Python Program
-❝ Don’t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate. ❞
— Ven. Henepola Gunararatana
+
❝ Don’t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate. ❞
— Ven. Henepola Gunaratana
Diving in
Books about programming usually start with a bunch of boring chapters about fundamentals and eventually work up to building something useful. Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
-
SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}