diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index ce6fb63..20ff49f 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -1,106 +1,105 @@ - +
- +chardet to Python 3-❝ Words, words. They’re all we have to go on. ❞
— Rosencrantz and Guildenstern are Dead ++❝ Words, words. They’re all we have to go on. ❞
— Rosencrantz and Guildenstern are Dead-
-- Introducing
chardet+- Introducing
chardet-
-- What is character encoding auto-detection? -
- Isn’t that impossible? -
- Who wrote this detection algorithm? -
- Yippie! Screw the standards, I’ll just auto-detect everything! -
- Why bother with auto-detection if it’s slow, inaccurate, and non-standard? +
- What is character encoding auto-detection? +
- Isn’t that impossible? +
- Who wrote this detection algorithm? +
- Yippie! Screw the standards, I’ll just auto-detect everything! +
- Why bother with auto-detection if it’s slow, inaccurate, and non-standard?
- Diving in +
- Diving in -
- Running
2to3-- Fixing what
2to3can’t +- Running
2to3+- Fixing what
2to3can’t-
Falseis invalid syntax -- No module named
constants-- Name 'file' is not defined -
- Can’t use a string pattern on a bytes-like object -
- Can’t convert '
bytes' object tostrimplicitly +Falseis invalid syntax +- No module named
constants+- Name 'file' is not defined +
- Can’t use a string pattern on a bytes-like object +
- Can’t convert '
bytes' object tostrimplicitlyIntroducing
-chardet: a mini-FAQWhen you think of “text,” you probably think of “characters and symbols I see on my computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. -
In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever). -
What is character encoding auto-detection?
-It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key. -
Isn’t that impossible?
-In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language. +
Introducing
+chardet: a mini-FAQWhen you think of “text,” you probably think of “characters and symbols I see on my computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. +
In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever). +
What is character encoding auto-detection?
+It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key. +
Isn’t that impossible?
+In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.
In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings. -
Who wrote this detection algorithm?
-This library is a port of the auto-detection code in Mozilla. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors’ comments, which are quite extensive and informative. -
You may also be interested in the research paper which led to the Mozilla implementation, A composite approach to language/encoding detection. -
Yippie! Screw the standards, I’ll just auto-detect everything!
-Don’t do that. Virtually every format and protocol contains a method for specifying character encoding. +
Who wrote this detection algorithm?
+This library is a port of the auto-detection code in Mozilla. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors’ comments, which are quite extensive and informative. +
You may also be interested in the research paper which led to the Mozilla implementation, A composite approach to language/encoding detection. +
Yippie! Screw the standards, I’ll just auto-detect everything!
+Don’t do that. Virtually every format and protocol contains a method for specifying character encoding.
-
- HTTP can define a
charsetparameter in theContent-typeheader.- HTML documents can define a
<meta http-equiv="content-type">element in the<head>of a web page.- XML documents can define an
encodingattribute in the XML prolog.If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards and figure out which one wins if they give you conflicting information.) -
Despite the complexity, it’s worthwhile to follow standards and respect explicit character encoding information. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards. -
Why bother with auto-detection if it’s slow, inaccurate, and non-standard?
-Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all. -
If following the relevant standards gets you nowhere, and you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my Universal Feed Parser, which calls this auto-detection library only after exhausting all other options. -
Diving in
+If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards and figure out which one wins if they give you conflicting information.) +
Despite the complexity, it’s worthwhile to follow standards and respect explicit character encoding information. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards. +
Why bother with auto-detection if it’s slow, inaccurate, and non-standard?
+Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all. +
If following the relevant standards gets you nowhere, and you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my Universal Feed Parser, which calls this auto-detection library only after exhausting all other options. +
Diving in
This is a brief guide to navigating the code itself. -
The main entry point for the detection algorithm is
universaldetector.py, which has one class,UniversalDetector. (You might think the main entry point is thedetectfunction inchardet/__init__.py, but that’s really just a convenience function that creates aUniversalDetectorobject, calls it, and returns its result.) +The main entry point for the detection algorithm is
universaldetector.py, which has one class,UniversalDetector. (You might think the main entry point is thedetectfunction inchardet/__init__.py, but that’s really just a convenience function that creates aUniversalDetectorobject, calls it, and returns its result.)There are 5 categories of encodings that
UniversalDetectorhandles:-
-UTF-nwith a BOM. This includesUTF-8, both BE and LE variants ofUTF-16, and all 4 byte-order variants ofUTF-32. -- Escaped encodings, which are entirely 7-bit ASCII compatible, where non-ASCII characters start with an escape sequence. Examples:
ISO-2022-JP(Japanese) andHZ-GB-2312(Chinese). -- Multi-byte encodings, where each character is represented by a variable number of bytes. Examples:
Big5(Chinese),SHIFT_JIS(Japanese),EUC-KR(Korean), andUTF-8without a BOM. -- Single-byte encodings, where each character is represented by one byte. Examples:
KOI8-R(Russian),windows-1255(Hebrew), andTIS-620(Thai). +UTF-nwith a BOM. This includesUTF-8, both BE and LE variants ofUTF-16, and all 4 byte-order variants ofUTF-32. +- Escaped encodings, which are entirely 7-bit ASCII compatible, where non-ASCII characters start with an escape sequence. Examples:
ISO-2022-JP(Japanese) andHZ-GB-2312(Chinese). +- Multi-byte encodings, where each character is represented by a variable number of bytes. Examples:
Big5(Chinese),SHIFT_JIS(Japanese),EUC-KR(Korean), andUTF-8without a BOM. +- Single-byte encodings, where each character is represented by one byte. Examples:
KOI8-R(Russian),windows-1255(Hebrew), andTIS-620(Thai).windows-1252, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.-
UTF-nwith a BOMIf the text starts with a BOM, we can reasonably assume that the text is encoded in
UTF-8,UTF-16, orUTF-32. (The BOM will tell us exactly which one; that’s what it’s for.) This is handled inline inUniversalDetector, which returns the result immediately without any further processing. -Escaped encodings
-If the text contains a recognizable escape sequence that might indicate an escaped encoding,
UniversalDetectorcreates anEscCharSetProber(defined inescprober.py) and feeds it the text. -
EscCharSetProbercreates a series of state machines, based on models ofHZ-GB-2312,ISO-2022-CN,ISO-2022-JP, andISO-2022-KR(defined inescsm.py).EscCharSetProberfeeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding,EscCharSetProberimmediately returns the positive result toUniversalDetector, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines. -Multi-byte encodings
-Assuming no BOM,
UniversalDetectorchecks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort,windows-1252. -The multi-byte encoding prober,
MBCSGroupProber(defined inmbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding:Big5,GB2312,EUC-TW,EUC-KR,EUC-JP,SHIFT_JIS, andUTF-8.MBCSGroupProberfeeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls toUniversalDetector.feed()will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding,MBCSGroupProberreports this positive result toUniversalDetector, which reports the result to the caller. -Most of the multi-byte encoding probers are inherited from
MultiByteCharSetProber(defined inmbcharsetprober.py), and simply hook up the appropriate state machine and distribution analyzer and letMultiByteCharSetProberdo the rest of the work.MultiByteCharSetProberruns the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time,MultiByteCharSetProberfeeds the text to an encoding-specific distribution analyzer. -The distribution analyzers (each defined in
chardistribution.py) use language-specific models of which characters are used most frequently. OnceMultiByteCharSetProberhas fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough,MultiByteCharSetProberreturns the result toMBCSGroupProber, which returns it toUniversalDetector, which returns it to the caller. -The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between
EUC-JPandSHIFT_JIS, so theSJISProber(defined insjisprober.py) also uses 2-character distribution analysis.SJISContextAnalysisandEUCJPContextAnalysis(both defined injpcntx.pyand both inheriting from a commonJapaneseContextAnalysisclass) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level toSJISProber, which checks both analyzers and returns the higher confidence level toMBCSGroupProber. -Single-byte encodings
-The single-byte encoding prober,
SBCSGroupProber(defined insbcsgroupprober.py), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language:windows-1251,KOI8-R,ISO-8859-5,MacCyrillic,IBM855, andIBM866(Russian);ISO-8859-7andwindows-1253(Greek);ISO-8859-5andwindows-1251(Bulgarian);ISO-8859-2andwindows-1250(Hungarian);TIS-620(Thai);windows-1255andISO-8859-8(Hebrew). -
SBCSGroupProberfeeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class,SingleByteCharSetProber(defined insbcharsetprober.py), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text.SingleByteCharSetProberprocesses the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio. -Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis,
HebrewProber(defined inhebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored "backwards" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255for Logical Hebrew, orISO-8859-8for Visual Hebrew). --
windows-1252If
UniversalDetectordetects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates aLatin1Prober(defined inlatin1prober.py) to try to detect English text in awindows-1252encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguishwindows-1252is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like.Latin1Proberautomatically reduces its confidence rating to allow more accurate probers to win if at all possible. -Running
-2to3We’re going to migrate the
chardetmodule from Python 2 to Python 3. Python 3 comes with a utility script called2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with2to3. In this chapter, we’ll start by running2to3on thechardetpackage, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic. -The main
chardetpackage is split across several different files, all in the same directory. The2to3script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and2to3will convert each of the files in turn. -C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w chardet\ ++
UTF-nwith a BOMIf the text starts with a BOM, we can reasonably assume that the text is encoded in
UTF-8,UTF-16, orUTF-32. (The BOM will tell us exactly which one; that’s what it’s for.) This is handled inline inUniversalDetector, which returns the result immediately without any further processing. +Escaped encodings
+If the text contains a recognizable escape sequence that might indicate an escaped encoding,
UniversalDetectorcreates anEscCharSetProber(defined inescprober.py) and feeds it the text. +
EscCharSetProbercreates a series of state machines, based on models ofHZ-GB-2312,ISO-2022-CN,ISO-2022-JP, andISO-2022-KR(defined inescsm.py).EscCharSetProberfeeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding,EscCharSetProberimmediately returns the positive result toUniversalDetector, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines. +Multi-byte encodings
+Assuming no BOM,
UniversalDetectorchecks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort,windows-1252. +The multi-byte encoding prober,
MBCSGroupProber(defined inmbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding:Big5,GB2312,EUC-TW,EUC-KR,EUC-JP,SHIFT_JIS, andUTF-8.MBCSGroupProberfeeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls toUniversalDetector.feed()will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding,MBCSGroupProberreports this positive result toUniversalDetector, which reports the result to the caller. +Most of the multi-byte encoding probers are inherited from
MultiByteCharSetProber(defined inmbcharsetprober.py), and simply hook up the appropriate state machine and distribution analyzer and letMultiByteCharSetProberdo the rest of the work.MultiByteCharSetProberruns the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time,MultiByteCharSetProberfeeds the text to an encoding-specific distribution analyzer. +The distribution analyzers (each defined in
chardistribution.py) use language-specific models of which characters are used most frequently. OnceMultiByteCharSetProberhas fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough,MultiByteCharSetProberreturns the result toMBCSGroupProber, which returns it toUniversalDetector, which returns it to the caller. +The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between
EUC-JPandSHIFT_JIS, so theSJISProber(defined insjisprober.py) also uses 2-character distribution analysis.SJISContextAnalysisandEUCJPContextAnalysis(both defined injpcntx.pyand both inheriting from a commonJapaneseContextAnalysisclass) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level toSJISProber, which checks both analyzers and returns the higher confidence level toMBCSGroupProber. +Single-byte encodings
+The single-byte encoding prober,
SBCSGroupProber(defined insbcsgroupprober.py), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language:windows-1251,KOI8-R,ISO-8859-5,MacCyrillic,IBM855, andIBM866(Russian);ISO-8859-7andwindows-1253(Greek);ISO-8859-5andwindows-1251(Bulgarian);ISO-8859-2andwindows-1250(Hungarian);TIS-620(Thai);windows-1255andISO-8859-8(Hebrew). +
SBCSGroupProberfeeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class,SingleByteCharSetProber(defined insbcharsetprober.py), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text.SingleByteCharSetProberprocesses the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio. +Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis,
HebrewProber(defined inhebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored "backwards" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255for Logical Hebrew, orISO-8859-8for Visual Hebrew). ++
windows-1252If
UniversalDetectordetects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates aLatin1Prober(defined inlatin1prober.py) to try to detect English text in awindows-1252encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguishwindows-1252is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like.Latin1Proberautomatically reduces its confidence rating to allow more accurate probers to win if at all possible. +Running
+2to3We’re going to migrate the
chardetmodule from Python 2 to Python 3. Python 3 comes with a utility script called2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with2to3. In this chapter, we’ll start by running2to3on thechardetpackage, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic. +The main
chardetpackage is split across several different files, all in the same directory. The2to3script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and2to3will convert each of the files in turn. +C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w chardet\ RefactoringTool: Skipping implicit fixer: buffer RefactoringTool: Skipping implicit fixer: idioms RefactoringTool: Skipping implicit fixer: set_literal @@ -566,9 +565,9 @@ RefactoringTool: chardet\sbcsgroupprober.py RefactoringTool: chardet\sjisprober.py RefactoringTool: chardet\universaldetector.py RefactoringTool: chardet\utf8prober.py-Now run the
2to3script on the testing harness,test.py. -C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w test.py +Now run the
2to3script on the testing harness,test.py. +C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w test.py RefactoringTool: Skipping implicit fixer: buffer RefactoringTool: Skipping implicit fixer: idioms RefactoringTool: Skipping implicit fixer: set_literal @@ -598,21 +597,21 @@ RefactoringTool: Skipping implicit fixer: ws_comma +print(count, 'tests') RefactoringTool: Files that were modified: RefactoringTool: test.py-Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work? -
Fixing what
-2to3can’t-
Falseis invalid syntaxNow for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere. -
C:\home\chardet> python test.py tests\*\* -Traceback (most recent call last): +Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work? +
Fixing what
+2to3can’t+
Falseis invalid syntaxNow for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere. +
C:\home\chardet> python test.py tests\*\* +Traceback (most recent call last): File "test.py", line 1, in <module> from chardet.universaldetector import UniversalDetector File "C:\home\chardet\chardet\universaldetector.py", line 51 self.done = constants.False ^ SyntaxError: invalid syntax-Hmm, a small snag. In Python 3,
Falseis a reserved word, so you can’t use it as a variable name. Let’s look atconstants.pyto see where it’s defined. Here’s the original version fromconstants.py, before the2to3script changed it: -Hmm, a small snag. In Python 3,
Falseis a reserved word, so you can’t use it as a variable name. Let’s look atconstants.pyto see where it’s defined. Here’s the original version fromconstants.py, before the2to3script changed it: +-import __builtin__ if not hasattr(__builtin__, 'False'): False = 0 @@ -620,84 +619,84 @@ if not hasattr(__builtin__, 'False'): else: False = __builtin__.False True = __builtin__.TrueThis piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in
Booleantype. This code detects the absence of the built-in constantsTrueandFalse, and defines them if necessary. -However, Python 3 will always have a
Booleantype, so this entire code snippet is unnecessary. The simplest solution is to replace all instances ofconstants.Trueandconstants.FalsewithTrueandFalse, respectively, then delete this dead code fromconstants.py. -So this line in
universaldetector.py: +This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in
Booleantype. This code detects the absence of the built-in constantsTrueandFalse, and defines them if necessary. +However, Python 3 will always have a
Booleantype, so this entire code snippet is unnecessary. The simplest solution is to replace all instances ofconstants.Trueandconstants.FalsewithTrueandFalse, respectively, then delete this dead code fromconstants.py. +So this line in
universaldetector.py:self.done = constants.FalseBecomes
self.done = FalseAh, wasn’t that satisfying? The code is shorter and more readable already. -
No module named
-constantsTime to run
test.pyagain and see how far it gets. -C:\home\chardet> python test.py tests\*\* -Traceback (most recent call last): +No module named
+constantsTime to run
test.pyagain and see how far it gets. +C:\home\chardet> python test.py tests\*\* +Traceback (most recent call last): File "test.py", line 1, in <module> from chardet.universaldetector import UniversalDetector File "C:\home\chardet\chardet\universaldetector.py", line 29, in <module> import constants, sys ImportError: No module named constants-What’s that you say? No module named
constants? Of course there’s a module namedconstants. ... Oh wait, no there isn’t. Remember when the2to3script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead: +What’s that you say? No module named
constants? Of course there’s a module namedconstants. ... Oh wait, no there isn’t. Remember when the2to3script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:-from . import constantsBut wait. Wasn’t the
2to3script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of theconstantsmodule within the library, and an absolute import of thesysmodule that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the2to3script is not smart enough to split the import statement into two. -The solution is to split the import statement manually. So this two-in-one import: +
But wait. Wasn’t the
2to3script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of theconstantsmodule within the library, and an absolute import of thesysmodule that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the2to3script is not smart enough to split the import statement into two. +The solution is to split the import statement manually. So this two-in-one import:
import constants, sysNeeds to become two separate imports:
-from . import constants import sysThere are variations of this problem scattered throughout the
chardetlibrary. In some places it’s "import constants, sys"; in other places, it’s "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import. +There are variations of this problem scattered throughout the
chardetlibrary. In some places it’s "import constants, sys"; in other places, it’s "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.Onward! -
Name 'file' is not defined
+Name 'file' is not defined
FIXME intro -
C:\home\chardet> python test.py tests\*\* +C:\home\chardet> python test.py tests\*\* tests\ascii\howto.diveintomark.org.xml -Traceback (most recent call last): +Traceback (most recent call last): File "test.py", line 9, in <module> for line in file(f, 'rb'): NameError: name 'file' is not defined-This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the
iomodule. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it’s an alias for io.open(), but never mind that right now.) +This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global file() function was an alias for open(), which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the
iomodule. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global file() function no longer exists. However, the open() function does still exist. (Technically, it’s an alias for io.open(), but never mind that right now.)Thus, the simplest solution to the problem of the missing file() is to call open() instead:
for line in open(f, 'rb'):And that’s all I have to say about that. -
Can’t use a string pattern on a bytes-like object
+Can’t use a string pattern on a bytes-like object
FIXME intro -
C:\home\chardet> python test.py tests\*\* +C:\home\chardet> python test.py tests\*\* tests\ascii\howto.diveintomark.org.xml -Traceback (most recent call last): +Traceback (most recent call last): File "test.py", line 10, in <module> u.feed(line) File "C:\home\chardet\chardet\universaldetector.py", line 98, in feed if self._highBitDetector.search(aBuf): TypeError: can't use a string pattern on a bytes-like object-Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.” -
First, let’s see what self._highBitDetector is. It’s defined in the __init__ method of the UniversalDetector class: -
Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.” +
First, let’s see what self._highBitDetector is. It’s defined in the __init__ method of the UniversalDetector class: +
-class UniversalDetector: def __init__(self): self._highBitDetector = re.compile(r'[\x80-\xFF]')This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255. +
This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.
And therein lies the problem. -
In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (
u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred inuniversaldetector.py: -In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (
u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred inuniversaldetector.py: +-def feed(self, aBuf): . . . if self._mInputState == ePureAscii: if self._highBitDetector.search(aBuf):And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness,
test.py. -And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness,
test.py. +-u = UniversalDetector() . . . for line in open(f, 'rb'): u.feed(line)And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file:
'rb'.'r'is for “read”; OK, big deal, we’re reading the file. Ah, but'b'is for “binary.” Without the'b'flag, thisforloop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the'b'flag, thisforloop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit... characters. But we don’t have characters; we have bytes. Oops. +And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file:
'rb'.'r'is for “read”; OK, big deal, we’re reading the file. Ah, but'b'is for “binary.” Without the'b'flag, thisforloop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the'b'flag, thisforloop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit... characters. But we don’t have characters; we have bytes. Oops.What we need this regular expression to search is not an array of characters, but an array of bytes. -
Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. So instead of this: +
Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. So instead of this:
self._highBitDetector = re.compile(r'[\x80-\xFF]')We now have this:
@@ -705,20 +704,18 @@ for line in open(f, 'rb'):self._highBitDetector = re.compile(b'[\x80-\xFF]')self._escDetector = re.compile(r'(\033|~{)')Again, this is going to be used to search a byte array (the same aBuf variable, in fact), so the regular expression pattern needs to be defined as a byte array:
-self._escDetector = re.compile(b'(\033|~{)')Can't convert '
+bytes' object tostrimplicitlyCan't convert '
bytes' object tostrimplicitlyCuriouser and curiouser... -
C:\home\chardet> python test.py tests\*\* +C:\home\chardet> python test.py tests\*\* tests\ascii\howto.diveintomark.org.xml -Traceback (most recent call last): +Traceback (most recent call last): File "test.py", line 10, in <module> u.feed(line) File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf): TypeError: Can't convert 'bytes' object to str implicitly-... -
© 2001-4, 2009 ℳark Pilgrim, CC-BY-SA-3.0 - - - - +
... +
© 2001–4, 2009 ℳark Pilgrim, CC-BY-SA-3.0 + + diff --git a/dip2 b/dip2 index b7460d2..0f4549b 100644 --- a/dip2 +++ b/dip2 @@ -4,11 +4,10 @@
Dive Into Python -Dive Into Python
20 May 2004
Copyright © 2000, 2001, 2002, 2003, 2004 Mark Pilgrim -
This book lives at http://diveintopython3.org/. If you're reading it somewhere else, you may not have the latest version. +
This book lives at http://diveintopython3.org/. If you're reading it somewhere else, you may not have the latest version.
Table of Contents
@@ -290,21 +289,21 @@
Chapter 1. Installing Python
-Welcome to Python. Let's dive in. In this chapter, you'll install the version of Python that's right for you. +
Welcome to Python. Let's dive in. In this chapter, you'll install the version of Python that's right for you.
1.1. Which Python is right for you?
-The first thing you need to do with Python is install it. Or do you? -
If you're using an account on a hosted server, your ISP may have already installed Python. Most popular Linux distributions come with Python in the default installation. Mac OS X 10.2 and later includes a command-line version of Python, although you'll probably want to install a version that includes a more Mac-like graphical interface. +
The first thing you need to do with Python is install it. Or do you? +
If you're using an account on a hosted server, your ISP may have already installed Python. Most popular Linux distributions come with Python in the default installation. Mac OS X 10.2 and later includes a command-line version of Python, although you'll probably want to install a version that includes a more Mac-like graphical interface.
Windows does not come with any version of Python, but don't despair! There are several ways to point-and-click your way to Python on Windows. -
As you can see already, Python runs on a great many operating systems. The full list includes Windows, Mac OS, Mac OS X, and all varieties of free UNIX-compatible systems like Linux. There are also versions that run on Sun Solaris, AS/400, Amiga, OS/2, BeOS, and a plethora +
As you can see already, Python runs on a great many operating systems. The full list includes Windows, Mac OS, Mac OS X, and all varieties of free UNIX-compatible systems like Linux. There are also versions that run on Sun Solaris, AS/400, Amiga, OS/2, BeOS, and a plethora of other platforms you've probably never even heard of. -
What's more, Python programs written on one platform can, with a little care, run on any supported platform. For instance, I regularly develop Python programs on Windows and later deploy them on Linux. +
What's more, Python programs written on one platform can, with a little care, run on any supported platform. For instance, I regularly develop Python programs on Windows and later deploy them on Linux.
So back to the question that started this section, “Which Python is right for you?” The answer is whichever one runs on the computer you already have.
1.2. Python on Windows
On Windows, you have a couple choices for installing Python.
ActiveState makes a Windows installer for Python called ActivePython, which includes a complete version of Python, an IDE with a Python-aware code editor, plus some Windows extensions for Python that allow complete access to Windows-specific services, APIs, and the Windows Registry. -
ActivePython is freely downloadable, although it is not open source. It is the IDE I used to learn Python, and I recommend you try it unless you have a specific reason not to. One such reason might be that ActiveState is generally -several months behind in updating their ActivePython installer when new version of Python are released. If you absolutely need the latest version of Python and ActivePython is still a version behind as you read this, you'll want to use the second option for installing Python on Windows. -
The second option is the “official” Python installer, distributed by the people who develop Python itself. It is freely downloadable and open source, and it is always current with the latest version of Python. +
ActivePython is freely downloadable, although it is not open source. It is the IDE I used to learn Python, and I recommend you try it unless you have a specific reason not to. One such reason might be that ActiveState is generally +several months behind in updating their ActivePython installer when new version of Python are released. If you absolutely need the latest version of Python and ActivePython is still a version behind as you read this, you'll want to use the second option for installing Python on Windows. +
The second option is the “official” Python installer, distributed by the people who develop Python itself. It is freely downloadable and open source, and it is always current with the latest version of Python.
Procedure 1.1. Option 1: Installing ActivePython
Here is the procedure for installing ActivePython: @@ -326,7 +325,7 @@ several months behind in updating their ActivePython installer when new version absolutely can't spare the 14MB.
- After the installation is complete, close the installer and choose Start->Programs->ActiveState ActivePython 2.2->PythonWin IDE. You'll see something like the following: +
After the installation is complete, close the installer and choose Start->Programs->ActiveState ActivePython 2.2->PythonWin IDE. You'll see something like the following:
@@ -341,7 +340,7 @@ see 'Help/About PythonWin' for further copyright information.Download the latest Python Windows installer by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then downloading the
.exeinstaller.- Double-click the installer,
Python-2.xxx.yyy.exe. The name will depend on the version of Python available when you read this. +Double-click the installer,
Python-2.xxx.yyy.exe. The name will depend on the version of Python available when you read this.Step through the installer program. @@ -350,10 +349,10 @@ see 'Help/About PythonWin' for further copyright information.
If disk space is tight, you can deselect the HTMLHelp file, the utility scripts (
Tools/), and/or the test suite (Lib/test/).- If you do not have administrative rights on your machine, you can select Advanced Options, then choose Non-Admin Install. This just affects where Registry entries and Start menu shortcuts are created. +
If you do not have administrative rights on your machine, you can select Advanced Options, then choose Non-Admin Install. This just affects where Registry entries and Start menu shortcuts are created.
- After the installation is complete, close the installer and select Start->Programs->Python 2.3->IDLE (Python GUI). You'll see something like the following: +
After the installation is complete, close the installer and select Start->Programs->Python 2.3->IDLE (Python GUI). You'll see something like the following:
@@ -370,8 +369,8 @@ Type "copyright", "credits" or "license()" for more information. IDLE 1.0 >>>1.3. Python on Mac OS X
-On Mac OS X, you have two choices for installing Python: install it, or don't install it. You probably want to install it. -
Mac OS X 10.2 and later comes with a command-line version of Python preinstalled. If you are comfortable with the command line, you can use this version for the first third of the book. However, +
On Mac OS X, you have two choices for installing Python: install it, or don't install it. You probably want to install it. +
Mac OS X 10.2 and later comes with a command-line version of Python preinstalled. If you are comfortable with the command line, you can use this version for the first third of the book. However, the preinstalled version does not come with an XML parser, so when you get to the XML chapter, you'll need to install the full version.
Rather than using the preinstalled version, you'll probably want to install the latest version, which also comes with a graphical interactive shell. @@ -430,15 +429,15 @@ Type "help", "copyright", "credits", or "license" for more information.
Double-click
PythonIDEto launch Python. -The MacPython IDE should display a splash screen, then take you to the interactive shell. If the interactive shell does not appear, select -Window->Python Interactive (Cmd-0). The opening window will look something like this: +
The MacPython IDE should display a splash screen, then take you to the interactive shell. If the interactive shell does not appear, select +Window->Python Interactive (Cmd-0). The opening window will look something like this:
Python 2.3 (#2, Jul 30 2003, 11:45:28) [GCC 3.1 20020420 (prerelease)] Type "copyright", "credits" or "license" for more information. MacPython IDE 1.0.1 >>> -Note that once you install the latest version, the pre-installed version is still present. If you are running scripts from +
Note that once you install the latest version, the pre-installed version is still present. If you are running scripts from the command line, you need to be aware which version of Python you are using.
Example 1.1. Two versions of Python
[localhost:~] you% python @@ -479,8 +478,8 @@ Type "help", "copyright", "credits", or "license" for more information.Double-click
Python IDEto launch Python. -The MacPython IDE should display a splash screen, and then take you to the interactive shell. If the interactive shell does not appear, select -Window->Python Interactive (Cmd-0). You'll see a screen like this: +
The MacPython IDE should display a splash screen, and then take you to the interactive shell. If the interactive shell does not appear, select +Window->Python Interactive (Cmd-0). You'll see a screen like this:
Python 2.3 (#2, Jul 30 2003, 11:45:28) [GCC 3.1 20020420 (prerelease)] @@ -488,9 +487,9 @@ Type "copyright", "credits" or "license" for more information. MacPython IDE 1.0.1 >>>1.5. Python on RedHat Linux
-Installing under UNIX-compatible operating systems such as Linux is easy if you're willing to install a binary package. Pre-built -binary packages are available for most popular Linux distributions. Or you can always compile from source. -
Download the latest Python RPM by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then selecting the
rpms/directory within that. Then download the RPM with the highest version number. You can install it with the rpm command, as shown here: +Installing under UNIX-compatible operating systems such as Linux is easy if you're willing to install a binary package. Pre-built +binary packages are available for most popular Linux distributions. Or you can always compile from source. +
Download the latest Python RPM by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then selecting the
rpms/directory within that. Then download the RPM with the highest version number. You can install it with the rpm command, as shown here:Example 1.2. Installing on RedHat Linux 9
localhost:~$ su - Password: [enter your root password] @@ -520,19 +519,19 @@ Type "help", "copyright", "credits", or "license" for more information.- ![]()
Whoops! Just typing python gives you the older version of Python -- the one that was installed by default. That's not the one you want. + Whoops! Just typing python gives you the older version of Python -- the one that was installed by default. That's not the one you want. - ![]()
At the time of this writing, the newest version is called python2.3. You'll probably want to change the path on the first line of the sample scripts to point to the newer version. + At the time of this writing, the newest version is called python2.3. You'll probably want to change the path on the first line of the sample scripts to point to the newer version. @@ -571,7 +570,7 @@ logout Type "help", "copyright", "credits" or "license" for more information. >>> [press Ctrl+D to exit] - ![]()
This is the complete path of the newer version of Python that you just installed. Use this on the #!line (the first line of each script) to ensure that scripts are running under the latest version of Python, and be sure to type python2.3 to get into the interactive shell. +This is the complete path of the newer version of Python that you just installed. Use this on the #!line (the first line of each script) to ensure that scripts are running under the latest version of Python, and be sure to type python2.3 to get into the interactive shell.1.7. Python Installation from Source
-If you prefer to build from source, you can download the Python source code from http://www.python.org/ftp/python/. Select the highest version number listed, download the
.tgzfile), and then do the usual configure, make, make install dance. +If you prefer to build from source, you can download the Python source code from http://www.python.org/ftp/python/. Select the highest version number listed, download the
.tgzfile), and then do the usual configure, make, make install dance.Example 1.4. Installing from source
localhost:~$ su - Password: [enter your root password] @@ -611,9 +610,9 @@ Type "help", "copyright", "credits" or "license" for more information. localhost:~$1.8. The Interactive Shell
Now that you have Python installed, what's this interactive shell thing you're running? -
It's like this: Python leads a double life. It's an interpreter for scripts that you can run from the command line or run like applications, by -double-clicking the scripts. But it's also an interactive shell that can evaluate arbitrary statements and expressions. -This is extremely useful for debugging, quick hacking, and testing. I even know some people who use the Python interactive shell in lieu of a calculator! +
It's like this: Python leads a double life. It's an interpreter for scripts that you can run from the command line or run like applications, by +double-clicking the scripts. But it's also an interactive shell that can evaluate arbitrary statements and expressions. +This is extremely useful for debugging, quick hacking, and testing. I even know some people who use the Python interactive shell in lieu of a calculator!
Launch the Python interactive shell in whatever way works on your platform, and let's dive in with the steps shown here:
Example 1.5. First Steps in the Interactive Shell
>>> 1 + 1@@ -648,7 +647,7 @@ hello world
1.9. Summary
You should now have a version of Python installed that works for you. -
Depending on your platform, you may have more than one version of Python intsalled. If so, you need to be aware of your paths. If simply typing python on the command line doesn't run the version of Python that you want to use, you may need to enter the full pathname of your preferred version. +
Depending on your platform, you may have more than one version of Python intsalled. If so, you need to be aware of your paths. If simply typing python on the command line doesn't run the version of Python that you want to use, you may need to enter the full pathname of your preferred version.
Congratulations, and welcome to Python.
Chapter 2. Your First Python Program
@@ -656,7 +655,7 @@ hello world Let's skip all that.2.1. Diving in
Here is a complete, working Python program. -
It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But +
It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
Example 2.1.
odbchelper.pyIf you have not already done so, you can download this and other examples used in this book.
@@ -678,7 +677,7 @@ if __name__ == "__main__":In the ActivePython IDE on Windows, you can run the Python program you're editing by choosing -File->Run... (Ctrl-R). Output is displayed in the interactive window. +File->Run... (Ctrl-R). Output is displayed in the interactive window. @@ -687,7 +686,7 @@ File->Run... (Ctrl-R). Output is displayed in the i
In the Python IDE on Mac OS, you can run a Python program with -Python->Run window... (Cmd-R), but there is an important option you must set first. Open the .pyfile in the IDE, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the Run as __main__ option is checked. This is a per-file setting, but you'll only need to do it once per file. +Python->Run window... (Cmd-R), but there is an important option you must set first. Open the.pyfile in the IDE, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the Run as __main__ option is checked. This is a per-file setting, but you'll only need to do it once per file.@@ -699,27 +698,27 @@ Python->Run window... (Cmd-R), but there is an impor
The id="odbchelper.output" output of
odbchelper.pywill look like this:server=mpilgrim;uid=sa;database=master;pwd=secret2.2. Declaring Functions
-Python has functions like most other languages, but it does not have separate header files like C++ or
interface/implementationsections like Pascal. When you need a function, just declare it, like this: +Python has functions like most other languages, but it does not have separate header files like C++ or
interface/implementationsections like Pascal. When you need a function, just declare it, like this:-def buildConnectionString(params):Note that the keyword
defstarts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments +def buildConnectionString(params):Note that the keyword
defstarts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments (not shown here) are separated with commas. -Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value. +
Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value. In fact, every Python function returns a value; if the function ever executes a
returnstatement, it will return that value, otherwise it will returnNone, the Python null value.-
- In Visual Basic, functions (that return a value) start with function, and subroutines (that do not return a value) start withsub. There are no subroutines in Python. Everything is a function, all functions return a value (even if it'sNone), and all functions start withdef. +In Visual Basic, functions (that return a value) start with function, and subroutines (that do not return a value) start withsub. There are no subroutines in Python. Everything is a function, all functions return a value (even if it'sNone), and all functions start withdef.The argument,
params, doesn't specify a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.+
-The argument,
params, doesn't specify a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.@@ -728,17 +727,17 @@ In fact, every Python function returns a value; if the function ever executes a
In Java, C++, and other statically-typed languages, you must specify the datatype of the function return value and each function argument. - In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally. + In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
- statically typed language
-- A language in which types are fixed at compile time. Most statically typed languages enforce this by requiring you to declare - all variables with their datatypes before using them. Java and C are statically typed languages. +
- A language in which types are fixed at compile time. Most statically typed languages enforce this by requiring you to declare + all variables with their datatypes before using them. Java and C are statically typed languages.
- dynamically typed language
-- A language in which types are discovered at execution time; the opposite of statically typed. VBScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value. +
- A language in which types are discovered at execution time; the opposite of statically typed. VBScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
- strongly typed language
-- A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it. +
- A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
- weakly typed language
-- A language in which types may be ignored; the opposite of strongly typed. VBScript is weakly typed. In VBScript, you can concatenate the string
'12'and the integer3to get the string'123', then treat that as the integer123, all without any explicit conversion. +- A language in which types may be ignored; the opposite of strongly typed. VBScript is weakly typed. In VBScript, you can concatenate the string
'12'and the integer3to get the string'123', then treat that as the integer123, all without any explicit conversion.So Python is both dynamically typed (because it doesn't use explicit datatype declarations) and strongly typed (because once a variable has a datatype, it actually matters). @@ -748,8 +747,8 @@ In fact, every Python function returns a value; if the function ever executes a def buildConnectionString(params): """Build a connection string from a dictionary of parameters. - Returns string."""
Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including - carriage returns and other quote characters. You can use them anywhere, but you'll see them most often used when defining + Returns string."""
Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including + carriage returns and other quote characters. You can use them anywhere, but you'll see them most often used when defining a
docstring.-
@@ -760,13 +759,13 @@ def buildConnectionString(params): Everything between the triple quotes is the function's
docstring, which documents what the function does. Adocstring, if it exists, must be the first thing defined in a function (that is, the first thing after the colon). You don't technically -need to give your function adocstring, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: thedocstringis available at runtime as an attribute of the function.+
Everything between the triple quotes is the function's
docstring, which documents what the function does. Adocstring, if it exists, must be the first thing defined in a function (that is, the first thing after the colon). You don't technically +need to give your function adocstring, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: thedocstringis available at runtime as an attribute of the function.@@ -777,29 +776,29 @@ need to give your function a
- Many Python IDEs use the docstringto provide context-sensitive documentation, so that when you type a function name, itsdocstringappears as a tooltip. This can be incredibly helpful, but it's only as good as thedocstrings you write. +Many Python IDEs use the docstringto provide context-sensitive documentation, so that when you type a function name, itsdocstringappears as a tooltip. This can be incredibly helpful, but it's only as good as thedocstrings you write.docstring, but you always should. I k2.4. Everything Is an Object
2.6. Testing Modules
-Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them. +
Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them. Here's an example that uses the
if__name__trick.-if __name__ == "__main__":Some quick observations before you get to the good stuff. First, parentheses are not required around the
ifexpression. Second, theifstatement ends with a colon, and is followed by indented code.+if __name__ == "__main__":
-Some quick observations before you get to the good stuff. First, parentheses are not required around the
ifexpression. Second, theifstatement ends with a colon, and is followed by indented code.-
- Like C, Python uses ==for comparison and=for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing. +Like C, Python uses ==for comparison and=for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.So why is this particular
ifstatement a trick? Modules are objects, and all modules have a built-in attribute__name__. A module's__name__depends on how you're using the module. If youimportthe module, then__name__is the module's filename, without a directory path or file extension. But you can also run the module directly as a standalone +So why is this particular
ifstatement a trick? Modules are objects, and all modules have a built-in attribute__name__. A module's__name__depends on how you're using the module. If youimportthe module, then__name__is the module's filename, without a directory path or file extension. But you can also run the module directly as a standalone program, in which case__name__will be a special default value,__main__.>>> import odbchelper >>> odbchelper.__name__-'odbchelper'Knowing this, you can design a test suite for your module within the module itself by putting it in this
ifstatement. When you run the module directly,__name__is__main__, so the test suite executes. When you import the module,__name__is something else, so the test suite is ignored. This makes it easier to develop and debug new modules before integrating +'odbchelper'Knowing this, you can design a test suite for your module within the module itself by putting it in this
ifstatement. When you run the module directly,__name__is__main__, so the test suite executes. When you import the module,__name__is something else, so the test suite is ignored. This makes it easier to develop and debug new modules before integrating them into a larger program.
- @@ -813,485 +812,9 @@ them into a larger program.On MacPython, there is an additional step to make the if__name__trick work. Pop up the module's options menu by clicking the black triangle in the upper-right corner of the window, and +On MacPython, there is an additional step to make the if__name__trick work. Pop up the module's options menu by clicking the black triangle in the upper-right corner of the window, and make sure Run as __main__ is checked.-
Chapter 3. Native Datatypes
3.2. Introducing Lists
-Lists are Python's workhorse datatype. If your only experience with lists is arrays in Visual Basic or (God forbid) the datastore in Powerbuilder, brace yourself for Python lists.
-
- -- - -A list in Python is like an array in Perl. In Perl, variables that store arrays always start with the -@character; in Python, variables can be named anything, and Python keeps track of the datatype internally. --
-- -- - -A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the -ArrayListclass, which can hold arbitrary objects and can expand dynamically as new items are added. -3.2.1. Defining Lists
-Example 3.6. Defining a List
>>> li = ["a", "b", "mpilgrim", "z", "example"]->>> li -['a', 'b', 'mpilgrim', 'z', 'example'] ->>> li[0]
-'a' ->>> li[4]
-'example'
-Example 3.7. Negative List Indices
>>> li -['a', 'b', 'mpilgrim', 'z', 'example'] ->>> li[-1]-'example' ->>> li[-3]
-'mpilgrim'
-Example 3.8. Slicing a List
>>> li -['a', 'b', 'mpilgrim', 'z', 'example'] ->>> li[1:3]-['b', 'mpilgrim'] ->>> li[1:-1]
-['b', 'mpilgrim', 'z'] ->>> li[0:3]
-['a', 'b', 'mpilgrim']
-Example 3.9. Slicing Shorthand
>>> li -['a', 'b', 'mpilgrim', 'z', 'example'] ->>> li[:3]-['a', 'b', 'mpilgrim'] ->>> li[3:]
![]()
-['z', 'example'] ->>> li[:]
-['a', 'b', 'mpilgrim', 'z', 'example']
--
-- -- -
If the left slice index is 0, you can leave it out, and 0 is implied. So -li[:3]is the same asli[0:3]from Example 3.8, “Slicing a List”. -- -- -
Similarly, if the right slice index is the length of the list, you can leave it out. So -li[3:]is the same asli[3:5], because this list has five elements. -- -- -
Note the symmetry here. In this five-element list, -li[:3]returns the first 3 elements, andli[3:]returns the last two elements. In fact,li[:n]will always return the firstnelements, andli[n:]will return the rest, regardless of the length of the list. -- -- -
If both slice indices are left out, all elements of the list are included. But this is not the same as the original li list; it is a new list that happens to have all the same elements. -li[:]is shorthand for making a complete copy of a list. -3.2.2. Adding Elements to Lists
-Example 3.10. Adding Elements to a List
>>> li -['a', 'b', 'mpilgrim', 'z', 'example'] ->>> li.append("new")->>> li -['a', 'b', 'mpilgrim', 'z', 'example', 'new'] ->>> li.insert(2, "new")
->>> li -['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new'] ->>> li.extend(["two", "elements"])
->>> li -['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']
-Example 3.11. The Difference between
extendandappend->>> li = ['a', 'b', 'c'] ->>> li.extend(['d', 'e', 'f'])->>> li -['a', 'b', 'c', 'd', 'e', 'f'] ->>> len(li)
-6 ->>> li[-1] -'f' ->>> li = ['a', 'b', 'c'] ->>> li.append(['d', 'e', 'f'])
->>> li -['a', 'b', 'c', ['d', 'e', 'f']] ->>> len(li)
-4 ->>> li[-1] -['d', 'e', 'f'] -
-3.2.3. Searching Lists
-Example 3.12. Searching a List
>>> li -['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements'] ->>> li.index("example")-5 ->>> li.index("new")
-2 ->>> li.index("c")
-Traceback (innermost last): - File "<interactive input>", line 1, in ? -ValueError: list.index(x): x not in list ->>> "c" in li
-False
--
-- -- - -Before version 2.2.1, Python had no separate boolean datatype. To compensate for this, Python accepted almost anything in a boolean context (like an -ifstatement), according to the following rules: --These rules still apply in Python 2.2.1 and beyond, but now you can also use an actual boolean, which has a value of-
-0is false; all other numbers are true. - -- An empty string (
"") is false, all other strings are true. - -- An empty list (
[]) is false; all other lists are true. - -- An empty tuple (
()) is false; all other tuples are true. - -- An empty dictionary (
{}) is false; all other dictionaries are true. - -TrueorFalse. Note the capitalization; these values, like everything else in Python, are case-sensitive. -3.2.4. Deleting List Elements
-Example 3.13. Removing Elements from a List
>>> li -['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements'] ->>> li.remove("z")->>> li -['a', 'b', 'new', 'mpilgrim', 'example', 'new', 'two', 'elements'] ->>> li.remove("new")
->>> li -['a', 'b', 'mpilgrim', 'example', 'new', 'two', 'elements'] ->>> li.remove("c")
-Traceback (innermost last): - File "<interactive input>", line 1, in ? -ValueError: list.remove(x): x not in list ->>> li.pop()
-'elements' ->>> li -['a', 'b', 'mpilgrim', 'example', 'new', 'two']
-3.2.5. Using List Operators
-Example 3.14. List Operators
>>> li = ['a', 'b', 'mpilgrim'] ->>> li = li + ['example', 'new']->>> li -['a', 'b', 'mpilgrim', 'example', 'new'] ->>> li += ['two']
->>> li -['a', 'b', 'mpilgrim', 'example', 'new', 'two'] ->>> li = [1, 2] * 3
->>> li -[1, 2, 1, 2, 1, 2]
--
-- -- -
Lists can also be concatenated with the -+operator.list = list + otherlisthas the same result aslist.extend(otherlist). But the+operator returns a new (concatenated) list as a value, whereasextendonly alters an existing list. This means thatextendis faster, especially for large lists. -- -- -
Python supports the -+=operator.li += ['two']is equivalent toli.extend(['two']). The+=operator works for lists, strings, and integers, and it can be overloaded to work for user-defined classes as well. (More - on classes in Chapter 5.) -- -- -
The -*operator works on lists as a repeater.li = [1, 2] * 3is equivalent toli = [1, 2] + [1, 2] + [1, 2], which concatenates the three lists into one. --Further Reading on Lists
--
-- How to Think Like a Computer Scientist teaches about lists and makes an important point about passing lists as function arguments. - -
- Python Tutorial shows how to use lists as stacks and queues. - -
- Python Knowledge Base answers common questions about lists and has a lot of example code using lists. - -
- Python Library Reference summarizes all the list methods. - -
3.3. Introducing Tuples
-A tuple is an immutable list. A tuple can not be changed in any way once it is created. -
Example 3.15. Defining a tuple
>>> t = ("a", "b", "mpilgrim", "z", "example")->>> t -('a', 'b', 'mpilgrim', 'z', 'example') ->>> t[0]
-'a' ->>> t[-1]
-'example' ->>> t[1:3]
-('b', 'mpilgrim')
-Example 3.16. Tuples Have No Methods
>>> t -('a', 'b', 'mpilgrim', 'z', 'example') ->>> t.append("new")-Traceback (innermost last): - File "<interactive input>", line 1, in ? -AttributeError: 'tuple' object has no attribute 'append' ->>> t.remove("z")
-Traceback (innermost last): - File "<interactive input>", line 1, in ? -AttributeError: 'tuple' object has no attribute 'remove' ->>> t.index("example")
-Traceback (innermost last): - File "<interactive input>", line 1, in ? -AttributeError: 'tuple' object has no attribute 'index' ->>> "z" in t
-True
-So what are tuples good for? -
--
-- Tuples are faster than lists. If you're defining a constant set of values and all you're ever going to do with it is iterate - through it, use a tuple instead of a list. - -
- It makes your code safer if you “write-protect” data that does not need to be changed. Using a tuple instead of a list is like having an implied
assertstatement that shows this data is constant, and that special thought (and a specific function) is required to override that. - -- Remember that I said that dictionary keys can be integers, strings, and “a few other types”? Tuples are one of those types. Tuples can be used as keys in a dictionary, but lists can't be used this way.Actually, it's more complicated than that. Dictionary keys must be immutable. Tuples themselves are immutable, but if you - have a tuple of lists, that counts as mutable and isn't safe to use as a dictionary key. Only tuples of strings, numbers, - or other dictionary-safe tuples can be used as dictionary keys. - -
- Tuples are used in string formatting, as you'll see shortly. -
-
-- -- - -Tuples can be converted into lists, and vice-versa. The built-in -tuplefunction takes a list and returns a tuple with the same elements, and thelistfunction takes a tuple and returns a list. In effect,tuplefreezes a list, andlistthaws a tuple. --Further Reading on Tuples
--
- How to Think Like a Computer Scientist teaches about tuples and shows how to concatenate tuples. - -
- Python Knowledge Base shows how to sort a tuple. - -
- Python Tutorial shows how to define a tuple with one element. - -
3.4. Declaring variables
Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from Chapter 2,
odbchelper.py. -Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring +
Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring into existence by being assigned a value, and they are automatically destroyed when they go out of scope.
Example 3.17. Defining the myParams Variable
if __name__ == "__main__": @@ -1299,19 +822,19 @@ if __name__ == "__main__": "database":"master", \ "uid":"sa", \ "pwd":"secret" \ - }Notice the indentation. An
ifstatement is a code block and needs to be indented just like a function. + }Notice the indentation. An
ifstatement is a code block and needs to be indented just like a function.Also notice that the variable assignment is one command split over several lines, with a backslash (“
\”) serving as a line-continuation marker.-
- When a command is split among several lines with the line-continuation marker (“ \”), the continued lines can be indented in any manner; Python's normally stringent indentation rules do not apply. If your Python IDE auto-indents the continued line, you should probably accept its default unless you have a burning reason not to. +When a command is split among several lines with the line-continuation marker (“ \”), the continued lines can be indented in any manner; Python's normally stringent indentation rules do not apply. If your Python IDE auto-indents the continued line, you should probably accept its default unless you have a burning reason not to.Strictly speaking, expressions in parentheses, straight brackets, or curly braces (like defining a dictionary) can be split into multiple lines with or without the line continuation character (“
\”). I like to include the backslash even when it's not required because I think it makes the code easier to read, but that's +Strictly speaking, expressions in parentheses, straight brackets, or curly braces (like defining a dictionary) can be split into multiple lines with or without the line continuation character (“
\”). I like to include the backslash even when it's not required because I think it makes the code easier to read, but that's a matter of style. -Third, you never declared the variable myParams, you just assigned a value to it. This is like VBScript without the
option explicitoption. Luckily, unlike VBScript, Python will not allow you to reference a variable that has never been assigned a value; trying to do so will raise an exception. +Third, you never declared the variable myParams, you just assigned a value to it. This is like VBScript without the
option explicitoption. Luckily, unlike VBScript, Python will not allow you to reference a variable that has never been assigned a value; trying to do so will raise an exception.3.4.1. Referencing Variables
Example 3.18. Referencing an Unbound Variable
>>> x Traceback (innermost last): @@ -1334,11 +857,11 @@ NameError: There is no variable named 'x'- ![]()
v is a tuple of three elements, and (x, y, z)is a tuple of three variables. Assigning one to the other assigns each of the values of v to each of the variables, in order. +v is a tuple of three elements, and (x, y, z)is a tuple of three variables. Assigning one to the other assigns each of the values of v to each of the variables, in order.This has all sorts of uses. I often want to assign names to a range of values. In C, you would use
enumand manually list each constant and its associated value, which seems especially tedious when the values are consecutive. +This has all sorts of uses. I often want to assign names to a range of values. In C, you would use
enumand manually list each constant and its associated value, which seems especially tedious when the values are consecutive. In Python, you can use the built-inrangefunction with multi-variable assignment to quickly assign consecutive values.Example 3.20. Assigning Consecutive Values
>>> range(7)[0, 1, 2, 3, 4, 5, 6] @@ -1353,14 +876,14 @@ NameError: There is no variable named 'x'
- ![]()
The built-in rangefunction returns a list of integers. In its simplest form, it takes an upper limit and returns a zero-based list counting - up to but not including the upper limit. (If you like, you can pass other parameters to specify a base other than0and a step other than1. You canprint range.__doc__for details.) +The built-in rangefunction returns a list of integers. In its simplest form, it takes an upper limit and returns a zero-based list counting + up to but not including the upper limit. (If you like, you can pass other parameters to specify a base other than0and a step other than1. You canprint range.__doc__for details.)- ![]()
MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, and SUNDAY are the variables you're defining. (This example came from the calendarmodule, a fun little module that prints calendars, like the UNIX programcal. Thecalendarmodule defines integer constants for days of the week.) +MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, and SUNDAY are the variables you're defining. (This example came from the calendarmodule, a fun little module that prints calendars, like the UNIX programcal. Thecalendarmodule defines integer constants for days of the week.)@@ -1371,7 +894,7 @@ NameError: There is no variable named 'x' You can also use multi-variable assignment to build functions that return multiple values, simply by returning a tuple of - all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including the
osmodule, which you'll discuss in Chapter 6. + all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including theosmodule, which you'll discuss in Chapter 6.Further Reading on Variables
@@ -1381,7 +904,7 @@ NameError: There is no variable named 'x'
3.5. Formatting Strings
-Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is +
Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert values into a string with the
%splaceholder.Note that
(k, v)is a tuple. I told you they were good for something. +Note that
(k, v)is a tuple. I told you they were good for something.You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that -string formatting isn't just concatenation. It's not even just formatting. It's also type coercion. +string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
Example 3.22. String Formatting vs. Concatenating
>>> uid = "sa" >>> pwd = "secret" >>> print pwd + " is not a good password for " + uid@@ -1435,9 +958,9 @@ TypeError: cannot concatenate 'str' and 'int' objects
-
(userCount, )is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a - tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the - comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether(userCount)was a tuple with one element or just the value of userCount. +(userCount, )is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a + tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the + comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether(userCount)was a tuple with one element or just the value of userCount.@@ -1449,12 +972,12 @@ TypeError: cannot concatenate 'str' and 'int' objects As with
printfin C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values. +As with
printfin C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.Example 3.23. Formatting Numbers
>>> print "Today's stock price: %f" % 50.462550.462500 @@ -1479,7 +1002,7 @@ TypeError: cannot concatenate 'str' and 'int' objects
-
You can even combine modifiers. Adding the +modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding +You can even combine modifiers. Adding the @@ -1507,7 +1030,7 @@ TypeError: cannot concatenate 'str' and 'int' objects+modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding the value to exactly two decimal places.-
To make sense of this, look at it from right to left. li is the list you're mapping. Python loops through li one element at a time, temporarily assigning the value of each element to the variable elem. Python then applies the function elem*2and appends that result to the returned list. +To make sense of this, look at it from right to left. li is the list you're mapping. Python loops through li one element at a time, temporarily assigning the value of each element to the variable elem. Python then applies the function elem*2and appends that result to the returned list.@@ -1518,13 +1041,13 @@ TypeError: cannot concatenate 'str' and 'int' objects -
It is safe to assign the result of a list comprehension to the variable that you're mapping. Python constructs the new list in memory, and when the list comprehension is complete, it assigns the result to the variable. + It is safe to assign the result of a list comprehension to the variable that you're mapping. Python constructs the new list in memory, and when the list comprehension is complete, it assigns the result to the variable. Here are the list comprehensions in the
buildConnectionStringfunction that you declared in Chapter 2:-["%s=%s" % (k, v) for k, v in params.items()]First, notice that you're calling the
itemsfunction of the params dictionary. This function returns a list of tuples of all the data in the dictionary. +["%s=%s" % (k, v) for k, v in params.items()]First, notice that you're calling the
itemsfunction of the params dictionary. This function returns a list of tuples of all the data in the dictionary.Example 3.25. The
keys,values, anditemsFunctions>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"} >>> params.keys()['server', 'uid', 'database', 'pwd'] @@ -1536,24 +1059,24 @@ TypeError: cannot concatenate 'str' and 'int' objects
-
The keysmethod of a dictionary returns a list of all the keys. The list is not in the order in which the dictionary was defined +The keysmethod of a dictionary returns a list of all the keys. The list is not in the order in which the dictionary was defined (remember that elements in a dictionary are unordered), but it is a list.- ![]()
The valuesmethod returns a list of all the values. The list is in the same order as the list returned bykeys, soparams.values()[n] == params[params.keys()[n]]for all values of n. +The valuesmethod returns a list of all the values. The list is in the same order as the list returned bykeys, soparams.values()[n] == params[params.keys()[n]]for all values of n.- ![]()
The itemsmethod returns a list of tuples of the form(key, value). The list contains all the data in the dictionary. +The itemsmethod returns a list of tuples of the form(key, value). The list contains all the data in the dictionary.Now let's see what
buildConnectionStringdoes. It takes a list,params., and maps it to a new list by applying string formatting to each element. The new list will have the same number of elements +items()Now let's see what
buildConnectionStringdoes. It takes a list,params., and maps it to a new list by applying string formatting to each element. The new list will have the same number of elements asitems()params., but each element in the new list will be a string that contains both a key and its associated value from the params dictionary.items()Example 3.26. List Comprehensions in
buildConnectionString, Step by Step>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"} >>> params.items() @@ -1568,7 +1091,7 @@ asparams., but each element in theitems()- ![]()
Note that you're using two variables to iterate through the params.items()list. This is another use of multi-variable assignment. The first element ofparams.items()is('server', 'mpilgrim'), so in the first iteration of the list comprehension, k will get'server'and v will get'mpilgrim'. In this case, you're ignoring the value of v and only including the value of k in the returned list, so this list comprehension ends up being equivalent toparams.. +keys()Note that you're using two variables to iterate through the params.items()list. This is another use of multi-variable assignment. The first element ofparams.items()is('server', 'mpilgrim'), so in the first iteration of the list comprehension, k will get'server'and v will get'mpilgrim'. In this case, you're ignoring the value of v and only including the value of k in the returned list, so this list comprehension ends up being equivalent toparams..keys()@@ -1580,8 +1103,8 @@ as params., but each element in theitems()@@ -1594,18 +1117,18 @@ as - ![]()
Combining the previous two examples with some simple string formatting, you get a list of strings that include both the key and value of each element of the dictionary. This looks suspiciously - like the output of the program. All that remains is to join the elements in this list into a single string. + Combining the previous two examples with some simple string formatting, you get a list of strings that include both the key and value of each element of the dictionary. This looks suspiciously + like the output of the program. All that remains is to join the elements in this list into a single string. params., but each element in theitems()3.7. Joining Lists and Splitting Strings
-You have a list of key-value pairs in the form
key=value, and you want to join them into a single string. To join any list of strings into a single string, use thejoinmethod of a string object. +You have a list of key-value pairs in the form
key=value, and you want to join them into a single string. To join any list of strings into a single string, use thejoinmethod of a string object.Here is an example of joining a list from the
buildConnectionStringfunction:- return ";".join(["%s=%s" % (k, v) for k, v in params.items()])One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything -is an object. You might have thought I meant that string variables are objects. But no, look closely at this example and you'll see that the string
";"itself is an object, and you are calling itsjoinmethod. -The
joinmethod joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't -need to be a semi-colon; it doesn't even need to be a single character. It can be any string.+ return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything +is an object. You might have thought I meant that string variables are objects. But no, look closely at this example and you'll see that the string
";"itself is an object, and you are calling itsjoinmethod. +The
joinmethod joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't +need to be a semi-colon; it doesn't even need to be a single character. It can be any string.split.3.7.1. Historical Note on String Methods
-When I first learned Python, I expected
jointo be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story - behind thejoinmethod. Prior to Python 1.6, strings didn't have all these useful methods. There was a separatestringmodule that contained all the string functions; each function took a string as its first argument. The functions were deemed - important enough to put onto the strings themselves, which made sense for functions likelower,upper, andsplit. But many hard-core Python programmers objected to the newjoinmethod, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of - the oldstringmodule (which still has a lot of useful stuff in it). I use the newjoinmethod exclusively, but you will see code written either way, and if it really bothers you, you can use the oldstring.joinfunction instead. +When I first learned Python, I expected
jointo be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story + behind thejoinmethod. Prior to Python 1.6, strings didn't have all these useful methods. There was a separatestringmodule that contained all the string functions; each function took a string as its first argument. The functions were deemed + important enough to put onto the strings themselves, which made sense for functions likelower,upper, andsplit. But many hard-core Python programmers objected to the newjoinmethod, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of + the oldstringmodule (which still has a lot of useful stuff in it). I use the newjoinmethod exclusively, but you will see code written either way, and if it really bothers you, you can use the oldstring.joinfunction instead.3.8. Summary
The
odbchelper.pyprogram and its output should now make perfect sense.@@ -1705,12 +1228,12 @@ if __name__ == "__main__":Chapter 4. The Power Of Introspection
-This chapter covers one of Python's strengths: introspection. As you know, everything in Python is an object, and introspection is code looking at other modules and functions in memory as objects, getting information about them, and -manipulating them. Along the way, you'll define functions with no name, call functions with arguments out of order, and reference +
This chapter covers one of Python's strengths: introspection. As you know, everything in Python is an object, and introspection is code looking at other modules and functions in memory as objects, getting information about them, and +manipulating them. Along the way, you'll define functions with no name, call functions with arguments out of order, and reference functions whose names you don't even know ahead of time.
4.1. Diving In
-Here is a complete, working Python program. You should understand a good deal about it just by looking at it. The numbered lines illustrate concepts covered - in Chapter 2, Your First Python Program. Don't worry if the rest of the code looks intimidating; you'll learn all about it throughout this chapter. +
Here is a complete, working Python program. You should understand a good deal about it just by looking at it. The numbered lines illustrate concepts covered + in Chapter 2, Your First Python Program. Don't worry if the rest of the code looks intimidating; you'll learn all about it throughout this chapter.
Example 4.1.
apihelper.pyIf you have not already done so, you can download this and other examples used in this book.
def info(object, spacing=10, collapse=1):![]()
![]()
@@ -1730,13 +1253,13 @@ if __name__ == "__main__":
-
This module has one function, info. According to its function declaration, it takes three parameters: object, spacing, and collapse. The last two are actually optional parameters, as you'll see shortly. +This module has one function, info. According to its function declaration, it takes three parameters: object, spacing, and collapse. The last two are actually optional parameters, as you'll see shortly.@@ -1760,7 +1283,7 @@ if __name__ == "__main__": - ![]()
The infofunction has a multi-linedocstringthat succinctly describes the function's purpose. Note that no return value is mentioned; this function will be used solely +The infofunction has a multi-linedocstringthat succinctly describes the function's purpose. Note that no return value is mentioned; this function will be used solely for its effects, rather than its value.Example 4.2. Sample Usage of
apihelper.py>>> from apihelper import info >>> li = [] @@ -1773,7 +1296,7 @@ insert L.insert(index, object) -- insert object before index pop L.pop([index]) -> item -- remove and return item at index (default last) remove L.remove(value) -- remove first occurrence of value reverse L.reverse() -- reverse *IN PLACE* -sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1By default the output is formatted to be easy to read. Multi-line
docstrings are collapsed into a single long line, but this option can be changed by specifying0for thecollapseargument. If the function names are longer than 10 characters, you can specify a larger value for thespacingargument to make the output easier to read. +sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1By default the output is formatted to be easy to read. Multi-line
docstrings are collapsed into a single long line, but this option can be changed by specifying0for thecollapseargument. If the function names are longer than 10 characters, you can specify a larger value for thespacingargument to make the output easier to read.Example 4.3. Advanced Usage of
apihelper.py>>> import odbchelper >>> info(odbchelper) buildConnectionString Build a connection string from a dictionary Returns string. @@ -1785,11 +1308,11 @@ buildConnectionString Build a connection string from a dictionary Retur Returns string.4.2. Using Optional and Named Arguments
Python allows function arguments to have default values; if the function is called without the argument, the argument gets its default - value. Futhermore, arguments can be specified in any order by using named arguments. Stored procedures in SQL Server Transact/SQL can do this, so if you're a SQL Server scripting guru, you can skim this part. + value. Futhermore, arguments can be specified in any order by using named arguments. Stored procedures in SQL Server Transact/SQL can do this, so if you're a SQL Server scripting guru, you can skim this part.
Here is an example of
info, a function with two optional arguments:-def info(object, spacing=10, collapse=1):spacing and collapse are optional, because they have default values defined. object is required, because it has no default value. If
infois called with only one argument, spacing defaults to10and collapse defaults to1. Ifinfois called with two arguments, collapse still defaults to1. -Say you want to specify a value for collapse but want to accept the default value for spacing. In most languages, you would be out of luck, because you would need to call the function with three arguments. But in +def info(object, spacing=10, collapse=1):
spacing and collapse are optional, because they have default values defined. object is required, because it has no default value. If
infois called with only one argument, spacing defaults to10and collapse defaults to1. Ifinfois called with two arguments, collapse still defaults to1. +Say you want to specify a value for collapse but want to accept the default value for spacing. In most languages, you would be out of luck, because you would need to call the function with three arguments. But in Python, arguments can be specified by name, in any order.
Example 4.4. Valid Calls of
infoinfo(odbchelper)@@ -1812,7 +1335,7 @@ info(spacing=15, object=odbchelper)
-
Here you are naming the collapse argument explicitly and specifying its value. spacing still gets its default value of 10. +Here you are naming the collapse argument explicitly and specifying its value. spacing still gets its default value of 10.@@ -1822,7 +1345,7 @@ info(spacing=15, object=odbchelper) ![]()
@@ -1840,11 +1363,11 @@ time, you'll call functions the “normal” way, but you always have th 4.3. Using
-type,str,dir, and Other Built-In FunctionsPython has a small set of extremely useful built-in functions. All other functions are partitioned off into modules. This was +
Python has a small set of extremely useful built-in functions. All other functions are partitioned off into modules. This was actually a conscious design decision, to keep the core language from getting bloated like other scripting languages (cough cough, Visual Basic).
4.3.1. The
-typeFunctionThe
typefunction returns the datatype of any arbitrary object. The possible types are listed in thetypesmodule. This is useful for helper functions that can handle several types of data. +The
typefunction returns the datatype of any arbitrary object. The possible types are listed in thetypesmodule. This is useful for helper functions that can handle several types of data.Example 4.5. Introducing
type>>> type(1)<type 'int'> >>> li = [] @@ -1860,7 +1383,7 @@ True
@@ -1879,12 +1402,12 @@ True - ![]()
typetakes anything -- and I mean anything -- and returns its datatype. Integers, strings, lists, dictionaries, tuples, functions, +typetakes anything -- and I mean anything -- and returns its datatype. Integers, strings, lists, dictionaries, tuples, functions, classes, modules, even types are acceptable.4.3.2. The
-strFunctionThe
strcoerces data into a string. Every datatype can be coerced into a string. +The
strcoerces data into a string. Every datatype can be coerced into a string.Example 4.6. Introducing
str>>> str(1)'1' @@ -1908,24 +1431,24 @@ True
- ![]()
However, strworks on any object of any type. Here it works on a list which you've constructed in bits and pieces. +However, strworks on any object of any type. Here it works on a list which you've constructed in bits and pieces.- ![]()
stralso works on modules. Note that the string representation of the module includes the pathname of the module on disk, so +stralso works on modules. Note that the string representation of the module includes the pathname of the module on disk, so yours will be different.- - ![]()
A subtle but important behavior of stris that it works onNone, the Python null value. It returns the string'None'. You'll use this to your advantage in theinfofunction, as you'll see shortly. +A subtle but important behavior of stris that it works onNone, the Python null value. It returns the string'None'. You'll use this to your advantage in theinfofunction, as you'll see shortly.At the heart of the
infofunction is the powerfuldirfunction.dirreturns a list of the attributes and methods of any object: modules, functions, strings, lists, dictionaries... pretty much +At the heart of the
infofunction is the powerfuldirfunction.dirreturns a list of the attributes and methods of any object: modules, functions, strings, lists, dictionaries... pretty much anything.Example 4.7. Introducing
dir>>> li = [] >>> dir(li)@@ -1941,24 +1464,24 @@ True
- ![]()
li is a list, so returns a list of all the methods of a list. Note that the returned list contains the names of the methods as strings, not +dir(li)li is a list, so returns a list of all the methods of a list. Note that the returned list contains the names of the methods as strings, not the methods themselves.dir(li)- ![]()
d is a dictionary, so returns a list of the names of dictionary methods. At least one of these,dir(d)keys, should look familiar. +d is a dictionary, so returns a list of the names of dictionary methods. At least one of these,dir(d)keys, should look familiar.- - ![]()
This is where it really gets interesting. odbchelperis a module, soreturns a list of all kinds of stuff defined in the module, including built-in attributes, likedir(odbchelper)__name__,__doc__, and whatever other attributes and methods you define. In this case,odbchelperhas only one user-defined method, thebuildConnectionStringfunction described in Chapter 2. +This is where it really gets interesting. odbchelperis a module, soreturns a list of all kinds of stuff defined in the module, including built-in attributes, likedir(odbchelper)__name__,__doc__, and whatever other attributes and methods you define. In this case,odbchelperhas only one user-defined method, thebuildConnectionStringfunction described in Chapter 2.Finally, the
callablefunction takes any object and returnsTrueif the object can be called, orFalseotherwise. Callable objects include functions, class methods, even classes themselves. (More on classes in the next chapter.) +Finally, the
callablefunction takes any object and returnsTrueif the object can be called, orFalseotherwise. Callable objects include functions, class methods, even classes themselves. (More on classes in the next chapter.)Example 4.8. Introducing
callable>>> import string >>> string.punctuation@@ -1973,7 +1496,7 @@ True join(list [,sep]) -> string Return a string composed of the words in list, with - intervening occurrences of sep. The default separator is a + intervening occurrences of sep. The default separator is a single space. (joinfields and join are synonymous)
@@ -1993,7 +1516,7 @@ True- ![]()
string.punctuation is not callable; it is a string. (A string does have callable methods, but the string itself is not callable.) + string.punctuation is not callable; it is a string. (A string does have callable methods, but the string itself is not callable.) @@ -2005,15 +1528,15 @@ True - ![]()
Any callable object may have a docstring. By using thecallablefunction on each of an object's attributes, you can determine which attributes you care about (methods, functions, classes) +Any callable object may have a docstring. By using thecallablefunction on each of an object's attributes, you can determine which attributes you care about (methods, functions, classes) and which you want to ignore (constants and so on) without knowing anything about the object ahead of time.4.3.3. Built-In Functions
-
type,str,dir, and all the rest of Python's built-in functions are grouped into a special module called__builtin__. (That's two underscores before and after.) If it helps, you can think of Python automatically executingfrom __builtin__ import *on startup, which imports all the “built-in” functions into the namespace so you can use them directly. +
type,str,dir, and all the rest of Python's built-in functions are grouped into a special module called__builtin__. (That's two underscores before and after.) If it helps, you can think of Python automatically executingfrom __builtin__ import *on startup, which imports all the “built-in” functions into the namespace so you can use them directly.The advantage of thinking like this is that you can access all the built-in functions and attributes as a group by getting - information about the
__builtin__module. And guess what, Python has a function calledinfo. Try it yourself and skim through the list now. We'll dive into some of the more important functions later. (Some of the + information about the__builtin__module. And guess what, Python has a function calledinfo. Try it yourself and skim through the list now. We'll dive into some of the more important functions later. (Some of the built-in error classes, likeAttributeError, should already look familiar.)Example 4.9. Built-in Attributes and Functions
>>> from apihelper import info >>> import __builtin__ @@ -2032,7 +1555,7 @@ IOError I/O operation failed.- @@ -2044,7 +1567,7 @@ IOError I/O operation failed.Python comes with excellent reference manuals, which you should peruse thoroughly to learn all the modules Python has to offer. But unlike most languages, where you would find yourself referring back to the manuals or man pages to remind + Python comes with excellent reference manuals, which you should peruse thoroughly to learn all the modules Python has to offer. But unlike most languages, where you would find yourself referring back to the manuals or man pages to remind yourself how to use these modules, Python is largely self-documenting. 4.4. Getting Object References With
-getattrYou already know that Python functions are objects. What you don't know is that you can get a reference to a function without knowing its name until run-time, by using the +
You already know that Python functions are objects. What you don't know is that you can get a reference to a function without knowing its name until run-time, by using the
getattrfunction.Example 4.10. Introducing
getattr>>> li = ["Larry", "Curly"] >>> li.pop@@ -2064,20 +1587,20 @@ AttributeError: 'tuple' object has no attribute 'pop'
-
This gets a reference to the popmethod of the list. Note that this is not calling thepopmethod; that would beli.pop(). This is the method itself. +This gets a reference to the popmethod of the list. Note that this is not calling thepopmethod; that would beli.pop(). This is the method itself.- ![]()
This also returns a reference to the popmethod, but this time, the method name is specified as a string argument to thegetattrfunction.getattris an incredibly useful built-in function that returns any attribute of any object. In this case, the object is a list, +This also returns a reference to the popmethod, but this time, the method name is specified as a string argument to thegetattrfunction.getattris an incredibly useful built-in function that returns any attribute of any object. In this case, the object is a list, and the attribute is thepopmethod.- ![]()
In case it hasn't sunk in just how incredibly useful this is, try this: the return value of getattris the method, which you can then call just as if you had saidli.append("Moe")directly. But you didn't call the function directly; you specified the function name as a string instead. +In case it hasn't sunk in just how incredibly useful this is, try this: the return value of getattris the method, which you can then call just as if you had saidli.append("Moe")directly. But you didn't call the function directly; you specified the function name as a string instead.@@ -2094,7 +1617,7 @@ AttributeError: 'tuple' object has no attribute 'pop' Example 4.11. The
getattrFunction inapihelper.py>>> import odbchelper >>> odbchelper.buildConnectionString<function buildConnectionString at 00D18DD4> @@ -2115,19 +1638,19 @@ True
- ![]()
This returns a reference to the buildConnectionStringfunction in theodbchelpermodule, which you studied in Chapter 2, Your First Python Program. (The hex address you see is specific to my machine; your output will be different.) +This returns a reference to the buildConnectionStringfunction in theodbchelpermodule, which you studied in Chapter 2, Your First Python Program. (The hex address you see is specific to my machine; your output will be different.)- ![]()
Using getattr, you can get the same reference to the same function. In general,is equivalent togetattr(object, "attribute")object.attribute. Ifobjectis a module, thenattributecan be anything defined in the module: a function, class, or global variable. +Using getattr, you can get the same reference to the same function. In general,is equivalent togetattr(object, "attribute")object.attribute. Ifobjectis a module, thenattributecan be anything defined in the module: a function, class, or global variable.- ![]()
And this is what you actually use in the infofunction. object is passed into the function as an argument; method is a string which is the name of a method or function. +And this is what you actually use in the infofunction. object is passed into the function as an argument; method is a string which is the name of a method or function.@@ -2144,10 +1667,10 @@ True 4.4.2.
-getattrAs a DispatcherA common usage pattern of
getattris as a dispatcher. For example, if you had a program that could output data in a variety of different formats, you could +A common usage pattern of
getattris as a dispatcher. For example, if you had a program that could output data in a variety of different formats, you could define separate functions for each output format and use a single dispatch function to call the right one. -For example, let's imagine a program that prints site statistics in HTML, XML, and plain text formats. The choice of output format could be specified on the command line, or stored in a configuration - file. A
statsoutmodule defines three functions,output_html,output_xml, andoutput_text. Then the main program defines a single output function, like this: +For example, let's imagine a program that prints site statistics in HTML, XML, and plain text formats. The choice of output format could be specified on the command line, or stored in a configuration + file. A
statsoutmodule defines three functions,output_html,output_xml, andoutput_text. Then the main program defines a single output function, like this:Example 4.12. Creating a Dispatcher with
getattrimport statsout @@ -2159,25 +1682,25 @@ def output(data, format="text"):![]()
- ![]()
The outputfunction takes one required argument, data, and one optional argument, format. If format is not specified, it defaults totext, and you will end up calling the plain text output function. +The outputfunction takes one required argument, data, and one optional argument, format. If format is not specified, it defaults totext, and you will end up calling the plain text output function.- ![]()
You concatenate the format argument with "output_" to produce a function name, and then go get that function from the statsoutmodule. This allows you to easily extend the program later to support other output formats, without changing this dispatch - function. Just add another function tostatsoutnamed, for instance,output_pdf, and pass "pdf" as the format into theoutputfunction. +You concatenate the format argument with "output_" to produce a function name, and then go get that function from the statsoutmodule. This allows you to easily extend the program later to support other output formats, without changing this dispatch + function. Just add another function tostatsoutnamed, for instance,output_pdf, and pass "pdf" as the format into theoutputfunction.- ![]()
Now you can simply call the output function in the same way as any other function. The output_function variable is a reference to the appropriate function from the statsoutmodule. +Now you can simply call the output function in the same way as any other function. The output_function variable is a reference to the appropriate function from the statsoutmodule.Did you see the bug in the previous example? This is a very loose coupling of strings and functions, and there is no error - checking. What happens if the user passes in a format that doesn't have a corresponding function defined in
statsout? Well,getattrwill returnNone, which will be assigned to output_function instead of a valid function, and the next line that attempts to call that function will crash and raise an exception. That's + checking. What happens if the user passes in a format that doesn't have a corresponding function defined instatsout? Well,getattrwill returnNone, which will be assigned to output_function instead of a valid function, and the next line that attempts to call that function will crash and raise an exception. That's bad.Luckily,
getattrtakes an optional third argument, a default value.Example 4.13.
getattrDefault Values@@ -2191,17 +1714,17 @@ def output(data, format="text"):- - ![]()
This function call is guaranteed to work, because you added a third argument to the call to getattr. The third argument is a default value that is returned if the attribute or method specified by the second argument wasn't +This function call is guaranteed to work, because you added a third argument to the call to getattr. The third argument is a default value that is returned if the attribute or method specified by the second argument wasn't found.As you can see,
getattris quite powerful. It is the heart of introspection, and you'll see even more powerful examples of it in later chapters. +As you can see,
getattris quite powerful. It is the heart of introspection, and you'll see even more powerful examples of it in later chapters.4.5. Filtering Lists
-As you know, Python has powerful capabilities for mapping lists into other lists, via list comprehensions (Section 3.6, “Mapping Lists”). This can be combined with a filtering mechanism, where some elements in the list are mapped while others are skipped entirely. +
As you know, Python has powerful capabilities for mapping lists into other lists, via list comprehensions (Section 3.6, “Mapping Lists”). This can be combined with a filtering mechanism, where some elements in the list are mapped while others are skipped entirely.
Here is the list filtering syntax:
-[mapping-expressionforelementinsource-listiffilter-expression]This is an extension of the list comprehensions that you know and love. The first two thirds are the same; the last part, starting with the
if, is the filter expression. A filter expression can be any expression that evaluates true or false (which in Python can be almost anything). Any element for which the filter expression evaluates true will be included in the mapping. All other elements are ignored, +[mapping-expressionforelementinsource-listiffilter-expression]This is an extension of the list comprehensions that you know and love. The first two thirds are the same; the last part, starting with the
if, is the filter expression. A filter expression can be any expression that evaluates true or false (which in Python can be almost anything). Any element for which the filter expression evaluates true will be included in the mapping. All other elements are ignored, so they are never put through the mapping expression and are not included in the output list.Example 4.14. Introducing List Filtering
>>> li = ["a", "mpilgrim", "foo", "b", "c", "b", "d", "d"] >>> [elem for elem in li if len(elem) > 1]@@ -2215,35 +1738,35 @@ so they are never put through the mapping expression and are not included in the
![]()
The mapping expression here is simple (it just returns the value of each element), so concentrate on the filter expression. - As Python loops through the list, it runs each element through the filter expression. If the filter expression is true, the element - is mapped and the result of the mapping expression is included in the returned list. Here, you are filtering out all the + As Python loops through the list, it runs each element through the filter expression. If the filter expression is true, the element + is mapped and the result of the mapping expression is included in the returned list. Here, you are filtering out all the one-character strings, so you're left with a list of all the longer strings. - ![]()
Here, you are filtering out a specific value, b. Note that this filters all occurrences ofb, since each time it comes up, the filter expression will be false. +Here, you are filtering out a specific value, b. Note that this filters all occurrences ofb, since each time it comes up, the filter expression will be false.- ![]()
countis a list method that returns the number of times a value occurs in a list. You might think that this filter would eliminate - duplicates from a list, returning a list containing only one copy of each value in the original list. But it doesn't, because - values that appear twice in the original list (in this case,bandd) are excluded completely. There are ways of eliminating duplicates from a list, but filtering is not the solution. +countis a list method that returns the number of times a value occurs in a list. You might think that this filter would eliminate + duplicates from a list, returning a list containing only one copy of each value in the original list. But it doesn't, because + values that appear twice in the original list (in this case,bandd) are excluded completely. There are ways of eliminating duplicates from a list, but filtering is not the solution.Let's id="apihelper.filter.care" get back to this line from
apihelper.py:- methodList = [method for method in dir(object) if callable(getattr(object, method))]This looks complicated, and it is complicated, but the basic structure is the same. The whole filter expression returns a -list, which is assigned to the methodList variable. The first half of the expression is the list mapping part. The mapping expression is an identity expression, -which it returns the value of each element.
returns a list of object's attributes and methods -- that's the list you're mapping. So the only new part is the filter expression after thedir(object)if. -The filter expression looks scary, but it's not. You already know about
callable,getattr, andin. As you saw in the previous section, the expressiongetattr(object, method)returns a function object if object is a module and method is the name of a function in that module. -So this expression takes an object (named object). Then it gets a list of the names of the object's attributes, methods, functions, and a few other things. Then it filters -that list to weed out all the stuff that you don't care about. You do the weeding out by taking the name of each attribute/method/function -and getting a reference to the real thing, via the
getattrfunction. Then you check to see if that object is callable, which will be any methods and functions, both built-in (like -thepopmethod of a list) and user-defined (like thebuildConnectionStringfunction of theodbchelpermodule). You don't care about other attributes, like the__name__attribute that's built in to every module. + methodList = [method for method in dir(object) if callable(getattr(object, method))]This looks complicated, and it is complicated, but the basic structure is the same. The whole filter expression returns a +list, which is assigned to the methodList variable. The first half of the expression is the list mapping part. The mapping expression is an identity expression, +which it returns the value of each element.
returns a list of object's attributes and methods -- that's the list you're mapping. So the only new part is the filter expression after thedir(object)if. +The filter expression looks scary, but it's not. You already know about
callable,getattr, andin. As you saw in the previous section, the expressiongetattr(object, method)returns a function object if object is a module and method is the name of a function in that module. +So this expression takes an object (named object). Then it gets a list of the names of the object's attributes, methods, functions, and a few other things. Then it filters +that list to weed out all the stuff that you don't care about. You do the weeding out by taking the name of each attribute/method/function +and getting a reference to the real thing, via the
getattrfunction. Then you check to see if that object is callable, which will be any methods and functions, both built-in (like +thepopmethod of a list) and user-defined (like thebuildConnectionStringfunction of theodbchelpermodule). You don't care about other attributes, like the__name__attribute that's built in to every module.Further Reading on Filtering Lists
@@ -2263,15 +1786,15 @@ the
popmethod of a list) and user-defined (like thebuildCon- ![]()
When using and, values are evaluated in a boolean context from left to right.0,'',[],(),{}, andNoneare false in a boolean context; everything else is true. Well, almost everything. By default, instances of classes are - true in a boolean context, but you can define special methods in your class to make an instance evaluate to false. You'll - learn all about classes and special methods in Chapter 5. If all values are true in a boolean context,andreturns the last value. In this case,andevaluates'a', which is true, then'b', which is true, and returns'b'. +When using and, values are evaluated in a boolean context from left to right.0,'',[],(),{}, andNoneare false in a boolean context; everything else is true. Well, almost everything. By default, instances of classes are + true in a boolean context, but you can define special methods in your class to make an instance evaluate to false. You'll + learn all about classes and special methods in Chapter 5. If all values are true in a boolean context,andreturns the last value. In this case,andevaluates'a', which is true, then'b', which is true, and returns'b'.- ![]()
If any value is false in a boolean context, andreturns the first false value. In this case,''is the first false value. +If any value is false in a boolean context, andreturns the first false value. In this case,''is the first false value.@@ -2288,15 +1811,15 @@ the -popmethod of a list) and user-defined (like thebuildCon >>> '' or [] or {}{} >>> def sidefx(): -... print "in sidefx()" -... return 1 +... print "in sidefx()" +... return 1 >>> 'a' or sidefx()
'a'
If you're a C hacker, you are certainly familiar with the
bool ? a : bexpression, which evaluates to a ifboolis true, and b otherwise. Because of the wayandandorwork in Python, you can accomplish the same thing. +If you're a C hacker, you are certainly familiar with the
bool ? a : bexpression, which evaluates to a ifboolis true, and b otherwise. Because of the wayandandorwork in Python, you can accomplish the same thing.4.6.1. Using the
and-orTrickExample 4.17. Introducing the
and-orTrick>>> a = "first" >>> b = "second" @@ -2332,7 +1855,7 @@ thepopmethod of a list) and user-defined (like thebuildCon- ![]()
This syntax looks similar to the bool ? a : bexpression in C. The entire expression is evaluated from left to right, so theandis evaluated first.1 and 'first'evalutes to'first', then'first' or 'second'evalutes to'first'. +This syntax looks similar to the bool ? a : bexpression in C. The entire expression is evaluated from left to right, so theandis evaluated first.1 and 'first'evalutes to'first', then'first' or 'second'evalutes to'first'.@@ -2343,7 +1866,7 @@ the popmethod of a list) and user-defined (like thebuildConHowever, since this Python expression is simply boolean logic, and not a special construct of the language, there is one extremely important difference - between this
and-ortrick in Python and thebool ? a : bsyntax in C. If the value of a is false, the expression will not work as you would expect it to. (Can you tell I was bitten by this? More than once?) + between thisand-ortrick in Python and thebool ? a : bsyntax in C. If the value of a is false, the expression will not work as you would expect it to. (Can you tell I was bitten by this? More than once?)Example 4.18. When the
and-orTrick Fails>>> a = "" >>> b = "second" >>> 1 and a or b@@ -2352,12 +1875,12 @@ the
popmethod of a list) and user-defined (like thebuildCon- ![]()
Since a is an empty string, which Python considers false in a boolean context, 1 and ''evalutes to'', and then'' or 'second'evalutes to'second'. Oops! That's not what you wanted. +Since a is an empty string, which Python considers false in a boolean context, 1 and ''evalutes to'', and then'' or 'second'evalutes to'second'. Oops! That's not what you wanted.The
and-ortrick,bool and a or b, will not work like the C expressionbool ? a : bwhen a is false in a boolean context. -The real trick behind the
and-ortrick, then, is to make sure that the value of a is never false. One common way of doing this is to turn a into[a]and b into[b], then taking the first element of the returned list, which will be either a or b. +The real trick behind the
and-ortrick, then, is to make sure that the value of a is never false. One common way of doing this is to turn a into[a]and b into[b], then taking the first element of the returned list, which will be either a or b.Example 4.19. Using the
and-orTrick Safely>>> a = "" >>> b = "second" >>> (1 and [a] or [b])[0]@@ -2366,12 +1889,12 @@ the
popmethod of a list) and user-defined (like thebuildCon- - ![]()
Since [a]is a non-empty list, it is never false. Even if a is0or''or some other false value, the list[a]is true because it has one element. +Since [a]is a non-empty list, it is never false. Even if a is0or''or some other false value, the list[a]is true because it has one element.By now, this trick may seem like more trouble than it's worth. You could, after all, accomplish the same thing with an
ifstatement, so why go through all this fuss? Well, in many cases, you are choosing between two constant values, so you can - use the simpler syntax and not worry, because you know that the a value will always be true. And even if you need to use the more complicated safe form, there are good reasons to do so. +By now, this trick may seem like more trouble than it's worth. You could, after all, accomplish the same thing with an
ifstatement, so why go through all this fuss? Well, in many cases, you are choosing between two constant values, so you can + use the simpler syntax and not worry, because you know that the a value will always be true. And even if you need to use the more complicated safe form, there are good reasons to do so. For example, there are some cases in Python whereifstatements are not allowed, such as inlambdafunctions.Further Reading on the
@@ -2380,10 +1903,10 @@ theand-orTrickpopmethod of a list) and user-defined (like thebuildCon4.7. Using
-lambdaFunctionsPython supports an interesting syntax that lets you define one-line mini-functions on the fly. Borrowed from Lisp, these so-called
lambdafunctions can be used anywhere a function is required. +Python supports an interesting syntax that lets you define one-line mini-functions on the fly. Borrowed from Lisp, these so-called
lambdafunctions can be used anywhere a function is required.Example 4.20. Introducing
lambdaFunctions>>> def f(x): -... return x*2 -... +... return x*2 +... >>> f(3) 6 >>> g = lambda x: x*2@@ -2395,27 +1918,27 @@ the
popmethod of a list) and user-defined (like thebuildCon- ![]()
This is a lambdafunction that accomplishes the same thing as the normal function above it. Note the abbreviated syntax here: there are no - parentheses around the argument list, and thereturnkeyword is missing (it is implied, since the entire function can only be one expression). Also, the function has no name, +This is a lambdafunction that accomplishes the same thing as the normal function above it. Note the abbreviated syntax here: there are no + parentheses around the argument list, and thereturnkeyword is missing (it is implied, since the entire function can only be one expression). Also, the function has no name, but it can be called through the variable it is assigned to.- - ![]()
You can use a lambdafunction without even assigning it to a variable. This may not be the most useful thing in the world, but it just goes to +You can use a lambdafunction without even assigning it to a variable. This may not be the most useful thing in the world, but it just goes to show that a lambda is just an in-line function.To generalize, a
lambdafunction is a function that takes any number of arguments (including optional arguments) and returns the value of a single expression.lambdafunctions can not contain commands, and they can not contain more than one expression. Don't try to squeeze too much into +To generalize, a
lambdafunction is a function that takes any number of arguments (including optional arguments) and returns the value of a single expression.lambdafunctions can not contain commands, and they can not contain more than one expression. Don't try to squeeze too much into alambdafunction; if you need something more complex, define a normal function instead and make it as long as you want.
- @@ -2423,8 +1946,8 @@ alambdafunctions are a matter of style. Using them is never required; anywhere you could use them, you could define a separate - normal function and use that instead. I use them in places where I want to encapsulate specific, non-reusable code without +lambdafunctions are a matter of style. Using them is never required; anywhere you could use them, you could define a separate + normal function and use that instead. I use them in places where I want to encapsulate specific, non-reusable code without littering my code with a lot of little one-line functions.lambdafunction; if you need something more complex, define a nor4.7.1. Real-World
lambdaFunctionsHere are the
lambdafunctions inapihelper.py:- processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)Notice that this uses the simple form of the
and-ortrick, which is okay, because alambdafunction is always true in a boolean context. (That doesn't mean that alambdafunction can't return a false value. The function is always true; its return value could be anything.) -Also notice that you're using the
splitfunction with no arguments. You've already seen it used with one or two arguments, but without any arguments it splits on whitespace. + processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)Notice that this uses the simple form of the
and-ortrick, which is okay, because alambdafunction is always true in a boolean context. (That doesn't mean that alambdafunction can't return a false value. The function is always true; its return value could be anything.) +Also notice that you're using the
splitfunction with no arguments. You've already seen it used with one or two arguments, but without any arguments it splits on whitespace.Example 4.21.
splitWith No Arguments>>> s = "this is\na\ttest">>> print s this is @@ -2437,52 +1960,52 @@ a test
- ![]()
This is a multiline string, defined by escape characters instead of triple quotes. \nis a carriage return, and\tis a tab character. +This is a multiline string, defined by escape characters instead of triple quotes. \nis a carriage return, and\tis a tab character.- ![]()
splitwithout any arguments splits on whitespace. So three spaces, a carriage return, and a tab character are all the same. +splitwithout any arguments splits on whitespace. So three spaces, a carriage return, and a tab character are all the same.- ![]()
You can normalize whitespace by splitting a string with splitand then rejoining it withjoin, using a single space as a delimiter. This is what theinfofunction does to collapse multi-linedocstrings into a single line. +You can normalize whitespace by splitting a string with splitand then rejoining it withjoin, using a single space as a delimiter. This is what theinfofunction does to collapse multi-linedocstrings into a single line.So what is the
infofunction actually doing with theselambdafunctions,splits, andand-ortricks?- processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)processFunc is now a function, but which function it is depends on the value of the collapse variable. If collapse is true,
processFunc(string)will collapse whitespace; otherwise,processFunc(string)will return its argument unchanged. -To do this in a less robust language, like Visual Basic, you would probably create a function that took a string and a
collapseargument and used anifstatement to decide whether to collapse the whitespace or not, then returned the appropriate value. This would be inefficient, - because the function would need to handle every possible case. Every time you called it, it would need to decide whether - to collapse whitespace before it could give you what you wanted. In Python, you can take that decision logic out of the function and define alambdafunction that is custom-tailored to give you exactly (and only) what you want. This is more efficient, more elegant, and + processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)processFunc is now a function, but which function it is depends on the value of the collapse variable. If collapse is true,
processFunc(string)will collapse whitespace; otherwise,processFunc(string)will return its argument unchanged. +To do this in a less robust language, like Visual Basic, you would probably create a function that took a string and a
collapseargument and used anifstatement to decide whether to collapse the whitespace or not, then returned the appropriate value. This would be inefficient, + because the function would need to handle every possible case. Every time you called it, it would need to decide whether + to collapse whitespace before it could give you what you wanted. In Python, you can take that decision logic out of the function and define alambdafunction that is custom-tailored to give you exactly (and only) what you want. This is more efficient, more elegant, and less prone to those nasty oh-I-thought-those-arguments-were-reversed kinds of errors.Further Reading on
lambdaFunctions
- Python Knowledge Base discusses using
lambdato call functions indirectly. -- Python Tutorial shows how to access outside variables from inside a
lambdafunction. (PEP 227 explains how this will change in future versions of Python.) +- Python Tutorial shows how to access outside variables from inside a
lambdafunction. (PEP 227 explains how this will change in future versions of Python.)- The Whole Python FAQ has examples of obfuscated one-liners using
lambda.4.8. Putting It All Together
-The last line of code, the only one you haven't deconstructed yet, is the one that does all the work. But by now the work - is easy, because everything you need is already set up just the way you need it. All the dominoes are in place; it's time +
The last line of code, the only one you haven't deconstructed yet, is the one that does all the work. But by now the work + is easy, because everything you need is already set up just the way you need it. All the dominoes are in place; it's time to knock them down.
This is the meat of
apihelper.py:print "\n".join(["%s %s" % (method.ljust(spacing), processFunc(str(getattr(object, method).__doc__))) - for method in methodList])Note that this is one command, split over multiple lines, but it doesn't use the line continuation character (
\). Remember when I said that some expressions can be split into multiple lines without using a backslash? A list comprehension is one of those expressions, since the entire expression is contained in + for method in methodList])Note that this is one command, split over multiple lines, but it doesn't use the line continuation character (
\). Remember when I said that some expressions can be split into multiple lines without using a backslash? A list comprehension is one of those expressions, since the entire expression is contained in square brackets. -Now, let's take it from the end and work backwards. The
-for method in methodListshows that this is a list comprehension. As you know, methodList is a list of all the methods you care about in object. So you're looping through that list with method. +
Now, let's take it from the end and work backwards. The
+for method in methodListshows that this is a list comprehension. As you know, methodList is a list of all the methods you care about in object. So you're looping through that list with method.
The next piece of the puzzle is the use of
straround thedocstring. As you may recall,stris a built-in function that coerces data into a string. But adocstringis always a string, so why bother with thestrfunction? The answer is that not every function has adocstring, and if it doesn't, its__doc__attribute isNone. +The next piece of the puzzle is the use of
straround thedocstring. As you may recall,stris a built-in function that coerces data into a string. But adocstringis always a string, so why bother with thestrfunction? The answer is that not every function has adocstring, and if it doesn't, its__doc__attribute isNone.Example 4.23. Why Use
stron adocstring?>>> >>> def foo(): print 2 >>> >>> foo() 2 @@ -2532,7 +2055,7 @@ True- ![]()
You can easily define a function that has no docstring, so its__doc__attribute isNone. Confusingly, if you evaluate the__doc__attribute directly, the Python IDE prints nothing at all, which makes sense if you think about it, but is still unhelpful. +You can easily define a function that has no docstring, so its__doc__attribute isNone. Confusingly, if you evaluate the__doc__attribute directly, the Python IDE prints nothing at all, which makes sense if you think about it, but is still unhelpful.@@ -2553,12 +2076,12 @@ True - -In SQL, you must use IS NULLinstead of= NULLto compare a null value. In Python, you can use either== Noneoris None, butis Noneis faster. +In SQL, you must use IS NULLinstead of= NULLto compare a null value. In Python, you can use either== Noneoris None, butis Noneis faster.Now that you are guaranteed to have a string, you can pass the string to processFunc, which you have already defined as a function that either does or doesn't collapse whitespace. Now you see why it was important to use
strto convert aNonevalue into a string representation. processFunc is assuming a string argument and calling itssplitmethod, which would crash if you passed itNonebecauseNonedoesn't have asplitmethod. -Stepping back even further, you see that you're using string formatting again to concatenate the return value of processFunc with the return value of method's
ljustmethod. This is a new string method that you haven't seen before. +Now that you are guaranteed to have a string, you can pass the string to processFunc, which you have already defined as a function that either does or doesn't collapse whitespace. Now you see why it was important to use
strto convert aNonevalue into a string representation. processFunc is assuming a string argument and calling itssplitmethod, which would crash if you passed itNonebecauseNonedoesn't have asplitmethod. +Stepping back even further, you see that you're using string formatting again to concatenate the return value of processFunc with the return value of method's
ljustmethod. This is a new string method that you haven't seen before.Example 4.24. Introducing
ljust>>> s = 'buildConnectionString' >>> s.ljust(30)'buildConnectionString ' @@ -2568,17 +2091,17 @@ True
- ![]()
ljustpads the string with spaces to the given length. This is what theinfofunction uses to make two columns of output and line up all thedocstrings in the second column. +ljustpads the string with spaces to the given length. This is what theinfofunction uses to make two columns of output and line up all thedocstrings in the second column.- - ![]()
If the given length is smaller than the length of the string, ljustwill simply return the string unchanged. It never truncates the string. +If the given length is smaller than the length of the string, ljustwill simply return the string unchanged. It never truncates the string.You're almost finished. Given the padded method name from the
ljustmethod and the (possibly collapsed)docstringfrom the call to processFunc, you concatenate the two and get a single string. Since you're mapping methodList, you end up with a list of strings. Using thejoinmethod of the string"\n", you join this list into a single string, with each element of the list on a separate line, and print the result. +You're almost finished. Given the padded method name from the
ljustmethod and the (possibly collapsed)docstringfrom the call to processFunc, you concatenate the two and get a single string. Since you're mapping methodList, you end up with a list of strings. Using thejoinmethod of the string"\n", you join this list into a single string, with each element of the list on a separate line, and print the result.Example 4.25. Printing a List
>>> li = ['a', 'b', 'c'] >>> print "\n".join(li)a @@ -2588,11 +2111,11 @@ c
- - ![]()
This is also a useful debugging trick when you're working with lists. And in Python, you're always working with lists. + This is also a useful debugging trick when you're working with lists. And in Python, you're always working with lists. That's the last piece of the puzzle. You should now understand this code. +
That's the last piece of the puzzle. You should now understand this code.
print "\n".join(["%s %s" % (method.ljust(spacing), @@ -2637,21 +2160,21 @@ sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1,- Recognizing the
and-ortrick and using it safely- Defining
lambdafunctions -- Assigning functions to variables and calling the function by referencing the variable. I can't emphasize this enough, because this mode of thought is vital - to advancing your understanding of Python. You'll see more complex applications of this concept throughout this book. +
- Assigning functions to variables and calling the function by referencing the variable. I can't emphasize this enough, because this mode of thought is vital + to advancing your understanding of Python. You'll see more complex applications of this concept throughout this book.
Chapter 5. Objects and Object-Orientation
This chapter, and pretty much every chapter after this, deals with object-oriented Python programming.
5.1. Diving In
-Here is a complete, working Python program. Read the
docstrings of the module, the classes, and the functions to get an overview of what this program does and how it works. As usual, don't +Here is a complete, working Python program. Read the
docstrings of the module, the classes, and the functions to get an overview of what this program does and how it works. As usual, don't worry about the stuff you don't understand; that's what the rest of the chapter is for.Example 5.1.
fileinfo.pyIf you have not already done so, you can download this and other examples used in this book.
"""Framework for getting filetype-specific metadata. -Instantiate appropriate class with filename. Returned object acts like a +Instantiate appropriate class with filename. Returned object acts like a dictionary, with key-value pairs for each piece of metadata. import fileinfo info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3") @@ -2662,7 +2185,7 @@ Or use listDirectory function to get info on all files in a directory. ... Framework can be extended by adding classes for particular file types, e.g. -HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for +HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for parsing its files appropriately; see MP3FileInfo for example. """ import os @@ -2730,13 +2253,13 @@ if __name__ == "__main__":- ![]()
This program's output depends on the files on your hard drive. To get meaningful output, you'll need to change the directory + This program's output depends on the files on your hard drive. To get meaningful output, you'll need to change the directory path to point to a directory of MP3 files on your own machine. -This is the output I got on my machine. Your output will be different, unless, by some startling coincidence, you share my +
This is the output I got on my machine. Your output will be different, unless, by some startling coincidence, you share my exact taste in music.
album= artist=Ghost in the Machine title=A Time Long Forgotten (Concept @@ -2784,11 +2307,11 @@ genre=255 name=/music/_singles/spinning.mp3 year=2000 comment=http://mp3.com/artists/95/vxp5.2. Importing Modules Using
-from module importPython has two ways of importing modules. Both are useful, and you should know when to use each. One way,
import module, you've already seen in Section 2.4, “Everything Is an Object”. The other way accomplishes the same thing, but it has subtle and important differences. +Python has two ways of importing modules. Both are useful, and you should know when to use each. One way,
import module, you've already seen in Section 2.4, “Everything Is an Object”. The other way accomplishes the same thing, but it has subtle and important differences.Here is the basic
from module importsyntax:from UserDict import UserDict -This is similar to the
import modulesyntax that you know and love, but with an important difference: the attributes and methods of the imported moduletypesare imported directly into the local namespace, so they are available directly, without qualification by module name. You +This is similar to the
import modulesyntax that you know and love, but with an important difference: the attributes and methods of the imported moduletypesare imported directly into the local namespace, so they are available directly, without qualification by module name. You can import individual items or usefrom module import *to import everything.![]()
Of course, realistically, most classes will be inherited from other classes, and they will define their own class methods -and attributes. But as you've just seen, there is nothing that a class absolutely must have, other than a name. In particular, -C++ programmers may find it odd that Python classes don't have explicit constructors and destructors. Python classes do have something similar to a constructor: the
__init__method. +and attributes. But as you've just seen, there is nothing that a class absolutely must have, other than a name. In particular, +C++ programmers may find it odd that Python classes don't have explicit constructors and destructors. Python classes do have something similar to a constructor: the__init__method.Example 5.4. Defining the
FileInfoClassfrom UserDict import UserDict @@ -2919,7 +2442,7 @@ class FileInfo(UserDict):-
In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. So the FileInfoclass is inherited from theUserDictclass (which was imported from theUserDictmodule).UserDictis a class that acts like a dictionary, allowing you to essentially subclass the dictionary datatype and add your own behavior. +In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. So the @@ -2930,12 +2453,12 @@ class FileInfo(UserDict):FileInfoclass is inherited from theUserDictclass (which was imported from theUserDictmodule).UserDictis a class that acts like a dictionary, allowing you to essentially subclass the dictionary datatype and add your own behavior. (There are similar classesUserListandUserStringwhich allow you to subclass lists and strings.) There is a bit of black magic behind this, which you will demystify later in this chapter when you explore theUserDictclass in more depth.![]()
- -In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. There is no special keyword like + In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. There is no special keyword like extendsin Java.Python supports multiple inheritance. In the parentheses following the class name, you can list as many ancestor classes as you +
Python supports multiple inheritance. In the parentheses following the class name, you can list as many ancestor classes as you like, separated by commas.
5.3.1. Initializing and Coding Classes
This example shows the initialization of the
FileInfoclass using the__init__method. @@ -2953,15 +2476,15 @@ class FileInfo(UserDict):- ![]()
__init__is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor - of the class. It's tempting, because it looks like a constructor (by convention,__init__is the first method defined for the class), acts like one (it's the first piece of code executed in a newly created instance - of the class), and even sounds like one (“init” certainly suggests a constructor-ish nature). Incorrect, because the object has already been constructed by the time__init__is called, and you already have a valid reference to the new instance of the class. But__init__is the closest thing you're going to get to a constructor in Python, and it fills much the same role. +__init__is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor + of the class. It's tempting, because it looks like a constructor (by convention,__init__is the first method defined for the class), acts like one (it's the first piece of code executed in a newly created instance + of the class), and even sounds like one (“init” certainly suggests a constructor-ish nature). Incorrect, because the object has already been constructed by the time__init__is called, and you already have a valid reference to the new instance of the class. But__init__is the closest thing you're going to get to a constructor in Python, and it fills much the same role.@@ -2969,7 +2492,7 @@ class FileInfo(UserDict): - ![]()
The first argument of every class method, including __init__, is always a reference to the current instance of the class. By convention, this argument is always namedself. In the__init__method,selfrefers to the newly created object; in other class methods, it refers to the instance whose method was called. Although +The first argument of every class method, including __init__, is always a reference to the current instance of the class. By convention, this argument is always namedself. In the__init__method,selfrefers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specifyselfexplicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically.![]()
@@ -2978,7 +2501,7 @@ class FileInfo(UserDict): __init__methods can take any number of arguments, and just like functions, the arguments can be defined with default values, making - them optional to the caller. In this case, filename has a default value ofNone, which is the Python null value. + them optional to the caller. In this case, filename has a default value ofNone, which is the Python null value.- @@ -3000,7 +2523,7 @@ class FileInfo(UserDict):By convention, the first argument of any Python class method (the reference to the current instance) is called self. This argument fills the role of the reserved wordthisin C++ or Java, butselfis not a reserved word in Python, merely a naming convention. Nonetheless, please don't call it anything butself; this is a very strong convention. +By convention, the first argument of any Python class method (the reference to the current instance) is called self. This argument fills the role of the reserved wordthisin C++ or Java, butselfis not a reserved word in Python, merely a naming convention. Nonetheless, please don't call it anything butself; this is a very strong convention.- ![]()
I told you that this class acts like a dictionary, and here is the first sign of it. You're assigning the argument filename as the value of this object's namekey. +I told you that this class acts like a dictionary, and here is the first sign of it. You're assigning the argument filename as the value of this object's namekey.@@ -3011,16 +2534,16 @@ class FileInfo(UserDict): 5.3.2. Knowing When to Use
-selfand__init__When defining your class methods, you must explicitly list
selfas the first argument for each method, including__init__. When you call a method of an ancestor class from within your class, you must include theselfargument. But when you call your class method from outside, you do not specify anything for theselfargument; you skip it entirely, and Python automatically adds the instance reference for you. I am aware that this is confusing at first; it's not really inconsistent, +When defining your class methods, you must explicitly list
selfas the first argument for each method, including__init__. When you call a method of an ancestor class from within your class, you must include theselfargument. But when you call your class method from outside, you do not specify anything for theselfargument; you skip it entirely, and Python automatically adds the instance reference for you. I am aware that this is confusing at first; it's not really inconsistent, but it may appear inconsistent because it relies on a distinction (between bound and unbound methods) that you don't know about yet. -Whew. I realize that's a lot to absorb, but you'll get the hang of it. All Python classes work the same way, so once you learn one, you've learned them all. If you forget everything else, remember this +
Whew. I realize that's a lot to absorb, but you'll get the hang of it. All Python classes work the same way, so once you learn one, you've learned them all. If you forget everything else, remember this one thing, because I promise it will trip you up:
@@ -3080,23 +2603,23 @@ class FileInfo(UserDict):
- @@ -3038,8 +2561,8 @@ class FileInfo(UserDict):__init__methods are optional, but when you define one, you must remember to explicitly call the ancestor's__init__method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor, +__init__methods are optional, but when you define one, you must remember to explicitly call the ancestor's__init__method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor, the descendant method must explicitly call the ancestor method at the proper time, with the proper arguments.5.4. Instantiating Classes
-Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the -
__init__method defines. The return value will be the newly created object. +Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the +
__init__method defines. The return value will be the newly created object.Example 5.7. Creating a
FileInfoInstance>>> import fileinfo >>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3")>>> f.__class__
@@ -3052,26 +2575,26 @@ class FileInfo(UserDict):
- ![]()
You are creating an instance of the FileInfoclass (defined in thefileinfomodule) and assigning the newly created instance to the variable f. You are passing one parameter,/music/_singles/kairo.mp3, which will end up as the filename argument inFileInfo's__init__method. +You are creating an instance of the FileInfoclass (defined in thefileinfomodule) and assigning the newly created instance to the variable f. You are passing one parameter,/music/_singles/kairo.mp3, which will end up as the filename argument inFileInfo's__init__method.- ![]()
Every class instance has a built-in attribute, __class__, which is the object's class. (Note that the representation of this includes the physical address of the instance on my - machine; your representation will be different.) Java programmers may be familiar with theClassclass, which contains methods likegetNameandgetSuperclassto get metadata information about an object. In Python, this kind of metadata is available directly on the object itself through attributes like__class__,__name__, and__bases__. +Every class instance has a built-in attribute, __class__, which is the object's class. (Note that the representation of this includes the physical address of the instance on my + machine; your representation will be different.) Java programmers may be familiar with theClassclass, which contains methods likegetNameandgetSuperclassto get metadata information about an object. In Python, this kind of metadata is available directly on the object itself through attributes like__class__,__name__, and__bases__.- ![]()
You can access the instance's docstringjust as with a function or a module. All instances of a class share the samedocstring. +You can access the instance's docstringjust as with a function or a module. All instances of a class share the samedocstring.- ![]()
Remember when the __init__method assigned its filename argument toself["name"]? Well, here's the result. The arguments you pass when you create the class instance get sent right along to the__init__method (along with the object reference,self, which Python adds for free). +Remember when the __init__method assigned its filename argument toself["name"]? Well, here's the result. The arguments you pass when you create the class instance get sent right along to the__init__method (along with the object reference,self, which Python adds for free).- In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit newoperator like C++ or Java. +In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit newoperator like C++ or Java.5.4.1. Garbage Collection
-If creating new instances is easy, destroying them is even easier. In general, there is no need to explicitly free instances, - because they are freed automatically when the variables assigned to them go out of scope. Memory leaks are rare in Python. +
If creating new instances is easy, destroying them is even easier. In general, there is no need to explicitly free instances, + because they are freed automatically when the variables assigned to them go out of scope. Memory leaks are rare in Python.
Example 5.8. Trying to Implement a Memory Leak
>>> def leakmem(): -... f = fileinfo.FileInfo('/music/_singles/kairo.mp3')-... +... f = fileinfo.FileInfo('/music/_singles/kairo.mp3')
+... >>> for i in range(100): -... leakmem()
+... leakmem()The technical term for this form of garbage collection is “reference counting”. Python keeps a list of references to every instance created. In the above example, there was only one reference to the
FileInfoinstance: the local variable f. When the function ends, the variable f goes out of scope, so the reference count drops to0, and Python destroys the instance automatically. -In previous versions of Python, there were situations where reference counting failed, and Python couldn't clean up after you. If you created two instances that referenced each other (for instance, a doubly-linked list, +
The technical term for this form of garbage collection is “reference counting”. Python keeps a list of references to every instance created. In the above example, there was only one reference to the
FileInfoinstance: the local variable f. When the function ends, the variable f goes out of scope, so the reference count drops to0, and Python destroys the instance automatically. +In previous versions of Python, there were situations where reference counting failed, and Python couldn't clean up after you. If you created two instances that referenced each other (for instance, a doubly-linked list, where each node has a pointer to the previous and next node in the list), neither instance would ever be destroyed automatically - because Python (correctly) believed that there is always a reference to each instance. Python 2.0 has an additional form of garbage collection called “mark-and-sweep” which is smart enough to notice this virtual gridlock and clean up circular references correctly. + because Python (correctly) believed that there is always a reference to each instance. Python 2.0 has an additional form of garbage collection called “mark-and-sweep” which is smart enough to notice this virtual gridlock and clean up circular references correctly.
As a former philosophy major, it disturbs me to think that things disappear when no one is looking at them, but that's exactly - what happens in Python. In general, you can simply forget about memory management and let Python clean up after you. + what happens in Python. In general, you can simply forget about memory management and let Python clean up after you.
Further Reading on Garbage Collection
@@ -3121,7 +2644,7 @@ class FileInfo(UserDict):
5.5. Exploring
-UserDict: A Wrapper ClassAs you've seen,
FileInfois a class that acts like a dictionary. To explore this further, let's look at theUserDictclass in theUserDictmodule, which is the ancestor of theFileInfoclass. This is nothing special; the class is written in Python and stored in a.pyfile, just like any other Python code. In particular, it's stored in thelibdirectory in your Python installation.+
As you've seen,
FileInfois a class that acts like a dictionary. To explore this further, let's look at theUserDictclass in theUserDictmodule, which is the ancestor of theFileInfoclass. This is nothing special; the class is written in Python and stored in a.pyfile, just like any other Python code. In particular, it's stored in thelibdirectory in your Python installation.@@ -3182,8 +2705,8 @@ class UserDict:
@@ -3147,31 +2670,31 @@ class UserDict: -
This is the __init__method that you overrode in theFileInfoclass. Note that the argument list in this ancestor class is different than the descendant. That's okay; each subclass can have - its own set of arguments, as long as it calls the ancestor with the correct arguments. Here the ancestor class has a way +This is the __init__method that you overrode in theFileInfoclass. Note that the argument list in this ancestor class is different than the descendant. That's okay; each subclass can have + its own set of arguments, as long as it calls the ancestor with the correct arguments. Here the ancestor class has a way to define initial values (by passing a dictionary in the dict argument) which theFileInfodoes not use.- ![]()
Python supports data attributes (called “instance variables” in Java and Powerbuilder, and “member variables” in C++). Data attributes are pieces of data held by a specific instance of a class. In this case, each instance of UserDictwill have a data attribute data. To reference this attribute from code outside the class, you qualify it with the instance name,instance.data, in the same way that you qualify a function with its module name. To reference a data attribute from within the class, - you useselfas the qualifier. By convention, all data attributes are initialized to reasonable values in the__init__method. However, this is not required, since data attributes, like local variables, spring into existence when they are first assigned a value. +Python supports data attributes (called “instance variables” in Java and Powerbuilder, and “member variables” in C++). Data attributes are pieces of data held by a specific instance of a class. In this case, each instance of UserDictwill have a data attribute data. To reference this attribute from code outside the class, you qualify it with the instance name,instance.data, in the same way that you qualify a function with its module name. To reference a data attribute from within the class, + you useselfas the qualifier. By convention, all data attributes are initialized to reasonable values in the__init__method. However, this is not required, since data attributes, like local variables, spring into existence when they are first assigned a value.- ![]()
The updatemethod is a dictionary duplicator: it copies all the keys and values from one dictionary to another. This does not clear the target dictionary first; if the target dictionary already has some keys, the ones from the source dictionary will - be overwritten, but others will be left untouched. Think ofupdateas a merge function, not a copy function. +The updatemethod is a dictionary duplicator: it copies all the keys and values from one dictionary to another. This does not clear the target dictionary first; if the target dictionary already has some keys, the ones from the source dictionary will + be overwritten, but others will be left untouched. Think ofupdateas a merge function, not a copy function.- ![]()
This is a syntax you may not have seen before (I haven't used it in the examples in this book). It's an ifstatement, but instead of having an indented block starting on the next line, there is just a single statement on the same - line, after the colon. This is perfectly legal syntax, which is just a shortcut you can use when you have only one statement - in a block. (It's like specifying a single statement without braces in C++.) You can use this syntax, or you can have indented code on subsequent lines, but you can't do both for the same block. +This is a syntax you may not have seen before (I haven't used it in the examples in this book). It's an ifstatement, but instead of having an indented block starting on the next line, there is just a single statement on the same + line, after the colon. This is perfectly legal syntax, which is just a shortcut you can use when you have only one statement + in a block. (It's like specifying a single statement without braces in C++.) You can use this syntax, or you can have indented code on subsequent lines, but you can't do both for the same block.Java and Powerbuilder support function overloading by argument list, i.e. one class can have multiple methods with the same name but a different number of arguments, or arguments of different types. Other languages (most notably PL/SQL) even support function overloading by argument name; i.e. one class can have multiple methods with the same name and the same number of arguments of the same type but different argument - names. Python supports neither of these; it has no form of function overloading whatsoever. Methods are defined solely by their name, - and there can be only one method per class with a given name. So if a descendant class has an
__init__method, it always overrides the ancestor__init__method, even if the descendant defines it with a different argument list. And the same rule applies to any other method. + names. Python supports neither of these; it has no form of function overloading whatsoever. Methods are defined solely by their name, + and there can be only one method per class with a given name. So if a descendant class has an__init__method, it always overrides the ancestor__init__method, even if the descendant defines it with a different argument list. And the same rule applies to any other method.@@ -3202,7 +2725,7 @@ class UserDict:
@@ -3221,8 +2744,8 @@ class UserDict:![]()
- Always assign an initial value to all of an instance's data attributes in the __init__method. It will save you hours of debugging later, tracking downAttributeErrorexceptions because you're referencing uninitialized (and therefore non-existent) attributes. +Always assign an initial value to all of an instance's data attributes in the __init__method. It will save you hours of debugging later, tracking downAttributeErrorexceptions because you're referencing uninitialized (and therefore non-existent) attributes.-
clearis a normal class method; it is publicly available to be called by anyone at any time. Notice thatclear, like all class methods, hasselfas its first argument. (Remember that you don't includeselfwhen you call the method; it's something that Python adds for you.) Also note the basic technique of this wrapper class: store a real dictionary (data) as a data attribute, define all the methods that a real dictionary has, and have each class method redirect to the corresponding - method on the real dictionary. (In case you'd forgotten, a dictionary'sclearmethod deletes all of its keys and their associated values.) +clearis a normal class method; it is publicly available to be called by anyone at any time. Notice thatclear, like all class methods, hasselfas its first argument. (Remember that you don't includeselfwhen you call the method; it's something that Python adds for you.) Also note the basic technique of this wrapper class: store a real dictionary (data) as a data attribute, define all the methods that a real dictionary has, and have each class method redirect to the corresponding + method on the real dictionary. (In case you'd forgotten, a dictionary'sclearmethod deletes all of its keys and their associated values.)@@ -3235,14 +2758,14 @@ class UserDict: -
You use the __class__attribute to see ifselfis aUserDict; if so, you're golden, because you know how to copy aUserDict: just create a newUserDictand give it the real dictionary that you've squirreled away in self.data. Then you immediately return the newUserDictyou don't even get to theimport copyon the next line. +You use the __class__attribute to see ifselfis aUserDict; if so, you're golden, because you know how to copy aUserDict: just create a newUserDictand give it the real dictionary that you've squirreled away in self.data. Then you immediately return the newUserDictyou don't even get to theimport copyon the next line.@@ -3258,12 +2781,12 @@ class UserDict: - ![]()
If self.__class__is notUserDict, thenselfmust be some subclass ofUserDict(like maybeFileInfo), in which case life gets trickier.UserDictdoesn't know how to make an exact copy of one of its descendants; there could, for instance, be other data attributes defined - in the subclass, so you would need to iterate through them and make sure to copy all of them. Luckily, Python comes with a module to do exactly this, and it's calledcopy. I won't go into the details here (though it's a wicked cool module, if you're ever inclined to dive into it on your own). +If self.__class__is notUserDict, thenselfmust be some subclass ofUserDict(like maybeFileInfo), in which case life gets trickier.UserDictdoesn't know how to make an exact copy of one of its descendants; there could, for instance, be other data attributes defined + in the subclass, so you would need to iterate through them and make sure to copy all of them. Luckily, Python comes with a module to do exactly this, and it's calledcopy. I won't go into the details here (though it's a wicked cool module, if you're ever inclined to dive into it on your own). Suffice it to say thatcopycan copy arbitrary Python objects, and that's how you're using it here.![]()
- -In versions of Python prior to 2.2, you could not directly subclass built-in datatypes like strings, lists, and dictionaries. To compensate for - this, Python comes with wrapper classes that mimic the behavior of these built-in datatypes: UserString,UserList, andUserDict. Using a combination of normal and special methods, theUserDictclass does an excellent imitation of a dictionary. In Python 2.2 and later, you can inherit classes directly from built-in datatypes likedict. An example of this is given in the examples that come with this book, infileinfo_fromdict.py. +In versions of Python prior to 2.2, you could not directly subclass built-in datatypes like strings, lists, and dictionaries. To compensate for + this, Python comes with wrapper classes that mimic the behavior of these built-in datatypes: UserString,UserList, andUserDict. Using a combination of normal and special methods, theUserDictclass does an excellent imitation of a dictionary. In Python 2.2 and later, you can inherit classes directly from built-in datatypes likedict. An example of this is given in the examples that come with this book, infileinfo_fromdict.py.In Python, you can inherit directly from the
dictbuilt-in datatype, as shown in this example. There are three differences here compared to theUserDictversion. +In Python, you can inherit directly from the
dictbuilt-in datatype, as shown in this example. There are three differences here compared to theUserDictversion.Example 5.11. Inheriting Directly from Built-In Datatype
dictclass FileInfo(dict):"store file metadata" @@ -3274,13 +2797,13 @@ class FileInfo(dict):
![]()
- ![]()
The first difference is that you don't need to import the UserDictmodule, sincedictis a built-in datatype and is always available. The second is that you are inheriting fromdictdirectly, instead of fromUserDict.UserDict. +The first difference is that you don't need to import the UserDictmodule, sincedictis a built-in datatype and is always available. The second is that you are inheriting fromdictdirectly, instead of fromUserDict.UserDict.@@ -3291,10 +2814,10 @@ class FileInfo(dict): - ![]()
The third difference is subtle but important. Because of the way UserDictworks internally, it requires you to manually call its__init__method to properly initialize its internal data structures.dictdoes not work like this; it is not a wrapper, and it requires no explicit initialization. +The third difference is subtle but important. Because of the way UserDictworks internally, it requires you to manually call its__init__method to properly initialize its internal data structures.dictdoes not work like this; it is not a wrapper, and it requires no explicit initialization.![]()
5.6. Special Class Methods
-In addition to normal class methods, there are a number of special methods that Python classes can define. Instead of being called directly by your code (like normal methods), special methods are called for +
In addition to normal class methods, there are a number of special methods that Python classes can define. Instead of being called directly by your code (like normal methods), special methods are called for you by Python in particular circumstances or when specific syntax is used. -
As you saw in the previous section, normal methods go a long way towards wrapping a dictionary in a class. But normal methods alone are not enough, because -there are a lot of things you can do with dictionaries besides call methods on them. For starters, you can get and set items with a syntax that doesn't include explicitly invoking methods. This is where special class methods come in: they +
As you saw in the previous section, normal methods go a long way towards wrapping a dictionary in a class. But normal methods alone are not enough, because +there are a lot of things you can do with dictionaries besides call methods on them. For starters, you can get and set items with a syntax that doesn't include explicitly invoking methods. This is where special class methods come in: they provide a way to map non-method-calling syntax into method calls.
5.6.1. Getting and Setting Items
Example 5.12. The
__getitem__Special Method@@ -3309,14 +2832,14 @@ provide a way to map non-method-calling syntax into method calls.- ![]()
The __getitem__special method looks simple enough. Like the normal methodsclear,keys, andvalues, it just redirects to the dictionary to return its value. But how does it get called? Well, you can call__getitem__directly, but in practice you wouldn't actually do that; I'm just doing it here to show you how it works. The right way +The __getitem__special method looks simple enough. Like the normal methodsclear,keys, andvalues, it just redirects to the dictionary to return its value. But how does it get called? Well, you can call__getitem__directly, but in practice you wouldn't actually do that; I'm just doing it here to show you how it works. The right way to use__getitem__is to get Python to call it for you.@@ -3334,22 +2857,22 @@ provide a way to map non-method-calling syntax into method calls. - ![]()
This looks just like the syntax you would use to get a dictionary value, and in fact it returns the value you would expect. But here's the missing link: under the covers, Python has converted this syntax to the method call f.__getitem__("name"). That's why__getitem__is a special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax. +This looks just like the syntax you would use to get a dictionary value, and in fact it returns the value you would expect. But here's the missing link: under the covers, Python has converted this syntax to the method call f.__getitem__("name"). That's why__getitem__is a special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax.- ![]()
Like the __getitem__method,__setitem__simply redirects to the real dictionary self.data to do its work. And like__getitem__, you wouldn't ordinarily call it directly like this; Python calls__setitem__for you when you use the right syntax. +Like the __getitem__method,__setitem__simply redirects to the real dictionary self.data to do its work. And like__getitem__, you wouldn't ordinarily call it directly like this; Python calls__setitem__for you when you use the right syntax.- - ![]()
This looks like regular dictionary syntax, except of course that f is really a class that's trying very hard to masquerade as a dictionary, and __setitem__is an essential part of that masquerade. This line of code actually callsf.__setitem__("genre", 32)under the covers. +This looks like regular dictionary syntax, except of course that f is really a class that's trying very hard to masquerade as a dictionary, and __setitem__is an essential part of that masquerade. This line of code actually callsf.__setitem__("genre", 32)under the covers.
__setitem__is a special class method because it gets called for you, but it's still a class method. Just as easily as the__setitem__method was defined inUserDict, you can redefine it in the descendant class to override the ancestor method. This allows you to define classes that act +
__setitem__is a special class method because it gets called for you, but it's still a class method. Just as easily as the__setitem__method was defined inUserDict, you can redefine it in the descendant class to override the ancestor method. This allows you to define classes that act like dictionaries in some ways but define their own behavior above and beyond the built-in dictionary. -This concept is the basis of the entire framework you're studying in this chapter. Each file type can have a handler class - that knows how to get metadata from a particular type of file. Once some attributes (like the file's name and location) are - known, the handler class knows how to derive other attributes automatically. This is done by overriding the
__setitem__method, checking for particular keys, and adding additional processing when they are found. -For example,
MP3FileInfois a descendant ofFileInfo. When anMP3FileInfo'snameis set, it doesn't just set thenamekey (like the ancestorFileInfodoes); it also looks in the file itself for MP3 tags and populates a whole set of keys. The next example shows how this works. +This concept is the basis of the entire framework you're studying in this chapter. Each file type can have a handler class + that knows how to get metadata from a particular type of file. Once some attributes (like the file's name and location) are + known, the handler class knows how to derive other attributes automatically. This is done by overriding the
__setitem__method, checking for particular keys, and adding additional processing when they are found. +For example,
MP3FileInfois a descendant ofFileInfo. When anMP3FileInfo'snameis set, it doesn't just set thenamekey (like the ancestorFileInfodoes); it also looks in the file itself for MP3 tags and populates a whole set of keys. The next example shows how this works.Example 5.14. Overriding
__setitem__inMP3FileInfodef __setitem__(self, key, item):if key == "name" and item:
@@ -3359,7 +2882,7 @@ provide a way to map non-method-calling syntax into method calls.
@@ -3372,13 +2895,13 @@ provide a way to map non-method-calling syntax into method calls. - ![]()
Notice that this __setitem__method is defined exactly the same way as the ancestor method. This is important, since Python will be calling the method for you, and it expects it to be defined with a certain number of arguments. (Technically speaking, +Notice that this __setitem__method is defined exactly the same way as the ancestor method. This is important, since Python will be calling the method for you, and it expects it to be defined with a certain number of arguments. (Technically speaking, the names of the arguments don't matter; only the number of arguments is important.)- ![]()
The extra processing you do for names is encapsulated in the__parsemethod. This is another class method defined inMP3FileInfo, and when you call it, you qualify it with self. Just calling__parsewould look for a normal function defined outside the class, which is not what you want. Callingself.__parsewill look for a class method defined within the class. This isn't anything new; you reference data attributes the same way. +The extra processing you do for names is encapsulated in the__parsemethod. This is another class method defined inMP3FileInfo, and when you call it, you qualify it with self. Just calling__parsewould look for a normal function defined outside the class, which is not what you want. Callingself.__parsewill look for a class method defined within the class. This isn't anything new; you reference data attributes the same way.@@ -3388,7 +2911,7 @@ provide a way to map non-method-calling syntax into method calls. - ![]()
After doing this extra processing, you want to call the ancestor method. Remember that this is never done for you in Python; you must do it manually. Note that you're calling the immediate ancestor, FileInfo, even though it doesn't have a__setitem__method. That's okay, because Python will walk up the ancestor tree until it finds a class with the method you're calling, so this line of code will eventually +After doing this extra processing, you want to call the ancestor method. Remember that this is never done for you in Python; you must do it manually. Note that you're calling the immediate ancestor, FileInfo, even though it doesn't have a__setitem__method. That's okay, because Python will walk up the ancestor tree until it finds a class with the method you're calling, so this line of code will eventually find and call the__setitem__defined inUserDict.- @@ -3410,14 +2933,14 @@ provide a way to map non-method-calling syntax into method calls.When accessing data attributes within a class, you need to qualify the attribute name: self.attribute. When calling other methods within a class, you need to qualify the method name:self.method. +When accessing data attributes within a class, you need to qualify the attribute name: self.attribute. When calling other methods within a class, you need to qualify the method name:self.method.- ![]()
First, you create an instance of MP3FileInfo, without passing it a filename. (You can get away with this because the filename argument of the__init__method is optional.) SinceMP3FileInfohas no__init__method of its own, Python walks up the ancestor tree and finds the__init__method ofFileInfo. This__init__method manually calls the__init__method ofUserDictand then sets thenamekey to filename, which isNone, since you didn't pass a filename. Thus, mp3file initially looks like a dictionary with one key,name, whose value isNone. +First, you create an instance of MP3FileInfo, without passing it a filename. (You can get away with this because the filename argument of the__init__method is optional.) SinceMP3FileInfohas no__init__method of its own, Python walks up the ancestor tree and finds the__init__method ofFileInfo. This__init__method manually calls the__init__method ofUserDictand then sets thenamekey to filename, which isNone, since you didn't pass a filename. Thus, mp3file initially looks like a dictionary with one key,name, whose value isNone.@@ -3430,7 +2953,7 @@ provide a way to map non-method-calling syntax into method calls. - ![]()
Now the real fun begins. Setting the namekey of mp3file triggers the__setitem__method onMP3FileInfo(notUserDict), which notices that you're setting thenamekey with a real value and callsself.__parse. Although you haven't traced through the__parsemethod yet, you can see from the output that it sets several other keys:album,artist,genre,title,year, andcomment. +Now the real fun begins. Setting the namekey of mp3file triggers the__setitem__method onMP3FileInfo(notUserDict), which notices that you're setting thenamekey with a real value and callsself.__parse. Although you haven't traced through the__parsemethod yet, you can see from the output that it sets several other keys:album,artist,genre,title,year, andcomment.5.7. Advanced Special Class Methods
-Python has more special methods than just
__getitem__and__setitem__. Some of them let you emulate functionality that you may not even know about. +Python has more special methods than just
__getitem__and__setitem__. Some of them let you emulate functionality that you may not even know about.This example shows some of the other special methods in
UserDict.Example 5.16. More Special Methods in
UserDictdef __repr__(self): return repr(self.data)@@ -3445,14 +2968,14 @@ provide a way to map non-method-calling syntax into method calls.
- ![]()
__repr__is a special method that is called when you callrepr(instance). Thereprfunction is a built-in function that returns a string representation of an object. It works on any object, not just class - instances. You're already intimately familiar withreprand you don't even know it. In the interactive window, when you type just a variable name and press the ENTER key, Python usesreprto display the variable's value. Go create a dictionary d with some data and thenprint repr(d)to see for yourself. +__repr__is a special method that is called when you callrepr(instance). Thereprfunction is a built-in function that returns a string representation of an object. It works on any object, not just class + instances. You're already intimately familiar withreprand you don't even know it. In the interactive window, when you type just a variable name and press the ENTER key, Python usesreprto display the variable's value. Go create a dictionary d with some data and thenprint repr(d)to see for yourself.- ![]()
__cmp__is called when you compare class instances. In general, you can compare any two Python objects, not just class instances, by using==. There are rules that define when built-in datatypes are considered equal; for instance, dictionaries are equal when they +@@ -3460,14 +2983,14 @@ provide a way to map non-method-calling syntax into method calls. __cmp__is called when you compare class instances. In general, you can compare any two Python objects, not just class instances, by using==. There are rules that define when built-in datatypes are considered equal; for instance, dictionaries are equal when they have all the same keys and values, and strings are equal when they are the same length and contain the same sequence of characters. For class instances, you can define the__cmp__method and code the comparison logic yourself, and then you can use==to compare instances of your class and Python will call your__cmp__special method for you.- ![]()
__len__is called when you calllen(instance). Thelenfunction is a built-in function that returns the length of an object. It works on any object that could reasonably be thought - of as having a length. Thelenof a string is its number of characters; thelenof a dictionary is its number of keys; thelenof a list or tuple is its number of elements. For class instances, define the__len__method and code the length calculation yourself, and then calllen(instance)and Python will call your__len__special method for you. +__len__is called when you calllen(instance). Thelenfunction is a built-in function that returns the length of an object. It works on any object that could reasonably be thought + of as having a length. Thelenof a string is its number of characters; thelenof a dictionary is its number of keys; thelenof a list or tuple is its number of elements. For class instances, define the__len__method and code the length calculation yourself, and then calllen(instance)and Python will call your__len__special method for you.@@ -3476,13 +2999,13 @@ provide a way to map non-method-calling syntax into method calls. - ![]()
__delitem__is called when you calldel instance[key], which you may remember as the way to delete individual items from a dictionary. When you usedelon a class instance, Python calls the__delitem__special method for you. +__delitem__is called when you calldel instance[key], which you may remember as the way to delete individual items from a dictionary. When you usedelon a class instance, Python calls the__delitem__special method for you.- -In Java, you determine whether two string variables reference the same physical memory location by using str1 == str2. This is called object identity, and it is written in Python asstr1 is str2. To compare string values in Java, you would usestr1.equals(str2); in Python, you would usestr1 == str2. Java programmers who have been taught to believe that the world is a better place because==in Java compares by identity instead of by value may have a difficult time adjusting to Python's lack of such “gotchas”. +In Java, you determine whether two string variables reference the same physical memory location by using str1 == str2. This is called object identity, and it is written in Python asstr1 is str2. To compare string values in Java, you would usestr1.equals(str2); in Python, you would usestr1 == str2. Java programmers who have been taught to believe that the world is a better place because==in Java compares by identity instead of by value may have a difficult time adjusting to Python's lack of such “gotchas”.At this point, you may be thinking, “All this work just to do something in a class that I can do with a built-in datatype.” And it's true that life would be easier (and the entire
UserDictclass would be unnecessary) if you could inherit from built-in datatypes like a dictionary. But even if you could, special +At this point, you may be thinking, “All this work just to do something in a class that I can do with a built-in datatype.” And it's true that life would be easier (and the entire
UserDictclass would be unnecessary) if you could inherit from built-in datatypes like a dictionary. But even if you could, special methods would still be useful, because they can be used in any class, not just wrapper classes likeUserDict. -Special methods mean that any class can store key/value pairs like a dictionary, just by defining the
__setitem__method. Any class can act like a sequence, just by defining the__getitem__method. Any class that defines the__cmp__method can be compared with==. And if your class represents something that has a length, don't define aGetLengthmethod; define the__len__method and uselen(instance).+
-Special methods mean that any class can store key/value pairs like a dictionary, just by defining the
__setitem__method. Any class can act like a sequence, just by defining the__getitem__method. Any class that defines the__cmp__method can be compared with==. And if your class represents something that has a length, don't define aGetLengthmethod; define the__len__method and uselen(instance).-
@@ -3491,9 +3014,9 @@ methods would still be useful, because they can be used in any class, not just w Python has a lot of other special methods. There's a whole set of them that let classes act like numbers, allowing you to add, -subtract, and do other arithmetic operations on class instances. (The canonical example of this is a class that represents -complex numbers, numbers with both real and imaginary components.) The
__call__method lets a class act like a function, allowing you to call a class instance directly. And there are other special methods +Python has a lot of other special methods. There's a whole set of them that let classes act like numbers, allowing you to add, +subtract, and do other arithmetic operations on class instances. (The canonical example of this is a class that represents +complex numbers, numbers with both real and imaginary components.) The
__call__method lets a class act like a function, allowing you to call a class instance directly. And there are other special methods that allow classes to have read-only and write-only data attributes; you'll talk more about those in later chapters.Further Reading on Special Class Methods
@@ -3502,7 +3025,7 @@ that allow classes to have read-only and write-only data attributes; you'll talk5.8. Introducing Class Attributes
-You already know about data attributes, which are variables owned by a specific instance of a class. Python also supports class attributes, which are variables owned by the class itself. +
You already know about data attributes, which are variables owned by a specific instance of a class. Python also supports class attributes, which are variables owned by the class itself.
Example 5.17. Introducing Class Attributes
class MP3FileInfo(FileInfo): "store ID3v1.0 MP3 tags" @@ -3539,7 +3062,7 @@ class MP3FileInfo(FileInfo):- ![]()
tagDataMap is a class attribute: literally, an attribute of the class. It is available before creating any instances of the class. + tagDataMap is a class attribute: literally, an attribute of the class. It is available before creating any instances of the class. @@ -3553,24 +3076,24 @@ class MP3FileInfo(FileInfo): - In Java, both static variables (called class attributes in Python) and instance variables (called data attributes in Python) are defined immediately after the class definition (one with the statickeyword, one without). In Python, only class attributes can be defined here; data attributes are defined in the__init__method. +In Java, both static variables (called class attributes in Python) and instance variables (called data attributes in Python) are defined immediately after the class definition (one with the statickeyword, one without). In Python, only class attributes can be defined here; data attributes are defined in the__init__method.Class attributes can be used as class-level constants (which is how you use them in
MP3FileInfo), but they are not really constants. You can also change them.Unlike in most languages, whether a Python function, method, or attribute is private or public is determined entirely by its name.
If the name of a Python function, class method, or attribute starts with (but doesn't end with) two underscores, it's private; everything else is -public. Python has no concept of protected class methods (accessible only in their own class and descendant classes). Class methods are either private (accessible +public. Python has no concept of protected class methods (accessible only in their own class and descendant classes). Class methods are either private (accessible only in their own class) or public (accessible from anywhere). -
In
MP3FileInfo, there are two methods:__parseand__setitem__. As you have already discussed,__setitem__is a special method; normally, you would call it indirectly by using the dictionary syntax on a class instance, but it is public, and you could -call it directly (even from outside thefileinfomodule) if you had a really good reason. However,__parseis private, because it has two underscores at the beginning of its name.+
@@ -3834,7 +3357,7 @@ exceptions, errors occur immediately, and you can handle them in a standard wayIn
MP3FileInfo, there are two methods:__parseand__setitem__. As you have already discussed,__setitem__is a special method; normally, you would call it indirectly by using the dictionary syntax on a class instance, but it is public, and you could +call it directly (even from outside thefileinfomodule) if you had a really good reason. However,__parseis private, because it has two underscores at the beginning of its name.
- @@ -3653,9 +3176,9 @@ AttributeError: 'MP3FileInfo' instance has no attribute '__parse'In Python, all special methods (like __setitem__) and built-in attributes (like__doc__) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and +In Python, all special methods (like __setitem__) and built-in attributes (like__doc__) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and attributes this way, because it will only confuse you (and others) later.- ![]()
If you try to call a private method, Python will raise a slightly misleading exception, saying that the method does not exist. Of course it does exist, but it's private, - so it's not accessible outside the class.Strictly speaking, private methods are accessible outside their class, just not easily accessible. Nothing in Python is truly private; internally, the names of private methods and attributes are mangled and unmangled on the fly to make them - seem inaccessible by their given names. You can access the __parsemethod of theMP3FileInfoclass by the name_MP3FileInfo__parse. Acknowledge that this is interesting, but promise to never, ever do it in real code. Private methods are private for a +If you try to call a private method, Python will raise a slightly misleading exception, saying that the method does not exist. Of course it does exist, but it's private, + so it's not accessible outside the class.Strictly speaking, private methods are accessible outside their class, just not easily accessible. Nothing in Python is truly private; internally, the names of private methods and attributes are mangled and unmangled on the fly to make them + seem inaccessible by their given names. You can access the @@ -3667,7 +3190,7 @@ AttributeError: 'MP3FileInfo' instance has no attribute '__parse'__parsemethod of theMP3FileInfoclass by the name_MP3FileInfo__parse. Acknowledge that this is interesting, but promise to never, ever do it in real code. Private methods are private for a reason, but like many other things in Python, their privateness is ultimately a matter of convention, not force.5.10. Summary
-That's it for the hard-core object trickery. You'll see a real-world application of special class methods in Chapter 12, which uses
getattrto create a proxy to a remote web service. +That's it for the hard-core object trickery. You'll see a real-world application of special class methods in Chapter 12, which uses
getattrto create a proxy to a remote web service.The next chapter will continue using this code sample to explore other Python concepts, such as exceptions, file objects, and
forloops.Before diving into the next chapter, make sure you're comfortable doing all of these things:
@@ -3685,18 +3208,18 @@ AttributeError: 'MP3FileInfo' instance has no attribute '__parse'Chapter 6. Exceptions and File Handling
-In this chapter, you will dive into exceptions, file objects,
forloops, and theosandsysmodules. If you've used exceptions in another programming language, you can skim the first section to get a sense of Python's syntax. Be sure to tune in again for file handling. +In this chapter, you will dive into exceptions, file objects,
forloops, and theosandsysmodules. If you've used exceptions in another programming language, you can skim the first section to get a sense of Python's syntax. Be sure to tune in again for file handling.6.1. Handling Exceptions
Like many other programming languages, Python has exception handling via
try...exceptblocks.-
- Python uses try...exceptto handle exceptions andraiseto generate them. Java and C++ usetry...catchto handle exceptions, andthrowto generate them. +Python uses try...exceptto handle exceptions andraiseto generate them. Java and C++ usetry...catchto handle exceptions, andthrowto generate them.Exceptions are everywhere in Python. Virtually every module in the standard Python library uses them, and Python itself will raise them in a lot of different circumstances. You've already seen them repeatedly throughout this book. +
Exceptions are everywhere in Python. Virtually every module in the standard Python library uses them, and Python itself will raise them in a lot of different circumstances. You've already seen them repeatedly throughout this book.
-
- Accessing a non-existent dictionary key will raise a
KeyErrorexception. @@ -3710,20 +3233,20 @@ AttributeError: 'MP3FileInfo' instance has no attribute '__parse'Mixing datatypes without coercion will raise a TypeErrorexception.In each of these cases, you were simply playing around in the Python IDE: an error occurred, the exception was printed (depending on your IDE, perhaps in an intentionally jarring shade of red), and that was that. This is called an unhandled exception. When the exception was raised, there was no code to explicitly notice it and deal with it, so it bubbled its -way back to the default behavior built in to Python, which is to spit out some debugging information and give up. In the IDE, that's no big deal, but if that happened while your actual Python program was running, the entire program would come to a screeching halt. -
An exception doesn't need result in a complete program crash, though. Exceptions, when raised, can be handled. Sometimes an exception is really because you have a bug in your code (like accessing a variable that doesn't exist), but -many times, an exception is something you can anticipate. If you're opening a file, it might not exist. If you're connecting -to a database, it might be unavailable, or you might not have the correct security credentials to access it. If you know +
In each of these cases, you were simply playing around in the Python IDE: an error occurred, the exception was printed (depending on your IDE, perhaps in an intentionally jarring shade of red), and that was that. This is called an unhandled exception. When the exception was raised, there was no code to explicitly notice it and deal with it, so it bubbled its +way back to the default behavior built in to Python, which is to spit out some debugging information and give up. In the IDE, that's no big deal, but if that happened while your actual Python program was running, the entire program would come to a screeching halt. +
An exception doesn't need result in a complete program crash, though. Exceptions, when raised, can be handled. Sometimes an exception is really because you have a bug in your code (like accessing a variable that doesn't exist), but +many times, an exception is something you can anticipate. If you're opening a file, it might not exist. If you're connecting +to a database, it might be unavailable, or you might not have the correct security credentials to access it. If you know a line of code may raise an exception, you should handle the exception using a
try...exceptblock.Example 6.1. Opening a Non-Existent File
>>> fsock = open("/notthere", "r")Traceback (innermost last): File "<interactive input>", line 1, in ? IOError: [Errno 2] No such file or directory: '/notthere' >>> try: -... fsock = open("/notthere")
+... fsock = open("/notthere")
... except IOError:
-... print "The file does not exist, exiting gracefully" +... print "The file does not exist, exiting gracefully" ... print "This line will always print"
The file does not exist, exiting gracefully This line will always print
@@ -3731,7 +3254,7 @@ This line will always print- ![]()
Using the built-in openfunction, you can try to open a file for reading (more onopenin the next section). But the file doesn't exist, so this raises theIOErrorexception. Since you haven't provided any explicit check for anIOErrorexception, Python just prints out some debugging information about what happened and then gives up. +Using the built-in openfunction, you can try to open a file for reading (more onopenin the next section). But the file doesn't exist, so this raises theIOErrorexception. Since you haven't provided any explicit check for anIOErrorexception, Python just prints out some debugging information about what happened and then gives up.@@ -3743,29 +3266,29 @@ This line will always print - ![]()
When the openmethod raises anIOErrorexception, you're ready for it. Theexcept IOError:line catches the exception and executes your own block of code, which in this case just prints a more pleasant error message. +When the openmethod raises anIOErrorexception, you're ready for it. Theexcept IOError:line catches the exception and executes your own block of code, which in this case just prints a more pleasant error message.- ![]()
Once an exception has been handled, processing continues normally on the first line after the try...exceptblock. Note that this line will always print, whether or not an exception occurs. If you really did have a file called +Once an exception has been handled, processing continues normally on the first line after the try...exceptblock. Note that this line will always print, whether or not an exception occurs. If you really did have a file callednottherein your root directory, the call toopenwould succeed, theexceptclause would be ignored, and this line would still be executed.Exceptions may seem unfriendly (after all, if you don't catch the exception, your entire program will crash), but consider -the alternative. Would you rather get back an unusable file object to a non-existent file? You'd need to check its validity +the alternative. Would you rather get back an unusable file object to a non-existent file? You'd need to check its validity somehow anyway, and if you forgot, somewhere down the line, your program would give you strange errors somewhere down the -line that you would need to trace back to the source. I'm sure you've experienced this, and you know it's not fun. With +line that you would need to trace back to the source. I'm sure you've experienced this, and you know it's not fun. With exceptions, errors occur immediately, and you can handle them in a standard way at the source of the problem.
6.1.1. Using Exceptions For Other Purposes
-There are a lot of other uses for exceptions besides handling actual error conditions. A common use in the standard Python library is to try to import a module, and then check whether it worked. Importing a module that does not exist will raise - an
ImportErrorexception. You can use this to define multiple levels of functionality based on which modules are available at run-time, +There are a lot of other uses for exceptions besides handling actual error conditions. A common use in the standard Python library is to try to import a module, and then check whether it worked. Importing a module that does not exist will raise + an
ImportErrorexception. You can use this to define multiple levels of functionality based on which modules are available at run-time, or to support multiple platforms (where platform-specific code is separated into different modules). -You can also define your own exceptions by creating a class that inherits from the built-in
Exceptionclass, and then raise your exceptions with theraisecommand. See the further reading section if you're interested in doing this. -The next example demonstrates how to use an exception to support platform-specific functionality. This code comes from the -
getpassmodule, a wrapper module for getting a password from the user. Getting a password is accomplished differently on UNIX, Windows, and Mac OS platforms, but this code encapsulates all of those differences. +You can also define your own exceptions by creating a class that inherits from the built-in
Exceptionclass, and then raise your exceptions with theraisecommand. See the further reading section if you're interested in doing this. +The next example demonstrates how to use an exception to support platform-specific functionality. This code comes from the +
getpassmodule, a wrapper module for getting a password from the user. Getting a password is accomplished differently on UNIX, Windows, and Mac OS platforms, but this code encapsulates all of those differences.Example 6.2. Supporting Platform-Specific Functionality
# Bind the name getpass to the appropriate function try: @@ -3788,34 +3311,34 @@ exceptions, errors occur immediately, and you can handle them in a standard way- ![]()
termiosis a UNIX-specific module that provides low-level control over the input terminal. If this module is not available (because it's not +termiosis a UNIX-specific module that provides low-level control over the input terminal. If this module is not available (because it's not on your system, or your system doesn't support it), the import fails and Python raises anImportError, which you catch.- ![]()
OK, you didn't have termios, so let's trymsvcrt, which is a Windows-specific module that provides an API to many useful functions in the Microsoft Visual C++ runtime services. If this import fails, Python will raise anImportError, which you catch. +OK, you didn't have termios, so let's trymsvcrt, which is a Windows-specific module that provides an API to many useful functions in the Microsoft Visual C++ runtime services. If this import fails, Python will raise anImportError, which you catch.- ![]()
If the first two didn't work, you try to import a function from EasyDialogs, which is a Mac OS-specific module that provides functions to pop up dialog boxes of various types. Once again, if this import fails, Python will raise anImportError, which you catch. +If the first two didn't work, you try to import a function from EasyDialogs, which is a Mac OS-specific module that provides functions to pop up dialog boxes of various types. Once again, if this import fails, Python will raise anImportError, which you catch.![]()
None of these platform-specific modules is available (which is possible, since Python has been ported to a lot of different platforms), so you need to fall back on a default password input function (which is - defined elsewhere in the getpassmodule). Notice what you're doing here: assigning the functiondefault_getpassto the variable getpass. If you read the officialgetpassdocumentation, it tells you that thegetpassmodule defines agetpassfunction. It does this by binding getpass to the correct function for your platform. Then when you call thegetpassfunction, you're really calling a platform-specific function that this code has set up for you. You don't need to know or + defined elsewhere in thegetpassmodule). Notice what you're doing here: assigning the functiondefault_getpassto the variable getpass. If you read the officialgetpassdocumentation, it tells you that thegetpassmodule defines agetpassfunction. It does this by binding getpass to the correct function for your platform. Then when you call thegetpassfunction, you're really calling a platform-specific function that this code has set up for you. You don't need to know or care which platform your code is running on -- just callgetpass, and it will always do the right thing.- ![]()
A try...exceptblock can have anelseclause, like anifstatement. If no exception is raised during thetryblock, theelseclause is executed afterwards. In this case, that means that thefrom EasyDialogs import AskPasswordimport worked, so you should bind getpass to theAskPasswordfunction. Each of the othertry...exceptblocks has similarelseclauses to bind getpass to the appropriate function when you find animportthat works. +A try...exceptblock can have anelseclause, like anifstatement. If no exception is raised during thetryblock, theelseclause is executed afterwards. In this case, that means that thefrom EasyDialogs import AskPasswordimport worked, so you should bind getpass to theAskPasswordfunction. Each of the othertry...exceptblocks has similarelseclauses to bind getpass to the appropriate function when you find animportthat works.6.2. Working with File Objects
-Python has a built-in function,
open, for opening a file on disk.openreturns a file object, which has methods and attributes for getting information about and manipulating the opened file. +Python has a built-in function,
open, for opening a file on disk.openreturns a file object, which has methods and attributes for getting information about and manipulating the opened file.Example 6.3. Opening a File
>>> f = open("/music/_singles/kairo.mp3", "rb")>>> f
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988> @@ -3846,15 +3369,15 @@ exceptions, errors occur immediately, and you can handle them in a standard way
- ![]()
The openmethod can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename, - is required; the other two are optional. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode. +The openmethod can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename, + is required; the other two are optional. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode. (print open.__doc__displays a great explanation of all the possible modes.)- ![]()
The openfunction returns an object (by now, this should not surprise you). A file object has several useful attributes. +The openfunction returns an object (by now, this should not surprise you). A file object has several useful attributes.@@ -3890,15 +3413,15 @@ Rave Mix 2000http://mp3.com/DJMARYJANE \037' - ![]()
A file object maintains state about the file it has open. The tellmethod of a file object tells you your current position in the open file. Since you haven't done anything with this file +A file object maintains state about the file it has open. The tellmethod of a file object tells you your current position in the open file. Since you haven't done anything with this file yet, the current position is0, which is the beginning of the file.- ![]()
The seekmethod of a file object moves to another position in the open file. The second parameter specifies what the first one means; -0means move to an absolute position (counting from the start of the file),1means move to a relative position (counting from the current position), and2means move to a position relative to the end of the file. Since the MP3 tags you're looking for are stored at the end of the file, you use2and tell the file object to move to a position128bytes from the end of the file. +The seekmethod of a file object moves to another position in the open file. The second parameter specifies what the first one means; +0means move to an absolute position (counting from the start of the file),1means move to a relative position (counting from the current position), and2means move to a position relative to the end of the file. Since the MP3 tags you're looking for are stored at the end of the file, you use2and tell the file object to move to a position128bytes from the end of the file.@@ -3910,21 +3433,21 @@ Rave Mix 2000http://mp3.com/DJMARYJANE \037' - ![]()
The readmethod reads a specified number of bytes from the open file and returns a string with the data that was read. The optional - parameter specifies the maximum number of bytes to read. If no parameter is specified,readwill read until the end of the file. (You could have simply saidread()here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.) The read data +The readmethod reads a specified number of bytes from the open file and returns a string with the data that was read. The optional + parameter specifies the maximum number of bytes to read. If no parameter is specified,readwill read until the end of the file. (You could have simply saidread()here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.) The read data is assigned to the tagData variable, and the current position is updated based on how many bytes were read.- ![]()
The tellmethod confirms that the current position has moved. If you do the math, you'll see that after reading 128 bytes, the position +The tellmethod confirms that the current position has moved. If you do the math, you'll see that after reading 128 bytes, the position has been incremented by 128.6.2.2. Closing Files
-Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's +
Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's important to close files as soon as you're finished with them.
Example 6.5. Closing a File
>>> f @@ -3953,13 +3476,13 @@ ValueError: I/O operation on closed file- ![]()
The closed attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (closed is False). +The closed attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (closed is False).@@ -3972,7 +3495,7 @@ ValueError: I/O operation on closed file - ![]()
To close a file, call the closemethod of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any) +To close a file, call the closemethod of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any) that the system hadn't gotten around to actually writing yet, and releases the system resources.@@ -3984,7 +3507,7 @@ ValueError: I/O operation on closed file - ![]()
Just because a file is closed doesn't mean that the file object ceases to exist. The variable f will continue to exist until it goes out of scope or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed; + Just because a file is closed doesn't mean that the file object ceases to exist. The variable f will continue to exist until it goes out of scope or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed; they all raise an exception. 6.2.3. Handling I/O Errors
-Now you've seen enough to understand the file handling code in the
fileinfo.pysample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle +Now you've seen enough to understand the file handling code in the
fileinfo.pysample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle errors.Example 6.6. File Objects in
MP3FileInfotry:@@ -4003,50 +3526,50 @@ ValueError: I/O operation on closed file
- ![]()
Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a try...exceptblock. (Hey, isn't standardized indentation great? This is where you start to appreciate it.) +Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a try...exceptblock. (Hey, isn't standardized indentation great? This is where you start to appreciate it.)- ![]()
The openfunction may raise anIOError. (Maybe the file doesn't exist.) +The openfunction may raise anIOError. (Maybe the file doesn't exist.)- ![]()
The seekmethod may raise anIOError. (Maybe the file is smaller than 128 bytes.) +The seekmethod may raise anIOError. (Maybe the file is smaller than 128 bytes.)- ![]()
The readmethod may raise anIOError. (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.) +The readmethod may raise anIOError. (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.)- ![]()
This is new: a try...finallyblock. Once the file has been opened successfully by theopenfunction, you want to make absolutely sure that you close it, even if an exception is raised by theseekorreadmethods. That's what atry...finallyblock is for: code in thefinallyblock will always be executed, even if something in thetryblock raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before. +This is new: a try...finallyblock. Once the file has been opened successfully by theopenfunction, you want to make absolutely sure that you close it, even if an exception is raised by theseekorreadmethods. That's what atry...finallyblock is for: code in thefinallyblock will always be executed, even if something in thetryblock raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before.- ![]()
At last, you handle your IOErrorexception. This could be theIOErrorexception raised by the call toopen,seek, orread. Here, you really don't care, because all you're going to do is ignore it silently and continue. (Remember,passis a Python statement that does nothing.) That's perfectly legal; “handling” an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the +At last, you handle your IOErrorexception. This could be theIOErrorexception raised by the call toopen,seek, orread. Here, you really don't care, because all you're going to do is ignore it silently and continue. (Remember,passis a Python statement that does nothing.) That's perfectly legal; “handling” an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the next line of code after thetry...exceptblock.6.2.4. Writing to Files
-As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes: +
As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes:
- "Append" mode will add data to the end of the file.
- "write" mode will overwrite the file.
Either mode will create the file automatically if it doesn't already exist, so there's never a need for any sort of fiddly - "if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open + "if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open it and start writing.
Example 6.7. Writing to Files
>>> logfile = open('test.log', 'w')@@ -4064,7 +3587,7 @@ test succeededline 2
@@ -4077,21 +3600,21 @@ test succeededline 2 - ![]()
You start boldly by creating either the new file test.logor overwrites the existing file, and opening the file for writing. (The second parameter"w"means open the file for writing.) Yes, that's all as dangerous as it sounds. I hope you didn't care about the previous +You start boldly by creating either the new file test.logor overwrites the existing file, and opening the file for writing. (The second parameter"w"means open the file for writing.) Yes, that's all as dangerous as it sounds. I hope you didn't care about the previous contents of that file, because it's gone now.- ![]()
fileis a synonym foropen. This one-liner opens the file, reads its contents, and prints them. +fileis a synonym foropen. This one-liner opens the file, reads its contents, and prints them.- ![]()
You happen to know that test.logexists (since you just finished writing to it), so you can open it and append to it. (The"a"parameter means open the file for appending.) Actually you could do this even if the file didn't exist, because opening - the file for appending will create the file if necessary. But appending will never harm the existing contents of the file. +You happen to know that test.logexists (since you just finished writing to it), so you can open it and append to it. (The"a"parameter means open the file for appending.) Actually you could do this even if the file didn't exist, because opening + the file for appending will create the file if necessary. But appending will never harm the existing contents of the file.@@ -4108,12 +3631,12 @@ test succeededline 2 - ![]()
As you can see, both the original line you wrote and the second line you appended are now in test.log. Also note that carriage returns are not included. Since you didn't write them explicitly to the file either time, the - file doesn't include them. You can write a carriage return with the"\n"character. Since you didn't do this, everything you wrote to the file ended up smooshed together on the same line. +As you can see, both the original line you wrote and the second line you appended are now in test.log. Also note that carriage returns are not included. Since you didn't write them explicitly to the file either time, the + file doesn't include them. You can write a carriage return with the"\n"character. Since you didn't do this, everything you wrote to the file ended up smooshed together on the same line.6.3. Iterating with
-forLoopsLike most other languages, Python has
forloops. The only reason you haven't seen them until now is that Python is good at so many other things that you don't need them as often. +Like most other languages, Python has
forloops. The only reason you haven't seen them until now is that Python is good at so many other things that you don't need them as often.Most other languages don't have a powerful list datatype like Python, so you end up doing a lot of manual work, specifying a start, end, and step to define a range of integers or characters -or other iteratable entities. But in Python, a
forloop simply iterates over a list, the same way list comprehensions work. +or other iteratable entities. But in Python, aforloop simply iterates over a list, the same way list comprehensions work.Example 6.8. Introducing the
forLoop>>> li = ['a', 'b', 'e'] >>> for s in li:-... print s
+... print s
a b e @@ -4125,7 +3648,7 @@ e
- ![]()
The syntax for a forloop is similar to list comprehensions. li is a list, and s will take the value of each element in turn, starting from the first element. +The syntax for a forloop is similar to list comprehensions. li is a list, and s will take the value of each element in turn, starting from the first element.@@ -4137,14 +3660,14 @@ e - ![]()
This is the reason you haven't seen the forloop yet: you haven't needed it yet. It's amazing how often you useforloops in other languages when all you really want is ajoinor a list comprehension. +This is the reason you haven't seen the forloop yet: you haven't needed it yet. It's amazing how often you useforloops in other languages when all you really want is ajoinor a list comprehension.Doing a “normal” (by Visual Basic standards) counter
forloop is also simple.Example 6.9. Simple Counters
>>> for i in range(5):-... print i +... print i 0 1 2 @@ -4152,7 +3675,7 @@ e
4 >>> li = ['a', 'b', 'c', 'd', 'e'] >>> for i in range(len(li)):-... print li[i] +... print li[i] a b c @@ -4163,22 +3686,22 @@ e
- ![]()
As you saw in Example 3.20, “Assigning Consecutive Values”, rangeproduces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress +As you saw in Example 3.20, “Assigning Consecutive Values”, rangeproduces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress occasionally) useful to have a counter loop.- - ![]()
Don't ever do this. This is Visual Basic-style thinking. Break out of it. Just iterate through the list, as shown in the previous example. + Don't ever do this. This is Visual Basic-style thinking. Break out of it. Just iterate through the list, as shown in the previous example.
forloops are not just for simple counters. They can iterate through all kinds of things. Here is an example of using aforloop to iterate through a dictionary. +
forloops are not just for simple counters. They can iterate through all kinds of things. Here is an example of using aforloop to iterate through a dictionary.Example 6.10. Iterating Through a Dictionary
>>> import os >>> for k, v in os.environ.items():![]()
-... print "%s=%s" % (k, v) +... print "%s=%s" % (k, v) USERPROFILE=C:\Documents and Settings\mpilgrim OS=Windows_NT COMPUTERNAME=MPILGRIM @@ -4186,7 +3709,7 @@ USERNAME=mpilgrim [...snip...] >>> print "\n".join(["%s=%s" % (k, v) -... for k, v in os.environ.items()])
+... for k, v in os.environ.items()])
USERPROFILE=C:\Documents and Settings\mpilgrim OS=Windows_NT COMPUTERNAME=MPILGRIM @@ -4197,22 +3720,22 @@ USERNAME=mpilgrim
- ![]()
os.environ is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables - accessible from MS-DOS. In UNIX, they are the variables exported in your shell's startup scripts. In Mac OS, there is no concept of environment variables, so this dictionary is empty. + os.environ is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables + accessible from MS-DOS. In UNIX, they are the variables exported in your shell's startup scripts. In Mac OS, there is no concept of environment variables, so this dictionary is empty. - ![]()
os.environ.items()returns a list of tuples:[(key1, value1), (key2, value2), ...]. Theforloop iterates through this list. The first round, it assignskey1to k andvalue1to v, so k =USERPROFILEand v =C:\Documents and Settings\mpilgrim. In the second round, k gets the second key,OS, and v gets the corresponding value,Windows_NT. +os.environ.items()returns a list of tuples:[(key1, value1), (key2, value2), ...]. Theforloop iterates through this list. The first round, it assignskey1to k andvalue1to v, so k =USERPROFILEand v =C:\Documents and Settings\mpilgrim. In the second round, k gets the second key,OS, and v gets the corresponding value,Windows_NT.@@ -4234,26 +3757,26 @@ USERNAME=mpilgrim - ![]()
With multi-variable assignment and list comprehensions, you can replace the entire forloop with a single statement. Whether you actually do this in real code is a matter of personal coding style. I like it +With multi-variable assignment and list comprehensions, you can replace the entire forloop with a single statement. Whether you actually do this in real code is a matter of personal coding style. I like it because it makes it clear that what I'm doing is mapping a dictionary into a list, then joining the list into a single string. - Other programmers prefer to write this out as aforloop. The output is the same in either case, although this version is slightly faster, because there is only oneforloop. The output is the same in either case, although this version is slightly faster, because there is only one- ![]()
tagDataMap is a class attribute that defines the tags you're looking for in an MP3 file. Tags are stored in fixed-length fields. Once you read the last 128 bytes of the file, bytes 3 through 32 of those - are always the song title, 33 through 62 are always the artist name, 63 through 92 are the album name, and so forth. Note + tagDataMap is a class attribute that defines the tags you're looking for in an MP3 file. Tags are stored in fixed-length fields. Once you read the last 128 bytes of the file, bytes 3 through 32 of those + are always the song title, 33 through 62 are always the artist name, 63 through 92 are the album name, and so forth. Note that tagDataMap is a dictionary of tuples, and each tuple contains two integers and a function reference. - ![]()
This looks complicated, but it's not. The structure of the forvariables matches the structure of the elements of the list returned byitems. Remember thatitemsreturns a list of tuples of the form(key, value). The first element of that list is("title", (3, 33, <function stripnulls>)), so the first time around the loop, tag gets"title", start gets3, end gets33, and parseFunc gets the functionstripnulls. +This looks complicated, but it's not. The structure of the forvariables matches the structure of the elements of the list returned byitems. Remember thatitemsreturns a list of tuples of the form(key, value). The first element of that list is("title", (3, 33, <function stripnulls>)), so the first time around the loop, tag gets"title", start gets3, end gets33, and parseFunc gets the functionstripnulls.- ![]()
Now that you've extracted all the parameters for a single MP3 tag, saving the tag data is easy. You slice tagdata from start to end to get the actual data for this tag, call parseFunc to post-process the data, and assign this as the value for the key tag in the pseudo-dictionary self. After iterating through all the elements in tagDataMap, self has the values for all the tags, and you know what that looks like. + Now that you've extracted all the parameters for a single MP3 tag, saving the tag data is easy. You slice tagdata from start to end to get the actual data for this tag, call parseFunc to post-process the data, and assign this as the value for the key tag in the pseudo-dictionary self. After iterating through all the elements in tagDataMap, self has the values for all the tags, and you know what that looks like. 6.4. Using
-sys.modulesModules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary
. +sys.modulesModules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary
.sys.modulesExample 6.12. Introducing
sys.modules>>> import sys>>> print '\n'.join(sys.modules.keys())
win32api @@ -4279,7 +3802,7 @@ stat
@@ -4308,7 +3831,7 @@ stat - ![]()
is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules your program has imported. Python preloads some modules on startup, and if you're using a Python IDE,sys.modulescontains all the modules imported by all the programs you've run within the IDE. +sys.modulesis a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules your program has imported. Python preloads some modules on startup, and if you're using a Python IDE,sys.modulescontains all the modules imported by all the programs you've run within the IDE.sys.modules- ![]()
As new modules are imported, they are added to . This explains why importing the same module twice is very fast: Python has already loaded and cached the module insys.modules, so importing the second time is simply a dictionary lookup. +sys.modulesAs new modules are imported, they are added to . This explains why importing the same module twice is very fast: Python has already loaded and cached the module insys.modules, so importing the second time is simply a dictionary lookup.sys.modules@@ -4338,7 +3861,7 @@ stat -Now you're ready to see how
is used insys.modulesfileinfo.py, the sample program introduced in Chapter 5. This example shows that portion of the code. +Now you're ready to see how
is used insys.modulesfileinfo.py, the sample program introduced in Chapter 5. This example shows that portion of the code.Example 6.15.
insys.modulesfileinfo.pydef getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):"get file info class from filename extension" @@ -4348,21 +3871,21 @@ stat
- ![]()
This is a function with two arguments; filename is required, but module is optional and defaults to the module that contains the FileInfoclass. This looks inefficient, because you might expect Python to evaluate theexpression every time the function is called. In fact, Python evaluates default expressions only once, the first time the module is imported. As you'll see later, you never call this +sys.modulesThis is a function with two arguments; filename is required, but module is optional and defaults to the module that contains the FileInfoclass. This looks inefficient, because you might expect Python to evaluate theexpression every time the function is called. In fact, Python evaluates default expressions only once, the first time the module is imported. As you'll see later, you never call this function with a module argument, so module serves as a function-level constant.sys.modules- ![]()
You'll plow through this line later, after you dive into the osmodule. For now, take it on faith that subclass ends up as the name of a class, likeMP3FileInfo. +You'll plow through this line later, after you dive into the osmodule. For now, take it on faith that subclass ends up as the name of a class, likeMP3FileInfo.@@ -4375,7 +3898,7 @@ stat - ![]()
You already know about getattr, which gets a reference to an object by name.hasattris a complementary function that checks whether an object has a particular attribute; in this case, whether a module has - a particular class (although it works for any object and any attribute, just likegetattr). In English, this line of code says, “If this module has the class named by subclass then return it, otherwise return the base classFileInfo.” +You already know about getattr, which gets a reference to an object by name.hasattris a complementary function that checks whether an object has a particular attribute; in this case, whether a module has + a particular class (although it works for any object and any attribute, just likegetattr). In English, this line of code says, “If this module has the class named by subclass then return it, otherwise return the base classFileInfo.”6.5. Working with Directories
-The
os.pathmodule has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing +The
os.pathmodule has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing the contents of a directory.Example 6.16. Constructing Pathnames
>>> import os @@ -4391,27 +3914,27 @@ stat- ![]()
os.pathis a reference to a module -- which module depends on your platform. Just asgetpassencapsulates differences between platforms by setting getpass to a platform-specific function,osencapsulates differences between platforms by setting path to a platform-specific module. +os.pathis a reference to a module -- which module depends on your platform. Just asgetpassencapsulates differences between platforms by setting getpass to a platform-specific function,osencapsulates differences between platforms by setting path to a platform-specific module.- ![]()
The joinfunction ofos.pathconstructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing +The joinfunction ofos.pathconstructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing with pathnames on Windows is annoying because the backslash character must be escaped.)- ![]()
In this slightly less trivial case, joinwill add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since -addSlashIfNecessaryis one of the stupid little functions I always need to write when building up my toolbox in a new language. Do not write this stupid little function in Python; smart people have already taken care of it for you. +In this slightly less trivial case, joinwill add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since +addSlashIfNecessaryis one of the stupid little functions I always need to write when building up my toolbox in a new language. Do not write this stupid little function in Python; smart people have already taken care of it for you.@@ -4437,14 +3960,14 @@ stat - ![]()
expanduserwill expand a pathname that uses~to represent the current user's home directory. This works on any platform where users have a home directory, like Windows, +expanduserwill expand a pathname that uses~to represent the current user's home directory. This works on any platform where users have a home directory, like Windows, UNIX, and Mac OS X; it has no effect on Mac OS.- ![]()
The splitfunction splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use +The splitfunction splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use multi-variable assignment to return multiple values from a function? Well,splitis such a function.- ![]()
You assign the return value of the splitfunction into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple. +You assign the return value of the splitfunction into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple.@@ -4462,7 +3985,7 @@ stat @@ -4479,11 +4002,11 @@ stat 'Program Files', 'Python20', 'RECYCLER', 'System Volume Information', 'TEMP', 'WINNT'] >>> [f for f in os.listdir(dirname) -... if os.path.isfile(os.path.join(dirname, f))] - ![]()
os.pathalso contains a functionsplitext, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique +os.pathalso contains a functionsplitext, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique to assign each of them to separate variables.+... if os.path.isfile(os.path.join(dirname, f))]
['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS', 'NTDETECT.COM', 'ntldr', 'pagefile.sys'] >>> [f for f in os.listdir(dirname) -... if os.path.isdir(os.path.join(dirname, f))]
+... if os.path.isdir(os.path.join(dirname, f))]
['cygwin', 'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER', 'System Volume Information', 'TEMP', 'WINNT']
@@ -4503,13 +4026,13 @@ stat- ![]()
You can use list filtering and the isfilefunction of theos.pathmodule to separate the files from the folders.isfiletakes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're usingto ensure a full pathname, butos.path.joinisfilealso works with a partial path, relative to the current working directory. You can useos.getcwd()to get the current working directory. +You can use list filtering and the isfilefunction of theos.pathmodule to separate the files from the folders.isfiletakes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're usingto ensure a full pathname, butos.path.joinisfilealso works with a partial path, relative to the current working directory. You can useos.getcwd()to get the current working directory.@@ -4532,7 +4055,7 @@ def listDirectory(directory, fileExtList): - ![]()
os.pathalso has aisdirfunction which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories +os.pathalso has aisdirfunction which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories within a directory.- ![]()
Iterating through the list with f, you use os.path.normcase(f)to normalize the case according to operating system defaults.normcaseis a useful little function that compensates for case-insensitive operating systems that think thatmahadeva.mp3andmahadeva.MP3are the same file. For instance, on Windows and Mac OS,normcasewill convert the entire filename to lowercase; on UNIX-compatible systems, it will return the filename unchanged. +Iterating through the list with f, you use os.path.normcase(f)to normalize the case according to operating system defaults.normcaseis a useful little function that compensates for case-insensitive operating systems that think thatmahadeva.mp3andmahadeva.MP3are the same file. For instance, on Windows and Mac OS,normcasewill convert the entire filename to lowercase; on UNIX-compatible systems, it will return the filename unchanged.@@ -4559,12 +4082,12 @@ def listDirectory(directory, fileExtList): - -Whenever possible, you should use the functions in osandos.pathfor file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like +Whenever possible, you should use the functions in osandos.pathfor file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions likeos.path.splitwork on UNIX, Windows, Mac OS, and any other platform supported by Python.There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you +
There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you may already be familiar with from working on the command line.
Example 6.20. Listing Directories with
glob>>> os.listdir("c:\\music\\_singles\\")@@ -4595,7 +4118,7 @@ may already be familiar with from working on the command line.
![]()
The globmodule, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard. - Here the wildcard is a directory path plus "*.mp3", which will match all.mp3files. Note that each element of the returned list already includes the full path of the file. + Here the wildcard is a directory path plus "*.mp3", which will match all.mp3files. Note that each element of the returned list already includes the full path of the file.@@ -4606,7 +4129,7 @@ may already be familiar with from working on the command line. @@ -4619,7 +4142,7 @@ may already be familiar with from working on the command line. - ![]()
Now consider this scenario: you have a musicdirectory, with several subdirectories within it, with.mp3files within each subdirectory. You can get a list of all of those with a single call toglob, by using two wildcards at once. One wildcard is the"*.mp3"(to match.mp3files), and one wildcard is within the directory path itself, to match any subdirectory withinc:\music. That's a crazy amount of power packed into one deceptively simple-looking function! +Now consider this scenario: you have a musicdirectory, with several subdirectories within it, with.mp3files within each subdirectory. You can get a list of all of those with a single call toglob, by using two wildcards at once. One wildcard is the"*.mp3"(to match.mp3files), and one wildcard is within the directory path itself, to match any subdirectory withinc:\music. That's a crazy amount of power packed into one deceptively simple-looking function!6.6. Putting It All Together
-Once again, all the dominoes are in place. You've seen how each line of code works. Now let's step back and see how it all +
Once again, all the dominoes are in place. You've seen how each line of code works. Now let's step back and see how it all fits together.
Example 6.21.
listDirectorydef listDirectory(directory, fileExtList):@@ -4638,8 +4161,8 @@ def listDirectory(directory, fileExtList):
-
listDirectoryis the main attraction of this entire module. It takes a directory (likec:\music\_singles\in my case) and a list of interesting file extensions (like['.mp3']), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in - that directory. And it does it in just a few straightforward lines of code. +listDirectoryis the main attraction of this entire module. It takes a directory (likec:\music\_singles\in my case) and a list of interesting file extensions (like['.mp3']), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in + that directory. And it does it in just a few straightforward lines of code.@@ -4651,42 +4174,42 @@ def listDirectory(directory, fileExtList): -
Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports nested functions -- literally, a function within a function. The nested function getFileInfoClasscan be called only from the function in which it is defined,listDirectory. As with any other function, you don't need an interface declaration or anything fancy; just define the function and code +Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports nested functions -- literally, a function within a function. The nested function getFileInfoClasscan be called only from the function in which it is defined,listDirectory. As with any other function, you don't need an interface declaration or anything fancy; just define the function and code it.- ![]()
Now that you've seen the osmodule, this line should make more sense. It gets the extension of the file (os.path.splitext(filename)[1]), forces it to uppercase (.upper()), slices off the dot ([1:]), and constructs a class name out of it with string formatting. Soc:\music\ap\mahadeva.mp3becomes.mp3becomes.MP3becomesMP3becomesMP3FileInfo. +Now that you've seen the osmodule, this line should make more sense. It gets the extension of the file (os.path.splitext(filename)[1]), forces it to uppercase (.upper()), slices off the dot ([1:]), and constructs a class name out of it with string formatting. Soc:\music\ap\mahadeva.mp3becomes.mp3becomes.MP3becomesMP3becomesMP3FileInfo.![]()
Having constructed the name of the handler class that would handle this file, you check to see if that handler class actually - exists in this module. If it does, you return the class, otherwise you return the base class FileInfo. This is a very important point: this function returns a class. Not an instance of a class, but the class itself. + exists in this module. If it does, you return the class, otherwise you return the base classFileInfo. This is a very important point: this function returns a class. Not an instance of a class, but the class itself.- - ![]()
For each file in the “interesting files” list (fileList), you call getFileInfoClasswith the filename (f). CallinggetFileInfoClass(f)returns a class; you don't know exactly which class, but you don't care. You then create an instance of this class (whatever - it is) and pass the filename (f again), to the__init__method. As you saw earlier in this chapter, the__init__method ofFileInfosetsself["name"], which triggers__setitem__, which is overridden in the descendant (MP3FileInfo) to parse the file appropriately to pull out the file's metadata. You do all that for each interesting file and return a +For each file in the “interesting files” list (fileList), you call getFileInfoClasswith the filename (f). CallinggetFileInfoClass(f)returns a class; you don't know exactly which class, but you don't care. You then create an instance of this class (whatever + it is) and pass the filename (f again), to the__init__method. As you saw earlier in this chapter, the__init__method ofFileInfosetsself["name"], which triggers__setitem__, which is overridden in the descendant (MP3FileInfo) to parse the file appropriately to pull out the file's metadata. You do all that for each interesting file and return a list of the resulting instances.Note that
listDirectoryis completely generic. It doesn't know ahead of time which types of files it will be getting, or which classes are defined -that could potentially handle those files. It inspects the directory for the files to process, and then introspects its own -module to see what special handler classes (likeMP3FileInfo) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class: -HTMLFileInfofor HTML files,DOCFileInfofor Word.docfiles, and so forth.listDirectorywill handle them all, without modification, by handing off the real work to the appropriate classes and collating the results. +Note that
listDirectoryis completely generic. It doesn't know ahead of time which types of files it will be getting, or which classes are defined +that could potentially handle those files. It inspects the directory for the files to process, and then introspects its own +module to see what special handler classes (likeMP3FileInfo) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class: +HTMLFileInfofor HTML files,DOCFileInfofor Word.docfiles, and so forth.listDirectorywill handle them all, without modification, by handing off the real work to the appropriate classes and collating the results.6.7. Summary
The
fileinfo.pyprogram introduced in Chapter 5 should now make perfect sense."""Framework for getting filetype-specific metadata. -Instantiate appropriate class with filename. Returned object acts like a +Instantiate appropriate class with filename. Returned object acts like a dictionary, with key-value pairs for each piece of metadata. import fileinfo info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3") @@ -4697,7 +4220,7 @@ Or use listDirectory function to get info on all files in a directory. ... Framework can be extended by adding classes for particular file types, e.g. -HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for +HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for parsing its files appropriately; see MP3FileInfo for example. """ import os @@ -4776,18 +4299,18 @@ if __name__ == "__main__":Chapter 7. Regular Expressions
Regular expressions are a powerful and standardized way of searching, replacing, and parsing text with complex patterns of -characters. If you've used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the
remodule to get an overview of the available functions and their arguments. +characters. If you've used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of theremodule to get an overview of the available functions and their arguments.7.1. Diving In
-Strings have methods for searching (
index,find, andcount), replacing (replace), and parsing (split), but they are limited to the simplest of cases. The search methods look for a single, hard-coded substring, and they are -always case-sensitive. To do case-insensitive searches of a string s, you must calls.lower()ors.upper()and make sure your search strings are the appropriate case to match. Thereplaceandsplitmethods have the same limitations. -If what you're trying to do can be accomplished with string functions, you should use them. They're fast and simple and easy - to read, and there's a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different +
Strings have methods for searching (
index,find, andcount), replacing (replace), and parsing (split), but they are limited to the simplest of cases. The search methods look for a single, hard-coded substring, and they are +always case-sensitive. To do case-insensitive searches of a string s, you must calls.lower()ors.upper()and make sure your search strings are the appropriate case to match. Thereplaceandsplitmethods have the same limitations. +If what you're trying to do can be accomplished with string functions, you should use them. They're fast and simple and easy + to read, and there's a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with
ifstatements to handle special cases, or if you're combining them withsplitandjoinand list comprehensions in weird unreadable ways, you may need to move up to regular expressions. -Although the regular expression syntax is tight and unlike normal code, the result can end up being more readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments +
Although the regular expression syntax is tight and unlike normal code, the result can end up being more readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions to make them practically self-documenting.
7.2. Case Study: Street Addresses
This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub - and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don't just + and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don't just make this stuff up; it's actually useful.) This example shows how I approached the problem.
Example 7.1. Matching at the End of a String
>>> s = '100 NORTH MAIN ROAD' @@ -4805,43 +4328,43 @@ within regular expressions to make them practically self-documenting.- ![]()
My goal is to standardize a street address so that 'ROAD'is always abbreviated as'RD.'. At first glance, I thought this was simple enough that I could just use the string methodreplace. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string,'ROAD', was a constant. And in this deceptively simple example,s.replacedoes indeed work. +My goal is to standardize a street address so that 'ROAD'is always abbreviated as'RD.'. At first glance, I thought this was simple enough that I could just use the string methodreplace. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string,'ROAD', was a constant. And in this deceptively simple example,s.replacedoes indeed work.- ![]()
Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that 'ROAD'appears twice in the address, once as part of the street name'BROAD'and once as its own word. Thereplacemethod sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed. +Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that 'ROAD'appears twice in the address, once as part of the street name'BROAD'and once as its own word. Thereplacemethod sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed.- ![]()
To solve the problem of addresses with more than one 'ROAD'substring, you could resort to something like this: only search and replace'ROAD'in the last four characters of the address (s[-4:]), and leave the string alone (s[:-4]). But you can see that this is already getting unwieldy. For example, the pattern is dependent on the length of the string - you're replacing (if you were replacing'STREET'with'ST.', you would need to uses[:-6]ands[-6:].replace(...)). Would you like to come back in six months and debug this? I know I wouldn't. +To solve the problem of addresses with more than one 'ROAD'substring, you could resort to something like this: only search and replace'ROAD'in the last four characters of the address (s[-4:]), and leave the string alone (s[:-4]). But you can see that this is already getting unwieldy. For example, the pattern is dependent on the length of the string + you're replacing (if you were replacing'STREET'with'ST.', you would need to uses[:-6]ands[-6:].replace(...)). Would you like to come back in six months and debug this? I know I wouldn't.- ![]()
It's time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the remodule. +It's time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the remodule.- ![]()
Take a look at the first parameter: 'ROAD$'. This is a simple regular expression that matches'ROAD'only when it occurs at the end of a string. The$means “end of the string”. (There is a corresponding character, the caret^, which means “beginning of the string”.) +Take a look at the first parameter: 'ROAD$'. This is a simple regular expression that matches'ROAD'only when it occurs at the end of a string. The$means “end of the string”. (There is a corresponding character, the caret^, which means “beginning of the string”.)- ![]()
Using the re.subfunction, you search the string s for the regular expression'ROAD$'and replace it with'RD.'. This matches theROADat the end of the string s, but does not match theROADthat's part of the wordBROAD, because that's in the middle of s. +Using the re.subfunction, you search the string s for the regular expression'ROAD$'and replace it with'RD.'. This matches theROADat the end of the string s, but does not match theROADthat's part of the wordBROAD, because that's in the middle of s.Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching
'ROAD'at the end of the address, was not good enough, because not all addresses included a street designation at all; some just -ended with the street name. Most of the time, I got away with it, but if the street name was'BROAD', then the regular expression would match'ROAD'at the end of the string as part of the word'BROAD', which is not what I wanted. +ended with the street name. Most of the time, I got away with it, but if the street name was'BROAD', then the regular expression would match'ROAD'at the end of the string as part of the word'BROAD', which is not what I wanted.Example 7.2. Matching Whole Words
>>> s = '100 BROAD' >>> re.sub('ROAD$', 'RD.', s) @@ -4859,22 +4382,22 @@ ended with the street name. Most of the time, I got away with it, but if the st- ![]()
What I really wanted was to match 'ROAD'when it was at the end of the string and it was its own whole word, not a part of some larger word. To express this in a regular expression, you use\b, which means “a word boundary must occur right here”. In Python, this is complicated by the fact that the'\'character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason - why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it's a bug in syntax or +What I really wanted was to match 'ROAD'when it was at the end of the string and it was its own whole word, not a part of some larger word. To express this in a regular expression, you use\b, which means “a word boundary must occur right here”. In Python, this is complicated by the fact that the'\'character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason + why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it's a bug in syntax or a bug in your regular expression.- ![]()
To work around the backslash plague, you can use what is called a raw string, by prefixing the string with the letter r. This tells Python that nothing in this string should be escaped;'\t'is a tab character, butr'\t'is really the backslash character\followed by the lettert. I recommend always using raw strings when dealing with regular expressions; otherwise, things get too confusing too quickly +To work around the backslash plague, you can use what is called a raw string, by prefixing the string with the letter r. This tells Python that nothing in this string should be escaped;'\t'is a tab character, butr'\t'is really the backslash character\followed by the lettert. I recommend always using raw strings when dealing with regular expressions; otherwise, things get too confusing too quickly (and regular expressions get confusing quickly enough all by themselves).- ![]()
*sigh* Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word + *sigh* Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word @@ -4882,13 +4405,13 @@ ended with the street name. Most of the time, I got away with it, but if the st'ROAD'as a whole word by itself, but it wasn't at the end, because the address had an apartment number after the street designation. Because'ROAD'isn't at the very end of the string, it doesn't match, so the entire call tore.subends up replacing nothing at all, and you get the original string back, which is not what you want.- ![]()
To solve this problem, I removed the $character and added another\b. Now the regular expression reads “match'ROAD'when it's a whole word by itself anywhere in the string,” whether at the end, the beginning, or somewhere in the middle. +To solve this problem, I removed the $character and added another\b. Now the regular expression reads “match'ROAD'when it's a whole word by itself anywhere in the string,” whether at the end, the beginning, or somewhere in the middle.7.3. Case Study: Roman Numerals
-You've most likely seen Roman numerals, even if you didn't recognize them. You may have seen them in copyrights of old movies - and television shows (“Copyright
MCMXLVI” instead of “Copyright1946”), or on the dedication walls of libraries or universities (“establishedMDCCCLXXXVIII” instead of “established1888”). You may also have seen them in outlines and bibliographical references. It's a system of representing numbers that really +You've most likely seen Roman numerals, even if you didn't recognize them. You may have seen them in copyrights of old movies + and television shows (“Copyright
MCMXLVI” instead of “Copyright1946”), or on the dedication walls of libraries or universities (“establishedMDCCCLXXXVIII” instead of “established1888”). You may also have seen them in outlines and bibliographical references. It's a system of representing numbers that really does date back to the ancient Roman empire (hence the name).In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.
@@ -4904,21 +4427,21 @@ ended with the street name. Most of the time, I got away with it, but if the stThe following are some general rules for constructing Roman numerals:
-
- Characters are additive.
Iis1,IIis2, andIIIis3.VIis6(literally, “5and1”),VIIis7, andVIIIis8. +- Characters are additive.
Iis1,IIis2, andIIIis3.VIis6(literally, “5and1”),VIIis7, andVIIIis8. -- The tens characters (
I,X,C, andM) can be repeated up to three times. At4, you need to subtract from the next highest fives character. You can't represent4asIIII; instead, it is represented asIV(“1less than5”). The number40is written asXL(10less than50),41asXLI,42asXLII,43asXLIII, and then44asXLIV(10less than50, then1less than5). +- The tens characters (
I,X,C, andM) can be repeated up to three times. At4, you need to subtract from the next highest fives character. You can't represent4asIIII; instead, it is represented asIV(“1less than5”). The number40is written asXL(10less than50),41asXLI,42asXLII,43asXLIII, and then44asXLIV(10less than50, then1less than5). -- Similarly, at
9, you need to subtract from the next highest tens character:8isVIII, but9isIX(1less than10), notVIIII(since theIcharacter can not be repeated four times). The number90isXC,900isCM. +- Similarly, at
9, you need to subtract from the next highest tens character:8isVIII, but9isIX(1less than10), notVIIII(since theIcharacter can not be repeated four times). The number90isXC,900isCM. -- The fives characters can not be repeated. The number
10is always represented asX, never asVV. The number100is alwaysC, neverLL. +- The fives characters can not be repeated. The number
10is always represented asX, never asVV. The number100is alwaysC, neverLL.- Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much. -
DCis600;CDis a completely different number (400,100less than500).CIis101;ICis not even a valid Roman numeral (because you can't subtract1directly from100; you would need to write it asXCIX, for10less than100, then1less than10). +DCis600;CDis a completely different number (400,100less than500).CIis101;ICis not even a valid Roman numeral (because you can't subtract1directly from100; you would need to write it asXCIX, for10less than100, then1less than10).7.3.1. Checking for Thousands
-What would it take to validate that an arbitrary string is a valid Roman numeral? Let's take it one digit at a time. Since - Roman numerals are always written highest to lowest, let's start with the highest: the thousands place. For numbers 1000 +
What would it take to validate that an arbitrary string is a valid Roman numeral? Let's take it one digit at a time. Since + Roman numerals are always written highest to lowest, let's start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of
Mcharacters.Example 7.3. Checking for Thousands
>>> import re @@ -4939,12 +4462,12 @@ ended with the street name. Most of the time, I got away with it, but if the stThis pattern has three parts: -
@@ -4953,8 +4476,8 @@ ended with the street name. Most of the time, I got away with it, but if the st^to match what follows only at the beginning of the string. If this were not specified, the pattern would match no matter - where theMcharacters were, which is not what you want. You want to make sure that theMcharacters, if they're there, are at the beginning of the string. +^to match what follows only at the beginning of the string. If this were not specified, the pattern would match no matter + where theMcharacters were, which is not what you want. You want to make sure that theMcharacters, if they're there, are at the beginning of the string. -M?to optionally match a singleMcharacter. Since this is repeated three times, you're matching anywhere from zero to threeMcharacters in a row. +M?to optionally match a singleMcharacter. Since this is repeated three times, you're matching anywhere from zero to threeMcharacters in a row. -$to match what precedes only at the end of the string. When combined with the^character at the beginning, this means that the pattern must match the entire string, with no other characters before or +$to match what precedes only at the end of the string. When combined with the^character at the beginning, this means that the pattern must match the entire string, with no other characters before or after theMcharacters.- ![]()
The essence of the remodule is thesearchfunction, that takes a regular expression (pattern) and a string ('M') to try to match against the regular expression. If a match is found,searchreturns an object which has various methods to describe the match; if no match is found,searchreturnsNone, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return - value ofsearch.'M'matches this regular expression, because the first optionalMmatches and the second and third optionalMcharacters are ignored. +The essence of the remodule is thesearchfunction, that takes a regular expression (pattern) and a string ('M') to try to match against the regular expression. If a match is found,searchreturns an object which has various methods to describe the match; if no match is found,searchreturnsNone, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return + value ofsearch.'M'matches this regular expression, because the first optionalMmatches and the second and third optionalMcharacters are ignored.@@ -4972,7 +4495,7 @@ ended with the street name. Most of the time, I got away with it, but if the st - ![]()
'MMMM'does not match. All threeMcharacters match, but then the regular expression insists on the string ending (because of the$character), and the string doesn't end yet (because of the fourthM). SosearchreturnsNone. +'MMMM'does not match. All threeMcharacters match, but then the regular expression insists on the string ending (because of the$character), and the string doesn't end yet (because of the fourthM). SosearchreturnsNone.@@ -5030,33 +4553,33 @@ ended with the street name. Most of the time, I got away with it, but if the st - ![]()
This pattern starts out the same as the previous one, checking for the beginning of the string ( ^), then the thousands place (M?M?M?). Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical - bars:CM,CD, andD?C?C?C?(which is an optionalDfollowed by zero to three optionalCcharacters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first +This pattern starts out the same as the previous one, checking for the beginning of the string ( ^), then the thousands place (M?M?M?). Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical + bars:CM,CD, andD?C?C?C?(which is an optionalDfollowed by zero to three optionalCcharacters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first one that matches, and ignores the rest.- ![]()
'MCM'matches because the firstMmatches, the second and thirdMcharacters are ignored, and theCMmatches (so theCDandD?C?C?C?patterns are never even considered).MCMis the Roman numeral representation of1900. +'MCM'matches because the firstMmatches, the second and thirdMcharacters are ignored, and theCMmatches (so theCDandD?C?C?C?patterns are never even considered).MCMis the Roman numeral representation of1900.- ![]()
'MD'matches because the firstMmatches, the second and thirdMcharacters are ignored, and theD?C?C?C?pattern matchesD(each of the threeCcharacters are optional and are ignored).MDis the Roman numeral representation of1500. +'MD'matches because the firstMmatches, the second and thirdMcharacters are ignored, and theD?C?C?C?pattern matchesD(each of the threeCcharacters are optional and are ignored).MDis the Roman numeral representation of1500.- ![]()
'MMMCCC'matches because all threeMcharacters match, and theD?C?C?C?pattern matchesCCC(theDis optional and is ignored).MMMCCCis the Roman numeral representation of3300. +'MMMCCC'matches because all threeMcharacters match, and theD?C?C?C?pattern matchesCCC(theDis optional and is ignored).MMMCCCis the Roman numeral representation of3300.- ![]()
'MCMC'does not match. The firstMmatches, the second and thirdMcharacters are ignored, and theCMmatches, but then the$does not match because you're not at the end of the string yet (you still have an unmatchedCcharacter). TheCdoes not match as part of theD?C?C?C?pattern, because the mutually exclusiveCMpattern has already matched. +'MCMC'does not match. The firstMmatches, the second and thirdMcharacters are ignored, and theCMmatches, but then the$does not match because you're not at the end of the string yet (you still have an unmatchedCcharacter). TheCdoes not match as part of theD?C?C?C?pattern, because the mutually exclusiveCMpattern has already matched.@@ -5067,11 +4590,11 @@ ended with the street name. Most of the time, I got away with it, but if the st Whew! See how quickly regular expressions can get nasty? And you've only covered the thousands and hundreds places of Roman - numerals. But if you followed all that, the tens and ones places are easy, because they're exactly the same pattern. But + numerals. But if you followed all that, the tens and ones places are easy, because they're exactly the same pattern. But let's look at another way to express the pattern.
7.4. Using the
-{n,m}SyntaxIn the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express - this in regular expressions, which some people find more readable. First look at the method we already used in the previous +
In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express + this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.
Example 7.5. The Old Way: Every Character Optional
>>> import re @@ -5152,7 +4675,7 @@ ended with the street name. Most of the time, I got away with it, but if the st@@ -5161,14 +4684,14 @@ ended with the street name. Most of the time, I got away with it, but if the st - ![]()
This matches the start of the string, then three Mout of a possible three, but then does not match the end of the string. The regular expression allows for up to only threeMcharacters before the end of the string, but you have four, so the pattern does not match and returnsNone. +This matches the start of the string, then three Mout of a possible three, but then does not match the end of the string. The regular expression allows for up to only threeMcharacters before the end of the string, but you have four, so the pattern does not match and returnsNone.- There is no way to programmatically determine that two regular expressions are equivalent. The best you can do is write a - lot of test cases to make sure they behave the same way on all relevant inputs. You'll talk more about writing test cases + There is no way to programmatically determine that two regular expressions are equivalent. The best you can do is write a + lot of test cases to make sure they behave the same way on all relevant inputs. You'll talk more about writing test cases later in this book. 7.4.1. Checking for Tens and Ones
-Now let's expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for +
Now let's expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.
Example 7.7. Checking for Tens
>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$' @@ -5187,35 +4710,35 @@ ended with the street name. Most of the time, I got away with it, but if the st- ![]()
This matches the start of the string, then the first optional M, thenCM, thenXL, then the end of the string. Remember, the(A|B|C)syntax means “match exactly one of A, B, or C”. You matchXL, so you ignore theXCandL?X?X?X?choices, and then move on to the end of the string.MCMLis the Roman numeral representation of1940. +This matches the start of the string, then the first optional M, thenCM, thenXL, then the end of the string. Remember, the(A|B|C)syntax means “match exactly one of A, B, or C”. You matchXL, so you ignore theXCandL?X?X?X?choices, and then move on to the end of the string.MCMLis the Roman numeral representation of1940.- ![]()
This matches the start of the string, then the first optional M, thenCM, thenL?X?X?X?. Of theL?X?X?X?, it matches theLand skips all three optionalXcharacters. Then you move to the end of the string.MCMLis the Roman numeral representation of1950. +This matches the start of the string, then the first optional M, thenCM, thenL?X?X?X?. Of theL?X?X?X?, it matches theLand skips all three optionalXcharacters. Then you move to the end of the string.MCMLis the Roman numeral representation of1950.- ![]()
This matches the start of the string, then the first optional M, thenCM, then the optionalLand the first optionalX, skips the second and third optionalX, then the end of the string.MCMLXis the Roman numeral representation of1960. +This matches the start of the string, then the first optional M, thenCM, then the optionalLand the first optionalX, skips the second and third optionalX, then the end of the string.MCMLXis the Roman numeral representation of1960.- ![]()
This matches the start of the string, then the first optional M, thenCM, then the optionalLand all three optionalXcharacters, then the end of the string.MCMLXXXis the Roman numeral representation of1980. +This matches the start of the string, then the first optional M, thenCM, then the optionalLand all three optionalXcharacters, then the end of the string.MCMLXXXis the Roman numeral representation of1980.- - ![]()
This matches the start of the string, then the first optional M, thenCM, then the optionalLand all three optionalXcharacters, then fails to match the end of the string because there is still one moreXunaccounted for. So the entire pattern fails to match, and returnsNone.MCMLXXXXis not a valid Roman numeral. +This matches the start of the string, then the first optional M, thenCM, then the optionalLand all three optionalXcharacters, then fails to match the end of the string because there is still one moreXunaccounted for. So the entire pattern fails to match, and returnsNone.MCMLXXXXis not a valid Roman numeral.The expression for the ones place follows the same pattern. I'll spare you the details and show you the end result. +
The expression for the ones place follows the same pattern. I'll spare you the details and show you the end result.
>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'So what does that look like using this alternate
{n,m}syntax? This example shows the new syntax. @@ -5234,48 +4757,48 @@ ended with the street name. Most of the time, I got away with it, but if the st- ![]()
This matches the start of the string, then one of a possible four Mcharacters, thenD?C{0,3}. Of that, it matches the optionalDand zero of three possibleCcharacters. Moving on, it matchesL?X{0,3}by matching the optionalLand zero of three possibleXcharacters. Then it matchesV?I{0,3}by matching the optional V and zero of three possibleIcharacters, and finally the end of the string.MDLVis the Roman numeral representation of1555. +This matches the start of the string, then one of a possible four Mcharacters, thenD?C{0,3}. Of that, it matches the optionalDand zero of three possibleCcharacters. Moving on, it matchesL?X{0,3}by matching the optionalLand zero of three possibleXcharacters. Then it matchesV?I{0,3}by matching the optional V and zero of three possibleIcharacters, and finally the end of the string.MDLVis the Roman numeral representation of1555.- ![]()
This matches the start of the string, then two of a possible four Mcharacters, then theD?C{0,3}with aDand one of three possibleCcharacters; thenL?X{0,3}with anLand one of three possibleXcharacters; thenV?I{0,3}with aVand one of three possibleIcharacters; then the end of the string.MMDCLXVIis the Roman numeral representation of2666. +This matches the start of the string, then two of a possible four Mcharacters, then theD?C{0,3}with aDand one of three possibleCcharacters; thenL?X{0,3}with anLand one of three possibleXcharacters; thenV?I{0,3}with aVand one of three possibleIcharacters; then the end of the string.MMDCLXVIis the Roman numeral representation of2666.- ![]()
This matches the start of the string, then four out of four Mcharacters, thenD?C{0,3}with aDand three out of threeCcharacters; thenL?X{0,3}with anLand three out of threeXcharacters; thenV?I{0,3}with aVand three out of threeIcharacters; then the end of the string.MMMMDCCCLXXXVIIIis the Roman numeral representation of3888, and it's the longest Roman numeral you can write without extended syntax. +This matches the start of the string, then four out of four Mcharacters, thenD?C{0,3}with aDand three out of threeCcharacters; thenL?X{0,3}with anLand three out of threeXcharacters; thenV?I{0,3}with aVand three out of threeIcharacters; then the end of the string.MMMMDCCCLXXXVIIIis the Roman numeral representation of3888, and it's the longest Roman numeral you can write without extended syntax.- - ![]()
Watch closely. (I feel like a magician. “Watch closely, kids, I'm going to pull a rabbit out of my hat.”) This matches the start of the string, then zero out of four M, then matchesD?C{0,3}by skipping the optionalDand matching zero out of threeC, then matchesL?X{0,3}by skipping the optionalLand matching zero out of threeX, then matchesV?I{0,3}by skipping the optionalVand matching one out of threeI. Then the end of the string. Whoa. +Watch closely. (I feel like a magician. “Watch closely, kids, I'm going to pull a rabbit out of my hat.”) This matches the start of the string, then zero out of four M, then matchesD?C{0,3}by skipping the optionalDand matching zero out of threeC, then matchesL?X{0,3}by skipping the optionalLand matching zero out of threeX, then matchesV?I{0,3}by skipping the optionalVand matching one out of threeI. Then the end of the string. Whoa.If you followed all that and understood it on the first try, you're doing better than I did. Now imagine trying to understand - someone else's regular expressions, in the middle of a critical function of a large program. Or even imagine coming back - to your own regular expressions a few months later. I've done it, and it's not a pretty sight. +
If you followed all that and understood it on the first try, you're doing better than I did. Now imagine trying to understand + someone else's regular expressions, in the middle of a critical function of a large program. Or even imagine coming back + to your own regular expressions a few months later. I've done it, and it's not a pretty sight.
In the next section you'll explore an alternate syntax that can help keep your expressions maintainable.
7.5. Verbose Regular Expressions
-So far you've just been dealing with what I'll call “compact” regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee - that you'll be able to understand it six months later. What you really need is inline documentation. -
Python allows you to do this with something called verbose regular expressions. A verbose regular expression is different from a compact regular expression in two ways: +
So far you've just been dealing with what I'll call “compact” regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee + that you'll be able to understand it six months later. What you really need is inline documentation. +
Python allows you to do this with something called verbose regular expressions. A verbose regular expression is different from a compact regular expression in two ways:
-
-- Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They're - not matched at all. (If you want to match a space in a verbose regular expression, you'll need to escape it by putting a +
- Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They're + not matched at all. (If you want to match a space in a verbose regular expression, you'll need to escape it by putting a backslash in front of it.) -
- Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a
#character and goes until the end of the line. In this case it's a comment within a multi-line string instead of within your +- Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a
#character and goes until the end of the line. In this case it's a comment within a multi-line string instead of within your source code, but it works the same way.This will be more clear with an example. Let's revisit the compact regular expression you've been working with, and make -it a verbose regular expression. This example shows how. +
This will be more clear with an example. Let's revisit the compact regular expression you've been working with, and make +it a verbose regular expression. This example shows how.
Example 7.9. Regular Expressions with Inline Comments
>>> pattern = """ ^ # beginning of string @@ -5301,8 +4824,8 @@ it a verbose regular expression. This example shows how.![]()
The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when - working with them: @@ -5321,16 +4844,16 @@ it a verbose regular expression. This example shows how.re.VERBOSEis a constant defined in theremodule that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has - quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the + working with them:re.VERBOSEis a constant defined in theremodule that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has + quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it's a lot more readable.- ![]()
This does not match. Why? Because it doesn't have the re.VERBOSEflag, so there.searchfunction is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can't auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose. +This does not match. Why? Because it doesn't have the re.VERBOSEflag, so there.searchfunction is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can't auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose.7.6. Case study: Parsing Phone Numbers
-So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions - are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where. -
This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American -phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the -area code, trunk, number, and optionally an extension separately in the company's database. I scoured the Web and found many +
So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions + are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where. +
This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American +phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the +area code, trunk, number, and optionally an extension separately in the company's database. I scoured the Web and found many examples of regular expressions that purported to do this, but none of them were permissive enough.
Here are the phone numbers I needed to be able to accept:
@@ -5345,8 +4868,8 @@ examples of regular expressions that purported to do this, but none of them were800-555-1212 ext. 1234work 1-(800) 555.1212 #1234-Quite a variety! In each of these cases, I need to know that the area code was
800, the trunk was555, and the rest of the phone number was1212. For those with an extension, I need to know that the extension was1234. -Let's work through developing a solution for phone number parsing. This example shows the first step. +
Quite a variety! In each of these cases, I need to know that the area code was
800, the trunk was555, and the rest of the phone number was1212. For those with an extension, I need to know that the extension was1234. +Let's work through developing a solution for phone number parsing. This example shows the first step.
Example 7.10. Finding Numbers
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')>>> phonePattern.search('800-555-1212').groups()
@@ -5358,21 +4881,21 @@ examples of regular expressions that purported to do this, but none of them were
- ![]()
Always read regular expressions from left to right. This one matches the beginning of the string, and then (\d{3}). What's\d{3}? Well, the{3}means “match exactly three numeric digits”; it's a variation on the{n,m} syntaxyou saw earlier.\dmeans “any numeric digit” (0through9). Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another - group of exactly four digits. Then match the end of the string. +Always read regular expressions from left to right. This one matches the beginning of the string, and then (\d{3}). What's\d{3}? Well, the{3}means “match exactly three numeric digits”; it's a variation on the{n,m} syntaxyou saw earlier.\dmeans “any numeric digit” (0through9). Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another + group of exactly four digits. Then match the end of the string.- ![]()
To get access to the groups that the regular expression parser remembered along the way, use the groups()method on the object that thesearchfunction returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you +To get access to the groups that the regular expression parser remembered along the way, use the groups()method on the object that thesearchfunction returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you defined three groups, one with three digits, one with three digits, and one with four digits.@@ -5390,9 +4913,9 @@ examples of regular expressions that purported to do this, but none of them were - ![]()
This regular expression is not the final answer, because it doesn't handle a phone number with an extension on the end. For + This regular expression is not the final answer, because it doesn't handle a phone number with an extension on the end. For that, you'll need to expand the regular expression. @@ -5406,7 +4929,7 @@ examples of regular expressions that purported to do this, but none of them were - ![]()
This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then + This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then a remembered group of three digits, then a hyphen, then a remembered group of three digits, then a hyphen, then a remembered - group of four digits. What's new is that you then match another hyphen, and a remembered group of one or more digits, then + group of four digits. What's new is that you then match another hyphen, and a remembered group of one or more digits, then the end of the string. ![]()
Unfortunately, this regular expression is not the final answer either, because it assumes that the different parts of the - phone number are separated by hyphens. What if they're separated by spaces, or commas, or dots? You need a more general + phone number are separated by hyphens. What if they're separated by spaces, or commas, or dots? You need a more general solution to match several different types of separators. @@ -5414,7 +4937,7 @@ examples of regular expressions that purported to do this, but none of them were![]()
Oops! Not only does this regular expression not do everything you want, it's actually a step backwards, because now you can't - parse phone numbers without an extension. That's not what you wanted at all; if the extension is there, you want to know what it is, but if it's not + parse phone numbers without an extension. That's not what you wanted at all; if the extension is there, you want to know what it is, but if it's not there, you still want to know what the different parts of the main number are. @@ -5435,7 +4958,7 @@ examples of regular expressions that purported to do this, but none of them were@@ -5453,14 +4976,14 @@ examples of regular expressions that purported to do this, but none of them were - ![]()
Hang on to your hat. You're matching the beginning of the string, then a group of three digits, then \D+. What the heck is that? Well,\Dmatches any character except a numeric digit, and+means “1 or more”. So\D+matches one or more characters that are not digits. This is what you're using instead of a literal hyphen, to try to match +Hang on to your hat. You're matching the beginning of the string, then a group of three digits, then \D+. What the heck is that? Well,\Dmatches any character except a numeric digit, and+means “1 or more”. So\D+matches one or more characters that are not digits. This is what you're using instead of a literal hyphen, to try to match different separators.- ![]()
Unfortunately, this is still not the final answer, because it assumes that there is a separator at all. What if the phone + Unfortunately, this is still not the final answer, because it assumes that there is a separator at all. What if the phone number is entered without any spaces or hyphens at all? @@ -5481,13 +5004,13 @@ examples of regular expressions that purported to do this, but none of them were - ![]()
Oops! This still hasn't fixed the problem of requiring extensions. Now you have two problems, but you can solve both of + Oops! This still hasn't fixed the problem of requiring extensions. Now you have two problems, but you can solve both of them with the same technique. - ![]()
The only change you've made since that last step is changing all the +to*. Instead of\D+between the parts of the phone number, you now match on\D*. Remember that+means “1 or more”? Well,*means “zero or more”. So now you should be able to parse phone numbers even when there is no separator character at all. +The only change you've made since that last step is changing all the +to*. Instead of\D+between the parts of the phone number, you now match on\D*. Remember that+means “1 or more”? Well,*means “zero or more”. So now you should be able to parse phone numbers even when there is no separator character at all.@@ -5500,14 +5023,14 @@ examples of regular expressions that purported to do this, but none of them were - ![]()
Lo and behold, it actually works. Why? You matched the beginning of the string, then a remembered group of three digits + Lo and behold, it actually works. Why? You matched the beginning of the string, then a remembered group of three digits ( 800), then zero non-numeric characters, then a remembered group of three digits (555), then zero non-numeric characters, then a remembered group of four digits (1212), then zero non-numeric characters, then a remembered group of an arbitrary number of digits (1234), then the end of the string.- ![]()
Finally, you've solved the other long-standing problem: extensions are optional again. If no extension is found, the groups()method still returns a tuple of four elements, but the fourth element is just an empty string. +Finally, you've solved the other long-standing problem: extensions are optional again. If no extension is found, the groups()method still returns a tuple of four elements, but the fourth element is just an empty string.@@ -5526,22 +5049,22 @@ examples of regular expressions that purported to do this, but none of them were - ![]()
I hate to be the bearer of bad news, but you're not finished yet. What's the problem here? There's an extra character before - the area code, but the regular expression assumes that the area code is the first thing at the beginning of the string. No + I hate to be the bearer of bad news, but you're not finished yet. What's the problem here? There's an extra character before + the area code, but the regular expression assumes that the area code is the first thing at the beginning of the string. No problem, you can use the same technique of “zero or more non-numeric characters” to skip over the leading characters before the area code. - ![]()
This is the same as in the previous example, except now you're matching \D*, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you're not remembering - these non-numeric characters (they're not in parentheses). If you find them, you'll just skip over them and then start remembering +This is the same as in the previous example, except now you're matching \D*, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you're not remembering + these non-numeric characters (they're not in parentheses). If you find them, you'll just skip over them and then start remembering the area code whenever you get to it.- ![]()
You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis + You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis after the area code is already handled; it's treated as a non-numeric separator and matched by the \D*after the first remembered group.)- ![]()
Just a sanity check to make sure you haven't broken anything that used to work. Since the leading characters are entirely + Just a sanity check to make sure you haven't broken anything that used to work. Since the leading characters are entirely optional, this matches the beginning of the string, then zero non-numeric characters, then a remembered group of three digits ( @@ -5549,15 +5072,15 @@ examples of regular expressions that purported to do this, but none of them were800), then one non-numeric character (the hyphen), then a remembered group of three digits (555), then one non-numeric character (the hyphen), then a remembered group of four digits (1212), then zero non-numeric characters, then a remembered group of zero digits, then the end of the string.- - ![]()
This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn't this phone number match? - Because there's a 1before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (\D*). Aargh. +This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn't this phone number match? + Because there's a 1before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (\D*). Aargh.Let's back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you -see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than +
Let's back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you +see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than trying to match it all just so you can skip over it, let's take a different approach: don't explicitly match the beginning -of the string at all. This approach is shown in the next example. +of the string at all. This approach is shown in the next example.
Example 7.15. Phone Number, Wherever I May Find Ye
>>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
@@ -5571,8 +5094,8 @@ of the string at all. This approach is shown in the next example.
@@ -5586,7 +5109,7 @@ of the string at all. This approach is shown in the next example. - ![]()
Note the lack of ^in this regular expression. You are not matching the beginning of the string anymore. There's nothing that says you need - to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out +Note the lack of ^in this regular expression. You are not matching the beginning of the string anymore. There's nothing that says you need + to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there.- ![]()
Sanity check. this still works. +Sanity check. this still works. - @@ -5594,7 +5117,7 @@ of the string at all. This approach is shown in the next example.
That still works too. See how quickly a regular expression can get out of control? Take a quick glance at any of the previous iterations. Can +
See how quickly a regular expression can get out of control? Take a quick glance at any of the previous iterations. Can you tell the difference between one and the next?
While you still understand the final answer (and it is the final answer; if you've discovered a case it doesn't handle, I don't want to know about it), let's write it out as a verbose regular expression, before you forget why you made the choices @@ -5627,7 +5150,7 @@ you made.
- ![]()
Final sanity check. Yes, this still works. You're done. +Final sanity check. Yes, this still works. You're done. @@ -5639,7 +5162,7 @@ you made.7.7. Summary
-This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you're completely +
This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you're completely overwhelmed by them now, believe me, you ain't seen nothing yet.
You should now be familiar with the following techniques:
@@ -5664,10 +5187,10 @@ you made.(a|b|c)matches eitheraorborc. -(x)in general is a remembered group. You can get the value of what matched by using thegroups()method of the object returned byre.search. +(x)in general is a remembered group. You can get the value of what matched by using thegroups()method of the object returned byre.search. -Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough +
Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough about them to know when they are appropriate, when they will solve your problems, and when they will cause more problems than they solve.
@@ -5688,8 +5211,8 @@ they solve.Chapter 8. HTML Processing
8.1. Diving in
I often see questions on comp.lang.python like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of my HTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions. -
Here is a complete, working Python program in two parts. The first part,
BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part,dialect.py, is an example of how to useBaseHTMLProcessor.pyto translate the text of an HTML document but leave the tags alone. Read thedocstrings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how -any of these class methods ever get called. Don't worry, all will be revealed in due time. +Here is a complete, working Python program in two parts. The first part,
BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part,dialect.py, is an example of how to useBaseHTMLProcessor.pyto translate the text of an HTML document but leave the tags alone. Read thedocstrings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how +any of these class methods ever get called. Don't worry, all will be revealed in due time.Example 8.1.
BaseHTMLProcessor.pyIf you have not already done so, you can download this and other examples used in this book.
from sgmllib import SGMLParser @@ -5832,7 +5355,7 @@ class ChefDialectizer(Dialectizer): (r'V', r'F'), (r'w', r'w'), (r'W', r'W'), - (r'([a-z])[.]', r'\1. Bork Bork Bork!')) + (r'([a-z])[.]', r'\1. Bork Bork Bork!')) class FuddDialectizer(Dialectizer): """convert HTML to Elmer Fudd-speak""" @@ -5916,7 +5439,7 @@ def test(url): if __name__ == "__main__": test("http://diveintopython3.org/odbchelper_list.html")Example 8.3. Output of
-dialect.pyRunning this script will translate Section 3.2, “Introducing Lists” into mock Swedish Chef-speak (from The Muppets), mock Elmer Fudd-speak (from Bugs Bunny cartoons), and mock Middle English (loosely based on Chaucer's The Canterbury Tales). If you look at the HTML source of the output pages, you'll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the +
Running this script will translate Section 3.2, “Introducing Lists” into mock Swedish Chef-speak (from The Muppets), mock Elmer Fudd-speak (from Bugs Bunny cartoons), and mock Middle English (loosely based on Chaucer's The Canterbury Tales). If you look at the HTML source of the output pages, you'll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the code listings and screen examples were left untouched.
<div class="abstract"> <p>Lists awe <span class="application">Pydon</span>'s wowkhowse datatype. @@ -5926,37 +5449,37 @@ in <span class="application">Powewbuiwdew</span>, bwace youwsewf fow <span class="application">Pydon</span> wists.</p> </div>8.2. Introducing
-sgmllib.pyHTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by
sgmllib.py, a part of the standard Python library. -The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags -and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool.
sgmllib.pypresents HTML structurally. -
sgmllib.pycontains one important class:SGMLParser.SGMLParserparses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, -it calls a method on itself based on what it found. In order to use the parser, you subclass theSGMLParserclass and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method. +HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by
sgmllib.py, a part of the standard Python library. +The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags +and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool.
sgmllib.pypresents HTML structurally. +
sgmllib.pycontains one important class:SGMLParser.SGMLParserparses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, +it calls a method on itself based on what it found. In order to use the parser, you subclass theSGMLParserclass and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method.
SGMLParserparses HTML into 8 kinds of data, and calls a separate method for each of them:
- Start tag
-- An HTML tag that starts a block, like
<html>,<head>,<body>, or<pre>, or a standalone tag like<br>or<img>. When it finds a start tagtagname,SGMLParserwill look for a method calledstart_ortagnamedo_. For instance, when it finds atagname<pre>tag, it will look for astart_preordo_premethod. If found,SGMLParsercalls this method with a list of the tag's attributes; otherwise, it callsunknown_starttagwith the tag name and list of attributes. +- An HTML tag that starts a block, like
<html>,<head>,<body>, or<pre>, or a standalone tag like<br>or<img>. When it finds a start tagtagname,SGMLParserwill look for a method calledstart_ortagnamedo_. For instance, when it finds atagname<pre>tag, it will look for astart_preordo_premethod. If found,SGMLParsercalls this method with a list of the tag's attributes; otherwise, it callsunknown_starttagwith the tag name and list of attributes.- End tag
-- An HTML tag that ends a block, like
</html>,</head>,</body>, or</pre>. When it finds an end tag,SGMLParserwill look for a method calledend_. If found,tagnameSGMLParsercalls this method, otherwise it callsunknown_endtagwith the tag name. +- An HTML tag that ends a block, like
</html>,</head>,</body>, or</pre>. When it finds an end tag,SGMLParserwill look for a method calledend_. If found,tagnameSGMLParsercalls this method, otherwise it callsunknown_endtagwith the tag name.- Character reference
-- An escaped character referenced by its decimal or hexadecimal equivalent, like
 . When found,SGMLParsercallshandle_charrefwith the text of the decimal or hexadecimal character equivalent. +- An escaped character referenced by its decimal or hexadecimal equivalent, like
 . When found,SGMLParsercallshandle_charrefwith the text of the decimal or hexadecimal character equivalent.- Entity reference
-- An HTML entity, like
©. When found,SGMLParsercallshandle_entityrefwith the name of the HTML entity. +- An HTML entity, like
©. When found,SGMLParsercallshandle_entityrefwith the name of the HTML entity.- Comment
-- An HTML comment, enclosed in
<!-- ... -->. When found,SGMLParsercallshandle_commentwith the body of the comment. +- An HTML comment, enclosed in
<!-- ... -->. When found,SGMLParsercallshandle_commentwith the body of the comment.- Processing instruction
-- An HTML processing instruction, enclosed in
<? ... >. When found,SGMLParsercallshandle_piwith the body of the processing instruction. +- An HTML processing instruction, enclosed in
<? ... >. When found,SGMLParsercallshandle_piwith the body of the processing instruction.- Declaration
-- An HTML declaration, such as a
DOCTYPE, enclosed in<! ... >. When found,SGMLParsercallshandle_declwith the body of the declaration. +- An HTML declaration, such as a
DOCTYPE, enclosed in<! ... >. When found,SGMLParsercallshandle_declwith the body of the declaration.- Text data
-- A block of text. Anything that doesn't fit into the other 7 categories. When found,
SGMLParsercallshandle_datawith the text. +- A block of text. Anything that doesn't fit into the other 7 categories. When found,
SGMLParsercallshandle_datawith the text.@@ -5964,22 +5487,22 @@ it calls a method on itself based on what it found. In order to use the parser,
-- Python 2.0 had a bug where SGMLParserwould not recognize declarations at all (handle_declwould never be called), which meant thatDOCTYPEs were silently ignored. This is fixed in Python 2.1. +Python 2.0 had a bug where SGMLParserwould not recognize declarations at all (handle_declwould never be called), which meant thatDOCTYPEs were silently ignored. This is fixed in Python 2.1.
sgmllib.pycomes with a test suite to illustrate this. You can runsgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing +
sgmllib.pycomes with a test suite to illustrate this. You can runsgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing theSGMLParserclass and definingunknown_starttag,unknown_endtag,handle_dataand other methods which simply print their arguments.
- In the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces. + In the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces. Example 8.4. Sample test of
-sgmllib.pyHere is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven't downloaded the HTML version of the book, you can do so at http://diveintopython3.org/.
+Here is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven't downloaded the HTML version of the book, you can do so at http://diveintopython3.org/.
c:\python23\lib> type "c:\downloads\diveintopython3\html\toc\index.html"<!DOCTYPE html @@ -6026,7 +5549,7 @@ data: '\n 'Along the way, you'll also learn about
locals,globals, and dictionary-based string formatting.8.3. Extracting data from HTML documents
To extract data from HTML documents, subclass the
SGMLParserclass and define methods for each tag or entity you want to capture. -The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages. +
The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages.
Example 8.5. Introducing
urllib>>> import urllib>>> sock = urllib.urlopen("http://diveintopython3.org/")
@@ -6052,19 +5575,19 @@ data: '\n '
- ![]()
The urllibmodule is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages). +The urllibmodule is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages).- ![]()
The simplest use of urllibis to retrieve the entire text of a web page using theurlopenfunction. Opening a URL is similar to opening a file. The return value ofurlopenis a file-like object, which has some of the same methods as a file object. +The simplest use of urllibis to retrieve the entire text of a web page using theurlopenfunction. Opening a URL is similar to opening a file. The return value ofurlopenis a file-like object, which has some of the same methods as a file object.- ![]()
The simplest thing to do with the file-like object returned by urlopenisread, which reads the entire HTML of the web page into a single string. The object also supportsreadlines, which reads the text line by line into a list. +The simplest thing to do with the file-like object returned by urlopenisread, which reads the entire HTML of the web page into a single string. The object also supportsreadlines, which reads the text line by line into a list.@@ -6097,14 +5620,14 @@ class URLLister(SGMLParser): - ![]()
resetis called by the__init__method ofSGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, +resetis called by the__init__method ofSGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it inreset, not in__init__, so that it will be re-initialized properly when someone re-uses a parser instance.- ![]()
start_ais called bySGMLParserwhenever it finds an<a>tag. The tag may contain anhrefattribute, and/or other attributes, likenameortitle. The attrs parameter is a list of tuples,[(attribute, value), (attribute, value), ...]. Or it may be just an<a>, a valid (if useless) HTML tag, in which case attrs would be an empty list. +start_ais called bySGMLParserwhenever it finds an<a>tag. The tag may contain anhrefattribute, and/or other attributes, likenameortitle. The attrs parameter is a list of tuples,[(attribute, value), (attribute, value), ...]. Or it may be just an<a>, a valid (if useless) HTML tag, in which case attrs would be an empty list.@@ -6159,20 +5682,20 @@ download/diveintopython3-common-5.0.zip - ![]()
You should closeyour parser object, too, but for a different reason. You've read all the data and fed it to the parser, but thefeedmethod isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to callcloseto flush the buffer and force everything to be fully parsed. +You should closeyour parser object, too, but for a different reason. You've read all the data and fed it to the parser, but thefeedmethod isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to callcloseto flush the buffer and force everything to be fully parsed.- ![]()
Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.) +Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.)8.4. Introducing
-BaseHTMLProcessor.py
SGMLParserdoesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it - finds, but the methods don't do anything.SGMLParseris an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclassSGMLParserto define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll - take this one step further by defining a class that catches everythingSGMLParserthrows at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer. +
SGMLParserdoesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it + finds, but the methods don't do anything.SGMLParseris an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclassSGMLParserto define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll + take this one step further by defining a class that catches everythingSGMLParserthrows at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer.
BaseHTMLProcessorsubclassesSGMLParserand provides all 8 essential handler methods:unknown_starttag,unknown_endtag,handle_charref,handle_entityref,handle_comment,handle_pi,handle_decl, andhandle_data.Example 8.8. Introducing
BaseHTMLProcessorclass BaseHTMLProcessor(SGMLParser): @@ -6210,13 +5733,13 @@ class BaseHTMLProcessor(SGMLParser):- ![]()
reset, called bySGMLParser.__init__, initializes self.pieces as an empty list before calling the ancestor method. self.pieces is a data attribute which will hold the pieces of the HTML document you're constructing. Each handler method will reconstruct the HTML thatSGMLParserparsed, and each method will append that string to self.pieces. Note that self.pieces is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but +reset, called bySGMLParser.__init__, initializes self.pieces as an empty list before calling the ancestor method. self.pieces is a data attribute which will hold the pieces of the HTML document you're constructing. Each handler method will reconstruct the HTML thatSGMLParserparsed, and each method will append that string to self.pieces. Note that self.pieces is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but Python is much more efficient at dealing with lists.[2]- ![]()
Since BaseHTMLProcessordoes not define any methods for specific tags (like thestart_amethod inURLLister),SGMLParserwill callunknown_starttagfor every start tag. This method takes the tag (tag) and the list of attribute name/value pairs (attrs), reconstructs the original HTML, and appends it to self.pieces. The string formatting here is a little strange; you'll untangle that (and also the odd-lookinglocalsfunction) later in this chapter. +Since BaseHTMLProcessordoes not define any methods for specific tags (like thestart_amethod inURLLister),SGMLParserwill callunknown_starttagfor every start tag. This method takes the tag (tag) and the list of attribute name/value pairs (attrs), reconstructs the original HTML, and appends it to self.pieces. The string formatting here is a little strange; you'll untangle that (and also the odd-lookinglocalsfunction) later in this chapter.@@ -6228,15 +5751,15 @@ Python is much more efficient at dealing with lists.[ - ![]()
When SGMLParserfinds a character reference, it callshandle_charrefwith the bare reference. If the HTML document contains the reference , ref will be160. Reconstructing the original complete character reference just involves wrapping ref in&#...;characters. +When SGMLParserfinds a character reference, it callshandle_charrefwith the bare reference. If the HTML document contains the reference , ref will be160. Reconstructing the original complete character reference just involves wrapping ref in&#...;characters.- ![]()
Entity references are similar to character references, but without the hash mark. Reconstructing the original entity reference - requires wrapping ref in &...;characters. (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this. Only certain standard -HTML entites end in a semicolon; other similar-looking entities do not. Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module calledhtmlentitydefs. Hence the extraifstatement.) +Entity references are similar to character references, but without the hash mark. Reconstructing the original entity reference + requires wrapping ref in &...;characters. (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this. Only certain standard +HTML entites end in a semicolon; other similar-looking entities do not. Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module calledhtmlentitydefs. Hence the extraifstatement.)@@ -6263,7 +5786,7 @@ Python is much more efficient at dealing with lists.[ ![]()
- @@ -6276,7 +5799,7 @@ Python is much more efficient at dealing with lists.[The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't). BaseHTMLProcessoris not forgiving; if script is improperly embedded, it will be parsed as if it were HTML. For instance, if the script contains less-than and equals signs,SGMLParsermay incorrectly think that it has found tags and attributes.SGMLParseralways converts tags and attribute names to lowercase, which may break the script, andBaseHTMLProcessoralways encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script +The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't). BaseHTMLProcessoris not forgiving; if script is improperly embedded, it will be parsed as if it were HTML. For instance, if the script contains less-than and equals signs,SGMLParsermay incorrectly think that it has found tags and attributes.SGMLParseralways converts tags and attribute names to lowercase, which may break the script, andBaseHTMLProcessoralways encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script within HTML comments.- ![]()
This is the one method in BaseHTMLProcessorthat is never called by the ancestorSGMLParser. Since the other handler methods store their reconstructed HTML in self.pieces, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it. +This is the one method in BaseHTMLProcessorthat is never called by the ancestorSGMLParser. Since the other handler methods store their reconstructed HTML in self.pieces, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it.@@ -6294,28 +5817,28 @@ Python is much more efficient at dealing with lists.[ 8.5.
-localsandglobalsLet's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions,
localsandglobals, which provide dictionary-based access to local and global variables. +Let's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions,
localsandglobals, which provide dictionary-based access to local and global variables.Remember
locals? You first saw it here:def unknown_starttag(self, tag, attrs): strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) self.pieces.append("<%(tag)s%(strattrs)s>" % locals()) -No, wait, you can't learn about
localsyet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention. -Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names -of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute. -
At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which -keeps track of the function's variables, including function arguments and locally defined variables. Each module has its +
No, wait, you can't learn about
localsyet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention. +Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names +of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute. +
At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which +keeps track of the function's variables, including function arguments and locally defined variables. Each module has its own namespace, called the global namespace, which keeps track of the module's variables, including functions, classes, any -other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any +other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any module, which holds built-in functions and exceptions.
When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order:
-
- local namespace - specific to the current function or class method. If the function defines a local variable x, or has an argument x, Python will use this and stop searching. +
- local namespace - specific to the current function or class method. If the function defines a local variable x, or has an argument x, Python will use this and stop searching. -
- global namespace - specific to the current module. If the module has defined a variable, function, or class called x, Python will use that and stop searching. +
- global namespace - specific to the current module. If the module has defined a variable, function, or class called x, Python will use that and stop searching. -
- built-in namespace - global to all modules. As a last resort, Python will assume that x is the name of built-in function or variable. +
- built-in namespace - global to all modules. As a last resort, Python will assume that x is the name of built-in function or variable.
If Python doesn't find x in any of these namespaces, it gives up and raises a
NameErrorwith the message There is no variable named 'x', which you saw back in Example 3.18, “Referencing an Unbound Variable”, but you didn't appreciate how much work Python was doing before giving you that error.@@ -6323,15 +5846,15 @@ module, which holds built-in functions and exceptions.
-- Python 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a nested function or lambdafunction, Python will search for that variable in the current (nested orlambda) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested orlambda) function's namespace, then in the parent function's namespace, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2:+Python 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a nested function or lambdafunction, Python will search for that variable in the current (nested orlambda) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested orlambda) function's namespace, then in the parent function's namespace, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2:from __future__ import nested_scopesAre you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in
localsfunction, and the global (module level) namespace is accessible via the built-inglobalsfunction. +Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in
localsfunction, and the global (module level) namespace is accessible via the built-inglobalsfunction.Example 8.10. Introducing
locals>>> def foo(arg):-... x = 1 -... print locals() -... +... x = 1 +... print locals() +... >>> foo(7)
{'arg': 7, 'x': 1} >>> foo('bar')
@@ -6346,22 +5869,22 @@ from __future__ import nested_scopes
- ![]()
localsreturns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values - of the dictionary are the actual values of the variables. So callingfoowith7prints the dictionary containing the function's two local variables: arg (7) and x (1). +localsreturns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values + of the dictionary are the actual values of the variables. So callingfoowith7prints the dictionary containing the function's two local variables: arg (7) and x (1).- - ![]()
Remember, Python has dynamic typing, so you could just as easily pass a string in for arg; the function (and the call to locals) would still work just as well.localsworks with all variables of all datatypes. +Remember, Python has dynamic typing, so you could just as easily pass a string in for arg; the function (and the call to locals) would still work just as well.localsworks with all variables of all datatypes.What
localsdoes for the local (function) namespace,globalsdoes for the global (module) namespace.globalsis more exciting, though, because a module's namespace is more exciting.[3] Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes -defined in the module. Plus, it includes anything that was imported into the module. +What
localsdoes for the local (function) namespace,globalsdoes for the global (module) namespace.globalsis more exciting, though, because a module's namespace is more exciting.[3] Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes +defined in the module. Plus, it includes anything that was imported into the module.Remember the difference between
from module importandimport module? Withimport module, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access -any of its functions or attributes:module.function. But withfrom module import, you're actually importing specific functions and attributes from another module into your own namespace, which is why you -access them directly without referencing the original module they came from. With theglobalsfunction, you can actually see this happen. +any of its functions or attributes:module.function. But withfrom module import, you're actually importing specific functions and attributes from another module into your own namespace, which is why you +access them directly without referencing the original module they came from. With theglobalsfunction, you can actually see this happen.Example 8.11. Introducing
globalsLook at the following block of code at the bottom of
BaseHTMLProcessor.py:if __name__ == "__main__": @@ -6371,7 +5894,7 @@ if __name__ == "__main__":@@ -6386,25 +5909,25 @@ __name__ = __main__ - ![]()
Just so you don't get intimidated, remember that you've seen all this before. The globalsfunction returns a dictionary, and you're iterating through the dictionary using theitemsmethod and multi-variable assignment. The only thing new here is theglobalsfunction. +Just so you don't get intimidated, remember that you've seen all this before. The globalsfunction returns a dictionary, and you're iterating through the dictionary using theitemsmethod and multi-variable assignment. The only thing new here is theglobalsfunction.-
SGMLParserwas imported fromsgmllib, usingfrom module import. That means that it was imported directly into the module's namespace, and here it is. +SGMLParserwas imported fromsgmllib, usingfrom module import. That means that it was imported directly into the module's namespace, and here it is.- ![]()
Contrast this with htmlentitydefs, which was imported usingimport. That means that thehtmlentitydefsmodule itself is in the namespace, but the entitydefs variable defined withinhtmlentitydefsis not. +Contrast this with htmlentitydefs, which was imported usingimport. That means that thehtmlentitydefsmodule itself is in the namespace, but the entitydefs variable defined withinhtmlentitydefsis not.- ![]()
This module only defines one class, BaseHTMLProcessor, and here it is. Note that the value here is the class itself, not a specific instance of the class. +This module only defines one class, BaseHTMLProcessor, and here it is. Note that the value here is the class itself, not a specific instance of the class.@@ -6413,12 +5936,12 @@ __name__ = __main__ - ![]()
Remember the if __name__trick? When running a module (as opposed to importing it from another module), the built-in__name__attribute is a special value,__main__. Since you ran this module as a script from the command line,__name__is__main__, which is why the little test code to print theglobalsgot executed. +Remember the if __name__trick? When running a module (as opposed to importing it from another module), the built-in__name__attribute is a special value,__main__. Since you ran this module as a script from the command line,__name__is__main__, which is why the little test code to print theglobalsgot executed.- -Using the localsandglobalsfunctions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors +Using the localsandglobalsfunctions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors the functionality of thegetattrfunction, which allows you to access arbitrary functions dynamically by providing the function name as a string.There is one other important difference between the
localsandglobalsfunctions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning +There is one other important difference between the
localsandglobalsfunctions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning it.Example 8.12.
localsis read-only,globalsis notdef foo(arg): @@ -6437,14 +5960,14 @@ print "z=",z![]()
- ![]()
Since foois called with3, this will print{'arg': 3, 'x': 1}. This should not be a surprise. +Since foois called with3, this will print{'arg': 3, 'x': 1}. This should not be a surprise.@@ -6457,7 +5980,7 @@ print "z=",z - ![]()
localsis a function that returns a dictionary, and here you are setting a value in that dictionary. You might think that this - would change the value of the local variable x to2, but it doesn't.localsdoes not actually return the local namespace, it returns a copy. So changing it does nothing to the value of the variables +localsis a function that returns a dictionary, and here you are setting a value in that dictionary. You might think that this + would change the value of the local variable x to2, but it doesn't.localsdoes not actually return the local namespace, it returns a copy. So changing it does nothing to the value of the variables in the local namespace.![]()
- ![]()
After being burned by locals, you might think that this wouldn't change the value of z, but it does. Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself),globalsreturns the actual global namespace, not a copy: the exact opposite behavior oflocals. So any changes to the dictionary returned byglobalsdirectly affect your global variables. +After being burned by locals, you might think that this wouldn't change the value of z, but it does. Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself),globalsreturns the actual global namespace, not a copy: the exact opposite behavior oflocals. So any changes to the dictionary returned byglobalsdirectly affect your global variables.@@ -6468,9 +5991,9 @@ print "z=",z @@ -6526,14 +6049,14 @@ meaningful keys and values already. Like![]()
8.6. Dictionary-based string formatting
-Why did you learn about
localsandglobals? So you can learn about dictionary-based string formatting. As you recall, regular string formatting provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in -place of each formatting marker. While this is efficient, it is not always the easiest code to read, especially when multiple -values are being inserted. You can't simply scan through the string in one pass and understand what the result will be; you're +Why did you learn about
localsandglobals? So you can learn about dictionary-based string formatting. As you recall, regular string formatting provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in +place of each formatting marker. While this is efficient, it is not always the easiest code to read, especially when multiple +values are being inserted. You can't simply scan through the string in one pass and understand what the result will be; you're constantly switching between reading the string and reading the tuple of values.There is an alternative form of string formatting that uses dictionaries instead of tuples of values.
Example 8.13. Introducing dictionary-based string formatting
@@ -6485,13 +6008,13 @@ constantly switching between reading the string and reading the tuple of values.- ![]()
Instead of a tuple of explicit values, this form of string formatting uses a dictionary, params. And instead of a simple %smarker in the string, the marker contains a name in parentheses. This name is used as a key in the params dictionary and subsitutes the corresponding value,secret, in place of the%(pwd)smarker. +Instead of a tuple of explicit values, this form of string formatting uses a dictionary, params. And instead of a simple %smarker in the string, the marker contains a name in parentheses. This name is used as a key in the params dictionary and subsitutes the corresponding value,secret, in place of the%(pwd)smarker.@@ -6503,7 +6026,7 @@ constantly switching between reading the string and reading the tuple of values. - ![]()
Dictionary-based string formatting works with any number of named keys. Each key must exist in the given dictionary, or the + Dictionary-based string formatting works with any number of named keys. Each key must exist in the given dictionary, or the formatting will fail with a KeyError.So why would you use dictionary-based string formatting? Well, it does seem like overkill to set up a dictionary of keys and values simply to do string formatting in the next line; it's really most useful when you happen to have a dictionary of -meaningful keys and values already. Like
locals. +meaningful keys and values already. Likelocals.Example 8.14. Dictionary-based string formatting in
BaseHTMLProcessor.pydef handle_comment(self, text): self.pieces.append("<!--%(text)s-->" % locals())@@ -6512,8 +6035,8 @@ meaningful keys and values already. Like
-
Using the built-in localsfunction is the most common use of dictionary-based string formatting. It means that you can use the names of local variables - within your string (in this case, text, which was passed to the class method as an argument) and each named variable will be replaced by its value. If text is'Begin page footer', the string formatting"<!--%(text)s-->" % locals()will resolve to the string'<!--Begin page footer-->'. +Using the built-in localsfunction is the most common use of dictionary-based string formatting. It means that you can use the names of local variables + within your string (in this case, text, which was passed to the class method as an argument) and each named variable will be replaced by its value. If text is'Begin page footer', the string formatting"<!--%(text)s-->" % locals()will resolve to the string'<!--Begin page footer-->'.-
When this method is called, attrs is a list of key/value tuples, just like the itemsof a dictionary, which means you can use multi-variable assignment to iterate through it. This should be a familiar pattern by now, but there's a lot going on here, so let's break it down: +When this method is called, attrs is a list of key/value tuples, just like the -itemsof a dictionary, which means you can use multi-variable assignment to iterate through it. This should be a familiar pattern by now, but there's a lot going on here, so let's break it down:
- Suppose attrs is
[('href', 'index.html'), ('title', 'Go to home page')].- In the first round of the list comprehension, key will get
'href', and value will get'index.html'. -- The string formatting
' %s="%s"' % (key, value)will resolve to' href="index.html"'. This string becomes the first element of the list comprehension's return value. +- The string formatting
' %s="%s"' % (key, value)will resolve to' href="index.html"'. This string becomes the first element of the list comprehension's return value.- In the second round, key will get
'title', and value will get'Go to home page'. @@ -6547,7 +6070,7 @@ meaningful keys and values already. Like![]()
Now, using dictionary-based string formatting, you insert the value of tag and strattrs into a string. So if tag is 'a', the final result would be'<a href="index.html" title="Go to home page">', and that is what gets appended to self.pieces. +Now, using dictionary-based string formatting, you insert the value of tag and strattrs into a string. So if tag is @@ -6556,30 +6079,30 @@ meaningful keys and values already. Like'a', the final result would be'<a href="index.html" title="Go to home page">', and that is what gets appended to self.pieces.![]()
- Using dictionary-based string formatting with localsis a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a +Using dictionary-based string formatting with localsis a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a slight performance hit in making the call tolocals, sincelocalsbuilds a copy of the local namespace.8.7. Quoting attribute values
-A common question on comp.lang.python is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”[4] (This is generally precipitated by a project manager who has found the HTML-is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through
BaseHTMLProcessor. -
BaseHTMLProcessorconsumes HTML (since it's descended fromSGMLParser) and produces equivalent HTML, but the HTML output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase +A common question on comp.lang.python is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”[4] (This is generally precipitated by a project manager who has found the HTML-is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through
BaseHTMLProcessor. +
BaseHTMLProcessorconsumes HTML (since it's descended fromSGMLParser) and produces equivalent HTML, but the HTML output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase or mixed case, and attribute values will be enclosed in double quotes, even if they started in single quotes or with no quotes -at all. It is this last side effect that you can take advantage of. +at all. It is this last side effect that you can take advantage of.Example 8.16. Quoting attribute values
>>> htmlSource = """-... <html> -... <head> -... <title>Test page</title> -... </head> -... <body> -... <ul> -... <li><a href=index.html>Home</a></li> -... <li><a href=toc.html>Table of contents</a></li> -... <li><a href=history.html>Revision history</a></li> -... </body> -... </html> -... """ +... <html> +... <head> +... <title>Test page</title> +... </head> +... <body> +... <ul> +... <li><a href=index.html>Home</a></li> +... <li><a href=toc.html>Table of contents</a></li> +... <li><a href=history.html>Revision history</a></li> +... </body> +... </html> +... """ >>> from BaseHTMLProcessor import BaseHTMLProcessor >>> parser = BaseHTMLProcessor() >>> parser.feed(htmlSource)
@@ -6599,7 +6122,7 @@ at all. It is this last side effect that you can take advantage of.
- ![]()
Note that the attribute values of the hrefattributes in the<a>tags are not properly quoted. (Also note that you're using triple quotes for something other than adocstring. And directly in the IDE, no less. They're very useful.) +Note that the attribute values of the hrefattributes in the<a>tags are not properly quoted. (Also note that you're using triple quotes for something other than adocstring. And directly in the IDE, no less. They're very useful.)@@ -6610,13 +6133,13 @@ at all. It is this last side effect that you can take advantage of. - ![]()
Using the outputfunction defined inBaseHTMLProcessor, you get the output as a single string, complete with quoted attribute values. While this may seem anti-climactic, think +Using the outputfunction defined inBaseHTMLProcessor, you get the output as a single string, complete with quoted attribute values. While this may seem anti-climactic, think about how much has actually happened here:SGMLParserparsed the entire HTML document, breaking it down into tags, refs, data, and so forth;BaseHTMLProcessorused those elements to reconstruct pieces of HTML (which are still stored in parser.pieces, if you want to see them); finally, you calledparser.output, which joined all the pieces of HTML into one string.8.8. Introducing
-dialect.py
Dialectizeris a simple (and silly) descendant ofBaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within ablock passes through unaltered. +<pre>...</pre>
Dialectizeris a simple (and silly) descendant ofBaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within ablock passes through unaltered.<pre>...</pre>To handle the
<pre>blocks, you define two methods inDialectizer:start_preandend_pre.Example 8.17. Handling specific tags
def start_pre(self, attrs):@@ -6630,25 +6153,25 @@ at all. It is this last side effect that you can take advantage of.
- ![]()
start_preis called every timeSGMLParserfinds a<pre>tag in the HTML source. (In a minute, you'll see exactly how this happens.) The method takes a single parameter, attrs, which contains the attributes of the tag (if any). attrs is a list of key/value tuples, just likeunknown_starttagtakes. +start_preis called every timeSGMLParserfinds a<pre>tag in the HTML source. (In a minute, you'll see exactly how this happens.) The method takes a single parameter, attrs, which contains the attributes of the tag (if any). attrs is a list of key/value tuples, just likeunknown_starttagtakes.- ![]()
In the resetmethod, you initialize a data attribute that serves as a counter for<pre>tags. Every time you hit a<pre>tag, you increment the counter; every time you hit a</pre>tag, you'll decrement the counter. (You could just use this as a flag and set it to1and reset it to0, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested<pre>tags.) In a minute, you'll see how this counter is put to good use. +In the resetmethod, you initialize a data attribute that serves as a counter for<pre>tags. Every time you hit a<pre>tag, you increment the counter; every time you hit a</pre>tag, you'll decrement the counter. (You could just use this as a flag and set it to1and reset it to0, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested<pre>tags.) In a minute, you'll see how this counter is put to good use.- ![]()
That's it, that's the only special processing you do for <pre>tags. Now you pass the list of attributes along tounknown_starttagso it can do the default processing. +That's it, that's the only special processing you do for <pre>tags. Now you pass the list of attributes along tounknown_starttagso it can do the default processing.- ![]()
end_preis called every timeSGMLParserfinds a</pre>tag. Since end tags can not contain attributes, the method takes no parameters. +end_preis called every timeSGMLParserfinds a</pre>tag. Since end tags can not contain attributes, the method takes no parameters.@@ -6663,7 +6186,7 @@ at all. It is this last side effect that you can take advantage of. -At this point, it's worth digging a little further into
SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) thatSGMLParserlooks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition ofstart_preandend_preto handle<pre>and</pre>. But how does this happen? Well, it's not magic, it's just good Python coding. +At this point, it's worth digging a little further into
SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) thatSGMLParserlooks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition ofstart_preandend_preto handle<pre>and</pre>. But how does this happen? Well, it's not magic, it's just good Python coding.Example 8.18.
SGMLParserdef finish_starttag(self, tag, attrs):try: @@ -6688,14 +6211,14 @@ at all. It is this last side effect that you can take advantage of.
- ![]()
At this point, SGMLParserhas already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a +At this point, SGMLParserhas already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a specific handler method for this tag, or whether you should fall back on the default method (unknown_starttag).- ![]()
The “magic” of SGMLParseris nothing more than your old friend,getattr. What you may not have realized before is thatgetattrwill find methods defined in descendants of an object as well as the object itself. Here the object isself, the current instance. So if tag is'pre', this call togetattrwill look for astart_premethod on the current instance, which is an instance of theDialectizerclass. +The “magic” of SGMLParseris nothing more than your old friend,getattr. What you may not have realized before is thatgetattrwill find methods defined in descendants of an object as well as the object itself. Here the object isself, the current instance. So if tag is'pre', this call togetattrwill look for astart_premethod on the current instance, which is an instance of theDialectizerclass.@@ -6708,19 +6231,19 @@ at all. It is this last side effect that you can take advantage of. - ![]()
Since you didn't find a start_xxxmethod, you'll also look for ado_xxxmethod before giving up. This alternate naming scheme is generally used for standalone tags, like<br>, which have no corresponding end tag. But you can use either naming scheme; as you can see,SGMLParsertries both for every tag. (You shouldn't define both astart_xxxanddo_xxxhandler method for the same tag, though; only thestart_xxxmethod will get called.) +Since you didn't find a start_xxxmethod, you'll also look for ado_xxxmethod before giving up. This alternate naming scheme is generally used for standalone tags, like<br>, which have no corresponding end tag. But you can use either naming scheme; as you can see,SGMLParsertries both for every tag. (You shouldn't define both astart_xxxanddo_xxxhandler method for the same tag, though; only thestart_xxxmethod will get called.)- ![]()
Another AttributeError, which means that the call togetattrfailed withdo_xxx. Since you found neither astart_xxxnor ado_xxxmethod for this tag, you catch the exception and fall back on the default method,unknown_starttag. +Another AttributeError, which means that the call togetattrfailed withdo_xxx. Since you found neither astart_xxxnor ado_xxxmethod for this tag, you catch the exception and fall back on the default method,unknown_starttag.- ![]()
Remember, try...exceptblocks can have anelseclause, which is called if no exception is raised during thetry...exceptblock. Logically, that means that you did find ado_xxxmethod for this tag, so you're going to call it. +Remember, try...exceptblocks can have anelseclause, which is called if no exception is raised during thetry...exceptblock. Logically, that means that you did find ado_xxxmethod for this tag, so you're going to call it.@@ -6728,22 +6251,22 @@ at all. It is this last side effect that you can take advantage of. By the way, don't worry about these different return values; in theory they mean something, but they're never actually used. Don't worry about the self.stack.append(tag)either;SGMLParserkeeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this - information either. In theory, you could use this module to validate that your tags were fully balanced, but it's probably - not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now. + information either. In theory, you could use this module to validate that your tags were fully balanced, but it's probably + not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now.- - ![]()
start_xxxanddo_xxxmethods are not called directly; the tag, method, and attributes are passed to this function,handle_starttag, so that descendants can override it and change the way all start tags are dispatched. You don't need that level of control, so you just let this method do its thing, which is to call - the method (start_xxxordo_xxx) with the list of attributes. Remember, method is a function, returned fromgetattr, and functions are objects. (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run +start_xxxanddo_xxxmethods are not called directly; the tag, method, and attributes are passed to this function,handle_starttag, so that descendants can override it and change the way all start tags are dispatched. You don't need that level of control, so you just let this method do its thing, which is to call + the method (start_xxxordo_xxx) with the list of attributes. Remember, method is a function, returned fromgetattr, and functions are objects. (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run out of ways to use it to my advantage.) Here, the function object is passed into this dispatch method as an argument, and - this method turns around and calls the function. At this point, you don't need to know what the function is, what it's named, + this method turns around and calls the function. At this point, you don't need to know what the function is, what it's named, or where it's defined; the only thing you need to know about the function is that it is called with one argument, attrs.Now back to our regularly scheduled program:
Dialectizer. When you left, you were in the process of defining specific handler methods for<pre>and</pre>tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, +Now back to our regularly scheduled program:
Dialectizer. When you left, you were in the process of defining specific handler methods for<pre>and</pre>tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, you need to override thehandle_datamethod.Example 8.19. Overriding the
handle_datamethoddef handle_data(self, text):@@ -6758,16 +6281,16 @@ you need to override the
handle_datamethod.- - ![]()
In the ancestor BaseHTMLProcessor, thehandle_datamethod simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you're in the middle of ablock, self.verbatim will be some value greater than<pre>...</pre>0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the - substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using theand-ortrick. +In the ancestor BaseHTMLProcessor, thehandle_datamethod simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you're in the middle of ablock, self.verbatim will be some value greater than<pre>...</pre>0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the + substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using theand-ortrick.You're close to completely understanding
Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes -later indialect.pydefine a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough +You're close to completely understanding
Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes +later indialect.pydefine a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough for one chapter.8.9. Putting it all together
-It's time to put everything you've learned so far to good use. I hope you were paying attention. +
It's time to put everything you've learned so far to good use. I hope you were paying attention.
Example 8.20. The
translatefunction, part 1def translate(url, dialectName="chef"):import urllib
@@ -6779,15 +6302,15 @@ def translate(url, dialectName="chef"):
-
The translatefunction has an optional argument dialectName, which is a string that specifies the dialect you'll be using. You'll see how this is used in a minute. +The translatefunction has an optional argument dialectName, which is a string that specifies the dialect you'll be using. You'll see how this is used in a minute.- ![]()
Hey, wait a minute, there's an importstatement in this function! That's perfectly legal in Python. You're used to seeingimportstatements at the top of a program, which means that the imported module is available anywhere in the program. But you can - also import modules within a function, which means that the imported module is only available within the function. If you - have a module that is only ever used in one function, this is an easy way to make your code more modular. (When you find +Hey, wait a minute, there's an @@ -6809,30 +6332,30 @@ def translate(url, dialectName="chef"):importstatement in this function! That's perfectly legal in Python. You're used to seeingimportstatements at the top of a program, which means that the imported module is available anywhere in the program. But you can + also import modules within a function, which means that the imported module is only available within the function. If you + have a module that is only ever used in one function, this is an easy way to make your code more modular. (When you find that your weekend hack has turned into an 800-line work of art and decide to split it up into a dozen reusable modules, you'll appreciate this.)![]()
capitalizeis a string method you haven't seen before; it simply capitalizes the first letter of a string and forces everything else - to lowercase. Combined with some string formatting, you've taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class. If dialectName is the string'chef', parserName will be the string'ChefDialectizer'. + to lowercase. Combined with some string formatting, you've taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class. If dialectName is the string'chef', parserName will be the string'ChefDialectizer'.- ![]()
You have the name of a class as a string (parserName), and you have the global namespace as a dictionary ( globals()). Combined, you can get a reference to the class which the string names. (Remember, classes are objects, and they can be assigned to variables just like any other object.) If parserName is the string'ChefDialectizer', parserClass will be the classChefDialectizer. +You have the name of a class as a string (parserName), and you have the global namespace as a dictionary ( globals()). Combined, you can get a reference to the class which the string names. (Remember, classes are objects, and they can be assigned to variables just like any other object.) If parserName is the string'ChefDialectizer', parserClass will be the classChefDialectizer.- - ![]()
Finally, you have a class object (parserClass), and you want an instance of the class. Well, you already know how to do that: call the class like a function. The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable - like a function, and out pops an instance of the class. If parserClass is the class ChefDialectizer, parser will be an instance of the classChefDialectizer. +Finally, you have a class object (parserClass), and you want an instance of the class. Well, you already know how to do that: call the class like a function. The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable + like a function, and out pops an instance of the class. If parserClass is the class ChefDialectizer, parser will be an instance of the classChefDialectizer.Why bother? After all, there are only 3
Dialectizerclasses; why not just use acasestatement? (Well, there's nocasestatement in Python, but why not just use a series ofifstatements?) One reason: extensibility. Thetranslatefunction has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a newFooDialectizertomorrow;translatewould work by passing'foo'as the dialectName. -Even better, imagine putting
FooDialectizerin a separate module, and importing it withfrom module import. You've already seen that this includes it inglobals(), sotranslatewould still work without modification, even thoughFooDialectizerwas in a separate file. +Why bother? After all, there are only 3
Dialectizerclasses; why not just use acasestatement? (Well, there's nocasestatement in Python, but why not just use a series ofifstatements?) One reason: extensibility. Thetranslatefunction has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a newFooDialectizertomorrow;translatewould work by passing'foo'as the dialectName. +Even better, imagine putting
FooDialectizerin a separate module, and importing it withfrom module import. You've already seen that this includes it inglobals(), sotranslatewould still work without modification, even thoughFooDialectizerwas in a separate file.Now imagine that the name of the dialect is coming from somewhere outside the program, maybe from a database or from a user-inputted -value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a URL and a dialect name (both strings) in the query string of a web page request, and output the “translated” web page. -
Finally, imagine a
Dialectizerframework with a plug-in architecture. You could put eachDialectizerclass in a separate file, leaving only thetranslatefunction indialect.py. Assuming a consistent naming scheme, thetranslatefunction could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven't +value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a URL and a dialect name (both strings) in the query string of a web page request, and output the “translated” web page. +Finally, imagine a
Dialectizerframework with a plug-in architecture. You could put eachDialectizerclass in a separate file, leaving only thetranslatefunction indialect.py. Assuming a consistent naming scheme, thetranslatefunction could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven't seen dynamic importing yet, but I promise to cover it in a later chapter.) To add a new dialect, you would simply add an -appropriately-named file in the plug-ins directory (likefoodialect.pywhich contains theFooDialectizerclass). Calling thetranslatefunction with the dialect name'foo'would find the modulefoodialect.py, import the classFooDialectizer, and away you go. +appropriately-named file in the plug-ins directory (likefoodialect.pywhich contains theFooDialectizerclass). Calling thetranslatefunction with the dialect name'foo'would find the modulefoodialect.py, import the classFooDialectizer, and away you go.Example 8.22. The
translatefunction, part 3parser.feed(htmlSource)parser.close()
@@ -6842,14 +6365,14 @@ appropriately-named file in the plug-ins directory (like
foodialect.py- ![]()
After all that imagining, this is going to seem pretty boring, but the feedfunction is what does the entire transformation. You had the entire HTML source in a single string, so you only had to callfeedonce. However, you can callfeedas often as you want, and the parser will just keep parsing. So if you were worried about memory usage (or you knew you - were going to be dealing with very large HTML pages), you could set this up in a loop, where you read a few bytes of HTML and fed it to the parser. The result would be the same. +After all that imagining, this is going to seem pretty boring, but the feedfunction is what does the entire transformation. You had the entire HTML source in a single string, so you only had to callfeedonce. However, you can callfeedas often as you want, and the parser will just keep parsing. So if you were worried about memory usage (or you knew you + were going to be dealing with very large HTML pages), you could set this up in a loop, where you read a few bytes of HTML and fed it to the parser. The result would be the same.@@ -6864,11 +6387,11 @@ appropriately-named file in the plug-ins directory (like - ![]()
Because feedmaintains an internal buffer, you should always call the parser'sclosemethod when you're done (even if you fed it all at once, like you did). Otherwise you may find that your output is missing +Because feedmaintains an internal buffer, you should always call the parser'sclosemethod when you're done (even if you fed it all at once, like you did). Otherwise you may find that your output is missing the last few bytes.foodialect.pyFurther reading
-
- You thought I was kidding about the server-side scripting idea. So did I, until I found this web-based dialectizer. Unfortunately, source code does not appear to be available. +
- You thought I was kidding about the server-side scripting idea. So did I, until I found this web-based dialectizer. Unfortunately, source code does not appear to be available.
8.10. Summary
-Python provides you with a powerful tool,
sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways. +Python provides you with a powerful tool,
sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways.
- parsing the HTML looking for something specific @@ -6887,30 +6410,30 @@ appropriately-named file in the plug-ins directory (like
foodialect.py
-[1] The technical term for a parser like
SGMLParseris a consumer: it consumes HTML and breaks it down. Presumably, the namefeedwas chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or +[1] The technical term for a parser like
SGMLParseris a consumer: it consumes HTML and breaks it down. Presumably, the namefeedwas chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the - only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that's just me. In any event, it's an interesting mental image. + only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that's just me. In any event, it's an interesting mental image.-[2] The reason Python is better at lists than strings is that lists are mutable but strings are immutable. This means that appending to a list - just adds the element and updates the index. Since strings can not be changed after they are created, code like
s = s + newpiecewill create an entirely new string out of the concatenation of the original and the new piece, then throw away the original - string. This involves a lot of expensive memory management, and the amount of effort involved increases as the string gets - longer, so doings = s + newpiecein a loop is deadly. In technical terms, appending n items to a list isO(n), while appending n items to a string isO(n2). +[2] The reason Python is better at lists than strings is that lists are mutable but strings are immutable. This means that appending to a list + just adds the element and updates the index. Since strings can not be changed after they are created, code like
s = s + newpiecewill create an entirely new string out of the concatenation of the original and the new piece, then throw away the original + string. This involves a lot of expensive memory management, and the amount of effort involved increases as the string gets + longer, so doings = s + newpiecein a loop is deadly. In technical terms, appending n items to a list isO(n), while appending n items to a string isO(n2).[3] I don't get out much.
-[4] All right, it's not that common a question. It's not up there with “What editor should I use to write Python code?” (answer: Emacs) or “Is Python better or worse than Perl?” (answer: “Perl is worse than Python because people wanted it worse.” -Larry Wall, 10/14/1998) But questions about HTML processing pop up in one form or another about once a month, and among those questions, this is a popular one. +
[4] All right, it's not that common a question. It's not up there with “What editor should I use to write Python code?” (answer: Emacs) or “Is Python better or worse than Perl?” (answer: “Perl is worse than Python because people wanted it worse.” -Larry Wall, 10/14/1998) But questions about HTML processing pop up in one form or another about once a month, and among those questions, this is a popular one.
Chapter 9. XML Processing
9.1. Diving in
-These next two chapters are about XML processing in Python. It would be helpful if you already knew what an XML document looks like, that it's made up of structured tags to form a hierarchy of elements, and so on. If this doesn't make +
These next two chapters are about XML processing in Python. It would be helpful if you already knew what an XML document looks like, that it's made up of structured tags to form a hierarchy of elements, and so on. If this doesn't make sense to you, there are many XML tutorials that can explain the basics.
If you're not particularly interested in XML, you should still read these chapters, which cover important topics like Python packages, Unicode, command line arguments, and how to use
getattrfor method dispatching.Being a philosophy major is not required, although if you have ever had the misfortune of being subjected to the writings of Immanuel Kant, you will appreciate the example program a lot more than if you majored in something useful, like computer science. -
There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each element it finds. (If you read Chapter 8, HTML Processing, this should sound familiar, because that's how the
sgmllibmodule works.) The other is called DOM (“Document Object Model”), and it works by reading in the entire XML document at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the DOM. -The following is a complete Python program which generates pseudo-random output based on a context-free grammar defined in an XML format. Don't worry yet if you don't understand what that means; you'll examine both the program's input and its output +
There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each element it finds. (If you read Chapter 8, HTML Processing, this should sound familiar, because that's how the
sgmllibmodule works.) The other is called DOM (“Document Object Model”), and it works by reading in the entire XML document at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the DOM. +The following is a complete Python program which generates pseudo-random output based on a context-free grammar defined in an XML format. Don't worry yet if you don't understand what that means; you'll examine both the program's input and its output in more depth throughout these next two chapters.
Example 9.1.
kgp.pyIf you have not already done so, you can download this and other examples used in this book.
@@ -6921,7 +6444,7 @@ Generates mock philosophy based on a context-free grammar Usage: python kgp.py [options] [source] Options: - -g ..., --grammar=... use specified grammar file or URL + -g ..., --grammar=... use specified grammar file or URL -h, --help show this help -d show debugging information while parsing @@ -6977,7 +6500,7 @@ class KantGenerator: """guess default source of the current grammar The default source will be one of the <ref>s that is not - cross-referenced. This sounds complicated but it's not. + cross-referenced. This sounds complicated but it's not. Example: The default source for kant.xml is "<xref id='section'/>", because 'section' is the one <ref> that is not <xref>'d anywhere in the grammar. @@ -7031,9 +6554,9 @@ class KantGenerator: """parse a single XML node A parsed XML document (from minidom.parse) is a tree of nodes - of various types. Each node is represented by an instance of the + of various types. Each node is represented by an instance of the corresponding Python class (Element for a tag, Text for - text data, Document for the top-level document). The following + text data, Document for the top-level document). The following statement constructs the name of a class method based on the type of node we're parsing ("parse_Element" for an Element node, "parse_Text" for a Text node, etc.) and then calls the method. @@ -7054,8 +6577,8 @@ class KantGenerator: """parse a text node The text of a text node is usually added to the output buffer - verbatim. The one exception is that <p class='sentence'> sets - a flag to capitalize the first letter of the next word. If + verbatim. The one exception is that <p class='sentence'> sets + a flag to capitalize the first letter of the next word. If that flag is set, we capitalize the text and reset the flag. """ text = node.data @@ -7071,7 +6594,7 @@ class KantGenerator: An XML element corresponds to an actual tag in the source: <xref id='...'>, <p chance='...'>, <choice>, etc. - Each element type is handled in its own method. Like we did in + Each element type is handled in its own method. Like we did in parse(), we construct a method name based on the name of the element ("do_xref" for an <xref> tag, etc.) and call the method. @@ -7090,7 +6613,7 @@ class KantGenerator: """handle <xref id='...'> tag An <xref id='...'> tag is a cross-reference to a <ref id='...'> - tag. <xref id='sentence'/> evaluates to a randomly chosen child of + tag. <xref id='sentence'/> evaluates to a randomly chosen child of <ref id='sentence'>. """ id = node.attributes["id"].value @@ -7099,10 +6622,10 @@ class KantGenerator: def do_p(self, node): """handle <p> tag - The <p> tag is the core of the grammar. It can contain almost + The <p> tag is the core of the grammar. It can contain almost anything: freeform text, <choice> tags, <xref> tags, even other - <p> tags. If a "class='sentence'" attribute is found, a flag - is set and the next word will be capitalized. If a "chance='X'" + <p> tags. If a "class='sentence'" attribute is found, a flag + is set and the next word will be capitalized. If a "chance='X'" attribute is found, there is an X% chance that the tag will be evaluated (and therefore a (100-X)% chance that it will be completely ignored) @@ -7122,7 +6645,7 @@ class KantGenerator: def do_choice(self, node): """handle <choice> tag - A <choice> tag contains one or more <p> tags. One <p> tag + A <choice> tag contains one or more <p> tags. One <p> tag is chosen at random and evaluated; the rest are ignored. """ self.parse(self.randomChildElement(node)) @@ -7162,7 +6685,7 @@ def openAnything(source): This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) - and deal with it in a uniform manner. Returned object is guaranteed + and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it. @@ -7207,49 +6730,49 @@ def openAnything(source): reference to ends, abstract from all content of knowledge; in the study of space, the discipline of human reason, in accordance with the principles of philosophy, is the clue to the discovery of the -Transcendental Deduction. The transcendental aesthetic, in all +Transcendental Deduction. The transcendental aesthetic, in all theoretical sciences, occupies part of the sphere of human reason concerning the existence of our ideas in general; still, the never-ending regress in the series of empirical conditions constitutes -the whole content for the transcendental unity of apperception. What +the whole content for the transcendental unity of apperception. What we have alone been able to show is that, even as this relates to the architectonic of human reason, the Ideal may not contradict itself, but it is still possible that it may be in contradictions with the employment of the pure employment of our hypothetical judgements, but natural causes (and I assert that this is the case) prove the validity -of the discipline of pure reason. As we have already seen, time (and +of the discipline of pure reason. As we have already seen, time (and it is obvious that this is true) proves the validity of time, and the architectonic of human reason, in the full sense of these terms, -abstracts from all content of knowledge. I assert, in the case of the +abstracts from all content of knowledge. I assert, in the case of the discipline of practical reason, that the Antinomies are just as necessary as natural causes, since knowledge of the phenomena is a posteriori. The discipline of human reason, as I have elsewhere shown, is by its very nature contradictory, but our ideas exclude the possibility of -the Antinomies. We can deduce that, on the contrary, the pure +the Antinomies. We can deduce that, on the contrary, the pure employment of philosophy, on the contrary, is by its very nature contradictory, but our sense perceptions are a representation of, in -the case of space, metaphysics. The thing in itself is a -representation of philosophy. Applied logic is the clue to the -discovery of natural causes. However, what we have alone been able to +the case of space, metaphysics. The thing in itself is a +representation of philosophy. Applied logic is the clue to the +discovery of natural causes. However, what we have alone been able to show is that our ideas, in other words, should only be used as a canon for the Ideal, because of our necessary ignorance of the conditions. -[...snip...]This is, of course, complete gibberish. Well, not complete gibberish. It is syntactically and grammatically correct (although -very verbose -- Kant wasn't what you would call a get-to-the-point kind of guy). Some of it may actually be true (or at least +[...snip...]
This is, of course, complete gibberish. Well, not complete gibberish. It is syntactically and grammatically correct (although +very verbose -- Kant wasn't what you would call a get-to-the-point kind of guy). Some of it may actually be true (or at least the sort of thing that Kant would have agreed with), some of it is blatantly false, and most of it is simply incoherent. But all of it is in the style of Immanuel Kant.
Let me repeat that this is much, much funnier if you are now or have ever been a philosophy major. -
The interesting thing about this program is that there is nothing Kant-specific about it. All the content in the previous -example was derived from the grammar file,
kant.xml. If you tell the program to use a different grammar file (which you can specify on the command line), the output will be +The interesting thing about this program is that there is nothing Kant-specific about it. All the content in the previous +example was derived from the grammar file,
kant.xml. If you tell the program to use a different grammar file (which you can specify on the command line), the output will be completely different.Example 9.4. Simpler output from
kgp.py[you@localhost kgp]$ python kgp.py -g binary.xml 00101001 [you@localhost kgp]$ python kgp.py -g binary.xml -10110100You will take a closer look at the structure of the grammar file later in this chapter. For now, all you need to know is +10110100
You will take a closer look at the structure of the grammar file later in this chapter. For now, all you need to know is that the grammar file defines the structure of the output, and the
kgp.pyprogram reads through the grammar and makes random decisions about which words to plug in where.9.2. Packages
-Actually parsing an XML document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour +
Actually parsing an XML document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour to talk about packages.
Example 9.5. Loading an XML document (a sneak peek)
>>> from xml.dom import minidom@@ -7258,12 +6781,12 @@ that the grammar file defines the structure of the output, and the
kgp.py<- - ![]()
This is a syntax you haven't seen before. It looks almost like the from module importyou know and love, but the"."gives it away as something above and beyond a simple import. In fact,xmlis what is known as a package,domis a nested package withinxml, andminidomis a module withinxml.dom. +This is a syntax you haven't seen before. It looks almost like the from module importyou know and love, but the"."gives it away as something above and beyond a simple import. In fact,xmlis what is known as a package,domis a nested package withinxml, andminidomis a module withinxml.dom.That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than -directories of modules; nested packages are subdirectories. The modules within a package (or a nested package) are still +
That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than +directories of modules; nested packages are subdirectories. The modules within a package (or a nested package) are still just
.pyfiles, like always, except that they're in a subdirectory instead of the mainlib/directory of your Python installation.Example 9.6. File layout of a package
Python21/ root Python installation (home of the executable) | @@ -7275,8 +6798,8 @@ just.pyfiles, like always, except that they're in a subdirectory | +--dom/ xml.dom package (contains minidom.py) | - +--parsers/ xml.parsers package (used internally)So when you say
from xml.dom import minidom, Python figures out that that means “look in thexmldirectory for adomdirectory, and look in that for theminidommodule, and import it asminidom”. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import -specific classes or functions from a module contained within a package. You can also import the package itself as a module. + +--parsers/ xml.parsers package (used internally)So when you say
from xml.dom import minidom, Python figures out that that means “look in thexmldirectory for adomdirectory, and look in that for theminidommodule, and import it asminidom”. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import +specific classes or functions from a module contained within a package. You can also import the package itself as a module. The syntax is all the same; Python figures out what you mean based on the file layout of the package, and automatically does the right thing.Example 9.7. Packages are modules, too
>>> from xml.dom import minidom>>> minidom @@ -7298,19 +6821,19 @@ The syntax is all the same; Python figures out what you mean based on the file l
- ![]()
Here you're importing a module ( minidom) from a nested package (xml.dom). The result is thatminidomis imported into your namespace, and in order to reference classes within theminidommodule (likeElement), you need to preface them with the module name. +Here you're importing a module ( minidom) from a nested package (xml.dom). The result is thatminidomis imported into your namespace, and in order to reference classes within theminidommodule (likeElement), you need to preface them with the module name.- ![]()
Here you are importing a class ( Element) from a module (minidom) from a nested package (xml.dom). The result is thatElementis imported directly into your namespace. Note that this does not interfere with the previous import; theElementclass can now be referenced in two ways (but it's all still the same class). +Here you are importing a class ( Element) from a module (minidom) from a nested package (xml.dom). The result is thatElementis imported directly into your namespace. Note that this does not interfere with the previous import; theElementclass can now be referenced in two ways (but it's all still the same class).@@ -7322,22 +6845,22 @@ The syntax is all the same; Python figures out what you mean based on the file l - ![]()
Here you are importing the dompackage (a nested package ofxml) as a module in and of itself. Any level of a package can be treated as a module, as you'll see in a moment. It can even +Here you are importing the dompackage (a nested package ofxml) as a module in and of itself. Any level of a package can be treated as a module, as you'll see in a moment. It can even have its own attributes and methods, just the modules you've seen before.So how can a package (which is just a directory on disk) be imported and treated as a module (which is always a file on disk)? -The answer is the magical
__init__.pyfile. You see, packages are not simply directories; they are directories with a specific file,__init__.py, inside. This file defines the attributes and methods of the package. For instance,xml.domcontains aNodeclass, which is defined inxml/dom/__init__.py. When you import a package as a module (likedomfromxml), you're really importing its__init__.pyfile.+The answer is the magical
@@ -7394,20 +6917,20 @@ package architecture. It's one of the many things Python is good at, so take ad__init__.pyfile. You see, packages are not simply directories; they are directories with a specific file,__init__.py, inside. This file defines the attributes and methods of the package. For instance,xml.domcontains aNodeclass, which is defined inxml/dom/__init__.py. When you import a package as a module (likedomfromxml), you're really importing its__init__.pyfile.-
- A package is a directory with the special __init__.pyfile in it. The__init__.pyfile defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file, - but it has to exist. But if__init__.pydoesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages. +A package is a directory with the special __init__.pyfile in it. The__init__.pyfile defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file, + but it has to exist. But if__init__.pydoesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages.So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an
xmlpackage withsaxanddompackages inside, the authors could have chosen to put all thesaxfunctionality inxmlsax.pyand all thedomfunctionality inxmldom.py, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the XML package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different +So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an
xmlpackage withsaxanddompackages inside, the authors could have chosen to put all thesaxfunctionality inxmlsax.pyand all thedomfunctionality inxmldom.py, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the XML package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different areas simultaneously).If you ever find yourself writing a large subsystem in Python (or, more likely, when you realize that your small subsystem has grown into a large one), invest some time designing a good -package architecture. It's one of the many things Python is good at, so take advantage of it. +package architecture. It's one of the many things Python is good at, so take advantage of it.
9.3. Parsing XML
-As I was saying, actually parsing an XML document is very simple: one line of code. Where you go from there is up to you. +
As I was saying, actually parsing an XML document is very simple: one line of code. Where you go from there is up to you.
Example 9.8. Loading an XML document (for real this time)
>>> from xml.dom import minidom>>> xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')
@@ -7365,20 +6888,20 @@ package architecture. It's one of the many things Python is good at, so take ad
- ![]()
Here is the one line of code that does all the work: minidom.parsetakes one argument and returns a parsed representation of the XML document. The argument can be many things; in this case, it's simply a filename of an XML document on my local disk. (To follow along, you'll need to change the path to point to your downloaded examples directory.) - But you can also pass a file object, or even a file-like object. You'll take advantage of this flexibility later in this chapter. +Here is the one line of code that does all the work: minidom.parsetakes one argument and returns a parsed representation of the XML document. The argument can be many things; in this case, it's simply a filename of an XML document on my local disk. (To follow along, you'll need to change the path to point to your downloaded examples directory.) + But you can also pass a file object, or even a file-like object. You'll take advantage of this flexibility later in this chapter.- ![]()
The object returned from minidom.parseis aDocumentobject, a descendant of theNodeclass. ThisDocumentobject is the root level of a complex tree-like structure of interlocking Python objects that completely represent the XML document you passed tominidom.parse. +The object returned from minidom.parseis aDocumentobject, a descendant of theNodeclass. ThisDocumentobject is the root level of a complex tree-like structure of interlocking Python objects that completely represent the XML document you passed tominidom.parse.- ![]()
toxmlis a method of theNodeclass (and is therefore available on theDocumentobject you got fromminidom.parse).toxmlprints out the XML that thisNoderepresents. For theDocumentnode, this prints out the entire XML document. +toxmlis a method of theNodeclass (and is therefore available on theDocumentobject you got fromminidom.parse).toxmlprints out the XML that thisNoderepresents. For theDocumentnode, this prints out the entire XML document.- ![]()
Every Nodehas achildNodesattribute, which is a list of theNodeobjects. ADocumentalways has only one child node, the root element of the XML document (in this case, thegrammarelement). +Every Nodehas achildNodesattribute, which is a list of theNodeobjects. ADocumentalways has only one child node, the root element of the XML document (in this case, thegrammarelement).- ![]()
To get the first (and in this case, the only) child node, just use regular list syntax. Remember, there is nothing special + To get the first (and in this case, the only) child node, just use regular list syntax. Remember, there is nothing special going on here; this is just a regular Python list of regular Python objects. @@ -7458,7 +6981,7 @@ package architecture. It's one of the many things Python is good at, so take ad - ![]()
Since getting the first child node of a node is a useful and common activity, the Nodeclass has afirstChildattribute, which is synonymous withchildNodes[0]. (There is also alastChildattribute, which is synonymous withchildNodes[-1].) +Since getting the first child node of a node is a useful and common activity, the Nodeclass has afirstChildattribute, which is synonymous withchildNodes[0]. (There is also alastChildattribute, which is synonymous withchildNodes[-1].)- ![]()
Looking at the XML in binary.xml, you might think that thegrammarhas only two child nodes, the tworefelements. But you're missing something: the carriage returns! After the'<grammar>'and before the first'<ref>'is a carriage return, and this text counts as a child node of thegrammarelement. Similarly, there is a carriage return after each'</ref>'; these also count as child nodes. Sogrammar.childNodesis actually a list of 5 objects: 3Textobjects and 2Elementobjects. +Looking at the XML in binary.xml, you might think that thegrammarhas only two child nodes, the tworefelements. But you're missing something: the carriage returns! After the'<grammar>'and before the first'<ref>'is a carriage return, and this text counts as a child node of thegrammarelement. Similarly, there is a carriage return after each'</ref>'; these also count as child nodes. Sogrammar.childNodesis actually a list of 5 objects: 3Textobjects and 2Elementobjects.@@ -7533,34 +7056,34 @@ u'0' - ![]()
The .dataattribute of aTextnode gives you the actual string that the text node represents. But what is that'u'in front of the string? The answer to that deserves its own section. +The .dataattribute of aTextnode gives you the actual string that the text node represents. But what is that'u'in front of the string? The answer to that deserves its own section.9.4. Unicode
-Unicode is a system to represent characters from all the world's different languages. When Python parses an XML document, all data is stored in memory as unicode. +
Unicode is a system to represent characters from all the world's different languages. When Python parses an XML document, all data is stored in memory as unicode.
You'll get to all that in a minute, but first, some background.
Historical note. Before unicode, there were separate character encoding systems for each language, each using the same numbers (0-255) to represent -that language's characters. Some languages (like Russian) have multiple conflicting standards about how to represent the +that language's characters. Some languages (like Russian) have multiple conflicting standards about how to represent the same characters; other languages (like Japanese) have so many characters that they require multiple-byte character sets. Exchanging documents between systems was difficult because there was no way for a computer to tell for certain which character encoding scheme the document author had used; the computer only saw numbers, and the numbers could mean different things. Then think about trying to store these documents in the same place (like in the same database table); you would need to store the character encoding alongside each piece of text, and make sure to pass it around whenever you passed the text around. -Then think about multilingual documents, with characters from multiple languages in the same document. (They typically used +Then think about multilingual documents, with characters from multiple languages in the same document. (They typically used escape codes to switch modes; poof, you're in Russian koi8-r mode, so character 241 means this; poof, now you're in Mac Greek -mode, so character 241 means something else. And so on.) These are the problems which unicode was designed to solve. -
To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.[5] Each 2-byte number represents a unique character used in at least one of the world's languages. (Characters that are used +mode, so character 241 means something else. And so on.) These are the problems which unicode was designed to solve. +
To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.[5] Each 2-byte number represents a unique character used in at least one of the world's languages. (Characters that are used in multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per number. Unicode data is never ambiguous. -
Of course, there is still the matter of all these legacy encoding systems. 7-bit ASCII, for instance, which stores English characters as numbers ranging from 0 to 127. (65 is capital “
A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in 7-bit ASCII. Western European languages like French, Spanish, and German all use an encoding system called ISO-8859-1 (also called “latin-1”), which uses the 7-bit ASCII characters for the numbers 0 through 127, but then extends into the 128-255 range for characters like n-with-a-tilde-over-it -(241), and u-with-two-dots-over-it (252). And unicode uses the same characters as 7-bit ASCII for 0 through 127, and the same characters as ISO-8859-1 for 128 through 255, and then extends from there into characters +Of course, there is still the matter of all these legacy encoding systems. 7-bit ASCII, for instance, which stores English characters as numbers ranging from 0 to 127. (65 is capital “
A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in 7-bit ASCII. Western European languages like French, Spanish, and German all use an encoding system called ISO-8859-1 (also called “latin-1”), which uses the 7-bit ASCII characters for the numbers 0 through 127, but then extends into the 128-255 range for characters like n-with-a-tilde-over-it +(241), and u-with-two-dots-over-it (252). And unicode uses the same characters as 7-bit ASCII for 0 through 127, and the same characters as ISO-8859-1 for 128 through 255, and then extends from there into characters for other languages with the remaining numbers, 256 through 65535.When dealing with unicode data, you may at some point need to convert the data back into one of these other legacy encoding -systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding -scheme, or to print it to a non-unicode-aware terminal or printer. Or to store it in an XML document which explicitly specifies the encoding scheme. +systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding +scheme, or to print it to a non-unicode-aware terminal or printer. Or to store it in an XML document which explicitly specifies the encoding scheme.
And on that note, let's get back to Python. -
Python has had unicode support throughout the language since version 2.0. The XML package uses unicode to store all parsed XML data, but you can use unicode anywhere. +
Python has had unicode support throughout the language since version 2.0. The XML package uses unicode to store all parsed XML data, but you can use unicode anywhere.
Example 9.13. Introducing unicode
>>> s = u'Dive in'>>> s @@ -7571,13 +7094,13 @@ Dive in
- ![]()
To create a unicode string instead of a regular ASCII string, add the letter “ u” before the string. Note that this particular string doesn't have any non-ASCII characters. That's fine; unicode is a superset of ASCII (a very large superset at that), so any regular ASCII string can also be stored as unicode. +To create a unicode string instead of a regular ASCII string, add the letter “ u” before the string. Note that this particular string doesn't have any non-ASCII characters. That's fine; unicode is a superset of ASCII (a very large superset at that), so any regular ASCII string can also be stored as unicode.@@ -7593,7 +7116,7 @@ La Peña - ![]()
When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn't know that s was a unicode string, you'd never notice the difference. + When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn't know that s was a unicode string, you'd never notice the difference. - ![]()
The real advantage of unicode, of course, is its ability to store non-ASCII characters, like the Spanish “ ñ” (nwith a tilde over it). The unicode character code for the tilde-n is0xf1in hexadecimal (241 in decimal), which you can type like this:\xf1. +The real advantage of unicode, of course, is its ability to store non-ASCII characters, like the Spanish “ ñ” (nwith a tilde over it). The unicode character code for the tilde-n is0xf1in hexadecimal (241 in decimal), which you can type like this:\xf1.@@ -7605,8 +7128,8 @@ La Peña @@ -7623,14 +7146,14 @@ sys.setdefaultencoding('iso-8859-1') - ![]()
Here's where the conversion-from-unicode-to-other-encoding-schemes comes in. s is a unicode string, but encodemethod, available on every unicode string, to convert the unicode string to a regular string in the given encoding scheme, - which you pass as a parameter. In this case, you're usinglatin-1(also known asiso-8859-1), which includes the tilde-n (whereas the default ASCII encoding scheme did not, since it only includes characters numbered 0 through 127). +Here's where the conversion-from-unicode-to-other-encoding-schemes comes in. s is a unicode string, but encodemethod, available on every unicode string, to convert the unicode string to a regular string in the given encoding scheme, + which you pass as a parameter. In this case, you're usinglatin-1(also known asiso-8859-1), which includes the tilde-n (whereas the default ASCII encoding scheme did not, since it only includes characters numbered 0 through 127).-
sitecustomize.pyis a special script; Python will try to import it on startup, so any code in it will be run automatically. As the comment mentions, it can go anywhere +sitecustomize.pyis a special script; Python will try to import it on startup, so any code in it will be run automatically. As the comment mentions, it can go anywhere (as long asimportcan find it), but it usually goes in thesite-packagesdirectory within your Pythonlibdirectory.@@ -7645,8 +7168,8 @@ La Peña - ![]()
setdefaultencodingfunction sets, well, the default encoding. This is the encoding scheme that Python will try to use whenever it needs to auto-coerce a unicode string into a regular string. +setdefaultencodingfunction sets, well, the default encoding. This is the encoding scheme that Python will try to use whenever it needs to auto-coerce a unicode string into a regular string.- ![]()
This example assumes that you have made the changes listed in the previous example to your sitecustomize.pyfile, and restarted Python. If your default encoding still says'ascii', you didn't set up yoursitecustomize.pyproperly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even - callsys.setdefaultencodingafter Python has started up. Dig intosite.pyand search for “setdefaultencoding” to find out how.) +This example assumes that you have made the changes listed in the previous example to your sitecustomize.pyfile, and restarted Python. If your default encoding still says'ascii', you didn't set up yoursitecustomize.pyproperly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even + callsys.setdefaultencodingafter Python has started up. Dig intosite.pyand search for “setdefaultencoding” to find out how.)@@ -7657,11 +7180,11 @@ La Peña Example 9.17. Specifying encoding in
-.pyfilesIf you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual
.pyfile by putting an encoding declaration at the top of each file. This declaration defines the.pyfile to be UTF-8:+If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual
.pyfile by putting an encoding declaration at the top of each file. This declaration defines the.pyfile to be UTF-8:#!/usr/bin/env python # -*- coding: UTF-8 -*- -Now, what about XML? Well, every XML document is in a specific encoding. Again, ISO-8859-1 is a popular encoding for data in Western European languages. KOI8-R -is popular for Russian texts. The encoding, if specified, is in the header of the XML document. +
Now, what about XML? Well, every XML document is in a specific encoding. Again, ISO-8859-1 is a popular encoding for data in Western European languages. KOI8-R +is popular for Russian texts. The encoding, if specified, is in the header of the XML document.
Example 9.18.
russiansample.xml<?xml version="1.0" encoding="koi8-r"?><preface> @@ -7671,13 +7194,13 @@ is popular for Russian texts. The encoding, if specified, is in the header of t
- ![]()
This is a sample extract from a real Russian XML document; it's part of a Russian translation of this very book. Note the encoding, koi8-r, specified in the header. +This is a sample extract from a real Russian XML document; it's part of a Russian translation of this very book. Note the encoding, koi8-r, specified in the header.@@ -7701,7 +7224,7 @@ UnicodeError: ASCII encoding error: ordinal not in range(128) - ![]()
These are Cyrillic characters which, as far as I know, spell the Russian word for “Preface”. If you open this file in a regular text editor, the characters will most likely like gibberish, because they're encoded + These are Cyrillic characters which, as far as I know, spell the Russian word for “Preface”. If you open this file in a regular text editor, the characters will most likely like gibberish, because they're encoded using the koi8-rencoding scheme, but they're being displayed iniso-8859-1.@@ -7727,13 +7250,13 @@ UnicodeError: ASCII encoding error: ordinal not in range(128) - ![]()
I'm assuming here that you saved the previous example as russiansample.xmlin the current directory. I am also, for the sake of completeness, assuming that you've changed your default encoding back +I'm assuming here that you saved the previous example as russiansample.xmlin the current directory. I am also, for the sake of completeness, assuming that you've changed your default encoding back to'ascii'by removing yoursitecustomize.pyfile, or at least commenting out thesetdefaultencodingline.- ![]()
Printing the koi8-r-encoded string will probably show gibberish on your screen, because your Python IDE is interpreting those characters asiso-8859-1, notkoi8-r. But at least they do print. (And, if you look carefully, it's the same gibberish that you saw when you opened the original -XML document in a non-unicode-aware text editor. Python converted it fromkoi8-rinto unicode when it parsed the XML document, and you've just converted it back.) +Printing the koi8-r-encoded string will probably show gibberish on your screen, because your Python IDE is interpreting those characters asiso-8859-1, notkoi8-r. But at least they do print. (And, if you look carefully, it's the same gibberish that you saw when you opened the original +XML document in a non-unicode-aware text editor. Python converted it fromkoi8-rinto unicode when it parsed the XML document, and you've just converted it back.)To sum up, unicode itself is a bit intimidating if you've never seen it before, but unicode data is really very easy to handle -in Python. If your XML documents are all 7-bit ASCII (like the examples in this chapter), you will literally never think about unicode. Python will convert the ASCII data in the XML documents into unicode while parsing, and auto-coerce it back to ASCII whenever necessary, and you'll never even notice. But if you need to deal with that in other languages, Python is ready. +in Python. If your XML documents are all 7-bit ASCII (like the examples in this chapter), you will literally never think about unicode. Python will convert the ASCII data in the XML documents into unicode while parsing, and auto-coerce it back to ASCII whenever necessary, and you'll never even notice. But if you need to deal with that in other languages, Python is ready.
Further reading
@@ -7745,7 +7268,7 @@ in Python. If your XML documents are all 7-bit ASCI
9.5. Searching for elements
-Traversing XML documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within +
Traversing XML documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within your XML document, there is a shortcut you can use to find it quickly:
getElementsByTagName.For this section, you'll be using the
binary.xmlgrammar file, which looks like this:Example 9.20.
binary.xml<?xml version="1.0"?> @@ -7759,7 +7282,7 @@ in Python. If your XML documents are all 7-bit ASCI <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\ <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p> </ref> -</grammar>It has two
refs,'bit'and'byte'. Abitis either a'0'or'1', and abyteis 8bits. +</grammar>It has two
refs,'bit'and'byte'. Abitis either a'0'or'1', and abyteis 8bits.Example 9.21. Introducing
getElementsByTagName>>> from xml.dom import minidom >>> xmldoc = minidom.parse('binary.xml') @@ -7781,7 +7304,7 @@ in Python. If your XML documents are all 7-bit ASCI@@ -7815,7 +7338,7 @@ in Python. If your XML documents are all 7-bit ASCI - ![]()
getElementsByTagNametakes one argument, the name of the element you wish to find. It returns a list ofElementobjects, corresponding to the XML elements that have that name. In this case, you find tworefelements. +getElementsByTagNametakes one argument, the name of the element you wish to find. It returns a list ofElementobjects, corresponding to the XML elements that have that name. In this case, you find tworefelements.@@ -7834,7 +7357,7 @@ in Python. If your XML documents are all 7-bit ASCI - ![]()
Just as before, the getElementsByTagNamemethod returns a list of all the elements it found. In this case, you have two, one for each bit. +Just as before, the getElementsByTagNamemethod returns a list of all the elements it found. In this case, you have two, one for each bit.- ![]()
Note carefully the difference between this and the previous example. Previously, you were searching for pelements within firstref, but here you are searching forpelements within xmldoc, the root-level object that represents the entire XML document. This does find thepelements nested within therefelements within the rootgrammarelement. +Note carefully the difference between this and the previous example. Previously, you were searching for pelements within firstref, but here you are searching forpelements within xmldoc, the root-level object that represents the entire XML document. This does find thepelements nested within therefelements within the rootgrammarelement.@@ -7857,7 +7380,7 @@ in Python. If your XML documents are all 7-bit ASCI - @@ -7883,13 +7406,13 @@ in Python. If your XML documents are all 7-bit ASCIThis section may be a little confusing, because of some overlapping terminology. Elements in an XML document have attributes, and Python objects also have attributes. When you parse an XML document, you get a bunch of Python objects that represent all the pieces of the XML document, and some of these Python objects represent attributes of the XML elements. But the (Python) objects that represent the (XML) attributes also have (Python) attributes, which are used to access various parts of the (XML) attribute that the object represents. I told you it was confusing. I am open to suggestions on how to distinguish these + This section may be a little confusing, because of some overlapping terminology. Elements in an XML document have attributes, and Python objects also have attributes. When you parse an XML document, you get a bunch of Python objects that represent all the pieces of the XML document, and some of these Python objects represent attributes of the XML elements. But the (Python) objects that represent the (XML) attributes also have (Python) attributes, which are used to access various parts of the (XML) attribute that the object represents. I told you it was confusing. I am open to suggestions on how to distinguish these more clearly. - ![]()
Each Elementobject has an attribute calledattributes, which is aNamedNodeMapobject. This sounds scary, but it's not, because aNamedNodeMapis an object that acts like a dictionary, so you already know how to use it. +Each Elementobject has an attribute calledattributes, which is aNamedNodeMapobject. This sounds scary, but it's not, because aNamedNodeMapis an object that acts like a dictionary, so you already know how to use it.- ![]()
Treating the NamedNodeMapas a dictionary, you can get a list of the names of the attributes of this element by usingattributes.keys(). This element has only one attribute,'id'. +Treating the NamedNodeMapas a dictionary, you can get a list of the names of the attributes of this element by usingattributes.keys(). This element has only one attribute,'id'.@@ -7901,14 +7424,14 @@ in Python. If your XML documents are all 7-bit ASCI - ![]()
Again treating the NamedNodeMapas a dictionary, you can get a list of the values of the attributes by usingattributes.values(). The values are themselves objects, of typeAttr. You'll see how to get useful information out of this object in the next example. +Again treating the NamedNodeMapas a dictionary, you can get a list of the values of the attributes by usingattributes.values(). The values are themselves objects, of typeAttr. You'll see how to get useful information out of this object in the next example.@@ -7924,7 +7447,7 @@ u'bit' - ![]()
Still treating the NamedNodeMapas a dictionary, you can access an individual attribute by name, using normal dictionary syntax. (Readers who have been - paying extra-close attention will already know how theNamedNodeMapclass accomplishes this neat trick: by defining a__getitem__special method. Other readers can take comfort in the fact that they don't need to understand how it works in order to use it effectively.) +Still treating the NamedNodeMapas a dictionary, you can access an individual attribute by name, using normal dictionary syntax. (Readers who have been + paying extra-close attention will already know how theNamedNodeMapclass accomplishes this neat trick: by defining a__getitem__special method. Other readers can take comfort in the fact that they don't need to understand how it works in order to use it effectively.)- ![]()
The Attrobject completely represents a single XML attribute of a single XML element. The name of the attribute (the same name as you used to find this object in thebitref.attributesNamedNodeMappseudo-dictionary) is stored ina.name. +The Attrobject completely represents a single XML attribute of a single XML element. The name of the attribute (the same name as you used to find this object in thebitref.attributesNamedNodeMappseudo-dictionary) is stored ina.name.@@ -7939,13 +7462,13 @@ u'bit' - Like a dictionary, attributes of an XML element have no ordering. Attributes may happen to be listed in a certain order in the original XML document, and the Attrobjects may happen to be listed in a certain order when the XML document is parsed into Python objects, but these orders are arbitrary and should carry no special meaning. You should always access individual attributes +Like a dictionary, attributes of an XML element have no ordering. Attributes may happen to be listed in a certain order in the original XML document, and the Attrobjects may happen to be listed in a certain order when the XML document is parsed into Python objects, but these orders are arbitrary and should carry no special meaning. You should always access individual attributes by name, like the keys of a dictionary.9.7. Segue
-OK, that's it for the hard-core XML stuff. The next chapter will continue to use these same example programs, but focus on +
OK, that's it for the hard-core XML stuff. The next chapter will continue to use these same example programs, but focus on other aspects that make the program more flexible: using streams for input processing, using
getattrfor method dispatching, and using command-line flags to allow users to reconfigure the program without changing the code.Before moving on to the next chapter, you should be comfortable doing all of these things:
@@ -7957,20 +7480,20 @@ u'bit'
-[5] This, sadly, is still an oversimplification. Unicode now has been extended to handle ancient Chinese, Korean, and Japanese texts, which had so - many different characters that the 2-byte unicode system could not represent them all. But Python doesn't currently support that out of the box, and I don't know if there is a project afoot to add it. You've reached the +
[5] This, sadly, is still an oversimplification. Unicode now has been extended to handle ancient Chinese, Korean, and Japanese texts, which had so + many different characters that the 2-byte unicode system could not represent them all. But Python doesn't currently support that out of the box, and I don't know if there is a project afoot to add it. You've reached the limits of my expertise, sorry.
Chapter 10. Scripts and Streams
10.1. Abstracting input sources
One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the file-like object.
Many functions which require an input source could simply take a filename, go open the file for reading, read it, and close -it when they're done. But they don't. Instead, they take a file-like object. -
In the simplest case, a file-like object is any object with a
readmethod with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When +it when they're done. But they don't. Instead, they take a file-like object. +In the simplest case, a file-like object is any object with a
readmethod with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left off and returns the next chunk of data. -This is how reading from real files works; the difference is that you're not limiting yourself to real files. The input source could be anything: a file on -disk, a web page, even a hard-coded string. As long as you pass a file-like object to the function, and the function simply +
This is how reading from real files works; the difference is that you're not limiting yourself to real files. The input source could be anything: a file on +disk, a web page, even a hard-coded string. As long as you pass a file-like object to the function, and the function simply calls the object's
readmethod, the function can handle any kind of input source without specific code to handle each kind.In case you were wondering how this relates to XML processing,
minidom.parseis one such function which can take a file-like object.Example 10.1. Parsing XML from a file
@@ -7994,7 +7517,7 @@ calls the object'sreadmethod, the function can handle any kind of- ![]()
First, you open the file on disk. This gives you a file object. + First, you open the file on disk. This gives you a file object. @@ -8006,7 +7529,7 @@ calls the object's readmethod, the function can handle any kind of- ![]()
Be sure to call the closemethod of the file object after you're done with it.minidom.parsewill not do this for you. +Be sure to call the closemethod of the file object after you're done with it.minidom.parsewill not do this for you.@@ -8016,8 +7539,8 @@ calls the object's -readmethod, the function can handle any kind ofWell, that all seems like a colossal waste of time. After all, you've already seen that
minidom.parsecan simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're -just going to be parsing a local file, you can pass the filename andminidom.parseis smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet. +Well, that all seems like a colossal waste of time. After all, you've already seen that
minidom.parsecan simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're +just going to be parsing a local file, you can pass the filename andminidom.parseis smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet.Example 10.2. Parsing XML from a URL
>>> import urllib >>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf')@@ -8050,13 +7573,13 @@ just going to be parsing a local file, you can pass the filename and
minid- ![]()
As you saw in a previous chapter, urlopentakes a web page URL and returns a file-like object. Most importantly, this object has areadmethod which returns the HTML source of the web page. +As you saw in a previous chapter, urlopentakes a web page URL and returns a file-like object. Most importantly, this object has areadmethod which returns the HTML source of the web page.- ![]()
Now you pass the file-like object to minidom.parse, which obediently calls thereadmethod of the object and parses the XML data that thereadmethod returns. The fact that this XML data is now coming straight from a web page is completely irrelevant.minidom.parsedoesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects. +Now you pass the file-like object to minidom.parse, which obediently calls thereadmethod of the object and parses the XML data that thereadmethod returns. The fact that this XML data is now coming straight from a web page is completely irrelevant.minidom.parsedoesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects.@@ -8068,7 +7591,7 @@ just going to be parsing a local file, you can pass the filename and -minid@@ -8082,13 +7605,13 @@ just going to be parsing a local file, you can pass the filename and - ![]()
By the way, this URL is real, and it really is XML. It's an XML representation of the current headlines on Slashdot, a technical news and gossip site. + By the way, this URL is real, and it really is XML. It's an XML representation of the current headlines on Slashdot, a technical news and gossip site. minid- - ![]()
minidomhas a method,parseString, which takes an entire XML document as a string and parses it. You can use this instead ofminidom.parseif you know you already have your entire XML document in a string. +minidomhas a method,parseString, which takes an entire XML document as a string and parses it. You can use this instead ofminidom.parseif you know you already have your entire XML document in a string.OK, so you can use the
minidom.parsefunction for parsing both local files and remote URLs, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a -file, a URL, or a string, you'll need special logic to check whether it's a string, and call theparseStringfunction instead. How unsatisfying. -If there were a way to turn a string into a file-like object, then you could simply pass this object to
minidom.parse. And in fact, there is a module specifically designed for doing just that:StringIO. +OK, so you can use the
minidom.parsefunction for parsing both local files and remote URLs, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a +file, a URL, or a string, you'll need special logic to check whether it's a string, and call theparseStringfunction instead. How unsatisfying. +If there were a way to turn a string into a file-like object, then you could simply pass this object to
minidom.parse. And in fact, there is a module specifically designed for doing just that:StringIO.Example 10.4. Introducing
StringIO>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>" >>> import StringIO @@ -8109,20 +7632,20 @@ file, a URL, or a string, you'll need special logic to check- ![]()
The StringIOmodule contains a single class, also calledStringIO, which allows you to turn a string into a file-like object. TheStringIOclass takes the string as a parameter when creating an instance. +The StringIOmodule contains a single class, also calledStringIO, which allows you to turn a string into a file-like object. TheStringIOclass takes the string as a parameter when creating an instance.- ![]()
Now you have a file-like object, and you can do all sorts of file-like things with it. Like read, which returns the original string. +Now you have a file-like object, and you can do all sorts of file-like things with it. Like read, which returns the original string.- ![]()
Calling readagain returns an empty string. This is how real file objects work too; once you read the entire file, you can't read any - more without explicitly seeking to the beginning of the file. TheStringIOobject works the same way. +Calling readagain returns an empty string. This is how real file objects work too; once you read the entire file, you can't read any + more without explicitly seeking to the beginning of the file. TheStringIOobject works the same way.@@ -8140,7 +7663,7 @@ file, a URL, or a string, you'll need special logic to check @@ -8161,7 +7684,7 @@ file, a URL, or a string, you'll need special logic to check - ![]()
At any time, readwill return the rest of the string that you haven't read yet. All of this is exactly how file objects work; hence the term +At any time, readwill return the rest of the string that you haven't read yet. All of this is exactly how file objects work; hence the term file-like object.So now you know how to use a single function,
minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, you useurlopento get a file-like object; for a local file, you useopen; and for a string, you useStringIO. Now let's take it one step further and generalize these differences as well. +So now you know how to use a single function,
minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, you useurlopento get a file-like object; for a local file, you useopen; and for a string, you useStringIO. Now let's take it one step further and generalize these differences as well.Example 10.6.
openAnythingdef openAnything(source):# try to open with urllib (if source is http, ftp, or file URL) @@ -8184,26 +7707,26 @@ def openAnything(source):
-
The openAnythingfunction takes a single parameter, source, and returns a file-like object. source is a string of some sort; it can either be a URL (like'http://slashdot.org/slashdot.rdf'), a full or partial pathname to a local file (like'binary.xml'), or a string that contains actual XML data to be parsed. +The openAnythingfunction takes a single parameter, source, and returns a file-like object. source is a string of some sort; it can either be a URL (like'http://slashdot.org/slashdot.rdf'), a full or partial pathname to a local file (like'binary.xml'), or a string that contains actual XML data to be parsed.- ![]()
First, you see if source is a URL. You do this through brute force: you try to open it as a URL and silently ignore errors caused by trying to open something which is not a URL. This is actually elegant in the sense that, if urllibever supports new types of URLs in the future, you will also support them without recoding. Ifurllibis able to open source, then thereturnkicks you out of the function immediately and the followingtrystatements never execute. +First, you see if source is a URL. You do this through brute force: you try to open it as a URL and silently ignore errors caused by trying to open something which is not a URL. This is actually elegant in the sense that, if urllibever supports new types of URLs in the future, you will also support them without recoding. Ifurllibis able to open source, then thereturnkicks you out of the function immediately and the followingtrystatements never execute.- ![]()
On the other hand, if urllibyelled at you and told you that source wasn't a valid URL, you assume it's a path to a file on disk and try to open it. Again, you don't do anything fancy to check whether source is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably - get them wrong anyway). Instead, you just blindly open the file, and silently trap any errors. +On the other hand, if urllibyelled at you and told you that source wasn't a valid URL, you assume it's a path to a file on disk and try to open it. Again, you don't do anything fancy to check whether source is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably + get them wrong anyway). Instead, you just blindly open the file, and silently trap any errors.@@ -8215,23 +7738,23 @@ class KantGenerator: xmldoc = minidom.parse(sock).documentElement sock.close() return xmldoc - ![]()
By this point, you need to assume that source is a string that has hard-coded data in it (since nothing else worked), so you use StringIOto create a file-like object out of it and return that. (In fact, since you're using thestrfunction, source doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its__str__special method.) +By this point, you need to assume that source is a string that has hard-coded data in it (since nothing else worked), so you use StringIOto create a file-like object out of it and return that. (In fact, since you're using thestrfunction, source doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its__str__special method.)10.2. Standard input, output, and error
-UNIX users are already familiar with the concept of standard input, standard output, and standard error. This section is for +
UNIX users are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you. -
Standard output and standard error (commonly abbreviated
stdoutandstderr) are pipes that are built into every UNIX system. When youstdoutpipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to thestderrpipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program -prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system +Standard output and standard error (commonly abbreviated
stdoutandstderr) are pipes that are built into every UNIX system. When youstdoutpipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to thestderrpipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program +prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system with a window-based Python IDE,stdoutandstderrdefault to your “Interactive Window”.)Example 10.8. Introducing
stdoutandstderr>>> for i in range(3): -... print 'Dive in'+... print 'Dive in'
Dive in Dive in Dive in >>> import sys >>> for i in range(3): -... sys.stdout.write('Dive in')
+... sys.stdout.write('Dive in')
Dive inDive inDive in >>> for i in range(3): -... sys.stderr.write('Dive in')
+... sys.stderr.write('Dive in')
Dive inDive inDive in
-
@@ -8243,17 +7766,17 @@ Dive inDive inDive in - ![]()
stdoutis a file-like object; calling itswritefunction will print out whatever string you give it. In fact, this is what thesys.stdout.write. +stdoutis a file-like object; calling itswritefunction will print out whatever string you give it. In fact, this is what thesys.stdout.write.- ![]()
In the simplest case, stdoutandstderrsend their output to the same place: the Python IDE (if you're in one), or the terminal (if you're running Python from the command line). Likestdout,stderrdoes not add carriage returns for you; if you want them, add them yourself. +In the simplest case, stdoutandstderrsend their output to the same place: the Python IDE (if you're in one), or the terminal (if you're running Python from the command line). Likestdout,stderrdoes not add carriage returns for you; if you want them, add them yourself.
stdoutandstderrare both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have noreadmethod, onlywrite. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output. +
stdoutandstderrare both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have noreadmethod, onlywrite. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output.Example 10.9. Redirecting output
[you@localhost kgp]$ python stdout.py Dive in @@ -8287,7 +7810,7 @@ fsock.close()![]()
- ![]()
Open a file for writing. If the file doesn't exist, it will be created. If the file does exist, it will be overwritten. +Open a file for writing. If the file doesn't exist, it will be created. If the file does exist, it will be overwritten. - @@ -8342,13 +7865,13 @@ raise Exception, 'this error will be logged'
![]()
Raise an exception. Note from the screen output that this does not print anything on screen. All the normal traceback information has been written to error.log. +Raise an exception. Note from the screen output that this does not print anything on screen. All the normal traceback information has been written to error.log.@@ -8365,13 +7888,13 @@ entering function - ![]()
Also note that you're not explicitly closing your log file, nor are you setting stderrback to its original value. This is fine, since once the program crashes (because of the exception), Python will clean up and close the file for us, and it doesn't make any difference thatstderris never restored, since, as I mentioned, the program crashes and Python ends. Restoring the original is more important forstdout, if you expect to go do other stuff within the same script afterwards. +Also note that you're not explicitly closing your log file, nor are you setting stderrback to its original value. This is fine, since once the program crashes (because of the exception), Python will clean up and close the file for us, and it doesn't make any difference thatstderris never restored, since, as I mentioned, the program crashes and Python ends. Restoring the original is more important forstdout, if you expect to go do other stuff within the same script afterwards.- ![]()
This shorthand syntax of the stderrwithout affecting subsequentThis shorthand syntax of the stderrwithout affecting subsequentStandard input, on the other hand, is a read-only file object, and it represents the data flowing into the program from some -previous program. This will likely not make much sense to classic Mac OS users, or even Windows users unless you were ever fluent on the MS-DOS command line. The way it works is that you can construct a chain of commands in a single line, so that one program's output -becomes the input for the next program in the chain. The first program simply outputs to standard output (without doing any +previous program. This will likely not make much sense to classic Mac OS users, or even Windows users unless you were ever fluent on the MS-DOS command line. The way it works is that you can construct a chain of commands in a single line, so that one program's output +becomes the input for the next program in the chain. The first program simply outputs to standard output (without doing any special redirecting itself, just doing normal
Example 10.12. Chaining commands
@@ -8402,24 +7925,24 @@ one program's output to the next program's input.- ![]()
This simply prints out the entire contents of binary.xml. (Windows users should usetypeinstead ofcat.) +This simply prints out the entire contents of binary.xml. (Windows users should usetypeinstead ofcat.)- ![]()
This prints the contents of binary.xml, but the “|” character, called the “pipe” character, means that the contents will not be printed to the screen. Instead, they will become the standard input of the +This prints the contents of binary.xml, but the “|” character, called the “pipe” character, means that the contents will not be printed to the screen. Instead, they will become the standard input of the next command, which in this case calls your Python script.@@ -8440,16 +7963,16 @@ def openAnything(source): - ![]()
Instead of specifying a module (like binary.xml), you specify “-”, which causes your script to load the grammar from standard input instead of from a specific file on disk. (More on how +Instead of specifying a module (like binary.xml), you specify “-”, which causes your script to load the grammar from standard input instead of from a specific file on disk. (More on how this happens in the next example.) So the effect is the same as the first syntax, where you specified the grammar filename - directly, but think of the expansion possibilities here. Instead of simply doingcat binary.xml, you could run a script that dynamically generates the grammar, then you can pipe it into your script. It could come from - anywhere: a database, or some grammar-generating meta-script, or whatever. The point is that you don't need to change your -kgp.pyscript at all to incorporate any of this functionality. All you need to do is be able to take grammar files from standard + directly, but think of the expansion possibilities here. Instead of simply doingcat binary.xml, you could run a script that dynamically generates the grammar, then you can pipe it into your script. It could come from + anywhere: a database, or some grammar-generating meta-script, or whatever. The point is that you don't need to change your +kgp.pyscript at all to incorporate any of this functionality. All you need to do is be able to take grammar files from standard input, and you can separate all the other logic into another program.- ![]()
This is the openAnythingfunction fromtoolbox.py, which you previously examined in Section 10.1, “Abstracting input sources”. All you've done is add three lines of code at the beginning of the function to check if the source is “-”; if so, you returnsys.stdin. Really, that's it! Remember,stdinis a file-like object with areadmethod, so the rest of the code (inkgp.py, where you callopenAnything) doesn't change a bit. +This is the openAnythingfunction fromtoolbox.py, which you previously examined in Section 10.1, “Abstracting input sources”. All you've done is add three lines of code at the beginning of the function to check if the source is “-”; if so, you returnsys.stdin. Really, that's it! Remember,stdinis a file-like object with areadmethod, so the rest of the code (inkgp.py, where you callopenAnything) doesn't change a bit.10.3. Caching node lookups
-
kgp.pyemploys several tricks which may or may not be useful to you in your XML processing. The first one takes advantage of the consistent structure of the input documents to build a cache of nodes. -A grammar file defines a series of
refelements. Eachrefcontains one or morepelements, which can contain a lot of different things, includingxrefs. Whenever you encounter anxref, you look for a correspondingrefelement with the sameidattribute, and choose one of therefelement's children and parse it. (You'll see how this random choice is made in the next section.) -This is how you build up the grammar: define
refelements for the smallest pieces, then definerefelements which "include" the firstrefelements by usingxref, and so forth. Then you parse the "largest" reference and follow eachxref, and eventually output real text. The text you output depends on the (random) decisions you make each time you fill in an +
kgp.pyemploys several tricks which may or may not be useful to you in your XML processing. The first one takes advantage of the consistent structure of the input documents to build a cache of nodes. +A grammar file defines a series of
refelements. Eachrefcontains one or morepelements, which can contain a lot of different things, includingxrefs. Whenever you encounter anxref, you look for a correspondingrefelement with the sameidattribute, and choose one of therefelement's children and parse it. (You'll see how this random choice is made in the next section.) +This is how you build up the grammar: define
refelements for the smallest pieces, then definerefelements which "include" the firstrefelements by usingxref, and so forth. Then you parse the "largest" reference and follow eachxref, and eventually output real text. The text you output depends on the (random) decisions you make each time you fill in anxref, so the output is different each time. -This is all very flexible, but there is one downside: performance. When you find an
xrefand need to find the correspondingrefelement, you have a problem. Thexrefhas anidattribute, and you want to find therefelement that has that sameidattribute, but there is no easy way to do that. The slow way to do it would be to get the entire list ofrefelements each time, then manually loop through and look at eachidattribute. The fast way is to do that once and build a cache, in the form of a dictionary. +This is all very flexible, but there is one downside: performance. When you find an
xrefand need to find the correspondingrefelement, you have a problem. Thexrefhas anidattribute, and you want to find therefelement that has that sameidattribute, but there is no easy way to do that. The slow way to do it would be to get the entire list ofrefelements each time, then manually loop through and look at eachidattribute. The fast way is to do that once and build a cache, in the form of a dictionary.Example 10.14.
loadGrammardef loadGrammar(self, grammar): self.grammar = self._load(grammar) @@ -8466,19 +7989,19 @@ def openAnything(source):- ![]()
As you saw in Section 9.5, “Searching for elements”, getElementsByTagNamereturns a list of all the elements of a particular name. You easily can get a list of all therefelements, then simply loop through that list. +As you saw in Section 9.5, “Searching for elements”, getElementsByTagNamereturns a list of all the elements of a particular name. You easily can get a list of all therefelements, then simply loop through that list.- ![]()
As you saw in Section 9.6, “Accessing element attributes”, you can access individual attributes of an element by name, using standard dictionary syntax. So the keys of the self.refs dictionary will be the values of the idattribute of eachrefelement. +As you saw in Section 9.6, “Accessing element attributes”, you can access individual attributes of an element by name, using standard dictionary syntax. So the keys of the self.refs dictionary will be the values of the idattribute of eachrefelement.@@ -8488,8 +8011,8 @@ def openAnything(source): id = node.attributes["id"].value self.parse(self.randomChildElement(self.refs[id])) - ![]()
The values of the self.refs dictionary will be the refelements themselves. As you saw in Section 9.3, “Parsing XML”, each element, each node, each comment, each piece of text in a parsed XML document is an object. +The values of the self.refs dictionary will be the refelements themselves. As you saw in Section 9.3, “Parsing XML”, each element, each node, each comment, each piece of text in a parsed XML document is an object.You'll explore the
randomChildElementfunction in the next section.10.4. Finding direct children of a node
-Another useful techique when parsing XML documents is finding all the direct child elements of a particular element. For instance, in the grammar files, a
refelement can have severalpelements, each of which can contain many things, including otherpelements. You want to find just thepelements that are children of theref, notpelements that are children of otherpelements. -You might think you could simply use
getElementsByTagNamefor this, but you can't.getElementsByTagNamesearches recursively and returns a single list for all the elements it finds. Sincepelements can contain otherpelements, you can't usegetElementsByTagName, because it would return nestedpelements that you don't want. To find only direct child elements, you'll need to do it yourself. +Another useful techique when parsing XML documents is finding all the direct child elements of a particular element. For instance, in the grammar files, a
refelement can have severalpelements, each of which can contain many things, including otherpelements. You want to find just thepelements that are children of theref, notpelements that are children of otherpelements. +You might think you could simply use
getElementsByTagNamefor this, but you can't.getElementsByTagNamesearches recursively and returns a single list for all the elements it finds. Sincepelements can contain otherpelements, you can't usegetElementsByTagName, because it would return nestedpelements that you don't want. To find only direct child elements, you'll need to do it yourself.Example 10.16. Finding direct child elements
def randomChildElement(self, node): choices = [e for e in node.childNodes @@ -8506,26 +8029,26 @@ def openAnything(source):- ![]()
However, as you saw in Example 9.11, “Child nodes can be text”, the list returned by childNodescontains all different types of nodes, including text nodes. That's not what you're looking for here. You only want the +However, as you saw in Example 9.11, “Child nodes can be text”, the list returned by childNodescontains all different types of nodes, including text nodes. That's not what you're looking for here. You only want the children that are elements.- ![]()
Each node has a nodeType attribute, which can be ELEMENT_NODE,TEXT_NODE,COMMENT_NODE, or any number of other values. The complete list of possible values is in the__init__.pyfile in thexml.dompackage. (See Section 9.2, “Packages” for more on packages.) But you're just interested in nodes that are elements, so you can filter the list to only include +Each node has a nodeType attribute, which can be ELEMENT_NODE,TEXT_NODE,COMMENT_NODE, or any number of other values. The complete list of possible values is in the__init__.pyfile in thexml.dompackage. (See Section 9.2, “Packages” for more on packages.) But you're just interested in nodes that are elements, so you can filter the list to only include those nodes whose nodeType isELEMENT_NODE.- ![]()
Once you have a list of actual elements, choosing a random one is easy. Python comes with a module called randomwhich includes several useful functions. Therandom.choicefunction takes a list of any number of items and returns a random item. For example, if therefelements contains severalpelements, then choices would be a list ofpelements, and chosen would end up being assigned exactly one of them, selected at random. +Once you have a list of actual elements, choosing a random one is easy. Python comes with a module called randomwhich includes several useful functions. Therandom.choicefunction takes a list of any number of items and returns a random item. For example, if therefelements contains severalpelements, then choices would be a list ofpelements, and chosen would end up being assigned exactly one of them, selected at random.10.5. Creating separate handlers by node type
-The third useful XML processing tip involves separating your code into logical functions, based on node types and element names. Parsed XML documents are made up of various types of nodes, each represented by a Python object. The root level of the document itself is represented by a
Documentobject. TheDocumentthen contains one or moreElementobjects (for actual XML tags), each of which may contain otherElementobjects,Textobjects (for bits of text), orCommentobjects (for embedded comments). Python makes it easy to write a dispatcher to separate the logic for each node type. +The third useful XML processing tip involves separating your code into logical functions, based on node types and element names. Parsed XML documents are made up of various types of nodes, each represented by a Python object. The root level of the document itself is represented by a
Documentobject. TheDocumentthen contains one or moreElementobjects (for actual XML tags), each of which may contain otherElementobjects,Textobjects (for bits of text), orCommentobjects (for embedded comments). Python makes it easy to write a dispatcher to separate the logic for each node type.Example 10.17. Class names of parsed XML objects
>>> from xml.dom import minidom >>> xmldoc = minidom.parse('kant.xml')@@ -8545,18 +8068,18 @@ def openAnything(source):
- ![]()
As you saw in Section 9.2, “Packages”, the object returned by parsing an XML document is a Documentobject, as defined in theminidom.pyin thexml.dompackage. As you saw in Section 5.4, “Instantiating Classes”,__class__is built-in attribute of every Python object. +As you saw in Section 9.2, “Packages”, the object returned by parsing an XML document is a Documentobject, as defined in theminidom.pyin thexml.dompackage. As you saw in Section 5.4, “Instantiating Classes”,__class__is built-in attribute of every Python object.- - ![]()
Furthermore, __name__is a built-in attribute of every Python class, and it is a string. This string is not mysterious; it's the same as the class name you type when you define a class - yourself. (See Section 5.3, “Defining Classes”.) +Furthermore, __name__is a built-in attribute of every Python class, and it is a string. This string is not mysterious; it's the same as the class name you type when you define a class + yourself. (See Section 5.3, “Defining Classes”.)Fine, so now you can get the class name of any particular XML node (since each XML node is represented as a Python object). How can you use this to your advantage to separate the logic of parsing each node type? The answer is
getattr, which you first saw in Section 4.4, “Getting Object References With getattr”. +Fine, so now you can get the class name of any particular XML node (since each XML node is represented as a Python object). How can you use this to your advantage to separate the logic of parsing each node type? The answer is
getattr, which you first saw in Section 4.4, “Getting Object References With getattr”.Example 10.18.
parse, a generic XML node dispatcherdef parse(self, node): parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)![]()
@@ -8565,7 +8088,7 @@ def openAnything(source):
- ![]()
First off, notice that you're constructing a larger string based on the class name of the node you were passed (in the node argument). So if you're passed a Documentnode, you're constructing the string'parse_Document', and so forth. +First off, notice that you're constructing a larger string based on the class name of the node you were passed (in the node argument). So if you're passed a Documentnode, you're constructing the string'parse_Document', and so forth.@@ -8576,7 +8099,7 @@ def openAnything(source): @@ -8604,39 +8127,39 @@ def openAnything(source): - ![]()
Finally, you can call that function and pass the node itself as an argument. The next example shows the definitions of each + Finally, you can call that function and pass the node itself as an argument. The next example shows the definitions of each of these functions. - ![]()
parse_Documentis only ever called once, since there is only oneDocumentnode in an XML document, and only oneDocumentobject in the parsed XML representation. It simply turns around and parses the root element of the grammar file. +parse_Documentis only ever called once, since there is only oneDocumentnode in an XML document, and only oneDocumentobject in the parsed XML representation. It simply turns around and parses the root element of the grammar file.- ![]()
parse_Textis called on nodes that represent bits of text. The function itself does some special processing to handle automatic capitalization +parse_Textis called on nodes that represent bits of text. The function itself does some special processing to handle automatic capitalization of the first word of a sentence, but otherwise simply appends the represented text to a list.- ![]()
parse_Commentis just apass, since you don't care about embedded comments in the grammar files. Note, however, that you still need to define the function - and explicitly make it do nothing. If the function did not exist, the genericparsefunction would fail as soon as it stumbled on a comment, because it would try to find the non-existentparse_Commentfunction. Defining a separate function for every node type, even ones you don't use, allows the genericparsefunction to stay simple and dumb. +parse_Commentis just apass, since you don't care about embedded comments in the grammar files. Note, however, that you still need to define the function + and explicitly make it do nothing. If the function did not exist, the genericparsefunction would fail as soon as it stumbled on a comment, because it would try to find the non-existentparse_Commentfunction. Defining a separate function for every node type, even ones you don't use, allows the genericparsefunction to stay simple and dumb.- - ![]()
The parse_Elementmethod is actually itself a dispatcher, based on the name of the element's tag. The basic idea is the same: take what distinguishes - elements from each other (their tag names) and dispatch to a separate function for each of them. You construct a string like -'do_xref'(for an<xref>tag), find a function of that name, and call it. And so forth for each of the other tag names that might be found in the +The parse_Elementmethod is actually itself a dispatcher, based on the name of the element's tag. The basic idea is the same: take what distinguishes + elements from each other (their tag names) and dispatch to a separate function for each of them. You construct a string like +'do_xref'(for an<xref>tag), find a function of that name, and call it. And so forth for each of the other tag names that might be found in the course of parsing a grammar file (<p>tags,<choice>tags).In this example, the dispatch functions
parseandparse_Elementsimply find other methods in the same class. If your processing is very complex (or you have many different tag names), +In this example, the dispatch functions
parseandparse_Elementsimply find other methods in the same class. If your processing is very complex (or you have many different tag names), you could break up your code into separate modules, and use dynamic importing to import each module and call whatever functions -you needed. Dynamic importing will be discussed in Chapter 16, Functional Programming. +you needed. Dynamic importing will be discussed in Chapter 16, Functional Programming.10.6. Handling command-line arguments
Python fully supports creating programs that can be run on the command line, complete with command-line arguments and either short- - or long-style flags to specify various options. None of this is XML-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it. + or long-style flags to specify various options. None of this is XML-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it.
It's difficult to talk about command-line processing without understanding how command-line arguments are exposed to your Python program, so let's write a simple program to see them.
Example 10.20. Introducing sys.argv
@@ -8650,7 +8173,7 @@ for arg in sys.argv:![]()
- ![]()
Each command-line argument passed to the program will be in sys.argv, which is just a list. Here you are printing each argument on a separate line. + Each command-line argument passed to the program will be in sys.argv, which is just a list. Here you are printing each argument on a separate line. @@ -8672,8 +8195,8 @@ kant.xml- ![]()
The first thing to know about sys.argv is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later, - in Chapter 16, Functional Programming. Don't worry about it for now. + The first thing to know about sys.argv is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later, + in Chapter 16, Functional Programming. Don't worry about it for now. @@ -8691,14 +8214,14 @@ kant.xml - ![]()
To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag - ( -m) which takes an argument (kant.xml). Both the flag itself and the flag's argument are simply sequential elements in the sys.argv list. No attempt is made to associate one with the other; all you get is a list. +To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag + ( -m) which takes an argument (kant.xml). Both the flag itself and the flag's argument are simply sequential elements in the sys.argv list. No attempt is made to associate one with the other; all you get is a list.So as you can see, you certainly have all the information passed on the command line, but then again, it doesn't look like -it's going to be all that easy to actually use it. For simple programs that only take a single argument and have no flags, -you can simply use
sys.argv[1]to access the argument. There's no shame in this; I do it all the time. For more complex programs, you need thegetoptmodule. +it's going to be all that easy to actually use it. For simple programs that only take a single argument and have no flags, +you can simply usesys.argv[1]to access the argument. There's no shame in this; I do it all the time. For more complex programs, you need thegetoptmodule.Example 10.22. Introducing
getoptdef main(argv): grammar = "kant.xml"@@ -8716,34 +8239,34 @@ if __name__ == "__main__":
- ![]()
First off, look at the bottom of the example and notice that you're calling the mainfunction withsys.argv[1:]. Remember,sys.argv[0]is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off +First off, look at the bottom of the example and notice that you're calling the mainfunction withsys.argv[1:]. Remember,sys.argv[0]is the name of the script that you're running; you don't care about that for command-line processing, so you chop it off and pass the rest of the list.- ![]()
This is where all the interesting processing happens. The getoptfunction of thegetoptmodule takes three parameters: the argument list (which you got fromsys.argv[1:]), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer - command-line flags that are equivalent to the single-character versions. This is quite confusing at first glance, and is +This is where all the interesting processing happens. The getoptfunction of thegetoptmodule takes three parameters: the argument list (which you got fromsys.argv[1:]), a string containing all the possible single-character command-line flags that this program accepts, and a list of longer + command-line flags that are equivalent to the single-character versions. This is quite confusing at first glance, and is explained in more detail below.- ![]()
If anything goes wrong trying to parse these command-line flags, getoptwill raise an exception, which you catch. You toldgetoptall the flags you understand, so this probably means that the end user passed some command-line flag that you don't understand. +If anything goes wrong trying to parse these command-line flags, getoptwill raise an exception, which you catch. You toldgetoptall the flags you understand, so this probably means that the end user passed some command-line flag that you don't understand.![]()
As is standard practice in the UNIX world, when the script is passed flags it doesn't understand, you print out a summary of proper usage and exit gracefully. - Note that I haven't shown the usagefunction here. You would still need to code that somewhere and have it print out the appropriate summary; it's not automatic. + Note that I haven't shown theusagefunction here. You would still need to code that somewhere and have it print out the appropriate summary; it's not automatic.So what are all those parameters you pass to the
getoptfunction? Well, the first one is simply the raw list of command-line flags and arguments (not including the first element, -the script name, which you already chopped off before calling themainfunction). The second is the list of short command-line flags that the script accepts. +the script name, which you already chopped off before calling themainfunction). The second is the list of short command-line flags that the script accepts.
"hg:d"@@ -8755,9 +8278,9 @@ the script name, which you already chopped off before calling the
mainshow debugging information while parsingThe first and third flags are simply standalone flags; you specify them or you don't, and they do things (print help) or change -state (turn on debugging). However, the second flag (
-g) must be followed by an argument, which is the name of the grammar file to read from. In fact it can be a filename or a web address, -and you don't know which yet (you'll figure it out later), but you know it has to be something. So you tellgetoptthis by putting a colon after thegin that second parameter to thegetoptfunction. -To further complicate things, the script accepts either short flags (like
-h) or long flags (like--help), and you want them to do the same thing. This is what the third parameter togetoptis for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter. +state (turn on debugging). However, the second flag (-g) must be followed by an argument, which is the name of the grammar file to read from. In fact it can be a filename or a web address, +and you don't know which yet (you'll figure it out later), but you know it has to be something. So you tellgetoptthis by putting a colon after thegin that second parameter to thegetoptfunction. +To further complicate things, the script accepts either short flags (like
-h) or long flags (like--help), and you want them to do the same thing. This is what the third parameter togetoptis for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter.
["help", "grammar="]@@ -8769,11 +8292,11 @@ and you don't know which yet (you'll figure it out later), but you know it has t
Three things of note here:
-
@@ -8804,28 +8327,28 @@ def main(argv):- All long flags are preceded by two dashes on the command line, but you don't include those dashes when calling
getopt. They are understood. +- All long flags are preceded by two dashes on the command line, but you don't include those dashes when calling
getopt. They are understood. -- The
--grammarflag must always be followed by an additional argument, just like the-gflag. This is notated by an equals sign,"grammar=". +- The
--grammarflag must always be followed by an additional argument, just like the-gflag. This is notated by an equals sign,"grammar=". -- The list of long flags is shorter than the list of short flags, because the
-dflag does not have a corresponding long version. This is fine; only-dwill turn on debugging. But the order of short and long flags needs to be the same, so you'll need to specify all the short +- The list of long flags is shorter than the list of short flags, because the
-dflag does not have a corresponding long version. This is fine; only-dwill turn on debugging. But the order of short and long flags needs to be the same, so you'll need to specify all the short flags that do have corresponding long flags first, then all the rest of the short flags.![]()
- ![]()
The grammar variable will keep track of the grammar file you're using. You initialize it here in case it's not specified on the command + The grammar variable will keep track of the grammar file you're using. You initialize it here in case it's not specified on the command line (using either the -gor the--grammarflag).- ![]()
The opts variable that you get back from getoptcontains a list of tuples: flag and argument. If the flag doesn't take an argument, then arg will simply beNone. This makes it easier to loop through the flags. +The opts variable that you get back from getoptcontains a list of tuples: flag and argument. If the flag doesn't take an argument, then arg will simply beNone. This makes it easier to loop through the flags.![]()
getoptvalidates that the command-line flags are acceptable, but it doesn't do any sort of conversion between short and long flags. - If you specify the-hflag, opt will contain"-h"; if you specify the--helpflag, opt will contain"--help". So you need to check for both. + If you specify the-hflag, opt will contain"-h"; if you specify the--helpflag, opt will contain"--help". So you need to check for both.@@ -8838,14 +8361,14 @@ def main(argv): - ![]()
Remember, the -dflag didn't have a corresponding long flag, so you only need to check for the short form. If you find it, you set a global - variable that you'll refer to later to print out debugging information. (I used this during the development of the script. +Remember, the -dflag didn't have a corresponding long flag, so you only need to check for the short form. If you find it, you set a global + variable that you'll refer to later to print out debugging information. (I used this during the development of the script. What, you thought all these examples worked on the first try?)![]()
- ![]()
That's it. You've looped through and dealt with all the command-line flags. That means that anything left must be command-line - arguments. These come back from the getoptfunction in the args variable. In this case, you're treating them as source material for the parser. If there are no command-line arguments +That's it. You've looped through and dealt with all the command-line flags. That means that anything left must be command-line + arguments. These come back from the getoptfunction in the args variable. In this case, you're treating them as source material for the parser. If there are no command-line arguments specified, args will be an empty list, and source will end up as the empty string.10.7. Putting it all together
-You've covered a lot of ground. Let's step back and see how all the pieces fit together. +
You've covered a lot of ground. Let's step back and see how all the pieces fit together.
To start with, this is a script that takes its arguments on the command line, using the
getoptmodule.def main(argv): @@ -8857,7 +8380,7 @@ def main(argv): for opt, arg in opts: ...You create a new instance of the
KantGeneratorclass, and pass it the grammar file and source that may or may not have been specified on the command line.- k = KantGenerator(grammar, source)The
KantGeneratorinstance automatically loads the grammar, which is an XML file. You use your customopenAnythingfunction to open the file (which could be stored in a local file or a remote web server), then use the built-inminidomparsing functions to parse the XML into a tree of Python objects. + k = KantGenerator(grammar, source)The
KantGeneratorinstance automatically loads the grammar, which is an XML file. You use your customopenAnythingfunction to open the file (which could be stored in a local file or a remote web server), then use the built-inminidomparsing functions to parse the XML into a tree of Python objects.def _load(self, source): sock = toolbox.openAnything(source) @@ -8875,7 +8398,7 @@ the "top-level" reference (that isn't referenced by anything else) and use that xrefs[xref.attributes["id"].value] = 1 xrefs = xrefs.keys() standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs] - return '<xref id="%s"/>' % random.choice(standaloneXrefs)Now you rip through the source material. The source material is also XML, and you parse it one node at a time. To keep the code separated and more maintainable, you use separate handlers for each node type. + return '<xref id="%s"/>' % random.choice(standaloneXrefs)
Now you rip through the source material. The source material is also XML, and you parse it one node at a time. To keep the code separated and more maintainable, you use separate handlers for each node type.
def parse_Element(self, node): handlerMethod = getattr(self, "do_%s" % node.tagName) @@ -8902,7 +8425,7 @@ def main(argv): ... k = KantGenerator(grammar, source) print k.output()10.8. Summary
-Python comes with powerful libraries for parsing and manipulating XML documents. The
minidomtakes an XML file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments, +Python comes with powerful libraries for parsing and manipulating XML documents. The
minidomtakes an XML file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments, error handling, even the ability to take input from the piped result of a previous program.Before moving on to the next chapter, you should be comfortable doing all of these things:
@@ -8918,14 +8441,14 @@ def main(argv):11.1. Diving in
You've learned about HTML processing and XML processing, and along the way you saw how to download a web page and how to parse XML from a URL, but let's dive into the more general topic of HTTP web services.
Simply stated, HTTP web services are programmatic ways of sending and receiving data from remote servers using the operations -of HTTP directly. If you want to get data from the server, use a straight HTTP GET; if you want to send new data to the server, -use HTTP POST. (Some more advanced HTTP web service APIs also define ways of modifying existing data and deleting data, using +of HTTP directly. If you want to get data from the server, use a straight HTTP GET; if you want to send new data to the server, +use HTTP POST. (Some more advanced HTTP web service APIs also define ways of modifying existing data and deleting data, using HTTP PUT and HTTP DELETE.) In other words, the “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for receiving, sending, modifying, and deleting data. -
The main advantage of this approach is simplicity, and its simplicity has proven popular with a lot of different sites. Data +
The main advantage of this approach is simplicity, and its simplicity has proven popular with a lot of different sites. Data -- usually XML data -- can be built and stored statically, or generated dynamically by a server-side script, and all major -languages include an HTTP library for downloading it. Debugging is also easier, because you can load up the web service in -any web browser and see the raw data. Modern browsers will even nicely format and pretty-print XML data for you, to allow +languages include an HTTP library for downloading it. Debugging is also easier, because you can load up the web service in +any web browser and see the raw data. Modern browsers will even nicely format and pretty-print XML data for you, to allow you to quickly navigate through it.
Examples of pure XML-over-HTTP web services:
@@ -8940,7 +8463,7 @@ you to quickly navigate through it.In later chapters, you'll explore APIs which use HTTP as a transport for sending and receiving data, but don't map application -semantics to the underlying HTTP semantics. (They tunnel everything over HTTP POST.) But this chapter will concentrate on +semantics to the underlying HTTP semantics. (They tunnel everything over HTTP POST.) But this chapter will concentrate on using HTTP GET to get data from a remote server, and you'll explore several HTTP features you can use to get the maximum benefit out of pure HTTP web services.
Here is a more advanced version of the
openanythingmodule that you saw in the previous chapter: @@ -8976,7 +8499,7 @@ def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT): This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) - and deal with it in a uniform manner. Returned object is guaranteed + and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it. @@ -8985,7 +8508,7 @@ def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT): If the lastmodified argument is supplied, it must be a formatted date/time string in GMT (as returned in the Last-Modified header of - a previous request). The formatted date/time will be used + a previous request). The formatted date/time will be used as the value of an If-Modified-Since request header. If the agent argument is supplied, it will be used as the value of a @@ -9046,9 +8569,9 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):11.2. How not to fetch data over HTTP
-Let's say you want to download a resource over HTTP, such as a syndicated Atom feed. But you don't just want to download +
Let's say you want to download a resource over HTTP, such as a syndicated Atom feed. But you don't just want to download it once; you want to download it over and over again, every hour, to get the latest news from the site that's offering the - news feed. Let's do it the quick-and-dirty way first, and then see how you can do better. + news feed. Let's do it the quick-and-dirty way first, and then see how you can do better.
Example 11.2. Downloading a feed the quick-and-dirty way
>>> import urllib >>> data = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read()@@ -9066,13 +8589,13 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
- - ![]()
Downloading anything over HTTP is incredibly easy in Python; in fact, it's a one-liner. The urllibmodule has a handyurlopenfunction that takes the address of the page you want, and returns a file-like object that you can justread()from to get the full contents of the page. It just can't get much easier. +Downloading anything over HTTP is incredibly easy in Python; in fact, it's a one-liner. The urllibmodule has a handyurlopenfunction that takes the address of the page you want, and returns a file-like object that you can justread()from to get the full contents of the page. It just can't get much easier.So what's wrong with this? Well, for a quick one-off during testing or development, there's nothing wrong with it. I do -it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any -web page. But once you start thinking in terms of a web service that you want to access on a regular basis -- and remember, +
So what's wrong with this? Well, for a quick one-off during testing or development, there's nothing wrong with it. I do +it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any +web page. But once you start thinking in terms of a web service that you want to access on a regular basis -- and remember, you said you were planning on retrieving this syndicated feed once an hour -- then you're being inefficient, and you're being rude.
Let's talk about some of the basic features of HTTP. @@ -9080,54 +8603,54 @@ rude.
There are five important features of HTTP which you should support.
11.3.1.
User-AgentThe
User-Agentis simply a way for a client to tell a server who it is when it requests a web page, a syndicated feed, or any sort of web - service over HTTP. When the client requests a resource, it should always announce who it is, as specifically as possible. + service over HTTP. When the client requests a resource, it should always announce who it is, as specifically as possible. This allows the server-side administrator to get in touch with the client-side developer if anything is going fantastically wrong. -By default, Python sends a generic
User-Agent:Python-urllib/1.15. In the next section, you'll see how to change this to something more specific. +By default, Python sends a generic
User-Agent:Python-urllib/1.15. In the next section, you'll see how to change this to something more specific.11.3.2. Redirects
-Sometimes resources move around. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. - A syndicated feed at
http://example.com/index.xmlmight be moved tohttp://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; for instance,http://www.example.com/index.xmlmight be redirected tohttp://server-farm-1.example.com/index.xml. -Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status - code
200means “everything's normal, here's the page you asked for”. Status code404means “page not found”. (You've probably seen 404 errors while browsing the web.) -HTTP has two different ways of signifying that a resource has moved. Status code
302is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in aLocation:header). Status code301is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in aLocation:header). If you get a302status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but - the next time you want to access the same resource, you should retry the old address. But if you get a301status code and a new address, you're supposed to use the new address from then on. +Sometimes resources move around. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. + A syndicated feed at
http://example.com/index.xmlmight be moved tohttp://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; for instance,http://www.example.com/index.xmlmight be redirected tohttp://server-farm-1.example.com/index.xml. +Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status + code
200means “everything's normal, here's the page you asked for”. Status code404means “page not found”. (You've probably seen 404 errors while browsing the web.) +HTTP has two different ways of signifying that a resource has moved. Status code
302is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in aLocation:header). Status code301is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in aLocation:header). If you get a302status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but + the next time you want to access the same resource, you should retry the old address. But if you get a301status code and a new address, you're supposed to use the new address from then on.
urllib.urlopenwill automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn't tell you when - it does so. You'll end up getting data you asked for, but you'll never know that the underlying library “helpfully” followed a redirect for you. So you'll continue pounding away at the old address, and each time you'll get redirected to - the new address. That's two round trips instead of one: not very efficient! Later in this chapter, you'll see how to work + it does so. You'll end up getting data you asked for, but you'll never know that the underlying library “helpfully” followed a redirect for you. So you'll continue pounding away at the old address, and each time you'll get redirected to + the new address. That's two round trips instead of one: not very efficient! Later in this chapter, you'll see how to work around this so you can deal with permanent redirects properly and efficiently.11.3.3.
-Last-Modified/If-Modified-SinceSome data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the +
Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the home page of Google.com only changes once every few weeks (when they put up a special holiday logo, or advertise a new service). Web services are no different; usually the server knows when the data you requested last changed, and HTTP provides a way for the server to include this last-modified date along with the data you requested.
If you ask for the same data a second time (or third, or fourth), you can tell the server the last-modified date that you - got last time: you send an
If-Modified-Sinceheader with your request, with the date you got back from the server last time. If the data hasn't changed since then, the - server sends back a special HTTP status code304, which means “this data hasn't changed since the last time you asked for it”. Why is this an improvement? Because when the server sends a304, it doesn't re-send the data. All you get is the status code. So you don't need to download the same data over and over again if it hasn't changed; + got last time: you send anIf-Modified-Sinceheader with your request, with the date you got back from the server last time. If the data hasn't changed since then, the + server sends back a special HTTP status code304, which means “this data hasn't changed since the last time you asked for it”. Why is this an improvement? Because when the server sends a304, it doesn't re-send the data. All you get is the status code. So you don't need to download the same data over and over again if it hasn't changed; the server assumes you have the data cached locally. -All modern web browsers support last-modified date checking. If you've ever visited a page, re-visited the same page a day - later and found that it hadn't changed, and wondered why it loaded so quickly the second time -- this could be why. Your +
All modern web browsers support last-modified date checking. If you've ever visited a page, re-visited the same page a day + later and found that it hadn't changed, and wondered why it loaded so quickly the second time -- this could be why. Your web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically - sent the last-modified date it got from the server the first time. The server simply says
304: Not Modified, so your browser knows to load the page from its cache. Web services can be this smart too. + sent the last-modified date it got from the server the first time. The server simply says304: Not Modified, so your browser knows to load the page from its cache. Web services can be this smart too.Python's URL library has no built-in support for last-modified date checking, but since you can add arbitrary headers to each request and read arbitrary headers in each response, you can add support for it yourself.
11.3.4.
ETag/If-None-MatchETags are an alternate way to accomplish the same thing as the last-modified date checking: don't re-download data that hasn't - changed. The way it works is, the server sends some sort of hash of the data (in an
ETagheader) along with the data you requested. Exactly how this hash is determined is entirely up to the server. The second - time you request the same data, you include the ETag hash in anIf-None-Match:header, and if the data hasn't changed, the server will send you back a304status code. As with the last-modified date checking, the server just sends the304; it doesn't send you the same data a second time. By including the ETag hash in your second request, you're telling the + changed. The way it works is, the server sends some sort of hash of the data (in anETagheader) along with the data you requested. Exactly how this hash is determined is entirely up to the server. The second + time you request the same data, you include the ETag hash in anIf-None-Match:header, and if the data hasn't changed, the server will send you back a304status code. As with the last-modified date checking, the server just sends the304; it doesn't send you the same data a second time. By including the ETag hash in your second request, you're telling the server that there's no need to re-send the same data if it still matches this hash, since you still have the data from the last time.Python's URL library has no built-in support for ETags, but you'll see how to add it later in this chapter.
11.3.5. Compression
-The last important HTTP feature is gzip compression. When you talk about HTTP web services, you're almost always talking - about moving XML back and forth over the wire. XML is text, and quite verbose text at that, and text generally compresses - well. When you request a resource over HTTP, you can ask the server that, if it has any new data to send you, to please send - it in compressed format. You include the
Accept-encoding: gzipheader in your request, and if the server supports compression, it will send you back gzip-compressed data and mark it with +The last important HTTP feature is gzip compression. When you talk about HTTP web services, you're almost always talking + about moving XML back and forth over the wire. XML is text, and quite verbose text at that, and text generally compresses + well. When you request a resource over HTTP, you can ask the server that, if it has any new data to send you, to please send + it in compressed format. You include the
Accept-encoding: gzipheader in your request, and if the server supports compression, it will send you back gzip-compressed data and mark it with aContent-encoding: gzipheader. -Python's URL library has no built-in support for gzip compression per se, but you can add arbitrary headers to the request. And +
Python's URL library has no built-in support for gzip compression per se, but you can add arbitrary headers to the request. And Python comes with a separate
gzipmodule, which has functions you can use to decompress the data yourself. -Note that our little one-line script to download a syndicated feed did not support any of these HTTP features. Let's see how you can improve it. +
Note that our little one-line script to download a syndicated feed did not support any of these HTTP features. Let's see how you can improve it.
11.4. Debugging HTTP web services
-First, let's turn on the debugging features of Python's HTTP library and see what's being sent over the wire. This will be useful throughout the chapter, as you add more and +
First, let's turn on the debugging features of Python's HTTP library and see what's being sent over the wire. This will be useful throughout the chapter, as you add more and more features.
Example 11.3. Debugging HTTP
>>> import httplib @@ -9154,62 +8677,62 @@ header: Connection: close- ![]()
urllibrelies on another standard Python library,httplib. Normally you don't need toimport httplibdirectly (urllibdoes that automatically), but you will here so you can set the debugging flag on theHTTPConnectionclass thaturllibuses internally to connect to the HTTP server. This is an incredibly useful technique. Some other Python libraries have similar debug flags, but there's no particular standard for naming them or turning them on; you need to read +urllibrelies on another standard Python library,httplib. Normally you don't need toimport httplibdirectly (urllibdoes that automatically), but you will here so you can set the debugging flag on theHTTPConnectionclass thaturllibuses internally to connect to the HTTP server. This is an incredibly useful technique. Some other Python libraries have similar debug flags, but there's no particular standard for naming them or turning them on; you need to read the documentation of each library to see if such a feature is available.- ![]()
Now that the debugging flag is set, information on the the HTTP request and response is printed out in real time. The first + Now that the debugging flag is set, information on the the HTTP request and response is printed out in real time. The first thing it tells you is that you're connecting to the server diveintomark.orgon port 80, which is the standard port for HTTP.- ![]()
When you request the Atom feed, urllibsends three lines to the server. The first line specifies the HTTP verb you're using, and the path of the resource (minus - the domain name). All the requests in this chapter will useGET, but in the next chapter on SOAP, you'll see that it usesPOSTfor everything. The basic syntax is the same, regardless of the verb. +When you request the Atom feed, urllibsends three lines to the server. The first line specifies the HTTP verb you're using, and the path of the resource (minus + the domain name). All the requests in this chapter will useGET, but in the next chapter on SOAP, you'll see that it usesPOSTfor everything. The basic syntax is the same, regardless of the verb.- ![]()
The second line is the Hostheader, which specifies the domain name of the service you're accessing. This is important, because a single HTTP server - can host multiple separate domains. My server currently hosts 12 domains; other servers can host hundreds or even thousands. +The second line is the Hostheader, which specifies the domain name of the service you're accessing. This is important, because a single HTTP server + can host multiple separate domains. My server currently hosts 12 domains; other servers can host hundreds or even thousands.- ![]()
The third line is the User-Agentheader. What you see here is the genericUser-Agentthat theurlliblibrary adds by default. In the next section, you'll see how to customize this to be more specific. +The third line is the User-Agentheader. What you see here is the genericUser-Agentthat theurlliblibrary adds by default. In the next section, you'll see how to customize this to be more specific.- ![]()
The server replies with a status code and a bunch of headers (and possibly some data, which got stored in the feeddata variable). The status code here is 200, meaning “everything's normal, here's the data you requested”. The server also tells you the date it responded to your request, some information about the server itself, and the content - type of the data it's giving you. Depending on your application, this might be useful, or not. It's certainly reassuring +The server replies with a status code and a bunch of headers (and possibly some data, which got stored in the feeddata variable). The status code here is 200, meaning “everything's normal, here's the data you requested”. The server also tells you the date it responded to your request, some information about the server itself, and the content + type of the data it's giving you. Depending on your application, this might be useful, or not. It's certainly reassuring that you thought you were asking for an Atom feed, and lo and behold, you're getting an Atom feed (application/atom+xml, which is the registered content type for Atom feeds).- ![]()
The server tells you when this Atom feed was last modified (in this case, about 13 minutes ago). You can send this date back + The server tells you when this Atom feed was last modified (in this case, about 13 minutes ago). You can send this date back to the server the next time you request the same feed, and the server can do last-modified checking. - ![]()
The server also tells you that this Atom feed has an ETag hash of "e8284-68e0-4de30f80". The hash doesn't mean anything by itself; there's nothing you can do with it, except send it back to the server the next - time you request this same feed. Then the server can use it to tell you if the data has changed or not. +The server also tells you that this Atom feed has an ETag hash of "e8284-68e0-4de30f80". The hash doesn't mean anything by itself; there's nothing you can do with it, except send it back to the server the next + time you request this same feed. Then the server can use it to tell you if the data has changed or not.11.5. Setting the
-User-AgentThe first step to improving your HTTP web services client is to identify yourself properly with a
User-Agent. To do that, you need to move beyond the basicurlliband dive intourllib2. +The first step to improving your HTTP web services client is to identify yourself properly with a
User-Agent. To do that, you need to move beyond the basicurlliband dive intourllib2.Example 11.4. Introducing
urllib2>>> import httplib >>> httplib.HTTPConnection.debuglevel = 1@@ -9243,22 +8766,22 @@ header: Connection: close
- ![]()
Fetching an HTTP resource with urllib2is a three-step process, for good reasons that will become clear shortly. The first step is to create aRequestobject, which takes the URL of the resource you'll eventually get around to retrieving. Note that this step doesn't actually +Fetching an HTTP resource with urllib2is a three-step process, for good reasons that will become clear shortly. The first step is to create aRequestobject, which takes the URL of the resource you'll eventually get around to retrieving. Note that this step doesn't actually retrieve anything yet.- ![]()
The second step is to build a URL opener. This can take any number of handlers, which control how responses are handled. - But you can also build an opener without any custom handlers, which is what you're doing here. You'll see how to define + The second step is to build a URL opener. This can take any number of handlers, which control how responses are handled. + But you can also build an opener without any custom handlers, which is what you're doing here. You'll see how to define and use custom handlers later in this chapter when you explore redirects. @@ -9269,7 +8792,7 @@ header: Connection: close >>> request.get_full_url() http://diveintomark.org/xml/atom.xml >>> request.add_header('User-Agent', -... 'OpenAnything/1.0 +http://diveintopython3.org/') - ![]()
The final step is to tell the opener to open the URL, using the Requestobject you created. As you can see from all the debugging information that gets printed, this step actually retrieves the +The final step is to tell the opener to open the URL, using the Requestobject you created. As you can see from all the debugging information that gets printed, this step actually retrieves the resource and stores the returned data in feeddata.+... 'OpenAnything/1.0 +http://diveintopython3.org/')
>>> feeddata = opener.open(request).read()
connect: (diveintomark.org, 80) send: ' @@ -9297,9 +8820,9 @@ header: Connection: close
@@ -9312,15 +8835,15 @@ header: Connection: close - ![]()
Using the add_headermethod on theRequestobject, you can add arbitrary HTTP headers to the request. The first argument is the header, the second is the value you're - providing for that header. Convention dictates that aUser-Agentshould be in this specific format: an application name, followed by a slash, followed by a version number. The rest is free-form, - and you'll see a lot of variations in the wild, but somewhere it should include a URL of your application. TheUser-Agentis usually logged by the server along with other details of your request, and including a URL of your application allows +Using the add_headermethod on theRequestobject, you can add arbitrary HTTP headers to the request. The first argument is the header, the second is the value you're + providing for that header. Convention dictates that aUser-Agentshould be in this specific format: an application name, followed by a slash, followed by a version number. The rest is free-form, + and you'll see a lot of variations in the wild, but somewhere it should include a URL of your application. TheUser-Agentis usually logged by the server along with other details of your request, and including a URL of your application allows server administrators looking through their access logs to contact you if something is wrong.- ![]()
And here's you sending your custom User-Agent, in place of the generic one that Python sends by default. If you look closely, you'll notice that you defined aUser-Agentheader, but you actually sent aUser-agentheader. See the difference?urllib2changed the case so that only the first letter was capitalized. It doesn't really matter; HTTP specifies that header field +And here's you sending your custom User-Agent, in place of the generic one that Python sends by default. If you look closely, you'll notice that you defined aUser-Agentheader, but you actually sent aUser-agentheader. See the difference?urllib2changed the case so that only the first letter was capitalized. It doesn't really matter; HTTP specifies that header field names are completely case-insensitive.11.6. Handling
Last-ModifiedandETagNow that you know how to add custom HTTP headers to your web service requests, let's look at adding support for
Last-ModifiedandETagheaders. -These examples show the output with debugging turned off. If you still have it turned on from the previous section, you can -turn it off by setting
httplib.HTTPConnection.debuglevel = 0. Or you can just leave debugging on, if that helps you. +These examples show the output with debugging turned off. If you still have it turned on from the previous section, you can +turn it off by setting
httplib.HTTPConnection.debuglevel = 0. Or you can just leave debugging on, if that helps you.Example 11.6. Testing
Last-Modified>>> import urllib2 >>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml') @@ -9336,7 +8859,7 @@ turn it off by settinghttplib.HTTPConnection.debuglevel = 0. Or y 'accept-ranges': 'bytes', 'connection': 'close'} >>> request.add_header('If-Modified-Since', -... firstdatastream.headers.get('Last-Modified'))+... firstdatastream.headers.get('Last-Modified'))
>>> seconddatastream = opener.open(request)
Traceback (most recent call last): File "<stdin>", line 1, in ? @@ -9367,20 +8890,20 @@ urllib2.HTTPError: HTTP Error 304: Not Modified
- ![]()
On the second request, you add the If-Modified-Sinceheader with the last-modified date from the first request. If the data hasn't changed, the server should return a304status code. +On the second request, you add the If-Modified-Sinceheader with the last-modified date from the first request. If the data hasn't changed, the server should return a304status code.- - ![]()
Sure enough, the data hasn't changed. You can see from the traceback that urllib2throws a special exception,HTTPError, in response to the304status code. This is a little unusual, and not entirely helpful. After all, it's not an error; you specifically asked the +Sure enough, the data hasn't changed. You can see from the traceback that urllib2throws a special exception,HTTPError, in response to the304status code. This is a little unusual, and not entirely helpful. After all, it's not an error; you specifically asked the server not to send you any data if it hadn't changed, and the data didn't change, so the server told you it wasn't sending - you any data. That's not an error; that's exactly what you were hoping for. + you any data. That's not an error; that's exactly what you were hoping for.
urllib2also raises anHTTPErrorexception for conditions that you would think of as errors, such as404(page not found). In fact, it will raiseHTTPErrorfor any status code other than200(OK),301(permanent redirect), or302(temporary redirect). It would be more helpful for your purposes to capture the status code and simply return it, without -throwing an exception. To do that, you'll need to define a custom URL handler. +
urllib2also raises anHTTPErrorexception for conditions that you would think of as errors, such as404(page not found). In fact, it will raiseHTTPErrorfor any status code other than200(OK),301(permanent redirect), or302(temporary redirect). It would be more helpful for your purposes to capture the status code and simply return it, without +throwing an exception. To do that, you'll need to define a custom URL handler.Example 11.7. Defining URL handlers
This custom URL handler is part of
openanything.py.class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):@@ -9394,20 +8917,20 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):
-
urllib2is designed around URL handlers. Each handler is just a class that can define any number of methods. When something happens - -- like an HTTP error, or even a304code --urllib2introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in Chapter 9, XML Processing to define handlers for different node types, buturllib2is more flexible, and introspects over as many handlers as are defined for the current request. +urllib2is designed around URL handlers. Each handler is just a class that can define any number of methods. When something happens + -- like an HTTP error, or even a304code --urllib2introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in Chapter 9, XML Processing to define handlers for different node types, buturllib2is more flexible, and introspects over as many handlers as are defined for the current request.- ![]()
urllib2searches through the defined handlers and calls thehttp_error_defaultmethod when it encounters a304status code from the server. By defining a custom error handler, you can preventurllib2from raising an exception. Instead, you create theHTTPErrorobject, but return it instead of raising it. +urllib2searches through the defined handlers and calls thehttp_error_defaultmethod when it encounters a304status code from the server. By defining a custom error handler, you can preventurllib2from raising an exception. Instead, you create theHTTPErrorobject, but return it instead of raising it.@@ -9417,7 +8940,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): - ![]()
This is the key part: before returning, you save the status code returned by the HTTP server. This will allow you easy access + This is the key part: before returning, you save the status code returned by the HTTP server. This will allow you easy access to it from the calling program. >>> import openanything >>> opener = urllib2.build_opener( -... openanything.DefaultErrorHandler())
+... openanything.DefaultErrorHandler())
>>> seconddatastream = opener.open(request) >>> seconddatastream.status
304 @@ -9434,30 +8957,30 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):
-
This is the key: now that you've defined your custom URL handler, you need to tell urllib2to use it. Remember how I said thaturllib2broke up the process of accessing an HTTP resource into three steps, and for good reason? This is why building the URL opener +This is the key: now that you've defined your custom URL handler, you need to tell urllib2to use it. Remember how I said thaturllib2broke up the process of accessing an HTTP resource into three steps, and for good reason? This is why building the URL opener is its own step, because you can build it with your own custom URL handlers that overrideurllib2's default behavior.- ![]()
Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use seconddatastream.headers.dict to acess them), also contains the HTTP status code. In this case, as you expected, the status is 304, meaning this data hasn't changed since the last time you asked for it. +Now you can quietly open the resource, and what you get back is an object that, along with the usual headers (use seconddatastream.headers.dict to acess them), also contains the HTTP status code. In this case, as you expected, the status is 304, meaning this data hasn't changed since the last time you asked for it.- - ![]()
Note that when the server sends back a 304status code, it doesn't re-send the data. That's the whole point: to save bandwidth by not re-downloading data that hasn't - changed. So if you actually want that data, you'll need to cache it locally the first time you get it. +Note that when the server sends back a 304status code, it doesn't re-send the data. That's the whole point: to save bandwidth by not re-downloading data that hasn't + changed. So if you actually want that data, you'll need to cache it locally the first time you get it.Handling
ETagworks much the same way, but instead of checking forLast-Modifiedand sendingIf-Modified-Since, you check forETagand sendIf-None-Match. Let's start with a fresh IDE session. +Handling
ETagworks much the same way, but instead of checking forLast-Modifiedand sendingIf-Modified-Since, you check forETagand sendIf-None-Match. Let's start with a fresh IDE session.Example 11.9. Supporting
ETag/If-None-Match>>> import urllib2, openanything >>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml') >>> opener = urllib2.build_opener( -... openanything.DefaultErrorHandler()) +... openanything.DefaultErrorHandler()) >>> firstdatastream = opener.open(request) >>> firstdatastream.headers.get('ETag')'"e842a-3e53-55d97640"' @@ -9472,7 +8995,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):
<-- rest of feed omitted for brevity --> >>> request.add_header('If-None-Match', -... firstdatastream.headers.get('ETag'))
+... firstdatastream.headers.get('ETag'))
>>> seconddatastream = opener.open(request) >>> seconddatastream.status
304 @@ -9483,7 +9006,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):
-
Using the firstdatastream.headers pseudo-dictionary, you can get the ETagreturned from the server. (What happens if the server didn't send back anETag? Then this line would returnNone.) +Using the firstdatastream.headers pseudo-dictionary, you can get the ETagreturned from the server. (What happens if the server didn't send back anETag? Then this line would returnNone.)@@ -9500,13 +9023,13 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): -
The second call succeeds quietly (without throwing an exception), and once again you see that the server has sent back a 304status code. Based on theETagyou sent the second time, it knows that the data hasn't changed. +The second call succeeds quietly (without throwing an exception), and once again you see that the server has sent back a 304status code. Based on theETagyou sent the second time, it knows that the data hasn't changed.@@ -9515,7 +9038,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): - ![]()
Regardless of whether the 304is triggered byLast-Modifieddate checking orETaghash matching, you'll never get the data along with the304. That's the whole point. +Regardless of whether the 304is triggered byLast-Modifieddate checking orETaghash matching, you'll never get the data along with the304. That's the whole point.![]()
- @@ -9527,7 +9050,7 @@ class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):In these examples, the HTTP server has supported both Last-ModifiedandETagheaders, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively +In these examples, the HTTP server has supported both Last-ModifiedandETagheaders, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively in case a server only supports one or the other, or neither.>>> import urllib2, httplib >>> httplib.HTTPConnection.debuglevel = 1
>>> request = urllib2.Request( -... 'http://diveintomark.org/redir/example301.xml')
+... 'http://diveintomark.org/redir/example301.xml')
>>> opener = urllib2.build_opener() >>> f = opener.open(request) connect: (diveintomark.org, 80) @@ -9608,14 +9131,14 @@ AttributeError: addinfourl instance has no attribute 'status'
![]()
The object you get back from the opener contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent - address). But the status code is missing, so you have no way of knowing programmatically whether this redirect was temporary - or permanent. And that matters very much: if it was a temporary redirect, then you should continue to ask for the data at - the old location. But if it was a permanent redirect (as this was), you should ask for the data at the new location from + address). But the status code is missing, so you have no way of knowing programmatically whether this redirect was temporary + or permanent. And that matters very much: if it was a temporary redirect, then you should continue to ask for the data at + the old location. But if it was a permanent redirect (as this was), you should ask for the data at the new location from now on. -This is suboptimal, but easy to fix.
urllib2doesn't behave exactly as you want it to when it encounters a301or302, so let's override its behavior. How? With a custom URL handler, just like you did to handle304codes. +This is suboptimal, but easy to fix.
urllib2doesn't behave exactly as you want it to when it encounters a301or302, so let's override its behavior. How? With a custom URL handler, just like you did to handle304codes.Example 11.11. Defining the redirect handler
This class is defined in
openanything.py.class SmartRedirectHandler(urllib2.HTTPRedirectHandler):@@ -9635,13 +9158,13 @@ class SmartRedirectHandler(urllib2.HTTPRedirectHandler):
-
Redirect behavior is defined in urllib2in a class calledHTTPRedirectHandler. You don't want to completely override the behavior, you just want to extend it a little, so you'll subclassHTTPRedirectHandlerso you can call the ancestor class to do all the hard work. +Redirect behavior is defined in urllib2in a class calledHTTPRedirectHandler. You don't want to completely override the behavior, you just want to extend it a little, so you'll subclassHTTPRedirectHandlerso you can call the ancestor class to do all the hard work.- ![]()
When it encounters a 301status code from the server,urllib2will search through its handlers and call thehttp_error_301method. The first thing ours does is just call thehttp_error_301method in the ancestor, which handles the grunt work of looking for theLocation:header and following the redirect to the new address. +When it encounters a 301status code from the server,urllib2will search through its handlers and call thehttp_error_301method. The first thing ours does is just call thehttp_error_301method in the ancestor, which handles the grunt work of looking for theLocation:header and following the redirect to the new address.@@ -9664,7 +9187,7 @@ follow redirects, but now it will also expose the redirect status code. >>> import openanything, httplib >>> httplib.HTTPConnection.debuglevel = 1 >>> opener = urllib2.build_opener( -... openanything.SmartRedirectHandler()) +... openanything.SmartRedirectHandler())
>>> f = opener.open(request) connect: (diveintomark.org, 80) send: 'GET /redir/example301.xml HTTP/1.0 @@ -9708,23 +9231,23 @@ header: Content-Type: application/atom+xml
- ![]()
You sent off a request, and you got a 301status code in response. At this point, thehttp_error_301method gets called. You call the ancestor method, which follows the redirect and sends a request at the new location (http://diveintomark.org/xml/atom.xml). +You sent off a request, and you got a 301status code in response. At this point, thehttp_error_301method gets called. You call the ancestor method, which follows the redirect and sends a request at the new location (http://diveintomark.org/xml/atom.xml).![]()
This is the payoff: now, not only do you have access to the new URL, but you have access to the redirect status code, so you - can tell that this was a permanent redirect. The next time you request this data, you should request it from the new location - ( http://diveintomark.org/xml/atom.xml, as specified in f.url). If you had stored the location in a configuration file or a database, you need to update that so you don't keep pounding - the server with requests at the old address. It's time to update your address book. + can tell that this was a permanent redirect. The next time you request this data, you should request it from the new location + (http://diveintomark.org/xml/atom.xml, as specified in f.url). If you had stored the location in a configuration file or a database, you need to update that so you don't keep pounding + the server with requests at the old address. It's time to update your address book.The same redirect handler can also tell you that you shouldn't update your address book.
Example 11.13. Using the redirect handler to detect temporary redirects
>>> request = urllib2.Request( -... 'http://diveintomark.org/redir/example302.xml')+... 'http://diveintomark.org/redir/example302.xml')
>>> f = opener.open(request) connect: (diveintomark.org, 80) send: ' @@ -9769,28 +9292,28 @@ http://diveintomark.org/xml/atom.xml
- ![]()
The server sends back a 302status code, indicating a temporary redirect. The temporary new location of the data is given in theLocation:header. +The server sends back a 302status code, indicating a temporary redirect. The temporary new location of the data is given in theLocation:header.- ![]()
urllib2calls yourhttp_error_302method, which calls the ancestor method of the same name inurllib2.HTTPRedirectHandler, which follows the redirect to the new location. Then yourhttp_error_302method stores the status code (302) so the calling application can get it later. +urllib2calls yourhttp_error_302method, which calls the ancestor method of the same name inurllib2.HTTPRedirectHandler, which follows the redirect to the new location. Then yourhttp_error_302method stores the status code (302) so the calling application can get it later.- ![]()
And here you are, having successfully followed the redirect to http://diveintomark.org/xml/atom.xml. f.status tells you that this was a temporary redirect, which means that you should continue to request data from the original address - (http://diveintomark.org/redir/example302.xml). Maybe it will redirect next time too, but maybe not. Maybe it will redirect to a different address. It's not for you - to say. The server said this redirect was only temporary, so you should respect that. And now you're exposing enough information +And here you are, having successfully followed the redirect to http://diveintomark.org/xml/atom.xml. f.status tells you that this was a temporary redirect, which means that you should continue to request data from the original address + (http://diveintomark.org/redir/example302.xml). Maybe it will redirect next time too, but maybe not. Maybe it will redirect to a different address. It's not for you + to say. The server said this redirect was only temporary, so you should respect that. And now you're exposing enough information that the calling application can respect that.11.8. Handling compressed data
-The last important HTTP feature you want to support is compression. Many web services have the ability to send data compressed, - which can cut down the amount of data sent over the wire by 60% or more. This is especially true of XML web services, since +
The last important HTTP feature you want to support is compression. Many web services have the ability to send data compressed, + which can cut down the amount of data sent over the wire by 60% or more. This is especially true of XML web services, since XML data compresses very well.
Servers won't give you compressed data unless you tell them you can handle it.
Example 11.14. Telling the server you would like compressed data
@@ -9823,7 +9346,7 @@ header: Content-Type: application/atom+xml- ![]()
This is the key: once you've created your Requestobject, add anAccept-encodingheader to tell the server you can accept gzip-encoded data.gzipis the name of the compression algorithm you're using. In theory there could be other compression algorithms, butgzipis the compression algorithm used by 99% of web servers. +This is the key: once you've created your Requestobject, add anAccept-encodingheader to tell the server you can accept gzip-encoded data.gzipis the name of the compression algorithm you're using. In theory there could be other compression algorithms, butgzipis the compression algorithm used by 99% of web servers.@@ -9840,7 +9363,7 @@ header: Content-Type: application/atom+xml @@ -9870,16 +9393,16 @@ header: Content-Type: application/atom+xml - ![]()
The Content-Lengthheader is the length of the compressed data, not the uncompressed data. As you'll see in a minute, the actual length of +The Content-Lengthheader is the length of the compressed data, not the uncompressed data. As you'll see in a minute, the actual length of the uncompressed data was 15955, so gzip compression cut your bandwidth by over 60%!- ![]()
Continuing from the previous example, f is the file-like object returned from the URL opener. Using its read()method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first +Continuing from the previous example, f is the file-like object returned from the URL opener. Using its read()method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first step towards getting the data you really want.- ![]()
OK, this step is a little bit of messy workaround. Python has a gzipmodule, which reads (and actually writes) gzip-compressed files on disk. But you don't have a file on disk, you have a gzip-compressed - buffer in memory, and you don't want to write out a temporary file just so you can uncompress it. So what you're going to - do is create a file-like object out of the in-memory data (compresseddata), using theStringIOmodule. You first saw theStringIOmodule in the previous chapter, but now you've found another use for it. +OK, this step is a little bit of messy workaround. Python has a gzipmodule, which reads (and actually writes) gzip-compressed files on disk. But you don't have a file on disk, you have a gzip-compressed + buffer in memory, and you don't want to write out a temporary file just so you can uncompress it. So what you're going to + do is create a file-like object out of the in-memory data (compresseddata), using theStringIOmodule. You first saw theStringIOmodule in the previous chapter, but now you've found another use for it.@@ -9891,7 +9414,7 @@ header: Content-Type: application/atom+xml - ![]()
This is the line that does all the actual work: “reading” from GzipFilewill decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. gzipper is a file-like object which represents a gzip-compressed file. That “file” is not a real file on disk, though; gzipper is really just “reading” from the file-like object you created withStringIOto wrap the compressed data, which is only in memory in the variable compresseddata. And where did that compressed data come from? You originally downloaded it from a remote HTTP server by “reading” from the file-like object you built withurllib2.build_opener. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it. +This is the line that does all the actual work: “reading” from GzipFilewill decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. gzipper is a file-like object which represents a gzip-compressed file. That “file” is not a real file on disk, though; gzipper is really just “reading” from the file-like object you created withStringIOto wrap the compressed data, which is only in memory in the variable compresseddata. And where did that compressed data come from? You originally downloaded it from a remote HTTP server by “reading” from the file-like object you built withurllib2.build_opener. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it.@@ -9900,7 +9423,7 @@ header: Content-Type: application/atom+xml -Look ma, real data. (15955 bytes of it, in fact.) “But wait!” I hear you cry. “This could be even easier!” I know what you're thinking. You're thinking that opener.open returns a file-like object, so why not cut out the
StringIOmiddleman and just pass f directly toGzipFile? OK, maybe you weren't thinking that, but don't worry about it, because it doesn't work. +“But wait!” I hear you cry. “This could be even easier!” I know what you're thinking. You're thinking that opener.open returns a file-like object, so why not cut out the
StringIOmiddleman and just pass f directly toGzipFile? OK, maybe you weren't thinking that, but don't worry about it, because it doesn't work.Example 11.16. Decompressing the data directly from the server
>>> f = opener.open(request)>>> f.headers.get('Content-Encoding')
@@ -9924,7 +9447,7 @@ AttributeError: addinfourl instance has no attribute 'tell'
@@ -9932,14 +9455,14 @@ AttributeError: addinfourl instance has no attribute 'tell' - ![]()
Simply opening the request will get you the headers (though not download any data yet). As you can see from the returned + Simply opening the request will get you the headers (though not download any data yet). As you can see from the returned Content-Encodingheader, this data has been sent gzip-compressed.![]()
Since opener.openreturns a file-like object, and you know from the headers that when you read it, you're going to get gzip-compressed data, - why not simply pass that file-like object directly toGzipFile? As you “read” from theGzipFileinstance, it will “read” compressed data from the remote HTTP server and decompress it on the fly. It's a good idea, but unfortunately it doesn't - work. Because of the way gzip compression works,GzipFileneeds to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and - forth through the data stream. So the inelegant hack of usingStringIOis the best solution: download the compressed data, create a file-like object out of it withStringIO, and then decompress the data from that. + why not simply pass that file-like object directly toGzipFile? As you “read” from theGzipFileinstance, it will “read” compressed data from the remote HTTP server and decompress it on the fly. It's a good idea, but unfortunately it doesn't + work. Because of the way gzip compression works,GzipFileneeds to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and + forth through the data stream. So the inelegant hack of usingStringIOis the best solution: download the compressed data, create a file-like object out of it withStringIO, and then decompress the data from that.11.9. Putting it all together
-You've seen all the pieces for building an intelligent HTTP web services client. Now let's see how they all fit together. +
You've seen all the pieces for building an intelligent HTTP web services client. Now let's see how they all fit together.
Example 11.17. The
openanythingfunctionThis function is defined in
openanything.py.def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT): @@ -9960,14 +9483,14 @@ def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT):- ![]()
urlparseis a handy utility module for, you guessed it, parsing URLs. It's primary function, also calledurlparse, takes a URL and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier). +urlparseis a handy utility module for, you guessed it, parsing URLs. It's primary function, also calledurlparse, takes a URL and splits it into a tuple of (scheme, domain, path, params, query string parameters, and fragment identifier). Of these, the only thing you care about is the scheme, to make sure that you're dealing with an HTTP URL (whichurllib2can handle).- ![]()
You identify yourself to the HTTP server with the User-Agentpassed in by the calling function. If noUser-Agentwas specified, you use a default one defined earlier in theopenanything.pymodule. You never use the default one defined byurllib2. +You identify yourself to the HTTP server with the User-Agentpassed in by the calling function. If noUser-Agentwas specified, you use a default one defined earlier in theopenanything.pymodule. You never use the default one defined byurllib2.@@ -10032,7 +9555,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT): - ![]()
Read the actual data returned from the server. This may be compressed; if so, you'll decompress it later. +Read the actual data returned from the server. This may be compressed; if so, you'll decompress it later. @@ -10077,9 +9600,9 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT): <feed version="0.3" <-- rest of data omitted for brevity -->'} >>> if params['status'] == 301:
-... url = params['url'] +... url = params['url'] >>> newparams = openanything.fetch( -... url, params['etag'], params['lastmodified'], useragent)
+... url, params['etag'], params['lastmodified'], useragent)
>>> newparams {'url': 'http://diveintomark.org/xml/atom.xml', 'lastmodified': None, @@ -10091,7 +9614,7 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT):
- ![]()
The very first time you fetch a resource, you don't have an ETaghash orLast-Modifieddate, so you'll leave those out. (They're optional parameters.) +The very first time you fetch a resource, you don't have an ETaghash orLast-Modifieddate, so you'll leave those out. (They're optional parameters.)@@ -10139,15 +9662,15 @@ def fetch(source, etag=None, last_modified=None, agent=USER_AGENT): Chapter 12. SOAP Web Services
-Chapter 11 focused on document-oriented web services over HTTP. The “input parameter” was the URL, and the “return value” was an actual XML document which it was your responsibility to parse. -
This chapter will focus on SOAP web services, which take a more structured approach. Rather than dealing with HTTP requests and XML documents directly, -SOAP allows you to simulate calling functions that return native data types. As you will see, the illusion is almost perfect; -you can “call” a function through a SOAP library, with the standard Python calling syntax, and the function appears to return Python objects and values. But under the covers, the SOAP library has actually performed a complex transaction involving multiple XML documents and a remote server. -
SOAP is a complex specification, and it is somewhat misleading to say that SOAP is all about calling remote functions. Some people would pipe up to add that SOAP allows for one-way asynchronous message passing, and document-oriented web services. And those people would be correct; -SOAP can be used that way, and in many different ways. But this chapter will focus on so-called “RPC-style” SOAP -- calling a remote function and getting results back. +
Chapter 11 focused on document-oriented web services over HTTP. The “input parameter” was the URL, and the “return value” was an actual XML document which it was your responsibility to parse. +
This chapter will focus on SOAP web services, which take a more structured approach. Rather than dealing with HTTP requests and XML documents directly, +SOAP allows you to simulate calling functions that return native data types. As you will see, the illusion is almost perfect; +you can “call” a function through a SOAP library, with the standard Python calling syntax, and the function appears to return Python objects and values. But under the covers, the SOAP library has actually performed a complex transaction involving multiple XML documents and a remote server. +
SOAP is a complex specification, and it is somewhat misleading to say that SOAP is all about calling remote functions. Some people would pipe up to add that SOAP allows for one-way asynchronous message passing, and document-oriented web services. And those people would be correct; +SOAP can be used that way, and in many different ways. But this chapter will focus on so-called “RPC-style” SOAP -- calling a remote function and getting results back.
12.1. Diving In
-You use Google, right? It's a popular search engine. Have you ever wished you could programmatically access Google search - results? Now you can. Here is a program to search Google from Python. +
You use Google, right? It's a popular search engine. Have you ever wished you could programmatically access Google search + results? Now you can. Here is a program to search Google from Python.
Example 12.1.
search.pyfrom SOAPpy import WSDL # you'll need to configure these two values; @@ -10171,7 +9694,7 @@ if __name__ == '__main__': print r['title'] print r['link'] print r['description'] - printYou can import this as a module and use it from a larger program, or you can run the script from the command line. On the + print
You can import this as a module and use it from a larger program, or you can run the script from the command line. On the command line, you give the search query as a command-line argument, and it prints out the URL, title, and description of the top five Google search results.
Here is the sample output for a search for the word “python”. @@ -10225,17 +9748,17 @@ Dive Into <b>Python</b>. This book is still being written. <b>...</b
Go to http://pyxml.sourceforge.net/, click Downloads, and download the latest version for your operating system.
- -
If you are using Windows, there are several choices. Make sure to download the version of PyXML that matches the version of Python you are using. +
If you are using Windows, there are several choices. Make sure to download the version of PyXML that matches the version of Python you are using.
- -
Double-click the installer. If you download PyXML 0.8.3 for Windows and Python 2.3, the installer program will be
PyXML-0.8.3.win32-py2.3.exe. +Double-click the installer. If you download PyXML 0.8.3 for Windows and Python 2.3, the installer program will be
PyXML-0.8.3.win32-py2.3.exe.Step through the installer program.
- -
After the installation is complete, close the installer. There will not be any visible indication of success (no programs - installed on the Start Menu or shortcuts installed on the desktop). PyXML is simply a collection of XML libraries used by other programs. +
After the installation is complete, close the installer. There will not be any visible indication of success (no programs + installed on the Start Menu or shortcuts installed on the desktop). PyXML is simply a collection of XML libraries used by other programs.
To verify that you installed PyXML correctly, run your Python IDE and check the version of the XML libraries you have installed, as shown here. @@ -10245,7 +9768,7 @@ Dive Into <b>Python</b>. This book is still being written. <b>...</b '0.8.3'
This version number should match the version number of the PyXML installer program you downloaded and ran.
12.2.2. Installing fpconst
-The second library you need is fpconst, a set of constants and functions for working with IEEE754 double-precision special values. This provides support for the +
The second library you need is fpconst, a set of constants and functions for working with IEEE754 double-precision special values. This provides support for the special values Not-a-Number (NaN), Positive Infinity (Inf), and Negative Infinity (-Inf), which are part of the SOAP datatype specification.
Procedure 12.2.
@@ -10255,11 +9778,11 @@ Dive Into <b>Python</b>. This book is still being written. <b>...</bDownload the latest version of fpconst from http://www.analytics.washington.edu/statcomp/projects/rzope/fpconst/.
- -
There are two downloads available, one in
.tar.gzformat, the other in.zipformat. If you are using Windows, download the.zipfile; otherwise, download the.tar.gzfile. +There are two downloads available, one in
.tar.gzformat, the other in.zipformat. If you are using Windows, download the.zipfile; otherwise, download the.tar.gzfile.- -
Decompress the downloaded file. On Windows XP, you can right-click on the file and choose Extract All; on earlier versions - of Windows, you will need a third-party program such as WinZip. On Mac OS X, you can double-click the compressed file to decompress it with Stuffit Expander. +
Decompress the downloaded file. On Windows XP, you can right-click on the file and choose Extract All; on earlier versions + of Windows, you will need a third-party program such as WinZip. On Mac OS X, you can double-click the compressed file to decompress it with Stuffit Expander.
Open a command prompt and navigate to the directory where you decompressed the fpconst files. @@ -10284,7 +9807,7 @@ Dive Into <b>Python</b>. This book is still being written. <b>...</b
Go to http://pywebsvcs.sourceforge.net/ and select Latest Official Release under the SOAPpy section.
- -
There are two downloads available. If you are using Windows, download the
.zipfile; otherwise, download the.tar.gzfile. +There are two downloads available. If you are using Windows, download the
.zipfile; otherwise, download the.tar.gzfile.Decompress the downloaded file, just as you did with fpconst. @@ -10303,8 +9826,8 @@ Dive Into <b>Python</b>. This book is still being written. <b>...</b '0.11.4'
This version number should match the version number of the SOAPpy archive you downloaded and installed.
12.3. First Steps with SOAP
-The heart of SOAP is the ability to call remote functions. There are a number of public access SOAP servers that provide simple functions for demonstration purposes. -
The most popular public access SOAP server is http://www.xmethods.net/. This example uses a demonstration function that takes a United States zip code and returns the current temperature in that +
The heart of SOAP is the ability to call remote functions. There are a number of public access SOAP servers that provide simple functions for demonstration purposes. +
The most popular public access SOAP server is http://www.xmethods.net/. This example uses a demonstration function that takes a United States zip code and returns the current temperature in that region.
Example 12.6. Getting the Current Temperature
>>> from SOAPpy import SOAPProxy@@ -10318,29 +9841,29 @@ region.
- ![]()
You access the remote SOAP server through a proxy class, SOAPProxy. The proxy handles all the internals of SOAP for you, including creating the XML request document out of the function name and argument list, sending the request over - HTTP to the remote SOAP server, parsing the XML response document, and creating native Python values to return. You'll see what these XML documents look like in the next section. +You access the remote SOAP server through a proxy class, SOAPProxy. The proxy handles all the internals of SOAP for you, including creating the XML request document out of the function name and argument list, sending the request over + HTTP to the remote SOAP server, parsing the XML response document, and creating native Python values to return. You'll see what these XML documents look like in the next section.- ![]()
Every SOAP service has a URL which handles all the requests. The same URL is used for all function calls. This particular service only has a single function, but later in this chapter you'll see - examples of the Google API, which has several functions. The service URL is shared by all functions.Each SOAP service also has a namespace, which is defined by the server and is completely arbitrary. It's simply part of the configuration - required to call SOAP methods. It allows the server to share a single service URL and route requests between several unrelated services. It's like dividing Python modules into packages. + Every SOAP service has a URL which handles all the requests. The same URL is used for all function calls. This particular service only has a single function, but later in this chapter you'll see + examples of the Google API, which has several functions. The service URL is shared by all functions.Each SOAP service also has a namespace, which is defined by the server and is completely arbitrary. It's simply part of the configuration + required to call SOAP methods. It allows the server to share a single service URL and route requests between several unrelated services. It's like dividing Python modules into packages. - ![]()
You're creating the SOAPProxywith the service URL and the service namespace. This doesn't make any connection to the SOAP server; it simply creates a local Python object. +You're creating the SOAPProxywith the service URL and the service namespace. This doesn't make any connection to the SOAP server; it simply creates a local Python object.@@ -10404,13 +9927,13 @@ region. - ![]()
Now with everything configured properly, you can actually call remote SOAP methods as if they were local functions. You pass arguments just like a normal function, and you get a return value just - like a normal function. But under the covers, there's a heck of a lot going on. + Now with everything configured properly, you can actually call remote SOAP methods as if they were local functions. You pass arguments just like a normal function, and you get a return value just + like a normal function. But under the covers, there's a heck of a lot going on. - - ![]()
Third, call the remote SOAP method as usual. The SOAP library will print out both the outgoing XML request document, and the incoming XML response document. This is all the hard - work that SOAPProxyis doing for you. Intimidating, isn't it? Let's break it down. +Third, call the remote SOAP method as usual. The SOAP library will print out both the outgoing XML request document, and the incoming XML response document. This is all the hard + work that SOAPProxyis doing for you. Intimidating, isn't it? Let's break it down.Most of the XML request document that gets sent to the server is just boilerplate. Ignore all the namespace declarations; -they're going to be the same (or similar) for all SOAP calls. The heart of the “function call” is this fragment within the
<Body>element: +Most of the XML request document that gets sent to the server is just boilerplate. Ignore all the namespace declarations; +they're going to be the same (or similar) for all SOAP calls. The heart of the “function call” is this fragment within the
<Body>element:<ns1:getTempxmlns:ns1="urn:xmethods-Temperature"
@@ -10422,7 +9945,7 @@ they're going to be the same (or similar) for all SOAP calls.
@@ -10430,17 +9953,17 @@ they're going to be the same (or similar) for all SOAP calls. - ![]()
The element name is the function name, getTemp.SOAPProxyusesgetattras a dispatcher. Instead of calling separate local methods based on the method name, it actually uses the method name to construct the XML +The element name is the function name, getTemp.SOAPProxyusesgetattras a dispatcher. Instead of calling separate local methods based on the method name, it actually uses the method name to construct the XML request document.![]()
The function's XML element is contained in a specific namespace, which is the namespace you specified when you created the - SOAPProxyobject. Don't worry about theSOAP-ENC:root; that's boilerplate too. +SOAPProxyobject. Don't worry about theSOAP-ENC:root; that's boilerplate too.- - ![]()
The arguments of the function also got translated into XML. SOAPProxyintrospects each argument to determine its datatype (in this case it's a string). The argument datatype goes into thexsi:typeattribute, followed by the actual string value. +The arguments of the function also got translated into XML. SOAPProxyintrospects each argument to determine its datatype (in this case it's a string). The argument datatype goes into thexsi:typeattribute, followed by the actual string value.The XML return document is equally easy to understand, once you know what to ignore. Focus on this fragment within the
<Body>: +The XML return document is equally easy to understand, once you know what to ignore. Focus on this fragment within the
<Body>:<ns1:getTempResponsexmlns:ns1="urn:xmethods-Temperature"
@@ -10452,36 +9975,36 @@ they're going to be the same (or similar) for all SOAP calls.
- ![]()
The server wraps the function return value within a <getTempResponse>element. By convention, this wrapper element is the name of the function, plusResponse. But it could really be almost anything; the important thing thatSOAPProxynotices is not the element name, but the namespace. +The server wraps the function return value within a <getTempResponse>element. By convention, this wrapper element is the name of the function, plusResponse. But it could really be almost anything; the important thing thatSOAPProxynotices is not the element name, but the namespace.![]()
The server returns the response in the same namespace we used in the request, the same namespace we specified when we first - create the SOAPProxy. Later in this chapter we'll see what happens if you forget to specify the namespace when creating theSOAPProxy. + create theSOAPProxy. Later in this chapter we'll see what happens if you forget to specify the namespace when creating theSOAPProxy.- ![]()
The return value is specified, along with its datatype (it's a float). SOAPProxyuses this explicit datatype to create a Python object of the correct native datatype and return it. +The return value is specified, along with its datatype (it's a float). SOAPProxyuses this explicit datatype to create a Python object of the correct native datatype and return it.12.5. Introducing WSDL
-The
SOAPProxyclass proxies local method calls and transparently turns then into invocations of remote SOAP methods. As you've seen, this is a lot of work, andSOAPProxydoes it quickly and transparently. What it doesn't do is provide any means of method introspection. -Consider this: the previous two sections showed an example of calling a simple remote SOAP method with one argument and one return value, both of simple data types. This required knowing, and keeping track of, the -service URL, the service namespace, the function name, the number of arguments, and the datatype of each argument. If any of these is +
The
SOAPProxyclass proxies local method calls and transparently turns then into invocations of remote SOAP methods. As you've seen, this is a lot of work, andSOAPProxydoes it quickly and transparently. What it doesn't do is provide any means of method introspection. +Consider this: the previous two sections showed an example of calling a simple remote SOAP method with one argument and one return value, both of simple data types. This required knowing, and keeping track of, the +service URL, the service namespace, the function name, the number of arguments, and the datatype of each argument. If any of these is missing or wrong, the whole thing falls apart. -
That shouldn't come as a big surprise. If I wanted to call a local function, I would need to know what package or module -it was in (the equivalent of service URL and namespace). I would need to know the correct function name and the correct number of arguments. Python deftly handles datatyping without explicit types, but I would still need to know how many argument to pass, and how many +
That shouldn't come as a big surprise. If I wanted to call a local function, I would need to know what package or module +it was in (the equivalent of service URL and namespace). I would need to know the correct function name and the correct number of arguments. Python deftly handles datatyping without explicit types, but I would still need to know how many argument to pass, and how many return values to expect. -
The big difference is introspection. As you saw in Chapter 4, Python excels at letting you discover things about modules and functions at runtime. You can list the available functions within +
The big difference is introspection. As you saw in Chapter 4, Python excels at letting you discover things about modules and functions at runtime. You can list the available functions within a module, and with a little work, drill down to individual function declarations and arguments. -
WSDL lets you do that with SOAP web services. WSDL stands for “Web Services Description Language”. Although designed to be flexible enough to describe many types of web services, it is most often used to describe SOAP web services. -
A WSDL file is just that: a file. More specifically, it's an XML file. It usually lives on the same server you use to access the -SOAP web services it describes, although there's nothing special about it. Later in this chapter, we'll download the WSDL file for the Google API and use it locally. That doesn't mean we're calling Google locally; the WSDL file still describes the remote functions sitting on Google's server. +
WSDL lets you do that with SOAP web services. WSDL stands for “Web Services Description Language”. Although designed to be flexible enough to describe many types of web services, it is most often used to describe SOAP web services. +
A WSDL file is just that: a file. More specifically, it's an XML file. It usually lives on the same server you use to access the +SOAP web services it describes, although there's nothing special about it. Later in this chapter, we'll download the WSDL file for the Google API and use it locally. That doesn't mean we're calling Google locally; the WSDL file still describes the remote functions sitting on Google's server.
A WSDL file contains a description of everything involved in calling a SOAP web service:
@@ -10496,8 +10019,8 @@ a module, and with a little work, drill down to individual function declarations
In other words, a WSDL file tells you everything you need to know to be able to call a SOAP web service.
12.6. Introspecting SOAP Web Services with WSDL
-Like many things in the web services arena, WSDL has a long and checkered history, full of political strife and intrigue. I will skip over this history entirely, since it - bores me to tears. There were other standards that tried to do similar things, but WSDL won, so let's learn how to use it. +
Like many things in the web services arena, WSDL has a long and checkered history, full of political strife and intrigue. I will skip over this history entirely, since it + bores me to tears. There were other standards that tried to do similar things, but WSDL won, so let's learn how to use it.
The most fundamental thing that WSDL allows you to do is discover the available methods offered by a SOAP server.
Example 12.8. Discovering The Available Methods
>>> from SOAPpy import WSDL@@ -10510,24 +10033,24 @@ a module, and with a little work, drill down to individual function declarations
- ![]()
SOAPpy includes a WSDL parser. At the time of this writing, it was labeled as being in the early stages of development, but I had no problem parsing + SOAPpy includes a WSDL parser. At the time of this writing, it was labeled as being in the early stages of development, but I had no problem parsing any of the WSDL files I tried. - ![]()
To use a WSDL file, you again use a proxy class, WSDL.Proxy, which takes a single argument: the WSDL file. Note that in this case you are passing in the URL of a WSDL file stored on the remote server, but the proxy class works just as well with a local copy of the WSDL file. The act of creating the WSDL proxy will download the WSDL file and parse it, so it there are any errors in the WSDL file (or it can't be fetched due to networking problems), you'll know about it immediately. +To use a WSDL file, you again use a proxy class, WSDL.Proxy, which takes a single argument: the WSDL file. Note that in this case you are passing in the URL of a WSDL file stored on the remote server, but the proxy class works just as well with a local copy of the WSDL file. The act of creating the WSDL proxy will download the WSDL file and parse it, so it there are any errors in the WSDL file (or it can't be fetched due to networking problems), you'll know about it immediately.- - ![]()
The WSDL proxy class exposes the available functions as a Python dictionary, server.methods. So getting the list of available methods is as simple as calling the dictionary method keys(). +The WSDL proxy class exposes the available functions as a Python dictionary, server.methods. So getting the list of available methods is as simple as calling the dictionary method keys().Okay, so you know that this SOAP server offers a single method:
getTemp. But how do you call it? The WSDL proxy object can tell you that too. +Okay, so you know that this SOAP server offers a single method:
getTemp. But how do you call it? The WSDL proxy object can tell you that too.Example 12.9. Discovering A Method's Arguments
>>> callInfo = server.methods['getTemp']>>> callInfo.inparams
@@ -10541,7 +10064,7 @@ u'zipcode'
- ![]()
The server.methods dictionary is filled with a SOAPpy-specific structure called CallInfo. ACallInfoobject contains information about one specific function, including the function arguments. +The server.methods dictionary is filled with a SOAPpy-specific structure called CallInfo. ACallInfoobject contains information about one specific function, including the function arguments.@@ -10553,14 +10076,14 @@ u'zipcode' - ![]()
Each ParameterInfoobject contains a name attribute, which is the argument name. You are not required to know the argument name to call the function through SOAP, but SOAP does support calling functions with named arguments (just like Python), andWSDL.Proxywill correctly handle mapping named arguments to the remote function if you choose to use them. +Each ParameterInfoobject contains a name attribute, which is the argument name. You are not required to know the argument name to call the function through SOAP, but SOAP does support calling functions with named arguments (just like Python), andWSDL.Proxywill correctly handle mapping named arguments to the remote function if you choose to use them.@@ -10577,13 +10100,13 @@ u'return' - ![]()
Each parameter is also explicitly typed, using datatypes defined in XML Schema. You saw this in the wire trace in the previous - section; the XML Schema namespace was part of the “boilerplate” I told you to ignore. For our purposes here, you may continue to ignore it. The zipcode parameter is a string, and if you pass in a Python string to the WSDL.Proxyobject, it will map it correctly and send it to the server. +Each parameter is also explicitly typed, using datatypes defined in XML Schema. You saw this in the wire trace in the previous + section; the XML Schema namespace was part of the “boilerplate” I told you to ignore. For our purposes here, you may continue to ignore it. The zipcode parameter is a string, and if you pass in a Python string to the WSDL.Proxyobject, it will map it correctly and send it to the server.- ![]()
The adjunct to callInfo.inparams for function arguments is callInfo.outparams for return value. It is also a list, because functions called through SOAP can return multiple values, just like Python functions. + The adjunct to callInfo.inparams for function arguments is callInfo.outparams for return value. It is also a list, because functions called through SOAP can return multiple values, just like Python functions. @@ -10633,38 +10156,38 @@ u'return' - ![]()
Each ParameterInfoobject contains name and type. This function returns a single value, named return, which is a float. +Each ParameterInfoobject contains name and type. This function returns a single value, named return, which is a float.- ![]()
The configuration is simpler than calling the SOAP service directly, since the WSDL file contains the both service URL and namespace you need to call the service. Creating the WSDL.Proxyobject downloads the WSDL file, parses it, and configures aSOAPProxyobject that it uses to call the actual SOAP web service. +The configuration is simpler than calling the SOAP service directly, since the WSDL file contains the both service URL and namespace you need to call the service. Creating the WSDL.Proxyobject downloads the WSDL file, parses it, and configures aSOAPProxyobject that it uses to call the actual SOAP web service.- ![]()
Once the WSDL.Proxyobject is created, you can call a function as easily as you did with theSOAPProxyobject. This is not surprising; theWSDL.Proxyis just a wrapper around theSOAPProxywith some introspection methods added, so the syntax for calling functions is the same. +Once the WSDL.Proxyobject is created, you can call a function as easily as you did with theSOAPProxyobject. This is not surprising; theWSDL.Proxyis just a wrapper around theSOAPProxywith some introspection methods added, so the syntax for calling functions is the same.- ![]()
You can access the WSDL.Proxy'sSOAPProxywith server.soapproxy. This is useful to turning on debugging, so that when you can call functions through the WSDL proxy, itsSOAPProxywill dump the outgoing and incoming XML documents that are going over the wire. +You can access the WSDL.Proxy'sSOAPProxywith server.soapproxy. This is useful to turning on debugging, so that when you can call functions through the WSDL proxy, itsSOAPProxywill dump the outgoing and incoming XML documents that are going over the wire.12.7. Searching Google
Let's finally turn to the sample code that you saw that the beginning of this chapter, which does something more useful and exciting than get the current temperature. -
Google provides a SOAP API for programmatically accessing Google search results. To use it, you will need to sign up for Google Web Services. +
Google provides a SOAP API for programmatically accessing Google search results. To use it, you will need to sign up for Google Web Services.
Procedure 12.4. Signing Up for Google Web Services
- -
Go to http://www.google.com/apis/ and create a Google account. This requires only an email address. After you sign up you will receive your Google API license - key by email. You will need this key to pass as a parameter whenever you call Google's search functions. +
Go to http://www.google.com/apis/ and create a Google account. This requires only an email address. After you sign up you will receive your Google API license + key by email. You will need this key to pass as a parameter whenever you call Google's search functions.
- -
Also on http://www.google.com/apis/, download the Google Web APIs developer kit. This includes some sample code in several programming languages (but not Python), and more importantly, it includes the WSDL file. +
Also on http://www.google.com/apis/, download the Google Web APIs developer kit. This includes some sample code in several programming languages (but not Python), and more importantly, it includes the WSDL file.
- -
Decompress the developer kit file and find
GoogleSearch.wsdl. Copy this file to some permanent location on your local drive. You will need it later in this chapter. +Decompress the developer kit file and find
GoogleSearch.wsdl. Copy this file to some permanent location on your local drive. You will need it later in this chapter.Once you have your developer key and your Google WSDL file in a known place, you can start poking around with Google Web Services. @@ -10675,7 +10198,7 @@ u'return' [u'doGoogleSearch', u'doGetCachedPage', u'doSpellingSuggestion'] >>> callInfo = server.methods['doGoogleSearch'] >>> for arg in callInfo.inparams:
-... print arg.name.ljust(15), arg.type +... print arg.name.ljust(15), arg.type key (u'http://www.w3.org/2001/XMLSchema', u'string') q (u'http://www.w3.org/2001/XMLSchema', u'string') start (u'http://www.w3.org/2001/XMLSchema', u'int') @@ -10697,16 +10220,16 @@ oe (u'http://www.w3.org/2001/XMLSchema', u'string')
- ![]()
According to the WSDL file, Google offers three functions: doGoogleSearch,doGetCachedPage, anddoSpellingSuggestion. These do exactly what they sound like: perform a Google search and return the results programmatically, get access to the +According to the WSDL file, Google offers three functions: doGoogleSearch,doGetCachedPage, anddoSpellingSuggestion. These do exactly what they sound like: perform a Google search and return the results programmatically, get access to the cached version of a page from the last time Google saw it, and offer spelling suggestions for commonly misspelled search words.@@ -10715,18 +10238,18 @@ oe (u'http://www.w3.org/2001/XMLSchema', u'string') - ![]()
The doGoogleSearchfunction takes a number of parameters of various types. Note that while the WSDL file can tell you what the arguments are called and what datatype they are, it can't tell you what they mean or how to use - them. It could theoretically tell you the acceptable range of values for each parameter, if only specific values were allowed, - but Google's WSDL file is not that detailed.WSDL.Proxycan't work magic; it can only give you the information provided in the WSDL file. +The doGoogleSearchfunction takes a number of parameters of various types. Note that while the WSDL file can tell you what the arguments are called and what datatype they are, it can't tell you what they mean or how to use + them. It could theoretically tell you the acceptable range of values for each parameter, if only specific values were allowed, + but Google's WSDL file is not that detailed.WSDL.Proxycan't work magic; it can only give you the information provided in the WSDL file.
- key - Your Google API key, which you received when you signed up for Google web services. -
- q - The search word or phrase you're looking for. The syntax is exactly the same as Google's web form, so if you know any +
- q - The search word or phrase you're looking for. The syntax is exactly the same as Google's web form, so if you know any advanced search syntax or tricks, they all work here as well. -
- start - The index of the result to start on. Like the interactive web version of Google, this function returns 10 results at a - time. If you wanted to get the second “page” of results, you would set start to 10. +
- start - The index of the result to start on. Like the interactive web version of Google, this function returns 10 results at a + time. If you wanted to get the second “page” of results, you would set start to 10. -
- maxResults - The number of results to return. Currently capped at 10, although you can specify fewer if you are only interested in +
- maxResults - The number of results to return. Currently capped at 10, although you can specify fewer if you are only interested in a few results and want to save a little bandwidth.
- filter - If
True, Google will filter out duplicate pages from the results. -- restrict - Set this to
countryplus a country code to get results only from a particular country. Example:countryUKto search pages in the United Kingdom. You can also specifylinux,mac, orbsdto search a Google-defined set of technical sites, orunclesamto search sites about the United States government. +- restrict - Set this to
countryplus a country code to get results only from a particular country. Example:countryUKto search pages in the United Kingdom. You can also specifylinux,mac, orbsdto search a Google-defined set of technical sites, orunclesamto search sites about the United States government.- safeSearch - If
True, Google will filter out porn sites. @@ -10740,7 +10263,7 @@ oe (u'http://www.w3.org/2001/XMLSchema', u'string') >>> server = WSDL.Proxy('/path/to/your/GoogleSearch.wsdl') >>> key = 'YOUR_GOOGLE_API_KEY' >>> results = server.doGoogleSearch(key, 'mark', 0, 10, False, "", -... False, "", "utf-8", "utf-8")+... False, "", "utf-8", "utf-8")
>>> len(results.resultElements)
10 >>> results.resultElements[0].URL
@@ -10752,24 +10275,24 @@ oe (u'http://www.w3.org/2001/XMLSchema', u'string')
- ![]()
After setting up the WSDL.Proxyobject, you can callserver.doGoogleSearchwith all ten parameters. Remember to use your own Google API key that you received when you signed up for Google web services. +After setting up the WSDL.Proxyobject, you can callserver.doGoogleSearchwith all ten parameters. Remember to use your own Google API key that you received when you signed up for Google web services.- ![]()
There's a lot of information returned, but let's look at the actual search results first. They're stored in results.resultElements, and you can access them just like a normal Python list. + There's a lot of information returned, but let's look at the actual search results first. They're stored in results.resultElements, and you can access them just like a normal Python list. - - ![]()
Each element in the resultElements is an object that has a URL, title, snippet, and other useful attributes. At this point you can use normal Python introspection techniques like dir(results.resultElements[0]) to see the available attributes. Or you can introspect through the WSDL proxy object and look through the function's outparams. Each technique will give you the same information. + Each element in the resultElements is an object that has a URL, title, snippet, and other useful attributes. At this point you can use normal Python introspection techniques like dir(results.resultElements[0]) to see the available attributes. Or you can introspect through the WSDL proxy object and look through the function's outparams. Each technique will give you the same information. The results object contains more than the actual search results. It also contains information about the search itself, such as how long -it took and how many results were found (even though only 10 were returned). The Google web interface shows this information, +
The results object contains more than the actual search results. It also contains information about the search itself, such as how long +it took and how many results were found (even though only 10 were returned). The Google web interface shows this information, and you can access it programmatically too.
Example 12.14. Accessing Secondary Information From Google
>>> results.searchTime@@ -10788,27 +10311,27 @@ and you can access it programmatically too.
- ![]()
This search took 0.224919 seconds. That does not include the time spent sending and receiving the actual SOAP XML documents. It's just the time that Google spent processing your request once it received it. + This search took 0.224919 seconds. That does not include the time spent sending and receiving the actual SOAP XML documents. It's just the time that Google spent processing your request once it received it. - ![]()
In total, there were approximately 30 million results. You can access them 10 at a time by changing the start parameter and calling server.doGoogleSearchagain. +In total, there were approximately 30 million results. You can access them 10 at a time by changing the start parameter and calling server.doGoogleSearchagain.- ![]()
For some queries, Google also returns a list of related categories in the Google Directory. You can append these URLs to http://directory.google.com/ to construct the link to the directory category page. + For some queries, Google also returns a list of related categories in the Google Directory. You can append these URLs to http://directory.google.com/ to construct the link to the directory category page. 12.8. Troubleshooting SOAP Web Services
-Of course, the world of SOAP web services is not all happiness and light. Sometimes things go wrong. -
As you've seen throughout this chapter, SOAP involves several layers. There's the HTTP layer, since SOAP is sending XML documents to, and receiving XML documents from, an HTTP server. So all the debugging techniques you learned -in Chapter 11, HTTP Web Services come into play here. You can import httplib and then set httplib.HTTPConnection.debuglevel = 1 to see the underlying HTTP traffic. -
Beyond the underlying HTTP layer, there are a number of things that can go wrong. SOAPpy does an admirable job hiding the SOAP syntax from you, but that also means it can be difficult to determine where the problem is when things don't work. +
Of course, the world of SOAP web services is not all happiness and light. Sometimes things go wrong. +
As you've seen throughout this chapter, SOAP involves several layers. There's the HTTP layer, since SOAP is sending XML documents to, and receiving XML documents from, an HTTP server. So all the debugging techniques you learned +in Chapter 11, HTTP Web Services come into play here. You can import httplib and then set httplib.HTTPConnection.debuglevel = 1 to see the underlying HTTP traffic. +
Beyond the underlying HTTP layer, there are a number of things that can go wrong. SOAPpy does an admirable job hiding the SOAP syntax from you, but that also means it can be difficult to determine where the problem is when things don't work.
Here are a few examples of common mistakes that I've made in using SOAP web services, and the errors they generated.
Example 12.15. Calling a Method With an Incorrectly Configured Proxy
>>> from SOAPpy import SOAPProxy @@ -10832,18 +10355,18 @@ Unable to determine object id from call: is the method element namespaced?>- ![]()
Did you spot the mistake? You're creating a SOAPProxymanually, and you've correctly specified the service URL, but you haven't specified the namespace. Since multiple services may be routed through the same service URL, the namespace is essential to determine which service you're trying to talk to, and therefore which method you're really +Did you spot the mistake? You're creating a SOAPProxymanually, and you've correctly specified the service URL, but you haven't specified the namespace. Since multiple services may be routed through the same service URL, the namespace is essential to determine which service you're trying to talk to, and therefore which method you're really calling.- - ![]()
The server responds by sending a SOAP Fault, which SOAPpy turns into a Python exception of type SOAPpy.Types.faultType. All errors returned from any SOAP server will always be SOAP Faults, so you can easily catch this exception. In this case, the human-readable part of the SOAP Fault gives a clue to the problem: the method element is not namespaced, because the originalSOAPProxyobject was not configured with a service namespace. +The server responds by sending a SOAP Fault, which SOAPpy turns into a Python exception of type SOAPpy.Types.faultType. All errors returned from any SOAP server will always be SOAP Faults, so you can easily catch this exception. In this case, the human-readable part of the SOAP Fault gives a clue to the problem: the method element is not namespaced, because the originalSOAPProxyobject was not configured with a service namespace.Misconfiguring the basic elements of the SOAP service is one of the problems that WSDL aims to solve. The WSDL file contains the service URL and namespace, so you can't get it wrong. Of course, there are still other things you can get wrong. +
Misconfiguring the basic elements of the SOAP service is one of the problems that WSDL aims to solve. The WSDL file contains the service URL and namespace, so you can't get it wrong. Of course, there are still other things you can get wrong.
Example 12.16. Calling a Method With the Wrong Arguments
>>> wsdlFile = 'http://www.xmethods.net/sd/2001/TemperatureService.wsdl' >>> server = WSDL.Proxy(wsdlFile) @@ -10865,15 +10388,15 @@ services.temperature.TempService.getTemp(int) -- no signature match>- ![]()
Did you spot the mistake? It's a subtle one: you're calling server.getTempwith an integer instead of a string. As you saw from introspecting the WSDL file, thegetTemp()SOAP function takes a single argument, zipcode, which must be a string.WSDL.Proxywill not coerce datatypes for you; you need to pass the exact datatypes that the server expects. +Did you spot the mistake? It's a subtle one: you're calling server.getTempwith an integer instead of a string. As you saw from introspecting the WSDL file, thegetTemp()SOAP function takes a single argument, zipcode, which must be a string.WSDL.Proxywill not coerce datatypes for you; you need to pass the exact datatypes that the server expects.@@ -10891,7 +10414,7 @@ TypeError: unpack non-sequence - ![]()
Again, the server returns a SOAP Fault, and the human-readable part of the error gives a clue as to the problem: you're calling a getTempfunction with an integer value, but there is no function defined with that name that takes an integer. In theory, SOAP allows you to overload functions, so you could have two functions in the same SOAP service with the same name and the same number of arguments, but the arguments were of different datatypes. This is why - it's important to match the datatypes exactly, and whyWSDL.Proxydoesn't coerce datatypes for you. If it did, you could end up calling a completely different function! Good luck debugging - that one. It's much easier to be picky about datatypes and fail as quickly as possible if you get them wrong. +Again, the server returns a SOAP Fault, and the human-readable part of the error gives a clue as to the problem: you're calling a getTempfunction with an integer value, but there is no function defined with that name that takes an integer. In theory, SOAP allows you to overload functions, so you could have two functions in the same SOAP service with the same name and the same number of arguments, but the arguments were of different datatypes. This is why + it's important to match the datatypes exactly, and whyWSDL.Proxydoesn't coerce datatypes for you. If it did, you could end up calling a completely different function! Good luck debugging + that one. It's much easier to be picky about datatypes and fail as quickly as possible if you get them wrong.![]()
Did you spot the mistake? @@ -10901,7 +10424,7 @@ TypeError: unpack non-sequence >>> from SOAPpy import WSDL >>> server = WSDL.Proxy(r'/path/to/local/GoogleSearch.wsdl') >>> results = server.doGoogleSearch('foo', 'mark', 0, 10, False, "",server.getTemponly returns one value, a float, but you've written code that assumes you're getting two values and trying to assign them - to two different variables. Note that this does not fail with a SOAP fault. As far as the remote server is concerned, nothing went wrong at all. The error only occurred after the SOAP transaction was complete,WSDL.Proxyreturned a float, and your local Python interpreter tried to accomodate your request to split it into two different variables. Since the function only returned + to two different variables. Note that this does not fail with a SOAP fault. As far as the remote server is concerned, nothing went wrong at all. The error only occurred after the SOAP transaction was complete,WSDL.Proxyreturned a float, and your local Python interpreter tried to accomodate your request to split it into two different variables. Since the function only returned one value, you get a Python exception trying to split it, not a SOAP Fault.-... False, "", "utf-8", "utf-8") +... False, "", "utf-8", "utf-8") <Fault SOAP-ENV:Server:
Exception from service object: Invalid authorization key: foo: <SOAPpy.Types.structType detail at 14164616>: @@ -10977,14 +10500,14 @@ Caused by: com.google.soap.search.UserKeyInvalidException: Key was of wrong size
- ![]()
Can you spot the mistake? There's nothing wrong with the calling syntax, or the number of arguments, or the datatypes. The + Can you spot the mistake? There's nothing wrong with the calling syntax, or the number of arguments, or the datatypes. The problem is application-specific: the first argument is supposed to be my application key, but foois not a valid Google key.@@ -10996,8 +10519,8 @@ Caused by: com.google.soap.search.UserKeyInvalidException: Key was of wrong size - ![]()
The Google server responds with a SOAP Fault and an incredibly long error message, which includes a complete Java stack trace. Remember that all SOAP errors are signified by SOAP Faults: errors in configuration, errors in function arguments, and application-specific errors like this. Buried in there + The Google server responds with a SOAP Fault and an incredibly long error message, which includes a complete Java stack trace. Remember that all SOAP errors are signified by SOAP Faults: errors in configuration, errors in function arguments, and application-specific errors like this. Buried in there somewhere is the crucial piece of information: Invalid authorization key: foo.12.9. Summary
-SOAP web services are very complicated. The specification is very ambitious and tries to cover many different use cases for web - services. This chapter has touched on some of the simpler use cases. +
SOAP web services are very complicated. The specification is very ambitious and tries to cover many different use cases for web + services. This chapter has touched on some of the simpler use cases.
Before diving into the next chapter, make sure you're comfortable doing all of these things:
@@ -11014,19 +10537,19 @@ Caused by: com.google.soap.search.UserKeyInvalidException: Key was of wrong sizeChapter 13. Unit Testing
13.1. Introduction to Roman numerals
-In previous chapters, you “dived in” by immediately looking at code and trying to understand it as quickly as possible. Now that you have some Python under your belt, you're going to step back and look at the steps that happen before the code gets written. +
In previous chapters, you “dived in” by immediately looking at code and trying to understand it as quickly as possible. Now that you have some Python under your belt, you're going to step back and look at the steps that happen before the code gets written.
In the next few chapters, you're going to write, debug, and optimize a set of utility functions to convert to and from Roman -numerals. You saw the mechanics of constructing and validating Roman numerals in Section 7.3, “Case Study: Roman Numerals”, but now let's step back and consider what it would take to expand that into a two-way utility. +numerals. You saw the mechanics of constructing and validating Roman numerals in Section 7.3, “Case Study: Roman Numerals”, but now let's step back and consider what it would take to expand that into a two-way utility.
The rules for Roman numerals lead to a number of interesting observations:
-
- There is only one correct way to represent a particular number as Roman numerals.
- The converse is also true: if a string of characters is a valid Roman numeral, it represents only one number (i.e. it can only be read one way). -
- There is a limited range of numbers that can be expressed as Roman numerals, specifically
1through3999. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent - that its normal value should be multiplied by1000, but you're not going to deal with that. For the purposes of this chapter, let's stipulate that Roman numerals go from1to3999.) +- There is a limited range of numbers that can be expressed as Roman numerals, specifically
1through3999. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent + that its normal value should be multiplied by1000, but you're not going to deal with that. For the purposes of this chapter, let's stipulate that Roman numerals go from1to3999.) -- There is no way to represent
0in Roman numerals. (Amazingly, the ancient Romans had no concept of0as a number. Numbers were for counting things you had; how can you count what you don't have?) +- There is no way to represent
0in Roman numerals. (Amazingly, the ancient Romans had no concept of0as a number. Numbers were for counting things you had; how can you count what you don't have?)- There is no way to represent negative numbers in Roman numerals.
- There is no way to represent fractions or non-integer numbers in Roman numerals. @@ -11045,7 +10568,7 @@ numerals. You saw the mechanics of constructing and validating Roman numerals i
fromRomanshould fail when given an invalid Roman numeral.- If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number - you started with. So
fromRoman(toRoman(n)) == nfor all n in1..3999. + you started with. SofromRoman(toRoman(n)) == nfor all n in1..3999.toRomanshould always return a Roman numeral using uppercase letters. @@ -11061,39 +10584,39 @@ numerals. You saw the mechanics of constructing and validating Roman numerals i13.2. Diving in
Now that you've completely defined the behavior you expect from your conversion functions, you're going to do something a little unexpected: you're going to write a test suite that puts these functions through their paces and makes sure that they - behave the way you want them to. You read that right: you're going to write code that tests code that you haven't written + behave the way you want them to. You read that right: you're going to write code that tests code that you haven't written yet.
This is called unit testing, since the set of two conversion functions can be written and tested as a unit, separate from -any larger program they may become part of later. Python has a framework for unit testing, the appropriately-named
unittestmodule.+any larger program they may become part of later. Python has a framework for unit testing, the appropriately-named
@@ -11402,22 +10925,22 @@ class ToRomanBadInput(unittest.TestCase):unittestmodule.-
- unittestis included with Python 2.1 and later. Python 2.0 users can download it frompyunit.sourceforge.net. +unittestis included with Python 2.1 and later. Python 2.0 users can download it frompyunit.sourceforge.net.Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important +
Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important to write them early (preferably before writing the code that they test), and to keep them updated as code and requirements -change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases +change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases of development:
- Before writing code, it forces you to detail your requirements in a useful fashion. -
- While writing code, it keeps you from over-coding. When all the test cases pass, the function is complete. +
- While writing code, it keeps you from over-coding. When all the test cases pass, the function is complete.
- When refactoring code, it assures you that the new version behaves the same way as the old version.
- When maintaining code, it helps you cover your ass when someone comes screaming that your latest change broke their old code. (“But sir, all the unit tests passed when I checked it in...”)
- When writing code in a team, it increases confidence that the code you're about to commit isn't going to break other peoples' - code, because you can run their unittests first. (I've seen this sort of thing in code sprints. A team breaks up the assignment, + code, because you can run their unittests first. (I've seen this sort of thing in code sprints. A team breaks up the assignment, everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team. That way, nobody goes off too far into developing code that won't play well with others.)
13.3. Introducing
romantest.pyThis is the complete test suite for your Roman numeral conversion functions, which are yet to be written but will eventually - be in
roman.py. It is not immediately obvious how it all fits together; none of these classes or methods reference any of the others. + be inroman.py. It is not immediately obvious how it all fits together; none of these classes or methods reference any of the others. There are good reasons for this, as you'll see shortly.Example 13.1.
romantest.pyIf you have not already done so, you can download this and other examples used in this book.
@@ -11245,16 +10768,16 @@ if __name__ == "__main__":13.4. Testing for success
-The most fundamental part of unit testing is constructing individual test cases. A test case answers a single question about +
The most fundamental part of unit testing is constructing individual test cases. A test case answers a single question about the code it is testing.
A test case should be able to...
-
-- ...run completely by itself, without any human input. Unit testing is about automation. +
- ...run completely by itself, without any human input. Unit testing is about automation.
- ...determine by itself whether the function it is testing has passed or failed, without a human interpreting the results. -
- ...run in isolation, separate from any other test cases (even if they test the same functions). Each test case is an island. +
- ...run in isolation, separate from any other test cases (even if they test the same functions). Each test case is an island.
Given that, let's build the first test case. You have the following requirement: +
Given that, let's build the first test case. You have the following requirement:
toRomanshould return the Roman numeral representation for all integers1to3999. @@ -11328,43 +10851,43 @@ class KnownValues(unittest.TestCase):-
To write a test case, first subclass the TestCaseclass of theunittestmodule. This class provides many useful methods which you can use in your test case to test specific conditions. +To write a test case, first subclass the TestCaseclass of theunittestmodule. This class provides many useful methods which you can use in your test case to test specific conditions.- ![]()
This is a list of integer/numeral pairs that I verified manually. It includes the lowest ten numbers, the highest number, - every number that translates to a single-character Roman numeral, and a random sampling of other valid numbers. The point + This is a list of integer/numeral pairs that I verified manually. It includes the lowest ten numbers, the highest number, + every number that translates to a single-character Roman numeral, and a random sampling of other valid numbers. The point of a unit test is not to test every possible input, but to test a representative sample. - ![]()
Every individual test is its own method, which must take no parameters and return no value. If the method exits normally + Every individual test is its own method, which must take no parameters and return no value. If the method exits normally without raising an exception, the test is considered passed; if the method raises an exception, the test is considered failed. - ![]()
Here you call the actual toRomanfunction. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you - have now defined the API for thetoRomanfunction: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the +Here you call the actual toRomanfunction. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you + have now defined the API for thetoRomanfunction: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the API is different than that, this test is considered failed.- ![]()
Also notice that you are not trapping any exceptions when you call toRoman. This is intentional.toRomanshouldn't raise an exception when you call it with valid input, and these input values are all valid. IftoRomanraises an exception, this test is considered failed. +Also notice that you are not trapping any exceptions when you call toRoman. This is intentional.toRomanshouldn't raise an exception when you call it with valid input, and these input values are all valid. IftoRomanraises an exception, this test is considered failed.![]()
Assuming the toRomanfunction was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check - whether it returned the right value. This is a common question, and theTestCaseclass provides a method,assertEqual, to check whether two values are equal. If the result returned fromtoRoman(result) does not match the known value you were expecting (numeral),assertEqualwill raise an exception and the test will fail. If the two values are equal,assertEqualwill do nothing. If every value returned fromtoRomanmatches the known value you expect,assertEqualnever raises an exception, sotestToRomanKnownValueseventually exits normally, which meanstoRomanhas passed this test. + whether it returned the right value. This is a common question, and theTestCaseclass provides a method,assertEqual, to check whether two values are equal. If the result returned fromtoRoman(result) does not match the known value you were expecting (numeral),assertEqualwill raise an exception and the test will fail. If the two values are equal,assertEqualwill do nothing. If every value returned fromtoRomanmatches the known value you expect,assertEqualnever raises an exception, sotestToRomanKnownValueseventually exits normally, which meanstoRomanhas passed this test.![]()
The TestCaseclass of theunittestprovides theassertRaisesmethod, which takes the following arguments: the exception you're expecting, the function you're testing, and the arguments - you're passing that function. (If the function you're testing takes more than one argument, pass them all toassertRaises, in order, and it will pass them right along to the function you're testing.) Pay close attention to what you're doing here: - instead of callingtoRomandirectly and manually checking that it raises a particular exception (by wrapping it in atry...exceptblock),assertRaiseshas encapsulated all of that for us. All you do is give it the exception (roman.OutOfRangeError), the function (toRoman), andtoRoman's arguments (4000), andassertRaisestakes care of callingtoRomanand checking to make sure that it raisesroman.OutOfRangeError. (Also note that you're passing thetoRomanfunction itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned + you're passing that function. (If the function you're testing takes more than one argument, pass them all toassertRaises, in order, and it will pass them right along to the function you're testing.) Pay close attention to what you're doing here: + instead of callingtoRomandirectly and manually checking that it raises a particular exception (by wrapping it in atry...exceptblock),assertRaiseshas encapsulated all of that for us. All you do is give it the exception (roman.OutOfRangeError), the function (toRoman), andtoRoman's arguments (4000), andassertRaisestakes care of callingtoRomanand checking to make sure that it raisesroman.OutOfRangeError. (Also note that you're passing thetoRomanfunction itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned recently how handy it is that everything in Python is an object, including functions and exceptions?)- ![]()
Along with testing numbers that are too large, you need to test numbers that are too small. Remember, Roman numerals cannot - express 0or negative numbers, so you have a test case for each of those (testZeroandtestNegative). IntestZero, you are testing thattoRomanraises aroman.OutOfRangeErrorexception when called with0; if it does not raise aroman.OutOfRangeError(either because it returns an actual value, or because it raises some other exception), this test is considered failed. +Along with testing numbers that are too large, you need to test numbers that are too small. Remember, Roman numerals cannot + express 0or negative numbers, so you have a test case for each of those (testZeroandtestNegative). IntestZero, you are testing thattoRomanraises aroman.OutOfRangeErrorexception when called with0; if it does not raise aroman.OutOfRangeError(either because it returns an actual value, or because it raises some other exception), this test is considered failed.@@ -11429,7 +10952,7 @@ class ToRomanBadInput(unittest.TestCase): - ![]()
Requirement #3 specifies that toRomancannot accept a non-integer number, so here you test to make sure thattoRomanraises aroman.NotIntegerErrorexception when called with0.5. IftoRomandoes not raise aroman.NotIntegerError, this test is considered failed. +Requirement #3 specifies that toRomancannot accept a non-integer number, so here you test to make sure thattoRomanraises aroman.NotIntegerErrorexception when called with0.5. IftoRomandoes not raise aroman.NotIntegerError, this test is considered failed.fromRomanshould fail when given an invalid Roman numeral.Requirement #4 is handled in the same way as requirement #1, iterating through a sampling of known values and testing each in turn. Requirement #5 is handled in the same way as requirements +
Requirement #4 is handled in the same way as requirement #1, iterating through a sampling of known values and testing each in turn. Requirement #5 is handled in the same way as requirements #2 and #3, by testing a series of bad inputs and making sure
fromRomanraises the appropriate exception.Example 13.4. Testing bad input to
fromRomanclass FromRomanBadInput(unittest.TestCase): @@ -11452,19 +10975,19 @@ class FromRomanBadInput(unittest.TestCase):- ![]()
Not much new to say about these; the pattern is exactly the same as the one you used to test bad input to toRoman. I will briefly note that you have another exception:roman.InvalidRomanNumeralError. That makes a total of three custom exceptions that will need to be defined inroman.py(along withroman.OutOfRangeErrorandroman.NotIntegerError). You'll see how to define these custom exceptions when you actually start writingroman.py, later in this chapter. +Not much new to say about these; the pattern is exactly the same as the one you used to test bad input to toRoman. I will briefly note that you have another exception:roman.InvalidRomanNumeralError. That makes a total of three custom exceptions that will need to be defined inroman.py(along withroman.OutOfRangeErrorandroman.NotIntegerError). You'll see how to define these custom exceptions when you actually start writingroman.py, later in this chapter.13.6. Testing for sanity
Often, you will find that a unit of code contains a set of reciprocal functions, usually in the form of conversion functions - where one converts A to B and the other converts B to A. In these cases, it is useful to create a “sanity check” to make sure that you can convert A to B and back to A without losing precision, incurring rounding errors, or triggering + where one converts A to B and the other converts B to A. In these cases, it is useful to create a “sanity check” to make sure that you can convert A to B and back to A without losing precision, incurring rounding errors, or triggering any other sort of bug.
Consider this requirement:
- If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number - you started with. So
fromRoman(toRoman(n)) == nfor all n in1..3999. + you started with. SofromRoman(toRoman(n)) == nfor all n in1..3999.Example 13.5. Testing
toRomanagainstfromRoman@@ -11479,7 +11002,7 @@ class SanityCheck(unittest.TestCase):- ![]()
You've seen the rangefunction before, but here it is called with two arguments, which returns a list of integers starting at the first argument (1) and counting consecutively up to but not including the second argument (4000). Thus,1..3999, which is the valid range for converting to Roman numerals. +You've seen the rangefunction before, but here it is called with two arguments, which returns a list of integers starting at the first argument (1) and counting consecutively up to but not including the second argument (4000). Thus,1..3999, which is the valid range for converting to Roman numerals.@@ -11491,7 +11014,7 @@ class SanityCheck(unittest.TestCase): @@ -11503,8 +11026,8 @@ class SanityCheck(unittest.TestCase): - ![]()
The actual testing logic here is straightforward: take a number (integer), convert it to a Roman numeral (numeral), then convert it back to a number (result) and make sure you end up with the same number you started with. If not, assertEqualwill raise an exception and the test will immediately be considered failed. If all the numbers match,assertEqualwill always return silently, the entiretestSanitymethod will eventually return silently, and the test will be considered passed. +The actual testing logic here is straightforward: take a number (integer), convert it to a Roman numeral (numeral), then convert it back to a number (result) and make sure you end up with the same number you started with. If not, assertEqualwill raise an exception and the test will immediately be considered failed. If all the numbers match,assertEqualwill always return silently, the entiretestSanitymethod will eventually return silently, and the test will be considered passed.fromRomanshould only accept uppercase Roman numerals (i.e. it should fail when given lowercase input). -In fact, they are somewhat arbitrary. You could, for instance, have stipulated that
fromRomanaccept lowercase and mixed case input. But they are not completely arbitrary; iftoRomanis always returning uppercase output, thenfromRomanmust at least accept uppercase input, or the “sanity check” (requirement #6) would fail. The fact that it only accepts uppercase input is arbitrary, but as any systems integrator will tell you, case always matters, so it's worth specifying -the behavior up front. And if it's worth specifying, it's worth testing. +In fact, they are somewhat arbitrary. You could, for instance, have stipulated that
fromRomanaccept lowercase and mixed case input. But they are not completely arbitrary; iftoRomanis always returning uppercase output, thenfromRomanmust at least accept uppercase input, or the “sanity check” (requirement #6) would fail. The fact that it only accepts uppercase input is arbitrary, but as any systems integrator will tell you, case always matters, so it's worth specifying +the behavior up front. And if it's worth specifying, it's worth testing.Example 13.6. Testing for case
class CaseCheck(unittest.TestCase): def testToRomanCase(self): @@ -11524,32 +11047,32 @@ class CaseCheck(unittest.TestCase):- ![]()
The most interesting thing about this test case is all the things it doesn't test. It doesn't test that the value returned - from toRomanis right or even consistent; those questions are answered by separate test cases. You have a whole test case just to test for uppercase-ness. You might - be tempted to combine this with the sanity check, since both run through the entire range of values and calltoRoman.[6] But that would violate one of the fundamental rules: each test case should answer only a single question. Imagine that you combined this case check with the sanity check, and - then that test case failed. You would need to do further analysis to figure out which part of the test case failed to determine - what the problem was. If you need to analyze the results of your unit testing just to figure out what they mean, it's a sure +The most interesting thing about this test case is all the things it doesn't test. It doesn't test that the value returned + from toRomanis right or even consistent; those questions are answered by separate test cases. You have a whole test case just to test for uppercase-ness. You might + be tempted to combine this with the sanity check, since both run through the entire range of values and calltoRoman.[6] But that would violate one of the fundamental rules: each test case should answer only a single question. Imagine that you combined this case check with the sanity check, and + then that test case failed. You would need to do further analysis to figure out which part of the test case failed to determine + what the problem was. If you need to analyze the results of your unit testing just to figure out what they mean, it's a sure sign that you've mis-designed your test cases.- ![]()
There's a similar lesson to be learned here: even though “you know” that toRomanalways returns uppercase, you are explicitly converting its return value to uppercase here to test thatfromRomanaccepts uppercase input. Why? Because the fact thattoRomanalways returns uppercase is an independent requirement. If you changed that requirement so that, for instance, it always - returned lowercase, thetestToRomanCasetest case would need to change, but this test case would still work. This was another of the fundamental rules: each test case must be able to work in isolation from any of the others. Every test case is an island. +There's a similar lesson to be learned here: even though “you know” that toRomanalways returns uppercase, you are explicitly converting its return value to uppercase here to test thatfromRomanaccepts uppercase input. Why? Because the fact thattoRomanalways returns uppercase is an independent requirement. If you changed that requirement so that, for instance, it always + returned lowercase, thetestToRomanCasetest case would need to change, but this test case would still work. This was another of the fundamental rules: each test case must be able to work in isolation from any of the others. Every test case is an island.- ![]()
Note that you're not assigning the return value of fromRomanto anything. This is legal syntax in Python; if a function returns a value but nobody's listening, Python just throws away the return value. In this case, that's what you want. This test case doesn't test anything about the return +Note that you're not assigning the return value of fromRomanto anything. This is legal syntax in Python; if a function returns a value but nobody's listening, Python just throws away the return value. In this case, that's what you want. This test case doesn't test anything about the return value; it just tests thatfromRomanaccepts the uppercase input without raising an exception.@@ -11561,7 +11084,7 @@ class CaseCheck(unittest.TestCase): - ![]()
This is a complicated line, but it's very similar to what you did in the ToRomanBadInputandFromRomanBadInputtests. You are testing to make sure that calling a particular function (roman.fromRoman) with a particular value (numeral.lower(), the lowercase version of the current Roman numeral in the loop) raises a particular exception (roman.InvalidRomanNumeralError). If it does (each time through the loop), the test passes; if even one time it does something else (like raises a different +This is a complicated line, but it's very similar to what you did in the ToRomanBadInputandFromRomanBadInputtests. You are testing to make sure that calling a particular function (roman.fromRoman) with a particular value (numeral.lower(), the lowercase version of the current Roman numeral in the loop) raises a particular exception (roman.InvalidRomanNumeralError). If it does (each time through the loop), the test passes; if even one time it does something else (like raises a different exception, or returning a value without raising an exception at all), the test fails.Chapter 14. Test-First Programming
14.1.
-roman.py, stage 1Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're +
Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're going to do this in stages, so you can see all the unit tests fail, then watch them pass one by one as you fill in the gaps in
roman.py.Example 14.1.
@@ -11587,8 +11110,8 @@ def fromRoman(s):roman1.py@@ -11611,8 +11134,8 @@ def fromRoman(s): - - ![]()
This is how you define your own custom exceptions in Python. Exceptions are classes, and you create your own by subclassing existing exceptions. It is strongly recommended (but not - required) that you subclass Exception, which is the base class that all built-in exceptions inherit from. Here I am definingRomanError(inherited fromException) to act as the base class for all my other custom exceptions to follow. This is a matter of style; I could just as easily +This is how you define your own custom exceptions in Python. Exceptions are classes, and you create your own by subclassing existing exceptions. It is strongly recommended (but not + required) that you subclass Exception, which is the base class that all built-in exceptions inherit from. Here I am definingRomanError(inherited fromException) to act as the base class for all my other custom exceptions to follow. This is a matter of style; I could just as easily have inherited each individual exception from theExceptionclass directly.Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At -this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to
romantest.pyand re-evaluate why you coded a test so useless that it passes with do-nothing functions. +Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At +this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to
romantest.pyand re-evaluate why you coded a test so useless that it passes with do-nothing functions.Run
romantest1.pywith the-vcommand-line option, which will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this:Example 14.2. Output of
romantest1.pyagainstroman1.pyfromRoman should only accept uppercase input ... ERROR @@ -11740,13 +11263,13 @@ FAILED (failures=10, errors=2)-
Running the script runs unittest.main(), which runs each test case, which is to say each method defined in each class withinromantest.py. For each test case, it prints out thedocstringof the method and whether that test passed or failed. As expected, none of the test cases passed. +Running the script runs unittest.main(), which runs each test case, which is to say each method defined in each class withinromantest.py. For each test case, it prints out thedocstringof the method and whether that test passed or failed. As expected, none of the test cases passed.- ![]()
For each failed test case, unittestdisplays the trace information showing exactly what happened. In this case, the call toassertRaises(also calledfailUnlessRaises) raised anAssertionErrorbecause it was expectingtoRomanto raise anOutOfRangeErrorand it didn't. +For each failed test case, unittestdisplays the trace information showing exactly what happened. In this case, the call toassertRaises(also calledfailUnlessRaises) raised anAssertionErrorbecause it was expectingtoRomanto raise anOutOfRangeErrorand it didn't.@@ -11758,8 +11281,8 @@ FAILED (failures=10, errors=2) @@ -11811,12 +11334,12 @@ def fromRoman(s):-
Overall, the unit test failed because at least one test case did not pass. When a test case doesn't pass, unittestdistinguishes between failures and errors. A failure is a call to anassertXYZmethod, likeassertEqualorassertRaises, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort - of exception raised in the code you're testing or the unit test case itself. For instance, thetestFromRomanCasemethod (“fromRomanshould only accept uppercase input”) was an error, because the call tonumeral.upper()raised anAttributeErrorexception, becausetoRomanwas supposed to return a string but didn't. ButtestZero(“toRomanshould fail with 0 input”) was a failure, because the call tofromRomandid not raise theInvalidRomanNumeralexception thatassertRaiseswas looking for. +Overall, the unit test failed because at least one test case did not pass. When a test case doesn't pass, unittestdistinguishes between failures and errors. A failure is a call to anassertXYZmethod, likeassertEqualorassertRaises, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort + of exception raised in the code you're testing or the unit test case itself. For instance, thetestFromRomanCasemethod (“fromRomanshould only accept uppercase input”) was an error, because the call tonumeral.upper()raised anAttributeErrorexception, becausetoRomanwas supposed to return a string but didn't. ButtestZero(“toRomanshould fail with 0 input”) was a failure, because the call tofromRomandid not raise theInvalidRomanNumeralexception thatassertRaiseswas looking for.romanNumeralMap is a tuple of tuples which defines three things: @@ -11825,7 +11348,7 @@ def fromRoman(s):-
- The character representations of the most basic Roman numerals. Note that this is not just the single-character Roman numerals; +
- The character representations of the most basic Roman numerals. Note that this is not just the single-character Roman numerals; you're also defining two-character pairs like
CM(“one hundred less than one thousand”); this will make thetoRomancode simpler later. -- The order of the Roman numerals. They are listed in descending value order, from
Mall the way down toI. +- The order of the Roman numerals. They are listed in descending value order, from
Mall the way down toI. -- The value of each Roman numeral. Each inner tuple is a pair of
(numeral, value). +- The value of each Roman numeral. Each inner tuple is a pair of
(numeral, value).![]()
Here's where your rich data structure pays off, because you don't need any special logic to handle the subtraction rule. - To convert to Roman numerals, you simply iterate through romanNumeralMap looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation + To convert to Roman numerals, you simply iterate through romanNumeralMap looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat. @@ -11844,7 +11367,7 @@ subtracting 10 from input, adding X to output subtracting 10 from input, adding X to output subtracting 4 from input, adding IV to output 'MCDXXIV' -So
toRomanappears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely. +So
toRomanappears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely.Example 14.5. Output of
romantest2.pyagainstroman2.pyRemember to run
romantest2.pywith the-vcommand-line flag to enable verbose mode.fromRoman should only accept uppercase input ... FAIL toRoman should always return uppercase ... ok@@ -11862,25 +11385,25 @@ toRoman should fail with 0 input ... FAIL
- ![]()
toRomandoes, in fact, always return uppercase, because romanNumeralMap defines the Roman numeral representations as uppercase. So this test passes already. +toRomandoes, in fact, always return uppercase, because romanNumeralMap defines the Roman numeral representations as uppercase. So this test passes already.- ![]()
Here's the big news: this version of the toRomanfunction passes the known values test. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including - inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it. +Here's the big news: this version of the toRomanfunction passes the known values test. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including + inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.- - ![]()
However, the function does not “work” for bad values; it fails every single bad input test. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to - be raised (via assertRaises), and you're never raising them. You'll do that in the next stage. +However, the function does not “work” for bad values; it fails every single bad input test. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to + be raised (via assertRaises), and you're never raising them. You'll do that in the next stage.Here's the rest of the output of the unit test, listing the details of all the failures. You're down to 10.
+Here's the rest of the output of the unit test, listing the details of all the failures. You're down to 10.
====================================================================== FAIL: fromRoman should only accept uppercase input ---------------------------------------------------------------------- @@ -12024,13 +11547,13 @@ def fromRoman(s):- ![]()
This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to if not ((0 < n) and (n < 4000)), but it's much easier to read. This is the range check, and it should catch inputs that are too large, negative, or zero. +This is a nice Pythonic shortcut: multiple comparisons at once. This is equivalent to if not ((0 < n) and (n < 4000)), but it's much easier to read. This is the range check, and it should catch inputs that are too large, negative, or zero.- ![]()
You raise exceptions yourself with the raisestatement. You can raise any of the built-in exceptions, or you can raise any of your custom exceptions that you've defined. +You raise exceptions yourself with the @@ -12038,7 +11561,7 @@ def fromRoman(s):raisestatement. You can raise any of the built-in exceptions, or you can raise any of your custom exceptions that you've defined. The second parameter, the error message, is optional; if given, it is displayed in the traceback that is printed if the exception is never handled.- ![]()
This is the non-integer check. Non-integers can not be converted to Roman numerals. +This is the non-integer check. Non-integers can not be converted to Roman numerals. @@ -12076,13 +11599,13 @@ toRoman should fail with 0 input ... ok
- ![]()
toRomanstill passes the known values test, which is comforting. All the tests that passed in stage 2 still pass, so the latest code hasn't broken anything. +toRomanstill passes the known values test, which is comforting. All the tests that passed in stage 2 still pass, so the latest code hasn't broken anything.- ![]()
More exciting is the fact that all of the bad input tests now pass. This test, testNonInteger, passes because of theint(n) <> ncheck. When a non-integer is passed totoRoman, theint(n) <> ncheck notices it and raises theNotIntegerErrorexception, which is whattestNonIntegeris looking for. +More exciting is the fact that all of the bad input tests now pass. This test, testNonInteger, passes because of theint(n) <> ncheck. When a non-integer is passed totoRoman, theint(n) <> ncheck notices it and raises theNotIntegerErrorexception, which is whattestNonIntegeris looking for.@@ -12155,7 +11678,7 @@ FAILED (failures=6) @@ -12164,13 +11687,13 @@ FAILED (failures=6)-
You're down to 6 failures, and all of them involve fromRoman: the known values test, the three separate bad input tests, the case check, and the sanity check. That means thattoRomanhas passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires thatfromRomanbe written, which it isn't yet.) Which means that you must stop codingtoRomannow. No tweaking, no twiddling, no extra checks “just in case”. Stop. Now. Back away from the keyboard. +You're down to 6 failures, and all of them involve fromRoman: the known values test, the three separate bad input tests, the case check, and the sanity check. That means thattoRomanhas passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires thatfromRomanbe written, which it isn't yet.) Which means that you must stop codingtoRomannow. No tweaking, no twiddling, no extra checks “just in case”. Stop. Now. Back away from the keyboard.- The most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for - a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module. + The most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for + a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module. 14.4.
-roman.py, stage 4Now that
toRomanis done, it's time to start codingfromRoman. Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than +Now that
toRomanis done, it's time to start codingfromRoman. Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than thetoRomanfunction.Example 14.9.
roman4.pyThis file is available in
py/roman/stage4/in the examples directory. @@ -12214,7 +11737,7 @@ def fromRoman(s):@@ -12250,13 +11773,13 @@ toRoman should fail with 0 input ... ok - ![]()
The pattern here is the same as toRoman. You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer +The pattern here is the same as toRoman. You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer values as often as possible, you match the “highest” Roman numeral character strings as often as possible.- ![]()
Two pieces of exciting news here. The first is that fromRomanworks for good input, at least for all the known values you test. +Two pieces of exciting news here. The first is that fromRomanworks for good input, at least for all the known values you test.@@ -12303,21 +11826,21 @@ Ran 12 tests in 1.222s FAILED (failures=4) - ![]()
The second is that the sanity check also passed. Combined with the known values tests, you can be reasonably sure that both toRomanandfromRomanwork properly for all possible good values. (This is not guaranteed; it is theoretically possible thattoRomanhas a bug that produces the wrong Roman numeral for some particular set of inputs, and thatfromRomanhas a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals thattoRomangenerated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write +The second is that the sanity check also passed. Combined with the known values tests, you can be reasonably sure that both toRomanandfromRomanwork properly for all possible good values. (This is not guaranteed; it is theoretically possible thattoRomanhas a bug that produces the wrong Roman numeral for some particular set of inputs, and thatfromRomanhas a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals thattoRomangenerated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write more comprehensive test cases until it doesn't bother you.)14.5.
roman.py, stage 5Now that
fromRomanworks properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input. - That means finding a way to look at a string and determine if it's a valid Roman numeral. This is inherently more difficult + That means finding a way to look at a string and determine if it's a valid Roman numeral. This is inherently more difficult than validating numeric input intoRoman, but you have a powerful tool at your disposal: regular expressions.If you're not familiar with regular expressions and didn't read Chapter 7, Regular Expressions, now would be a good time. -
As you saw in Section 7.3, “Case Study: Roman Numerals”, there are several simple rules for constructing a Roman numeral, using the letters
M,D,C,L,X,V, andI. Let's review the rules: +As you saw in Section 7.3, “Case Study: Roman Numerals”, there are several simple rules for constructing a Roman numeral, using the letters
M,D,C,L,X,V, andI. Let's review the rules:-
- Characters are additive.
Iis1,IIis2, andIIIis3.VIis6(literally, “5and1”),VIIis7, andVIIIis8. +- Characters are additive.
Iis1,IIis2, andIIIis3.VIis6(literally, “5and1”),VIIis7, andVIIIis8. -- The tens characters (
I,X,C, andM) can be repeated up to three times. At4, you need to subtract from the next highest fives character. You can't represent4asIIII; instead, it is represented asIV(“1less than5”).40is written asXL(“10less than50”),41asXLI,42asXLII,43asXLIII, and then44asXLIV(“10less than50, then1less than5”). +- The tens characters (
I,X,C, andM) can be repeated up to three times. At4, you need to subtract from the next highest fives character. You can't represent4asIIII; instead, it is represented asIV(“1less than5”).40is written asXL(“10less than50”),41asXLI,42asXLII,43asXLIII, and then44asXLIV(“10less than50, then1less than5”). -- Similarly, at
9, you need to subtract from the next highest tens character:8isVIII, but9isIX(“1less than10”), notVIIII(since theIcharacter can not be repeated four times).90isXC,900isCM. +- Similarly, at
9, you need to subtract from the next highest tens character:8isVIII, but9isIX(“1less than10”), notVIIII(since theIcharacter can not be repeated four times).90isXC,900isCM. -- The fives characters can not be repeated.
10is always represented asX, never asVV.100is alwaysC, neverLL. +- The fives characters can not be repeated.
10is always represented asX, never asVV.100is alwaysC, neverLL. -- Roman numerals are always written highest to lowest, and read left to right, so order of characters matters very much.
DCis600;CDis a completely different number (400, “100less than500”).CIis101;ICis not even a valid Roman numeral (because you can't subtract1directly from100; you would need to write it asXCIX, “10less than100, then1less than10”). +- Roman numerals are always written highest to lowest, and read left to right, so order of characters matters very much.
DCis600;CDis a completely different number (400, “100less than500”).CIis101;ICis not even a valid Roman numeral (because you can't subtract1directly from100; you would need to write it asXCIX, “10less than100, then1less than10”).Example 14.12.
@@ -12381,19 +11904,19 @@ def fromRoman(s):roman5.py- ![]()
This is just a continuation of the pattern you discussed in Section 7.3, “Case Study: Roman Numerals”. The tens places is either XC(90),XL(40), or an optionalLfollowed by 0 to 3 optionalXcharacters. The ones place is eitherIX(9),IV(4), or an optionalVfollowed by 0 to 3 optionalIcharacters. +This is just a continuation of the pattern you discussed in Section 7.3, “Case Study: Roman Numerals”. The tens places is either XC(90),XL(40), or an optionalLfollowed by 0 to 3 optionalXcharacters. The ones place is eitherIX(9),IV(4), or an optionalVfollowed by 0 to 3 optionalIcharacters.- ![]()
Having encoded all that logic into a regular expression, the code to check for invalid Roman numerals becomes trivial. If + Having encoded all that logic into a regular expression, the code to check for invalid Roman numerals becomes trivial. If re.searchreturns an object, then the regular expression matched and the input is valid; otherwise, the input is invalid.At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of -invalid Roman numerals. But don't take my word for it, look at the results: +invalid Roman numerals. But don't take my word for it, look at the results:
Example 14.13. Output of
romantest5.pyagainstroman5.pyfromRoman should only accept uppercase input ... oktoRoman should always return uppercase ... ok @@ -12416,20 +11939,20 @@ OK
![]()
- ![]()
One thing I didn't mention about regular expressions is that, by default, they are case-sensitive. Since the regular expression -romanNumeralPattern was expressed in uppercase characters, the re.searchcheck will reject any input that isn't completely uppercase. So the uppercase input test passes. +One thing I didn't mention about regular expressions is that, by default, they are case-sensitive. Since the regular expression +romanNumeralPattern was expressed in uppercase characters, the re.searchcheck will reject any input that isn't completely uppercase. So the uppercase input test passes.- ![]()
More importantly, the bad input tests pass. For instance, the malformed antecedents test checks cases like MCMC. As you've seen, this does not match the regular expression, sofromRomanraises anInvalidRomanNumeralErrorexception, which is what the malformed antecedents test case is looking for, so the test passes. +More importantly, the bad input tests pass. For instance, the malformed antecedents test checks cases like MCMC. As you've seen, this does not match the regular expression, sofromRomanraises anInvalidRomanNumeralErrorexception, which is what the malformed antecedents test case is looking for, so the test passes.@@ -12451,7 +11974,7 @@ OK - ![]()
In fact, all the bad input tests pass. This regular expression catches everything you could think of when you made your test + In fact, all the bad input tests pass. This regular expression catches everything you could think of when you made your test cases. ![]()
Chapter 15. Refactoring
15.1. Handling bugs
-Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by “bug”? A bug is a test case you haven't written yet. +
Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by “bug”? A bug is a test case you haven't written yet.
Example 15.1. The bug
>>> import roman5 >>> roman5.fromRoman("")0
@@ -12460,7 +11983,7 @@ OK![]()
Remember in the previous section when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals? - Well, it turns out that this is still true for the final version of the regular expression. And that's a bug; you want an + Well, it turns out that this is still true for the final version of the regular expression. And that's a bug; you want an empty string to raise an @@ -12479,7 +12002,7 @@ class FromRomanBadInput(unittest.TestCase):InvalidRomanNumeralErrorexception just like any other sequence of characters that don't represent a valid Roman numeral.@@ -12563,23 +12086,23 @@ OK - ![]()
Pretty simple stuff here. Call fromRomanwith an empty string and make sure it raises anInvalidRomanNumeralErrorexception. The hard part was finding the bug; now that you know about it, testing for it is the easy part. +Pretty simple stuff here. Call fromRomanwith an empty string and make sure it raises anInvalidRomanNumeralErrorexception. The hard part was finding the bug; now that you know about it, testing for it is the easy part.-
All the other test cases still pass, which means that this bug fix didn't break anything else. Stop coding. +All the other test cases still pass, which means that this bug fix didn't break anything else. Stop coding. -Coding this way does not make fixing bugs any easier. Simple bugs (like this one) require simple test cases; complex bugs -will require complex test cases. In a testing-centric environment, it may seem like it takes longer to fix a bug, since you need to articulate in code exactly what the bug is (to write the test case), -then fix the bug itself. Then if the test case doesn't pass right away, you need to figure out whether the fix was wrong, -or whether the test case itself has a bug in it. However, in the long run, this back-and-forth between test code and code -tested pays for itself, because it makes it more likely that bugs are fixed correctly the first time. Also, since you can -easily re-run all the test cases along with your new one, you are much less likely to break old code when fixing new code. Today's unit test +
Coding this way does not make fixing bugs any easier. Simple bugs (like this one) require simple test cases; complex bugs +will require complex test cases. In a testing-centric environment, it may seem like it takes longer to fix a bug, since you need to articulate in code exactly what the bug is (to write the test case), +then fix the bug itself. Then if the test case doesn't pass right away, you need to figure out whether the fix was wrong, +or whether the test case itself has a bug in it. However, in the long run, this back-and-forth between test code and code +tested pays for itself, because it makes it more likely that bugs are fixed correctly the first time. Also, since you can +easily re-run all the test cases along with your new one, you are much less likely to break old code when fixing new code. Today's unit test is tomorrow's regression test.
15.2. Handling changing requirements
Despite your best efforts to pin your customers to the ground and extract exact requirements from them on pain of horrible - nasty things involving scissors and hot wax, requirements will change. Most customers don't know what they want until they - see it, and even if they do, they aren't that good at articulating what they want precisely enough to be useful. And even - if they do, they'll want more in the next release anyway. So be prepared to update your test cases as requirements change. -
Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Remember the rule that said that no character could be repeated more than three times? Well, the Romans were willing to make an exception -to that rule by having 4
Mcharacters in a row to represent4000. If you make this change, you'll be able to expand the range of convertible numbers from1..3999to1..4999. But first, you need to make some changes to the test cases. + nasty things involving scissors and hot wax, requirements will change. Most customers don't know what they want until they + see it, and even if they do, they aren't that good at articulating what they want precisely enough to be useful. And even + if they do, they'll want more in the next release anyway. So be prepared to update your test cases as requirements change. +Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Remember the rule that said that no character could be repeated more than three times? Well, the Romans were willing to make an exception +to that rule by having 4
Mcharacters in a row to represent4000. If you make this change, you'll be able to expand the range of convertible numbers from1..3999to1..4999. But first, you need to make some changes to the test cases.Example 15.6. Modifying test cases for new requirements (
romantest71.py)This file is available in
py/roman/stage7/in the examples directory.If you have not already done so, you can download this and other examples used in this book.
@@ -12729,25 +12252,25 @@ if __name__ == "__main__":![]()
The existing known values don't change (they're all still reasonable values to test), but you need to add a few more in the - 4000range. Here I've included4000(the shortest),4500(the second shortest),4888(the longest), and4999(the largest). +4000range. Here I've included4000(the shortest),4500(the second shortest),4888(the longest), and4999(the largest).- ![]()
The definition of “large input” has changed. This test used to call toRomanwith4000and expect an error; now that4000-4999are good values, you need to bump this up to5000. +The definition of “large input” has changed. This test used to call toRomanwith4000and expect an error; now that4000-4999are good values, you need to bump this up to5000.- ![]()
The definition of “too many repeated numerals” has also changed. This test used to call fromRomanwith'MMMM'and expect an error; now thatMMMMis considered a valid Roman numeral, you need to bump this up to'MMMMM'. +The definition of “too many repeated numerals” has also changed. This test used to call fromRomanwith'MMMM'and expect an error; now thatMMMMis considered a valid Roman numeral, you need to bump this up to'MMMMM'.@@ -12844,8 +12367,8 @@ OutOfRangeError: number out of range (must be 1..3999) - ![]()
The sanity check and case checks loop through every number in the range, from 1to3999. Since the range has now expanded, theseforloops need to be updated as well to go up to4999. +The sanity check and case checks loop through every number in the range, from 1to3999. Since the range has now expanded, theseforloops need to be updated as well to go up to4999.Example 15.8. Coding the new requirements (
roman72.py)This file is available in
py/roman/stage7/in the examples directory.@@ -12909,18 +12432,18 @@ def fromRoman(s):- ![]()
toRomanonly needs one small change, in the range check. Where you used to check0 < n < 4000, you now check0 < n < 5000. And you change the error message that youraiseto reflect the new acceptable range (1..4999instead of1..3999). You don't need to make any changes to the rest of the function; it handles the new cases already. (It merrily adds'M'for each thousand that it finds; given4000, it will spit out'MMMM'. The only reason it didn't do this before is that you explicitly stopped it with the range check.) +toRomanonly needs one small change, in the range check. Where you used to check0 < n < 4000, you now check0 < n < 5000. And you change the error message that youraiseto reflect the new acceptable range (1..4999instead of1..3999). You don't need to make any changes to the rest of the function; it handles the new cases already. (It merrily adds'M'for each thousand that it finds; given4000, it will spit out'MMMM'. The only reason it didn't do this before is that you explicitly stopped it with the range check.)- - ![]()
You don't need to make any changes to fromRomanat all. The only change is to romanNumeralPattern; if you look closely, you'll notice that you added another optionalMin the first section of the regular expression. This will allow up to 4Mcharacters instead of 3, meaning you will allow the Roman numeral equivalents of4999instead of3999. The actualfromRomanfunction is completely general; it just looks for repeated Roman numeral characters and adds them up, without caring how - many times they repeat. The only reason it didn't handle'MMMM'before is that you explicitly stopped it with the regular expression pattern matching. +You don't need to make any changes to fromRomanat all. The only change is to romanNumeralPattern; if you look closely, you'll notice that you added another optionalMin the first section of the regular expression. This will allow up to 4Mcharacters instead of 3, meaning you will allow the Roman numeral equivalents of4999instead of3999. The actualfromRomanfunction is completely general; it just looks for repeated Roman numeral characters and adds them up, without caring how + many times they repeat. The only reason it didn't handle'MMMM'before is that you explicitly stopped it with the regular expression pattern matching.You may be skeptical that these two small changes are all that you need. Hey, don't take my word for it; see for yourself: +
You may be skeptical that these two small changes are all that you need. Hey, don't take my word for it; see for yourself:
Example 15.9. Output of
romantest72.pyagainstroman72.pyfromRoman should only accept uppercase input ... ok toRoman should always return uppercase ... ok fromRoman should fail with blank string ... ok @@ -12943,16 +12466,16 @@ OK![]()
- ![]()
All the test cases pass. Stop coding. +All the test cases pass. Stop coding. Comprehensive unit testing means never having to rely on a programmer who says “Trust me.”
15.3. Refactoring
The best thing about comprehensive unit testing is not the feeling you get when all your test cases finally pass, or even - the feeling you get when someone else blames you for breaking their code and you can actually prove that you didn't. The best thing about unit testing is that it gives you the freedom to refactor mercilessly. -
Refactoring is the process of taking working code and making it work better. Usually, “better” means “faster”, although it can also mean “using less memory”, or “using less disk space”, or simply “more elegantly”. Whatever it means to you, to your project, in your environment, refactoring is important to the long-term health of any + the feeling you get when someone else blames you for breaking their code and you can actually prove that you didn't. The best thing about unit testing is that it gives you the freedom to refactor mercilessly. +
Refactoring is the process of taking working code and making it work better. Usually, “better” means “faster”, although it can also mean “using less memory”, or “using less disk space”, or simply “more elegantly”. Whatever it means to you, to your project, in your environment, refactoring is important to the long-term health of any program. -
Here, “better” means “faster”. Specifically, the
fromRomanfunction is slower than it needs to be, because of that big nasty regular expression that you use to validate Roman numerals. +Here, “better” means “faster”. Specifically, the
fromRomanfunction is slower than it needs to be, because of that big nasty regular expression that you use to validate Roman numerals. It's probably not worth trying to do away with the regular expression altogether (it would be difficult, and it might not end up any faster), but you can speed up the function by precompiling the regular expression.Example 15.10. Compiling regular expressions
@@ -12971,14 +12494,14 @@ end up any faster), but you can speed up the function by precompiling the regula- ![]()
This is the syntax you've seen before: re.searchtakes a regular expression as a string (pattern) and a string to match against it ('M'). If the pattern matches, the function returns a match object which can be queried to find out exactly what matched and +This is the syntax you've seen before: re.searchtakes a regular expression as a string (pattern) and a string to match against it ('M'). If the pattern matches, the function returns a match object which can be queried to find out exactly what matched and how.@@ -12991,7 +12514,7 @@ end up any faster), but you can speed up the function by precompiling the regula - ![]()
This is the new syntax: re.compiletakes a regular expression as a string and returns a pattern object. Note there is no string to match here. Compiling a +This is the new syntax: re.compiletakes a regular expression as a string and returns a pattern object. Note there is no string to match here. Compiling a regular expression has nothing to do with matching it against any specific strings (like'M'); it only involves the regular expression itself.@@ -13032,19 +12555,19 @@ def fromRoman(s): - ![]()
Calling the compiled pattern object's searchfunction with the string'M'accomplishes the same thing as callingre.searchwith both the regular expression and the string'M'. Only much, much faster. (In fact, there.searchfunction simply compiles the regular expression and calls the resulting pattern object'ssearchmethod for you.) +Calling the compiled pattern object's searchfunction with the string'M'accomplishes the same thing as callingre.searchwith both the regular expression and the string'M'. Only much, much faster. (In fact, there.searchfunction simply compiles the regular expression and calls the resulting pattern object'ssearchmethod for you.)- ![]()
This looks very similar, but in fact a lot has changed. romanNumeralPattern is no longer a string; it is a pattern object which was returned from re.compile. +This looks very similar, but in fact a lot has changed. romanNumeralPattern is no longer a string; it is a pattern object which was returned from re.compile.- ![]()
That means that you can call methods on romanNumeralPattern directly. This will be much, much faster than calling re.searchevery time. The regular expression is compiled once and stored in romanNumeralPattern when the module is first imported; then, every time you callfromRoman, you can immediately match the input string against the regular expression, without any intermediate steps occurring under +That means that you can call methods on romanNumeralPattern directly. This will be much, much faster than calling re.searchevery time. The regular expression is compiled once and stored in romanNumeralPattern when the module is first imported; then, every time you callfromRoman, you can immediately match the input string against the regular expression, without any intermediate steps occurring under the covers.So how much faster is it to compile regular expressions? See for yourself: -
Example 15.12. Output of
romantest81.pyagainstroman81.py.............+
Example 15.12. Output of
romantest81.pyagainstroman81.py.............---------------------------------------------------------------------- Ran 13 tests in 3.385s
@@ -13053,13 +12576,13 @@ OK
![]()
- ![]()
Just a note in passing here: this time, I ran the unit test without the -voption, so instead of the fulldocstringfor each test, you only get a dot for each test that passes. (If a test failed, you'd get anF, and if it had an error, you'd get anE. You'd still get complete tracebacks for each failure and error, so you could track down any problems.) +Just a note in passing here: this time, I ran the unit test without the -voption, so instead of the fulldocstringfor each test, you only get a dot for each test that passes. (If a test failed, you'd get anF, and if it had an error, you'd get anE. You'd still get complete tracebacks for each failure and error, so you could track down any problems.)- - ![]()
You ran 13tests in3.385seconds, compared to3.685seconds without precompiling the regular expressions. That's an8%improvement overall, and remember that most of the time spent during the unit test is spent doing other things. (Separately, +You ran @@ -13070,8 +12593,8 @@ OK13tests in3.385seconds, compared to3.685seconds without precompiling the regular expressions. That's an8%improvement overall, and remember that most of the time spent during the unit test is spent doing other things. (Separately, I time-tested the regular expressions by themselves, apart from the rest of the unit tests, and found that compiling this regular expression speeds up thesearchby an average of54%.) Not bad for such a simple fix.Oh, and in case you were wondering, precompiling the regular expression didn't break anything, and you just proved it.
There is one other performance optimization that I want to try. Given the complexity of regular expression syntax, it should -come as no surprise that there is frequently more than one way to write the same expression. After some discussion about +
There is one other performance optimization that I want to try. Given the complexity of regular expression syntax, it should +come as no surprise that there is frequently more than one way to write the same expression. After some discussion about this module on comp.lang.python, someone suggested that I try using the
{m,n}syntax for the optional repeated characters.Example 15.13.
roman82.pyThis file is available in
py/roman/stage8/in the examples directory. @@ -13090,11 +12613,11 @@ romanNumeralPattern = \- - ![]()
You have replaced M?M?M?M?withM{0,4}. Both mean the same thing: “match 0 to 4Mcharacters”. Similarly,C?C?C?becameC{0,3}(“match 0 to 3Ccharacters”) and so forth forXandI. +You have replaced M?M?M?M?withM{0,4}. Both mean the same thing: “match 0 to 4Mcharacters”. Similarly,C?C?C?becameC{0,3}(“match 0 to 3Ccharacters”) and so forth forXandI.This form of the regular expression is a little shorter (though not any more readable). The big question is, is it any faster? +
This form of the regular expression is a little shorter (though not any more readable). The big question is, is it any faster?
Example 15.14. Output of
romantest82.pyagainstroman82.py............. ---------------------------------------------------------------------- Ran 13 tests in 3.315s@@ -13104,8 +12627,8 @@ OK
![]()
- ![]()
Overall, the unit tests run 2% faster with this form of regular expression. That doesn't sound exciting, but remember that - the searchfunction is a small part of the overall unit test; most of the time is spent doing other things. (Separately, I time-tested +Overall, the unit tests run 2% faster with this form of regular expression. That doesn't sound exciting, but remember that + the @@ -13113,18 +12636,18 @@ OKsearchfunction is a small part of the overall unit test; most of the time is spent doing other things. (Separately, I time-tested just the regular expressions, and found that thesearchfunction is11%faster with this syntax.) By precompiling the regular expression and rewriting part of it to use this new syntax, you've improved the regular expression performance by over60%, and improved the overall performance of the entire unit test by over10%.![]()
- ![]()
More important than any performance boost is the fact that the module still works perfectly. This is the freedom I was talking + More important than any performance boost is the fact that the module still works perfectly. This is the freedom I was talking about earlier: the freedom to tweak, change, or rewrite any piece of it and verify that you haven't messed anything up in - the process. This is not a license to endlessly tweak your code just for the sake of tweaking it; you had a very specific + the process. This is not a license to endlessly tweak your code just for the sake of tweaking it; you had a very specific objective (“make -fromRomanfaster”), and you were able to accomplish that objective without any lingering doubts about whether you introduced new bugs in the process.One other tweak I would like to make, and then I promise I'll stop refactoring and put this module to bed. As you've seen -repeatedly, regular expressions can get pretty hairy and unreadable pretty quickly. I wouldn't like to come back to this -module in six months and try to maintain it. Sure, the test cases pass, so I know that it works, but if I can't figure out -how it works, it's still going to be difficult to add new features, fix new bugs, or otherwise maintain it. As you saw in Section 7.5, “Verbose Regular Expressions”, Python provides a way to document your logic line-by-line. +
One other tweak I would like to make, and then I promise I'll stop refactoring and put this module to bed. As you've seen +repeatedly, regular expressions can get pretty hairy and unreadable pretty quickly. I wouldn't like to come back to this +module in six months and try to maintain it. Sure, the test cases pass, so I know that it works, but if I can't figure out +how it works, it's still going to be difficult to add new features, fix new bugs, or otherwise maintain it. As you saw in Section 7.5, “Verbose Regular Expressions”, Python provides a way to document your logic line-by-line.
Example 15.15.
roman83.pyThis file is available in
py/roman/stage8/in the examples directory.If you have not already done so, you can download this and other examples used in this book.
@@ -13152,8 +12675,8 @@ romanNumeralPattern = re.compile('''![]()
The @@ -13166,25 +12689,25 @@ OKre.compilefunction can take an optional second argument, which is a set of one or more flags that control various options about the - compiled regular expression. Here you're specifying there.VERBOSEflag, which tells Python that there are in-line comments within the regular expression itself. The comments and all the whitespace around them are -not considered part of the regular expression; there.compilefunction simply strips them all out when it compiles the expression. This new, “verbose” version is identical to the old version, but it is infinitely more readable. + compiled regular expression. Here you're specifying there.VERBOSEflag, which tells Python that there are in-line comments within the regular expression itself. The comments and all the whitespace around them are +not considered part of the regular expression; there.compilefunction simply strips them all out when it compiles the expression. This new, “verbose” version is identical to the old version, but it is infinitely more readable.![]()
- ![]()
This new, “verbose” version runs at exactly the same speed as the old version. In fact, the compiled pattern objects are the same, since the + This new, “verbose” version runs at exactly the same speed as the old version. In fact, the compiled pattern objects are the same, since the re.compilefunction strips out all the stuff you added.- ![]()
This new, “verbose” version passes all the same tests as the old version. Nothing has changed, except that the programmer who comes back to + This new, “verbose” version passes all the same tests as the old version. Nothing has changed, except that the programmer who comes back to this module in six months stands a fighting chance of understanding how the function works. 15.4. Postscript
-A clever reader read the previous section and took it to the next level. The biggest headache (and performance drain) in the program as it is currently written is - the regular expression, which is required because you have no other way of breaking down a Roman numeral. But there's only +
A clever reader read the previous section and took it to the next level. The biggest headache (and performance drain) in the program as it is currently written is + the regular expression, which is required because you have no other way of breaking down a Roman numeral. But there's only 5000 of them; why don't you just build a lookup table once, then simply read that? This idea gets even better when you realize - that you don't need to use regular expressions at all. As you build the lookup table for converting integers to Roman numerals, + that you don't need to use regular expressions at all. As you build the lookup table for converting integers to Roman numerals, you can build the reverse lookup table to convert Roman numerals to integers. -
And best of all, he already had a complete set of unit tests. He changed over half the code in the module, but the unit tests +
And best of all, he already had a complete set of unit tests. He changed over half the code in the module, but the unit tests stayed the same, so he could prove that his code worked just as well as the original.
Example 15.17.
roman9.pyThis file is available in
py/roman/stage9/in the examples directory. @@ -13264,8 +12787,8 @@ Ran 13 tests in 0.791s OK -Remember, the best performance you ever got in the original version was 13 tests in 3.315 seconds. Of course, it's not entirely -a fair comparison, because this version will take longer to import (when it fills the lookup tables). But since import is +
Remember, the best performance you ever got in the original version was 13 tests in 3.315 seconds. Of course, it's not entirely +a fair comparison, because this version will take longer to import (when it fills the lookup tables). But since import is only done once, this is negligible in the long run.
The moral of the story?
@@ -13276,12 +12799,12 @@ only done once, this is negligible in the long run.15.5. Summary
Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and increase flexibility - in any long-term project. It is also important to understand that unit testing is not a panacea, a Magic Problem Solver, - or a silver bullet. Writing good test cases is hard, and keeping them up to date takes discipline (especially when customers - are screaming for critical bug fixes). Unit testing is not a replacement for other forms of testing, including functional - testing, integration testing, and user acceptance testing. But it is feasible, and it does work, and once you've seen it + in any long-term project. It is also important to understand that unit testing is not a panacea, a Magic Problem Solver, + or a silver bullet. Writing good test cases is hard, and keeping them up to date takes discipline (especially when customers + are screaming for critical bug fixes). Unit testing is not a replacement for other forms of testing, including functional + testing, integration testing, and user acceptance testing. But it is feasible, and it does work, and once you've seen it work, you'll wonder how you ever got along without it. -
This chapter covered a lot of ground, and much of it wasn't even Python-specific. There are unit testing frameworks for many languages, all of which require you to understand the same basic concepts: +
This chapter covered a lot of ground, and much of it wasn't even Python-specific. There are unit testing frameworks for many languages, all of which require you to understand the same basic concepts:
@@ -13320,20 +12843,20 @@ only done once, this is negligible in the long run.
Chapter 16. Functional Programming
16.1. Diving in
-In Chapter 13, Unit Testing, you learned about the philosophy of unit testing. In Chapter 14, Test-First Programming, you stepped through the implementation of basic unit tests in Python. In Chapter 15, Refactoring, you saw how unit testing makes large-scale refactoring easier. This chapter will build on those sample programs, but here +
In Chapter 13, Unit Testing, you learned about the philosophy of unit testing. In Chapter 14, Test-First Programming, you stepped through the implementation of basic unit tests in Python. In Chapter 15, Refactoring, you saw how unit testing makes large-scale refactoring easier. This chapter will build on those sample programs, but here we will focus more on advanced Python-specific techniques, rather than on unit testing itself. -
The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual -modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the -build process for this book; I have unit tests for several of the example programs (not just the
roman.pymodule featured in Chapter 13, Unit Testing), and the first thing my automated build script does is run this program to make sure all my examples still work. If this -regression test fails, the build immediately stops. I don't want to release non-working examples any more than you want to +The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual +modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the +build process for this book; I have unit tests for several of the example programs (not just the
roman.pymodule featured in Chapter 13, Unit Testing), and the first thing my automated build script does is run this program to make sure all my examples still work. If this +regression test fails, the build immediately stops. I don't want to release non-working examples any more than you want to download them and sit around scratching your head and yelling at your monitor and wondering why they don't work.Example 16.1.
regression.pyIf you have not already done so, you can download this and other examples used in this book.
"""Regression testing framework This module will search for scripts in the same directory named -XYZtest.py. Each such script should be a test suite that tests a -module through PyUnit. (As of Python 2.1, PyUnit is included in +XYZtest.py. Each such script should be a test suite that tests a +module through PyUnit. (As of Python 2.1, PyUnit is included in the standard library as "unittest".) This script will aggregate all found test suites into one big test suite and run them all at once. """ @@ -13414,7 +12937,7 @@ OK16.2. Finding the path
When running Python scripts from the command line, it is sometimes useful to know where the currently running script is located on disk.
This is one of those obscure little tricks that is virtually impossible to figure out on your own, but simple to remember -once you see it. The key to it is
sys.argv. As you saw in Chapter 9, XML Processing, this is a list that holds the list of command-line arguments. However, it also holds the name of the running script, exactly +once you see it. The key to it issys.argv. As you saw in Chapter 9, XML Processing, this is a list that holds the list of command-line arguments. However, it also holds the name of the running script, exactly as it was called from the command line, and this is enough information to determine its location.Example 16.3.
fullpath.pyIf you have not already done so, you can download this and other examples used in this book.
@@ -13428,25 +12951,25 @@ print 'full path =', os.path.abspath(pathname)![]()
- ![]()
Regardless of how you run a script, sys.argv[0]will always contain the name of the script, exactly as it appears on the command line. This may or may not include any path +Regardless of how you run a script, sys.argv[0]will always contain the name of the script, exactly as it appears on the command line. This may or may not include any path information, as you'll see shortly.- ![]()
os.path.dirnametakes a filename as a string and returns the directory path portion. If the given filename does not include any path information, +os.path.dirnametakes a filename as a string and returns the directory path portion. If the given filename does not include any path information,os.path.dirnamereturns an empty string.- - ![]()
os.path.abspathis the key here. It takes a pathname, which can be partial or even blank, and returns a fully qualified pathname. +os.path.abspathis the key here. It takes a pathname, which can be partial or even blank, and returns a fully qualified pathname.
os.path.abspathdeserves further explanation. It is very flexible; it can take any kind of pathname. +
os.path.abspathdeserves further explanation. It is very flexible; it can take any kind of pathname.Example 16.4. Further explanation of
os.path.abspath>>> import os >>> os.getcwd()@@ -13487,7 +13010,7 @@ print 'full path =', os.path.abspath(pathname)
![]()
- ![]()
os.path.abspathalso normalizes the pathname it returns. Note that this example worked even though I don't actually have a 'foo' directory.os.path.abspathnever checks your actual disk; this is all just string manipulation. +@@ -13504,7 +13027,7 @@ print 'full path =', os.path.abspath(pathname) os.path.abspathalso normalizes the pathname it returns. Note that this example worked even though I don't actually have a 'foo' directory.os.path.abspathnever checks your actual disk; this is all just string manipulation.![]()
- @@ -13527,19 +13050,19 @@ full path = /home/you/diveintopython3/common/pyos.path.abspathnot only constructs full path names, it also normalizes them. That means that if you are in the/usr/directory,os.path.abspath('bin/../local/bin')will return/usr/local/bin. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without +os.path.abspathnot only constructs full path names, it also normalizes them. That means that if you are in the/usr/directory,os.path.abspath('bin/../local/bin')will return/usr/local/bin. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without turning it into a full pathname, useos.path.normpathinstead.-
In the first case, sys.argv[0]includes the full path of the script. You can then use theos.path.dirnamefunction to strip off the script name and return the full directory name, andos.path.abspathsimply returns what you give it. +In the first case, sys.argv[0]includes the full path of the script. You can then use theos.path.dirnamefunction to strip off the script name and return the full directory name, andos.path.abspathsimply returns what you give it.- ![]()
If the script is run by using a partial pathname, sys.argv[0]will still contain exactly what appears on the command line.os.path.dirnamewill then give you a partial pathname (relative to the current directory), andos.path.abspathwill construct a full pathname from the partial pathname. +If the script is run by using a partial pathname, sys.argv[0]will still contain exactly what appears on the command line.os.path.dirnamewill then give you a partial pathname (relative to the current directory), andos.path.abspathwill construct a full pathname from the partial pathname.@@ -13548,13 +13071,13 @@ full path = /home/you/diveintopython3/common/py - ![]()
If the script is run from the current directory without giving any path, os.path.dirnamewill simply return an empty string. Given an empty string,os.path.abspathreturns the current directory, which is what you want, since the script was run from the current directory. +If the script is run from the current directory without giving any path, os.path.dirnamewill simply return an empty string. Given an empty string,os.path.abspathreturns the current directory, which is what you want, since the script was run from the current directory.![]()
- Like the other functions in the osandos.pathmodules,os.path.abspathis cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash - as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of theosmodule. +Like the other functions in the osandos.pathmodules,os.path.abspathis cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash + as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of theosmodule.Addendum. One reader was dissatisfied with this solution, and wanted to be able to run all the unit tests in the current directory, -not the directory where
regression.pyis located. He suggests this approach instead: +not the directory whereregression.pyis located. He suggests this approach instead:Example 16.6. Running scripts in the current directory
import sys, os, re, unittest def regressionTest(): @@ -13566,7 +13089,7 @@ def regressionTest():- ![]()
Instead of setting path to the directory where the currently running script is located, you set it to the current working directory instead. This + Instead of setting path to the directory where the currently running script is located, you set it to the current working directory instead. This will be whatever directory you were in before you ran the script, which is not necessarily the same as the directory the script is in. (Read that sentence a few times until you get it.) @@ -13574,7 +13097,7 @@ def regressionTest():- ![]()
Append this directory to the Python library search path, so that when you dynamically import the unit test modules later, Python can find them. You didn't need to do this when path was the directory of the currently running script, because Python always looks in that directory. + Append this directory to the Python library search path, so that when you dynamically import the unit test modules later, Python can find them. You didn't need to do this when path was the directory of the currently running script, because Python always looks in that directory. @@ -13583,25 +13106,25 @@ def regressionTest(): -The rest of the function is the same. This technique will allow you to re-use this
regression.pyscript on multiple projects. Just put the script in a common directory, then change to the project's directory before running - it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory whereregression.pyis located. +This technique will allow you to re-use this
regression.pyscript on multiple projects. Just put the script in a common directory, then change to the project's directory before running + it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory whereregression.pyis located.16.3. Filtering lists revisited
-You're already familiar with using list comprehensions to filter lists. There is another way to accomplish this same thing, which some people feel is more expressive. +
You're already familiar with using list comprehensions to filter lists. There is another way to accomplish this same thing, which some people feel is more expressive.
Python has a built-in
filterfunction which takes two arguments, a function and a list, and returns a list.[7] The function passed as the first argument tofiltermust itself take one argument, and the list thatfilterreturns will contain all the elements from the list passed tofilterfor which the function passed tofilterreturns true.Got all that? It's not as difficult as it sounds.
Example 16.7. Introducing
filter>>> def odd(n):-... return n % 2 -... +... return n % 2 +... >>> li = [1, 2, 3, 5, 9, 10, 256, -3] >>> filter(odd, li)
[1, 3, 5, 9, -3] >>> [e for e in li if odd(e)]
>>> filteredList = [] >>> for n in li:
-... if odd(n): -... filteredList.append(n) -... +... if odd(n): +... filteredList.append(n) +... >>> filteredList [1, 3, 5, 9, -3]
@@ -13614,7 +13137,7 @@ def regressionTest():
-@@ -13627,7 +13150,7 @@ def regressionTest(): - ![]()
filtertakes two arguments, a function (odd) and a list (li). It loops through the list and callsoddwith each element. Ifoddreturns a true value (remember, any non-zero value is true in Python), then the element is included in the returned list, otherwise it is filtered out. The result is a list of only the odd +filtertakes two arguments, a function (odd) and a list (li). It loops through the list and callsoddwith each element. Ifoddreturns a true value (remember, any non-zero value is true in Python), then the element is included in the returned list, otherwise it is filtered out. The result is a list of only the odd numbers from the original list, in the same order as they appeared in the original.@@ -13641,25 +13164,25 @@ def regressionTest(): - ![]()
You could also accomplish the same thing with a forloop. Depending on your programming background, this may seem more “straightforward”, but functions likefilterare much more expressive. Not only is it easier to write, it's easier to read, too. Reading theforloop is like standing too close to a painting; you see all the details, but it may take a few seconds to be able to step +You could also accomplish the same thing with a forloop. Depending on your programming background, this may seem more “straightforward”, but functions likefilterare much more expressive. Not only is it easier to write, it's easier to read, too. Reading theforloop is like standing too close to a painting; you see all the details, but it may take a few seconds to be able to step back and see the bigger picture: “Oh, you're just filtering the list!”![]()
As you saw in Section 16.2, “Finding the path”, path may contain the full or partial pathname of the directory of the currently running script, or it may contain an empty string - if the script is being run from the current directory. Either way, files will end up with the names of the files in the same directory as this script you're running. + if the script is being run from the current directory. Either way, files will end up with the names of the files in the same directory as this script you're running. - ![]()
This is a compiled regular expression. As you saw in Section 15.3, “Refactoring”, if you're going to use the same regular expression over and over, you should compile it for faster performance. The compiled - object has a searchmethod which takes a single argument, the string to search. If the regular expression matches the string, thesearchmethod returns aMatchobject containing information about the regular expression match; otherwise it returnsNone, the Python null value. +This is a compiled regular expression. As you saw in Section 15.3, “Refactoring”, if you're going to use the same regular expression over and over, you should compile it for faster performance. The compiled + object has a searchmethod which takes a single argument, the string to search. If the regular expression matches the string, thesearchmethod returns aMatchobject containing information about the regular expression match; otherwise it returnsNone, the Python null value.- ![]()
For each element in the files list, you're going to call the searchmethod of the compiled regular expression object, test. If the regular expression matches, the method will return aMatchobject, which Python considers to be true, so the element will be included in the list returned byfilter. If the regular expression does not match, thesearchmethod will returnNone, which Python considers to be false, so the element will not be included. +For each element in the files list, you're going to call the searchmethod of the compiled regular expression object, test. If the regular expression matches, the method will return aMatchobject, which Python considers to be true, so the element will be included in the list returned byfilter. If the regular expression does not match, thesearchmethod will returnNone, which Python considers to be false, so the element will not be included.Historical note. Versions of Python prior to 2.0 did not have list comprehensions, so you couldn't filter using list comprehensions; the
filterfunction was the only game in town. Even with the introduction of list comprehensions in 2.0, some people still prefer the -old-stylefilter(and its companion function,map, which you'll see later in this chapter). Both techniques work at the moment, so which one you use is a matter of style. +Historical note. Versions of Python prior to 2.0 did not have list comprehensions, so you couldn't filter using list comprehensions; the
filterfunction was the only game in town. Even with the introduction of list comprehensions in 2.0, some people still prefer the +old-stylefilter(and its companion function,map, which you'll see later in this chapter). Both techniques work at the moment, so which one you use is a matter of style. There is discussion thatmapandfiltermight be deprecated in a future version of Python, but no decision has been made.Example 16.9. Filtering using list comprehensions instead
files = os.listdir(path) @@ -13669,16 +13192,16 @@ There is discussion thatmapandfiltermight be depre- ![]()
This will accomplish exactly the same result as using the filterfunction. Which way is more expressive? That's up to you. +This will accomplish exactly the same result as using the filterfunction. Which way is more expressive? That's up to you.16.4. Mapping lists revisited
-You're already familiar with using list comprehensions to map one list into another. There is another way to accomplish the same thing, using the built-in
mapfunction. It works much the same way as thefilterfunction. +You're already familiar with using list comprehensions to map one list into another. There is another way to accomplish the same thing, using the built-in
mapfunction. It works much the same way as thefilterfunction.Example 16.10. Introducing
map>>> def double(n): -... return n*2 -... +... return n*2 +... >>> li = [1, 2, 3, 5, 9, 10, 256, -3] >>> map(double, li)[2, 4, 6, 10, 18, 20, 512, -6] @@ -13686,22 +13209,22 @@ There is discussion that
mapandfiltermight be depre [2, 4, 6, 10, 18, 20, 512, -6] >>> newlist = [] >>> for n in li:-... newlist.append(double(n)) -... +... newlist.append(double(n)) +... >>> newlist [2, 4, 6, 10, 18, 20, 512, -6]
-
- ![]()
maptakes a function and a list[8] and returns a new list by calling the function with each element of the list in order. In this case, the function simply +maptakes a function and a list[8] and returns a new list by calling the function with each element of the list in order. In this case, the function simply multiplies each element by 2.- ![]()
You could accomplish the same thing with a list comprehension. List comprehensions were first introduced in Python 2.0; maphas been around forever. +You could accomplish the same thing with a list comprehension. List comprehensions were first introduced in Python 2.0; maphas been around forever.@@ -13719,14 +13242,14 @@ There is discussion that mapandfiltermight be depre- ![]()
As a side note, I'd like to point out that mapworks just as well with lists of mixed datatypes, as long as the function you're using correctly handles each type. In this - case, thedoublefunction simply multiplies the given argument by 2, and Python Does The Right Thing depending on the datatype of the argument. For integers, this means actually multiplying it by 2; for +As a side note, I'd like to point out that mapworks just as well with lists of mixed datatypes, as long as the function you're using correctly handles each type. In this + case, thedoublefunction simply multiplies the given argument by 2, and Python Does The Right Thing depending on the datatype of the argument. For integers, this means actually multiplying it by 2; for strings, it means concatenating the string with itself; for tuples, it means making a new tuple that has all of the elements of the original, then all of the elements of the original again.All right, enough play time. Let's look at some real code. +
All right, enough play time. Let's look at some real code.
Example 16.12.
mapinregression.pyfilenameToModuleName = lambda f: os.path.splitext(f)[0]moduleNames = map(filenameToModuleName, files)
@@ -13734,13 +13257,13 @@ There is discussion thatmapandfiltermight be depre- ![]()
As you saw in Section 4.7, “Using lambda Functions”, lambdadefines an inline function. And as you saw in Example 6.17, “Splitting Pathnames”,os.path.splitexttakes a filename and returns a tuple(name, extension). SofilenameToModuleNameis a function which will take a filename and strip off the file extension, and return just the name. +As you saw in Section 4.7, “Using lambda Functions”, lambdadefines an inline function. And as you saw in Example 6.17, “Splitting Pathnames”,os.path.splitexttakes a filename and returns a tuple(name, extension). SofilenameToModuleNameis a function which will take a filename and strip off the file extension, and return just the name.@@ -13748,32 +13271,32 @@ There is discussion that - ![]()
Calling maptakes each filename listed in files, passes it to the functionfilenameToModuleName, and returns a list of the return values of each of those function calls. In other words, you strip the file extension off +Calling maptakes each filename listed in files, passes it to the functionfilenameToModuleName, and returns a list of the return values of each of those function calls. In other words, you strip the file extension off of each filename, and store the list of all those stripped filenames in moduleNames.mapandfiltermight be depreAs you'll see in the rest of the chapter, you can extend this type of data-centric thinking all the way to the final goal, which is to define and execute a single test suite that contains the tests from all of those individual test suites.
16.5. Data-centric programming
-By now you're probably scratching your head wondering why this is better than using
forloops and straight function calls. And that's a perfectly valid question. Mostly, it's a matter of perspective. Using +By now you're probably scratching your head wondering why this is better than using
forloops and straight function calls. And that's a perfectly valid question. Mostly, it's a matter of perspective. Usingmapandfilterforces you to center your thinking around your data. -In this case, you started with no data at all; the first thing you did was get the directory path of the current script, and got a list of files in that directory. That was the bootstrap, and it gave you real data to work +
In this case, you started with no data at all; the first thing you did was get the directory path of the current script, and got a list of files in that directory. That was the bootstrap, and it gave you real data to work with: a list of filenames. -
However, you knew you didn't care about all of those files, only the ones that were actually test suites. You had too much data, so you needed to
filterit. How did you know which data to keep? You needed a test to decide, so you defined one and passed it to thefilterfunction. In this case you used a regular expression to decide, but the concept would be the same regardless of how you +However, you knew you didn't care about all of those files, only the ones that were actually test suites. You had too much data, so you needed to
filterit. How did you know which data to keep? You needed a test to decide, so you defined one and passed it to thefilterfunction. In this case you used a regular expression to decide, but the concept would be the same regardless of how you constructed the test.Now you had the filenames of each of the test suites (and only the test suites, since everything else had been filtered out), -but you really wanted module names instead. You had the right amount of data, but it was in the wrong format. So you defined a function that would transform a single filename into a module name, and you mapped that function onto -the entire list. From one filename, you can get a module name; from a list of filenames, you can get a list of module names. -
Instead of
filter, you could have used aforloop with anifstatement. Instead ofmap, you could have used aforloop with a function call. But usingforloops like that is busywork. At best, it simply wastes time; at worst, it introduces obscure bugs. For instance, you need -to figure out how to test for the condition “is this file a test suite?” anyway; that's the application-specific logic, and no language can write that for us. But once you've figured that out, +but you really wanted module names instead. You had the right amount of data, but it was in the wrong format. So you defined a function that would transform a single filename into a module name, and you mapped that function onto +the entire list. From one filename, you can get a module name; from a list of filenames, you can get a list of module names. +Instead of
filter, you could have used aforloop with anifstatement. Instead ofmap, you could have used aforloop with a function call. But usingforloops like that is busywork. At best, it simply wastes time; at worst, it introduces obscure bugs. For instance, you need +to figure out how to test for the condition “is this file a test suite?” anyway; that's the application-specific logic, and no language can write that for us. But once you've figured that out, do you really want go to all the trouble of defining a new empty list and writing aforloop and anifstatement and manually callingappendto add each element to the new list if it passes the condition and then keeping track of which variable holds the new filtered data and which one holds the old unfiltered data? Why not just define the test condition, then let Python do the rest of that work for us? -Oh sure, you could try to be fancy and delete elements in place without creating a new list. But you've been burned by that -before. Trying to modify a data structure that you're looping through can be tricky. You delete an element, then loop to -the next element, and suddenly you've skipped one. Is Python one of the languages that works that way? How long would it take you to figure it out? Would you remember for certain whether +
Oh sure, you could try to be fancy and delete elements in place without creating a new list. But you've been burned by that +before. Trying to modify a data structure that you're looping through can be tricky. You delete an element, then loop to +the next element, and suddenly you've skipped one. Is Python one of the languages that works that way? How long would it take you to figure it out? Would you remember for certain whether it was safe the next time you tried? Programmers spend so much time and make so many mistakes dealing with purely technical -issues like this, and it's all pointless. It doesn't advance your program at all; it's just busywork. -
I resisted list comprehensions when I first learned Python, and I resisted
filterandmapeven longer. I insisted on making my life more difficult, sticking to the familiar way offorloops andifstatements and step-by-step code-centric programming. And my Python programs looked a lot like Visual Basic programs, detailing every step of every operation in every function. And they had all the same types of little problems -and obscure bugs. And it was all pointless. -Let it all go. Busywork code is not important. Data is important. And data is not difficult. It's only data. If you have -too much, filter it. If it's not what you want, map it. Focus on the data; leave the busywork behind. +issues like this, and it's all pointless. It doesn't advance your program at all; it's just busywork. +
I resisted list comprehensions when I first learned Python, and I resisted
filterandmapeven longer. I insisted on making my life more difficult, sticking to the familiar way offorloops andifstatements and step-by-step code-centric programming. And my Python programs looked a lot like Visual Basic programs, detailing every step of every operation in every function. And they had all the same types of little problems +and obscure bugs. And it was all pointless. +Let it all go. Busywork code is not important. Data is important. And data is not difficult. It's only data. If you have +too much, filter it. If it's not what you want, map it. Focus on the data; leave the busywork behind.
16.6. Dynamically importing modules
-OK, enough philosophizing. Let's talk about dynamically importing modules. -
First, let's look at how you normally import modules. The
import modulesyntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once -this way, with a comma-separated list. You did this on the very first line of this chapter's script. +OK, enough philosophizing. Let's talk about dynamically importing modules. +
First, let's look at how you normally import modules. The
import modulesyntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once +this way, with a comma-separated list. You did this on the very first line of this chapter's script.Example 16.13. Importing multiple modules at once
import sys, os, re, unittest
@@ -13806,13 +13329,13 @@ import sys, os, re, unittest-
The variable sys is now the sysmodule, just as if you had saidimport sys. The variable os is now theosmodule, and so forth. +The variable sys is now the -sysmodule, just as if you had saidimport sys. The variable os is now theosmodule, and so forth.So
__import__imports a module, but takes a string argument to do it. In this case the module you imported was just a hard-coded string, -but it could just as easily be a variable, or the result of a function call. And the variable that you assign the module -to doesn't need to match the module name, either. You could import a series of modules and assign them to a list. +So
__import__imports a module, but takes a string argument to do it. In this case the module you imported was just a hard-coded string, +but it could just as easily be a variable, or the result of a function call. And the variable that you assign the module +to doesn't need to match the module name, either. You could import a series of modules and assign them to a list.